Notes On Awk And Sed

awk and sed are two great text processing tools in linux. awk is both a programming language and text processor that can be used to manipulate text data in very useful ways.

awk

The format of an awk command is

1
awk 'BEGIN { actions; } /<search-pattern>/ { actions; } END { actions; }' <input-file>

The BEGIN clause is the commands to execute before the file processing, and the END clause is the commands to execute after the file processing.

The awk will process the file line by line, if the <search-pattern> is not given, it will process all the lines of the file; if the <search-pattern> is given, it will use the search portion to decide if the current line reflects the pattern, and then performs the actions on matches; if the actions for processing each line are not given, the default behaviour is to print the line.

Internal variables

The awk uses some internal variables to assign certain pieces of information as it processes a file.

The internal variables that awk uses are:

FILENAME: References the current input file.
FNR: References the number of the current record relative to the current input file.
FS: The current field separator used to denote each field in a record. By default, this is set to whitespace.
NF: The number of fields in the current record.
NR: The number of the current record.
OFS: The field separator for the outputted data. By default, this is set to whitespace.
ORS: The record separator for the outputted data. By default, this is a newline character.
RS: The record separator used to distinguish separate records in the input file. By default, this is a newline character.

Following is an example:

1
2
3
sudo awk 'BEGIN { FS=":"; print "User\t\tUID\t\tGID\t\tHome\t\tShell\n--------------"; }
{print $1,"\t\t",$3,"\t\t",$4,"\t\t",$6,"\t\t",$7;}
END { print "---------\nFile Complete" }' /etc/passwd

Field searching and Compound expressions

By using the <search-pattern> directly, awk will search the whole line for the matches, sometimes we want to search in a specific field, this can be done as

1
awk '$<field-number> ~ /<search-pattern>/ { actions; }' <input-file>

where the <field-number> denotes which field we want to search, it begins from 1, while $0 denotes the whole line. And take care of the ~ in the expression, it is important.

Sometimes we need further complicated logic for determine whether to process a certain line, awk allows us to compose compound expressions. For example

1
awk '$2 !~ /^sa/ && $1 < 5 {print;}' example.txt

In the example, it specifies a compound expression that the second field should not begin with sa, and the first field should be a number that is less than 5, and print all the lines in the example.txt that satisfies the compound condition.

And it is also possible to utilize external script with awk:

1
awk -f <ext-script.awk> <input-file>

where <ext-script.awk> is like

1
2
3
BEGIN { actions; } 
/<search-pattern>/ { actions; } 
END { actions; }

For more examples, see here.

sed

The sed is a non-interactive stream editor. It receives text input, whether from stdin or from a file, performs certain operations on specified lines of the input, one line at a time, then outputs the result to stdout or to a file.

The sed determines which lines of its input that it will operate on from the address range passed to it. Specify this address range either by line number or by a pattern to match.

The sed provides a large amount of commands, the most used three are print, delete, and substitute.

Following is a list of some commonly used commands:

-e: Interpret the next string as an editing instruction.
p: The action of print the line

1
sed -e 'p' <file>

It prints all the lines of the <file> to the screen.

d: The action of delete the line from the output

1
sed -e 'd' <file>

It prints nothing to the screen, and the content of <file> keeps not modified, as the delete means to delete the line in the line buffer of sed

[line-address][action]: specify the line of the file to take the action

1
sed -e '1d' <file>

It deletes the first line of the <file> from the output to the screen.

[line-range-address][action]: specify the range of lines of the file to take the action

1
sed -e '1,10d' <file>

It prints the lines from 11th line of the <file> to the screen.

[regexp-address][action]: specify the line to take action with regular expression

1
sed -e '/^#/d' <file>

It ignores all the lines of <file> start with ‘#’ from printing to the screen.

[regexp-address-begin],[regexp-address-end][action]: specify the lines to take action from the line matches [regexp-address-begin] up to and including the line matches [regexp-address-end]

1
sed -e '/BEGIN/,/END/p' <file>

It prints a block of lines of <file> start with a line containing ‘BEGIN’ end with a line containing ‘END’.

s/pattern1/pattern2/: Substitute pattern2 for first instance of pattern1 in a line

1
sed -e 's/##/--/' <file>

It prints every line of <file> with first occurrence of ‘##’ substituted with ‘–’.

1
sed -e 's/.*/some-word &/' <file>

The & in pattern2 refers to the matched content in pattern1.

1
sed -e 's/\(.*\) \(.*\) \(.*\)/some-word \1 some-other-word \2 some-more-other-word \3/' <file>

The \num in pattern2 refers to the nth matched content in pattern1.

g: Operate on every pattern match within each matched line of input

1
sed -e 's/foo/bar/' <file>

It prints every line of <file> with every occurrence of ‘foo’ substituted with ‘bar’.

s:pattern1:pattern2:g: Substitution with ‘:’ as the separator
[line-range-address]/s/pattern1/pattern2/: Substitute pattern2 for first instance of pattern1 in a line, over lines in range
[line-range-address]/y/pattern1/pattern2/: Replace any character in pattern1 with the corresponding character in pattern2, over lines in range (equivalent of tr)
[address] i pattern Filename: Insert pattern at address indicated in file Filename. Usually used with -i in-place option
-e '[action1];[action2]': specify multiple actions

1
sed -e '=;p' <file>

Action(command) = tells sed to print the line number, the whole command prints every line of <file> with line number at the head of the line.

-e '[action1]' -e '[action2]': specify multiple actions
-f <script-file>.sed: specify actions from script file with file name end with ‘.sed’
-e {...}: specify multiple actions

1
2
3
4
5
6
1,/^END/{
        s/[Ll]inux/GNU\/Linux/g 
        s/samba/Samba/g 
        s/posix/POSIX/g 
       p
}

It specifies the actions to take from the first line to the line start with ‘END’

i: insert before the current line

1
2
3
4
5
i\
first line\
second line\
third line
}

It insert multiple lines before the current line.

a: insert after the current line
c: replace the current line
'': Strong quotes protect the RE characters in the instruction from reinterpretation as special characters by the body of the script
"": Double quotes allow expansion in the instruction

three sources: source 1, source 2, source 3.