Liveblogging: Senior Skills: Grok awk

[author’s note: personally, I use awk a bunch in MySQL DBA work, for tasks like scrubbing data from a production export for use in qa/dev, but usually have to resort to Perl for really complex stuff, but now I know how to do .]

Basics:
By default, fields are separated by any number of spaces. The -F option to awk changes the separator on commandline.
Print the first field, fields are separated by a colon.
awk -F: '{print $1}' /etc/passwd

Print the first and fifth field:
awk -F: '{$print $1,$5}' /etc/passwd

Can pattern match and use files, so you can replace:
grep foo /etc/passwd | awk -F: '{print $1,$5}'
with:
awk -F: '/foo/ {print $1,$5}' /etc/passwd

NF = built in variable (no $) used to mean “field number”
This will print the first and last fields of lines where the first field matches “foo”
awk -F: '$1 ~/foo/ {print $1,$NF}' /etc/passwd

NF = number of fields, ie, “7″
$NF = value of last field, ie “/bin/bash”
(similarly, NR is record number)

Awk makes assumptions about input, variables, and processing that you’d otherwise have to code yourself.

– “main loop” of input processing is done for you
– awk initializes variables for you, to 0
– input is viewed by awk as ‘records’ which are splittable into ‘fields’

This all makes a lot of operations very concise in awk, many things can be done w/ a one-liner that would otherwise require several lines of code.

awk key points:
– splits text into fields
– default delimiter is “any number of spaces”
– reference fields
– $0 is entire line
– create filters using ‘addresses’ which can be regexps (similar to sed)
– Turing-complete language
– has if, while, for, do-while, etc
– built-in math like exp, log, rand, sin, cos
– built-in string sub, split, index, toupper/lower

Patterns and actions
Pattern is first, then action(s)
Actions are enclosed in {}

only a pattern, no action:
'length>42'
but, the default action is to print the whole line, so this will actually do something — print lines where the length of the line is > 42. strings are just arrays in awk

only action, no pattern:
{print $2,$1}
do this to all lines of input

NR % 3 == 0
print every third line (pattern is %NR mod 3)

{print $1, $NF, $(NF-1)}
print the first field, last field, and 2nd to last field

built-in variables
NF, NR we’ve done
FS = field separator (can be regexp)
OFMT = output format for numbers (default %.6g)

Patterns
– used to filter lines processed by awk
– can be regexp
/^root/ is the pattern in the following
awk -F:'/^root/ {print $1,$NF}' /etc/passwd

– Patterns can use fields and relational operators
To print 1st, 4th and last field if value of 4th field >10:
awk -F: '$4 > 10 {print $1, $4, $NF}' /etc/passwd

awk -F: '$0 !~ /^#/ && $4 > 10 {print $1, $4, $NF}' /etc/passwd

Range patterns
sed-like addressing : you can have start and end addresses
awk ‘NR==1,NR==3′
prints only first three lines of the file
You can use regular expressions in range patterns:
awk -F:’/^root/,/^daemon/ {print $1,$NF}’ /etc/passwd
start printing at the line that starts with “root”, the last line that is processed is the line starting with “daemon”

Range pattern “gotcha” – can’t mix a range with other patterns:
To do “start at non-commented line where value of $4 is less than 10, end at the first line where value of $4 is greater than 10″

This does not work!
awk -F: '$0 !~ /^#/ $4 <= 10, $4 > 10' /etc/passwd

This is how to do it, {next} is an action that skips:
awk -F: '$0 ~ /^#/ {next} $4 <= 10, $4 > 10 {print $1, $4' /etc/passwd

Basic Aggregation
awk -F: ‘$3 > 100 {x+=1; print x}’ /etc/passwd
This gives a line of output as each matching line is processed. This gives a running total of x.

awk -F: ‘$3 > 100 {x+=1} END {print x}’ /etc/passwd
This processes the “{print x}” action only after the entire file has been processed. This gives only the final value of x.

Arrays:
Support for regular arrays
Technically multi-dimensional arrays are not supported, but array indexes are not supported, so you can make your own associative arrays.

Example:
awk -F: ‘{x[$1] = $2*($4 – $3)} END {for(key in x) {print key, x[key]}}’ stocks.txt

The part before the END creates the associative array, the part after the END prints the array.

Extreme data munging:
awk -f: '{x[$1]=($2'($4 - $3))} END {for(z in x) {print z, x[z]}}' stocks.txt

ABC,100,12.14,19.12
FOO,100,24.01,17.45

output
BAR 271.5
ABC 698

For the line “ABC,100,12.14,19.12″
the function becomes

x[ABC] = 100 * (19.12 - 12.14) = 698

Aggregate across multiple variables:
awk -F, '{x[$1] = $2*($4 - $3); y+=x[$1]} END {for(z in x) {print z, x[z]}} {print "Net:"y}}' stocks.txt

Note that y is a running *sum* (not a running count like before).

Now, the above is hard to read, this is much easier.

#!/usr/bin/awk -f

BEGIN { FS="," }
{ x[$1] = $2*($4 - $3)
  y+=x[$1]
 }
END {
 for(z in x) {
  print z, x[z]
  }
 }  # end for loop
 {
 print "Net:"y
 } # end END block

This was liveblogged, so please point out any issues, as they may be typos on my part….