213 lines
		
	
	
	
		
			4.8 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			213 lines
		
	
	
	
		
			4.8 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
---
 | 
						|
tags:
 | 
						|
  - shell
 | 
						|
  - awk
 | 
						|
---
 | 
						|
 | 
						|
# Awk
 | 
						|
 | 
						|
> Awk is a programming language designed for text processing and data
 | 
						|
> extraction. It was created in the 1970s and remains widely used today for
 | 
						|
> tasks such as filtering and transforming text data, generating reports, and
 | 
						|
> performing basic calculations. Awk is known for its simplicity and
 | 
						|
> versatility, making it a popular tool for Unix system administrators and data
 | 
						|
> analysts.
 | 
						|
 | 
						|
## Invocation
 | 
						|
 | 
						|
We can use `awk` directly in `stdin` or we can reference `.awk` files for more
 | 
						|
elaborate scripts
 | 
						|
 | 
						|
```bash
 | 
						|
# CLI
 | 
						|
awk [program] file1, file2, file3
 | 
						|
 | 
						|
# Script file
 | 
						|
awk -f [ref_to_script_file] file1, file2, file3
 | 
						|
```
 | 
						|
 | 
						|
We can also pipe to it. This piped command receives output from the `echo`
 | 
						|
command and prints the value in the last field for each record:
 | 
						|
 | 
						|
```bash
 | 
						|
echo -e "1 2 3 5\n2 2 3 8" | awk '{print $(NF)}'
 | 
						|
```
 | 
						|
 | 
						|
## Syntactic structure
 | 
						|
 | 
						|
`awk` is a line-oriented language.
 | 
						|
 | 
						|
An `awk` program consists in a sequence of **pattern: action** statements and
 | 
						|
optional functional definitions.
 | 
						|
 | 
						|
For most of the examples we will use this list as the input:
 | 
						|
 | 
						|
```
 | 
						|
cloud
 | 
						|
existence
 | 
						|
ministerial
 | 
						|
falcon
 | 
						|
town
 | 
						|
sky
 | 
						|
top
 | 
						|
bookworm
 | 
						|
bookcase
 | 
						|
war
 | 
						|
Peter 89
 | 
						|
Lucia 95
 | 
						|
Thomas 76
 | 
						|
Marta 67
 | 
						|
Joe 92
 | 
						|
Alex 78
 | 
						|
Sophia 90
 | 
						|
Alfred 65
 | 
						|
Kate 46
 | 
						|
```
 | 
						|
 | 
						|
> `awk` particularly lends itself to inputs that are structured by whitespace or
 | 
						|
> in columns, like what you get from commands like `ls` and `grep`
 | 
						|
 | 
						|
### Patterns and actions
 | 
						|
 | 
						|
The basic structure of an `awk` script is as follows:
 | 
						|
 | 
						|
```
 | 
						|
pattern {action}
 | 
						|
```
 | 
						|
 | 
						|
A **pattern** is what you want to match against. It can be a literal string or a
 | 
						|
regex. The **action** is what process you want to execute against the lines in
 | 
						|
the input that match the pattern.
 | 
						|
 | 
						|
The following script prints the line that matches `Joe`:
 | 
						|
 | 
						|
```bash
 | 
						|
awk '/Joe/ {print}' list.txt
 | 
						|
```
 | 
						|
 | 
						|
`/Joe/` is the patttern and `{print}` is the action.
 | 
						|
 | 
						|
### Lines, records, fields
 | 
						|
 | 
						|

 | 
						|
 | 
						|
When `awk` receives a file it divides the lines into **records**.
 | 
						|
 | 
						|
Each line `awk` receives is broken up into a sequence of **fields**.
 | 
						|
 | 
						|
The fields are accessed by special variables:
 | 
						|
 | 
						|
- `$1` reads the first field, `$2` reads the second field and so on.
 | 
						|
 | 
						|
- The variable `$0` refers to the whole record
 | 
						|
 | 
						|
So, in the picture `cloud existence ministerial` corresponse to `$1` `$2` `$3`
 | 
						|
 | 
						|
## Basic examples
 | 
						|
 | 
						|
**_Match a pattern_**
 | 
						|
 | 
						|
```bash
 | 
						|
awk '/book/ { print }' list.txt
 | 
						|
# bookworm
 | 
						|
# bookcase
 | 
						|
```
 | 
						|
 | 
						|
**_Print all words that are longer that five characters_**
 | 
						|
 | 
						|
```bash
 | 
						|
awk 'length($1) > 5 { print $0 }' list.txt
 | 
						|
```
 | 
						|
 | 
						|
For the first field of every line (we only have one field per line), if it is
 | 
						|
greater than 5 characters print it. The "every line" part is provided for via
 | 
						|
the all fields variable - `$0`.
 | 
						|
 | 
						|
We actually don't need to include the `{ print $0 }` action, as this is the
 | 
						|
default behaviour. We could have just put `length($1) > 5 list.txt`
 | 
						|
 | 
						|
**_Print all words that do not have three characters_**
 | 
						|
 | 
						|
```bash
 | 
						|
awk '!(length($1) == 3)' list.txt
 | 
						|
```
 | 
						|
 | 
						|
Here we negate by prepending the pattern with `!` and wrapping it in
 | 
						|
parentheses.
 | 
						|
 | 
						|
**_Return words that are either three characters or four characters in length_**
 | 
						|
 | 
						|
```
 | 
						|
awk '(length($1) == 3) || (length($1) == 4)' list.txt
 | 
						|
```
 | 
						|
 | 
						|
Here we use the logical OR to match against more than one pattern. Notice that
 | 
						|
whenever we use a Boolean operator such as NOT or OR, we wrap our pattern in
 | 
						|
parentheses.
 | 
						|
 | 
						|
**_Match and string-interpolate the output_**
 | 
						|
 | 
						|
```bash
 | 
						|
awk 'length($1) > 0  {print $1, "has", length($1), "chars"}' list.txt
 | 
						|
 | 
						|
# storeroom has 9 chars
 | 
						|
# tree has 4 chars
 | 
						|
# cup has 3 chars
 | 
						|
```
 | 
						|
 | 
						|
**_Match against a numerical property_**
 | 
						|
 | 
						|
```bash
 | 
						|
awk '$2 >= 90 { print $0 }' scores.txt
 | 
						|
 | 
						|
# Lucia 95
 | 
						|
# Joe 92
 | 
						|
# Sophia 90
 | 
						|
```
 | 
						|
 | 
						|
This returns the records where there is a secondary numerical field that is
 | 
						|
greater than 90.
 | 
						|
 | 
						|
**_Match a field against a regular expression_**
 | 
						|
 | 
						|
```bash
 | 
						|
awk '$1 ~ /^[b,c]/ {print $1}' words.txt
 | 
						|
```
 | 
						|
 | 
						|
This matches all the fields in the `$1` place that begin with 'b' or 'c'.
 | 
						|
 | 
						|
The tilde is the regex match operator. You must be passing a regex to use it,
 | 
						|
otherwise use `==`.
 | 
						|
 | 
						|
## Syntactic shorthands
 | 
						|
 | 
						|
- For a statement like `awk 'length($1) > 5 { print $0 }' list.txt`. We actually
 | 
						|
  don't need to include the `{ print $0 }` action, as this is the default
 | 
						|
  behaviour and it is implied. We could have just put `length($1) > 5 list.txt`.
 | 
						|
 | 
						|
https://zetcode.com/lang/awk/
 | 
						|
 | 
						|
## Built-in variables
 | 
						|
 | 
						|
### `NF`
 | 
						|
 | 
						|
The value of `NF` is the **number** of **fields** in the current record. `Awk`
 | 
						|
automatically updates the value of `NF` every time it reads a record.
 | 
						|
 | 
						|
No matter how many fields there are, the last value in a record can always be
 | 
						|
represented by `$NF`.
 | 
						|
 | 
						|
### `NR`
 | 
						|
 | 
						|
`NR` represents the **number** of **records**. It is set at the point at which
 | 
						|
the file is read.
 | 
						|
 | 
						|
### `FS`
 | 
						|
 | 
						|
`FS` represents the **field separator**. The default field separator is a space.
 | 
						|
We can specify a different separator with the `-F` flag. E.g to separate by
 | 
						|
comma:
 | 
						|
 | 
						|
```bash
 | 
						|
awk -F, '{print $1 }' list.txt
 | 
						|
```
 |