> Awk is a programming language designed for text processing and data extraction. It was created in the 1970s and remains widely used today for tasks such as filtering and transforming text data, generating reports, and performing basic calculations. Awk is known for its simplicity and versatility, making it a popular tool for Unix system administrators and data analysts.
## Invocation
We can use `awk` directly in `stdin` or we can reference `.awk` files for more elaborate scripts
The basic structure of an `awk` script is as follows:
```
pattern {action}
```
A **pattern** is what you want to match against. It can be a literal string or a regex. The **action** is what process you want to execute against the lines in the input that match the pattern.
**_Print all words that are longer that five characters_**
```bash
awk 'length($1) > 5 { print $0 }' list.txt
```
For the first field of every line (we only have one field per line), if it is greater than 5 characters print it. The "every line" part is provided for via the all fields variable - `$0`.
We actually don't need to include the `{ print $0 }` action, as this is the default behaviour. We could have just put `length($1) > 5 list.txt`
**_Print all words that do not have three characters_**
Here we use the logical OR to match against more than one pattern. Notice that whenever we use a Boolean operator such as NOT or OR, we wrap our pattern in parentheses.
This matches all the fields in the `$1` place that begin with 'b' or 'c'.
The tilde is the regex match operator. You must be passing a regex to use it, otherwise use `==`.
## Syntactic shorthands
- For a statement like `awk 'length($1) > 5 { print $0 }' list.txt`. We actually don't need to include the `{ print $0 }` action, as this is the default behaviour and it is implied. We could have just put `length($1) > 5 list.txt`.
The value of `NF` is the **number** of **fields** in the current record. `Awk` automatically updates the value of `NF` every time it reads a record.
No matter how many fields there are, the last value in a record can always be represented by `$NF`.
### `NR`
`NR` represents the **number** of **records**. It is set at the point at which the file is read.
### `FS`
`FS` represents the **field separator**. The default field separator is a space. We can specify a different separator with the `-F` flag. E.g to separate by comma: