Command Line Analytics, Part II

In a previous post, we took a look at the critical role log files play in administering servers, and also dissected the content of an Nginx access log. With that information in mind, let's turn our attention to parsing out data points from the command line.

You'll recall that the first field of /var/log/nginx/access.log is the IP address of a client accessing our server. We can tease that field out with:

awk '{print $1}' access.log,

where awk prints only the column we want. This returns the following (abridged) list of IP addresses:

124.88.64.192
54.147.255.97
37.32.43.127
95.213.177.126
66.249.69.98
66.249.69.121
66.249.69.126
66.249.69.123
173.249.63.71
173.249.63.71
173.249.63.71
173.249.63.71
120.79.157.213
34.235.161.174
34.235.161.174
34.235.161.174
120.79.157.213
120.79.157.213
120.79.157.213
120.79.157.213 .

For a total count of IP addresses, we can invoke awk with the END rule and the special variable NR to get a tally, e.g.:

awk 'END{print NR}' access.log.

Here awk processess the access.log file until its END--using NR--which holds the current line number, and tracks the cummulative count (93 in our case). You'll notice, though, that there are a number of identical IP addresses in our output. To filter out these duplicates, we can use awk like this:

awk 'NR>1{column[$1]++} END{for(IP in column) print IP}' /var/log/nginx/access.log

(source). Then, to get a total count of the same, we can (|) pipe our output to the word count utility:

awk 'NR>1{column[$1]++} END{for(IP in column) print IP}' /var/log/nginx/access.log |wc --lines,

which yields 48.

Assigning a count to the number of times each unique IP address appears in our log file is nifty:

awk '{print $1|"sort --numeric|uniq --count"}' access.log,

and offers the following (abridged):

1 31.47.103.179
3 34.235.161.174
1 37.32.43.127
1 40.77.167.7
1 54.147.255.97
1 54.175.69.50
2 58.19.0.130
1 60.1.134.144
1 60.12.18.6
1 66.249.69.121
1 66.249.69.123
1 66.249.69.126
1 66.249.69.98
3 74.101.21.230
1 84.236.82.57
1 89.162.35.203
1 95.213.177.123
1 95.213.177.126
2 100.43.90.123
1 106.45.0.223 .

Finally, let's list those 10 IP address that appear most frequently in our log file:

awk '{print $1|"sort --numeric|uniq --count|sort --numeric|tail -10"}' access.log.

Strictly speaking IPs which have the same count are all tied:

2 199.249.230.85
2 201.103.24.153
2 58.19.0.130
3 207.46.13.79
3 34.235.161.174
3 74.101.21.230
4 173.249.63.71
6 120.79.157.213
7 157.55.39.248
24 172.254.43.243

but the intent of the query should be clear.

We now have a few actionable data points that we can use to inform future decisions. This, though, is just the tip of the proverbial iceberg, as there's much more data we can extract with a little grit (and the right command line arguments).

Cheers.