Linux One Liner Of The Day - Calculate the top 10 IP addresses from a log file

Sometimes you are faced with a problem at the Linux command line that tries hard to force you to write a script. With perl, with python or just a quick and dirty shell script.

But most of the time, you do not even need to leave the command line to solve the task.

The trick is just to know the right tools and combine them cleverly.

Just as with the example I want to show you here.

And you hear me saying this once again: There are plenty of other ways at the Linux command line to solve this or similar tasks. So take this example to get inspiration for how to tackle tasks like this. And take it as a demonstration of the power that the Linux command line gives you.

That’s said - let’s jump in and have a look at the task we want to solve today:

Calculate the top 10 IP addresses hitting a website based on the web servers log file

… without scripting! Just by using a clever combination of Linux command line tools.

The basic layout of a web servers log file

If you are not familiar with web servers log files, let’s start with two example lines from the log file I want to analyze for this example:

183.136.225.14 - - [13/Jun/2021:03:03:34 +0200] "GET / HTTP/1.1" 200 5893 "-"
   "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko)
   Chrome/63.0.3239.132 Safari/537.36 QIHU 360SE"
119.28.239.222 - - [13/Jun/2021:03:13:00 +0200] "GET / HTTP/1.1" 200 5880
   "http://64.123.123.123" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36
   (KHTML, like Gecko) Chrome/86.0.4 240.111 Safari/537.36"

Every line in this log file stands for a single request from a client hitting the web server.

And as you see, every single line looks like this:

IP-ADDRESS - - [TIMESTAMP] REQUEST & REQUEST-INFORMATION

where the first part of every line shows the IP address where the request came from.

And our goal is now, to take the whole log file and generate the top 10 IP addresses that sent the most requests to my web server.

And if we wanna have a top 10 list of the IP addresses, we first need to extract them from the log.

For this step I wanna show you two different approaches:

Extract the IP addresses with cut

The first approach is to use the command-line tool “cut” for extracting the IP addresses.

This tool is a command you can use, if you want to extract fields from lines of text, if these fields are separated with a dedicated single character from each other.

This works in our example here perfectly:

  • We are interested in the first field of every line.
  • And this first field is separated by a single space from the rest of the line.

So we can simply call “cut” in the following way, to extract all the IP addresses from a log file called “demolog”:

cut -d " " -f 1 demolog

Where -d " " sets the field delimiter to the white space, and -f 1 lets “cut” print out the first field of every line only.

The output representing the two example log lines from above will just show the two IP addresses:

183.136.225.14
119.28.239.222

For the example here, the use of cut is the most simple solution to extract the IP addresses.

*But what if the data you are interested in cannot be extracted via simple columns?**

Then you can use a slightly bigger one: The powerful search tool “grep”.

Extract the IP addresses with grep

The great thing about “grep” is that you can search text-based data by regular expressions.

With these powerful search patterns, you can find the data you are interested in everywhere.

(Assuming, you are able to formulate a regular expression for the data you are searching for.)

And yes - often this is the hardest part to do: to formulate the regular expression.

Even to write a regular expression that matches all possible IP addresses and only them is far from trivial.

But for the sake of simplicity here, let’s cheat a bit and let’s use a search pattern that is far from perfect but serves us to solve our problem perfectly:

grep "^[0-9][0-9]*\.[0-9][0-9]*\.[0-9][0-9]*\.[0-9][0-9]*" demolog

With this search pattern we are addressing sequences of four decimal numbers separated by a single dot:

  • [0-9] stands for a digit from zero to nine
  • [0-9]* stands for a digit from zero to nine repeated any number of times (also “zero times”)
  • \. stands for a standalone dot (that needs to be quoted in a regular expression with a preceding backslash)

And since we are already cheating anyway, we can say that we are only interested in these numbers, if they are at the start of the line. Therefore the leading ^ in front of the regular expression.

ok Robert - that sounds great, but doesn’t grep print out _the_whole_line_ if the pattern matches?

Glad you ask!😉

But yes. The typical behavior of grep is, to print out the whole line if a match for the regular expression is found: So in our example we would get every single line of the file - not this useful at all.

But if you ask grep for help, it will be happy to assist you:

grep -o "^[0-9][0-9]*\.[0-9][0-9]*\.[0-9][0-9]*\.[0-9][0-9]*" demolog

The magic here is the "-o" command line switch. When using this, grep will - instead of printing out the whole line - print out just the part of the text that matches the regular expression.

And voila! The output now looks exactly the same as the output of the cut command from above.

183.136.225.14
119.28.239.222

As you see - as soon as you can describe the data you are interested in as a regular expression, you can use “grep” to extract only the data of interest.

Just add the “-o” command line switch.

site note: Although the data extraction with grep is way more powerful than extracting data fields with cut, I’ll stick with “cut” for the rest of this article. Just for a cleaner looking command line. (and isn’t simplicity always king?)

Put the lines in order with sort

Now that we have extracted all the IP addresses, we need to put them in order to process them further.

And if you need to “put something in order” - if you need to sort something at the Linux command line - then the tool “sort” is the way to go. (An easy to remember name, isn’t it?)

Sort lets you sort existing data in different ways - as for instance alphabetically or numerically. For our task it’s quite enough to sort the addresses in a simple alphabetical way.

To do this, we just feed the output of the cut command through “sort”:

cut -d " " -f 1 demolog | sort

As a result, we will see a sorted list of all IP addresses contained in the log file.

...
94.102.49.193
94.102.49.193
94.102.49.193
94.102.56.18
94.158.220.2
94.199.18.198
94.199.18.198
94.199.18.198
...

As you see from the excerpt above, some IP addresses are responsible for a single request only while others have hit the website multiple times.

The next step now would be to count the lines of all the repeated IP addresses. (Do you remember that we wanna create a top 10 list of the IP addresses?)

Count duplicate lines with uniq

If we wanna count the repeated IP addresses, we now simply need to count the duplicates of every single line we have.

Every time you need to deal with duplicates at the Linux shell, the tool “uniq” is the way to go.

With “uniq” you can remove duplicates from text based data or you can just * identify (show) the duplicates without deleting them.

The third thing “uniq” can help you with is the way we wanna have “uniq” to work for us here: count the duplicates

Just give sorted data to "uniq -c", and uniq will give you every uniq line it sees together with the number of occurrences:

cut -d " " -f 1 demolog | sort | uniq -c

as a result you will see output lines like the following:

...
3 94.102.49.193
1 94.102.56.18
1 94.158.220.2
3 94.199.18.198
...

Every line is now printed out only once but with a leading number showing the count of every line.

almost there

Only two last steps left to complete our mission: Generate the top 10 list and print the result “nicely” to the console.

Use sort a second time, this time to sort by number

Now it’s time to sort the resulting output lines from the last step. But this time, we cannot simply sort alphabetically, but we need to sort in a numeric fashion.

For this, just ask sort with the command line switch "-n" for assistance:

cut -d " " -f 1 demolog | sort | uniq -c | sort -n

As you see in the output, the lines are perfectly sorted now:

...
12 62.210.123.29
13 138.197.131.66
14 173.208.244.92
14 198.204.240.244
15 107.150.52.198
...

And to get just the last ten lines here (remember, we wanna see a top 10 list), we ask the tail command for help.

Get only the last 10 lines with tail

This tool gives you by default just the last 10 lines of a file or data stream:

cut -d " " -f 1 demolog | sort | uniq -c | sort -n | tail

as you see, the output now ends with our “winner” - with the IP address that is responsible for the most requests:

...
96 13.232.96.15
96 52.65.15.196
105 13.124.222.242
135 45.146.165.123
531 64.123.123.123

But because every traditional “top 10 list” doesn’t end with the winner but orders the positions in a descending order, we simply reverse our output data stream with tac:

cut -d " " -f 1 demolog | sort | uniq -c | sort -n | tail | tac

this looks much better:

531 64.123.123.123
135 45.146.165.123
105 13.124.222.242
96 52.65.15.196
96 13.232.96.15
...

Hey Robert - are you serious about using “tac” here? Don’t you know the “-r” switch for sort?

Sure I know. Just wanted to show you one more useful tool here 😉

So ok - instead of reversing the sorted output with tac, we can simply sort our data numerically in descending order:

cut -d " " -f 1 demolog | sort | uniq -c | sort -nr

the "-r" switch stands for “reverse” here.

Get only the first 10 lines with head

Now that we have all die IPs ordered by the number of requests they made in descending order, we now need to take the first 10 lines from output instead of the last 10:

cut -d " " -f 1 demolog | sort | uniq -c | sort -nr | head

The tool “head” works like “tail”, but it gives you the first lines of some data instead of the last ones.

And just like tail, head will give you by default exactly ten lines of output. If you need a different number, just name your number with the "-n" command line switch. With "head -n 5" you would get for instance only the first five lines.

Now we could say that we have resolved our task.

But we could also get a little bit more fancy and do the output something nicer …

Format the resulting data with awk

As a last step I wanna show you a fast way to format the output of structured data in every single way you can think of.

Maybe you need the resulting data as a CSV file or you need the result JSON formatted?

For this you should take a closer look at “awk”.

This tool gives you the power of a whole scripting language on its own.

But for the sake of the task here, let’s leverage the ability of “awk” to read-in structured data with ease and to give you direct access to every single data field:

If you have structured data, where

  • every data set is formed by a single line and
  • every data field within every line is represented by a single column

then “awk” gives you with built-in variables ($1, $2, $3, and so on …) direct access to every single data field.

So for example with the following command line

cut -d " " -f 1 demolog | sort | uniq -c | sort -nr | head | awk '{print $1}'

we would just print out the first field of every line - the numbers:

531
135
105
96
96
...

And if we also print out the IP addresses from the second field and add a little bit of text as formatting, the result can be really impressive 😉

cut -d " " -f 1 demolog | sort | uniq -c | sort -nr | head | awk '{print $2 " --> " $1}'

this would give you an output like this:

64.123.123.123 --> 531
45.146.165.123 --> 135
13.124.222.242 --> 105
52.65.15.196 --> 96
...

Mission Accomplished

So to summarize: On the Linux shell, you can use the following command line to calculate the top 10 IP addresses that sent the most requests to a website based on the web servers log file:

cut -d " " -f 1 demolog | sort | uniq -c | sort -nr | head | awk '{print $2 " --> " $1}'

… without any scripting at all.

Have fun🙂

Here is what to do next

LBFCoverSmall

If you followed me through this article, you certainly have realized that knowing some internals about how things are working at the Linux command line, can save you a lot of time and frustration.

And sometimes it’s just fun to leverage these powerful mechanics.

If you wanna know more about such “internal mechanisms” of the Linux command line - written especially for Linux beginners

have a look at “The Linux Beginners Framework”

In this framework I guide you through 5 simple steps to feel comfortable at the Linux command line.

This framework comes as a free pdf and you can get it here.

Wanna take an unfair advantage?

ToolboxCoverSmall

If it comes to working on the Linux command line - at the end of the day it is always about knowing the right tool for the right task.

And it is about knowing the tools that are most certainly available on the Linux system you are currently on.

To give you all the tools for your day-to-day work at the Linux command line, I have created “The ShellToolbox”.

This book gives you everything

  • from the very basic commands, through
  • everything you need for working with files and filesystems,
  • managing processes,
  • managing users and permissions, through
  • software management,
  • hardware analyses and
  • simple shell-scripting to the tools you need for
  • doing simple “networking stuff”.

Everything in one single, easy to read book. With explanations and example calls for illustration.

If you are interested, go to shelltoolbox.com and have a look (as long as it is available).