-sh

Non-trivial command-line fu

awk '!_[$0]++' file

— -sh (@rtfmsh) February 13, 2013

This little command to remove duplicate lines from the input without using sort(1) and uniq has made the rounds on the internet for a while. I last encountered it in June of 2012, and there are many online resources explaining the details, but let's just quickly summarize:

Hence: upon first encountering a line, the value will be 0; we negate that value, yielding 1 (ie true) and thus print the line. We then increment that value. Any subsequent time we encounter this line, we will have a value larger than 0, which, when negated, becomes false, and the output of the line thus suppressed.

So this is nifty, but other than geek creds, what do we get from using this? In the last three posts, we've done a lot of sort | uniq -ing. This combination of commands (with or without uniq -c or sort -n thrown into the mix) is ubiquitous. It is also very readable and clear.

But it has one significant disadvantage: in order to be able to sort the input, the entire input has to either fit into memory or (chunks thereof) be stored in temporary files. If you are on a read-only file system or simply do not have much disk space left, you are screwed:

$ du -h l
542M    l
$ df -h .
Filesystem        Size       Used      Avail %Cap Mounted on
/dev/xbd3a         23G        21G       364M  98% /
$ time sort l | uniq

/: write failed, file system is full
sort: No space left on device
/: write failed, file system is full
   39.95s real     1.98s user     1.43s system
$

Sorting is also slow, and, if you are only interested in removing duplicate lines, entirely unnecessary. The only reason you're sorting the data is because uniq(1) can only detect duplicates if they're adjacent -- you're not actually interested in the data being sorted. In fact, there are cases where you explicitly do not want the data to be sorted, ie you need it to retain its original order.

So the awk '!_[$0]++' trick solves those problems:

$ time awk '!_[$0]++' l >/dev/null
   11.31s real     8.70s user     1.08s system
$ 

Now one thing to note is that the output of sort | uniq and that of awk '!_[$0]++' are not going to be identical, since in the first case, the data was sorted, while in the last case, it wasn't. That is, awk '!_[$0]++' is not equivalent to sort | uniq, but only to uniq -- with the additional benefit of being able to take unsorted input of much larger size.

Note also that for very large input, you may still run out of memory: if all unique lines in your input do not actually fit into memory (as they have to, since each unique line is now a key in your hash), then awk will croak. In that case, and if you happen to have sufficient disk space, using sort | uniq may actually be your only solution. As usual, no silver bullet.

2013-02-13


[previous] [Index] [next]