Non-trivial command-line fu | Follow @rtfmsh |
awk '!_[$0]++' file
— -sh (@rtfmsh) February 13, 2013
This little command to remove duplicate lines from the input without using sort(1) and uniq has made the rounds on the internet for a while. I last encountered it in June of 2012, and there are many online resources explaining the details, but let's just quickly summarize:
So this is nifty, but other than geek creds, what do we get from using this? In the last three posts, we've done a lot of sort | uniq -ing. This combination of commands (with or without uniq -c or sort -n thrown into the mix) is ubiquitous. It is also very readable and clear.
But it has one significant disadvantage: in order to be able to sort the input, the entire input has to either fit into memory or (chunks thereof) be stored in temporary files. If you are on a read-only file system or simply do not have much disk space left, you are screwed:
$ du -h l
542M l
$ df -h .
Filesystem Size Used Avail %Cap Mounted on
/dev/xbd3a 23G 21G 364M 98% /
$ time sort l | uniq
/: write failed, file system is full
sort: No space left on device
/: write failed, file system is full
39.95s real 1.98s user 1.43s system
$
Sorting is also slow, and, if you are only interested in removing duplicate lines, entirely unnecessary. The only reason you're sorting the data is because uniq(1) can only detect duplicates if they're adjacent -- you're not actually interested in the data being sorted. In fact, there are cases where you explicitly do not want the data to be sorted, ie you need it to retain its original order.
So the awk '!_[$0]++' trick solves those problems:
$ time awk '!_[$0]++' l >/dev/null
11.31s real 8.70s user 1.08s system
$
Now one thing to note is that the output of sort | uniq and that of awk '!_[$0]++' are not going to be identical, since in the first case, the data was sorted, while in the last case, it wasn't. That is, awk '!_[$0]++' is not equivalent to sort | uniq, but only to uniq -- with the additional benefit of being able to take unsorted input of much larger size.
Note also that for very large input, you may still run out of memory: if all unique lines in your input do not actually fit into memory (as they have to, since each unique line is now a key in your hash), then awk will croak. In that case, and if you happen to have sufficient disk space, using sort | uniq may actually be your only solution. As usual, no silver bullet.
2013-02-13