-sh

Non-trivial command-line fu

<f tr '[:punct:]' ' '|tr '[:space:]' '\n'|tr '[:upper:]' '[:lower:]'|egrep '^[a-z]+$'|sort|uniq -c|awk '{print length() " " $2;}'|sort -n

— -sh (@rtfmsh) February 10, 2013

Picking up where we left of yesterday, we can extract some more information about our input text. Let's try to ascertain what the longest word used by Shakespeare in all his works might be. Here, we suppress the count of the unique words found (with all the flaws previously discussed), and instead print them together with their length. Important trivia thusly derived: the bard's four longest words are:

$ <shakespeare tr '[:punct:]' ' ' | tr '[:space:]' '\n' |       \
        tr '[:upper:]' '[:lower:]' | egrep '^[a-z]+$' |         \
        sort | uniq | awk '{ print length() " " $1; }' |        \
        sort -n | tail -4
17 anthropophaginian
17 indistinguishable
17 undistinguishable
27 honorificabilitudinitatibus
$ 

(The last one is not actually a typo or an error in our character translation, you ignoramus!)

What is the most common word-length, you ask? Why, out of the 23790 distinct words found, it appears to be 7:

$ <shakespeare tr '[:punct:]' ' ' | tr '[:space:]' '\n' |       \
        tr '[:upper:]' '[:lower:]' | egrep '^[a-z]+$' |         \
        sort | uniq | wc -l
  23790
$ <shakespeare tr '[:punct:]' ' ' | tr '[:space:]' '\n' |       \
        tr '[:upper:]' '[:lower:]' | egrep '^[a-z]+$' |         \
        sort | uniq | awk '{ print length(); }' | sort -n  |    \
        uniq -c | sort -n
   1 27
   3 17
   4 16
  23 1
  31 15
  52 14
 121 2
 196 13
 430 12
 673 3
 956 11
1700 10
2024 4
2651 9
3027 5
3611 8
4003 6
4284 7
$ 

But that's not very pretty. Let's "draw a graph":

$ <shakespeare tr '[:punct:]' ' ' | tr '[:space:]' '\n' |       \
        tr '[:upper:]' '[:lower:]' | egrep '^[a-z]+$' |         \
        sort | uniq | awk '{ print length(); }' | sort -n  |    \
        uniq -c | perl -le 'while (<>) {                        \
                          m/(.*) (.*)/;                         \
                          printf("$2 %s\n", 'x' x (($1 >50) ? $1/50 : 1)); }'
 1 x
 2 xx
 3 xxxxxxxxxxxxx
 4 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
 5 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
 6 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
 7 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
 8 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
 9 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
10 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
11 xxxxxxxxxxxxxxxxxxx
12 xxxxxxxx
13 xxx
14 x
15 x
16 x
17 x
27 x

With this little perl trick (previously mentioned here), we print an 'x' for every 1-50 occurrences of a word. (Feel free to suggest awk(1) or other solutions.) As usual, we're cheating and losing precision in favor of a representation, but that's alright; "pretty" doesn't have to be precise as long as we get the right result.

2013-02-10


[previous] [Index] [next]