Non-trivial command-line fu | Follow @rtfmsh |
<file tr '[:punct:]' ' ' | tr '[:space:]' '\n' | tr '[:upper:]' '[:lower:]' | egrep '^[a-z]+$' | sort | uniq -c | sort -n
— -sh (@rtfmsh) February 9, 2013
Quick, what are the most frequently used word in Don Quixote? Sure, it's gotta be "the", "and", "to", "of", "don", and "quixote". But just what are the numbers? Does the tall tale of our noble knighs-errant more frequently mention his celebrated steed or the lady of his captive heart? Eh? Eh?
The approach we're taking here is to first attempt to identify what exactly a "word" is. So we start by ignoring punctuation and differences in capitalization. Next we split the input into line-based records and then weed out everything that doesn't look like an english word (which, for simplicity, we defined to be a sequence of alphabetical characters. Following that is the usual sort(1)/uniq(1) dance.
As so often, the problems with anything relating to human language lie in the flawed definitions we apply. For example, we replace all punctuation with spaces, but that breaks up hyphenated words: "middle-aged" becomes two words, for example. Can we improve on that?
tr(1)'s [:punct:] class basically applies ispunct(3), which in turn
tests for any printing character except space or a character for which isalnum(3) is true.
So we could try to replace '[:punct:]' with '`~!@#$%^&*()_+=|\\}\]{\[:;<,>.?/"', but of course then we have to account for the various special cases in the text, where what would usually be printed using an em dash is offset from the text using "--":
Those whom I have inspired with love by letting them see me, I have by words undeceived, and if their longings live on hope--and I have given none to Chrysostom or to any other--it cannot justly be said that the death of any is my doing, for it was rather his own obstinacy than my cruelty that killed him;
Language is messy. Our solution involves trade-offs and approximations of what we want. As so often, "good enough" has to do.
A more complete approach to word frequencies is, of course, a Word Cloud. The programs generating these have to account for such issues, and are generally quiet a bit smarter. For example, they ignore common words (such as "the", "and", etc.) and group similar words ("learned", "learns", "learning" all become "learning"). Here's a word cloud for "Don Quixote" generated by TagCrowd:
But to answer at least one of our initial questions:
$ <don-quixote tr '[:punct:]' ' ' | tr '[:space:]' '\n' | \
tr '[:upper:]' '[:lower:]' | egrep '^[a-z]+$' | \
sort | uniq -c | sort -n | \
egrep -w "(rocinante|dulcinea|sancho|quixote)"
210 rocinante
292 dulcinea
2205 sancho
2327 quixote
$
2013-02-09