Non-trivial command-line fu | Follow @rtfmsh |
<file tr -cd '[:print:]' | tr '[:upper:]' '[:lower:]' | tr -d ' ' | fold -w1 | sort | uniq -c | sort -n
— -sh (@rtfmsh) February 12, 2013
After having sorted out word frequency as well as word-length and word-length frequency, it's now time to look at letter frequency of input texts. Practical application: if you were playing "Wheel of Fortune", which letters would it make sense to guess first?
$ <shakespeare tr -cd '[:print:]' | tr '[:upper:]' '[:lower:]' | tr -d ' ' | \
fold -w1 | sort | uniq -c | sort -n | tail -5
255038 i
290083 a
315836 o
331139 t
448878 e
$
Of course the letter frequency will vary depending on the input text, and more so depending on the input language. For example, using the German original text of Faust, you get a slightly different distribution:
$ <faust tr -cd '[:print:]' | tr '[:upper:]' '[:lower:]' | tr -d ' ' | \
fold -w1 | sort | uniq -c | sort -n | tail -5
10749 t
11013 r
13098 i
15182 n
24759 e
$
Italian: Using Dante's Inferno:
$ <inferno tr -cd '[:print:]' | tr '[:upper:]' '[:lower:]' | tr -d ' ' | \
fold -w1 | sort | uniq -c | sort -n | tail -5
9621 n
13967 o
14493 i
15217 a
16992 e
$
French: Using Jules Verne's 20000 Lieues sous les mers:
$ <verne tr -cd '[:print:]' | tr '[:upper:]' '[:lower:]' | tr -d ' ' | \
fold -w1 | sort | uniq -c | sort -n | tail -5
50879 i
52860 n
54950 a
60067 s
102821 e
$
Spanish: Back to El ingenioso hidalgo don Quijote de la Mancha:
$ <don-quijote tr -cd '[:print:]' | tr '[:upper:]' '[:lower:]' | tr -d ' ' | \
fold -w1 | sort | uniq -c | sort -n | tail -5
109490 n
126489 s
154709 o
194365 a
223827 e
$
fold(1) is one of those rarely used commands that you find out about only by coincidence, I think. More frequently used may be fmt(1), with it's useful vi(1) application:
!}fmt
(That'd be equivalent to gqap in vim(1) -- which, by the way is
different from vi(1), and if I invoke vi I
really, really do not want it to have syntax highlighting and all that
other nonsense, thank you very much linux.)
2013-02-12