-sh

Non-trivial command-line fu

<file tr -cd '[:print:]' | tr '[:upper:]' '[:lower:]' | tr -d ' ' | fold -w1 | sort | uniq -c | sort -n

— -sh (@rtfmsh) February 12, 2013

After having sorted out word frequency as well as word-length and word-length frequency, it's now time to look at letter frequency of input texts. Practical application: if you were playing "Wheel of Fortune", which letters would it make sense to guess first?

$ <shakespeare tr -cd '[:print:]' | tr '[:upper:]' '[:lower:]' | tr -d ' ' | \
        fold -w1 | sort | uniq -c | sort -n  | tail -5
255038 i
290083 a
315836 o
331139 t
448878 e
$ 

Of course the letter frequency will vary depending on the input text, and more so depending on the input language. For example, using the German original text of Faust, you get a slightly different distribution:

$ <faust tr -cd '[:print:]' | tr '[:upper:]' '[:lower:]' | tr -d ' ' | \
        fold -w1 | sort | uniq -c | sort -n  | tail -5
10749 t
11013 r
13098 i
15182 n
24759 e
$ 

Italian: Using Dante's Inferno:

$ <inferno tr -cd '[:print:]' | tr '[:upper:]' '[:lower:]' | tr -d ' ' | \
        fold -w1 | sort | uniq -c | sort -n  | tail -5
 9621 n
13967 o
14493 i
15217 a
16992 e
$

French: Using Jules Verne's 20000 Lieues sous les mers:

$ <verne tr -cd '[:print:]' | tr '[:upper:]' '[:lower:]' | tr -d ' ' | \
        fold -w1 | sort | uniq -c | sort -n  | tail -5
 50879 i
 52860 n
 54950 a
 60067 s
102821 e
$

Spanish: Back to El ingenioso hidalgo don Quijote de la Mancha:

$ <don-quijote tr -cd '[:print:]' | tr '[:upper:]' '[:lower:]' | tr -d ' ' | \
        fold -w1 | sort | uniq -c | sort -n  | tail -5
109490 n
126489 s
154709 o
194365 a
223827 e
$

fold(1) is one of those rarely used commands that you find out about only by coincidence, I think. More frequently used may be fmt(1), with it's useful vi(1) application:

!}fmt
(That'd be equivalent to gqap in vim(1) -- which, by the way is different from vi(1), and if I invoke vi I really, really do not want it to have syntax highlighting and all that other nonsense, thank you very much linux.)

2013-02-12


[previous] [Index] [next]