Kenneth Church, "Unix for Poets" [pdf]. Pleaes note that some of the syntax in this document is deprecated: tail -n +2 should be used instead of tail +2.
William Poser, Unix Tools for linguists. Note that the outgoing links are broken for individual commands, but the page provides an excellent summary.
Extracting a list of word types and their frequencies is among the most basic operations you will be performing on a corpus. How can we accomplish these tasks? The central component is breaking text lines, which currently contain multiple words, into many more lines which contain exactly one word each. In English, word boundaries are marked with a space; what we need to do is, then, converting every space character into the new line character. There are additional considerations, such as neutralizing case distinctions and treatment of punctuation. Below, we will learn the necessary commands.
First off, your locale needs to be set to en_US.ISO-8859-1 for the commands below to behave precisely the way they should. Follow these instructions (OS-X here, cygwin here).
(OS-X) Installing a newer version of sed
Turns out, the version of sed included in OS-X is ancient (pre v.3.02, BSD implementation) and wouldn't support '\n', '\t' and so on. The current working version is 4.1.5-11 (gnu implementation). The best way to update these Unix software packages on OS-X is through Fink. Download and install Fink, and then you can update sed through its interface.
The newer version is called gsed (for gnu-sed) and is installed as /sw/bin/gsed. You can automacally call this command whenever you use sed by setting an alias in your .bash_profile file. Using pico, edit the file to include the following line:
Translating characters using tr
- tr is a unix utility that translates a character into another. The following commands will replace every 'o' character in the file with 'x':
tr 'o' 'x' < austen-emma.txt
Unlike most other unix commands, tr syntax only permits standard input, which means this will NOT work: tr 'o' 'x' austen-emma.txt. Instead, you need to explicitly state that input is read from the source file as standard input with the 'less than' symbol '<'. An alternative is to use the cat command to pipe the file content into tr. (Note: more on '<' and cat at the bottom of this page.)
cat austen-emma.txt | tr 'o' 'x'
The echo command comes in handy for quick trial runs and debugging: (Note: $ is my command-line prompt and is not part of the command syntax.)
$ echo "Hello world" | tr 'o' 'x'
- The newline character is most commonly represented as '\n'. Therefore, to convert every space into a new line (thereby getting one word per line), you can issue the following:
$ echo "It's 12 o'clock now." | tr ' ' '\n'
- Instead of two single characters, you can specify two matching sets of characters:
$ echo "It's 12 o'clock now. Call Ted." | tr 'aeiou' 'AEIOU'
Because of this property of tr syntax, it is most commonly used as a quick method of lowercase/uppercase conversion:
It's 12 O'clOck nOw. CAll TEd.
$ echo "It's 12 o'clock now. Call Ted." | tr '[A-Z]' '[a-z]'
it's 12 o'clock now. call ted.
- At this point, you might be tempted to formulate something like below, thinking it would replace all instances of 'that' with 'this' in the text:
$ echo "I like that hat." | tr 'that' 'this'
(Puzzled? Go back and examine the 'aeiou' example above.) This is because tr operates on individual characters ('t', 'h', ...) and not strings ('that'). Converting the string 'that' into another string 'this' is a string-based operation; for this, we need sed.
I like shis his.
Transforming strings using sed
- sed (short for 'stream editor') is a powerful tool used for transforming streaming text. For example, you can replace all instances of 'Emma' with 'Queen Elizabeth' in this corpus file (Jane Austen's Emma):
sed 's/Emma/Queen Elizabeth/g' austen-emma.txt
- sed syntax is as follows:
|sed 's/string1/string2/' file ||Prints out lines of file while substituting ('s') string1 with string2 once per line|
|sed 's/string1/string2/g' file ||Same as above, but string replacement is done globally ('g') throughout line|
|sed -r 's/string1/string2/g' file ||Same as above, but strings contain (extended) regular expression ('-r') |
|sed 's/.../.../g; s/.../.../g' file ||Separate multiple transformations with ';'|
- Without the 'g' switch, sed's default behavior is to perform only one replacement per line and stop:
$ echo "laa dee daa" | sed 's/aa/uu/'
For our corpus processing purposes, we need such transformation to be applied to every applicable instance. Therefore, it is highly recommended that you get into the habit of supplying 'g'.
luu dee daa
$ echo "laa dee daa" | sed 's/aa/uu/g'
luu dee duu
- When you want to apply multiple transformations, you can separate them with a semicolon ';':
sed 's/Emma/Juliet/g; s/Mr. Knightley/Romeo/g' austen-emma.txt
- The '-r' switch signals that the strings are to be interpreted as (extended) regular expressions. With the help of the regular expression formalism, we can match beyond literal strings. For example, regular expression 'o+' matches not only 'o' but also 'oo', 'ooo', 'oooo', and so on:
$ echo "Ooooooh, what a wonderful book." | sed -r 's/o+/_/g'
Regular expressions are extremely powerful; they let users capture an infinite set of strings with a finite description. In addition, they provide many handy shortcuts that represent a group of characters.
For example, [:punct:] stands for all punctuation symbols in English, and the following command maps any sequence of punctuation symbols to a space:
O_h, what a w_nderful b_k.
$ echo "It's 12 o'clock now...! Call Ted." | sed -r 's/[[:punct:]]+/ /g'
It s 12 o clock now Call Ted
- Lastly, the following sed command is identical in effect to tr ' ' '\n' above.
$ echo "It's 12 o'clock now." | sed -r 's/ /\n/g'
Putting it all together: compiling a tokenized word file
- The goal is to extract a tokenized word file from austen-emma.txt, with one word per line. First we start out by noting the total # of lines and words in the file. There are 16,823 lines and 158,167 words in this text:
$ wc austen-emma.txt
16823 158167 887071 austen-emma.txt
- Let's begin by converting all uppercase letters into lowercase. (Some might not prefer lowercasing proper nouns, but for now we're applying this across-the-board.) It's a good idea to pipe the output into less, more or head so we can eyeball the results along the way:
tr '[A-Z]' '[a-z]' < austen-emma.txt | more
- Now let's get rid of all punctuation symbols by converting them into a space. (Again, in best practice, they are tokenized apart and not gotten rid of entirely; we will learn how to do this later.)
tr '[A-Z]' '[a-z]' < austen-emma.txt | sed -r 's/[[:punct:]]/ /g' | more
Note that the '+' syntax was not used, which causes '...!' to be translated into four spaces, which will then be translated into four empty lines in our next step. But don't worry -- empty lines will be eliminated at the end.
- Next, we separate words into individual lines by converting spaces into a new line character: (The two sed commands can be collapsed into one as sed -r 's/[[:punct:]]/ /g; s/ /\n/g' instead, but it's just as easy to pipe.)
tr '[A-Z]' '[a-z]' < austen-emma.txt | sed -r 's/[[:punct:]]/ /g' | sed -r 's/ /\n/g' | more
- This looks almost done, but you will notice that the output contains many empty lines. To get rid of them, you can use grep. Not surprisingly, grep too can take regular expression arguments; the switch to enable this option is '-E'. An empty line is represented as '^$'. (More on this later.) Therefore, the grep command that grabs the opposite of the empty line is grep -v -E '^$':
tr '[A-Z]' '[a-z]' < austen-emma.txt | sed -r 's/[[:punct:]]/ /g' | sed -r 's/ /\n/g' | grep -v -E '^$' | more
- The output looks good! We'll save it into a file now:
tr '[A-Z]' '[a-z]' < austen-emma.txt | sed -r 's/[[:punct:]]/ /g' | sed -r 's/ /\n/g' | grep -v -E '^$' > austen-emma.words
- This is what the tokenized word file looks like:
$ head -15 austen-emma.words
- Now let's see how many words there are, this time by line-counting:
$ wc -l austen-emma.words
There are more words now -- why is this? (Hint: What happened to words like "you're"?)
More ways than one; Comparing files with diff
- As is often the case in text processing tasks, there are many more alternative ways to obtain the same or a near-identical result. In Unix for Poets, Church accomplishes word tokenization by using tr only ('\012' and '\n' both represent the new line character):
tr '[A-Z]' '[a-z]' < austen-emma.txt | tr -sc '[a-z]' '\012' > austen-emma.church.words
- His formulation differs from ours in that (1) it also strips out numbers, and (2) it misses square brackets. How can we compare the two results? We can use diff to compare the two output files:
$ diff austen-emma.words austen-emma.church.words | tail
- When the contents of two files are identical, diff prints out nothing:
$ head -200 austen-emma.txt | tail -10 > dada
$ tail -n +191 austen-emma.txt | head -10 > baba
$ diff dada baba
- A question: How should we modify our original command formula so that it eliminates numerals as well? Hint: There are regular expressions that represent digits, a la [a-z] and [:punct:]. Try this page.
Compiling a word-frequency table
- So why go for all the trouble of splitting words into separate lines, you ask? That is because unix commands and text processing in general are heavily line-oriented, meaning, typically programs and commands operate on a line-by-line basis.
- If that's not enough to convince you, starting from the word-tokenized file that we created above, we can easily proceed to compile a list of word types (unique words) in the text:
Here is how we got the list. First, the lines in the tokenized word list are alphabetically sorted (sort), and then rows of consecutive identical lines are collapsed into one (uniq).
$ sort austen-emma.words | uniq | head -15
- Now a word frequency table is only a few tweaks away. This time, we print a count number in front while collapsing identical lines (uniq -c), and sort the result again in a reverse numerical order (sort -nr) instead of the default alphabetical order.
$ sort austen-emma.words | uniq -c | sort -nr | head -15
- So far we discussed how to process a single text file. Now, how can we process an entire corpus, consisting of many text files? There are two general approaches:
- Process individual text files to obtain separate word tokenization files for each. Then conflate the individual results into a single result representing the entire corpus.
- Process the text files all at once, by globbing the files together and forming a single standard input stream to feed into the command-line process. In other words, think *.txt.
- If taking the second approach, a word of warning about using '<'. '<' takes a single file name as its argument. Which means none of the following works:
tr '[A-Z]' '[a-z]' < austen-emma.txt austen-persuasion.txt
Hence, for feeding multiple files into tr as a single standard input stream, you need to resort to cat:
tr '[A-Z]' '[a-z]' < austen-*.txt
tr '[A-Z]' '[a-z]' < *.txt THESE DO NOT WORK!!
cat austen-emma.txt austen-persuasion.txt | tr '[A-Z]' '[a-z]'
cat austen-*.txt | tr '[A-Z]' '[a-z]'
cat *.txt | tr '[A-Z]' '[a-z]'