LING 2050 Special Topics in Linguistics: Corpus Linguistics, University of Pittsburgh

Go to: LING2050 home page Lab pages index Command reference sheet

Lab 3
Objectives: basics of character-based (tr) and string-based (sed) editing; extracting a word list and a frequency table from a corpus.

Reference:
Kenneth Church, "Unix for Poets" [pdf]. Pleaes note that some of the syntax in this document is deprecated: tail -n +2 should be used instead of tail +2.
William Poser, Unix Tools for linguists. Note that the outgoing links are broken for individual commands, but the page provides an excellent summary.

Overview

Extracting a list of word types and their frequencies is among the most basic operations you will be performing on a corpus. How can we accomplish these tasks? The central component is breaking text lines, which currently contain multiple words, into many more lines which contain exactly one word each. In English, word boundaries are marked with a space; what we need to do is, then, converting every space character into the new line character. There are additional considerations, such as neutralizing case distinctions and treatment of punctuation. Below, we will learn the necessary commands.

Setting locale
First off, your locale needs to be set to en_US.ISO-8859-1 for the commands below to behave precisely the way they should. Follow these instructions (OS-X here, cygwin here).

(OS-X) Installing a newer version of sed
Turns out, the version of sed included in OS-X is ancient (pre v.3.02, BSD implementation) and wouldn't support '\n', '\t' and so on. The current working version is 4.1.5-11 (gnu implementation). The best way to update these Unix software packages on OS-X is through Fink. Download and install Fink, and then you can update sed through its interface.
The newer version is called gsed (for gnu-sed) and is installed as /sw/bin/gsed. You can automacally call this command whenever you use sed by setting an alias in your .bash_profile file. Using pico, edit the file to include the following line:
alias sed='/sw/bin/gsed'

Translating characters using tr

tr is a unix utility that translates a character into another. The following commands will replace every 'o' character in the file with 'x':
tr 'o' 'x' < austen-emma.txt
cat austen-emma.txt | tr 'o' 'x'
Unlike most other unix commands, tr syntax only permits standard input, which means this will NOT work: tr 'o' 'x' austen-emma.txt. Instead, you need to explicitly state that input is read from the source file as standard input with the 'less than' symbol '<'. An alternative is to use the cat command to pipe the file content into tr. (Note: more on '<' and cat at the bottom of this page.)

The echo command comes in handy for quick trial runs and debugging: (Note: $ is my command-line prompt and is not part of the command syntax.)
$ echo "Hello world" | tr 'o' 'x'
Hellx wxrld

The newline character is most commonly represented as '\n'. Therefore, to convert every space into a new line (thereby getting one word per line), you can issue the following:
$ echo "It's 12 o'clock now." | tr ' ' '\n'
It's
12
o'clock
now.

Instead of two single characters, you can specify two matching sets of characters:
$ echo "It's 12 o'clock now. Call Ted." | tr 'aeiou' 'AEIOU'
It's 12 O'clOck nOw. CAll TEd.
Because of this property of tr syntax, it is most commonly used as a quick method of lowercase/uppercase conversion:
$ echo "It's 12 o'clock now. Call Ted." | tr '[A-Z]' '[a-z]'
it's 12 o'clock now. call ted.

At this point, you might be tempted to formulate something like below, thinking it would replace all instances of 'that' with 'this' in the text:
$ echo "I like that hat." | tr 'that' 'this'
I like shis his.
(Puzzled? Go back and examine the 'aeiou' example above.) This is because tr operates on individual characters ('t', 'h', ...) and not strings ('that'). Converting the string 'that' into another string 'this' is a string-based operation; for this, we need sed.

Transforming strings using sed

sed (short for 'stream editor') is a powerful tool used for transforming streaming text. For example, you can replace all instances of 'Emma' with 'Queen Elizabeth' in this corpus file (Jane Austen's Emma):
sed 's/Emma/Queen Elizabeth/g' austen-emma.txt

sed syntax is as follows:

sed 's/string1/string2/' file Prints out lines of file while substituting ('s') string1 with string2 once per line

sed 's/string1/string2/g' file Same as above, but string replacement is done globally ('g') throughout line

sed -r 's/string1/string2/g' file Same as above, but strings contain (extended) regular expression ('-r')

sed 's/.../.../g; s/.../.../g' file Separate multiple transformations with ';'

Without the 'g' switch, sed's default behavior is to perform only one replacement per line and stop:
$ echo "laa dee daa" | sed 's/aa/uu/'
luu dee daa
$ echo "laa dee daa" | sed 's/aa/uu/g'
luu dee duu
For our corpus processing purposes, we need such transformation to be applied to every applicable instance. Therefore, it is highly recommended that you get into the habit of supplying 'g'.

When you want to apply multiple transformations, you can separate them with a semicolon ';':
sed 's/Emma/Juliet/g; s/Mr. Knightley/Romeo/g' austen-emma.txt

The '-r' switch signals that the strings are to be interpreted as (extended) regular expressions. With the help of the regular expression formalism, we can match beyond literal strings. For example, regular expression 'o+' matches not only 'o' but also 'oo', 'ooo', 'oooo', and so on:
$ echo "Ooooooh, what a wonderful book." | sed -r 's/o+/_/g'
O_h, what a w_nderful b_k.
Regular expressions are extremely powerful; they let users capture an infinite set of strings with a finite description. In addition, they provide many handy shortcuts that represent a group of characters. For example, [:punct:] stands for all punctuation symbols in English, and the following command maps any sequence of punctuation symbols to a space:
$ echo "It's 12 o'clock now...! Call Ted." | sed -r 's/[[:punct:]]+/ /g'
It s 12 o clock now Call Ted

Lastly, the following sed command is identical in effect to tr ' ' '\n' above.
$ echo "It's 12 o'clock now." | sed -r 's/ /\n/g'
It's
12
o'clock
now.

Putting it all together: compiling a tokenized word file

The goal is to extract a tokenized word file from austen-emma.txt, with one word per line. First we start out by noting the total # of lines and words in the file. There are 16,823 lines and 158,167 words in this text:
$ wc austen-emma.txt
16823 158167 887071 austen-emma.txt

Let's begin by converting all uppercase letters into lowercase. (Some might not prefer lowercasing proper nouns, but for now we're applying this across-the-board.) It's a good idea to pipe the output into less, more or head so we can eyeball the results along the way:
tr '[A-Z]' '[a-z]' < austen-emma.txt | more

Now let's get rid of all punctuation symbols by converting them into a space. (Again, in best practice, they are tokenized apart and not gotten rid of entirely; we will learn how to do this later.)
tr '[A-Z]' '[a-z]' < austen-emma.txt | sed -r 's/[[:punct:]]/ /g' | more
Note that the '+' syntax was not used, which causes '...!' to be translated into four spaces, which will then be translated into four empty lines in our next step. But don't worry -- empty lines will be eliminated at the end.

Next, we separate words into individual lines by converting spaces into a new line character: (The two sed commands can be collapsed into one as sed -r 's/[[:punct:]]/ /g; s/ /\n/g' instead, but it's just as easy to pipe.)
tr '[A-Z]' '[a-z]' < austen-emma.txt | sed -r 's/[[:punct:]]/ /g' | sed -r 's/ /\n/g' | more

This looks almost done, but you will notice that the output contains many empty lines. To get rid of them, you can use grep. Not surprisingly, grep too can take regular expression arguments; the switch to enable this option is '-E'. An empty line is represented as '^$'. (More on this later.) Therefore, the grep command that grabs the opposite of the empty line is grep -v -E '^$':
tr '[A-Z]' '[a-z]' < austen-emma.txt | sed -r 's/[[:punct:]]/ /g' | sed -r 's/ /\n/g' | grep -v -E '^$' | more

The output looks good! We'll save it into a file now:
tr '[A-Z]' '[a-z]' < austen-emma.txt | sed -r 's/[[:punct:]]/ /g' | sed -r 's/ /\n/g' | grep -v -E '^$' > austen-emma.words

This is what the tokenized word file looks like:

$ head -15 austen-emma.words emma by jane austen 1816 volume i chapter i emma woodhouse handsome clever and rich

Now let's see how many words there are, this time by line-counting:
$ wc -l austen-emma.words
161980 austen-emma.words
There are more words now -- why is this? (Hint: What happened to words like "you're"?)

More ways than one; Comparing files with diff

As is often the case in text processing tasks, there are many more alternative ways to obtain the same or a near-identical result. In Unix for Poets, Church accomplishes word tokenization by using tr only ('\012' and '\n' both represent the new line character):
tr '[A-Z]' '[a-z]' < austen-emma.txt | tr -sc '[a-z]' '\012' > austen-emma.church.words

His formulation differs from ours in that (1) it also strips out numbers, and (2) it misses square brackets. How can we compare the two results? We can use diff to compare the two output files:

$ diff austen-emma.words austen-emma.church.words | tail 145411c145407 < to --- > [to 145413a145410 > ] 146730c146727 < 26th --- > th

When the contents of two files are identical, diff prints out nothing:
$ head -200 austen-emma.txt | tail -10 > dada
$ tail -n +191 austen-emma.txt | head -10 > baba
$ diff dada baba
(*nothing*)

A question: How should we modify our original command formula so that it eliminates numerals as well? Hint: There are regular expressions that represent digits, a la [a-z] and [:punct:]. Try this page.

Compiling a word-frequency table

So why go for all the trouble of splitting words into separate lines, you ask? That is because unix commands and text processing in general are heavily line-oriented, meaning, typically programs and commands operate on a line-by-line basis.

If that's not enough to convince you, starting from the word-tokenized file that we created above, we can easily proceed to compile a list of word types (unique words) in the text:

$ sort austen-emma.words | uniq | head -15 000 10 1816 23rd 24th 26th 28th 7th 8th a abbey abbots abdy abhor abhorred

Here is how we got the list. First, the lines in the tokenized word list are alphabetically sorted (sort), and then rows of consecutive identical lines are collapsed into one (uniq).

Now a word frequency table is only a few tweaks away. This time, we print a count number in front while collapsing identical lines (uniq -c), and sort the result again in a reverse numerical order (sort -nr) instead of the default alphabetical order.

$ sort austen-emma.words | uniq -c | sort -nr | head -15 5242 to 5204 the 4897 and 4293 of 3192 i 3130 a 2529 it 2490 her 2400 was 2364 she 2188 in 2151 not 1999 you 1976 be 1818 that

From one text file to an entire corpus

So far we discussed how to process a single text file. Now, how can we process an entire corpus, consisting of many text files? There are two general approaches:

Process individual text files to obtain separate word tokenization files for each. Then conflate the individual results into a single result representing the entire corpus.
Process the text files all at once, by globbing the files together and forming a single standard input stream to feed into the command-line process. In other words, think *.txt.

If taking the second approach, a word of warning about using '<'. '<' takes a single file name as its argument. Which means none of the following works:
tr '[A-Z]' '[a-z]' < austen-emma.txt austen-persuasion.txt
tr '[A-Z]' '[a-z]' < austen-*.txt
tr '[A-Z]' '[a-z]' < *.txt THESE DO NOT WORK!!

Hence, for feeding multiple files into tr as a single standard input stream, you need to resort to cat:
cat austen-emma.txt austen-persuasion.txt | tr '[A-Z]' '[a-z]'
cat austen-*.txt | tr '[A-Z]' '[a-z]'
cat *.txt | tr '[A-Z]' '[a-z]'