LING 2050 Special Topics in Linguistics: Corpus Linguistics, University of Pittsburgh

Go to: LING2050 home page Lab pages index Command reference sheet

Lab 4
Objectives: compiling N-gram lists from a corpus; basics of Regular Expression

Reference:

Kenneth Church, "Unix for Poets" [pdf]. Please note that some of the syntax in this document is deprecated: tail -n +2 should be used instead of tail +2.
Regular Expressions - User guide

Overview

N-gram statistics give you a quick access to collocation patterns in a corpus. How can we compile N-grams from a corpus using unix tools? The key is in recognizing that the tokenized word list file we built in Lab 3 represents 1-grams. We start out with this file and build up from there.

Building N-gram files

Let's start off with Austen's emma. The original corpus text file looks like this:
$ head austen-emma.txt [Emma by Jane Austen 1816] VOLUME I CHAPTER I Emma Woodhouse, handsome, clever, and rich, with a comfortable home and happy disposition, seemed to unite some of the best blessings of existence; and had lived nearly twenty-one years in the world

Now recall that the austen-emma.words file is the tokenized word list file we constructed from the corpus text file, which looks like:
$ head -12 austen-emma.words emma by jane austen 1816 volume i chapter i emma woodhouse handsome

Now, a bi-gram list should look like the following. Two words are separated by a TAB character here:
emma by by jane jane austen austen 1816 1816 volume volume i i chapter chapter i i emma emma woodhouse woodhouse handsome handsome clever

Above, the first column is exactly the same as the lines in the tokenized word file. The second column, on the other hand, is different in that it starts out with the 2nd line. The second column can be, therefore, generated from the tokenized word file using the tail command:
$ tail -n +2 austen-emma.words > austen-emma.words2 $ more austen-emma.words2 by jane austen 1816 volume i chapter i emma woodhouse

Once we have separate files for the first and the second column, paste is the command to use to paste lines from multiple files together, separated by a TAB character:
$ paste austen-emma.words austen-emma.words2 | more emma by by jane jane austen austen 1816 1816 volume volume i i chapter chapter i i emma emma woodhouse woodhouse handsome handsome clever clever and

The output looks good, so we're going to save it to a bigram file called austen-emma.2gram:
$ paste austen-emma.words austen-emma.words2 > austen-emma.2gram

The command that often goes with paste is cut, which is used to 'cut out' certain portions from each line of a file. In our case, to cut out the 2nd column, cut -f2 ('f' for 'field') can be used. The syntax of paste and cut:

paste file1 file2 file3 pastes respective lines from the files into one line, separated by a TAB

cut -f2 file cut out the 2nd field (column) from file

cut -fm,n file cut out the m-th and the n-th fields (columns) from file. Fields are TAB-separated

Now for some questions:

How to generate a 3-gram file? A 4-gram file?
What are the top 5 most frequent 2-grams? Top 5 most frequent 3-grams and 4-grams?

Regular Expressions

First off, some key terms and definitions below. See this page for explanation.

literal
metacharacter
escape sequence
target string
search expression

We have been using Regular Expressions (Regex) with grep and sed. For example:
$ echo 'Ooooooh, what a wonderful book.' | grep -E 'o+' Ooooooh, what a wonderful book.
What are the key terms used in this regular expression matching? What about in the example below?
$ echo 'What is 5+4?' | grep -E '\+.\?' What is 5+4?

If you use grep --color (here I have set an alias so I don't have to specify --color every time), you will be able to see the matched portion in the target string color coded in red. If this color option does not work for your particular version of grep, you can use grep -o instead, where -o is a switch for printing out only the matched parts, in separate lines. This is obviously less convenient than being able to see these matched parts in the original context:
$ echo 'Ooooooh, what a wonderful book.' | grep -E -o 'o+' ooooo o oo

There are many kinds of metacharacters, which are part of Regular Expression syntax. Please refer to this page for explanation.
Brackets, ranges and negation:

character example matches does not match

[ ] [a-z] 'd', 'Kaib', '432a...' 'A', '2344', 'AB-2'

-

^ [^a-z] 'A', 'Abcd', '26' 'a' 'abcd'

Positioning ('anchors'):

^ ^T 'The', 'The boy' 'this girl', 'He said. The girl was happy.'

$ e$ 'e', 'debate' 'e.', 'she said'

. . 'a', 'B', 'e48', ' ', 'but...' ''

\b e\b 'the', 'and the girl', 'she ate.', 'ice-coffee' 'them', 'the88'

\B e\B 'baked' 'ice-cream', 'ice cream', 'the7'

Iteration ('quantifiers'):

? ed?e 'ee','ede' 'edde', 'eddde', 'edte'

* ed*e 'ee', 'ede', 'edde', 'eddde' 'edte'

+ ed+e 'ede', 'edde', 'eddde' 'ee', 'eed', 'ete'

{n} d{3} 'ddd', 'dddd', 'ddddd' 'd', 'dd'

{n,m} ab{3,5}a 'abbba', 'abbbba', 'abbbbba' 'aba', 'abba', 'abbbbbba'

Others:

() a(bc)*d 'ad', 'abcd', 'abcbcd' 'abd', 'abccbd', 'acbd'

| ((Dec|Nov)em|Octo)ber 'December', 'November', 'October' 'September', 'June'

\ \+ '2+3' '\'