Reference:
Overview
N-gram statistics give you a quick access to collocation patterns in a corpus. How can we compile N-grams from a corpus using unix tools? The key is in recognizing that the tokenized word list file we built in Lab 3 represents 1-grams. We start out with this file and build up from there.
Building N-gram files
- Let's start off with Austen's emma. The original corpus text file looks like this:
$ head austen-emma.txt
[Emma by Jane Austen 1816]
VOLUME I
CHAPTER I
Emma Woodhouse, handsome, clever, and rich, with a comfortable home
and happy disposition, seemed to unite some of the best blessings
of existence; and had lived nearly twenty-one years in the world
- Now recall that the austen-emma.words file is the tokenized word list file we constructed from the corpus text file, which looks like:
$ head -12 austen-emma.words
emma
by
jane
austen
1816
volume
i
chapter
i
emma
woodhouse
handsome
- Now, a bi-gram list should look like the following. Two words are separated by a TAB character here:
emma by
by jane
jane austen
austen 1816
1816 volume
volume i
i chapter
chapter i
i emma
emma woodhouse
woodhouse handsome
handsome clever
- Above, the first column is exactly the same as the lines in the tokenized word file. The second column, on the other hand, is different in that it starts out with the 2nd line. The second column can be, therefore, generated from the tokenized word file using the tail command:
$ tail -n +2 austen-emma.words > austen-emma.words2
$ more austen-emma.words2
by
jane
austen
1816
volume
i
chapter
i
emma
woodhouse
- Once we have separate files for the first and the second column, paste is the command to use to paste lines from multiple files together, separated by a TAB character:
$ paste austen-emma.words austen-emma.words2 | more
emma by
by jane
jane austen
austen 1816
1816 volume
volume i
i chapter
chapter i
i emma
emma woodhouse
woodhouse handsome
handsome clever
clever and
- The output looks good, so we're going to save it to a bigram file called austen-emma.2gram:
$ paste austen-emma.words austen-emma.words2 > austen-emma.2gram
- The command that often goes with paste is cut, which is used to 'cut out' certain portions from each line of a file. In our case, to cut out the 2nd column, cut -f2 ('f' for 'field') can be used. The syntax of paste and cut:
paste file1 file2 file3 | pastes respective lines from the files into one line, separated by a TAB |
cut -f2 file | cut out the 2nd field (column) from file |
cut -fm,n file | cut out the m-th and the n-th fields (columns) from file. Fields are TAB-separated |
- Now for some questions:
- How to generate a 3-gram file? A 4-gram file?
- What are the top 5 most frequent 2-grams? Top 5 most frequent 3-grams and 4-grams?
Regular Expressions
-
First off, some key terms and definitions below. See this page for explanation.
- literal
- metacharacter
- escape sequence
- target string
- search expression
- We have been using Regular Expressions (Regex) with grep and sed. For example:
$ echo 'Ooooooh, what a wonderful book.' | grep -E 'o+'
Ooooooh, what a wonderful book.
What are the key terms used in this regular expression matching? What about in the example below?
$ echo 'What is 5+4?' | grep -E '\+.\?'
What is 5+4?
- If you use grep --color (here I have set an alias so I don't have to specify --color every time), you will be able to see the matched portion in the target string color coded in red. If this color option does not work for your particular version of grep, you can use grep -o instead, where -o is a switch for printing out only the matched parts, in separate lines. This is obviously less convenient than being able to see these matched parts in the original context:
$ echo 'Ooooooh, what a wonderful book.' | grep -E -o 'o+'
ooooo
o
oo
- There are many kinds of metacharacters, which are part of Regular Expression syntax. Please refer to this page for explanation.
Brackets, ranges and negation:
character | example | matches | does not match |
[ ] | [a-z] | 'd', 'Kaib', '432a...' | 'A', '2344', 'AB-2' |
- | | | |
^ | [^a-z] | 'A', 'Abcd', '26' | 'a' 'abcd' |
Positioning ('anchors'):
^ | ^T | 'The', 'The boy' | 'this girl', 'He said. The girl was happy.' |
$ | e$ | 'e', 'debate' | 'e.', 'she said' |
. | . | 'a', 'B', 'e48', ' ', 'but...' | '' |
\b | e\b | 'the', 'and the girl', 'she ate.', 'ice-coffee' | 'them', 'the88' |
\B | e\B | 'baked' | 'ice-cream', 'ice cream', 'the7' |
Iteration ('quantifiers'):
? | ed?e | 'ee','ede' | 'edde', 'eddde', 'edte' |
* | ed*e | 'ee', 'ede', 'edde', 'eddde' | 'edte' |
+ | ed+e | 'ede', 'edde', 'eddde' | 'ee', 'eed', 'ete' |
{n} | d{3} | 'ddd', 'dddd', 'ddddd' | 'd', 'dd' |
{n,m} | ab{3,5}a | 'abbba', 'abbbba', 'abbbbba' | 'aba', 'abba', 'abbbbbba' |
Others:
() | a(bc)*d | 'ad', 'abcd', 'abcbcd' | 'abd', 'abccbd', 'acbd' |
| | ((Dec|Nov)em|Octo)ber | 'December', 'November', 'October' | 'September', 'June' |
\ | \+ | '2+3' | '\' |
|