Go to: LING2050 home page   Lab pages index   Command reference sheet

Lab 4

Objectives: compiling N-gram lists from a corpus; basics of Regular Expression
Reference:

Overview

N-gram statistics give you a quick access to collocation patterns in a corpus. How can we compile N-grams from a corpus using unix tools? The key is in recognizing that the tokenized word list file we built in Lab 3 represents 1-grams. We start out with this file and build up from there.

Building N-gram files

  1. Let's start off with Austen's emma. The original corpus text file looks like this:
    $ head austen-emma.txt
    [Emma by Jane Austen 1816]
    
    VOLUME I
    
    CHAPTER I
    
    
    Emma Woodhouse, handsome, clever, and rich, with a comfortable home
    and happy disposition, seemed to unite some of the best blessings
    of existence; and had lived nearly twenty-one years in the world
    
  2. Now recall that the austen-emma.words file is the tokenized word list file we constructed from the corpus text file, which looks like:
    $ head -12 austen-emma.words 
    emma
    by
    jane
    austen
    1816
    volume
    i
    chapter
    i
    emma
    woodhouse
    handsome
    
  3. Now, a bi-gram list should look like the following. Two words are separated by a TAB character here:
    emma    by
    by      jane
    jane    austen
    austen  1816
    1816    volume
    volume  i
    i       chapter
    chapter i
    i       emma
    emma    woodhouse
    woodhouse       handsome
    handsome        clever
    
  4. Above, the first column is exactly the same as the lines in the tokenized word file. The second column, on the other hand, is different in that it starts out with the 2nd line. The second column can be, therefore, generated from the tokenized word file using the tail command:
    $ tail -n +2 austen-emma.words > austen-emma.words2
    $ more austen-emma.words2
    by
    jane
    austen
    1816
    volume
    i
    chapter
    i
    emma
    woodhouse
    
  5. Once we have separate files for the first and the second column, paste is the command to use to paste lines from multiple files together, separated by a TAB character:
    $ paste austen-emma.words austen-emma.words2 | more 
    emma    by
    by      jane
    jane    austen
    austen  1816
    1816    volume
    volume  i
    i       chapter
    chapter i
    i       emma
    emma    woodhouse
    woodhouse       handsome
    handsome        clever
    clever  and
    
  6. The output looks good, so we're going to save it to a bigram file called austen-emma.2gram:
    $ paste austen-emma.words austen-emma.words2 > austen-emma.2gram
    
  7. The command that often goes with paste is cut, which is used to 'cut out' certain portions from each line of a file. In our case, to cut out the 2nd column, cut -f2 ('f' for 'field') can be used. The syntax of paste and cut:
    paste file1 file2 file3pastes respective lines from the files into one line, separated by a TAB
    cut -f2 filecut out the 2nd field (column) from file
    cut -fm,n filecut out the m-th and the n-th fields (columns) from file. Fields are TAB-separated

  8. Now for some questions:

    • How to generate a 3-gram file? A 4-gram file?
    • What are the top 5 most frequent 2-grams? Top 5 most frequent 3-grams and 4-grams?


Regular Expressions

  1. First off, some key terms and definitions below. See this page for explanation.

    • literal
    • metacharacter
    • escape sequence
    • target string
    • search expression

  2. We have been using Regular Expressions (Regex) with grep and sed. For example:
    $ echo 'Ooooooh, what a wonderful book.' | grep -E 'o+'
    Ooooooh, what a wonderful book.
    
    What are the key terms used in this regular expression matching? What about in the example below?
    $ echo 'What is 5+4?' | grep -E '\+.\?'
    What is 5+4?
    

  3. If you use grep --color (here I have set an alias so I don't have to specify --color every time), you will be able to see the matched portion in the target string color coded in red. If this color option does not work for your particular version of grep, you can use grep -o instead, where -o is a switch for printing out only the matched parts, in separate lines. This is obviously less convenient than being able to see these matched parts in the original context:
    $ echo 'Ooooooh, what a wonderful book.' | grep -E -o 'o+'
    ooooo
    o
    oo
    

  4. There are many kinds of metacharacters, which are part of Regular Expression syntax. Please refer to this page for explanation.

    Brackets, ranges and negation:
    characterexamplematchesdoes not match
    [ ][a-z]'d', 'Kaib', '432a...''A', '2344', 'AB-2'
    -
    ^[^a-z]'A', 'Abcd', '26''a' 'abcd'

    Positioning ('anchors'):
    ^^T'The', 'The boy''this girl', 'He said. The girl was happy.'
    $e$'e', 'debate''e.', 'she said'
    ..'a', 'B', 'e48', ' ', 'but...' ''
    \be\b'the', 'and the girl', 'she ate.', 'ice-coffee''them', 'the88'
    \Be\B'baked''ice-cream', 'ice cream', 'the7'

    Iteration ('quantifiers'):
    ?ed?e'ee','ede''edde', 'eddde', 'edte'
    *ed*e'ee', 'ede', 'edde', 'eddde''edte'
    +ed+e'ede', 'edde', 'eddde''ee', 'eed', 'ete'
    {n}d{3}'ddd', 'dddd', 'ddddd''d', 'dd'
    {n,m}ab{3,5}a'abbba', 'abbbba', 'abbbbba''aba', 'abba', 'abbbbbba'

    Others:
    ()a(bc)*d'ad', 'abcd', 'abcbcd''abd', 'abccbd', 'acbd'
    |((Dec|Nov)em|Octo)ber'December', 'November', 'October''September', 'June'
    \\+'2+3''\'