Go to: LING2050 home page   Lab pages index   Command reference sheet

Lab 2

Objectives: managing multiple files; basics of searching text file contents
Reference: http://osxfaq.com/Tutorials/LearningCenter/
Pages covering this lab session: here and here.

Managing multiple files with *

  1. Move into your gutenberg/ directory.

  2. We learned that wc -w carroll-alice.txt command displays the total number of words contained in the carroll-alice.txt. (It's 26,443 words.) But how do we find out the size of the entire corpus? You can specify all text files as arguments:
    wc -w austen-emma.txt austen-persuasion.txt austen-sense.txt bible-kjv.txt (... and so on.)
    But a better way is to utilize this handy wildcard * notation:
    wc -w *
    which prints out the following result:
        1483 README
      158167 austen-emma.txt
       83308 austen-persuasion.txt
      118675 austen-sense.txt
      821133 bible-kjv.txt
              ...
       79659 milton-paradise.txt
       20459 shakespeare-caesar.txt
       29605 shakespeare-hamlet.txt
       17741 shakespeare-macbeth.txt
      122070 whitman-leaves.txt
     2136725 total
    
    The wildcard * matches every file name in the directory (except for those starting with a period, such as .bash_history); it even prints out the total at the bottom.

  3. Now the problem is, the README file, which is not a corpus text file, is included in this tally. To exclude this file and include only the ones ending with .txt:
    wc -w *.txt
    You might want to save that information into a file for future reference (but probably better to place that file in your home directory, so as not to disturb your corpus directory):
    wc -w *.txt > ~/corpus-size.txt

  4. If you are interested in the size of the Shakespeare subcorpus, you can do:
    wc -w sha*.txt

  5. Now move up to the parent directory which contains both abc/ and gutenberg/ directories. What is the size (in # of words) of the two corpora combined? The total number of lines?


Searching for words and patterns using grep

  1. grep is a popular unix command for pattern-based searching. Say you're interested in the word 'cause'. The following command:
    grep cause carroll-alice.txt
    prints out every line in carroll-alice.txt that matches (== contains) "cause". If there are a lot of lines, you can "pipe" the result into less or more so you can scroll through, or use head to just print the first few lines:
    grep what carroll-alice.txt | less
    grep what carroll-alice.txt | head -15

  2. grep provides many useful option switches, including --color. Try:
    grep --color cause carroll-alice.txt
    A list of grep command options and syntax:
    grep pattern file(s) prints out all lines in file(s) that match pattern
    grep -i pattern file(s) does the same, but ignores case (so 'the' and 'The' are both matched)
    grep -w restricts the search to whole words only
    grep -n precedes each line with the line number
    grep -h stops preceding each line with the file name (searching multiple files)
    grep -l displays a list of files that contain the string (actual lines are not shown)
    grep -v prints the lines that do NOT match pattern
    grep --color prints the matched portion in color (extremely handy!)
    grep -iw --color pattern file(s) '-' options can be strung together; '--' options cannot
    grep "word1 word2" file(s) pattern must be in quotes if it contains space

  3. The first command below looks up all instances of "of course" in the Gutenberg corpus and counts the number of lines. The second one saves the matching lines into a file.
    grep -i "of course" *.txt | wc -l
    grep -i "of course" *.txt > search_result.text   (Why am I using '.text' instead of '.txt'?)

  4. Now for some exercise:
    1. We are interested in thou, the archaic second person singular pronoun. In how many corpus text files does this word occur?
    2. How many lines of this corpus contain this word?
    3. Is that number the same as thou's total # of occurrence in the corpus? Why not?
    4. Among the lines with thou, how many of them also contain shalt? (hint: use "|")
    5. Among the lines with thou and shalt, how many of them do NOT contain not?

  5. Lastly, it is also possible to get grep print out additional context lines before and after the matching lines. Use this option -C n where n is 1, 2, 3... and so on. This command:
    grep --color -C 1 cause carroll-alice.txt
    gets you this result:
    little histories about children who had got burnt, and eaten up by wild
    beasts and other unpleasant things, all because they WOULD not remember
    the simple rules their friends had taught them: such as, that a red-hot
    --
    about two feet high, and was going on shrinking rapidly: she soon found
    out that the cause of this was the fan she was holding, and she dropped
    it hastily, just in time to avoid shrinking away altogether.
    --
    is the driest thing I know. Silence all round, if you please! "William
    the Conqueror, whose cause was favoured by the pope, was soon submitted
    to by the English, who wanted leaders, and had been of late much
    --
    (*snip*)
    Note that the -C switch takes its own argument n; it does not "mesh" with other simple option switches. The syntax:
    grep -C n prints out n lines before and after each matching line
    grep -iw -C n --color pattern file(s) only simple '-' options can be strung together