LING 2050 Special Topics in Linguistics: Corpus Linguistics, University of Pittsburgh

Go to: LING2050 home page Lab pages index Command reference sheet

Lab 2
Objectives: managing multiple files; basics of searching text file contents

Reference: http://osxfaq.com/Tutorials/LearningCenter/
Pages covering this lab session: here and here.
Managing multiple files with *

Move into your gutenberg/ directory.

We learned that wc -w carroll-alice.txt command displays the total number of words contained in the carroll-alice.txt. (It's 26,443 words.) But how do we find out the size of the entire corpus? You can specify all text files as arguments:
wc -w austen-emma.txt austen-persuasion.txt austen-sense.txt bible-kjv.txt (... and so on.)
But a better way is to utilize this handy wildcard * notation:
wc -w *
which prints out the following result:

1483 README 158167 austen-emma.txt 83308 austen-persuasion.txt 118675 austen-sense.txt 821133 bible-kjv.txt ... 79659 milton-paradise.txt 20459 shakespeare-caesar.txt 29605 shakespeare-hamlet.txt 17741 shakespeare-macbeth.txt 122070 whitman-leaves.txt 2136725 total

The wildcard * matches every file name in the directory (except for those starting with a period, such as .bash_history); it even prints out the total at the bottom.

Now the problem is, the README file, which is not a corpus text file, is included in this tally. To exclude this file and include only the ones ending with .txt:
wc -w *.txt
You might want to save that information into a file for future reference (but probably better to place that file in your home directory, so as not to disturb your corpus directory):
wc -w *.txt > ~/corpus-size.txt

If you are interested in the size of the Shakespeare subcorpus, you can do:
wc -w sha*.txt

Now move up to the parent directory which contains both abc/ and gutenberg/ directories. What is the size (in # of words) of the two corpora combined? The total number of lines?

Searching for words and patterns using grep

grep is a popular unix command for pattern-based searching. Say you're interested in the word 'cause'. The following command:
grep cause carroll-alice.txt
prints out every line in carroll-alice.txt that matches (== contains) "cause". If there are a lot of lines, you can "pipe" the result into less or more so you can scroll through, or use head to just print the first few lines:
grep what carroll-alice.txt | less
grep what carroll-alice.txt | head -15

grep provides many useful option switches, including --color. Try:
grep --color cause carroll-alice.txt
A list of grep command options and syntax:

grep pattern file(s) prints out all lines in file(s) that match pattern

grep -i pattern file(s) does the same, but ignores case (so 'the' and 'The' are both matched)

grep -w restricts the search to whole words only

grep -n precedes each line with the line number

grep -h stops preceding each line with the file name (searching multiple files)

grep -l displays a list of files that contain the string (actual lines are not shown)

grep -v prints the lines that do NOT match pattern

grep --color prints the matched portion in color (extremely handy!)

grep -iw --color pattern file(s) '-' options can be strung together; '--' options cannot

grep "word1 word2" file(s) pattern must be in quotes if it contains space

The first command below looks up all instances of "of course" in the Gutenberg corpus and counts the number of lines. The second one saves the matching lines into a file.
grep -i "of course" *.txt | wc -l
grep -i "of course" *.txt > search_result.text (Why am I using '.text' instead of '.txt'?)

Now for some exercise:

We are interested in thou, the archaic second person singular pronoun. In how many corpus text files does this word occur?
How many lines of this corpus contain this word?
Is that number the same as thou's total # of occurrence in the corpus? Why not?
Among the lines with thou, how many of them also contain shalt? (hint: use "|")
Among the lines with thou and shalt, how many of them do NOT contain not?

Lastly, it is also possible to get grep print out additional context lines before and after the matching lines. Use this option -C n where n is 1, 2, 3... and so on. This command:
grep --color -C 1 cause carroll-alice.txt
gets you this result:
little histories about children who had got burnt, and eaten up by wild
beasts and other unpleasant things, all because they WOULD not remember
the simple rules their friends had taught them: such as, that a red-hot
--
about two feet high, and was going on shrinking rapidly: she soon found
out that the cause of this was the fan she was holding, and she dropped
it hastily, just in time to avoid shrinking away altogether.
--
is the driest thing I know. Silence all round, if you please! "William
the Conqueror, whose cause was favoured by the pope, was soon submitted
to by the English, who wanted leaders, and had been of late much
--
(*snip*)
Note that the -C switch takes its own argument n; it does not "mesh" with other simple option switches. The syntax:

grep -C n prints out n lines before and after each matching line

grep -iw -C n --color pattern file(s) only simple '-' options can be strung together