LING 2050 Special Topics in Linguistics: Corpus Linguistics, University of Pittsburgh

Go to: LING2050 home page Lab pages index Command reference sheet

Lab 2 Take-home Exercise

We are interested in the word 'ought'. How many text files in Gutenberg corpus contain this word? How many total lines?

This time, we want to find how many 'ought' is NOT immediately followed by 'to'. How would you achieve this? Is there a potential pitfall to your approach?

How many times does the name 'Ishmael' occur in melville-moby_dick.txt? Which n-th lines?

You want to find out how many lines contain both 'sense' and 'leave' in Gutenberg corpus, and formulate the following command:
grep -i sense *.txt | grep -i leave | wc -l
Now observe what is problematic with this approach. What optional switch should have been used in this case?

You want to pull up the lines in which 'joy' and 'tear' occur together. This command achieves it (almost... we'll ignore -w for now):
grep -i joy *.txt | grep -i tear
And then you think: "This only pulls up cases where the two words are on the same line! Better expand the context..." and formulate this command to print out additional lines around the matches:
grep -i -C 2 joy *.txt | grep -i -C 2 tear
Does it give you the intended result? What is problematic with this approach?