Lab 2 Take-home Exercise

  1. We are interested in the word 'ought'. How many text files in Gutenberg corpus contain this word? How many total lines?

  2. This time, we want to find how many 'ought' is NOT immediately followed by 'to'. How would you achieve this? Is there a potential pitfall to your approach?

  3. How many times does the name 'Ishmael' occur in melville-moby_dick.txt? Which n-th lines?

  4. You want to find out how many lines contain both 'sense' and 'leave' in Gutenberg corpus, and formulate the following command:
    grep -i sense *.txt | grep -i leave | wc -l
    Now observe what is problematic with this approach. What optional switch should have been used in this case?

  5. You want to pull up the lines in which 'joy' and 'tear' occur together. This command achieves it (almost... we'll ignore -w for now):
    grep -i joy *.txt | grep -i tear
    And then you think: "This only pulls up cases where the two words are on the same line! Better expand the context..." and formulate this command to print out additional lines around the matches:
    grep -i -C 2 joy *.txt | grep -i -C 2 tear
    Does it give you the intended result? What is problematic with this approach?