LING 2050 Special Topics in Linguistics: Corpus Linguistics, University of Pittsburgh

Go to: LING2050 home page Lab pages index Command reference sheet

Lab 3 Homework Assignment

The goal is to produce a word frequency table file for the entire gutenberg corpus, employing both a. and b. approaches outlined in "From one text file to an entire corpus" section of Lab 3 page.

Try approach a. You will need to first compile a tokenized word file for each text file, and then combine them to produce the word frequency table file for the entire corpus. While doing so, you will need the cat command (see command reference page). How many word types are found in this corpus? Copy and paste the top 30 word types and their frequencies.

Try approach b. You should be able to obtain the final result (word frequency table file for the entire corpus) from the original corpus files using a single long chain of commands. What is this chain of commands? Is your result the same as what you obtained above? (It should be.)

Examine the four individual commands strung together in the command chain shown in 5. of "Putting it all together" section of Lab 3 page. In particular, pay attention to their relative ordering. Which command can be ordered freely in relation to others? Which pairs of commands need to be ordered in a particular way?

Compile a word frequency table file for every individual text file of the gutenberg corpus. Do the same for the abc corpus. In addition, compile a word frequency table file for the entire abc corpus.

What is the type-token ratio for the two corpora (gutenberg and abc)?

What is the type-token ratio of each individual text file of the two corpora?

For the entire gutenberg corpus, compile: a bigram frequency table, a trigram frequency table, and a 4-gram frequency table. What are the top 20 most frequent ones in the three?

An open question. Examine the results so far; in addition, feel free to try out other means of exploration. Report on any one interesting observation you made.