We will continue to explore the two corpora: the Bible and the Jane Austen novels from Homework 1, part of NLTK's Project Gutenberg Selections corpus. Again the files are:
The King James Version Bible (bible-kjv.txt)
Jane Austen novels (austen-emma.txt, austen-persuasion.txt, austen-sense.txt)
PART 1: Bigram Frequencies of the Bible and Jane Austen Novels [35 points]
In PART 1, we will take a close look at the bigram frequencies of two
corpora. We are interested in what types of
word bigrams are frequently found in either corpus, and also what types of words are found
following the word 'so', and in what probability. Additionally, we will pickle the bigram frequency dictionaries so we can re-use them
later. To achieve these goals, complete this TEMPLATE script, which:
imports necessary modules,
opens the text files for the two corpora and reads in the content as text strings,
builds the following objects, b_ for the Bible and a_ for Austen:
b_toks, a_toks: word tokens, all in lowercase
Let's do this right. First build a token list from the original, mixed-case text, and then lowercase each token through list comprehension. That is: do NOT lowercase your entire text string first and then tokenize. Why should we avoid the latter? That's because word tokenization relies on capitalization, and by folding case beforehand we are bound to confuse the tokenizer, as we saw in the previous exercise.
b_tokfd, a_tokfd: word frequency distribution
b_bigrams, a_bigrams: word bigrams, cast as a list
If you concatenated the three Austen novel texts, one unfortunate side effect is that concatenation creates two bigrams that shouldn't technically be there: the last word from Jane Austen novel #1 + the first word from novel #2, and the last word from novel #2 + the first word from novel #3. This will be more of a problem if our corpus consisted of smaller text files in large number, but we're talking only a couple of rogue bigram tokens, so we'll simply ignore it.
b_bigramfd, a_bigramfd: bigram frequency distribution
b_bigramcfd, a_bigramcfd: bigram (w1, w2) conditional frequency distribution ("CFD"), where w1 is construed as the condition and w2 the outcome
pickles the two bigram CFDs (conditional frequency distributions) using the highest binary protocol: name the files bible_bigramcfd.pkl and austen_bigramcfd.pkl.
answers the following questions by exploring the objects:
As usual, you should work on these questions in the IDLE shell first. It is MUCH (much!) faster than working on the script side. With scripts, you change one thing in your code, and then... yep, you have to re-run the entire script and wait while everything is re-built.
How many word tokens and types are there, for each corpus?
Compare the overall size of the two corpora. Which one is larger?
What are the top 20 most frequent words and their counts, for each corpus?
Make a comparison. Anything noteworthy?
What are the top 20 most frequent word bigrams and their counts, for each corpus?
Make a comparison between the two corpora. What observations can you make?
How many times does the word 'so' occur in each corpus? What are their relative frequency against the corpus size (= total # of tokens)?
Judging by the relative frequency, in which corpus is 'so' more frequently found, and by how much?
In each corpus, what are the top 20 'so-initial' bigrams (bigrams that have the word so as the first word) and their counts?
Do a cross-comparison. What observations can you make? Is Bible's use of so similar to Austen's?
In The Bible, given the word 'so' as the current word, what is the probability of getting 'much' as the next word? How about in Jane Austen novels? How about 'will' -- how does it fare as the next word? Provide a cross-comparison summary.
PART 2: Bigram Speak [15 points]
Next up, let's now use the pickled data for some fun. We will plug them into a program called "Bigram Speak". Instructions: