LING 1330/2330 Introduction to Computational Linguistics, University of Pittsburgh

PART 1: Bigram Frequencies of the Bible and Jane Austen Novels [35 points]

this TEMPLATE script

imports necessary modules,

opens the text files for the two corpora and reads in the content as text strings,

builds the following objects, b_ for the Bible and a_ for Austen:

b_toks, a_toks: word tokens, all in lowercase
Let's do this right. First build a token list from the original, mixed-case text, and then lowercase each token through list comprehension. That is: do NOT lowercase your entire text string first and then tokenize. Why should we avoid the latter? That's because word tokenization relies on capitalization, and by folding case beforehand we are bound to confuse the tokenizer, as we saw in the previous exercise.
b_tokfd, a_tokfd: word frequency distribution
b_bigrams, a_bigrams: word bigrams, cast as a list
If you concatenated the three Austen novel texts, one unfortunate side effect is that concatenation creates two bigrams that shouldn't technically be there: the last word from Jane Austen novel #1 + the first word from novel #2, and the last word from novel #2 + the first word from novel #3. This will be more of a problem if our corpus consisted of smaller text files in large number, but we're talking only a couple of rogue bigram tokens, so we'll simply ignore it.
b_bigramfd, a_bigramfd: bigram frequency distribution
b_bigramcfd, a_bigramcfd: bigram (w1, w2) conditional frequency distribution ("CFD"), where w1 is construed as the condition and w2 the outcome

pickles the two bigram CFDs (conditional frequency distributions) using the highest binary protocol: name the files bible_bigramcfd.pkl and austen_bigramcfd.pkl.

answers the following questions by exploring the objects:

How many word tokens and types are there, for each corpus?
Compare the overall size of the two corpora. Which one is larger?
What are the top 20 most frequent words and their counts, for each corpus?
Make a comparison. Anything noteworthy?
What are the top 20 most frequent word bigrams and their counts, for each corpus?
Make a comparison between the two corpora. What observations can you make?
How many times does the word 'so' occur in each corpus? What are their relative frequency against the corpus size (= total # of tokens)?
Judging by the relative frequency, in which corpus is 'so' more frequently found, and by how much?
In each corpus, what are the top 20 'so-initial' bigrams (bigrams that have the word so as the first word) and their counts?
Do a cross-comparison. What observations can you make? Is Bible's use of so similar to Austen's?
In The Bible, given the word 'so' as the current word, what is the probability of getting 'much' as the next word? How about in Jane Austen novels? How about 'will' -- how does it fare as the next word? Provide a cross-comparison summary.

PART 2: Bigram Speak [15 points]

Bigram Speak

Download BigramSpeak.py.
The program won't run as it is. You need to modify it first by doing the following:
1. Plug in one of your bigram CFDs (conditional frequency distributions) by unpickling and loading one of your pickled CFDs, assigning it to the variable w1w2f.
2. Choose the appropriate title for your session by uncommenting one of the provided value assignments for title.
Try out the program. Make sure to try the word 'so', and also the ENTER key. Try out a few different runs to get a sense of how the program works.
Now do the same with the other corpus data.

Save out your shell session as a text file (.txt extension). Open it up in a text-editor program, and at the end of the file add your answers to the following questions:

Examine the code closely to understand how it works. What does the interactive portion of the program do? Describe in your words what it does, step by step.

Copy and paste a passage that the script produced, mostly through random selection, which you found particularly Bible-like. What strikes you?

Do the same with Jane Austen.

Provide a summary of your overall assessment of this program. Your impression, observations, comparisons between Bible Speak and Jane Austen Speak. Anything else that strikes you.

SUBMIT:

PART 1: The completed bible_austen_bigrams.py script, and its shell-side output saved as a text file bible_austen_bigrams.OUT.txt. You don't need to submit the pickle files.
PART 2: A saved shell session as a text file containing your answers and summary write-up at the end.

Homework Assignment 2: A Duel of Bigrams

PART 1: Bigram Frequencies of the Bible and Jane Austen Novels [35 points]

PART 2: Bigram Speak [15 points]