LING 1330/2330 Introduction to Computational Linguistics, University of Pittsburgh

Go to: LING 1330/2330 home page

Exercise 4: Austen vs. the ENABLE Word List

Imagine Jane Austen, with a mighty pen in her hand, going up against a group of Scrabble/Words-with-Friends players. In this part, we will compare Austen's words against the ENABLE English word list, operating exclusively in IDLE shell (not in a script!). The goal is to practice list comprehension as well as n-grams. Follow the steps below.

STEP 1: unpickle the ENABLE list
We already processed the ENABLE English word list (enable1.txt, linked at the bottom) in class and pickled it as words.pkl. In Python shell, unpickle it as a list called wlist and get ready to explore.

>>> f = open('words.pkl', 'rb') >>> wlist = pickle.load(f) >>> f.close() >>> wlist[-10:] ['zymology', 'zymosan', 'zymosans', 'zymoses', 'zymosis', 'zymotic', 'zymurgies', 'zymurgy', 'zyzzyva', 'zyzzyvas'] >>> len(wlist) 172820 >>>

STEP 2: process Emma
You have already downloaded Austen's Emma as part of HW1. Read it in and apply the usual text processing steps, building three objects: etoks (a list of word tokens, all in lowercase for this exercise), etypes (an alphabetically sorted word type list), and efreq (word frequency distribution).

>>> fname = "C:/Users/narae/Documents/ling1330/gutenberg/austen-emma.txt" >>> f = open(fname, 'r') >>> etxt = f.read() >>> f.close() >>> etxt[-200:] 'e deficiencies, the wishes,\nthe hopes, the confidence, the predictions of the small band\nof true friends who witnessed the ceremony, were fully answered\nin the perfect happiness of the union.\n\n\nFINIS\n' >>> etoks = nltk.word_tokenize(etxt.lower()) # lowercase everything >>> etoks[-20:] ['of', 'true', 'friends', 'who', 'witnessed', 'the', 'ceremony', ',', 'were', 'fully', 'answered', 'in', 'the', 'perfect', 'happiness', 'of', 'the', 'union', '.', 'finis'] >>> len(etoks) 191851 >>> etypes = sorted(set(etoks)) >>> etypes[-10:] ['younger', 'youngest', 'your', 'yours', 'yourself', 'yourself.', 'youth', 'youthful', 'zeal', 'zigzags'] >>> len(etypes) 7914 >>> efreq = nltk.FreqDist(etoks) >>> efreq['beautiful'] 24

STEP 3: list-comprehend Emma
Now, explore the three objects wlist, efreq, and etypes to answer the following questions. Do NOT use the for loop! Every solution must involve use of LIST COMPREHENSION.

Question 1: Words with prefix and suffix
What words did Jane Austen use that start with 'un' and end in 'able'?
Question 2: Length
How many Emma word types are 15 characters or longer? Exclude hyphenated words.
Question 3: Average word length
What's the average length of all Emma word types?
First, use list comprehension on etypes to turn each word type into its length. Then, use sum() to add up the lengths, then divide by the # of types.

Question 4: Word frequency
How many Emma word types have a frequency count of 200 or more? How many word types appear only once?
For this, you want to use filtering: [x for x in etypes if ...]. For the if condition, you want to specify the frequency count of x, which you can get through efreq.

Question 5: Emma words not in wlist
Of the Emma word types, how many of them are not found in our list of ENABLE English words, i.e., wlist?

>>> 'cavatappi' in wlist False >>> 'cavatappi' not in wlist True >>> 'spaghetti' in wlist True >>>

Q5 surely takes a long time to process! Let's try a different approach. First create wset by passing wlist through the set() function. Since the original wlist did not contain any duplicates, this set has the same size:
>>> wset = set(wlist) >>> len(wset) 172820 >>>
Now, try the same list comprehension as Q5 except with wset instead of wlist. The end result should be exactly the same, but did you notice the speed? So, the lesson here: different data types are optimized for different operations. Sets are optimized for membership tests, while lists are not. That's why this list comprehension finishes much, much faster. As a matter of fact, the most efficient method would be simply computing the set difference: turn etypes into a set, and then just subtract wset. Something to keep in mind!

STEP 4: bigrams in Emma
Let's now try out bigrams. Build two objects: e2grams (a list of word bigrams; make sure to cast it as a list) and e2gramfd (a frequency distribution of bigrams) as shown below, and then answer the following questions.

>>> e2grams = list(nltk.bigrams(etoks)) >>> e2gramfd = nltk.FreqDist(e2grams) >>>

Question 6: Bigrams
What are the last 10 bigrams?
Question 7: Bigram top frequency
What are the top 20 most frequent bigrams?
Question 8: Bigram frequency count
How many times does the bigram 'so happy' appear?
Question 9: Word following 'so'
What are the words that follow 'so'? What are their frequency counts? (For loop will be easier; see if you can utilize list comprehension for this.)

SUBMIT:

Upload: Your saved Python shell session (a text file with .txt extension).

Exercise 4: Austen vs. the ENABLE Word List

STEP 1: unpickle the ENABLE list

STEP 2: process Emma

STEP 3: list-comprehend Emma

STEP 4: bigrams in Emma