Go to: LING 1330/2330 home page  

Exercise 4: Austen vs. the ENABLE Word List

Imagine Jane Austen, with a mighty pen in her hand, going up against a group of Scrabble/Words-with-Friends players. In this part, we will compare Austen's words against the ENABLE English word list, operating exclusively in IDLE shell (not in a script!). The goal is to practice list comprehension as well as n-grams. Follow the steps below.

STEP 1: unpickle the ENABLE list

We already processed the ENABLE English word list (enable1.txt, linked at the bottom) in class and pickled it as words.pkl. In Python shell, unpickle it as a list called wlist and get ready to explore.
>>> f = open('words.pkl', 'rb')
>>> wlist = pickle.load(f)
>>> f.close()
>>> wlist[-10:]
['zymology', 'zymosan', 'zymosans', 'zymoses', 'zymosis', 'zymotic', 'zymurgies', 
'zymurgy', 'zyzzyva', 'zyzzyvas']
>>> len(wlist)

STEP 2: process Emma

You have already downloaded Austen's Emma as part of HW1. Read it in and apply the usual text processing steps, building three objects: etoks (a list of word tokens, all in lowercase), etypes (an alphabetically sorted word type list), and efreq (word frequency distribution).
>>> fname = "C:/Users/narae/Documents/ling1330/gutenberg/austen-emma.txt"
>>> f = open(fname, 'r')
>>> etxt = f.read()
>>> f.close()
>>> etxt[-200:]
'e deficiencies, the wishes,\nthe hopes, the confidence, the predictions of the 
small band\nof true friends who witnessed the ceremony, were fully answered\nin 
the perfect happiness of the union.\n\n\nFINIS\n'
>>> etoks = nltk.word_tokenize(etxt.lower())
>>> etoks[-20:]
['of', 'true', 'friends', 'who', 'witnessed', 'the', 'ceremony', ',', 'were', 
'fully', 'answered', 'in', 'the', 'perfect', 'happiness', 'of', 'the', 'union', 
'.', 'finis']
>>> len(etoks)
>>> etypes = sorted(set(etoks))
>>> etypes[-10:]
['younger', 'youngest', 'your', 'yours', 'yourself', 'yourself.', 'youth', 'youthful', 
'zeal', 'zigzags']
>>> len(etypes)
>>> efreq = nltk.FreqDist(etoks)
>>> efreq['beautiful']

STEP 3: list-comprehend Emma

Now, explore the three objects wlist, efreq, and etypes to answer the following questions. Do NOT use the for loop! Every solution must involve use of LIST COMPREHENSION.
  • Question 1: Words with prefix and suffix
    What are the words that start with 'un' and end in 'able'?
  • Question 2: Length
    How many Emma word types are 15 characters or longer? Exclude hyphenated words.
  • Question 3: Average word length
    What's the average length of all Emma word types?
  • Question 4: Word frequency
    How many Emma word types have a frequency count of 200 or more? How many word types appear only once?
  • Question 5: Emma words not in wlist
    Of the Emma word types, how many of them are not found in our list of ENABLE English words, i.e., wlist?

STEP 4: bigrams in Emma

Let's now try out bigrams. Build two objects: e2grams (a list of word bigrams; make sure to cast it as a list) and e2gramfd (a frequency distribution of bigrams) as shown below, and then answer the following questions.
>>> e2grams = list(nltk.bigrams(etoks))
>>> e2gramfd = nltk.FreqDist(e2grams)
  • Question 6: Bigrams
    What are the last 10 bigrams?
  • Question 7: Bigram top frequency
    What are the top 20 most frequent bigrams?
  • Question 8: Bigram frequency count
    How many times does the bigram 'so happy' appear?
  • Question 9: Word following 'so'
    What are the words that follow 'so'? What are their frequency counts? (For loop will be easier; see if you can utilize list comprehension for this.)

  • Upload: Your saved Python shell session (a text file with .txt extension).