Imagine Jane Austen, with a mighty pen in her hand, going up against a group of Scrabble/Words-with-Friends players. In this part, we will compare Austen's words against the ENABLE English word list, operating exclusively in IDLE shell (not in a script!). The goal is to practice list comprehension as well as n-grams. Follow the steps below.
STEP 1: unpickle the ENABLE list
We already processed the ENABLE English word list (enable1.txt, linked at the bottom) in class and pickled it as words.pkl. In Python shell, unpickle it as a list called wlist and get ready to explore.
>>> f = open('words.pkl', 'rb')
>>> wlist = pickle.load(f)
['zymology', 'zymosan', 'zymosans', 'zymoses', 'zymosis', 'zymotic', 'zymurgies',
'zymurgy', 'zyzzyva', 'zyzzyvas']
STEP 2: process Emma
You have already downloaded Austen's Emma as part of HW1. Read it in and apply the usual text processing steps, building three objects:
etoks (a list of word tokens, all in lowercase), etypes (an alphabetically sorted word type list), and efreq (word frequency distribution).
>>> fname = "C:/Users/narae/Documents/ling1330/gutenberg/austen-emma.txt"
>>> f = open(fname, 'r')
>>> etxt = f.read()
'e deficiencies, the wishes,\nthe hopes, the confidence, the predictions of the
small band\nof true friends who witnessed the ceremony, were fully answered\nin
the perfect happiness of the union.\n\n\nFINIS\n'
>>> etoks = nltk.word_tokenize(etxt.lower())
['of', 'true', 'friends', 'who', 'witnessed', 'the', 'ceremony', ',', 'were',
'fully', 'answered', 'in', 'the', 'perfect', 'happiness', 'of', 'the', 'union',
>>> etypes = sorted(set(etoks))
['younger', 'youngest', 'your', 'yours', 'yourself', 'yourself.', 'youth', 'youthful',
>>> efreq = nltk.FreqDist(etoks)
STEP 3: list-comprehend Emma
Now, explore the three objects wlist, efreq, and etypes to answer the following questions. Do NOT use the for loop! Every solution must involve use of LIST COMPREHENSION.
STEP 4: bigrams in Emma
Let's now try out bigrams. Build two objects: e2grams (a list of word bigrams; make sure to cast it as a list) and e2gramfd (a frequency distribution of bigrams) as shown below, and then answer the following questions.
>>> e2grams = list(nltk.bigrams(etoks))
>>> e2gramfd = nltk.FreqDist(e2grams)
- Question 6: Bigrams
What are the last 10 bigrams?
- Question 7: Bigram top frequency
What are the top 20 most frequent bigrams?
- Question 8: Bigram frequency count
How many times does the bigram 'so happy' appear?
- Question 9: Word following 'so'
What are the words that follow 'so'? What are their frequency counts? (For loop will be easier; see if you can utilize list comprehension for this.)
- Upload: Your saved Python shell session (a text file with .txt extension).