In class, we processed unigram data from Peter Norvig, excerpted from the huge Google Web 1T dataset. Let's have you process the bigram version, count_2w.txt, and explore it. Follow the steps below.
STEP 1: read in count_2w.txt
First, download the file. This is another big file -- avoid partial download! When you open it, make sure to specify UTF-8 encoding, otherwise you will see the very last line break. Then, it's business as usual, i.e., reading the file in as a list of lines.
|
>>> f = open('ling1330/count_2w.txt', encoding='utf8')
>>> lines = f.readlines()
>>> f.close()
>>> lines[0]
'0Uplink verified\t523545\n'
>>> lines[1000]
'<S> below\t1974177\n'
>>> lines[5000]
'<S> membros\t140285\n'
>>> len(lines)
286358
>>> lines[200000]
'personality to\t138488\n'
>>> lines[-1]
'über die\t187069\n'
>>>
|
|
STEP 2: build goog2w_list
First up, build goog2w_list as a list of ((w1, w2), count) tuples. Unlike the unigrams we processed in class, the file this time around is not ordered by frequency, it's ordered alphabetically. That's why we are calling it _list instead of _rank. Here's the process with a mini version:
|
>>> mini = lines[:10]
>>> mini
['0Uplink verified\t523545\n', '0km to\t116103\n', '1000s of\t939476\n', '100s of\t539389\n',
'100th anniversary\t158621\n', '10am to\t376141\n', '10th and\t183715\n',
'10th anniversary\t242830\n', '10th century\t117755\n', '10th grade\t174046\n']
>>> mini[0]
'0Uplink verified\t523545\n'
>>> mini[0].split()
['0Uplink', 'verified', '523545']
>>> mini_list = []
>>> for m in mini:
... (w1, w2, count) = m.split()
... count = int(count)
... mini_list.append(((w1, w2), count))
...
>>> mini_list
[(('0Uplink', 'verified'), 523545), (('0km', 'to'), 116103), (('1000s', 'of'), 939476),
(('100s', 'of'), 539389), (('100th', 'anniversary'), 158621), (('10am', 'to'), 376141),
(('10th', 'and'), 183715), (('10th', 'anniversary'), 242830), (('10th', 'century'),
117755), (('10th', 'grade'), 174046)]
>>> mini_list[0]
(('0Uplink', 'verified'), 523545)
>>>
|
|
STEP 3: build goog2w_fd
Next, build goog2w_fd as a frequency distribution, implemented as nltk.FreqDist. When finished, it should work like below. See today's shell session (posted next to lecture slides) for how to initialize and manually populate a FreqDist object.
|
>>> goog2w_fd[('of', 'the')]
2766332391
>>> goog2w_fd[('so', 'beautiful')]
612472
>>>
|
|
STEP 4: build goog2w_cfd
Next up, let's build goog2w_cfd as a conditional frequency distribution, implemented as nltk.ConditionalFreqDist. As with FreqDist, since we do not have a flat list of all bigram tokens, we have to resort to initializing an empty CFD then manually populating it. This time, it's slightly trickier because of the nested structure of CFD:
|
>>> goog2w_cfd = nltk.ConditionalFreqDist()
>>> for (w1,w2) in goog2w_fd:
... goog2w_cfd[w1][w2] = goog2w_fd[(w1,w2)]
>>> goog2w_cfd['so']['beautiful']
612472
>>>
|
|
STEP 5: explore
Now explore the two data objects to familiarize yourself with the bigram data. Answer the following questions:
- What are the top bigrams? Do they look similar or dissimilar to those compiled from the Bible and the Austen corpora?
- What are the top so-initial bigrams? Do they look more or less similar to those found in Jane Austen or the Bible?
- Back to those bigrams necessary for computing the probability of the sentence 'She was not afraid.'. Are they all found in this data?
- Find a bigram that you think should be represented and it is.
- Find a bigram that you think should be represented but is not.
STEP 6: pickle the data
Pickle goog2w_fd as 'goog2w_fd.pkl', and goog2w_cfd as 'goog2w_cfd.pkl'.
Save your IDLE shell session as ex5_google_bigrams.txt. Open it in your text editor, clean up messy bits, and then add your answers to the questions above to accompany your relevant code bits.
SUBMIT:
- Upload: Your saved shell file (.txt) and the two saved pickle files: goog2w_fd.pkl and goog2w_cfd.pkl.
|
|
|