In this part, the goal is to build a bigram tagger and test it out. We will use the Brown Corpus and its native tagset. Follow the steps below.
STEP 1: Prepare data sets
There are a total of 57,340 POS-tagged sentences in the Brown Corpus. Among them, assign the first 50,000 to your list of training sentences. Then, assign the remaining sentences to your list of testing sentences. The first of your testing sentences should look like this:
Following the steps shown in this chapter, build a bigram tagger with two back-off models. The first one on the stack should be a default tagger that assigns 'NN' by default.
STEP 3: Evaluate
Evaluate your bigram tagger on the test sentences through .accuracy(). (Note: .evaluate() is the outdated method.) You should be getting the accuracy score of 0.911. If not, something went wrong: go back and re-build your tagger.
STEP 4: Explore
Now, explore your tagger to answer the questions below.
How big are your training data and testing data? Answer in terms of the number of total words in them.
Note that each data set is a list of sentences. To get the total number of words in them, you will have to sum the length of all sentences in the list. That you can do using list comprehension: convert each sentence into its length using len(), and then use sum(). Alternatively, if you must, you can start with total=0 and then for-loop through your list to add up the length of each sentence.
What is the performance of each of the two back-off taggers? How much improvement did you get: (1) going from the default tagger to the unigram tagger, and (2) going from the unigram tagger to the bigram tagger?
Recall that 'cold' is ambiguous between JJ 'adjective' and NN 'singular noun'. Let's explore the word in the training data. The problem with the training data, through, is that it is a list of tagged sentences, and it's difficult to get to the tagged words which are one level below:
To be able to compile tagged-word-level statistics, we will need a flat list of tagged words, without them being organized into sentences. And, let's lowercase all words while at it, so we don't have to deal with 'Cold' and 'cold' as separate cases. How to do this? You can use multi-loop list comprehension to construct it while applying .lower():
Now, exploring this list of (word, POS) pairs from the training data, answer the questions below.
Which is the more likely tag for 'cold' overall?
When the POS tag of the preceding word (call it POSn-1) is AT, what is the likelihood of 'cold' being a noun? How about it being an adjective?
Build a list of (POS_n-1, cold's-POS), and build a ConditionalFreqDist from it. Then, you can simply look up cfd['AT'], cfd['JJ'], etc., which gives you a FreqDist object. From a FreqDist, you can easily look up a relativized frequency value through .freq().
When POSn-1 is JJ, what is the likelihood of 'cold' being a noun? How about it being an adjective?
Can you find any POSn-1 that favors NN over JJ for the following word 'cold'?
Based on what you found, how is your bigram tagger expected to tag 'cold' in the following sentences?
I was very cold.
I had a cold.
I had a severe cold.
January was a cold month.
Verify your prediction by having the tagger actually tag the four sentences. What did you find?
Have the tagger tag the following sentences, all of which contain the word 'so':
I failed to do so.
I was happy, but so was my enemy.
So, how was the exam?
The students came in early so they can get good seats.
She failed the exam, so she must take it again.
That was so incredible.
Wow, so incredible.
Examine the tagger's performance on the sentences, focusing on the word 'so'. For each of them, decide if the tagger's output is correct, and explain how the tagger determined the POS tag.
Based on what you have observed so far, offer a critique on the bigram tagger. What are its strengths and what are its limitations?
PART 2: Building a Better Tagger [15 points]
There are multiple ways to design a more complex tagger with a better performance: the book sections illustrate at least two obvious ways to achieve this. In this part, your task is to improve the bigram tagger we built in PART 1. Make sure to use the same training and testing data you used above, and do not over-write the original bigram tagger because you will need it around for comparison. First implement your new version of tagger, test it out to make sure the performance has indeed improved, and answer the following questions.
Explain what you did to improve your tagger's performance.
How much performance gain were you able to achieve? Was it as significant as you hoped?
Make up a sentence on which your new POS tagger produces a better result. Explain why the new tagger is more successful with this particular example.
See the hint below for how to use .tag().
Find a sentence from your test data that shows an improved tagging result by your new POS tagger. Explain how your new tagger was more successful in handling it.
You should use the .accuracy() and .tag() methods on the test sentences. The former will give you the accuracy score, while the latter gives you the actual tagged output which you can scrutinize.
But bear in mind that the two expect different types of input. .tag() takes a single, untagged sentence.
.accuracy(), on the other hand, takes a list of sentences, and they have to be POS-tagged for obvious reasons. Because it can only handle a list of sentences, when dealing with a single sentence you have to make it into a list of just one sentence by nesting it in [].
SUBMIT
Two format choices:
MS Word answer sheet: HW7 n-tram tagger.docx plus your saved IDLE session file "HW7_shell.txt".
OR, you may submit a Jupyter Notebook file (.ipynb) if you're comfortable with the format.