LING 1330/2330 Introduction to Computational Linguistics, University of Pittsburgh

Go to: LING 1330/2330 home page

Exercise 6: Boy or Girl? Movie Good or Bad?

The goal of this exercise is to learn how to build a Naive Bayes classifier. The NLTK has two examples: Name gender classifier and Movie review classifier. In IDLE shell, complete building the models by trying out the code examples.
Name gender classifier
Build a name gender classifier described in NLTK book section: 1.1 "Gender Identification". (Note: You can skip section 1.2 "Choosing the right features".) Additionally, reference my saved shell session. Details:

The book dives right into model making, but as always, you should start out by exploring your data source. That's what I was doing in today's class: don't skip this fun part!
The book section and my saved shell session are essentially accomplishing the same thing, but (1) the book provides a narrated explanation along the way which is helpful, and (2) my shell session breaks down the process into smaller chunks, so it might be easier for you to understand.
Another difference: my shell session uses two features (first character, last character) while the book section uses just one.
You don't have to replicate BOTH sets of code in your shell. Recommendation: follow my shell session, but refer to the NLTK book for explanation of the process.

Movie review sentiment analysis
Try out document classification on movie reviews by following this NLTK Book section. Since we are dealing with positive/negative opinions on movies, this task is a form of sentiment analysis. Details:

Start out by exploring the movie reviews corpus and familiarizing yourself, which the book didn't do. Find out how big the corpus is, how many reviews there are, and how many of them are positive/negative. Take a look at a positive (or negative) review to get a concrete sense of the content.

Use the usual corpus methods: .fileids(), .words(), .raw(). This particular corpus comes with categories too: .categories() returns 'pos' and 'neg'. You can list file IDs based on categories: movie_reviews.fileids('pos').

You will notice the code in the book is pretty dense with lots of list comprehension. If you find a code block confusing, focus instead on the end result: what the newly built data object looks like, and how it's structured.

That nested list comprehension for building up documents is a head-scratcher. Let me unpack that for you: it is for-looping through the list of categories (a short one of just ['neg', 'pos']), and then for-looping through all file IDs belonging to the category, and then finally creating a tuple of (review tokens, category) which populates the documents list. Go ahead and flash documents[0] in IDLE shell. (If you get a 'Squeezed text...' message, double-clicking it will reveal the content.) You will see that the tuple consists of a pair (x,y) where x is a movie review in its tokenized form, and y is its category 'pos'/'neg'. This data object makes a review (represented as word tokens) and its label more easily accessible for the upcoming feature generation step.

Because of random shuffling, your "most informative features" list might not look exactly like what's shown in the book. So, don't be alarmed if Mr. Matt Damon is missing from your list. Don't stop at top 5 features: try 20 or more.
If you're done with what's in the book, it's time to try something new. See how the classifier classifies this short and fake movie review.

>>> myreview = """Mr. Matt Damon was outstanding, fantastic, excellent, wonderfully subtle, superb, terrific, and memorable in his portrayal of Mulan.""" >>> myreview_toks = nltk.word_tokenize(myreview.lower()) # lowercase, and then tokenize >>> myreview_toks ['mr.', 'matt', 'damon', 'was', 'outstanding', ',', 'fantastic', ',', 'excellent', ',', 'wonderfully', 'subtle', ',', 'superb', ',', 'terrific', ',', 'and', 'memorable', 'in', 'his', 'portrayal', 'of', 'mulan', '.'] >>> myreview_feats = document_features(myreview_toks) # generate word feature dictionary >>> classifier.classify(myreview_feats) # classify ?? >>> classifier.prob_classify(myreview_feats).prob('pos') # probability of 'pos' label ?? >>> classifier.prob_classify(myreview_feats).prob('neg') # probability of 'neg' label ?? >>>

I know you're curious about myreview_feats, but printing/flashing it in its entirety is a bad idea. It is a large dictionary with 2,000 dimensions, because the feature generator function creates a True/False entry for each of the top 2,000 words, no matter the size of the input movie review text. So, instead, you should list-ify myreview_feats.items() and then print out slices.

This time, change "Matt Damon" to "Steven Seagal" (IMDB profile) and see what happens.

SUBMIT:

A saved shell session as a .txt file, edited to clean up messy bits and to include your notes/comments