Go to: LING 1330/2330 home page  

Exercise 6: Boy or Girl? Movie Good or Bad?

The goal of this exercise is to learn how to build a Naive Bayes classifier. The NLTK has two examples: Name gender classifier and Movie review classifier. In IDLE shell, complete building the models by trying out the code examples.

Name gender classifier

Build a name gender classifier described in NLTK book section: 1.1 "Gender Identification". (Note: You can skip section 1.2 "Choosing the right features".) Additionally, reference my saved shell session. Details:
  1. The book dives right into model making, but as always, you should start out by exploring your data source. That's what I was doing in today's class: don't skip this fun part!
  2. The book section and my saved shell session are essentially accomplishing the same thing, but (1) the book provides a narrated explanation along the way which is helpful, and (2) my shell session breaks down the process into smaller chunks, so it might be easier for you to understand.
  3. Another difference: my shell session uses two features (first character, last character) while the book section uses just one.
  4. You don't have to replicate BOTH sets of code in your shell. Recommendation: follow my shell session, but refer to the NLTK book for explanation of the process.

Movie review sentiment analysis

Try out document classification on movie reviews by following this NLTK Book section. Since we are dealing with positive/negative opinions on movies, this task is a form of sentiment analysis. Details:
  1. Start out by exploring the movie reviews corpus and familiarizing yourself, which the book didn't do. Find out how big the corpus is, how many reviews there are, and how many of them are positive/negative. Take a look at a positive (or negative) review to get a concrete sense of the content.
  2. You will notice the code in the book is pretty dense with lots of list comprehension. If you find a code block confusing, focus instead on the end result: what the newly built data object looks like, and how it's structured.
  3. Because of random shuffling, your "most informative features" list might not look exactly like what's shown in the book. So, don't be alarmed if Mr. Matt Damon is missing from your list. Don't stop at top 5 features: try 20 or more.
  4. If you're done with what's in the book, it's time to try something new. See how the classifier classifies this short and fake movie review.
     
    >>> myreview = """Mr. Matt Damon was outstanding, fantastic, excellent, wonderfully 
    subtle, superb, terrific, and memorable in his portrayal of Mulan."""   
    >>> myreview_toks = nltk.word_tokenize(myreview.lower())  # lowercase, and then tokenize
    >>> myreview_toks
    ['mr.', 'matt', 'damon', 'was', 'outstanding', ',', 'fantastic', ',', 'excellent', ',', 
    'wonderfully', 'subtle', ',', 'superb', ',', 'terrific', ',', 'and', 'memorable', 'in', 
    'his', 'portrayal', 'of', 'mulan', '.']
    >>> myreview_feats = document_features(myreview_toks)     # generate word feature dictionary
    >>> classifier.classify(myreview_feats)    # classify
                  ??              
    >>> classifier.prob_classify(myreview_feats).prob('pos')  # probability of 'pos' label
                  ??              
    >>> classifier.prob_classify(myreview_feats).prob('neg')  # probability of 'neg' label
                  ??              
    >>> 
    
  5. This time, change "Matt Damon" to "Steven Seagal" (IMDB profile) and see what happens.


SUBMIT:
  • A saved shell session as a .txt file, edited to clean up messy bits and to include your notes/comments