Go to: LING 1330/2330 home page  

Exercise 11: Lexical Resources, WordNet

This exercise is based on the later sections of NLTK Book Ch.2: 4 Lexical Resources and 5 Wordnet.
For this exercise, let's explore three of NLTK's lexical resources: (1) the stopword list (stopwords), (2) the CMU Pronouncing Dictionary (cmudict), and (3) WordNet (wordnet). For how to access and work with them, refer to the NLTK book sections linked above.

In Python shell, solve the following problems:

  • Q1: What are the top 20 most frequent words in the Brown corpus that are not stopwords? Let's count "real" words only -- exclude non-alphanumeric words such as symbols and punctuation too while at it.
  • Q2: According to the CMU Pronouncing Dictionary, what are the English words that contain both /ʃ/ and /ʒ/ sounds? How about /θ/ and /ð/? For a starter, here's how to check if /k/ is part of 'anxious', which has multiple pronunciations:
    >>> prondict = cmudict.dict()
    >>> prondict['anxious']
    [['AE1', 'NG', 'K', 'SH', 'AH0', 'S'], ['AE1', 'NG', 'SH', 'AH0', 'S']]
    >>> 'K' in prondict['anxious'][0]    # /k/ is in the first pronunciation
    True
    >>> 'K' in prondict['anxious'][1]    # it is not in the second
    False
    
  • Q3: According to WordNet, how many distinct senses does 'chair' have? What are the hyponyms of 'chair' in its 'chair.n.01' sense? What is its hypernym, and what is its hyper-hypernym?

When you are done exploring, save your IDLE session as a text file (File > Save As ... > and save as a .txt file). After that, open up the file again and edit it to clearly mark the answers to Q1 through Q3, so I will be able to quickly find them.


SUBMIT:
    Your saved IDLE shell session, edited to mark the questions Q1 -- Q3.