PART 1: Explore PropBank [30 points]
Let's explore the Proposition Bank (PropBank) via NLTK. The J&M chapter featured agree and fall, so we will focus on these two verbs.
agree.01
- First, through propbank.roleset(), find out how PropBank defines the argument roles for this verb.
- How are this verb and its semantic arguments realized in Penn Treebank trees? Answer this question by inspecting annotation instances. For your convenience, I have pre-selected a few choice candidates at these indexes: [3434, 4689, 7360, 8815].
- Based on your observations, write up a short summary. Make sure to address relevant points using key terms.
fall.01
Do the same, but this time I am not providing a pre-selected instance set. That means it's up to you to explore the instances and pick out a few good candidates to dive into. Some tips:
- Utilize the first 9353 instances (index 0-9352) only. Because only the first 1/10 of Penn Treebank is included in NLTK, the rest will not map to a syntactic tree.
- There are 53 total instances of 'fall.01' in this range.
- This is a code snippet that I used for screening short sentences with agree:
for (n,i) in enumerate(pb_instances[:9353]):
if i.roleset=='agree.01' and len(i.tree.leaves())<=30:
print(i)
print(n, i.roleset, i.fileid, i.sentnum)
print(' '.join(i.tree.leaves()))
PART 2: Try out Word Embeddings [20 points]
Let's try out the pre-trained GloVe word vectors from Stanford. It will be interesting to see how they measure up against the Google vectors I demonstrated in class. GloVe model files come in a different format and require a different loading procedure: you can see how it's done in the bottom section of this tutorial.
Preparation:
- You will need to install the gensim library first. If you have python.org distribution, use pip3 in the command line, following the same procedure you did previously for installing nltk. If Anaconda python, install through Anacona Navigator. Beware -- gensim can take a LONG time to install (took me 25 minutes).
- You can download GloVe word vector files from the project home page.
- There are four pre-trained vectors available. Try the one based on Wikipedia and Gigaword 5 corpora, named 'glove.6B.zip'.
- This is a big file -- make sure you have a good wifi connection and your computer has enough space.
- All in all, you will need about 3.5GB of free space on your hard drive.
- Some browsers will kick up a security warning and refuse to download the file unless you click an option to proceed (FireFox and Edge). Others just might flat out refuse (Chrome), in which case try a different browser.
Exploration:
- In the tutorial, the loaded vector object is called model. Name it glove_vecs instead, so it is consistent with the naming convention I used in my in-class demo.
- The tutorial was written some years ago for a supercomputing environment where everyone shares the same setup and disk space, so you will therefore need to adapt and fill in the gaps so the code will work on your machine. First, the conversion step shown in Cell [17] is no longer necessary, so skip it. Instead, start with Cell [18], but it should be updated as below:
from gensim.models import KeyedVectors
filename = 'd:/Lab/word_vectors/glove/glove.6B.100d.txt'
glove_vecs = KeyedVectors.load_word2vec_format(filename, binary=False, no_header=True)
- Try out what I demonstrated in class with the Google vectors. Then, try new things and make one interesting discovery of your own.
- You are encouraged (but not required) to look beyond what's on the tutorial and my in-class demo. There are numerous outstanding tutorials on word vectors out there!
SUBMIT:
- PART 1: If operating entirely in shell, submit a saved shell output edited for brevity and additional comments/analysis. If operating in Python script mainly, upload your script AND the saved output file.
- PART 2: The same!
|
|
|