Exploring the PropBank in NLTK¶
- This Python interface is called Jupyter Notebook. If like what you see, take LING 1340/2340 Data Science for Linguists!
- J&M Ch.21 Semantic Role Labeling: https://web.stanford.edu/~jurafsky/slp3/21.pdf
- NLTK PropBank how-to: https://www.nltk.org/howto/propbank.html
In [1]:
import nltk
from nltk.corpus import treebank
In [2]:
# Does anyone remember the very first PTB sentence?
treebank.parsed_sents()[0]
Out[2]:
- JNB users: in-line tree drawing won't work unless you have
svgling
package installed. With older nltk, you must have ghostscript installed and added to PATH. See this stackoverflow.
In [3]:
from nltk.corpus import propbank
# Very first propbank annotation ("instance") is on 'join', from the Pierre Vinken tree!
# The values look cryptic, but details later
print(propbank.instances()[0])
wsj_0001.mrg 0 8 gold join.01 vf--a 0:2-ARG0 7:0-ARGM-MOD 8:0-rel 9:1-ARG1 11:1-ARGM-PRD 15:1-ARGM-TMP
In [4]:
pb_instances = propbank.instances()
print(pb_instances[42]) # say
print(pb_instances[103]) # rise
wsj_0003.mrg 15 19 gold say.01 vp--a 1:2*20:0-ARG1 19:0-rel 21:1-ARG0 wsj_0004.mrg 8 16 gold rise.01 vp--a 0:2-ARG1 13:1-ARGM-DIS 16:0-rel 17:1-ARG4-to 20:1-ARG3-from
In [5]:
len(pb_instances)
Out[5]:
112917
In [6]:
len(propbank.verbs())
Out[6]:
3257
In [7]:
print(propbank.verbs()[:20])
print(propbank.verbs()[-20:])
['abandon', 'abate', 'abdicate', 'abet', 'abide', 'abolish', 'abort', 'abound', 'abridge', 'absolve', 'absorb', 'abstain', 'abuse', 'accede', 'accelerate', 'accept', 'access', 'acclaim', 'accommodate', 'accompany'] ['wrap', 'wreak', 'wreck', 'wrench', 'wrest', 'wrestle', 'wriggle', 'wring', 'write', 'writhe', 'wrong', 'yank', 'yell', 'yelp', 'yield', 'zap', 'zero', 'zip', 'zone', 'zoom']
- Lots of verbs are represented! Let's look at the verb frame ("roleset") for
join.01
:
In [8]:
propbank.roleset('join.01')
Out[8]:
<Element 'roleset' at 0x000001CFCC23EC00>
In [9]:
for role in propbank.roleset('join.01').findall('roles/role'):
print(role.attrib['n'], role.attrib['descr'])
0 agent, entity doing the tying 1 patient, thing(s) being tied 2 instrument, string
In [10]:
rise01 = propbank.roleset('rise.01')
for role in rise01.findall("roles/role"):
print(role.attrib['n'], role.attrib['descr'])
1 Logical subject, patient, thing rising 2 EXT, amount risen 3 start point 4 end point M medium
Back to Mr. Vinken and his joining activity¶
In [11]:
inst0 = pb_instances[0]
print(inst0)
print(inst0.fileid, inst0.sentnum, inst0.wordnum, inst0.tagger)
wsj_0001.mrg 0 8 gold join.01 vf--a 0:2-ARG0 7:0-ARGM-MOD 8:0-rel 9:1-ARG1 11:1-ARGM-PRD 15:1-ARGM-TMP wsj_0001.mrg 0 8 gold
In [12]:
print(dir(inst0))
['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_get_tree', 'arguments', 'baseform', 'fileid', 'inflection', 'parse', 'parse_corpus', 'predicate', 'predid', 'roleset', 'sensenumber', 'sentnum', 'tagger', 'tree', 'wordnum']
In [13]:
inst0.tree
# Entire sentence tree can be directly called from the instance!
Out[13]:
In [14]:
inst0.roleset
Out[14]:
'join.01'
In [15]:
# detailed inflection information on the verb
infl = inst0.inflection
infl.form, infl.tense, infl.aspect, infl.person, infl.voice
Out[15]:
('v', 'f', '-', '-', 'a')
In [16]:
# predicate points to the 9th word/tag (join/VB), with height 0
inst0.predicate
Out[16]:
PropbankTreePointer(8, 0)
In [17]:
# subtree for the predicate
print(inst0.predicate.select(inst0.tree))
(VB join)
In [18]:
# there are 2 essential (numbered) Args and 3 modifier Args.
# each is (location, label)
# Arg0's location is (0,2): it starts with the 1st (=index 0) word/tag combo (Pierre/NNP) and then
# goes up 2 tree levels to NP-SBJ. This NP-SBJ subtree is essentially Arg0.
# ARGM-PRD: secondary predication
inst0.arguments
Out[18]:
((PropbankTreePointer(0, 2), 'ARG0'), (PropbankTreePointer(7, 0), 'ARGM-MOD'), (PropbankTreePointer(9, 1), 'ARG1'), (PropbankTreePointer(11, 1), 'ARGM-PRD'), (PropbankTreePointer(15, 1), 'ARGM-TMP'))
In [19]:
# print out the subtree portion of each argument
for (argloc, argid) in inst0.arguments:
print(argid +":")
print(argloc.select(inst0.tree))
# argloc.select(inst0.tree).draw() # Nope, too many Windows to chase down
print()
ARG0: (NP-SBJ (NP (NNP Pierre) (NNP Vinken)) (, ,) (ADJP (NP (CD 61) (NNS years)) (JJ old)) (, ,)) ARGM-MOD: (MD will) ARG1: (NP (DT the) (NN board)) ARGM-PRD: (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director))) ARGM-TMP: (NP-TMP (NNP Nov.) (CD 29))
In [ ]: