Exploring the PropBank in NLTK¶

  • This Python interface is called Jupyter Notebook. If like what you see, take LING 1340/2340 Data Science for Linguists!
  • J&M Ch.21 Semantic Role Labeling: https://web.stanford.edu/~jurafsky/slp3/21.pdf
  • NLTK PropBank how-to: https://www.nltk.org/howto/propbank.html
In [1]:
import nltk
from nltk.corpus import treebank
In [2]:
# Does anyone remember the very first PTB sentence? 
treebank.parsed_sents()[0]
Out[2]:
No description has been provided for this image
  • JNB users: in-line tree drawing won't work unless you have svgling package installed. With older nltk, you must have ghostscript installed and added to PATH. See this stackoverflow.
In [3]:
from nltk.corpus import propbank

# Very first propbank annotation ("instance") is on 'join', from the Pierre Vinken tree!
# The values look cryptic, but details later
print(propbank.instances()[0])
wsj_0001.mrg 0 8 gold join.01 vf--a 0:2-ARG0 7:0-ARGM-MOD 8:0-rel 9:1-ARG1 11:1-ARGM-PRD 15:1-ARGM-TMP
In [4]:
pb_instances = propbank.instances()

print(pb_instances[42])     # say
print(pb_instances[103])    # rise
wsj_0003.mrg 15 19 gold say.01 vp--a 1:2*20:0-ARG1 19:0-rel 21:1-ARG0
wsj_0004.mrg 8 16 gold rise.01 vp--a 0:2-ARG1 13:1-ARGM-DIS 16:0-rel 17:1-ARG4-to 20:1-ARG3-from
In [5]:
len(pb_instances)
Out[5]:
112917
In [6]:
len(propbank.verbs())
Out[6]:
3257
In [7]:
print(propbank.verbs()[:20])
print(propbank.verbs()[-20:])
['abandon', 'abate', 'abdicate', 'abet', 'abide', 'abolish', 'abort', 'abound', 'abridge', 'absolve', 'absorb', 'abstain', 'abuse', 'accede', 'accelerate', 'accept', 'access', 'acclaim', 'accommodate', 'accompany']
['wrap', 'wreak', 'wreck', 'wrench', 'wrest', 'wrestle', 'wriggle', 'wring', 'write', 'writhe', 'wrong', 'yank', 'yell', 'yelp', 'yield', 'zap', 'zero', 'zip', 'zone', 'zoom']
  • Lots of verbs are represented! Let's look at the verb frame ("roleset") for join.01:
In [8]:
propbank.roleset('join.01')
Out[8]:
<Element 'roleset' at 0x000001CFCC23EC00>
In [9]:
for role in propbank.roleset('join.01').findall('roles/role'):
    print(role.attrib['n'], role.attrib['descr'])
0 agent, entity doing the tying
1 patient, thing(s) being tied
2 instrument, string
In [10]:
rise01 = propbank.roleset('rise.01')

for role in rise01.findall("roles/role"):
    print(role.attrib['n'], role.attrib['descr'])
1 Logical subject, patient, thing rising
2 EXT, amount risen
3 start point
4 end point
M medium

Back to Mr. Vinken and his joining activity¶

In [11]:
inst0 = pb_instances[0]

print(inst0)
print(inst0.fileid, inst0.sentnum, inst0.wordnum, inst0.tagger)
wsj_0001.mrg 0 8 gold join.01 vf--a 0:2-ARG0 7:0-ARGM-MOD 8:0-rel 9:1-ARG1 11:1-ARGM-PRD 15:1-ARGM-TMP
wsj_0001.mrg 0 8 gold
In [12]:
print(dir(inst0))
['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_get_tree', 'arguments', 'baseform', 'fileid', 'inflection', 'parse', 'parse_corpus', 'predicate', 'predid', 'roleset', 'sensenumber', 'sentnum', 'tagger', 'tree', 'wordnum']
In [13]:
inst0.tree
# Entire sentence tree can be directly called from the instance! 
Out[13]:
No description has been provided for this image
In [14]:
inst0.roleset
Out[14]:
'join.01'
In [15]:
# detailed inflection information on the verb
infl = inst0.inflection
infl.form, infl.tense, infl.aspect, infl.person, infl.voice
Out[15]:
('v', 'f', '-', '-', 'a')
In [16]:
# predicate points to the 9th word/tag (join/VB), with height 0
inst0.predicate
Out[16]:
PropbankTreePointer(8, 0)
In [17]:
# subtree for the predicate
print(inst0.predicate.select(inst0.tree))
(VB join)
In [18]:
# there are 2 essential (numbered) Args and 3 modifier Args. 
# each is (location, label)
# Arg0's location is (0,2): it starts with the 1st (=index 0) word/tag combo (Pierre/NNP) and then
#     goes up 2 tree levels to NP-SBJ. This NP-SBJ subtree is essentially Arg0.
# ARGM-PRD: secondary predication
inst0.arguments
Out[18]:
((PropbankTreePointer(0, 2), 'ARG0'),
 (PropbankTreePointer(7, 0), 'ARGM-MOD'),
 (PropbankTreePointer(9, 1), 'ARG1'),
 (PropbankTreePointer(11, 1), 'ARGM-PRD'),
 (PropbankTreePointer(15, 1), 'ARGM-TMP'))
In [19]:
# print out the subtree portion of each argument 
for (argloc, argid) in inst0.arguments:
    print(argid +":")
    print(argloc.select(inst0.tree))
    # argloc.select(inst0.tree).draw()  # Nope, too many Windows to chase down
    print()
ARG0:
(NP-SBJ
  (NP (NNP Pierre) (NNP Vinken))
  (, ,)
  (ADJP (NP (CD 61) (NNS years)) (JJ old))
  (, ,))

ARGM-MOD:
(MD will)

ARG1:
(NP (DT the) (NN board))

ARGM-PRD:
(PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director)))

ARGM-TMP:
(NP-TMP (NNP Nov.) (CD 29))

In [ ]: