Stanza by Stanford NLP: a New Library to Rule them All?¶

PyLing, April 9 2025, Na-Rae Han (naraehan@pitt.edu)

  • Surprise: for real-life language processing tasks, people don't use NLTK's POS tagger or syntactic parser
  • We use off-the-shelf NLP suites coupled with pre-trained language models
  • They commonly implement a pipeline, which takes care of essential text processing steps including tokenization, lemmatization, POS tagigng, syntactic parsing, named-entity recognition. They also tend to adopt the latest neural network architecture.
  • We'll take a look at stanza, and also spacy

Stanza: A Python NLP Package for Many Human Languages¶

  • By Stanford NLP Group
  • Project home: https://stanfordnlp.github.io/stanza/
  • GitHub repo: https://github.com/stanfordnlp/stanza
  • Pre-trained langauge models are posted on Hugging Face.
    • English model: https://huggingface.co/stanfordnlp/stanza-en
    • Korean model: https://huggingface.co/stanfordnlp/stanza-ko
    • All language model files are named default.zip. You can manually download this and unzip.
In [1]:
%pprint
Pretty printing has been turned OFF
In [2]:
# Install stanza for yourself. I've already done this, so commented out.  
#    This installs the package in my home directory ~/.local/lib/python3.11/site-packages

# !pip install stanza --user
In [3]:
import stanza
In [4]:
# Download the English language model via stanza.download('en')
# This creates a new folder ~/stanza_resources and download the model files in it
# I've already downloaded the model, so commented out:

# stanza.download('en')
In [5]:
# Build an English NLP pipeline
nlp = stanza.Pipeline('en')
# By default, it checks for a newer language model. Use download_method=None to skip the check
2025-04-10 13:22:45 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …
2025-04-10 13:22:45 INFO: Downloaded file to C:\Users\narae\stanza_resources\resources.json
2025-04-10 13:22:47 INFO: Loading these models for language: en (English):
============================================
| Processor    | Package                   |
--------------------------------------------
| tokenize     | combined                  |
| mwt          | combined                  |
| pos          | combined_charlm           |
| lemma        | combined_nocharlm         |
| constituency | ptb3-revised_charlm       |
| depparse     | combined_charlm           |
| sentiment    | sstplus_charlm            |
| ner          | ontonotes-ww-multi_charlm |
============================================

2025-04-10 13:22:47 INFO: Using device: cpu
2025-04-10 13:22:47 INFO: Loading: tokenize
C:\Program Files\Python311\Lib\site-packages\torch\_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
2025-04-10 13:22:47 INFO: Loading: mwt
2025-04-10 13:22:47 INFO: Loading: pos
2025-04-10 13:22:48 INFO: Loading: lemma
2025-04-10 13:22:48 INFO: Loading: constituency
2025-04-10 13:22:48 INFO: Loading: depparse
2025-04-10 13:22:48 INFO: Loading: sentiment
2025-04-10 13:22:49 INFO: Loading: ner
2025-04-10 13:22:50 INFO: Done loading processors!
In [6]:
doc = nlp("Barack Obama was born in Hawaii. He was elected president in 2008.")
In [7]:
# Printing out in Conllu format. 
# Confession: have no idea why it works the way it does. 
print("{:C}".format(doc))
# text = Barack Obama was born in Hawaii.
# sent_id = 0
# constituency = (ROOT (S (NP (NNP Barack) (NNP Obama)) (VP (VBD was) (VP (VBN born) (PP (IN in) (NP (NNP Hawaii))))) (. .)))
# sentiment = 1
1	Barack	Barack	PROPN	NNP	Number=Sing	4	nsubj:pass	_	start_char=0|end_char=6|ner=B-PERSON
2	Obama	Obama	PROPN	NNP	Number=Sing	1	flat	_	start_char=7|end_char=12|ner=E-PERSON
3	was	be	AUX	VBD	Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin	4	aux:pass	_	start_char=13|end_char=16|ner=O
4	born	bear	VERB	VBN	Tense=Past|VerbForm=Part|Voice=Pass	0	root	_	start_char=17|end_char=21|ner=O
5	in	in	ADP	IN	_	6	case	_	start_char=22|end_char=24|ner=O
6	Hawaii	Hawaii	PROPN	NNP	Number=Sing	4	obl	_	start_char=25|end_char=31|ner=S-GPE|SpaceAfter=No
7	.	.	PUNCT	.	_	4	punct	_	start_char=31|end_char=32|ner=O

# text = He was elected president in 2008.
# sent_id = 1
# constituency = (ROOT (S (NP (PRP He)) (VP (VBD was) (VP (VBN elected) (S (NP (NN president))) (PP (IN in) (NP (CD 2008))))) (. .)))
# sentiment = 1
1	He	he	PRON	PRP	Case=Nom|Gender=Masc|Number=Sing|Person=3|PronType=Prs	3	nsubj:pass	_	start_char=33|end_char=35|ner=O
2	was	be	AUX	VBD	Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin	3	aux:pass	_	start_char=36|end_char=39|ner=O
3	elected	elect	VERB	VBN	Tense=Past|VerbForm=Part|Voice=Pass	0	root	_	start_char=40|end_char=47|ner=O
4	president	president	NOUN	NN	Number=Sing	3	xcomp	_	start_char=48|end_char=57|ner=O
5	in	in	ADP	IN	_	6	case	_	start_char=58|end_char=60|ner=O
6	2008	2008	NUM	CD	NumForm=Digit|NumType=Card	3	obl	_	start_char=61|end_char=65|ner=S-DATE|SpaceAfter=No
7	.	.	PUNCT	.	_	3	punct	_	start_char=65|end_char=66|ner=O|SpaceAfter=No
In [8]:
print("{:C}".format(nlp("NLP is my favorite thing in the world. You have thousand years to live.")))
# text = NLP is my favorite thing in the world.
# sent_id = 0
# constituency = (ROOT (S (NP (NNP NLP)) (VP (VBZ is) (NP (NP (PRP$ my) (JJ favorite) (NN thing)) (PP (IN in) (NP (DT the) (NN world))))) (. .)))
# sentiment = 2
1	NLP	NLP	PROPN	NNP	Number=Sing	5	nsubj	_	start_char=0|end_char=3|ner=O
2	is	be	AUX	VBZ	Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin	5	cop	_	start_char=4|end_char=6|ner=O
3	my	my	PRON	PRP$	Case=Gen|Number=Sing|Person=1|Poss=Yes|PronType=Prs	5	nmod:poss	_	start_char=7|end_char=9|ner=O
4	favorite	favorite	ADJ	JJ	Degree=Pos	5	amod	_	start_char=10|end_char=18|ner=O
5	thing	thing	NOUN	NN	Number=Sing	0	root	_	start_char=19|end_char=24|ner=O
6	in	in	ADP	IN	_	8	case	_	start_char=25|end_char=27|ner=O
7	the	the	DET	DT	Definite=Def|PronType=Art	8	det	_	start_char=28|end_char=31|ner=O
8	world	world	NOUN	NN	Number=Sing	5	nmod	_	start_char=32|end_char=37|ner=O|SpaceAfter=No
9	.	.	PUNCT	.	_	5	punct	_	start_char=37|end_char=38|ner=O

# text = You have thousand years to live.
# sent_id = 1
# constituency = (ROOT (S (NP (PRP You)) (VP (VBP have) (NP (NP (CD thousand) (NNS years)) (SBAR (S (VP (TO to) (VP (VB live))))))) (. .)))
# sentiment = 1
1	You	you	PRON	PRP	Case=Nom|Person=2|PronType=Prs	2	nsubj	_	start_char=39|end_char=42|ner=O
2	have	have	VERB	VBP	Mood=Ind|Number=Sing|Person=2|Tense=Pres|VerbForm=Fin	0	root	_	start_char=43|end_char=47|ner=O
3	thousand	thousand	NUM	CD	NumForm=Word|NumType=Card	4	nummod	_	start_char=48|end_char=56|ner=B-DATE
4	years	year	NOUN	NNS	Number=Plur	2	obj	_	start_char=57|end_char=62|ner=E-DATE
5	to	to	PART	TO	_	6	mark	_	start_char=63|end_char=65|ner=O
6	live	live	VERB	VB	VerbForm=Inf	4	acl	_	start_char=66|end_char=70|ner=O|SpaceAfter=No
7	.	.	PUNCT	.	_	2	punct	_	start_char=70|end_char=71|ner=O|SpaceAfter=No
In [9]:
# What can you do with a document object?
dir(doc)
Out[9]:
['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_attach_coref_mentions', '_coref', '_count_words', '_ents', '_lang', '_num_tokens', '_num_words', '_process_sentences', '_sentences', '_text', 'add_property', 'build_ents', 'coref', 'entities', 'ents', 'from_serialized', 'get', 'get_mwt_expansions', 'iter_tokens', 'iter_words', 'lang', 'mark_whitespace', 'num_tokens', 'num_words', 'reindex_sentences', 'sentence_comments', 'sentences', 'set', 'set_mwt_expansions', 'sort_features', 'text', 'to_dict', 'to_serialized']
In [10]:
# What can you do with a sentence object?
dir(doc.sentences[0])
Out[10]:
['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_comments', '_constituency', '_dependencies', '_doc', '_doc_id', '_empty_words', '_enhanced_dependencies', '_ents', '_index', '_process_tokens', '_sent_id', '_sentiment', '_text', '_tokens', '_words', 'add_comment', 'add_property', 'build_dependencies', 'build_ents', 'build_fake_dependencies', 'comments', 'constituency', 'dependencies', 'dependencies_string', 'doc', 'doc_id', 'empty_words', 'entities', 'ents', 'has_enhanced_dependencies', 'id', 'index', 'print_dependencies', 'print_tokens', 'print_words', 'rebuild_dependencies', 'sent_id', 'sentiment', 'text', 'to_dict', 'tokens', 'tokens_string', 'words', 'words_string']

Tips for finding your way around new library and data objects¶

  • Use dir() and help(), which reveals methods available for a particular data object
  • Also... look up documentation! Stanza's reference page on various data objects here: https://stanfordnlp.github.io/stanza/data_objects.html
In [11]:
doc.sentences[0].text
Out[11]:
'Barack Obama was born in Hawaii.'
In [12]:
doc.sentences[0].constituency
Out[12]:
(ROOT (S (NP (NNP Barack) (NNP Obama)) (VP (VBD was) (VP (VBN born) (PP (IN in) (NP (NNP Hawaii))))) (. .)))
In [13]:
dir(doc.sentences[0].constituency)
Out[13]:
['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', 'add_property', 'all_leaves_are_preterminals', 'children', 'count_unary_depth', 'depth', 'get_common_words', 'get_compound_constituents', 'get_constituent_counts', 'get_rare_words', 'get_root_labels', 'get_unique_constituent_labels', 'get_unique_tags', 'get_unique_words', 'is_leaf', 'is_preterminal', 'label', 'leaf_labels', 'pretty_print', 'prune_none', 'remap_constituent_labels', 'remap_words', 'replace_tags', 'replace_words', 'reverse', 'simplify_labels', 'visit_preorder', 'write_treebank', 'yield_preterminals']
In [14]:
doc.sentences[0].constituency.__str__()
Out[14]:
'(ROOT (S (NP (NNP Barack) (NNP Obama)) (VP (VBD was) (VP (VBN born) (PP (IN in) (NP (NNP Hawaii))))) (. .)))'
In [15]:
import nltk
tree1 = nltk.Tree.fromstring(doc.sentences[0].constituency.__str__())
tree1
Out[15]:
No description has been provided for this image
  • JNB users: in-line tree drawing won't work unless you have svgling package installed.
In [16]:
doc.sentences[0].print_dependencies()
('Barack', 4, 'nsubj:pass')
('Obama', 1, 'flat')
('was', 4, 'aux:pass')
('born', 0, 'root')
('in', 6, 'case')
('Hawaii', 4, 'obl')
('.', 4, 'punct')
In [17]:
# visualizing dependency trees! 
from stanza.utils.visualization.dependency_visualization import visualize_doc
visualize_doc(doc, 'en')
Barack PROPN Obama PROPN was AUX born VERB in ADP Hawaii . PROPN nsubj:pass flat aux:pass case obl
He PRON was AUX elected VERB president NOUN in ADP 2008 . NUM nsubj:pass aux:pass xcomp case obl

Note: When this JNB is uploaded to GitHub, the dependency arc labels ("nsubj:pass", "case", etc.) don't show up. Boo!

In [18]:
doc.sentences[0].words[:3]   # first 3 words
Out[18]:
[{
  "id": 1,
  "text": "Barack",
  "lemma": "Barack",
  "upos": "PROPN",
  "xpos": "NNP",
  "feats": "Number=Sing",
  "head": 4,
  "deprel": "nsubj:pass",
  "start_char": 0,
  "end_char": 6
}, {
  "id": 2,
  "text": "Obama",
  "lemma": "Obama",
  "upos": "PROPN",
  "xpos": "NNP",
  "feats": "Number=Sing",
  "head": 1,
  "deprel": "flat",
  "start_char": 7,
  "end_char": 12
}, {
  "id": 3,
  "text": "was",
  "lemma": "be",
  "upos": "AUX",
  "xpos": "VBD",
  "feats": "Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin",
  "head": 4,
  "deprel": "aux:pass",
  "start_char": 13,
  "end_char": 16
}]
In [19]:
for w in doc.sentences[0].words:
    print(w.id, w.text, w.lemma, w.upos, w.xpos, w.feats, w.head, w.deprel, sep="\t")
1	Barack	Barack	PROPN	NNP	Number=Sing	4	nsubj:pass
2	Obama	Obama	PROPN	NNP	Number=Sing	1	flat
3	was	be	AUX	VBD	Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin	4	aux:pass
4	born	bear	VERB	VBN	Tense=Past|VerbForm=Part|Voice=Pass	0	root
5	in	in	ADP	IN	None	6	case
6	Hawaii	Hawaii	PROPN	NNP	Number=Sing	4	obl
7	.	.	PUNCT	.	None	4	punct

Careful: words have IDs, which start with 1. Their list index, however, starts with 0. That means, while switching back and forth between list index and word ID, you have to be mindful about +1 and -1.

For example, "Hawaii" has word ID of 6, but its list index is 5. It's head points to "born" whose ID is 4 but its index is 3.

NER (Named-entity recognition)!¶

  • https://stanfordnlp.github.io/stanza/ner.html
In [20]:
doc.entities
Out[20]:
[{
  "text": "Barack Obama",
  "type": "PERSON",
  "start_char": 0,
  "end_char": 12
}, {
  "text": "Hawaii",
  "type": "GPE",
  "start_char": 25,
  "end_char": 31
}, {
  "text": "2008",
  "type": "DATE",
  "start_char": 61,
  "end_char": 65
}]

Let's try... Korean!¶

  • https://huggingface.co/stanfordnlp/stanza-ko
In [21]:
nlp_ko = stanza.Pipeline('ko')
doc_ko = nlp_ko("오늘 날씨가 아주 좋군요.")
2025-04-10 13:22:59 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …
2025-04-10 13:22:59 INFO: Downloaded file to C:\Users\narae\stanza_resources\resources.json
2025-04-10 13:23:00 INFO: Loading these models for language: ko (Korean):
==============================
| Processor | Package        |
------------------------------
| tokenize  | kaist          |
| pos       | kaist_nocharlm |
| lemma     | kaist_nocharlm |
| depparse  | kaist_nocharlm |
==============================

2025-04-10 13:23:00 INFO: Using device: cpu
2025-04-10 13:23:00 INFO: Loading: tokenize
2025-04-10 13:23:00 INFO: Loading: pos
2025-04-10 13:23:01 INFO: Loading: lemma
2025-04-10 13:23:02 INFO: Loading: depparse
2025-04-10 13:23:02 INFO: Done loading processors!
In [22]:
doc_ko.sentences[0].print_dependencies()
('오늘', 2, 'compound')
('날씨가', 4, 'nsubj')
('아주', 4, 'advmod')
('좋군요', 0, 'root')
('.', 4, 'punct')
In [23]:
# How is morphology represented?
for w in doc_ko.sentences[0].words:
    print(w.id, w.text, w.lemma, w.upos, w.xpos, w.feats, w.head, w.deprel, sep="\t")
1	오늘	오늘	NOUN	ncn	None	2	compound
2	날씨가	날씨+가	NOUN	ncn+jcs	None	4	nsubj
3	아주	아주	ADV	mag	None	4	advmod
4	좋군요	좋+군요	ADJ	paa+ef	None	0	root
5	.	.	PUNCT	sf	None	4	punct

Back to English: Yelp Reviews, 10K only¶

In [45]:
import pandas as pd

filename = 'review_10k.json'
df = pd.read_json(filename, lines=True, encoding='utf-8')
print(df.head(5))
                review_id                 user_id             business_id  \
0  YS5GiNt7SeaofYv4Ms_WMQ  6U0PY5tSp2kiFFBG-2conA  6M747U8wDVZcRZHqIYB_iQ   
1  7IfTa9YkgpUl2uxpPH3eng  8RrH2pgJJ2PPGZwYuzJnRg  DpmTxK7_GPfiyLAM6LSL7Q   
2  cNmwd1HdI6o075iqqqTduQ  ZrMA70Hq6mkJT-1l0xnwPQ  Ckzuf2-coSImCKlYwKRTNw   
3  I5Qlw24u74kCEtxxFAUJDg  1J4UgfT5v5rwYLCX6fCbYQ  3L-ezs0VANOtmdJDlbha1Q   
4  SQgplD2UFkduTfjgdgaYqg  KHgWURbB5TNHNYu72memYg  v_1d4aNN7bjiQ7aF3S0hgA   

   stars  useful  funny  cool  \
0      1       2      0     0   
1      5       0      0     0   
2      3       1      1     1   
3      5       1      0     0   
4      5       0      0     0   

                                                text                date  
0  Very confusing entrance. The putting in your n... 2015-11-30 20:37:04  
1  Absolutely delicious vegan donuts - not just o... 2018-09-13 20:02:55  
2  I had such high hopes! I will say the location... 2016-05-13 17:10:57  
3  Busy when I went. Worth the wait. The best Mex... 2013-06-15 03:25:33  
4  Have been wanting to check this place out for ... 2018-07-29 16:06:48  
In [25]:
df = df[['stars', 'text']]
df.head()
Out[25]:
stars text
0 1 Very confusing entrance. The putting in your n...
1 5 Absolutely delicious vegan donuts - not just o...
2 3 I had such high hopes! I will say the location...
3 5 Busy when I went. Worth the wait. The best Mex...
4 5 Have been wanting to check this place out for ...
In [26]:
df.text[0]
Out[26]:
'Very confusing entrance. The putting in your name line and the your table is ready line are just muddled together. Called ahead to see if they accept reservation, they don\'t,  but was told to tell the hostess what game we wanted to see and they would make sure we were in that section. When we arrived the hostess, Natalie, asked our party size and then when we said we had called ahead as we were told to do but before we had a chance she cut us off with "We don\'t do reservations" we continued on with our request. When we were seated we had to ask to have the game we wanted put on a screen near us. They did so. The food was okay. A bit pricey but you are in a tourist mecca. My wife ordered a "Wedge" salad what she got was a salad with a few lettuce leaves definately no wedge. I had the Cuban Meatloaf sandwich it was alright. All in all if you want to watch a game and pay alot for a beer (6.50 a pint for domestic) this is your place.'
In [27]:
df.text[3]
Out[27]:
"Busy when I went. Worth the wait. The best Mexican food I've had in Portland by far. Food made with love and great flavor!!"
In [28]:
doc1 = nlp(df.text[3])
In [29]:
for w in doc1.iter_words():
    print(w.id, w.text, w.lemma, w.upos, w.xpos, w.feats, w.head, w.deprel, sep="\t")
1	Busy	busy	ADJ	JJ	Degree=Pos	0	root
2	when	when	ADV	WRB	PronType=Int	4	advmod
3	I	I	PRON	PRP	Case=Nom|Number=Sing|Person=1|PronType=Prs	4	nsubj
4	went	go	VERB	VBD	Mood=Ind|Number=Sing|Person=1|Tense=Past|VerbForm=Fin	1	advcl
5	.	.	PUNCT	.	None	1	punct
1	Worth	worth	ADJ	JJ	Degree=Pos	0	root
2	the	the	DET	DT	Definite=Def|PronType=Art	3	det
3	wait	wait	NOUN	NN	Number=Sing	1	obj
4	.	.	PUNCT	.	None	1	punct
1	The	the	DET	DT	Definite=Def|PronType=Art	4	det
2	best	good	ADJ	JJS	Degree=Sup	4	amod
3	Mexican	Mexican	ADJ	JJ	Degree=Pos	4	amod
4	food	food	NOUN	NN	Number=Sing	0	root
5	I	I	PRON	PRP	Case=Nom|Number=Sing|Person=1|PronType=Prs	7	nsubj
6	've	have	AUX	VBP	Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin	7	aux
7	had	have	VERB	VBN	Tense=Past|VerbForm=Part	4	acl:relcl
8	in	in	ADP	IN	None	9	case
9	Portland	Portland	PROPN	NNP	Number=Sing	7	obl
10	by	by	ADP	IN	None	11	case
11	far	far	ADV	RB	Degree=Pos	7	obl
12	.	.	PUNCT	.	None	4	punct
1	Food	food	NOUN	NN	Number=Sing	0	root
2	made	make	VERB	VBN	Tense=Past|VerbForm=Part|Voice=Pass	1	acl
3	with	with	ADP	IN	None	4	case
4	love	love	NOUN	NN	Number=Sing	2	obl
5	and	and	CCONJ	CC	None	7	cc
6	great	great	ADJ	JJ	Degree=Pos	7	amod
7	flavor	flavor	NOUN	NN	Number=Sing	4	conj
8	!!	!!	PUNCT	.	None	1	punct

Adjective + HEAD pairs?¶

  • Let's print adjectives and their head.
  • Adjective head can be: noun ("happy girl"), ROOT ("is happy"), or another adjective if in conjunction.
In [30]:
for sent in doc1.sentences:
    print(sent.text)
    for w in sent.words:
        if w.upos=='ADJ': 
            # note how headword is looked up through ID and index:
            headtext = sent.words[w.head-1].text if w.head > 0 else "ROOT"
            print(" ", w.id, w.text, w.upos, str(w.head)+":"+headtext, w.deprel, sep="\t")
    print()
Busy when I went.
 	1	Busy	ADJ	0:ROOT	root

Worth the wait.
 	1	Worth	ADJ	0:ROOT	root

The best Mexican food I've had in Portland by far.
 	2	best	ADJ	4:food	amod
 	3	Mexican	ADJ	4:food	amod

Food made with love and great flavor!!
 	6	great	ADJ	7:flavor	amod

Sentiment analysis!¶

  • https://stanfordnlp.github.io/stanza/sentiment.html
In [31]:
for i, sent in enumerate(doc1.sentences):
    print(sent.text)
    print("%d -> %d" % (i, sent.sentiment))
Busy when I went.
0 -> 1
Worth the wait.
1 -> 1
The best Mexican food I've had in Portland by far.
2 -> 2
Food made with love and great flavor!!
3 -> 2
In [32]:
senti_scores = [sent.sentiment for sent in doc1.sentences]
print(senti_scores)    # 1s and 2s
import numpy as np
np.mean(senti_scores)    # this was a 5-star review
[1, 1, 2, 2]
Out[32]:
1.5
In [33]:
doc2 = nlp(df.text[0])
for i, sent in enumerate(doc2.sentences):
    print(sent.text)
    print("%d -> %d" % (i, sent.sentiment))

senti_scores2 = [sent.sentiment for sent in doc2.sentences]
np.mean(senti_scores2)   # this was a 1-star review
Very confusing entrance.
0 -> 0
The putting in your name line and the your table is ready line are just muddled together.
1 -> 0
Called ahead to see if they accept reservation, they don't,  but was told to tell the hostess what game we wanted to see and they would make sure we were in that section.
2 -> 1
When we arrived the hostess, Natalie, asked our party size and then when we said we had called ahead as we were told to do but before we had a chance she cut us off with "We don't do reservations" we continued on with our request.
3 -> 0
When we were seated we had to ask to have the game we wanted put on a screen near us.
4 -> 1
They did so.
5 -> 1
The food was okay.
6 -> 1
A bit pricey but you are in a tourist mecca.
7 -> 1
My wife ordered a "Wedge" salad what she got was a salad with a few lettuce leaves definately no wedge.
8 -> 1
I had the Cuban Meatloaf sandwich it was alright.
9 -> 1
All in all if you want to watch a game and pay alot for a beer (6.50 a pint for domestic) this is your place.
10 -> 1
Out[33]:
0.7272727272727273

What adjectives and adverbs are common in Yelp reviews?¶

  • Let's process all 10K reviews and find out!
  • This will take a long time. We'll use the tqdm library to keep an eye on the progress.
In [ ]:
# progress bar! 
from tqdm import tqdm
tqdm.pandas(desc='DataFrame Operation')
review_docs = df.text.progress_apply(nlp)

## Well -- tqdm reports 2-5 sec per review, estimates 7-12 hours of total processing time. YIKES! 
## So I had to abort this run. 
In [48]:
# So let's just do... 500 reviews. How long does it take? 
%time review_docs = df[:500].text.apply(nlp)
len(review_docs) 
CPU times: total: 2h 20min 27s
Wall time: 25min 10s
Out[48]:
500
In [49]:
dir(review_docs[0])
Out[49]:
['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_attach_coref_mentions', '_coref', '_count_words', '_ents', '_lang', '_num_tokens', '_num_words', '_process_sentences', '_sentences', '_text', 'add_property', 'build_ents', 'coref', 'entities', 'ents', 'from_serialized', 'get', 'get_mwt_expansions', 'iter_tokens', 'iter_words', 'lang', 'mark_whitespace', 'num_tokens', 'num_words', 'reindex_sentences', 'sentence_comments', 'sentences', 'set', 'set_mwt_expansions', 'sort_features', 'text', 'to_dict', 'to_serialized']
In [50]:
# (Condition, Outcome) format for every word in our favorite review
[(word.upos, word.text) for word in review_docs[3].iter_words()]
Out[50]:
[('ADJ', 'Busy'), ('ADV', 'when'), ('PRON', 'I'), ('VERB', 'went'), ('PUNCT', '.'), ('ADJ', 'Worth'), ('DET', 'the'), ('NOUN', 'wait'), ('PUNCT', '.'), ('DET', 'The'), ('ADJ', 'best'), ('ADJ', 'Mexican'), ('NOUN', 'food'), ('PRON', 'I'), ('AUX', "'ve"), ('VERB', 'had'), ('ADP', 'in'), ('PROPN', 'Portland'), ('ADP', 'by'), ('ADV', 'far'), ('PUNCT', '.'), ('NOUN', 'Food'), ('VERB', 'made'), ('ADP', 'with'), ('NOUN', 'love'), ('CCONJ', 'and'), ('ADJ', 'great'), ('NOUN', 'flavor'), ('PUNCT', '!!')]
In [51]:
# For all of 500 review docs. Nested list comprehension that flattens: 
pos_word = [(word.upos, word.text.lower()) for mydoc in review_docs for word in mydoc.iter_words()]
pos_word[:10]
Out[51]:
[('ADV', 'very'), ('ADJ', 'confusing'), ('NOUN', 'entrance'), ('PUNCT', '.'), ('DET', 'the'), ('VERB', 'putting'), ('ADP', 'in'), ('PRON', 'your'), ('NOUN', 'name'), ('NOUN', 'line')]
In [52]:
len(pos_word)
Out[52]:
66238
In [53]:
pos_word_CFD = nltk.ConditionalFreqDist(pos_word)
pos_word_CFD['ADJ'].most_common(50)
Out[53]:
[('good', 266), ('great', 161), ('other', 92), ('best', 92), ('nice', 88), ('more', 76), ('first', 69), ('new', 67), ('little', 67), ('delicious', 63), ('friendly', 63), ('amazing', 61), ('hot', 49), ('bad', 47), ('sure', 46), ('few', 45), ('same', 45), ('better', 44), ('last', 44), ('small', 41), ('many', 40), ('clean', 39), ('tasty', 38), ('different', 38), ('fresh', 38), ('perfect', 37), ('full', 36), ('worth', 36), ('only', 35), ('next', 34), ('big', 32), ('sweet', 31), ('free', 30), ('happy', 30), ('most', 29), ('awesome', 29), ('long', 28), ('wonderful', 27), ('old', 26), ('busy', 25), ('cool', 25), ('high', 24), ('hard', 24), ('else', 24), ('much', 23), ('worst', 23), ('less', 23), ('large', 23), ('decent', 23), ('several', 22)]
In [54]:
pos_word_CFD['ADJ'].freq('good')
Out[54]:
0.050236071765816806
In [55]:
pos_word_CFD['ADV'].most_common(50)
Out[55]:
[('so', 244), ('very', 218), ('when', 179), ('just', 175), ('here', 164), ('really', 124), ('back', 113), ('also', 99), ('never', 87), ('even', 84), ('how', 84), ('then', 83), ('there', 83), ('too', 79), ('always', 77), ('only', 77), ('again', 71), ('well', 67), ('as', 63), ('now', 58), ('still', 58), ('definitely', 57), ('ever', 51), ('pretty', 45), ('however', 42), ('about', 38), ('where', 36), ('highly', 32), ('in', 31), ('away', 30), ('super', 29), ('almost', 28), ('much', 27), ('actually', 26), ('more', 25), ('over', 25), ('maybe', 24), ('later', 23), ('probably', 22), ('once', 22), ('finally', 22), ('right', 21), ('quite', 21), ('all', 21), ('why', 20), ('though', 20), ('inside', 19), ('usually', 19), ('first', 19), ('soon', 19)]

Caution: Using Pre-Built NLP Suites¶

  • These tools popularized NLP for regular CS, data-science and "analytics" crowds.
  • They tend to just plug in their text into these suites, and use the output for their goals... They however do not:
    • try to understand how they are built: what sort of algorithms are being used, what sort of language data was used for training the language models
    • pay attention to the linguistic representation of various levels: POS, dependency syntax, etc.
    • critically evaluate where the models are successful and not successful!

We're trained linguists, we can and should do better.¶

  • If you incorporate these tools in your project (you should!), make sure to engage with them, evaluate the output with critical eyes, and consider whether or not they are a good fit for your particular language data: how accurate they are, where they're successful, where they fall short. This is important!
In [ ]:
 
In [ ]:
 
In [ ]: