Na-Rae Han (naraehan@pitt.edu), 5/30/2017, CMU DH Summer Workshop
Jupyter tips:
More on https://www.cheatography.com/weidadeyue/cheat-sheets/jupyter-notebook/
print("hello, world!")
greet
is a variable name assigned to a string value; note the absence of quotation marks.greet = "Hello, world!"
greet + " I come in peace."
greet.upper()
len()
returns the length of a string in the # of characters. len(greet)
+
, -
, *
and /
with numbers. num1 = 5678
num2 = 3.141592
result = num1 / num2
print(num1, "divided by", str(num2), "is", result)
[ ]
, with elements separated with commas. Lists can have strings, numbers, and more. len()
to get the size of a list. in
to see if an element is in a list. li = ['red', 'blue', 'green', 'black', 'white', 'pink']
len(li)
'mauve' not in li
# Try [0], [2], [-1], [3:5], [3:], [:5]
li[0]
for x in li :
print(x, len(x))
print("Done!")
.upper()
, len()
, +'ish'
[x for x in li if x.endswith('e')]
[x+'ish' for x in li]
[len(x) for x in li]
di = {'Homer':35, 'Marge':35, 'Bart':10, 'Lisa':8}
di['Homer']
len(di)
NLTK is an external module; you can start using it after importing it.
nltk.word_tokenize()
is a handy tokenizing function out of literally tons of functions it provides.
It turns a text (a single string) into a list tokenized words.
import nltk
nltk.word_tokenize(greet)
help(nltk.word_tokenize)
sent = "You haven't seen Star Wars...?"
nltk.word_tokenize(sent)
nltk.FreqDist()
is is another useful NLTK function. # First "Rose" is capitalized. How to lowercase?
sent = 'Rose is a rose is a rose is a rose.'
toks = nltk.word_tokenize(sent)
print(toks)
freq = nltk.FreqDist(toks)
freq
freq.most_common(3)
freq['rose']
len(freq)
myfile = 'C:/Users/zoso/Desktop/inaugural/1789-Washington.txt' # Mac users should leave out C:
wtxt = open(myfile).read()
print(wtxt)
len(wtxt) # Number of characters in text
'fellow citizens'.lower() in wtxt.lower() # phrase as a substring
'Americans' in wtxt
# Turn off/on pretty printing (prints too many lines)
%pprint
# Tokenize text
nltk.word_tokenize(wtxt)
wtokens = nltk.word_tokenize(wtxt)
len(wtokens) # Number of words in text
# Build a dictionary of frequency count
wfreq = nltk.FreqDist(wtokens)
wfreq['citizens']
wfreq['the']
len(wfreq) # Number of unique words in text
wfreq.most_common(40) # 40 most common words
sentcount = wfreq['.'] + wfreq['?'] + wfreq['!'] # Assuming every sentence ends with ., ! or
print(sentcount)
# Tokens include symbols and punctuation. First 50 tokens:
wtokens[:50]
wtokens_nosym = [t for t in wtokens if t.isalnum()] # alpha-numeric tokens only
len(wtokens_nosym)
# First 50 tokens, alpha-numeric tokens only:
wtokens_nosym[:50]
len(wtokens_nosym)/sentcount # Average sentence length in number of words
[w for w in wfreq if len(w) >= 13] # all 13+ character words
long = [w for w in wfreq if len(w) >= 13]
for w in long :
print(w, len(w), wfreq[w]) # long words tend to be less frequent
from nltk.corpus import PlaintextCorpusReader
corpus_root = 'C:/Users/zoso/Desktop/inaugural' # Mac users should leave out C:
inaug = PlaintextCorpusReader(corpus_root, '.*txt') # all files ending in 'txt'
# .txt file names as file IDs
inaug.fileids()
# NLTK automatically tokenizes the corpus. First 50 words:
print(inaug.words()[:50])
# You can also specify individual file ID. First 50 words from Obama 2009:
print(inaug.words('2009-Obama.txt')[50:])
# NLTK automatically segments sentences too, which are accessed through .sents()
print(inaug.sents('2009-Obama.txt')[0]) # first sentence
print(inaug.sents('2009-Obama.txt')[1]) # 2nd sentence
# How long are these speeches in terms of word and sentence count?
print('Washington 1789:', len(inaug.words('1789-Washington.txt')), len(inaug.sents('1789-Washington.txt')))
print('Obama 2009:', len(inaug.words('2009-Obama.txt')), len(inaug.sents('2009-Obama.txt')))
# for-loop through file IDs and print out word count.
for f in inaug.fileids():
print(len(inaug.words(f)), f)
# Corpus size in number of words
print(len(inaug.words()))
# Building word frequency distribution for the entire corpus
inaug_freq = nltk.FreqDist(inaug.words())
inaug_freq.most_common(100)
Take a Python course!