Na-Rae Han (naraehan@pitt.edu), 2/17/2017, Pitt Library Workshop

Preparation¶

This tutorial is found on http://www.pitt.edu/~naraehan
Download and unzip the "C-Span Inaugural Address Corpus", available on NLTK's corpora page: http://www.nltk.org/nltk_data/
Place the unzipped "inaugural" folder on your DESKTOP

Jupyter tips:

The very basics¶

print("hello, world!")

String type objects are enclosed in quotation marks.
+ is a concatenation operator.
Below, greet is a variable name assigned to a string value; note the absence of quotation marks.

greet = "Hello, world!"
greet + " I come in peace."

greet.upper()

len(greet)

num1 = 5678
num2 = 3.141592
result = num1 / num2
print(result)

Lists are enclosed in [ ], with elements separated with commas. Lists can have strings, numbers, and more.
Like with string, you can use len() to get the size of a list.
Like with string, you can use in to see if an element is in a list.

li = ['red', 'blue', 'green', 'black', 'white', 'pink']
len(li)

'blue' in li

Using for loop, you can loop through a list of items, applying the same set of operations to each element.
Just like the conditionals, the embedded code block is marked with indentation.

for x in li :
    print(x, len(x))

List comprehension builds a new list from an existing list.
You can filter in only certain elements, and you can apply transformation in the process.
Try: .upper(), len(), +'ish'

[x for x in li if x.endswith('e')]

[x+'ish' for x in li]

[len(x) for x in li]

di = {'Homer':35, 'Marge':35, 'Bart':10, 'Lisa':8}
di['Bart']

len(di)

NLTK is an external module; you can start using it after importing it.
nltk.word_tokenize() is a handy tokenizing function out of literally tons of functions it provides.
It turns a text (a single string) into a list tokenized words.

import nltk

nltk.word_tokenize(greet)

sent = "You haven't seen Star Wars...?"
nltk.word_tokenize(sent)

sent = 'Rose is a rose is a rose is a rose.'
toks = nltk.word_tokenize(sent)
print(toks)

freq = nltk.FreqDist(toks)
freq

freq.most_common(3)

freq['rose']

len(freq)

myfile = 'C:/Users/narae/Desktop/inaugural/1789-Washington.txt'  # Mac users should leave out C:
wtxt = open(myfile).read()
print(wtxt)

len(wtxt)     # Number of characters in text

'fellow citizens' in wtxt

nltk.word_tokenize(wtxt)

wtokens = nltk.word_tokenize(wtxt)
len(wtokens)     # Number of words in text

wfreq = nltk.FreqDist(wtokens)
wfreq['citizens']

len(wfreq)      # Number of unique words in text

wfreq.most_common(40)     # 40 most common words

sentcount = wfreq['.'] + wfreq['?'] + wfreq['!']  # Assuming every sentence ends with ., ! or ?
sentcount

len(wtokens)/sentcount     # Average sentence length in number of words

[w for w in wfreq if len(w) >= 13]       # all 13+ character words

long = [w for w in wfreq if len(w) >= 13] 
for w in long :
    print(w, len(w), wfreq[w])               # long words tend to be less frequent

Take a Python course!