George Washington wrote the first of his two inaugural speeches in 1789. Barack Obama wrote his in 2009. How does the intervening 220 years affect their language? We will explore the two pieces of historic texts in this homework.
Your job is to complete two python scripts: (A) textproc.py, and (B) HW3.inaugural.TEMPLATE.py. The former is a python module that includes many essential functions for text processing, for example getToks() for tokenization. The latter is your main script, and it calls various functions in the textproc.py module to process the two inaugural speech text files. Additionally, you will be answering some questions regarding the two speeches: write them up in a separate document (.txt, MS word, or a .pdf file).
Download the two speech text files (Mac users: make sure to download them as the source text files) and the two script files, and save them in the same directory. Then follow the instructions below to complete the assignment.
Part A: Complete textproc.py and Learn the Functions
The goal here is two-fold: (1) complete the module, and (2) familiarize yourself with the workings of the various functions so you can comfortably use them in Part B.
Complete the getRelFreq() function, marked with [1].
Try out the main() function by making the edits marked with [2] and running the script. You should be getting this shell output.
Learn how the individual functions work by examining how they are called in the main() function. You may also experiment with the functions immediately following the execution of textproc.py.
Note that the objects talefreq, taletop10 etc. are not available after the script is run: it's because they are encapsulated within the main() function. However: the tale string and the functions are available, so you can rebuild the data objects by tracing the steps in main(). See this shell output for illustration.
Part B: Complete HW3.inaugural.TEMPLATE.py and Answer Questions
Now you are ready to explore the two speeches and address some linguistically motivated questions. You will find the answers through completing HW3.inaugural.TEMPLATE.py, while using the functions in textproc.py. Have your script calculate and write out the relevant data points, and then write down in a separate document the answers and your analysis.
Question 1: Initial Impressions
First, take a quick look through the two speeches (Washington, Obama) and form an impression. Are there any differences that are immediately noticeable?
Question 2: Text Length
Whose speech is longer: Washington's or Obama's? How long are the speeches?
The length of a text is measured as the total # of word tokens it contains. Individual symbols also count as tokens.
Question 3: Vocabulary Diversity
Whose speech has more diverse vocabulary? Vocabulary diversity can be represented by TTR (Type-Token Ratio). Have your script write out both the type count and the TTR.
Type-Token Ratio can be obtained by dividing the total type count by the total token count.
Question 4: Sentence Length
Who uses longer sentences -- Washington or Obama? Have your script write out both the sentence count and the average sentence length.
A sentence's length is the number of word tokens (symbols included) in it. You many assume that '.', '!' and '?' always mark the end of a sentence.
Question 5: Word Length
Who uses longer words -- Washington or Obama? What are their average word lengths? Exclude symbols when calculating these numbers.
>>> ':'.isalnum()
False
Question 6: Top 20 Washington Words
What are the top 20 most frequent words in Washington's speech? Have your script write them out, in descending order: the word, its frequency count, and its relative frequency, separated by a tab.
The write-out should look something like:
' 2871 0.0811200271248
, 2418 0.0683205244123
the 1642 0.0463946654611
. 988 0.0279159132007
and 872 0.0246383363472
to 729 0.020597875226
a 632 0.0178571428571
Question 7: Top 20 Obama Words
Do the same for Obama's speech. How does this list compare with the one from Washington's?
Question 8: Frequent Words Found in One Speech Only
What are the frequent words that are only found in one speech? That is: what are the frequent words in Washington's speech that do not occur in Obama's at all, and vice versa? Have your script print out top 10 words each case, followed by their Washington count and their Obama count, separate by a tab.
Question 9: Top 20 Favored by Washington
Of those words used by both, which were used by Washington much more than by Obama? Give top 20 such words along with the degree of preference (calculated as the difference in their relative frequencies in the two texts). Any observations you can make?
Question 10: Top 20 Favored by Obama
Do the same for Obama's speech. Anything noteworthy?
Question 11: Your Own Question
Anything else you want to investigate? Pick your own question and find the answer.
Question 12: Comparison Summary
In your own words, summarize the findings from Q1-Q11. Feel free to interpret the results and provide your own insights.
When you are done, upload the four files:
textproc.py
HW3.inaugural.YOUR-LAST-NAME.py
Your output file HW3.inaugural.OUT.txt
A document (.txt, .docx, or .pdf) containing your answers