Let's practice text processing with NLTK. You will be essentially duplicating what we did with the Gettysburg Address in class today.
- Instead of the Gettysburg address as input, use O. Henry's The Gift of the Magi, found here.
- The file should be downloaded (DO NOT copy-and-paste!! It is a bad practice to copy-paste text file contents!!) and placed in your usual Python script directory.
Instructions and tips:
- Name your new script file process_gift.py.
- Reference the input text file using its full file and directory path.
- Comment your code! (Required on all assignments going forward.)
- You should utilize shell as usual, but this text is much longer than Gettysburg, so you should take care not to flash large objects to shell.
- Have your script print out the following:
- How many word tokens there are,
- How many word types there are, (word types are a unique set of words)
- Top 20 most frequent words and their counts, (use for-loop on .most_common() output)
- Words that are at least 10 characters long and their counts.
- [BONUS] 10+ characters-long words that occur at least twice, sorted from most frequent to least
- Format your output exactly like this (produced using tale.txt), down to the verbiage, spaces, blank lines, line breaks and placement of quotes.
- Additionally, save your shell output as a text file for submission, named process_gift_out.txt. This is an example of a saved shell session: note that this is NOT a script. See this FAQ entry for how. It will save your entire IDLE shell history, with warts and errors and all, which is fine (remember: shell-side exploration is encouraged), as long as the last part has your latest script output.
- The input file begins with the title and the author: "The Gift of the Magi by O. Henry". If you want, exclude this part when processing the text, but you are not obligated to do so.
SUBMIT:
- Upload: the process_gift.py script and the saved shell file process_gift_out.txt.
Remember to include in your scripts a comment line at the very top containing your name, Pitt email and and date, e.g.:
# Na-Rae Han, naxxxhan@pitt.edu, September 10, 2024
|
|
|