Go to: LING 1330/2330 home page  

Exercise 3: Processing O. Henry

Let's practice text processing with NLTK. You will be essentially duplicating what we did with the Gettysburg Address in class today.
  1. Instead of the Gettysburg address as input, use O. Henry's The Gift of the Magi, found here.
  2. The file should be downloaded (DO NOT copy-and-paste!! It is a bad practice to copy-paste text file contents!!) and placed in your usual Python script directory.
Instructions and tips:
  • Name your new script file process_gift.py.
  • Reference the input text file using its full file and directory path.
  • Comment your code! (Required on all assignments going forward.)
  • You should utilize shell as usual, but this text is much longer than Gettysburg, so you should take care not to flash large objects to shell.
  • Have your script print out the following:
    1. How many word tokens there are,
    2. How many word types there are, (word types are a unique set of words)
    3. Top 20 most frequent words and their counts, (use for-loop on .most_common() output)
    4. Words that are at least 10 characters long and their counts.
    5. [BONUS] 10+ characters-long words that occur at least twice, sorted from most frequent to least
  • Format your output exactly like this (produced using tale.txt), down to the verbiage, spaces, blank lines, line breaks and placement of quotes.
  • Additionally, save your shell output as a text file for submission, named process_gift_out.txt. This is an example of a saved shell session: note that this is NOT a script. See this FAQ entry for how. It will save your entire IDLE shell history, with warts and errors and all, which is fine (remember: shell-side exploration is encouraged), as long as the last part has your latest script output.
  • The input file begins with the title and the author: "The Gift of the Magi by O. Henry". If you want, exclude this part when processing the text, but you are not obligated to do so.

  • Upload: the process_gift.py script and the saved shell file process_gift_out.txt.

    Remember to include in your scripts a comment line at the very top containing your name, Pitt email and and date, e.g.: # Na-Rae Han, naxxxhan@pitt.edu, September 10, 2023