In this part, you will be reviewing two spell checkers of your choice.
Select your favorite spell checker and try out at least three misspellings of a word of your choice. Note the correction(s) suggested/made by the spell checker.
In terms of edit distance, how far off can your misspellings be and still have the
correct spelling suggested?
Select your second favorite spell checker and perform a comparison of their
performances. For each of your misspellings, which spell checker has a better
ranking? Can the misspellings be more or less far off?
Write up your findings. Be sure to properly document them: include screen shots if it helps (but remember that they are not a substitute for your written analysis). Also, there's no need to write a lengthy paper -- 1-2 pages of written text (excluding screenshots and tables) is all I ask.
PART 2: the Bible vs. Jane Austen Novels [25 points]
Are we ready to handle some seriously large texts? Yes we are! In this part, you will process two corpora: the Bible and the Jane Austen novels, using NLTK's text processing functions we reviewed in class.
The text files are included in the Project Gutenberg Selections corpus, available from the NLTK Corpora
page here: https://www.nltk.org/nltk_data/. Download the zipped archive (gutenberg.zip), unzip it, and place
the "gutenberg" directory under your script directory. In this assignment, we will be working with these two groups of text as two corpora:
The King James Version Bible (bible-kjv.txt)
Jane Austen novels (austen-emma.txt, austen-persuasion.txt, austen-sense.txt)
(Note: do NOT use NLTK's corpus loading tools such as PlaintextCorpusReader, which we will learn later. Also: do NOT access the two corpora through the pre-loaded nltk.corpus.gutenberg. The point of this homework is to treat these corpora as you would any random collection of texts you encounter on the Internet.)
Your job is to write a script that processes the two corpora for some basic stats:
opens (and later closes) the Bible text file, read in the string content,
builds a list of individual sentences,
prints out how many sentences there are,
builds a flat tokenized word list and the type list,
prints the token and the type counts of this corpus,
builds a frequency count dictionary of words,
prints the top 50 word types and their counts,
and repeats the above for the Jane Austen corpus.
There are three Austen text files, so how do you extract data from all of them? There are two ways to go:
Read the text content of each file as a string, and then concatenate the three (already pretty big!) strings into one gigantic string. Then, pass it to nltk.sent_tokenize() and nltk.word_tokenize(). To illustrate:
Alternatively, you can process each file separately for sentence/word tokens, and then merge the three token lists into one using +. In a different context (say, when you need to compare between Austen novels), this approach is the way to go.
Finally, make one observation about the two corpora. It could involve some new code of your own not included above, or it could be based off of A.--F. above. Have your script print out your observation enclosed in """...""".
These are seriously big text files, so take care not to crash your shell.
Your script file should be named bible_austen.py. I am not providing a template script file here: it is up to you to structure your script any way you see fit. But take care to make it readable and well-structured. And importantly, put comments in your script that explain what a block of code achieves. That helps my grading and your future code maintenance.
Your script will print out to shell: save this shell output as a text file for submission, named bible_austen_out.txt. See this FAQ entry for how.
Upload a word-processed file for PART 1; bible_austen.py (script) and bible_austen_out.txt (your shell output saved as a text file) for PART 2.
Remember to include in your scripts a comment line at the very top containing your name, Pitt email and and date, e.g.: # Na-Rae Han, firstname.lastname@example.org, September 10, 2023