LING 1901 Fundamentals of Text Processing for Linguists, University of Pittsburgh

Go to: Course home page

Homework Assignment #4
Comparative Analysis of Two Corpora: EFL Writing by Bulgarian and Japanese Students

Between Bulgarian and Japanese college students, which group writes English on a more advanced level? Among many measurements that can be used as an indicator for writing quality, we will try our hands on two metrics: (1) the average sentence length, and (2) the level of their vocabulary. In this assignment, you will work with 60 English essays written by Japanese and Bulgarian students, excerpted from the ICLE2 (International Corpus of Learner English v2) corpus.
Your job involves completing two python scripts: (A) HW4.brown_vocab_rank.py, and (B) HW4.process_ICLE2.py. For (A), you will be inducing a vocabulary frequency ranking from a word frequency list extracted from the Brown Corpus. For (B), you will be processing the learner corpus for the two metrics above, using the functions in your textproc.py module. Additionally, you will be answering some questions: write them up in a separate document (.txt, MS word, or a .pdf file).

Part A: Vocabulary Ranking from The Brown Corpus
What is the most common English word? Well, you guessed it, it's the. How about happy -- in the decreasing order of frequency, what do you think its rank is? How about make and fabricate? The answers are: 1,098th (happy), 132nd (make), and 26,158th (fabricate), according to the Brown Corpus, the very first Electronic corpus consisting of 1 million words published in 1960. In this corpus, happy occurred 103 times, make 805 times and fabricate only once.
We will use this corpus as the basis of our English vocabulary ranking. First, download these two files:

brown_freq.txt: This is a pre-compiled list of raw frequency counts extracted from Brown.
HW4.brown_vocab_rank.TEMPLATE.py: This is a template script.
Then, complete the template script to carry out the following steps:

STEP 1
Read in the brown_freq.txt file. Process it to build a frequency dictionary called brown_freq.
When you first read in the frequency counts, they are of the string type. Make sure to convert the numbers to the integer type.

STEP 2
Process brown_freq to obtain a dictionary of vocabulary ranks named brown_vocab_rank. To do this, complete a custom function named getRank() according to the given specification.
The following illustrates how a ranking can be induced. Note that 'fox' and 'dog' have the same count, therefore they should share the same rank. As a result, there are two words at rank #2, and rank #3 is vacant. The second round of for loop takes care of that.
>>> foo = {'dog':15, 'cat':24, 'cow':3, 'fox':15, 'pig':1} >>> foo {'pig': 1, 'fox': 15, 'dog': 15, 'cow': 3, 'cat': 24} >>> ordered = sorted(foo.keys(), key=foo.get, reverse=True) >>> ordered ['cat', 'fox', 'dog', 'cow', 'pig'] >>> range(len(ordered)) [0, 1, 2, 3, 4] >>> rank = {} >>> for i in range(len(ordered)): rank[ordered[i]] = i+1 >>> rank {'pig': 5, 'fox': 2, 'dog': 3, 'cow': 4, 'cat': 1} >>> for i in range(1, len(ordered)): if foo[ordered[i]] == foo[ordered[i-1]]: rank[ordered[i]] = rank[ordered[i-1]] >>> rank {'pig': 5, 'fox': 2, 'dog': 2, 'cow': 4, 'cat': 1} >>>

>>> for w in sorted(brown_vocab_rank, key=brown_vocab_rank.get)[:30]: print w, '\t', brown_vocab_rank[w] the 1 , 2 . 3 of 4 and 5 ' 6 to 7 a 8 in 9 ` 10 that 11 is 12 was 13 he 14 for 15 it 16 - 17 with 18 as 19 his 20 on 21 be 22 s 23 i 24 ; 25 at 26 by 27 this 28 had 29 ? 30 >>>

STEP 3
Pickle brown_vocab_rank into a file. Use the binary protocol.
Now, answer the following questions about the Brown vocabulary ranking. You should do so by exploring brown_freq and brown_vocab_rank in IDLE shell immediately following the execution of your script.

Q1: What are the ranks of teacher and student?
Q2: Find a word that fits each of the following rank ranges: 100~200, 500~1000, 3000~5000, and 10000~20000.
Q3: How many word types are found in Brown?
Q4: How many word types occur only once in Brown?
Use list comprehension on brown_freq.

Q5: Find a legitimate English word that is NOT in the vocabulary ranking. For such unattested words, what would be the suitable rank to give them?
If there are 10,000 word types in the Brown Corpus, any word that does not occur in the corpus should be ranked right behind all types that do, which means it should rank at 10,001.

Q6: What is the average vocabulary rank of the sentence 'I am very tired.'?

>>> brown_vocab_rank['i'] 24 >>> brown_vocab_rank['am'] 430 >>> brown_vocab_rank['very'] 133 >>> brown_vocab_rank['tired'] 2370 >>> brown_vocab_rank['.'] 3

Q7: How about 'I am utterly exhausted.' this time?

Part B: Process the Learner Corpus
In this part, process the 60 essay files by Bulgarian and Japanese students to gain some insight into the writing quality of their English. First, download these two files:

ICLE2.zip: This is a zipped archive of the 60 essay files.
HW4.process_ICLE2.TEMPLATE.py: This is a template script file.

Then, complete the template script to accomplish the following steps:

STEP 1
Unpickle the Brown vocabulary ranking as a dictionary named vrank.
STEP 2
Read in the two sets of text and build the two tokenized corpora.
STEP 3
Calculate and print out the average sentence length of the Bulgarian essays and the Japanese essays.
STEP 4
Transform the two token lists into lists of vocabulary ranks. For any word that is not in the Brown ranking dictionary, assign the notfoundrank as its rank, whose value you already determined in Q5 above.
The following is an example of a very short corpus (toks) and its corresponding vocabulary ranks (ranks).
>>> toks ['it', 'was', 'a', 'long', 'day', '.', 'i', 'am', 'very', 'tired', '.'] >>> vrank['it'] 16 >>> vrank['was'] 13 >>> ranks [16, 13, 8, 120, 139, 3, 24, 430, 133, 2370, 3] >>>

>>> foo = {'pig': 5, 'fox': 2, 'dog': 2, 'cow': 4, 'cat': 1} >>> notfoundrank = 1000000000000 >>> wds = ['dog', 'cow', 'horse'] >>> for w in wds: if w in foo: # w is in the dictionary foo print w, foo[w] else: # w is not in foo print w, notfoundrank dog 2 cow 4 horse 1000000000000 >>>

STEP 5
Calculate and print out the average vocabulary ranks for the two sets of essays.
Now, answer the following questions about the two corpora.

Q8: What is the average sentence length of the Bulgarian essays? How about the Japanese essays?
Q9: What is the average vocabulary rank of the Bulgarian essays? How about the Japanese essays?
Q10: From the two questions above, what would you conclude about the English writing levels of the two student groups?

When you are done, upload these four files:

HW4.brown_vocab_rank.YOUR-LAST-NAME.py
HW4.process_ICLE2.YOUR-LAST-NAME.py
Your pickle file containing the Brown vocabulary ranking
A document (.txt, .docx, or .pdf) containing your answers