Go to: Course home page  

Homework Assignment #4

Comparative Analysis of Two Corpora: EFL Writing by Bulgarian and Japanese Students
Between Bulgarian and Japanese college students, which group writes English on a more advanced level? Among many measurements that can be used as an indicator for writing quality, we will try our hands on two metrics: (1) the average sentence length, and (2) the level of their vocabulary. In this assignment, you will work with 60 English essays written by Japanese and Bulgarian students, excerpted from the ICLE2 (International Corpus of Learner English v2) corpus.

Your job involves completing two python scripts: (A) HW4.brown_vocab_rank.py, and (B) HW4.process_ICLE2.py. For (A), you will be inducing a vocabulary frequency ranking from a word frequency list extracted from the Brown Corpus. For (B), you will be processing the learner corpus for the two metrics above, using the functions in your textproc.py module. Additionally, you will be answering some questions: write them up in a separate document (.txt, MS word, or a .pdf file).

Part A: Vocabulary Ranking from The Brown Corpus

What is the most common English word? Well, you guessed it, it's the. How about happy -- in the decreasing order of frequency, what do you think its rank is? How about make and fabricate? The answers are: 1,098th (happy), 132nd (make), and 26,158th (fabricate), according to the Brown Corpus, the very first Electronic corpus consisting of 1 million words published in 1960. In this corpus, happy occurred 103 times, make 805 times and fabricate only once.

We will use this corpus as the basis of our English vocabulary ranking. First, download these two files:

Then, complete the template script to carry out the following steps:

  • STEP 1
    Read in the brown_freq.txt file. Process it to build a frequency dictionary called brown_freq.
  • STEP 2
    Process brown_freq to obtain a dictionary of vocabulary ranks named brown_vocab_rank. To do this, complete a custom function named getRank() according to the given specification.
  • STEP 3
    Pickle brown_vocab_rank into a file. Use the binary protocol.
Now, answer the following questions about the Brown vocabulary ranking. You should do so by exploring brown_freq and brown_vocab_rank in IDLE shell immediately following the execution of your script.
  • Q1: What are the ranks of teacher and student?
  • Q2: Find a word that fits each of the following rank ranges: 100~200, 500~1000, 3000~5000, and 10000~20000.
  • Q3: How many word types are found in Brown?
  • Q4: How many word types occur only once in Brown?
  • Q5: Find a legitimate English word that is NOT in the vocabulary ranking. For such unattested words, what would be the suitable rank to give them?
  • Q6: What is the average vocabulary rank of the sentence 'I am very tired.'?
  • Q7: How about 'I am utterly exhausted.' this time?

Part B: Process the Learner Corpus

In this part, process the 60 essay files by Bulgarian and Japanese students to gain some insight into the writing quality of their English. First, download these two files:

Then, complete the template script to accomplish the following steps:

  • STEP 1
    Unpickle the Brown vocabulary ranking as a dictionary named vrank.
  • STEP 2
    Read in the two sets of text and build the two tokenized corpora.
  • STEP 3
    Calculate and print out the average sentence length of the Bulgarian essays and the Japanese essays.
  • STEP 4
    Transform the two token lists into lists of vocabulary ranks. For any word that is not in the Brown ranking dictionary, assign the notfoundrank as its rank, whose value you already determined in Q5 above.
  • STEP 5
    Calculate and print out the average vocabulary ranks for the two sets of essays.
Now, answer the following questions about the two corpora.
  • Q8: What is the average sentence length of the Bulgarian essays? How about the Japanese essays?
  • Q9: What is the average vocabulary rank of the Bulgarian essays? How about the Japanese essays?
  • Q10: From the two questions above, what would you conclude about the English writing levels of the two student groups?

When you are done, upload these four files:

  • HW4.brown_vocab_rank.YOUR-LAST-NAME.py
  • HW4.process_ICLE2.YOUR-LAST-NAME.py
  • Your pickle file containing the Brown vocabulary ranking
  • A document (.txt, .docx, or .pdf) containing your answers