Homework Assignment #4Comparative Analysis of Two Corpora: EFL Writing by Bulgarian and Japanese Students
Between Bulgarian and Japanese college students, which group writes English on a more advanced level? Among many measurements that can be used as an indicator for writing quality, we will try our hands on two metrics: (1) the average sentence length, and (2) the level of their vocabulary. In this assignment, you will work with 60 English essays written by Japanese and Bulgarian students, excerpted from the ICLE2 (International Corpus of Learner English v2) corpus.
Your job involves completing two python scripts: (A) HW4.brown_vocab_rank.py, and (B) HW4.process_ICLE2.py. For (A), you will be inducing a vocabulary frequency ranking from a word frequency list extracted from the Brown Corpus. For (B), you will be processing the learner corpus for the two metrics above, using the functions in your textproc.py module. Additionally, you will be answering some questions: write them up in a separate document (.txt, MS word, or a .pdf file).
Part A: Vocabulary Ranking from The Brown CorpusWhat is the most common English word? Well, you guessed it, it's the. How about happy -- in the decreasing order of frequency, what do you think its rank is? How about make and fabricate? The answers are: 1,098th (happy), 132nd (make), and 26,158th (fabricate), according to the Brown Corpus, the very first Electronic corpus consisting of 1 million words published in 1960. In this corpus, happy occurred 103 times, make 805 times and fabricate only once.
We will use this corpus as the basis of our English vocabulary ranking. First, download these two files:
Part B: Process the Learner CorpusIn this part, process the 60 essay files by Bulgarian and Japanese students to gain some insight into the writing quality of their English. First, download these two files:
Then, complete the template script to accomplish the following steps:
When you are done, upload these four files: