LING 1330/2330 Introduction to Computational Linguistics, University of Pittsburgh

Go to: LING 1330/2330 home page

Homework Assignment 3: Bulgarian vs. Japanese EFL Writing
Comparative Analysis of Two Learner Corpora

Between Bulgarian and Japanese college students, which group writes English on a more advanced level? Let's explore the question with real data: 60 English essays written by Japanese and Bulgarian students, excerpted from the ICLE2 (International Corpus of Learner English v2) corpus. Among many measurements that can be used as an indicator for writing quality, we will try our hands on these metrics:

Syntactic complexity: Are the sentences simple and short, or are they complex and long?
→ Can be measured through average sentence length
Lexical diversity: Are there more of the same words repeated throughout, or do the essays feature more diverse vocabulary?
→ Can be measured through type-token ratio (with caveat!)
Vocabulary level: Do the essays use more of the common, everyday words, or do they use more sophisticated and technical words?
→ a. Can be measured through average word length (common words tend to be shorter), and
→ b. Can be measured against published lists of top most frequent English words
You will notice that items 1, 2 and 3a can be measured corpus-internally. Item 3b will have to rely on an external, pre-compiled list of common English words. We have such a resource handy: Peter Norvig's list of the 1/3 million most frequent words and their counts will do nicely.

PART 1: Prepare Vocabulary Bands [15 points]

Norvig's list of the 1/3 million most frequent unigram counts consists of the top 333k entries of the famous Google Web 1T 5-gram dataset. In class, we processed the count_1w.txt file into a list and a FreqDist, which we pickled. In this part, we will process them further to use in our study.
First, a bit of background. In SLA (second language acquisition) literature, vocabulary levels are commonly grouped into frequency bands. For example, Lextutor's VocabProfile identifies each word in a submitted passage as belonging to 1k types, 2k types, and off types, the idea being that 1k-type words are among the top 1,000 most frequent English words and 2k-type words belong to the next 1,000 frequent ones, etc. Rather than talking about a particular word being ranked at, say, 1,302th in terms of frequency, we can talk about this word being in the 2k band, which holds a certain intuitive appeal.
So: let's get to it. Starting from your Google list of (word, count) tuples (we called it goog1w_rank in class), build a new dictionary, named goog_kband, where the key is a word and the value is its k-band. Specifics:

Ranks 1-1,000 should have the k-band value of 1, and
ranks 1,001-2,000 should have the k-band value of 2, etc.
We will however limit ourselves to 20 such bands: all words beyond the rank of 20,000 should be excluded from this dictionary.
Here is Python code in shell. Don't be shy! I only hid it to save space:
First, remember Python index starts with 0; a rank of 1000 means index 999. The k-band value can then be calculated this way:
>>> 1814/1000 # index 1814 (= rank 1815) 1.814 >>> 1814/1000 + 1 2.814 >>> int(1814/1000 + 1) # index 1814 --> k-band 2 2 >>> int(17384/1000 + 1) 18 >>> int(19999/1000 + 1) # rank 20000 --> k-band 20 20
With that, goog_kband can be built like so:
>>> goog_kband = {} # initialize an empty dict >>> for i in range(20000): (word, count) = goog1w_rank[i] kband = int(i/1000 + 1) goog_kband[word] = kband >>> goog1w_rank[999] # 1000th, should be kband 1 ('entry', 80717798) >>> goog1w_rank[1000] # 1001st, should be kband 2 ('stay', 80694073) >>> goog_kband['entry'] # yep 1 >>> goog_kband['stay'] # and yep 2 >>> len(goog_kband) # there are 20k of them... good 20000 >>> goog1w_rank[19999] # the last one at rank 20000. what a strange word... ('bizjournalshire', 1602219) >>> goog_kband['bizjournalshire'] # and it is indeed kband 20. 20

When it is ready, explore it along with the two original Google data objects (goog1w_rank and goog1w_fd) in IDLE shell and explore the questions below.

What are the ranks of teacher and student? What are their k-bands?
So, unfortunately, none of the three data objects are optimized for rank lookup. One thing you can do:
>>> foo = list('tiger') >>> foo ['t', 'i', 'g', 'e', 'r'] >>> list(enumerate(foo)) # enumerate() returns (index, item) tuples [(0, 't'), (1, 'i'), (2, 'g'), (3, 'e'), (4, 'r')] >>> for (index, (word, count)) in enumerate(goog1w_rank[:5]): # trying out on first 5 print(index + 1, word, count) 1 the 23135851162 2 of 13151942776 3 and 12997637966 4 to 12136980858 5 a 9081174698 >>> for (index, (word, count)) in enumerate(goog1w_rank): if word == 'linguistics': print(index + 1, word, count) 11453 linguistics 4077716 >>>
But that's pretty cumbersome. A better solution involves creating yet another data object, a dictionary, which maps a word to its frequency ranking, like so:
>>> goog1w_rankdict = dict() >>> for (index, (word, count)) in enumerate(goog1w_rank): rank = index + 1 goog1w_rankdict[word] = rank >>> goog1w_rankdict['and'] # can now look up a word's rank directly 3 >>> goog1w_rankdict['linguistics'] 11453 >>>

Find a word that fits each of the 20 k-bands. Do their bands align with your own intuition?
What are some examples of English words not found in the top 20k range?
What is the average vocabulary band of the sentence 'I am very tired'?
How about 'I am utterly exhausted' this time?
When you are done, pickle goog_kband so it can be used in PART 2. Then, save your IDLE shell session as hw3_vocab_band_shell.txt. Open it up in your text editor, clean up messy parts, and then add your answers to the questions above to accompany your relevant code bits.

PART 2: Bulgarian & Japanese Learner Corpora [45 points]

We are now ready to get up close with the 60 essay files by Bulgarian and Japanese students. Download the template script and the zipped archive of the corpus:

ICLE2_bu_ja.zip: This is a zipped archive of the 60 essay files. It is posted on the HW3 submission link on Canvas.
ICLE_efl_writing.TEMPLATE.py: This is a template script file.

A note about this homework: you will be essentially composing two documents: (1) a word-processed file which works as a written report on the whole investigation, and (2) a Python script. Because of this set up, you should not include your observations in your script. Instead, write them up in the word-processed document while referencing findings from the code output.

You may adopt proper data-science methods for this part:

Submit a Jupyter Notebook file (.ipynb). Follow the structure of the template, and add your written analysis as markdown cells.
Using the pandas library, compute per-essay stats and then aggregate. Maybe some nifty plot graphs.
These are COMPLETELY OPTIONAL of course, only if you have time and interest. I promise I won't judge you for choosing to stick to what's required; this is not the only class you're taking after all!

First order of business: take an old-fashioned look at some of the student essays, using your favorite text file reader. What are your first impressions? Write them down in your report document.
Next, it's Python time. The script is structured as follows:

Preparation: import libraries and unpickle data files.
Load the two corpora using NLTK's PlaintextCorpusReader. Print out some basic specs.
The Bulgarian text files all start with 'B', and the Japanese ones 'J'. They all end in 'txt'. Your corpus reading patterns should therefore be 'B.*txt' and 'J.*txt'.

Build the usual data objects, based on all lower-case tokens.
Compute measurements for writing quality (more below), print out results.
Print out unigram and bigram frequencies (more below).
Here are more details on [D] the various measurements intended to aid us in assessing the writing quality of the two learner groups.

Average essay length. What is the average length of the Bulgarian essays? How about the Japanese one?
Don't do this the hard way! You don't need to do per-essay calculation, at all. Suppose your corpus has 1,000 words and it contains 8 essays. Then, the average essay length is 1,000 divided by 8.

Average sentence length. What is the average sentence length of the Bulgarian writings? How about Japanese?
Again, don't do it the hard way! You don't need per-sentence calculation. Suppose your corpus has 1,000 words, and it contains a total of 80 sentences. Then, your average sentence length is just 1,000 words divided by 80, as simple as that.

Lexical diversity. Which group uses more diverse vocabulary? Find the type-token ratio of the two corpora.
Average word length. Which group uses longer words -- Bulgarian or Japanese? Find the answer through the average word token length measured in # of characters. Exclude tokens that are symbols or punctuation in your calculation. Note that these should be calculated on tokens, not types.
How to exclude symbols and punctuation from tokens? A handy method here is .isalnum() (is alpha-numeric):
>>> foo = nltk.word_tokenize('Hello, world!') >>> foo ['Hello', ',', 'world', '!'] >>> [t for t in foo if t.isalnum()] ['Hello', 'world'] >>> bu_toks_nosym = [t for t in bu_toks if t.isalnum()] >>> len(bu_toks) 17326 >>> len(bu_toks_nosym) 15409 # lost about 2k tokens >>> set(bu_toks).difference(set(bu_toks_nosym)) # what's lost {"'", '"', '",', '".', '........', '!', ').', '.,', '/', '(', ',,', '),', '.....', ')', '.', '..', '--', '..."', '-', '...', "'.", ':', '."', ';', ',', '[', '?', '?"', '%'}

Average vocabulary band. Compute, for each group, the average vocabulary band of the words used.

Again, we need to calculate these on a per-token basis, not per-type. 'I am a platypus, I really really am' is lower-band-vacabularied on average than 'I am really a platypus', which share the exact same word types.
But crucially, exclude from the calculation any words that are not found in the 20 bands. That is, if a text consists of 6 tokens which have bands [2, 8, 13, 8, not-in, 17], then the average k-band should be calculated as (2+8+13+8+17) divided by 5, and not 6. Think about it: if we divided by 6, the "not-in" word is in effect given the vocabulary band of 0, which is not right. Essentially, then, we're treating these out-of-band words as if they are not even there.
You may ask: why exclude 21+ band words at all? The answer is, 21+ band words in learner-produced texts are more likely to be misspellings, personal names and other oddities than advanced vocabulary.

% of 11+ band word types. Of all word types found in each corpus, what % comes from band 11~~20? What are some example words?

Again, we do not count words from 21+ bands, but the base of the division this time should still be *all* types. So, if a corpus had 150 word types where 75 are from bands 1~~10, 45 from bands 11~~20, and 30 out-of-band words, then the % is calculated as 45/150 *100.

Now more on [E] unigram and bigram frequencies. Address the following:

Compare the top frequencies across the two learner groups. Are there any noticeable differences in the overall rankings and/or make-up? What could they suggest in terms of their writing levels?
Beyond comparing the top most frequent n-grams, how else could you use n-gram statistics for the purpose of assessing EFL/ESL writing quality? Could large-scale, native-corpus-sourced n-gram frequency lists such as the Norvig/Google bigram lists and the COCA n-gram lists be useful, and in what way?

In your written report, citing these findings (cite, don't just paste-in screenshots!), compose a comparison summary of the English writing quality of the Bulgarian and Japanese college students. Include your assessment on how well these metrics capture the two groups' writing levels.

SUBMIT:

PART 1: hw3_vocab_band_shell.txt and goog_kband.pkl
PART 2: (1) Your word-processed document containing a written report, (2) ICLE_efl_writing.py, (3) ICLE_efl_writing.OUT.txt (script output saved as a .txt file).