Go to: LING 1340/2340 home page  

Homework 2: Process the ETS Corpus

This homework carries a total of 60 50 points.
(9/18/2017) NEW: I am essentially breaking up this homework into two submissions. For Thursday 9/21, submit a DRAFT version (50 points). Code should be working but can be in rough shape, likewise with the presentation component. For Tuesday 9/26, submit a final, polished version of the homework as To-Do #6.
For this homework, we will explore and process the ETS Corpus of Non-Native Written English, using Python's numpy and pandas libraries we recently learned.

Data set

  • The corpus was published by the LDC (Linguistic Data Consortium). I have acquired license on behalf of all Pitt faculty and students. I am distributing the data through the privately held "Licensed-Data-Sets" repo of our class GitHub organization.
  • If you haven't already, clone the repo as your local repository. There's no need to fork.
  • You should keep the data files in its original directory location and its original form.

Repo, files and folders

  1. Start by forking this HW2-Repo GitHub repository. Your fork will automatically be set to a private repo.
  2. Clone it onto your laptop. Unlike with the "Licensed-Data-Sets" repo, you SHOULD NOT DIRECTLY CLONE MY ORIGINAL REPO. FORK FIRST, and then CLONE YOUR FORK.
  3. In the directory, you will find a folder with your name, say narae/. Your code, as a Jupyter notebook file (xxx.ipynb), should go into that directory. *After* you have created your Jupyter notebook file, you can go ahead and delete the dummy file your_script_here.txt. (Remember to use git rm.)
  4. You can name the notebook file whatever you want, but make sure to use underscore _ instead of spaces.
  5. As noted in the section above, your corpus data files should remain in their original directory location. Do not move or copy them into this repo.
  6. The repo is already configured, via the .gitignore file in the root, to ignore your Jupyter checkpoint directory .ipynb_checkpoints.
  7. While working on your code, you should be frequently committing and pushing to your fork. I haven't been enforcing this, but now that we are more comfortable with git, I will be sure to check it this time.

Goals

The first goal of this work is Basic Data Processing, which involves processing of CSV files and text files. You should achieve the following:
  1. Start with index.csv and build a DataFramed named ets_df.
  2. Process the three additional CSV files, which split the data into training, testing, and development sets.
  3. Augment the ets_df Data Frame with an additional column named 'Split' with three appropriate values: 'TS' for testing, 'TR' for training, and 'DV' for development.
  4. Make sure your ets_df meets the following specifications:
    • Use the 'Filename' as the row index.
    • Have 'L1' as a column name instead of 'Language'. 'L1' is a more specific terminology.
  5. Additionally, produce two dictionary-type objects:
    • prompts which has 'P1', 'P2', ... 'P8' as keys and the prompt text string as values,
    • responses which has response file names '88.txt', '278.txt', etc. as keys and tokenized word lists as values.
The second goal is Exploratory Data Analysis (EDA).
  1. Read up on documentation in order to gain understanding of the data set. There is a README file, a PDF document, and the LDC publication page. What is the purpose of this data, what sort of information is included, and what is the organization?
  2. Then, explore the data to confirm the content. For example, the PDF document contains tables illustrating the make-up of the data and various data points. Don't take their word for it! You should find a way to confirm and demonstrate these data points through your code.
  3. Visualization: Try out at least one plot/graph.
The third and last goal is Linguistic Analysis. In particular, we want to be able to highlight quantitative differences between three learner groups: low, medium and high levels of English proficiency. First, produce three sub-corpora corresponding to the three groups, and explore the following:
  1. Text length: Do the groups as a whole write longer or shorter responses?
    → Can be measured through average response length in number of words
  2. Syntactic complexity: Are the sentences simple and short, or are they complex and long?
    → Can be measured through average sentence length
  3. Lexical diversity: Are there more of the same words repeated throughout, or do the essays feature more diverse vocabulary?
    → Can be measured through type-token ratio (with caveat!)
  4. Vocabulary level: Do the essays use more of the common, everyday words, or do they use more sophisticated and technical words?
    → a. Can be measured through average word length (common words tend to be shorter), and
    → b. Can be measured against published lists of top most frequent English words.
There are five total measurements. Choose three if you are relatively new to programming; experienced programmers should work on four or all five. If you haven't taken LING 1330 "Introduction to Computational Linguistics", you might want to learn about NLTK's core text processing functions from these slides. Additionally, if you are going for 4b., which is the most involved kind of task, you might want to consult this homework or this one.

Your report, aka Jupyter Notebook file

  • At the top of your Jupyter Notebook should be a markdown cell with your name, email and date.
  • The second cell, another markdown cell, should contain a brief summary of the data set. This is just a starting point though -- probing and exploring the data set should be done throughout your report.
  • Don't forget to make use of markdown cells for organization, explanation and notes. Use comments as you see fit. Remember: your Jupyter Notebook should be much more than a Python script: you should treat it as a written project report with embedded Python code.
  • Your report should have at least these three main sections. See the "Goal" section about contents.
    1. Basic data processing
    2. Exploratory data analysis (EDA)
    3. Linguistic analysis

Submission

  1. When you think you have the final version of your script, "Restart & Run All" your Jupyter notebook one last time, so the output is all orderly and tidy. Save the script.
  2. Push the HW2 repo to your own GitHub fork one last time. Check your GitHub fork to make sure your file is there and everything looks OK.
  3. Finally, create a pull request for me.
** This is NOT rolling submission. The original repo is private, meaning only our class members have read access. Your fork will automatically become private for your access only. I will wait until the assignment deadline to merge in your pull requests, which means your homework will essentially be visible to yourself only until I process your pull request.