LING 1340/2340 Data Science for Linguists, University of Pittsburgh

Go to: LING 1340/2340 home page

Homework 3: Machine Learning with the ETS Corpus
This homework carries a total of 60 points.

For this homework, we will apply machine learning methods to the ETS Corpus of Non-Native Written English data, using Python's sklearn library.

Data set

You should be fairly familiar with this corpus by now.
It was published by the LDC (Linguistic Data Consortium). I have acquired license on behalf of all Pitt faculty and students. I am distributing the data through the privately held "Licensed-Data-Sets" repo of our class GitHub organization.
You should have it already cloned the repo as your local repository. There's no need to fork.
You should keep the data files in its original directory location and its original form.

Repo, files and folders

Start by forking this HW3-Repo GitHub repository. Your fork will automatically be set to a private repo.
Clone it onto your laptop.
In the directory, you will find a folder with your name, say narae/. Your code, as a Jupyter notebook file (xxx.ipynb), should go into that directory. *After* you have created your Jupyter notebook file, you can go ahead and delete the dummy file your_script_here.txt. (Remember to use git rm.)

Beware: if you git-delete the dummy file through git rm BEFORE creating your notebook file in there, git will see the directory as empty and hide it from your view. (That's the reason I included the dummy file in the first place.) No need to panic. You can undo it by typing in git checkout -- your_script_here.txt. This essentially throws away that last change you made in the repo. After that, git status to confirm you now have a clean slate.

You can name the notebook file whatever you want, but make sure to use underscore _ instead of spaces.
As noted in the section above, your corpus data files should remain in their original directory location. Do not move or copy them into this repo.
The repo is already configured, via the .gitignore file in the root, to ignore your Jupyter checkpoint directory .ipynb_checkpoints.
While working on your code, you should be frequently committing and pushing to your fork. I haven't been enforcing this, but now that we are more comfortable with git, I will be sure to check it this time. (This means: don't work somewhere else and just copy over finished Jupyter Notebook file.)

Goals
Let's build machine learning models that does the following:

L1 identification: Given a response, predict the first language (L1) of the writer
Topic (= prompt) identification: Given a response, predict the prompt for which it was written
Proficiency level classification: Given a response, predict the proficiency level of the writer Change of plan: see below.
The three tasks have a lot in common, but they also come with some fundamental differences that require special attention:

L1 identification: Remember the data set itself came with its own training-testing data split. Use this split in your evaluation.
Topic (= prompt) identification: This task is pretty similar to 1. above with one caveat: topic distribution is not even.
Proficiency level classification: The targets ('low', 'intermediate', 'high') are not truly discreet categories! The question then becomes: Regression models or classification models? Change of plan: see below.
Evaluation:

L1 identification: Use the designated test data.
Topic (= prompt) identification: Use your own training-testing split, in 80-20 ratio. Use random seed 0.
Proficiency level classification: Use 10-fold cross validation. Use random seed 0. Change of plan: see below.
Plots:

Confusion matrices in seaborn's "heat map" format is an obvious choice. Use it.
In addition, you should try at least one additional visualization method.
Machine learning algorithms:

Everyone should try Naive Bayes for Task 1 and Task 2.
Advanced programmers: try at least one additional machine learning method.

Features:

The "bag-of-words" approach is the most basic one for Task 1 and Task 2. Everyone should start with it.
Advanced programmers: try and see if you can leverage different features. Text-based features that go beyond individual words could be interesting. How about features that are not based on text: for example, will L1 be a helpful predictor for the ~~proficiency level~~ the prompt?
NEW: Starter code for Task 3 Proficiency Level Prediction
I decided to provide a starter script for Task 3. You will find it in the Homework 3 Repo. Here's what you should do:

Copy over the code into your directory. Rename it as HW3_ETS_Score_Prediction_YOURNAME.ipynb.
Run it. You will see that it requires a pickled dataframe file. Figure it out.
Modify the code and poke around as you see fit.
Finally, edit the last Markdown cell to answer the 6 questions.

Your report, aka Jupyter Notebook file

Your own Jupyter Notebook file should address Task 1 and Task 2.
Task 3 is submitted as your own version of the starter script.
And the usual:

At the top of your Jupyter Notebook should be a markdown cell with your name, email and date.
The second cell, another markdown cell, should contain a brief summary of the data set.
Don't forget to make use of markdown cells for organization, explanation and notes. Clearly mark sections, and use comments as you see fit. Remember: your Jupyter Notebook should be much more than a Python script: you should treat it as a written project report with embedded Python code.

Submission

When you think you have the final version of your script, "Restart & Run All" your Jupyter notebook one last time, so the output is all orderly and tidy. Save the script.
Push the HW3 repo to your own GitHub fork one last time. Check your GitHub fork to make sure your file is there and everything looks OK.
Finally, create a pull request for me.
** This is NOT rolling submission. The original repo is private, meaning only our class members have read access. Your fork will automatically become private for your access only. I will wait until the assignment deadline to merge in your pull requests, which means your homework will essentially be visible to yourself only until I process your pull request.