LING 1340/2340 Data Science for Linguists, University of Pittsburgh

Go to: LING 1340/2340 home page

Homework 1: Explore Linguistic Data
This homework carries a total of 50 points.

For this homework, let's have you explore a linguistic data set using the Python skills you already have.

Data set

You can process any publicly available linguistic data set of your choice.
You already took a look at two data sets for To-do 1: you may pick one of them, or you may choose something else entirely.
Many of you are familiar with NLTK's corpora. Do not use NLTK's pre-loaded corpora such as nltk.corpus.brown and nltk.corpus.inaugural.
You may, however, download a zipped archive through this page and process the files as you would any other corpora. You might find 1.9 Loading your own Corpus section useful.

GitHub dry run with To-do 1

Let's have you practice forking before getting on with Homework.
If you haven't already, fork the Class-Practice-Repo we used in class. Then, clone it onto your laptop.
Inside, you will find todo1/ directory.
Find the text file you previously submitted for To-Do1, and copy it into that directory. Make sure it's named something like todo1_yourname.txt, so it won't conflict with some other student's.
Skip git init, because cloning effectively initializes the directory as a git repo. Do the usual local git routine, which ends with committing. Push to your own GitHub fork.
Then, create a pull request for me. I'll respond as soon as I can!

Repo, files and folders

Start by forking this HW1-Repo GitHub repository.
Once you have your own fork in your GitHub account, clone it onto your laptop. DO NOT DIRECTLY CLONE MY ORIGINAL REPO WITHOUT FORKING FIRST.
In the directory, you will find a folder with your name, say narae/. Your code, as a Jupyter notebook file (xxx.ipynb), should go into that directory.
There is a file named your_script_here.txt. You can delete it. I put it there simply because without it git will ignore empty directories.
You can name the notebook file whatever you want, but remember -- no spaces; use underscore _ instead.
In your personal directory, create a new directory called data, all in lowercase. All your data files should go into this directory. (No need to change data file names to remove spaces; in fact, do not modify the files at all.)
In your Python code, use relative paths when referencing data files. That is, use open("data/corpusfile1.txt") instead of the full path.
While working on your code, you should be frequently committing and pushing to your fork.
The repo is already configured, via the .gitignore file in the root, to ignore any files under data/ directories. (Oops. File was left out. See below.) Therefore, only your Jupyter notebook file (and any other file in the same folder level) will be synced up.
My bad! Somehow the .gitignore file was missing from the HW1-Repo directory. It is there now, but if you forked earlier you can easily create one in the command line. Please see this screenshot. Basically you should execute (you can copy and paste):
echo "*/data/**" > .gitignore

Your code

Your Python code should be written as a Jupyter Notebook. If you are not familiar with it, watch Lynda.com tutorial.
At the top of your Jupyter notebook should be a markdown cell with following information:

Your name, email and date
Info on your data set. The name, author(s), download URL, etc. Basically what you reported back in To-do 1.

The second cell, another markdown cell, should contain a self-assessment:

A summary of what your code does and how you addressed your "discovery" question.
A future wish: something that you would have liked to be able to do with this data set but do not know how at the moment.

Your code should achieve the following:

Open the data files and read in the data.
Print out some representative snippets of the data. Don't flash the entire thing -- just enough to get a sense of the content.
Print out some basic stats, such as the total number of data points. For corpora, this could be the number of text files, sentences, word tokens, etc.
Additionally, you should make one discovery. Pick a question you think you can address with reasonable effort, and explore the data for an answer.

Try and see if you can find a way to utilize the upcoming NumPy library in your data processing. Aggregation functions such as numpy.sum() and numpy.mean() are easy choices. This part is optional.
Don't forget to make use of markdown cells for organization, explanation and notes. Use comments as you see fit.

Submission

When you think you have the final version of your script, "Restart & Run All" your Jupyter notebook one last time, so the output is all orderly and tidy. Save the script.
You should also save an HTML version of your notebook: Download as -> HTML (.html). Place it in the same directory as your .ipynb file.
Push the HW1 repo to your own GitHub fork one last time. Check your GitHub fork to make sure both your files are there and everything looks OK.
Finally, create a pull request for me.
** This is a form of rolling submission. Firstly, the original repo and everyone's forks are public. Additionally, I will not wait until the deadline before I start processing pull requests and merging in your contributions. That means you will be able to view other students' submissions before the deadline. You should feel free to do so.