CMU Digital Humanities Summer Workshop Presentation

Go to: Na-Rae Han's home page

Corpus Processing in Python: a Primer
CMU Digital Humanities Summer Workshop, 5/16 -- 5/19/2016

Overview
Corpus methods are increasingly popular in linguistics and research in modern languages. You may have used corpus-exploration software such as AntConc and WordSmith. But how hard is it to do it without help from such specialized software? As it turns out, it's not too difficult to pick up the very basics of text processing using a programming language that is rapidly becoming a household name: Python, along with NLTK (Natural Language ToolKit), a suite of computational linguistic tools. In this hands-on tutorial, we will learn the absolute minimum that will get you started with text and corpus processing.

Task: Bulgarian vs. Japanese EFL Writing
Between Bulgarian and Japanese college students, which group writes English on a more advanced level? Let's explore the question with real data: 20 English essays written by Japanese and Bulgarian students, excerpted from the ICLE2 (International Corpus of Learner English v2) corpus. Among many measurements that can be used as an indicator for writing quality, we will try our hands on these metrics:

Syntactic complexity: Are the sentences simple and short, or are they complex and long?
→ Can be measured through average sentence length
Lexical diversity: Are there more of the same words repeated throughout, or do the essays feature more diverse vocabulary?
→ Can be measured through type-token ratio (caveat: corpus size)
Vocabulary level: Do the essays use more of the common, everyday words, or do they use more sophisticated and technical words?
→ a. Can be measured through average word length (common words tend to be shorter), and
→ b. Can be measured against published lists of top most frequent English words
You will notice that items 1, 2 and 3a can be measured corpus-internally. Item 3b will have to rely on an external, pre-compiled list of English words, such as The Academic Word List (Coxhead, 2000) or The First 4,000 Words (Graves, Sales and Ruda 2008).

PART 1: Setting up Shop with Python3 and Jupyter Notebook

In order to save time and effort that normally go into setting up individual computing environments, we will use an online version of Jupyter Notebook pre-configured with Python3 and NLTK. You should:

Log into the workshop's Jupyter Notebook server (password required)
Create your own directory (ex. name it something like "JohnSmithFiles")
Now, move into your own directory. That is where you should keep all your files, including corpus files and Python notebook file. Let's first create a Python notebook file:

On the right, click on "New", then choose "Python 3". It opens a new Python notebook window.
The document is initially named "Untitled". Click on "Untitled", and re-name it "Basics".
In the box, type in:
print("hello, world!")
Then press the "play" triagle button. Congratulations -- you just executed your first Python code.
Let's go ahead and set up the corpus. Steps:

Download the ICLE2 sample zip file onto your laptop computer. (Password protected; linked on the right)
Unzip the .zip archive. Examine the content.
Inside your own Jupyter Notebook folder, create a new folder through the "New" button on the right.
Initially, it comes up as "Untitled Folder". Tick the checkbox, and click the "Rename" button. Rename it to "corpus".
Double-click the "corpus" folder to move into it.
Time to upload the corpus files. Click "Upload" button on the right.
Navigate into your own local (that is, on your laptop) folder where you prevoiusly unzipped the ICLE2 sample files. Select all 20 text files plus the README file.
Each file shows up with a blue "Upload" button. Go ahead and click on each and every one of them.
Your corpus directory is now ready.

PART 2: Python & NLTK Basics

In this part, we will learn the absolute basics of Python and NLTK. The lesson is conducted in Jupyter Notebook.

PART 3: Corpus Processing

In this part, we will learn how to process an archive of text files, aka a corpus. We will be composing a separate Jupyter notebook.

What's Next?

Save your work

When your Jupyter notebook is finished, re-run it one last time: "Kernel" --> "Restart & Run All"
Save your notebooks: "File" --> "Download as". Multiple options:

Notebook (.ipynb): Not recommended unless you already have Jupyter installed on your own laptop.
Python (.py): Once you install Python 3 on your laptop, you will be able to run this file as a script. But beware: when you run the script, only explicit print() commands will produce visible output.
HTML (.html): RECOMMENDED. The downloaded file is a web page, so you won't be able to run it as a Python script as is. But the HTML file has all information, and you should be able to re-trace each step in Python IDLE shell environment, once you have your own laptop configured.

Close your notebook: "File" --> "Close and Halt"
Install Python 3 / NLTK on your laptop
You will need to have Python 3 and NLTK on your own machine.

Python 3 installation how-to, for Windows
Python 3 installation how-to, for Mac
NLTK 3.0 installation and data download instructions in this FAQ
Install Jupyter Notebook on your laptop
Additionally, if you liked working in Jupyter, you may want to install it on your own machine. See this instruction page.
Learn more

Take LING 1330/2330 Introduction to Computational Linguistics. It has both undergrad/grad sections and will be offered every spring.
Take a Python intro course, such as CS 0008. CMU likely has a similar course.
Take free MOOC courses online. Coursera, edX, Udacity all have great Python courses.
Join PyLing (Pitt Python Linguistics Group). Email Na-Rae if interested.

DH Workshop links:

CMU DH Workshop Home
Jupyter Python server

This workshop session:

Corpus download

Python help:

Python 3 Notes
FAQ
Text samples
LING 1330/2330 Introduction to Computational Linguistics