LING 2050 Special Topics in Linguistics: Corpus Linguistics, University of Pittsburgh

Go to: LING2050 home page Lab pages index Command reference sheet

Lab 7
Objectives: installing python and NLTK; navigating through corpora using NLTK

References:

NLTK Home
Download
Getting Started
NLTK Book Table of Contents
Chapter 2: Accessing Text Corpora and Lexical Resources

Overview

Natural Language Processing Toolkit (NLTK) is a powerful and versatile programming package designed for research, development, and instruction in natural language processing. It supports virtually every NLP task there is, from simple text/corpus processing operations to more complex ones involving state-of-the-art machine learning methods such as part-of-speech tagging, parsing, and text categorization. It is an open-source python module. Learning NLTK in its entirety would mean covering the full lanscape of present-day NLP, which would require a whole semester devoted to the subject; in our lab sessions, we will focus on the essential operations needed for corpus processing and handling.

Download and Install python and NLTK

First order of business: download & install python and NLTK. Download links and instructions are on this page. There are three components:

Python
PyYAML (a python module for parsing markup languages)
NLTK

Windows users: download and install the three components in the given order.

Mac users: first download the three components and follow instructions below.

Install python from the dmg image.
From within your Download directory inside Finder, double click PyYAML-3.09.tar.gz file to unzip it into a directory named PyYAML-3.09.
Open a *new* terminal. Move into your PyYAML-3.09 directory, which is likely to be located at: ~/Downloads/PyYAML-3.09
Type in the following: python setup.py install. Supply password if prompted.
Try installing NLTK using the dmg image. This is likely to fail, however.
Now move into NLTK install directory: /tmp/nltk-installer
Type in: sudo python setup.py install

Getting Started: Loading NLTK within IDLE

We will be running python and NLTK via IDLE (the Integrated Development Interface). It is located in:
(Windows) Start -> All Programs -> Python 2.6 -> IDLE
(Mac) Finder -> Applications -> Python 2.6 -> IDLE

Now load NLTK by typing import nltk, as seen in the image above.

You will notice that scrolling through your command history using Up/Down arrows does not work, as you were able to do within your bash terminal. That is because IDLE does not natively support this function. Instead, you need to use: Alt-p (Windows) Ctrl-p (Mac) to call the previously typed command and Alt-n (Windows) Ctrl-n (Mac) for the next one. (See this page.)
But of course, some tech-savvy folks have written a solution: follow the steps below to enable Up/Down arrows. (For some reason, I was able to get this patch to work on Windows only. Mac users: please hang tight and use Ctrl-p/n in the meantime.)

First, exit IDLE.
Download Terminal.py from this page, and save it in C:\Python26\Lib\idlelib .
Double click Terminal.py to execute the script.
Open config-extensions.def file in the same directory. Append the following lines:
[Terminal]
enable=1
enable_shell=1
enable_editor=0

Now open up IDLE. You should be able to use Up/Down arrow to toggle through your command history.

Downloading NLTK Corpora
We will now download a collection of corpora to be used within NLTK. The corpora are also available for manual download on this page. (Look familiar? It should!)

Executing nltk.download() brings up an interface window for downloading corpus packages (see the screen shot).
Download and install all-corpora, which installs all corpora listed on the page, including our gutenberg and abc corpora.

Navigating Through Corpora
We will now try navigating through some of the essential corpora provided with NLTK, while learning the basic commands associated with various aspects of corpus structure. For this section, we will cover some of the sections in Chapter 2: Accessing Text Corpora and Lexical Resources.

Gutenberg corpus
Brown corpus
Reuters corpus
Annotated text corpora
Text corpus structure

Processing Your Own Corpora
NLTK lets you load your own corpora and use all corpus-processing functionalities on them. This section provides instructions on how to load your own corpus.

We will use a tiny sample corpus, contained in: ELI_essays.zip.
Download the corpus and unzip it into a location of your choice. Note the location.
From your Desktop environment, open up the corpus text files and examine their content. They are just regular text files!
Following these instructions closely, load the corpus.
Now find out: the corpus size (in # of tokens), the total number of sentences. (Hint)
Now try looking up concordance lines involving any word of your choice. (Hint)