Natural Language Processing Toolkit (NLTK) is a powerful and versatile programming package designed for research, development, and instruction in natural language processing. It supports virtually every NLP task there is, from simple text/corpus processing operations to more complex ones involving state-of-the-art machine learning methods such as part-of-speech tagging, parsing, and text categorization. It is an open-source python module. Learning NLTK in its entirety would mean covering the full lanscape of present-day NLP, which would require a whole semester devoted to the subject; in our lab sessions, we will focus on the essential operations needed for corpus processing and handling.
- First order of business: download & install python and NLTK. Download links and instructions are on this page. There are three components:
- PyYAML (a python module for parsing markup languages)
- Windows users: download and install the three components in the given order.
- Mac users: first download the three components and follow instructions below.
- Install python from the dmg image.
- From within your Download directory inside Finder, double click PyYAML-3.09.tar.gz file to unzip it into a directory named PyYAML-3.09.
- Open a *new* terminal. Move into your PyYAML-3.09 directory, which is likely to be located at: ~/Downloads/PyYAML-3.09
- Type in the following: python setup.py install. Supply password if prompted.
- Try installing NLTK using the dmg image. This is likely to fail, however.
- Now move into NLTK install directory: /tmp/nltk-installer
- Type in: sudo python setup.py install
- We will be running python and NLTK via IDLE (the Integrated Development Interface). It is located in:
(Windows) Start -> All Programs -> Python 2.6 -> IDLE
(Mac) Finder -> Applications -> Python 2.6 -> IDLE
- Now load NLTK by typing import nltk, as seen in the image above.
- You will notice that scrolling through your command history using Up/Down arrows does not work, as you were able to do within your bash terminal. That is because IDLE does not natively support this function. Instead, you need to use: Alt-p (Windows) Ctrl-p (Mac) to call the previously typed command and Alt-n (Windows) Ctrl-n (Mac) for the next one. (See this page.)
But of course, some tech-savvy folks have written a solution: follow the steps below to enable Up/Down arrows. (For some reason, I was able to get this patch to work on Windows only. Mac users: please hang tight and use Ctrl-p/n in the meantime.)
Downloading NLTK Corpora
We will now download a collection of corpora to be used within NLTK. The corpora are also available for manual download on this page. (Look familiar? It should!)
- Executing nltk.download() brings up an interface window for downloading corpus packages (see the screen shot).
- Download and install all-corpora, which installs all corpora listed on the page, including our gutenberg and abc corpora.
Navigating Through Corpora
We will now try navigating through some of the essential corpora provided with NLTK, while learning the basic commands associated with various aspects of corpus structure. For this section, we will cover some of the sections in Chapter 2: Accessing Text Corpora and Lexical Resources.
- Gutenberg corpus
- Brown corpus
- Reuters corpus
- Annotated text corpora
- Text corpus structure
Processing Your Own Corpora
NLTK lets you load your own corpora and use all corpus-processing functionalities on them. This section provides instructions on how to load your own corpus.
- We will use a tiny sample corpus, contained in: ELI_essays.zip.
- Download the corpus and unzip it into a location of your choice. Note the location.
- From your Desktop environment, open up the corpus text files and examine their content. They are just regular text files!
- Following these instructions closely, load the corpus.
- Now find out: the corpus size (in # of tokens), the total number of sentences. (Hint)
- Now try looking up concordance lines involving any word of your choice. (Hint)