Natural Language Processing with Python
--- Analyzing Text with the Natural Language Toolkit
Steven Bird, Ewan Klein, and Edward Loper
O'Reilly Media, 2009
|
Sellers and prices
0.
Preface
(
extras
)
Preface
,
Audience
,
Emphasis
,
What You Will Learn
,
Organization
,
Why Python?
,
Software Requirements
,
Natural Language Toolkit (NLTK)
,
For Instructors
,
Conventions Used in This Book
,
Using Code Examples
,
Acknowledgments
,
About the Authors
,
Royalties
,
1.
Language Processing and Python
(
extras
)
1 Language Processing and Python
,
1.1 Computing with Language: Texts and Words
,
Getting Started with Python
,
Getting Started with NLTK
,
Searching Text
,
Counting Vocabulary
,
1.2 A Closer Look at Python: Texts as Lists of Words
,
Lists
,
Indexing Lists
,
Variables
,
Strings
,
1.3 Computing with Language: Simple Statistics
,
Frequency Distributions
,
Fine-grained Selection of Words
,
Collocations and Bigrams
,
Counting Other Things
,
1.4 Back to Python: Making Decisions and Taking Control
,
Conditionals
,
Operating on Every Element
,
Nested Code Blocks
,
Looping with Conditions
,
1.5 Automatic Natural Language Understanding
,
Word Sense Disambiguation
,
Pronoun Resolution
,
Generating Language Output
,
Machine Translation
,
Spoken Dialog Systems
,
Textual Entailment
,
Limitations of NLP
,
1.6 Summary
,
1.7 Further Reading
,
1.8 Exercises
,
2.
Accessing Text Corpora and Lexical Resources
(
extras
)
2 Accessing Text Corpora and Lexical Resources
,
2.1 Accessing Text Corpora
,
Gutenberg Corpus
,
Web and Chat Text
,
Brown Corpus
,
Reuters Corpus
,
Inaugural Address Corpus
,
Annotated Text Corpora
,
Corpora in Other Languages
,
Text Corpus Structure
,
Loading your own Corpus
,
2.2 Conditional Frequency Distributions
,
Conditions and Events
,
Counting Words by Genre
,
Plotting and Tabulating Distributions
,
Generating Random Text with Bigrams
,
2.3 More Python: Reusing Code
,
Creating Programs with a Text Editor
,
Functions
,
Modules
,
2.4 Lexical Resources
,
Wordlist Corpora
,
A Pronouncing Dictionary
,
Comparative Wordlists
,
Shoebox and Toolbox Lexicons
,
2.5 WordNet
,
Senses and Synonyms
,
The WordNet Hierarchy
,
More Lexical Relations
,
Semantic Similarity
,
2.6 Summary
,
2.7 Further Reading
,
2.8 Exercises
,
3.
Processing Raw Text
3 Processing Raw Text
,
3.1 Accessing Text from the Web and from Disk
,
Electronic Books
,
Dealing with HTML
,
Processing Search Engine Results
,
Processing RSS Feeds
,
Reading Local Files
,
Extracting Text from PDF, MSWord and other Binary Formats
,
Capturing User Input
,
The NLP Pipeline
,
Basic Operations with Strings
,
Printing Strings
,
Accessing Individual Characters
,
Accessing Substrings
,
More operations on strings
,
The Difference between Lists and Strings
,
3.3 Text Processing with Unicode
,
What is Unicode?
,
Extracting encoded text from files
,
Using your local encoding in Python
,
Using Basic Meta-Characters
,
Ranges and Closures
,
Extracting Word Pieces
,
Doing More with Word Pieces
,
Finding Word Stems
,
Searching Tokenized Text
,
3.6 Normalizing Text
,
Stemmers
,
Lemmatization
,
3.7 Regular Expressions for Tokenizing Text
,
Simple Approaches to Tokenization
,
NLTK's Regular Expression Tokenizer
,
Further Issues with Tokenization
,
3.8 Segmentation
,
Sentence Segmentation
,
Word Segmentation
,
From Lists to Strings
,
Strings and Formats
,
Lining Things Up
,
Writing Results to a File
,
Text Wrapping
,
3.10 Summary
,
3.11 Further Reading
,
3.12 Exercises
,
4.
Writing Structured Programs
(
extras
)
4 Writing Structured Programs
,
4.1 Back to the Basics
,
Assignment
,
Equality
,
Conditionals
,
4.2 Sequences
,
Operating on Sequence Types
,
Combining Different Sequence Types
,
Generator Expressions
,
4.3 Questions of Style
,
Python Coding Style
,
Procedural vs Declarative Style
,
Some Legitimate Uses for Counters
,
4.4 Functions: The Foundation of Structured Programming
,
Function Inputs and Outputs
,
Parameter Passing
,
Variable Scope
,
Checking Parameter Types
,
Functional Decomposition
,
Documenting Functions
,
4.5 Doing More with Functions
,
Functions as Arguments
,
Accumulative Functions
,
Higher-Order Functions
,
Named Arguments
,
4.6 Program Development
,
Structure of a Python Module
,
Multi-Module Programs
,
Sources of Error
,
Debugging Techniques
,
Defensive Programming
,
4.7 Algorithm Design
,
Recursion
,
Space-Time Tradeoffs
,
Dynamic Programming
,
4.8 A Sample of Python Libraries
,
Matplotlib
,
NetworkX
,
csv
,
NumPy
,
Other Python Libraries
,
4.9 Summary
,
4.10 Further Reading
,
4.11 Exercises
,
5.
Categorizing and Tagging Words
5 Categorizing and Tagging Words
,
5.1 Using a Tagger
,
5.2 Tagged Corpora
,
Representing Tagged Tokens
,
Reading Tagged Corpora
,
A Simplified Part-of-Speech Tagset
,
Nouns
,
Verbs
,
Adjectives and Adverbs
,
Unsimplified Tags
,
Exploring Tagged Corpora
,
5.3 Mapping Words to Properties Using Python Dictionaries
,
Indexing Lists vs Dictionaries
,
Dictionaries in Python
,
Defining Dictionaries
,
Default Dictionaries
,
Incrementally Updating a Dictionary
,
Complex Keys and Values
,
Inverting a Dictionary
,
5.4 Automatic Tagging
,
The Default Tagger
,
The Regular Expression Tagger
,
The Lookup Tagger
,
Evaluation
,
5.5 N-Gram Tagging
,
Unigram Tagging
,
Separating the Training and Testing Data
,
General N-Gram Tagging
,
Combining Taggers
,
Tagging Unknown Words
,
Storing Taggers
,
Performance Limitations
,
5.6 Transformation-Based Tagging
,
5.7 How to Determine the Category of a Word
,
Morphological Clues
,
Syntactic Clues
,
Semantic Clues
,
New Words
,
Morphology in Part of Speech Tagsets
,
5.8 Summary
,
5.9 Further Reading
,
5.10 Exercises
,
6.
Learning to Classify Text
(
extras
)
6 Learning to Classify Text
,
6.1 Supervised Classification
,
Gender Identification
,
Choosing The Right Features
,
Document Classification
,
Part-of-Speech Tagging
,
Exploiting Context
,
Sequence Classification
,
Other Methods for Sequence Classification
,
6.2 Further Examples of Supervised Classification
,
Sentence Segmentation
,
Identifying Dialogue Act Types
,
Recognizing Textual Entailment
,
Scaling Up to Large Datasets
,
6.3 Evaluation
,
The Test Set
,
Accuracy
,
Precision and Recall
,
Confusion Matrices
,
Cross-Validation
,
6.4 Decision Trees
,
Entropy and Information Gain
,
6.5 Naive Bayes Classifiers
,
Underlying Probabilistic Model
,
Zero Counts and Smoothing
,
Non-Binary Features
,
The Naivete of Independence
,
The Cause of Double-Counting
,
6.6 Maximum Entropy Classifiers
,
The Maximum Entropy Model
,
Maximizing Entropy
,
Generative vs Conditional Classifiers
,
6.7 Modeling Linguistic Patterns
,
What do models tell us?
,
6.8 Summary
,
6.9 Further Reading
,
6.10 Exercises
,
7.
Extracting Information from Text
7 Extracting Information from Text
,
7.1 Information Extraction
,
Information Extraction Architecture
,
Noun Phrase Chunking
,
Tag Patterns
,
Chunking with Regular Expressions
,
Exploring Text Corpora
,
Chinking
,
Representing Chunks: Tags vs Trees
,
7.3 Developing and Evaluating Chunkers
,
Reading IOB Format and the CoNLL 2000 Corpus
,
Simple Evaluation and Baselines
,
Training Classifier-Based Chunkers
,
7.4 Recursion in Linguistic Structure
,
Building Nested Structure with Cascaded Chunkers
,
Trees
,
Tree Traversal
,
7.5 Named Entity Recognition
,
7.7 Summary
,
7.8 Further Reading
,
7.9 Exercises
,
8.
Analyzing Sentence Structure
(
extras
)
8 Analyzing Sentence Structure
,
8.1 Some Grammatical Dilemmas
,
Linguistic Data and Unlimited Possibilities
,
Ubiquitous Ambiguity
,
8.2 What's the Use of Syntax?
,
Beyond n-grams
,
8.3 Context Free Grammar
,
A Simple Grammar
,
Writing Your Own Grammars
,
Recursion in Syntactic Structure
,
8.4 Parsing With Context Free Grammar
,
Recursive Descent Parsing
,
Shift-Reduce Parsing
,
The Left-Corner Parser
,
Well-Formed Substring Tables
,
8.5 Dependencies and Dependency Grammar
,
Valency and the Lexicon
,
Scaling Up
,
8.6 Grammar Development
,
Treebanks and Grammars
,
Pernicious Ambiguity
,
Weighted Grammar
,
8.7 Summary
,
8.8 Further Reading
,
8.9 Exercises
,
9.
Building Feature Based Grammars
9 Building Feature Based Grammars
,
9.1 Grammatical Features
,
Syntactic Agreement
,
Using Attributes and Constraints
,
Terminology
,
Subsumption and Unification
,
9.3 Extending a Feature based Grammar
,
Subcategorization
,
Heads Revisited
,
Auxiliary Verbs and Inversion
,
Unbounded Dependency Constructions
,
Case and Gender in German
,
9.4 Summary
,
9.5 Further Reading
,
9.6 Exercises
,
10.
Analyzing the Meaning of Sentences
(
extras
)
10 Analyzing the Meaning of Sentences
,
10.1 Natural Language Understanding
,
Querying a Database
,
Natural Language, Semantics and Logic
,
10.3 First-Order Logic
,
Syntax
,
First Order Theorem Proving
,
Summarizing the Language of First Order Logic
,
Truth in Model
,
Individual Variables and Assignments
,
Quantification
,
Quantifier Scope Ambiguity
,
Model Building
,
10.4 The Semantics of English Sentences
,
Compositional Semantics in Feature-Based Grammar
,
The λ-Calculus
,
Quantified NPs
,
Transitive Verbs
,
Quantifier Ambiguity Revisited
,
10.5 Discourse Semantics
,
Discourse Representation Theory
,
Discourse Processing
,
10.6 Summary
,
10.7 Further Reading
,
10.8 Exercises
,
11.
Managing Linguistic Data
11 Managing Linguistic Data
,
11.1 Corpus Structure: a Case Study
,
The Structure of TIMIT
,
Notable Design Features
,
Fundamental Data Types
,
Three Corpus Creation Scenarios
,
Quality Control
,
Curation vs Evolution
,
11.3 Acquiring Data
,
Obtaining Data from the Web
,
Obtaining Data from Word Processor Files
,
Obtaining Data from Spreadsheets and Databases
,
Converting Data Formats
,
Deciding Which Layers of Annotation to Include
,
Standards and Tools
,
Special Considerations when Working with Endangered Languages
,
11.4 Working with XML
,
Using XML for Linguistic Structures
,
The Role of XML
,
The ElementTree Interface
,
Using ElementTree for Accessing Toolbox Data
,
Formatting Entries
,
11.5 Working with Toolbox Data
,
Adding a Field to Each Entry
,
Validating a Toolbox Lexicon
,
11.6 Describing Language Resources using OLAC Metadata
,
What is Metadata?
,
OLAC: Open Language Archives Community
,
Disseminating Language Resources
,
11.7 Summary
,
11.8 Further Reading
,
11.9 Exercises
,
12.
Afterword: Facing the Language Challenge
Afterword: The Language Challenge
,
Language Processing vs Symbol Processing
,
Contemporary Philosophical Divides
,
NLTK Roadmap
,
Envoi...
,
Bibliography
Term Index
Errata
(corrected here, and in the second printing of book (January 2010))
Translations
:
Book
(jp),
Prefácio
(pt),
Przedmowa
(pl)
Reviews
:
LanguageLog
,
Amazon.com
,
Slashdot.org
,
Dr Dobbs
Interested in translating this book?
Please read our
Translator's Guide
.
This book is made available under the terms of the
Creative Commons Attribution Noncommercial No-Derivative-Works 3.0 US License
.
Please post any questions about the materials to the
nltk-users
mailing list. Please report an errors on the
issue tracker
. Note that the "extras" sections are not part of the published book, and will continue to be expanded.