Go to: LING 1340/2340 home page  

Learning Resources by Topic

This is not just a link dump. These resources are carefully curated textbook stand-ins, and you are fully expected to learn from them! There are multiple types:
  1. Online tutorials. Watch, practice and learn. I pre-screened and narrowed down to very essential & relevant contents only, so you can stop wondering if you should learn the whole thing!
  2. Articles. Read them -- they will be referenced in lectures and used in classroom discussions.
  3. Book and book chapters. Python Data Science Handbook neatly aligns with our data science focus and doubles up as a reference book. Parts of the NLTK Book will also be referenced.
  4. Software installation links. Download and install on your machine.
  5. Bookmark pages. These are lists of useful links compiled by someone else, which often contain pointers to data sets or resources. Explore them and use them as needed; you should become familiar with what's on them.
  6. References -- for looking things up.

Linguistic Data, Open Access, Data Publishing

  • Linguistics Data Repositories [link]
  • Linguistic Linked Open Data [link]
  • Linguistic Data Consortium (LDC) [link]
  • Data Management Plans for Linguistic Research, Workshop at 2017 LSA Summer Institute [link] [slides]
  • Justin Kitzes. (2018) The Basic Reproducible Workflow Template. In Justin Kitzes, Daniel Turek, Fatma Deniz (Eds.) The Practice of Reproducible Research. [link]
  • D-Scholarship @ Pitt: Institutional Repository at the University of Pittsburgh [link]
  • Copyright and Intellectual Toolkit by Lauren Collister [link]
  • TEI: A Gentle Introduction to XML [link]
  • json.org: Introducing JSON [link], JSON example (vs. XML) [link]

Corpus Linguistics

  • Stefan Th. Gries and John Newman. (2013) Creating and using corpora. In Podesva, Robert J., and Devyani Sharma. (Ed.), Research Methods in Linguistics. [link] [draft copy]
  • NLTK Book Ch.2 Accessing Text Corpora and Lexical Resources [chapter]
  • NLTK Book Ch.11 Managing Linguistic Data [chapter]
  • NLTK Corpora Index [link] [GitHub repo]
  • FSNLP Ch.4 Corpus-Based Work Links [link]
  • Corpus-based Linguistics Links [link]
  • Corpus Resource Database (CoRD) [link]

Data Processing Fundamentals: Python's numpy, pandas, and visualization libraries

  • Python Data Science Handbook. (2016) O'Reilly Media [book]
  • (DataCamp) Introduction to Python for Data Science, Ch.4 NumPy [tutorial]
  • (DataCamp) Intermediate Python for Data Science. Focus on Matplotlib, Numpy & Pandas. [tutorial]
  • (DataCamp) pandas Foundations [tutorial]
  • (DataCamp) Manipulating DataFrames with pandas [tutorial]
  • Visualization: pandas 0.20.3 documentation [link]
  • Chris Albon's Notes on ML & AI, "Data Wrangling" [link]
  • 19 Essential Snippets in Pandas [link]

Data Mining & Machine Learning

  • Twitter text mining tutorials: [The Code Way], [Adil Moujahid], [Marco Bonzanini]
  • Scrapy tutorial [link]
  • Mapping the United Swears of America by Jack Grieve [link]
  • Python Data Science Handbook. (2016) O'Reilly Media [book]
  • Topic Modeling with Scikit Learn [link]
  • (DataCamp) Supervised Learning with scikit-learn [tutorial]
  • (DataCamp) Unsupervised Learning in Python [tutorial]
  • (DataCamp) NLP Fundamentals in Python [tutorial]

Big Data Essentials

  • How to "Big Data" with Python [link]
  • Learn Big Data Analytics using Top YouTube Videos, TED Talks & other resources [link]
  • spaCy: Industrial-Strength Natural Language Processing in Python [link]
  • CRC: Center for Research Computing at Pitt [link]

Linguistic Annotation, Ontology, and Knowledge Engineering

  • NLTK Book Ch.11 Managing Linguistic Data [chapter]
  • WordNet: a lexical database for English [link]
  • Handbook of Linguistic Annotation [link]

Speech Data

  • ELAN: create complex annotations on video and audio resources [link]
  • Praat: doing phonetics by computer [link]


Below focuses more on the software tools side of resources.

Git and GitHub

  • Git download & installation [link]
  • Software Carpentry Lesson: Version Control with Git [tutorial]
  • How to get started with Git and GitHub [YouTube]
  • git - the simple guide [link]
  • Tutorials: Become a git guru. (Uses BitBucket instead of GitHub, ignore parts on SVN) [link]
  • Na-Rae's Git and GitHub Tips for class


  • GitHub Guides: Mastering Markdown [link]
  • Chrome browser Markdown Viewer extension [link]

Anaconda and Jupyter Notebook

  • Anaconda Python download & installation: use version 3.6. [link]
  • Lynda.com Tutorial: Introduction to Jupyter Notebook, basics, Markdown, How to Launch. (Skip "mathematical typesetting" video.) [tutorial]
  • Jupyter Notebook Tutorial: The Definitive Guide on DataCamp (more advanced) [link]

Command-line, Bash and Unix Tools

  • Software Carpentry Lesson: The Unix Shell [tutorial]
  • Thirty Useful Unix Commands [PDF]
  • Unix for Poets (in 2016) by Christopher Manning [PDF]

Text Editor

  • Atom [link] recommended for all systems.
  • Also good: Notepad++ [link] (Windows only), BBEdit [link] (Mac only) and Sublime Text [link] (all platforms).
  • On the command-line side, nano is easiest to use. It is already on Macs; Windows users will need to install it. (I will give you instructions.)

Related Topics

The topics below are not among the focus areas of this course, but parts of them will be relevant. They are provided for reference.

Natural Language Processing & NLTK

NLP-related topics will frequently come up throughout this course, which you are expected to pick up as needed. For in-depth learning, refer to the LING 1330/2330 course page.

Python References