Na-Rae Han's home page

LING 1340/2340 Data Science for Linguists

Fall 2017, University of Pittsburgh

Meetings: Tue & Thu 4pm - 5:15pm   Classroom: A202 Langley Hall
Instructor: Na-Rae Han     Pitt ID & Google ID: naraehan;     Office hours: 2:30-4pm Mon&Wed and by appointment, CL G17.
Data science is a fast-growing professional and academic discipline that is highly interdisciplinary in nature. Its practice centers on domain expertise: this course will introduce linguistics majors to core methods and practices in data science as it pertains to linguistic inquiry. Students will first learn the fundamentals of structuring, manipulating and sharing various forms of linguistic data; be given hands-on training on practical aspects of data processing, including handling large quantities of text data (“big data”) and creating statistical language models through machine learning; and get acquainted with the emerging field of knowledge engineering and ontology. Additionally, they will be given a chance to apply data-intensive methods to a term project of their choice. Upon successful completion of this course, students will be able to (1) identify the best methods for representing and analyzing linguistic data for a given purpose, (2) transform and process linguistic data in large volumes, and (3) understand how statistics-driven text analytics and machine learning methods operate.

The course assumes that the students have an introductory knowledge of linguistics as well as basic competency in Python, a general-purpose programming language. The prerequisites therefore are:

  1. LING 1000 “Introduction to Linguistics”, AND
  2. An introductory Python course, which can be one of:
    LING 1330/2330 “Intro to Computational Linguistics”, CS 0008 “Intro to Programming with Python”, CS 0155 “Data Witchcraft”
Knowledge of statistics is highly recommended but not required. Uninitiated students will need to quickly pick up some basic aspects as they come up.

Python Data Science Handbook (2016, O'Reilly Media) is probably the closest thing to a textbook we will have. It will however be utilized more as a reference book. The scope of this course goes beyond core data science skills, for which articles and other materials will be assigned as needed. All throughout, we will be using various resources available on the web: see this Learning Resources page for a list.

Required Software
We will be using Python 3: Continuum's Anaconda distribution in particular. It ships with Jupyter Notebook and Spyder as main IDEs, and we will be using them extensively. Another key piece of software is Git, which enables version control and collaboration. In addition, we will learn unix tools and Bash shell; a text editor is also required.

  • Anaconda Python, version 3.6 (included IDEs: Jupyter Notebook, Spyder)
  • Git (NOTE: install the command-line version git. We will NOT be using GitHub's desktop GUI.)
  • Bash shell and unix tools (Mac & Linux users already have them as part of their OS; Windows users get them via Git-Bash, which is installed along with git.)
  • Text editor. Atom recommended for all systems. Also good: Notepad++ (Windows only), BBEdit (Mac only) and Sublime Text (all platforms).

Required Hardware
The software applications above should install and run from your own personal laptop, which you are expected to bring to every class meeting. Your laptop should run one of these OS's: Mac OS X (10.6 or later), Windows (7, 8.1 or 10), and Linux (any distribution). Mobile and cloud-based OS's are not supported -- iPads and Chromebooks are not suitable platforms for this class.

Course Requirements, Grading and Policies
Please read the Course Policies page.

Course Schedule
Please see the Course Schedule page.