LING 1340/2340 Data Science for Linguists, University of Pittsburgh

Go to: LING 1340/2340 home page

Term Project Guidelines

Overview
Individual students will work on a project of their own choice and design over the course of the semester, culminating with a class presentation followed by a final project delivery. The goal of this project is to make a linguistic discovery through application of data-intensive methods.
Components
A project consists of three main components: data, analysis, and presentation.
A. Data

Start with found data. Many linguistics research projects begin with a targeted data collection effort -- field work, surveys, elicitation, human subjects, and more. But the underlying assumption of data science is that data exists in the wild, and it is up to a data scientist to harness it. True to this assumption, we will have you start with data that is found in the wild, be it published data sets, corpora, or social media streams.
Add value. You should not, however, be content with data as it is packaged and presented to you. In many cases, your data will need a lot of work -- sourcing, cleaning up, and reorganizing. In other cases, you may be dealing with published data that's more or less ready for analysis. You are, then, expected to add value: augmenting, annotating and leveraging multiple data sets are all potential avenues.
Follow best data practices. Throughout this semester, we will be learning about best data practices, both emerging and firmly established in data science circles. Make sure your own data efforts and the output are in compliance.

B. Analysis

Linguistic analysis. You will have designed your data with a research question in mind. Your data should make a suitable empirical basis for your linguistic inquiry; your research question should be properly motivated and addressed in a theoretically and methodologically sound manner. You interpretations of the findings should likewise be rigorously supported by your data. Even with meticulous preparation, however, your data in the end may not prove fruitful grounds for your original research question. Pivoting is therefore allowed up to a certain point; whether or not this move is ultimately successful, reasons for pivoting and/or failure of the original research agenda must be thoroughly probed and documented, since this sort of outcome is all part and parcel in research efforts deeply grounded in real-life data and, further, provides valuable insight.
Computational methods. In your linguistic analysis, you are expected to employ various computational methods including natural language processing, statistics, machine learning, topic modeling and more. Proper techniques should be used in accordance with your research question and the specifics of your data. At the same time, you should demonstrate mastery of these techniques by justifying your choice of computational methods and thoroughly evaluating the outcome, rather than blindly applying them and accepting the returned output. As with linguistic analysis, failed experimentation should not be brushed aside, but rather receive proper investigation and documentation, as this is all part of the discovery process.

C. Presentation

This component encompasses all audience-facing aspects of your project, which include but are not limited to:
Proper use of GitHub as a project-hosting and publication platform.

Overall documentation.

Structure, readability and organization of your Python code in the form of Jupyter notebooks.

Visualization through graphs and plots.

Your oral presentation, scheduled in the last two weeks of class.

Your final report: language, content, clarity, precision, organization, citation, etc.

Weight distribution. Ideally, a project will have the three components in perfect balance: a total of say 180 points will be equally split between data/analysis/presentation as 60-60-60. In reality, everyone's project will be different: some will have ambitious and challenging data curation plans, while others might wish to focus their efforts on extensive use of advanced computational methods. To accommodate this, a limited amount of trade-off is provisioned between the "data" and the "analysis" components: more data-focused projects therefore may have up to 70-50-60 distribution with more data-side contribution, while projects heavily focused on analysis are allowed to go easier on data-related efforts, with up to 50-70-60 split.

Submission

Your project should be initiated and developed in the form of a GitHub-hosted public repository. The final deliverables should include:

A README document and a LICENSE document accompanying your GitHub repository.
A written report containing a summary of your data and linguistic analysis. Anywhere between 5 and 10 pages, of which a minimum of 3 pages must be devoted to written descriptions (not including charts, graphs, examples, tables, etc.).
Your data.
Python scripts in Jupyter Notebook form, that you created and used to process, explore and analyze the data.
Slides or other materials you used for your in-class presentation.

Milestones

The term project carries a total of 400 points, which you accrue over the course of the semester through meeting several, structured, milestones. Refer to the Schedule page for the dates.

Milestone Points, distribution
Data Ⓓ; Analysis Ⓐ; Presentation ⓟ What

1 Project ideas 20 ⒹⒹⒶⒶ Send instructor 1-2 project ideas.

2 Project plan 20 ⒹⒶⓟⓟ Finalize project plan, create a GitHub project repository.

3 1st progress report 40 ⒹⒹⒹⒹⒹⒹⓟⓟ Focus on data curation, report progress.

4 2nd progress report 40 ⒹⒹⒹⒹⒶⒶⓟⓟ Continue with data curation, attempt analysis.

5 3rd progress report 40 ⒹⒹⒶⒶⒶⒶⓟⓟ Data-side effort should be mostly done; ramp up analysis.

6 Project presentation 60 ⒶⒶⒶⒶⒶⒶⓟⓟⓟⓟⓟⓟ Oral presentation of your work in classroom.

7 Final project submission 180 ⒹⒹⒹⒹⒹⒹⒶⒶⒶⒶⒶⒶⓟⓟⓟⓟⓟⓟ ⒹⒹⒹⒹⒹⒹⒶⒶⒶⒶⒶⒶⓟⓟⓟⓟⓟⓟ Turn in final project in the form of a GitHub repository.

*** More detail will follow as each milestone approaches. ***

Project ideas
You should come up with one or two project ideas. Include these details:

Working title.
A brief summary.
The DATA portion. Example points you should address: What will your data look like? What sorts of data sourcing and cleaning up effort will be involved? Do you have a sense of the overall data size you should be aiming for? Do you have an existing data source in mind that you can start with, and if so, what are the URLs or references?
The ANALYSIS portion. Example points you should address: What is your end goal? What linguistic analysis do you have in mind? Any hypothesis you will be testing? Are you planning to do any predictive analysis (machine learning, classification, etc.), and using what methods?
The PRESENTATION portion. I don't expect a whole lot of variability on this, but describe anything noteworthy you have in mind.
Submission: In the project_ideas directory of 'Class-Practice-Repo', create a markdown-formatted text file named project_ideas_YOURNAME.md. Commit, push to your fork, and create a pull request for Na-Rae.

Project plan
Launch your project as a GitHub repository and publish a project plan.

Create a repository within the Data-Science-for-Linguists organization.
Include the following files:

README.md: Include your name, project title, and a brief summary here.
LICENSE.md: You will eventually need to specify a license for your project. Build it now as a place holder.
project_plan.md: This is your project plan. Start with your project ideas document and polish it up.
progress_report.md: This is where you will log your progress, newest entry on top. Add your first entry.

Submission: Your project repo counts as your submission.

1st Progress report
For the 1st progress report, focus on your data. This milestone consists of 30 data points and 10 presentation points. Goals:

Attempt and mostly complete the data acquisition process.
Start and make a head way into cleaning and reorganizing your data.
By now, you should have concrete ideas on the "data end game": what your data's final form will be like, the target total size, format, etc.
Devise a couple of options regarding the "sharability" of your data.
As for the progress report itself, these should be the content:

A python script in the form of a jupyter notebook.

Clearly document each step of your data processing pipeline.
Bullet points have their uses, but I would also like to see some written summaries and explanations.
Remember: your Jupyter Notebook file is also your presentation. Make it easy for me and your classmates to understand what you are doing. Show your data and your processes.
Compile some stats on your data: size and make up are the very basics.
Include a markdown cell where you outline a couple of options (or a single option, if you are fairly sure) regarding the "sharing plan" for your data. You should plan out how much you will be sharing with (1) the world and (2) our class. Make sure to include reasoning behind each.

Some form of your data. If all of your data is currently stored in a git-ignored data/ directory, make samples available in the data_samples/ directory.
progress_report.md: Keep on logging your progress whenever you make one, inbetween milestone due dates. This one is more for your own record, and it does not have to be very detailed: I thought it would be nice to have a place where you can record your regular progress in a format that's roomier than git commit messages.
Above are the minimum requirements, but do feel free to impose additional organization as you see fit. This is your project after all! But when you do so, make sure you provide explanation somewhere.

Submission: Your project repo counts as your submission.

2st Progress report
For the 2nd progress report, you ease up your focus on data and start working on analysis. This milestone consists of 20 data points, 10 analysis points and 10 presentation points. Goals:

Complete your the data acquisition process.
Be mostly done with cleaning and reorganizing your data. It should be more or less in its final form.
The overall format, shape and size of your data should be known at this point. Document them.
Finalize the license for your data and project, and get your data into a sharable form.
Start bringing in the analysis part into your project. In particular, your manipulation of data should be shaped by the linguistic analysis you are after.
As for the progress report itself, these should be the content:

A python script in the form of a jupyter notebook. You have three options:

EXISTING: the existing script file which was part of your 1st progress report. You continue to update and add to it.
NEW REPLACEMENT: a whole new script file that replaces the earlier one. The script you submitted earlier as part of the 1st progress report is now regarded as initial exploration and is no longer part of your work pipeline.
NEW CONTINUING: a new script file that's part of a pipeline. The earlier script you submitted for the 1st progress report accomplishes PART 1 of your work pipeline, and this new file is PART 2 that picks up where PART 1 left off.
On top of your script, specify which type it is so I will have a sense of how the script fits in your project.
Your data: include your data in a designated folder.

If including data in its entirety, make sure it's within your right to do so. Present a justification in LICENSE_notes.md.
If you are including samples, make sure it's within fair use. Document your sampling method and justification in LICENSE_notes.md.
If you are including derived data, again provide justification.

Two license-related documents.

LICENSE.md is a binding licensing document, so it is intended as audience-facing. This is where you lay out your licensing terms for your future visitors wanting to use your data and code. You may adopt popular, existing licensing standards: revisit Lauren Collister's materials, and consult these documents.
LICENSE_notes.md is your report on the licensing aspect of your project. Give background information and include the motivation and justification behind your licensing terms. Describe the process through which you arrived at your licensing decisions.

progress_report.md: Keep on logging your progress whenever you make one, inbetween milestone due dates. Again, each entry does not have to coincide with a progress report due. Whenever you make progress, log it here.
project_plan.md. Many of you made substantial changes to your original project plan. Update the document to reflect the current project direction.
Now, a tricky part for those of you who have ever pushed data files to your GitHub repo that you have now decided shouldn't be shared. The bottom line: you should abandon your repo and make a FRESH NEW ONE. Details:

A git repo is essentially a time machine. Even after you delete a file, it's not entirely gone from git's history and can easily be retrieved.
There IS A WAY to reach deep into your git commit history and completely scrub offending files from the entire history of your git. That way, it's as if the file was never committed. If you want, you can Google it and try it out.
But that's advanced stuff, and we are not here to become a git master. So, I am allowing you to start anew, for *one last time*.
Don't delete your old repo just yet: we'll keep it around for a couple of weeks. Make it private, and append 'DROPPED' in the name of the repo to clearly mark it as such. And then start a new one. In your new repo, explain it in progress_report.md that you've done this.
Submission: Your project repo counts as your submission.

NOTE: After 'submission', don't hold yourself back from pushing more updates and changes thinking you should freeze the repo until I post the grade. There's no need: I have access to your repo at every stage it moves through.

3rd Progress report
For the 3rd and last progress report, you should focus on analysis. This milestone consists of 10 data points, 20 analysis points and 10 presentation points. Goals:

Wrap up your data-side effort: your data is in its final form with clear documentation.
The license for your data and project is all ready, and your data is in its ready-to-share form.
Make a headway into the analysis part of your project. You should have some preliminary findings that are either sufficiently close to what you set out to investigate, or at least meaningful enough in their own right and point to immediate next steps.
As for the progress report itself, these should be the content:

Your repo: let's tidy it up a bit.

Give your project repo a DESCRIPTIVE name that indicates what your project is about, not something like "Narae 1340 project". Alicia ('Discourse Analysis ART Corpus') and Paige ('2016 Election Project') have the right idea.

A new repo name means a new GitHub remote address. Don't forget to update your local git's remote setting!
You will also need to update the link on your Visitor's Log file.

Make sure your own name shows up below your repo title as part of the subtitle. Again, Alicia and Paige have the right idea.
Update your README.md file:

Make sure the file has a nice blurb that introduces your project.
Also, put a link to your Visitor's Log file on it.

A python script in the form of a jupyter notebook. Like last time, you have three options:

EXISTING: the existing script file which was part of your 1st progress report. You continue to update and add to it.
NEW REPLACEMENT: a whole new script file that replaces the earlier one. The script you submitted earlier as part of the 1st progress report is now regarded as initial exploration and is no longer part of your work pipeline.
NEW CONTINUING: a new script file that's part of a pipeline. The new file is PART 3 that picks up where PART 2 left off, etc.
On top of your script, specify which type it is so I will have a sense of how the script fits in your project.
Your data and lincense: include your data in a designated folder. Many of you did not get the two files right last time: LICENSE.md and LICENSE_notes.md. Make sure they are in good shape this time around.
progress_report.md: Like always, keep on logging your progress whenever you make one, inbetween milestone due dates. Again, each entry does not have to coincide with a progress report due. Whenever you make progress, log it here.
project_plan.md. Make sure this document is up to date.
Submission: Your project repo counts as your submission.

Presentation guidelines
Format
Your slot is 23 minutes long: 19 minutes for the presentation portion plus 4 minutes provisioned for questions. Prepare PowerPoint, PDF, or any other visual aids. You may go over some of your GitHub repo contents, but if you choose to do so it should be clear you are following pre-meditated plans, not just ad libbing. Rehearse and time your presentation! It shouldn't be too long or too short.
Content
Your project, of course! But unlike your project itself that dives right into data, you should start with motivating and contextualizing your project topic. That means supplying background information, research questions, theoretical foundations and related literature, and so on. And, be sure to show your data and findings through visualization. All the nifty plots and charts you have been generating can be saved as an image file, which you can incorporate into slides.
Evaluation
The presentation component of your project carries a total of 60 points. Your presentation will be evaluated based on the following: preparation, accuracy and depth of content, originality, engagement with audience, and delivery. If you are presenting in Week 14, I will take into consideration that your analysis part may be slightly less developed.

Final project submission guidelines
You've worked hard through many project milestones, and it's time to prepare your project for final submission. Unlike the three progress reports where the focus was firmly on the process, the final submission should highlight the results and your interpretation of them. The process should still get a fair and clear illustration, but you should prune out from your production code any "branches" representing trials-and-errors that led to a dead end. (You are encouraged to move any old code bits into a designated subfolder.) All in all, your GitHub repo should present a coherent picture of your project, from start to finish.

Your repo: files and folders
(Note: Objects that are entirely/substantially new in this submission are in blue.) Below are the required files with fixed file names:

LICENSE.md
LICENSE_notes.md
project_plan.md
progress_report.md
README.md (Details below)
final_report.md (Details below)
In addition, you should have:

A designated folder for your data, and data files inside
A designated folder for figures, graphs and other image files (Details below)
Jupyter Notebook files and their Markdown versions (Details below)
Your presentation slides, saved in a PDF format
Lastly, some of you might have extra files and directories serving some purpose. Perhaps a folder containing some old code that is no longer relevant, or something like that. Make sure to explain what these are in your README.md document.

README.md
Revamp your README document and give it a proper structure. This document is what greets your visitors, so its goal should be to give them a proper orientation. It should include:

Front matter: the title of your project, your name, email, date
A brief description of your project
A brief description of the "found" data set you started your project with. Include a web link (if any) along with proper attribution.
A "directory" of your repo. Have a bullet-point list of the files and folders along with a one-line description of what they are. Make them into clickable links so your visitors can easily navigate to the files/folders. Do NOT list individual data files or image files under a subfolder.
A link to your visitor's log.

Images folder and files
Your final_report.md file will need figures and graphs for illustration.

Have a folder, named images or something, where all image files should go.
Plots can easily be exported. See here. Alternatively, they are automatically exported as .png files during Jupyter Notebook --> Markdown conversion.
With other types of objects, say trees, see if you can export them as an image file. As a last resort, you may use screenshots, but **only as a very last resort**.

Your code: Jupyter Notebook
The same usual guidelines for your Jupyter Notebook files continue to apply: your code should work correctly while walking the audience through the whole process. This time around, however, your code should be in a streamlined form: you should prune your code of any unsuccessful bits and experiments that have since been abandoned. In other words, your code files should demonstrate your project in a lean and coherent manner. Some important points:

For many of you, breaking down your code into multiple Jupyter Notebook files will make organizational sense. For example, the first notebook could focus on data clean-up effort, and the second one takes from there and conducts data analysis, and so forth.
Still, your Jupyter Notebook file will get long and unwieldy. At the top, include a "Table of contents" that provides handy shortcuts to various subsections. See this screenshot for how it's done.
Your code may produce interim outputs (such as saved pickle files). If you decide against sharing them, make sure to exclude them from GitHub repo via .gitignore.
Make sure to "Restart & Run All" your Jupyter Notebook file before pushing to your GitHub repo! You want all cell outputs to be tidy.

Your code: Markdown version
Here's Na-Rae the student's Jupyter Notebook file. Note that the section links do not work. Although GitHub does an admirable job of rendering Jupyter Notebook files, it breaks HTML anchors which makes it impossible for us to refer to relevant bits of code in our reports. So, we will go to the trouble of exporting our Jupyter Notebook to Markdown, which preserves section anchors. See Na-Rae the Student's Markdown version of Python code, and her final report which is now able to refer back to relevant sections of code. Yep, it's absolutely worth the trouble! Instructions:

Save your Jupyter Notebook files as Markdown (.md) files. If your code has a plot graph, a .zip file will be created with a Markdown file along with plot images as .png files.
We don't want the numerous image files littering the top level of repo, so move them into a designated image file folder, say images/. You will then have to edit your Markdown file to add 'images/' path in front of the image file name.

final_report.md
Think of this as a usual "final report" that is in a markdown format instead of MS Word. Details:

Shoot for around 1,500 words, excluding references if any. That's the length of a short paper of about 4-5 pages. (Remember to use the wc command!)
Use headers and clearly mark your sections.
Use visualization! Display figures you had saved as external image files. Example here.
Link to the relevant sections of your code. The Markdown version of your code should be easily reference-able, because section headers are automatically turned into anchors. See this example.
What should you include in this report? Revisit the overall description of the term project at the top of this page.
Have a paragraph devoted to the overall history and process of your project, warts and all. Document setbacks, false starts, and other difficulties you experienced.

Submission: Your entire project repo is your submission. Make sure everything is in order and looks good.