LING 1340/2340 Data Science for Linguists, University of Pittsburgh

Go to: LING 1340/2340 home page

Homework 4: Wrangling Big Data on a Supercomputer
This homework carries a total of 50 points.

Let's switch to a more proper venue for the Yelp data -- the CRC's h2p supercomputer cluster. You will reprise what you did for To-Do #11 on your own personal computer, but give it a proper supercomputing treatment.
Part 1: To-Do #11 Redux

Log into your CRC account.
In your home directory, create a new directory called hw4_yelp. This is where you will save all files relating to this assignment.
All Yelp data files have already been loaded to the server. You can find them in /ihome/ling1340-2017f/shared_data/yelp_dataset.
Create the Python script file process_reviews.py in your hw4_yelp directory.
Before you run the script, you should switch into an interactive session. Execute crc-interactive.py -s.
See which Python versions are available: module spider python
Load Anaconda's python3: module load python/anaconda3.5-4.2.0
Verify you are now using the correct python version: which python
You are ready to run the python script! python process_reviews.py /long/path/to/review.json
Hope that didn't take too long. Now run it again, but this time do two things differently:

Time the whole process to find out how long it takes. Simply prefix time in front the whole thing.
Redirect the output to a file named yelp_out.txt. Remember you can do this through >.
So, the new command will look like: time python process_reviews.py filename > yelp_out.txt.

When it's done, look through yelp_out.txt using less to verify the output.
Exit from the interactive shell by executing exit. You should now be back in your login shell.
How are we doing as a group? This was the report generated by crc-usage.pl ling1340-2017f as of 11/9/2017:
Account: ling1340-2017f Total SUs: 10000 Proposal End: 11/02/18 -------------------------------------------------------------------- Cluster: smp -------------------------------------------------------------------- User SUs (CPU Hours) Percent of Total ---------------------- ---------------------- ---------------------- Cluster Total 1 0.0124 als333 0 0.0000 awr14 0 0.0036 ben25 0 0.0000 blh82 0 0.0000 cjl71 0 0.0000 daz53 0 0.0013 juffs 0 0.0000 kak275 0 0.0000 ktl14 0 0.0000 mmj32 0 0.0000 naraehan 0 0.0000 nhl3 0 0.0028 peh40 0 0.0039 rwc27 0 0.0007 --------------------------------------------------------------------

Find out where we are at now. Did we make any dent? Did you contribute?

Part 2: Experiment
You should experiment. Note that we have used up only 1 out of 10,000 available SU (Service Units, == CPU hours). Let's see how much we can use up. You are free to do whatever you want, but I included some suggestions below.

Write another Python script that does something else, potentially more complex.
Jupyter notebook is not available. We'll have to make do with boring-old python scripts.
Pretty much all Python libraries we need should already be on the cluster, including pandas, sklearn and nltk. If you find something that's not on it, let me know, I'll ask Barry.
Suggestion 1. How about average review lengths for 1, 2, 3, 4, and 5 star reviews? Are longer reviews more positive?
Suggestion 2. Distribution of 'horrible' vs. 'scrumptious' in 1, 2, 3, 4, and 5 star reviews?
Suggestion 3. How about some machine learning -- perhaps a classifier that tells 1-star reviews from 5-star reviews? But if you are going that route, the default configuration for interactive session might not be sufficient.

Watch your memory usage. By default, a single CPU core uses up to 20GB of RAM. If your process fails due to memory shortage, you will need to initiate the interactive shell process with more memory: crc-interactive.py -s -b 40 requests 40GB of memory, for example.
Also, if your job takes longer than the default 1 hour, you should start your session with longer hours. For example, crc-interactive.py -s -t 3 requests 3 hours.

The "FOO" principle applies even on a supercomputer: start small. You don't want to find out your script has a bug 30 minutes into a big job! Create a FOO.json in your hw4_yelp directory as a mini version of review.json, maybe containing the first 10,000 lines, and write a script that works on that file. Then, when you are sure your code works and does what it is supposed to do, unleash it on the full review.json file. You can even do this on your laptop and then simply copy over your script.
But how do you copy over a file onto your CRC account anyway? You do it through scp (secure copy). The syntax is
scp your_local_file pittid@h2p.crc.pitt.edu:~
This copies your local file onto your home directory on h2p. An illustration here.
Are you using the Atom editor? I hear that you can edit a file on a remote server. Perhaps an add-on package is needed. I suggest you google "remote edit" and find out!
SUBMISSION: I have read access to your hw4_yelp directory, so that's your submission. It should include process_reviews.py and yelp_out.py. For your experiment portion, include a brief explanation in a readme.txt file.