Go to: LING 1330/2330 home page  

Exercise 7: Steve Jobs vs. Regex Redux

An answer sheet (MS Word doc) is provided for this exercise: Ex7 Regex Redux.docx. For most questions, you will be entering two things: (1) a screenshot of your IDLE shell showing relevant code bits and output, and (2) your written answer or comment accompanying the code. See this example to get a sense of what's expected. You'll be submitting your saved IDLE shell session too, so make sure to save it.

We'll pick right back up where we left off with HW5! Start out by copy-pasting the first 5 paragraphs of the Steve Jobs wikipedia article, then get ready to use Python's re module:

 
>>> jobs = """Steven Paul Jobs (February 24, 1955 – October 5, 2011) was an
American businessman, inventor, and investor best known for co-founding 
...
Jobs holds over 450 patents in total.[4]"""    
>>> re.findall(r'\d+', jobs)
['24', '1955', '5', '2011', '1970', '1980', '1955', '1972', '1974', '1976',
 '1979', '1983', '1984', '1985', '1985', '1986', '3', '1995', '28', '1997', 
 '1999', '2002', '3', '2003', '2011', '2022', '141', '450', '4']
  1. Using re.sub(), replace all instances of capitalized words ('Jobs', 'Apple', etc.) with 'BUELLER'.
  2. Using re.findall(), formulate a regular expression that matches all multi-word proper noun phrases ("Steve Wozniak", "The Walt Disney Company", etc.). They can be identified as a sequence of capitalized words. You may exclude unconventional capitalization patterns such as 'iPad'.
  3. Then, using re.sub(), replace all those matching instances with "<MULTIWORD-PNP>". You just performed a rudimentary form of a Named Entity Recognition (NER) task!
  4. Do the same, but let's preserve the proper NP itself this time. Substitute each multi-word proper NP with itself sandwiched between the opening tag <MULTIWORD-PNP> and the closing tag </MULTIWORD-PNP>. So, for example, "Steven Paul Jobs" should be replaced with <MULTIWORD-PNP>Steven Paul Jobs</MULTIWORD-PNP>.
  5. Try your own regex substitution operation. State what your goal/intention is, formulate your regex and execute the task, and then state how successful your operation was.
Now, let's investigate what sort of word types are used in the Jobs article, using the re.search() method this time. First, tokenize the text and create an alphabetically sorted list of word types:
 
>>> jobs_wtypes =  sorted(set(nltk.word_tokenize(jobs)))
>>> len(jobs_wtypes)
271
>>> jobs_wtypes[:20]
["''", "'s", '(', ')', ',', '.', '141', '1955', '1970s', '1972', '1974',
'1976', '1979', '1980s', '1983', '1984', '1985', '1986', '1995', '1997']
>>> jobs_wtypes[:20]
['to', 'took', 'total', 'traveled', 'tumor', 'tumor-related', 'unsuccessful', 
'user', 'vector', 'verge', 'visual', 'was', 'wealth', 'which', 'with', 
'withdrawing', 'won', 'worked', 'year', '–']
  1. Suppose you want to find out which word types start with a capital letter. Using re.search() as a filtering condition in list comprehension, create a list of all capitalized word types. See today's lecture slides for how to combine list comprehension with re.search().
  2. Which word types have five or more "vowel" characters in them? (example: 'microcomputers') Again, list-comprehend through jobs_wtypes for matching entries. Use of re.I switch recommended.
  3. Which word types start and end with the same letter? 'died' would be an example. You will want to use group capture syntax, and also the re.I switch. (Note: we'll ignore single-letter words like 'a'.)
  4. Which word types have 4 or more consecutive "consonant" characters? Example: 'abstract'. Rather than listing out all 20+ consonant letters, use the negative set [^] regex notation. In addition to the usual 5 "vowel" letters and 'y', you'll also need to exclude certain additional characters.
  5. Try your own regex search operation on jobs_wtypes. State what your goal/intention is, formulate your regex and execute the task, and then state how successful the outcome was.


SUBMIT:
  • MS Word answer sheet: "Ex7 Regex Redux.docx" (template linked on top)
  • Your saved IDLE shell (you can edit to trim errors, etc.): "Ex7_shell.txt"
As usual, the SOLUTION document will be revealed in a post-submission link. Even if you feel good about your work, plan on studying the key document carefully. When teaching regular expressions, the hardest part is getting my students to see (1) that their expressions are incorrect and (2) WHY they are incorrect. They often assume, wrongly, that their expressions are good simply because they produce the correct number of matches with the given text.