LING 1330/2330 Introduction to Computational Linguistics, University of Pittsburgh

Go to: LING 1330/2330 home page

Exercise 7: Regular Expressions vs. Steve Jobs

The goal of this exercise is to get you practice writing regular expressions on regex101. We will use Steve Jobs's Wikipedia entry for our text. Copy the first four paragraphs above the gray Contents box ('Steven Paul Jobs ... Presidential Medal of Freedom.'), and paste into the text window of regex101. First, notes on using the site:

See the screenshot here.
Choose the Python flavor of regex.
Adjust the font size so all of the text fits in the window. That way, you can quickly scan your matches without having to scroll. (On Chrome, Control - zooms out.)
Match count is displayed on the top right corner. Don't just rely on it: you must visually inspect your matches to make sure they are legit, no true matches are getting left out, and match boundaries are correct.
Leave the gm flags turned on. And, do not use the "ignore case" flag for this exercise.
Additional instructions:

Write down your answers in a document file. Word doc and .txt documents are both fine. Feel free to include screenshots, which will be good for your reference.
Is your match count different from mine? Perhaps you are wrong, perhaps the Wikipedia page has been edited in the meantime. Check this revision history page; my solution is based on Wham2001's edit made on October 14 18:53. Base your answers on this version.

10 years after Jobs's death, there's still a feverish war raging on Wikipedia to claim his soul, as evidenced by the very frequent edits his page suffers. Some edits are downright trollish... why won't these people let us practice regex in peace! At any rate, do use the direct link to the correct version above.

Even if you feel good about your work, plan on studying the posted SOLUTION document carefully. When teaching regular expressions, the hardest part is getting my students to see (1) that their expressions are incorrect and (2) WHY they are incorrect. A lot of them assume their expressions are good just because they produce the correct number of matches with the given text. Often times they would be wrong!
About composing regular expressions:

Dealing with a small text, it is possible to write a trivial regular expression that simply lists all alternative forms separated by the "|" operator. DO NOT DO THIS. For example, for Q5, you can easily write /iMac|iTunes|iPod|iPhone|iPad/ and be done with it -- but the point is to write a more compact and elegant regular expression that captures the structure shared by the target strings.
On the flip side, do not over-pursue compacting of your expression. For example, /have|has|had/ can be further compacted to /ha(ve|s|d)/, but then you sacrifice readability for a small gain. You should try to strike a happy medium.
When it comes to finding words, make sure to match the whole words. When finding words ending with ing, don't just match the ing part! Your regex should match the entire words such as seeking.

Hope that was enough instructions! Let's get to it. Write regular expressions matching the descriptions below.

Years. (18 matches)
Dates. (2 matches)
The word computer and its variations (capitalized and plural; exclude microcomputer but allow matching within hyphenated words such as computer-animated). (7 matches)
All-capital words. (11 matches)
Apple's product names, starting with 'i': iPod, iTunes, iPad, etc. (6 matches)
Words ending in -ing. (14 matches)
A word ending in -ly and the following word. (8 matches)
Words with possessive 's: Jobs's, etc. (5 matches)
Words that have x or X in them. (8 matches)
The indefinite article a and an. (13 matches. Do NOT match the surrounding spaces.)
A quoted word or phrase, including the surrounding quotation marks ("). (1 match)
Words that are 12 characters or longer. (exclude hyphenated words; 9 matches)
Hyphenated words. (10 matches)
Parenthesis in a pair of round brackets (). Include () so it matches (CEO) as a whole. (3 matches)
This one is in fact tricky to get right. You might want to look up "greedy matching behavior" of regular expressions. Or, see the KEY document for explanation.

'the ... of' constructions. Allow '...' to be multiple words, up to 4. Intervening words may accompany punctuation. (13 matches; 14 if including 'The ...')

SUBMIT:

A document (MS Word doc, PDF, .txt etc) containing your answers.