LING 2050 Special Topics in Linguistics: Corpus Linguistics, University of Pittsburgh

Go to: LING2050 home page Lab pages index Command reference sheet

Lab 5
Objectives: more on Regular Expressions; simple perl scripts

Reference:

Kenneth Church, "Unix for Poets" [pdf]. Please note that some of the syntax in this document is deprecated: tail -n +2 should be used instead of tail +2.
Regular Expressions - User guide

Overview

We learn about: extended regular expressions, perl syntax, and shell scripting for looping through files.

Regular Expressions, continued

Some character classes are defined according to the POSIX standard, some of which we have used before. The following are the most commonly used and useful ones. Please see this page for detailed explanation.
POSIX Character Classes:

character explanation

[:digit:] matches any single digit (0-9)

[:alnum:] matches any alphanumeric character (0-9, A-Z, a-z)

[:alpha:] matches any alphabetic character (A-Z, a-z)

[:upper:] matches any uppercase alphabetic character (A-Z)

[:lower:] matches any lowercase alphabetic character (a-z)

[:blank:] matches SPACE and TAB

[:punct:] matches punctuation symbols: . , " ' ? ! ; : # $ % & ( ) * + - / < > = @ [ ] \ ^ _ { } | ~

Note that these are always used inside an additional set of square brackets, e.g., [[:digit:]] or [[:digit:][:alpha:]], to match a single character. By contrast, [[:digit:]][[:alpha:]], enclosed in two sets of outer square brackets, matches a string of two characters:
$ echo 'this is 20th century...!' | grep -E '[[:digit:][:alpha:]]'
this is 20th century...!

$ echo 'this is 20th century...!' | grep -E '[[:digit:]][[:alpha:]]'
this is 20th century...!

In addition to the more standard set of Regular Expression syntax, many programming languages (PHP, Perl, python, java, etc.) support the following common extensions. Note that the lowercase version '\x' and the uppercase version '\X' work as a pair; the latter is the complement of the former. See this page for more information.
Character Class Abbreviations:

character explanation matches does not match

\d matches any single digit (0-9) '1', '23', '23rd' 'twenty'

\D matches any character NOT in the range 0-9 'Amy', 'twenty', '23rd', '0.25', '1,000' '1', '23'

\s matches any whitespace character (space, tab, etc.) ' ', 'this hat', 'hat ' 'hat'

\S matches any character that is non-whitespace 'this hat', 'hat' ' '

\w matches any alphanumeric character ('0-9', 'A-Z', 'a-z') 'Amy', 'a', 'A', '12', '23rd' '...', '!', ' '

\W matches everything else (punctuation, symbol, whitespace) 'my hat', 'yes!', '3/4' '23rd', 'Amy'

You might have noticed that sed and grep do not accept these abbreviations. So, why bother learning these? The answer is that you can use them in perl, and later, antconc. First, let's learn the basics of perl...

Doing Everything with perl

We are first going to write an extremely simple perl script. Open up a text file named greetings.pl in pico, by typing:
pico greetings.pl
and type in the following line:
print "hello world\n";
Save and exit using Ctl-X. Now execute the perl script:
$ perl greetings.pl
hello world

In the case above, we executed a perl script saved in a separate file (greetings.pl). Since the code itself is extremely simple, we do not have to rely on a script file at all; using the -e switch, the code can be supplied from the command-line:
$ perl -e 'print "hello pretty\n";'
hello pretty
Another useful switch is -n, which loosely translates to "do something for every line of the standard input". Therefore, the following command simply prints out each and every line of the input file.
$ perl -ne 'print;' austen-emma.2gram | more emma by by jane jane austen austen 1816 1816 volume volume i i chapter chapter i i emma emma woodhouse

In perl, regex patterns are enclosed in / /. You can use perl -ne 'print if /PATTERN/;' and perl -ne 'print unless /PATTERN/;' to simulate grep and grep -v, respectively:
$perl -ne 'print if /^thou\t/;' gutenberg.2gram | more thou the thou mayest thou shalt thou eatest thou shalt thou 3 thou wast thou eaten $ perl -ne 'print unless /[[:alpha:]]+\s[[:alpha:]]+/;' gutenberg.2gram | more the 8th the 23rd the 28th sept 28th the 24th the 7th of 10 10 000 000 l the 10

Perl also provides a sed-like syntax that lets you edit your text stream on the fly. perl -ne 's/PATTERN1/PATTERN2/g; print;' achieves just that. Therefore, this good-old sed command:
sed 's/Emma/Juliet/g; s/Mr. Knightley/Romeo/g' austen-emma.txt
is the same as:
perl -ne 's/Emma/Juliet/g; s/Mr. Knightley/Romeo/g; print;' austen-emma.txt

And, of course, perl provides a tr syntax as well. The following two commands are therefore equivalent:
tr '[A-Z]' '[a-z]' < austen-emma.txt
perl -ne 'tr/[A-Z]/[a-z]/; print' austen-emma.txt

Which means: we could process tokenization of a file using perl alone! Remember this is how we did tokenization before:
tr '[A-Z]' '[a-z]' < austen-emma.txt | sed -r 's/[[:punct:]]/ /g' | sed -r 's/ /\n/g' | grep -v -E '^$' | more
which can now be done using only perl:
cat austen-emma.txt | perl -ne 'tr/[A-Z]/[a-z]/; s/[[:punct:]]/ /g; s/ /\n/g; print;' | perl -ne 'print unless /^$/;' | more
Note that the last perl command cannot be incorporated with the previous set and needs to be executed after piping. What could be the reason behind this?

Repeating Commands on Multiple Files

For your homework, you bravely and diligently applied the same command to every text file. With some basic bash shell scripting, all these repetitions can be neatly packed into one round of command execution. CAUTION: this looping syntax is extremely powerful -- it would be a good idea to back up your data, and/or do trial runs.

Starting from the "austen-emma.words" tokenized word file, this is how you would obtain a word frequency file:
cat austen-emma.words | sort | uniq -c | sort -nr > austen-emma.words.freq

Note that the output file name has the input file name built into it. We can designate this portion as a variable and formulate a loop syntax so it applies to every file in the directory that ends with .words:
$ for myfile in *.words > do > cat $myfile | sort | uniq -c | sort -nr > $myfile.freq > echo $myfile finished. > done
The words in red are part of the bash shell scripting syntax. myfile is the variable; it is first used without the $ prefix and then with one throughout. The echo command is not essential, but it provides handy feedback. As you type in RETURN at the end of the line, your shell recognizes that your command is not complete and prompts with >, until done is typed in.