Reference:
Overview
We learn about: extended regular expressions, perl syntax, and shell scripting for looping through files.
Regular Expressions, continued
- Some character classes are defined according to the POSIX standard, some of which we have used before. The following are the most commonly used and useful ones. Please see this page for detailed explanation.
POSIX Character Classes:
character | explanation |
[:digit:] | matches any single digit (0-9) |
[:alnum:] | matches any alphanumeric character (0-9, A-Z, a-z) |
[:alpha:] | matches any alphabetic character (A-Z, a-z) |
[:upper:] | matches any uppercase alphabetic character (A-Z) |
[:lower:] | matches any lowercase alphabetic character (a-z) |
[:blank:] | matches SPACE and TAB |
[:punct:] | matches punctuation symbols: . , " ' ? ! ; : # $ % & ( ) * + - / < > = @ [ ] \ ^ _ { } | ~ |
Note that these are always used inside an additional set of square brackets, e.g., [[:digit:]] or [[:digit:][:alpha:]], to match a single character. By contrast, [[:digit:]][[:alpha:]], enclosed in two sets of outer square brackets, matches a string of two characters:
$ echo 'this is 20th century...!' | grep -E '[[:digit:][:alpha:]]'
this is 20th century...!
$ echo 'this is 20th century...!' | grep -E '[[:digit:]][[:alpha:]]'
this is 20th century...!
-
In addition to the more standard set of Regular Expression syntax, many programming languages (PHP, Perl, python, java, etc.) support the following common extensions. Note that the lowercase version '\x' and the uppercase version '\X' work as a pair; the latter is the complement of the former. See this page for more information.
Character Class Abbreviations:
character | explanation | matches | does not match |
\d | matches any single digit (0-9) | '1', '23', '23rd' | 'twenty' |
\D | matches any character NOT in the range 0-9 | 'Amy', 'twenty', '23rd', '0.25', '1,000' | '1', '23' |
\s | matches any whitespace character (space, tab, etc.) | ' ', 'this hat', 'hat ' | 'hat' |
\S | matches any character that is non-whitespace | 'this hat', 'hat' | ' ' |
\w | matches any alphanumeric character ('0-9', 'A-Z', 'a-z') | 'Amy', 'a', 'A', '12', '23rd' | '...', '!', ' ' |
\W | matches everything else (punctuation, symbol, whitespace) | 'my hat', 'yes!', '3/4' | '23rd', 'Amy' |
You might have noticed that sed and grep do not accept these abbreviations. So, why bother learning these? The answer is that you can use them in perl, and later, antconc. First, let's learn the basics of perl...
Doing Everything with perl
-
We are first going to write an extremely simple perl script. Open up a text file named greetings.pl in pico, by typing:
pico greetings.pl
and type in the following line:
print "hello world\n";
Save and exit using Ctl-X. Now execute the perl script:
$ perl greetings.pl
hello world
-
In the case above, we executed a perl script saved in a separate file (greetings.pl). Since the code itself is extremely simple, we do not have to rely on a script file at all; using the -e switch, the code can be supplied from the command-line:
$ perl -e 'print "hello pretty\n";'
hello pretty
Another useful switch is -n, which loosely translates to "do something for every line of the standard input". Therefore, the following command simply prints out each and every line of the input file.
$ perl -ne 'print;' austen-emma.2gram | more
emma by
by jane
jane austen
austen 1816
1816 volume
volume i
i chapter
chapter i
i emma
emma woodhouse
- In perl, regex patterns are enclosed in / /. You can use perl -ne 'print if /PATTERN/;' and perl -ne 'print unless /PATTERN/;' to simulate grep and grep -v, respectively:
$perl -ne 'print if /^thou\t/;' gutenberg.2gram | more
thou the
thou mayest
thou shalt
thou eatest
thou shalt
thou 3
thou wast
thou eaten
$ perl -ne 'print unless /[[:alpha:]]+\s[[:alpha:]]+/;' gutenberg.2gram | more
the 8th
the 23rd
the 28th
sept 28th
the 24th
the 7th
of 10
10 000
000 l
the 10
- Perl also provides a sed-like syntax that lets you edit your text stream on the fly. perl -ne 's/PATTERN1/PATTERN2/g; print;' achieves just that. Therefore, this good-old sed command:
sed 's/Emma/Juliet/g; s/Mr. Knightley/Romeo/g' austen-emma.txt
is the same as:
perl -ne 's/Emma/Juliet/g; s/Mr. Knightley/Romeo/g; print;' austen-emma.txt
-
And, of course, perl provides a tr syntax as well. The following two commands are therefore equivalent:
tr '[A-Z]' '[a-z]' < austen-emma.txt
perl -ne 'tr/[A-Z]/[a-z]/; print' austen-emma.txt
-
Which means: we could process tokenization of a file using perl alone! Remember this is how we did tokenization before:
tr '[A-Z]' '[a-z]' < austen-emma.txt | sed -r 's/[[:punct:]]/ /g' | sed -r 's/ /\n/g' | grep -v -E '^$' | more
which can now be done using only perl:
cat austen-emma.txt | perl -ne 'tr/[A-Z]/[a-z]/; s/[[:punct:]]/ /g; s/ /\n/g; print;' | perl -ne 'print unless /^$/;' | more
Note that the last perl command cannot be incorporated with the previous set and needs to be executed after piping. What could be the reason behind this?
Repeating Commands on Multiple Files
- For your homework, you bravely and diligently applied the same command to every text file. With some basic bash shell scripting, all these repetitions can be neatly packed into one round of command execution. CAUTION: this looping syntax is extremely powerful -- it would be a good idea to back up your data, and/or do trial runs.
- Starting from the "austen-emma.words" tokenized word file, this is how you would obtain a word frequency file:
cat austen-emma.words | sort | uniq -c | sort -nr > austen-emma.words.freq
Note that the output file name has the input file name built into it. We can designate this portion as a variable and formulate a loop syntax so it applies to every file in the directory that ends with .words:
$ for myfile in *.words
> do
> cat $myfile | sort | uniq -c | sort -nr > $myfile.freq
> echo $myfile finished.
> done
The words in red are part of the bash shell scripting syntax. myfile is the variable; it is first used without the $ prefix and then with one throughout. The echo command is not essential, but it provides handy feedback. As you type in RETURN at the end of the line, your shell recognizes that your command is not complete and prompts with >, until done is typed in.
|