Lesson 7a - Regular Expressions
Summary:

What are regular expressions?
How do I construct a regular expression?
What can I do with regular expressions?

Regular expressions (or 'regexes') are a way of testing or manipulating
strings based on the content of the string. Using a regular expression is
somewhat like using a 'Find' or 'Search and Replace' function in a text
editor. Regular expressions can be used in two main contexts: as an operator
in an if statement (to see if a string matches a given regular
expression), or to modify a string by replacing parts according to the regex.

Basic Regex Structure

Regular expressions are strings bounded by / characters. Certain
characters take on special meaning within a regular expression, as do certain
escaped (e.g. \s) characters. The simplest regular expression is just a
string of alphanumeric text, like /some sample text/ This regex will match
'some sample text' occurring anywhere inside the string it is tested against.

Simple Regex Special Characters

 \ the next character should be taken literally and not as a special character ^ the beginning of the line $ the end of the line . matches any character except newline (\n) | matches either the expression before the | OR the sequence after. [] matches the class defined within the brackets (see below) () used to group terms (useful with $1,$2,... and |) (see below) ^ and $ are 'placeholders' which allow the regex to test for the
position of the expression relative to the string. /^And/ will match the word
And occurring at the beginning of the string.

[] will match any of the characters listed inside the square
brackets. If you wanted to match any of 'a', 'e', 'i', 'o', or 'u', you could
use /[aeiou]/. Note that this will not match uppercase vowels. They need \to
included separately, like /[aeiouAEIOU]/. Within [], it's possible to
define a sequential list of characters by using a -. For example, to match
any alphanumeric character, use /[A-Za-z0-9]/. If the list of characters
begins with ^, the class will match any character *not* defined in the list.
Note that these regexes match *any single* occurrance of any character
specified by the class.

() is used to subdivide expressions into groups, as in a numerical
equation, needs to be subdivided for clarity, functionality, or debugging
purposes. As well, the scalars $1, $2, $3, etc. get filled with the actual strings matching the patterns specified within sets of brackets. Bracket sets are numbered starting from the left and in the order of the opening bracket. This can be quite useful when dealing with manipulative regexes. Condensed Character Classes and Other Escaped Characters There are some escaped characters (e.g. \x) that can be used as shorthand for commonly used character classes. As well, there are some other escaped characters useful for determining relative positions of strings within regexes.  \w short for [A-Za-z0-9_] \W short for [^\w] \s matches any whitespace character (tabs, newlines, spaces, etc.) \S short for [^\s] \d short for [0-9] \D short for [^\d] \b matches a 'word boundary', i.e. an imaginary spot between two adjacent \w and \W characters. the ends of a string count as \Ws. note that this does not apply within []s. \B matches anywhere \b doesn't. \A like ^ (see below) \Z like$ (see below)

\A and \Z match the absolute beginning and absolute end of a
string, respectively. The difference between these and ^/$ is best explained by the Perl documentation at http://language.perl.com/newdocs/pod/perlre.html. Quantification Modifiers The following characters and sequences can be used as 'quantifiers' to allow matching a character, a character class, or a () grouped expression more than once. All should be used immediately following the expression to be quantified.  * Match 0 or more times + Match 1 or more times ? Match 1 or 0 times {x} Match exactly n times {x,} Match at least n times {x,y} Match between x and y times For example, /.*/ matches any sequence of characters not including newlines. /\w{1,5}/ matches any 'word' up to five characters long. Comparative Regexes To use a regular expression in an if statement, the =~ operator is used in the same way == or eq would be used for simple comparisons. The difference is that instead of a numerical value or a string being on the right-hand side of the operator, a regex is used instead. Example: ------------------------------------- @strings = ( 'I am very tired.', 'We are tired.', 'I feel quite tired.', 'You look really tired.', ); foreach$x (@strings) {
if ($x =~ /(very|quite) tired/) { print 'You ' .$rList . 'need to go to bed.' . "\n";
# Right here, $1 contains either 'very' or 'quite' depending # on which one matched.$rLength .= 'really ';
}
}
-------------------------------------

This should print out 'You need to go to bed.' and then 'You really
need to go to bed.' If more strings matching the regex were added to @strings,
more lines would be printed, each with an additional 'really'.

Manipulative Regexes

If comparative regular expressions are like using the Find command
in a text editor, manipulative regular expressions are like using the Search
and Replace tool. They allow you to take the text matched by a regular
expression and substitute new text, possibly based on the old, in its place.
Instead of just the single expression, the manipulative syntax uses two
expressions back to back, with an s added to the front. The simplest form
looks like this:
-------------------------------------
$x = 'Fifteen red apples.';$x =~ s/red/green/;
# $x is now 'Fifteen green apples.' ------------------------------------- Note that the same operator is used as in a comparative context, but there are now two expressions, separated by the middle /. Note that this will only replace the first occurrance of the matched string. To replace all occurrances, add a g modifier to the end of the regext, like this: -------------------------------------$x = $y = 'How much wood could a woodchuck chuck if a woodchuck could chuck wood?';$x =~ s/wood/w00d/;
$y =~ s/wood/w00d/g; #$x is now 'How much w00d could a woodchuck [...]', whereas $y is 'How much # w00d could a w00dchuck [...]' (and so on through the string). ------------------------------------- To use portions of the matched string in the expression to replace with, use brackets and the $1, etc. variables mentioned earlier. For example,
to switch the positions of two numbers separated by a dash:
-------------------------------------
$x = '4028-12039';$x =~ s/(\n+)-(\n+)/$2-$1/;
# \$x is now '12039-4028'.
-------------------------------------

You can use functions, variables, etc. inside either expression of a
regex, whether comparative or manipulative. In fact, regular expressions are
parsed as double-quoted strings, so any escaped characters (like \n) can be
used. More information on the use of Perl regular expressions can be found at
http://language.perl.com/newdocs/pod/perlre.html.

Lesson 8 - Forms

Home | Lessons | Get Perl | Resources