Summary:
What are regular expressions?
How do I construct a regular expression?
What can I do with regular expressions?
Regular expressions (or 'regexes') are a way of testing or manipulating
strings based on the content of the string. Using a regular expression is
somewhat like using a 'Find' or 'Search and Replace' function in a text
editor. Regular expressions can be used in two main contexts: as an operator
in an `if` statement (to see if a string matches a given regular
expression), or to modify a string by replacing parts according to the regex.
Basic Regex Structure
Regular expressions are strings bounded by `/` characters. Certain
characters take on special meaning within a regular expression, as do certain
escaped (e.g. `\s`) characters. The simplest regular expression is just a
string of alphanumeric text, like `/some sample text/` This regex will match
'some sample text' occurring anywhere inside the string it is tested against.
Simple Regex Special Characters
\ | the next character should be taken literally and not as a special character |
^ | the beginning of the line |
$ | the end of the line |
. | matches any character except newline (\n) |
| | matches either the expression before the `|` OR the sequence after. |
[] | matches the class defined within the brackets (see below) |
() | used to group terms (useful with `$1`,`$2`,... and `|`) (see below) |
`^` and `$` are 'placeholders' which allow the regex to test for the
position of the expression relative to the string. `/^And/` will match the word
`And` occurring at the beginning of the string.
`[]` will match any of the characters listed inside the square
brackets. If you wanted to match any of 'a', 'e', 'i', 'o', or 'u', you could
use `/[aeiou]/`. Note that this will not match uppercase vowels. They need \to
included separately, like `/[aeiouAEIOU]/`. Within `[]`, it's possible to
define a sequential list of characters by using a `-`. For example, to match
any alphanumeric character, use `/[A-Za-z0-9]/`. If the list of characters
begins with `^`, the class will match any character *not* defined in the list.
Note that these regexes match *any single* occurrance of any character
specified by the class.
`()` is used to subdivide expressions into groups, as in a numerical
equation, needs to be subdivided for clarity, functionality, or debugging
purposes. As well, the scalars `$1`, `$2`, `$3`, etc. get filled with the
actual strings matching the patterns specified within sets of brackets. Bracket
sets are numbered starting from the left and in the order of the opening
bracket. This can be quite useful when dealing with manipulative regexes.
Condensed Character Classes and Other Escaped Characters
There are some escaped characters (e.g. `\x`) that can be used as
shorthand for commonly used character classes. As well, there are some other
escaped characters useful for determining relative positions of strings within
regexes.
\w | short for [A-Za-z0-9_] |
\W | short for [^\w] |
\s | matches any whitespace character (tabs, newlines, spaces, etc.) |
\S | short for [^\s] |
\d | short for [0-9] |
\D | short for [^\d] |
\b | matches a 'word boundary', i.e. an imaginary spot between two adjacent `\w` and `\W` characters. the ends of a string count as `\W`s. note that this does not apply within `[]`s. |
\B | matches anywhere \b doesn't. |
\A | like ^ (see below) |
\Z | like $ (see below) |
`\A` and `\Z` match the absolute beginning and absolute end of a
string, respectively. The difference between these and `^`/`$` is best
explained by the Perl documentation at
http://language.perl.com/newdocs/pod/perlre.html.
Quantification Modifiers
The following characters and sequences can be used as 'quantifiers' to
allow matching a character, a character class, or a `()` grouped
expression more than once. All should be used immediately following the
expression to be quantified.
* | Match 0 or more times |
+ | Match 1 or more times |
? | Match 1 or 0 times |
{x} | Match exactly n times |
{x,} | Match at least n times |
{x,y} | Match between x and y times |
For example, `/.*/` matches any sequence of characters not including
newlines. `/\w{1,5}/` matches any 'word' up to five characters long.
Comparative Regexes
To use a regular expression in an `if` statement, the `=~` operator is
used in the same way `==` or `eq` would be used for simple comparisons. The
difference is that instead of a numerical value or a string being on the
right-hand side of the operator, a regex is used instead. Example:
-------------------------------------
@strings = (
'I am very tired.',
'We are tired.',
'I feel quite tired.',
'You look really tired.',
);
foreach $x (@strings) {
if ($x =~ /(very|quite) tired/) {
print 'You ' . $rList . 'need to go to bed.' . "\n";
# Right here, $1 contains either 'very' or 'quite' depending
# on which one matched.
$rLength .= 'really ';
}
}
-------------------------------------
This should print out 'You need to go to bed.' and then 'You really
need to go to bed.' If more strings matching the regex were added to @strings,
more lines would be printed, each with an additional 'really'.
Manipulative Regexes
If comparative regular expressions are like using the `Find` command
in a text editor, manipulative regular expressions are like using the `Search
and Replace` tool. They allow you to take the text matched by a regular
expression and substitute new text, possibly based on the old, in its place.
Instead of just the single expression, the manipulative syntax uses two
expressions back to back, with an `s` added to the front. The simplest form
looks like this:
-------------------------------------
$x = 'Fifteen red apples.';
$x =~ s/red/green/;
# $x is now 'Fifteen green apples.'
-------------------------------------
Note that the same operator is used as in a comparative context, but
there are now two expressions, separated by the middle `/`. Note that this will
only replace the first occurrance of the matched string. To replace all
occurrances, add a `g` modifier to the end of the regext, like this:
-------------------------------------
$x = $y = 'How much wood could a woodchuck chuck if a woodchuck could chuck wood?';
$x =~ s/wood/w00d/;
$y =~ s/wood/w00d/g;
# $x is now 'How much w00d could a woodchuck [...]', whereas $y is 'How much
# w00d could a w00dchuck [...]' (and so on through the string).
-------------------------------------
To use portions of the matched string in the expression to replace
with, use brackets and the `$1`, etc. variables mentioned earlier. For example,
to switch the positions of two numbers separated by a dash:
-------------------------------------
$x = '4028-12039';
$x =~ s/(\n+)-(\n+)/$2-$1/;
# $x is now '12039-4028'.
-------------------------------------
You can use functions, variables, etc. inside either expression of a
regex, whether comparative or manipulative. In fact, regular expressions are
parsed as double-quoted strings, so any escaped characters (like `\n`) can be
used. More information on the use of Perl regular expressions can be found at
http://language.perl.com/newdocs/pod/perlre.html.
Lesson
8 - Forms
|