Informatica: Regular Expressions I (Perl)

Scope

This post discusses the regex with respect to Perl 5.8.7 Any changes in the later versions are not in the scope of this document. You may anyhow point out any anomaly in the post so that I can incorporate the same or come up with an errata.

Introduction

Regular expression is an expression string that describes a pattern representing a set of strings without listing them all. These regex are put to use in several areas in the computing realm. The best example will be the usage in searching for files and directories using wildcards.

Glossary

Here's a set of symbols and terms that'll be the part of post lingo.

$literalName --> Represents a scalar data type in Perl which can accept values irrespective of it being a number or a string.

@literalName --> Represents a list/array data type in Perl.

$_ --> Better known as "default input and pattern matching space" is the default global variable which gets populated generally during looping if no variable is specified.

@languageList = {"Perl", "C++", "Java"};
foreach (@languageList) {
    print "Language : ".$_."\n";
}

Lets begin the action by taking up simple string matches.

Word matching

We'd start with a piece of Perl code and then analyze what's going on.

"Rajan Karol" =~ /Karol/;

In the above snippet we are searching for the pattern 'Karol' in the string. Here the string to match with is "Rajan Karol" and the pattern to match is specified inside the default delimiters // as /Karol/. Operator =~ associates the string with the pattern and return true if match is found, else returns false.

We could replace the string literal with a variable. A variant to this can use operator !~ for a negative testing scenario like.

$string = "Rajan Karol";
print "No match\n" if $string !~ /Karol/;

If the match presumes the default variable $_ then we can omit the variable and the comparison operators =~\!~ altogether.

$_ = "Rajan Karol";
print "No match\n" if /Karol/; # prints No Match if 'Karol' is not found in default variable.

The default delimiter // can be replaced by any other delimiter using a prefix m before the delimiter. e.g. To search for / in a unix like file path we can use a different delimiter as follows.

"/usr/bin/java" =~ m!/java!; # match, delimited by '!'
"/usr/bin/java" =~ m{/java}; # match, delimited by '{}'
"/usr/bin/java" =~ m"/java"; # match, delimited by '"'

Some special characters, called metacharacters, are reserved for use in regex notation. The metacharacters are as follows : {} [] () ^ $ . | * + ? \

A metacharacter can be matched by escaping it by putting a backslash before it. infact a forward slash is also supposed to ve backslashed in order to be matched because it delimits a regex.

"Language C++" =~ /C++/; # flagged as syntax error.
"Language C++" =~ /C\+\+/; # matches as + is escaped
"The open interval [0,1)." =~ /[0,1)./     # syntax error!
"The open interval [0,1)." =~ /\[0,1\)\./  # matches

In addition to the metacharacters, there are some non printable ASCII characters are represented by escape sequences. Common examples are \t, \n , \r, the octal escape sequence, e.g., \07 , or hexadecimal escape sequence, e.g., \xAA.

Where to find a match in the string

One can specify the location in the string where pattern match is required. This is done with the help of anchor metacharacters ^ $ and word anchor metacharacters \b \B.
^ – matches pattern occurring at the beginning of the string.
$ – matches pattern at the end of the string, or before a newline at the end of the string.
\b – matches pattern at the boundary of word in string. In other words, matches a boundary between a word character and a non-word character \w\W or \W\w.
\B – matches pattern not at the boundary of words.

So if we presume the default variable $_ to be "Matching patterns in string\n"

/^Match/; # look for ‘Match’ at the start of string
/string$/; # look for ‘string’ at the end of string
/^Matching patterns in string$/; # complete string match
"" =~ /^$/; # ^$ matches an empty string
/\bpat/; # words starting with ‘pat’
/ing\b/; # words ending in ‘ing’
/\Bpat/; # words not starting with ‘pat’

Note that using both ^ and $ gives you full control and forces a complete string match.

Matching against a set

A character class comes to our rescue when we want to match with a set of possible characters rather than a single character to match at a particular point in the regex. Character classes are denoted by brackets [...] , with the set of characters to be possibly matched inside or by their corresponding abbreviated names.
\d is a digit and represents [0-9] - Matches a single digit.
\s is a whitespace character and represents [\ \t\r\n\f] - Matched a space character.
\w is a word character (alphanumeric or _) and represents [0-9a-zA-Z_]
\D is a negated \d; it represents any character but a digit [^0-9]
\S is a negated \s; it represents any non-whitespace character [^\s]
\W is a negated \w; it represents any non-word character [^\w]
The period '.' matches any character but "\n"

/[cb]ol[td]/; # matches colt bolt cold bold
"cat" =~ /[atc]/; # matches c as the match is made per position.
/[rR][aA][jJ]/ # matches case insensitive versions of Raj
/raj/i # uses the 'i' modifier to achieve the same effect

i - appended at the end as in //i is the modifier for the matching operation and stands for case-insensitive.

Character classes also have special characters, but the sets of ordinary and special characters inside a character class are different than those outside a character class. The special characters for a character class are - ] \ ^ $ and are matched using an escape.
- character is used as a range operator in a character class. '-' at the beginning or end of the class acts as an ordinary character.
] represents end of a character class.
$ denotes a scalar variable.
\ escapes sequences.
^ The special character '^' in the first position of a character class denotes a negated character class, which matches any character but those in the brackets.

$x = 'bcr';
/[\]c]at/;        # matches ']at' or 'cat'
/[$x]at/;         # matches 'bat, 'cat', or 'rat'
/[\$x]at/;        # $ is escaped so matches '$at'
                  # or 'xat'
/[\\$x]at/;       # \ is esaceped so matches '\at',
                  # 'bat, 'cat', or 'rat'
/[0-9a-fA-F]/;    # matches a hexadecimal digit
/[^a]at/;         # doesn't match 'aat' or 'at',
                  # but matches all other 'bat',
                  # 'cat, '0at', '%at', etc.
/[^0-9]/;         # matches a non-numeric character
/[a^]at/;         # matches 'aat' or '^at'; here '^'
                  # is ordinary
/\d\d:\d\d:\d\d/; # matches a hh:mm:ss time format
/[\d\s]/;         # matches any digit or whitespace
                  # character
/end\./;          # matches 'end.'

Type the rest of your post here.

Informatica

Categories

Topics

Thursday, April 10, 2008

Regular Expressions I (Perl)

What people said... (0)

Profile

Blog Archive

Recent Comments