Scope
This post discusses the regex with respect to Perl 5.8.7 Any changes in the later versions are not in the scope of this document. You may anyhow point out any anomaly in the post so that I can incorporate the same or come up with an errata.
Introduction
Regular expression is an expression string that describes a pattern representing a set of strings without listing them all. These regex are put to use in several areas in the computing realm. The best example will be the usage in searching for files and directories using wildcards.
Glossary
Here's a set of symbols and terms that'll be the part of post lingo.
$literalName --> Represents a scalar data type in Perl which can accept values irrespective of it being a number or a string.
@literalName --> Represents a list/array data type in Perl.
$_ --> Better known as "default input and pattern matching space" is the default global variable which gets populated generally during looping if no variable is specified.
@languageList = {"Perl", "C++", "Java"};
foreach (@languageList) {
print "Language : ".$_."\n";
}Word matching
We'd start with a piece of Perl code and then analyze what's going on.
"Rajan Karol" =~ /Karol/;
We could replace the string literal with a variable. A variant to this can use operator !~ for a negative testing scenario like.
$string = "Rajan Karol";
print "No match\n" if $string !~ /Karol/;
$_ = "Rajan Karol";
print "No match\n" if /Karol/; # prints No Match if 'Karol' is not found in default variable.
"/usr/bin/java" =~ m!/java!; # match, delimited by '!'
"/usr/bin/java" =~ m{/java}; # match, delimited by '{}'
"/usr/bin/java" =~ m"/java"; # match, delimited by '"'
A metacharacter can be matched by escaping it by putting a backslash before it. infact a forward slash is also supposed to ve backslashed in order to be matched because it delimits a regex.
"Language C++" =~ /C++/; # flagged as syntax error.
"Language C++" =~ /C\+\+/; # matches as + is escaped
"The open interval [0,1)." =~ /[0,1)./ # syntax error!
"The open interval [0,1)." =~ /\[0,1\)\./ # matches
Where to find a match in the string
One can specify the location in the string where pattern match is required. This is done with the help of anchor metacharacters ^ $ and word anchor metacharacters \b \B.
^ – matches pattern occurring at the beginning of the string.
$ – matches pattern at the end of the string, or before a newline at the end of the string.
\b – matches pattern at the boundary of word in string. In other words, matches a boundary between a word character and a non-word character \w\W or \W\w.
\B – matches pattern not at the boundary of words.
So if we presume the default variable $_ to be "Matching patterns in string\n"
/^Match/; # look for ‘Match’ at the start of string
/string$/; # look for ‘string’ at the end of string
/^Matching patterns in string$/; # complete string match
"" =~ /^$/; # ^$ matches an empty string
/\bpat/; # words starting with ‘pat’
/ing\b/; # words ending in ‘ing’
/\Bpat/; # words not starting with ‘pat’
Matching against a set
A character class comes to our rescue when we want to match with a set of possible characters rather than a single character to match at a particular point in the regex. Character classes are denoted by brackets [...] , with the set of characters to be possibly matched inside or by their corresponding abbreviated names.
\d is a digit and represents [0-9] - Matches a single digit.
\s is a whitespace character and represents [\ \t\r\n\f] - Matched a space character.
\w is a word character (alphanumeric or _) and represents [0-9a-zA-Z_]
\D is a negated \d; it represents any character but a digit [^0-9]
\S is a negated \s; it represents any non-whitespace character [^\s]
\W is a negated \w; it represents any non-word character [^\w]
The period '.' matches any character but "\n"
/[cb]ol[td]/; # matches colt bolt cold bold
"cat" =~ /[atc]/; # matches c as the match is made per position.
/[rR][aA][jJ]/ # matches case insensitive versions of Raj
/raj/i # uses the 'i' modifier to achieve the same effect
Character classes also have special characters, but the sets of ordinary and special characters inside a character class are different than those outside a character class. The special characters for a character class are - ] \ ^ $ and are matched using an escape.
- character is used as a range operator in a character class. '-' at the beginning or end of the class acts as an ordinary character.
] represents end of a character class.
$ denotes a scalar variable.
\ escapes sequences.
^ The special character '^' in the first position of a character class denotes a negated character class, which matches any character but those in the brackets.
$x = 'bcr';
/[\]c]at/; # matches ']at' or 'cat'
/[$x]at/; # matches 'bat, 'cat', or 'rat'
/[\$x]at/; # $ is escaped so matches '$at'
# or 'xat'
/[\\$x]at/; # \ is esaceped so matches '\at',
# 'bat, 'cat', or 'rat'
/[0-9a-fA-F]/; # matches a hexadecimal digit
/[^a]at/; # doesn't match 'aat' or 'at',
# but matches all other 'bat',
# 'cat, '0at', '%at', etc.
/[^0-9]/; # matches a non-numeric character
/[a^]at/; # matches 'aat' or '^at'; here '^'
# is ordinary
/\d\d:\d\d:\d\d/; # matches a hh:mm:ss time format
/[\d\s]/; # matches any digit or whitespace
# character
/end\./; # matches 'end.'
Type the rest of your post here.

What people said... (0)
Post a Comment