Regexes and Pattern Matching
A regular expression is a string containing a combination of normal characters and special metacharacters or metasequences. The normal characters match themselves. Metacharacters and metasequences are characters or sequences of characters that represent ideas such as quantity, locations, or types of characters.
Two rules for matches
-
The earliest (leftmost) match
Regular expressions are applied to the input starting at the first character and proceeding toward the last. As soon as the regular expression engine finds a match, it returns.
-
Standard quantifiers
Quantifiers specify how many times something can be repeated. The standard quantifiers attempt to match as many times as possible. They settle for less than the maximum only if this is necessary for the success of the match.
Regular expression engines have subtle differences based on their type. There are two classes of engines:
Deterministic Finite Automaton (DFA) and Nondeterministic Finite Automaton (NFA). DFAs are faster but lack many of the features of an NFA, such as capturing, lookaround, and non-greedy quantifiers. In the NFA world there are two types: Traditional and POSIX.
- DFA engines
-
DFAs compare each character of the input string to the regular expression, keeping track of all matches in progress. Since each character is examined at most once, the DFA engine is the fastest. One additional rule to remember with DFAs is that the alternation metasequence is greedy. When more than one option in an alternation (tooth|toothbresh) matches, the longest one is selected.
- Traditional NFA engines
-
Traditional NFA engines compare each element of the regex to the input string, keeping track of positions where it chose between two options in the regex. If an option fails, the engine backtracks to the most recently saved position. For standard quantifiers, the engine chooses the greedy option of matching more text; however, if that option leads to the failure of the match, the engine returns to a saved position and tries a less greedy path. The traditional NFA engine uses ordered alternation, where each option in the alternation is tried sequentially. A longer match may be ignored if an earlier option leads to a successful match.
- POSIX NFA engines
-
POSIX NFA Engines work similarly to Traditional NFAs with one exception: a POSIX engine always picks the longest of the leftmost matches. For example, the alternation cat|category would match the full word "toothbresh" whenever possible, even if the first alternative ("tooth") matched and appeared earlier in the alternation.
Character classes and class-like constructs
Character classes are ways to define or specify a set of characters. A character class matches one character in the input string that is within the defined set.
- Normal classes: [...] and [^...]
-
Character classes, [...], and negated character classes, [^...], allow you to list the characters that you do or do not want to match.
- Almost any character: dot (.)
-
Usually matches any character except a newline. The match mode can often be changed so that dot also matches newlines.
- Class shorthands: \w, \d, \s, \W, \D, \S
-
Commonly provided shorthands for digit, word character, and space character classes.
- POSIX character class: [ : alnum :]
-
POSIX defines several character classes that can be used only within regular expression character classes
POSIX character classes
|
Class
|
Meaning
|
|---|
|
alnum
|
Letters and digits.
| |
alpha
|
Letters.
| |
blank
|
Space or tab only.
| |
cntrl
|
Control characters.
| |
digit
|
Decimal digits.
| |
graph
|
Printing characters, excluding space.
| |
lower
|
Lowercase letters.
| |
print
|
Printing characters, including space.
| |
punct
|
Printing characters, excluding letters and digits.
| |
space
|
Whitespace.
| |
upper
|
Uppercase letters.
| |
xdigit
|
Hexadecimal digits.
|
Mode modifiers
Mode modifiers are a way to change how the regular expression engine interprets a regular expression.
- Multiline mode: m
-
Changes the behavior of ^ and $ to match next to newlines within the input string.
- Single-line mode: s
-
Changes the behavior of . (dot) to match all characters, including newlines, within the input string.
- Case-insensitive mode: i
-
Treat as identical letters that differ only in case.
- Free-spacing mode: x
-
Allows for whitespace and comments within a regular expression. The whitespace and comments (starting with # and extending to the end of the line) are ignored by the regular expression engine.
|