Home PHP Ajax Regex Interviews Contact us    

PHP


Interview Questions



Regexes and Pattern Matching

A regular expression is a string containing a combination of normal characters and special metacharacters or metasequences. The normal characters match themselves. Metacharacters and metasequences are characters or sequences of characters that represent ideas such as quantity, locations, or types of characters.

      

Two rules for matches

  1. The earliest (leftmost) match

    Regular expressions are applied to the input starting at the first character and proceeding toward the last. As soon as the regular expression engine finds a match, it returns.

  2. Standard quantifiers

    Quantifiers specify how many times something can be repeated. The standard quantifiers attempt to match as many times as possible. They settle for less than the maximum only if this is necessary for the success of the match.

 

Regular expression engines have subtle differences based on their type. There are two classes of engines:

Deterministic Finite Automaton (DFA) and Nondeterministic Finite Automaton (NFA). DFAs are faster but lack many of the features of an NFA, such as capturing, lookaround, and non-greedy quantifiers. In the NFA world there are two types: Traditional and POSIX.

DFA engines

DFAs compare each character of the input string to the regular expression, keeping track of all matches in progress. Since each character is examined at most once, the DFA engine is the fastest. One additional rule to remember with DFAs is that the alternation metasequence is greedy. When more than one option in an alternation (tooth|toothbresh) matches, the longest one is selected.

Traditional NFA engines

Traditional NFA engines compare each element of the regex to the input string, keeping track of positions where it chose between two options in the regex. If an option fails, the engine backtracks to the most recently saved position. For standard quantifiers, the engine chooses the greedy option of matching more text; however, if that option leads to the failure of the match, the engine returns to a saved position and tries a less greedy path. The traditional NFA engine uses ordered alternation, where each option in the alternation is tried sequentially. A longer match may be ignored if an earlier option leads to a successful match.

POSIX NFA engines

POSIX NFA Engines work similarly to Traditional NFAs with one exception: a POSIX engine always picks the longest of the leftmost matches. For example, the alternation cat|category would match the full word "toothbresh" whenever possible, even if the first alternative ("tooth") matched and appeared earlier in the alternation.

Character classes and class-like constructs

Character classes are ways to define or specify a set of characters. A character class matches one character in the input string that is within the defined set.

Normal classes: [...] and [^...]

Character classes, [...], and negated character classes, [^...], allow you to list the characters that you do or do not want to match.

Almost any character: dot (.)

Usually matches any character except a newline. The match mode can often be changed so that dot also matches newlines.

Class shorthands: \w, \d, \s, \W, \D, \S

Commonly provided shorthands for digit, word character, and space character classes.

POSIX character class: [ : alnum :]

POSIX defines several character classes that can be used only within regular expression character classes

POSIX character classes

Class

Meaning

alnum

Letters and digits.

alpha

Letters.

blank

Space or tab only.

cntrl

Control characters.

digit

Decimal digits.

graph

Printing characters, excluding space.

lower

Lowercase letters.

print

Printing characters, including space.

punct

Printing characters, excluding letters and digits.

space

Whitespace.

upper

Uppercase letters.

xdigit

Hexadecimal digits.

Mode modifiers

Mode modifiers are a way to change how the regular expression engine interprets a regular expression.

Multiline mode: m

Changes the behavior of ^ and $ to match next to newlines within the input string.

Single-line mode: s

Changes the behavior of . (dot) to match all characters, including newlines, within the input string.

Case-insensitive mode: i

Treat as identical letters that differ only in case.

Free-spacing mode: x

Allows for whitespace and comments within a regular expression. The whitespace and comments (starting with # and extending to the end of the line) are ignored by the regular expression engine.

 


Feed Back of this Topic
 
Name :
Email :
Topic :
Comments :

Ajax


Regular Expression


 
Copyright © 2007 123developers.com
Contact us | Disclaimer