Regular Expressions explained!

Regular expressions (“regexes”) are supercharged Find/Replace string operations. Regular expressions are used when editing text in a text editor, to:

  • check whether the text contains a certain pattern
  • find those pattern matches, if there are any
  • pull information (i.e. substrings) out of the text
  • make modifications to the text.

Note: Unless otherwise specified, regular expressions are case-sensitive. However, almost all implementations provide a flag to enable case-insensitivity.

Another note: It is important to know whether “the text” is a sequence of bytes or a sequence of Unicode characters. In plain ASCII, the two are the same, and you can get a long way without needing to know the difference…

Regular expressions contain a mixture of literal characters, which just represent themselves, and special characters called metacharacters, which do special things.

The dot

Our first metacharacter is the full stop, .. A . finds any single character. The regular expression

c.t

means “find a c, followed by any character, followed by a t”.

In a piece of text, this will find cat, cot, czt and even the literal string c.t (c, full stop, t), but not ct, or coot.

Whitespace is significant in regular expressions. The regular expression

c t

means “find a c, followed by a space, followed by a t”.

Any metacharacter can be escaped using a backslash, \. This turns it back into a literal. So the regular expression

c\.t

means “find a c, followed by a full stop, followed by a t”.

The backslash is a metacharacter, which means that it too can be escaped using a backslash. So the regular expression

c\\t

means “find a c, followed by a backslash, followed by a t”.

Character class ranges

Within a character class, you can express a range of letters or digits, using a hyphen:

  • [b-f] is the same as [bcdef] and means “find a b or a c or a d or an e or an f”.
  • [A-Z] is the same as [ABCDEFGHIJKLMNOPQRSTUVWXYZ] and means “find an upper-case letter”.
  • [1-9] is the same as [123456789] and means “find a non-zero digit”.

Outside of a character class, the hyphen has no special meaning. The regular expression a-z means “find an a followed by a hyphen followed by a z”.

Ranges and isolated characters may coexist in the same character class:

  • [0-9.,] means “find a digit or a full stop or a comma”.
  • [0-9a-fA-F] means “find a hexadecimal digit”.
  • [a-zA-Z0-9\-] means “find an alphanumeric character or a hyphen”.

Although you can try using non-alphanumeric characters as endpoints in a range (e.g. abc[!-/]def), this is not legal syntax in every implementation. Even where it is legal, it is not obvious to a reader exactly which characters are included in such a range. Use caution (by which I mean, don’t do this).

Equally, range endpoints should be chosen from the same range. Even if a regular expression like [A-z] is legal in your implementation of choice, it may not do what you think. (Hint: there are characters between Z and a…)

Character class negation

You may negate a character class by putting a caret at the very beginning.

  • [^a] means “find any character other than an a”.
  • [^a-zA-Z0-9] means “find a non-alphanumeric character”.
  • [\^abc] means “find a caret or an a or a b or a c”.
  • [^\^] means “find any character other than a caret”. (Ugh!)

Freebie character classes

The regular expression \d means the same as [0-9]: “find a digit”. (To find a backslash followed by a d, use the regular expression \\d.)

\w means the same as [0-9A-Za-z_]: “find a word character”.

\s means “find a space character (space, tab, carriage return or line feed)”.

Furthermore,

  • \D means [^0-9]: “find a non-digit”.
  • \W means [^0-9A-Za-z_]: “find a non-word character”.
  • \S means “find a non-space character”.

Multipliers

You can use braces to put a multiplier after a literal or a character class.

  • The regular expression a{1} is the same as a and means “find an a”.
  • a{3} means “find an a followed by an a followed by an a”.
  • a{0} means “find the empty string”. By itself, this appears to be useless. If you use this regular expression on any piece of text, you will immediately get a match, right at the point where you started searching. This remains true even if your text is the empty string!
  • a\{2\} means “find an a followed by a left brace followed by a 2 followed by a right brace”.
  • Braces have no special meaning inside character classes. [{}] means “find a left brace or a right brace”.

Caution. Multipliers have no memory. The regular expression [abc]{2} means “find a or b or c, followed by a or b or c”. This is the same as “find aa or ab or ac or ba or bb orbc or ca or cb or cc”. It does not mean “find aa or bb or cc”!

Multiplier ranges

Multipliers may have ranges:

  • x{4,4} is the same as x{4}.
  • colou{0,1}r means “find colour or color”.
  • a{3,5} means “find aaaaa or aaaa or aaa”.

Note that the longer possibility is most preferred, because multipliers are greedy. If your input text is I had an aaaaawful day then this regular expression will find the aaaaa inaaaaawful. It won’t just stop after three as.

The multiplier is greedy, but it won’t disregard a perfectly good match. If your input text is I had an aaawful daaaaay, then this regular expression will find the aaa in aaawful on its first match. Only if you then say “find me another match” will it continue searching and find the aaaaa in daaaaay.

Multiplier ranges may be open-ended:

  • a{1,} means “find one or more as in a row”. Your multiplier will still be greedy, though. After finding the first a, it will try to find as many more as as possible.
  • .{0,} means “find anything”. No matter what your input text is – even the empty string – this regular expression will successfully match the entire text and return it to you.

Freebie multipliers

? means the same as {0,1}. For example, colou?r means “find colour or color”.

* means the same as {0,}. For example, .* means “find anything”, exactly as above.

+ means the same as {1,}. For example, \w+ means “find a word”. Here a “word” is a sequence of 1 or more “word characters”, such as _var or AccountName1.

These are extremely common and you must learn them. Also:

  • \?\*\+ means “find a question mark followed by an asterisk followed by a plus sign”.
  • [?*+] means “find a question mark or an asterisk or a plus sign”.

Non-greed

The regular expression “.*” means “find a double quote, followed by as many characters as possible, followed by a double quote”. Notice how the inner characters, caught by .*, could easily be more double quotes. This is not usually very useful.

Multipliers can be made non-greedy by appending a question mark. This reverses the order of preference:

  • \d{4,5}? means “find \d\d\d\d or \d\d\d\d\d”. This has exactly the same behaviour as \d{4}.
  • colou??r is colou{0,1}?r which means “find color or colour”. This has the same behaviour as colou?r.
  • “.*?” means “find a double quote, followed by as few characters as possible, followed by a double quote”. This, unlike the two examples above, is actually useful.

Alternation

You can match one of several choices using pipes:

  • cat|dog means “find cat or dog”.
  • red|blue| and red||blue and |red|blue all mean “find red or blue or the empty string”.
  • a|b|c is the same as [abc].
  • cat|dog|\| means “find cat or dog or a pipe”.
  • [cat|dog] means “find a or c or d or g or o or t or a pipe”.

Grouping

You may group expressions using parentheses:

  • To find a day of the week, use (Mon|Tues|Wednes|Thurs|Fri|Satur|Sun)day.
  • (\w*)ility is the same as \w*ility. Both mean “find a word ending in ility”. For why the first form might be useful, see later…
  • \(\) means “find a left parenthesis followed by a right parenthesis”.
  • [()] means “find a left parenthesis or a right parenthesis”.

Groups may contain the empty string:

  • (red|blue|) means “find red or blue or the empty string”.
  • abc()def means the same as abcdef.

You may use multipliers on groups:

  • (red|blue)? means the same as (red|blue|).
  • \w+(\s+\w+)* means “find one or more words separated by whitespace”.

Word boundaries

word boundary is the place between a word character and a non-word character. Remember, a word character is \w which is [0-9A-Za-z_], and a non-word character is \W which is [^0-9A-Za-z_].

The beginning and end of the text always count as word boundaries too.

In the input text it’s a cat, there are eight word boundaries. If we added a trailing space after cat, there would be nine word boundaries.

  • The regular expression \b means “find a word boundary”.
  • \b\w\w\w\b means “find a three-letter word”.
  • a\ba means “find a, followed by a word boundary, followed by a”. This regular expression will never successfully find a match, no matter what the input text.

Word boundaries are not characters. They have zero width. The following regular expressions are identical in behaviour:

  • (\bcat)\b
  • (\bcat\b)
  • \b(cat)\b
  • \b(cat\b)

Line boundaries

Every piece of text breaks down into one or more lines, separated by line breaks, like this:

  • Line
  • Line break
  • Line
  • Line break
  • Line break
  • Line

Note how the text always ends with a line, never with a line break. However, any of the lines may contain zero characters, including the final line.

start-of-line is the place between a line break and the first character of the next line. As with word boundaries, the beginning of the text also counts as a start-of-line.

An end-of-line is the place between the last character of a line and the line break. As with word boundaries, the end of the text also counts as an end-of-line.

So our new breakdown is:

  • Start-of-line, line, end-of-line
  • Line break
  • Start-of-line, line, end-of-line
  • Line break
  • Line break
  • Start-of-line, line, end-of-line

Based on this:

  • The regular expression ^ means “find a start-of-line”.
  • The regular expression $ means “find an end-of-line”.
  • ^$ means “find an empty line”.
  • ^.*$ will find your entire text, because a line break is a character and . will find it. To find a single line, use a non-greedy multiplier, ^.*?$.
  • \^\$ means “find a caret followed by a dollar sign”.
  • [$] means “find a dollar sign”. However, [^] is not a valid regular expression. Remember that the caret has a different special meaning inside square brackets! To put a caret in a character class, use [\^].

Like word boundaries, line boundaries are not characters. They have zero width. The following regular expressions are identical in behaviour:

  • (^cat)$
  • (^cat$)
  • ^(cat)$
  • ^(cat$)

In summary:

  • Literals: a b c d 1 2 3 4 etc.
  • Character classes: . [abc] [a-z] \d \w \s
    • . means “any character”
    • \d means “a digit”
    • \w means “a word character”, [0-9A-Za-z_]
    • \s means “a space, tab, carriage return or line feed character”
    • Negated character classes: [^abc] \D \W \S
  • Multipliers: {4} {3,16} {1,} ? * +
    • ? means “zero or one”
    • * means “zero or more”
    • + means “one or more”
    • Multipliers are greedy unless you put a ? afterwards
  • Alternation and grouping: (Septem|Octo|Novem|Decem)ber
  • Word, line and text boundaries: \b ^ $ \A \z
  • To refer back to a capture groups: \1 \2 \3 etc. (works in both replacement expressions and matching expressions)
  • List of metacharacters: . \ [ ] { } ? * + | ( ) ^ $
  • List of metacharacters when inside a character class: [ ] \ – ^
  • You can always escape a metacharacter using a backslash: \