Regex is something I have always been slighly afraid of - long, intimidating and barely readable strings. Today was the day I learned regex, and started to use them to be more productive.
Most of the examples and resources I used come from RegexOne.
The most basic regex is simply the text you want to match.
/abc/ -> matches anything starting with "abc"
Easy enough.
We can also match any non digit character using \D
. Digits match using \d\
.
For example, we can match 12
using \d\
. Or a1
with \D\d
.
.
matches anything. Anything. Even whitespace. You can match a literal .
by escaping with .
You can specify groups to match using []
. For example we can match man
and can
but not dan
with [mc]
.
You can exclude matches using ^
. We can match hog
and dog
but ignore bog
using [^b]
. For completeness we can do [^b]og
.
You can define ranges, for example to match English words the follow is often used: [A-Za-z0-9_]
. Writing that is tiresome, so you can use the metacharacter \w
which accomplishes the same thing. An example of a range is [A-D
], which matches anything containing A, B, C or D.
You can match repetitions using {}
. For example a{2}
matches two or more a
letters. It also works with ranges, for example [a-c]{2}
matches a, b or c twice in a row. You can also match bear
with b{1}..r{1}
.
You can match any number of occurrences with *
and at least one or more with +
. For example a+
matches any string with at least one a
. Or [abc]+
matches any string with either a, b or c.
?
indicates a character is optional. ab?c
will match both abc
and ac
, since b?
is optional.
Take the following:
1 file found?
2 files found?
24 files found?
0 files found.
We can match the first three, but not the last, \?
. Or a complete match with \d?.*\?
. An optional digit, any number of characters and a ?
.
You can match whitespace with \s
. For example:
a
b
c
We can match the first two with \s+
. This means “starting with whitespace”. We can improve it with \s+.*
, to look for characters after the whitespace.
We can match the start of a line with ^ and the end with $. For example, we can match Australia with ^A.*a$
. Starting with A
, and arbitrary amount of characters, then ending with a
.
We can specify which part of the regex we want to keep, or capture, using ()
. Say we have a bunch of files, and we want the filename part of all the .pdf
files.
file_record.pdf
file_aaa.pdf
other.png
We can match and capture the pdf file names using (^file.*)\.pdf
. Anything starting with file
and ending with .pdf
, but only keep the filename part.
You can capture multiple sets of charaters in the same expression. Take this example:
Jan 1987
Mar 1983
Oct 2012
We can match the month and year, and then just the year part, with the following:
(\D{3}\s+(\d{4}))
Which reads:
Another example:
1280x720
1920x1600
1024x768
We want to capture both the width and height. the height is three and an optional fourth digit. The regex looks like this:
(\d{4})x+(\d{3}\d?)
You can match using logical OR using |
. For example you can match “I love cats” and “I love dogs” with I love (cats|dogs)
.
\S
matches non whitespace
\b
matches the boundary between a word and non word character.
Learning just some basic regex make the whole system a lot less scary, and is very exciting. I will continue practising and learning regex.