In previous post some common usage of the re module are introduced, and this post will detail the grammar. After walking through this post, you can write regular expressions to obtain the comments in /* coments */, get the local part and domain part of an e-mail, filter out e-mail addresses by some criteria, and etc.

This post will cover the follow python re grammar:

  1. repetition
  2. character set
  3. anchoring
  4. group
  5. search options
  6. look ahead and behind assertion
  7. self referecing

You may encounter some difficulty walking through this post in order since all the syntaxes are relevant and convoluted. So the best suggestion is skipping the part you could understand now, and come back to it later after learning other parts.

Repetition

It’s very common in re to match pattern showing one or more times, and that’s what the repetition do. There are five repetition expression:

Pattern Meaning Example
* repeat zero or more times ab*: a followed by zero or more b
+ repeat one or more times ab+: a followed by one or more b
? repeat zero or one time ab?: a followed by zero or one b
{m} repeat m times ab{3}: a followed by three b
{m,n} repeat m to n times ab{2,5}: a followed by two to five b

Greedy vs. Non-Greedy

The default matching process is greedy, which consumes as much input as possible. For example, when using ab+ to match abbb, the result is abbb. We can turn off greedy mode by appending a ? mark after the repetition expression: Use ab+?c to match abbb, the result will be ab.

Here is another example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import re
s = 'Good Good Good Good Good'
pattern = r'(Good\s*){2,5}'
pattern_non_greedy = r'(Good\s*){2,5}?'
print 'Greedy:'
print re.search(pattern, s).group(0)
print
print 'Non-Greedy:'
print re.search(pattern_non_greedy, s).group(0)
print
1
2
3
4
5
Greedy:
Good Good Good Good Good
Non-Greedy:
Good Good

Here is another example to obtain the comments in the code, without the leading and trailing space characters:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import re
sample_code = \
r""
/**
Some comments(1)
...
*/
codes.
/**
Some comments(2)
...
*/
""
pattern = r"(?s)/\*+\s+(.*?)\s+\*/"
re.findall(pattern, sample_code)
1
['Some comments(1)\n ...', 'Some comments(2)\n ...']

The (?s) flag let dot (.) match \n, which I’ll detail later. It’s very important to use non-greedy mode here to prevent (.*) from matching */ in the middle of the source code.

Character Set

Sometimes, we want to match one of a group of characters, and that’s the character set. For example, [abc] will match a, b or c. Usually, character set is used together with repetition expression, e.g: [ab]+ will match string comprised of one or more a and b.

Adding a ^ mark at the beginning of a character set excludes all charaters in the set, which means match any character not in the set. For example, [^abc]+ will match google rather than apple.

When there are too many characters in a character set and many of them are coninuous, a character range may help you save time key strokes. The following table shows some common character ranges:

Character Range Meaning
[0-9] all number digits
[a-z] all lower case letters
[A-Z] all upper case letters
[a-zA-Z] all lower and upper case letters
[a-zA-Z0-9] the alphanumeric

Based on your demand, you can set the start and end of your range and the range match any charater falls into it. For example: [2-4] will only match 2, 3 and 4.

Dot (.) is a very special character which represents a character set matching any single character except \n. There is a . related flag DOTALL, with this flag, . can match \n:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import re
s = 'abc\t\n0980879'
match_iter = re.finditer(r'.+', s)
print 'Without DOTALL:'
for m in match_iter:
print repr(s[m.start() : m.end()])
print
print 'With DOTALL:'
match_iter2 = re.finditer(r'.+', s, re.DOTALL)
for m in match_iter2:
print repr(s[m.start() : m.end()])
1
2
3
4
5
6
Without DOTALL:
'abc\t'
'0980879'
With DOTALL:
'abc\t\n0980879'

Some character sets can be used with escape codes, which is more compact:

Code Character Set Meaning
\d [0-9] a digit
\D [^0-9] a non-digit
\s [\t\n ] (and some not common whitespace characters) whitespace
\S [^\t\n ] non-whitespace
\w [a-zA-Z0-9] alphanumeric
\W [^a-zA-Z0-9] non-alphanumeric
1
2
3
4
5
6
import re
s = 'abc \t\n\t 0980879'
match_iter = re.finditer(r'[\s]+', s)
for m in match_iter:
print repr(s[m.start() : m.end()])
1
' \t\n\t '

Anchoring

It’s very common that you would like the match begin at some specific position, the anchoring instructions can help you do that:

Code Meaning
^ start of string, or line.
$ end of string, or line.
\A start of string.
\Z end of string.
\b empty string at the beginning or end of a word
\B empty string not at the beginning or end of a word

\b and \B is hard to distinguish, let’s see an example to make things more clear:

1
2
3
4
5
6
7
8
9
import re
s = '123456789'
b_pattern = r'\b\d+\b'
B_pattern = r'\B\d+\B'
for pattern in [b_pattern, B_pattern]:
print pattern,
m = re.search(pattern, s)
if m: print repr(s[m.start() : m.end()])
1
2
\b\d+\b '123456789'
\B\d+\B '2345678'

In the above example, \b\d+\b matches all the digits since they start from the beginning and ends at the end; \B\d+\B matches only the inner part because \B is somewhere in the word.

Group

If you care about some parts of the search result, the group syntax can help. A group can be created by wrapping the part into parethesis (()), and you can retrieve the group either by MatchObject instance’s groups() method or group() method:

1
2
3
4
5
6
7
8
9
import re
s = 'Sean <[email protected]>'
pattern = r'(\w+)\s+<(.+)>'
m = re.search(pattern, s)
print m.groups()
print m.group(1), m.group(2)
1
2
('Sean', '[email protected]')

The groups are ordered by the left parenthesis and start from 1. Group 0 is the whole match. If the group doesn’t find a match, None object will be used as the result:

1
2
3
4
5
6
7
8
9
10
import re
# e-mail address may be wrapped by <> or ""
s = 'Sean "[email protected]"'
pattern = r'(\w+)\s+((<.+>)|(".+"))'
m = re.search(pattern, s)
print m.groups()
1
('Sean', '"[email protected]"', None, '"[email protected]"')

The follow table gives the mapping between group index, pattern and match:

Group Index Pattern Match
1 (\w+) Sean
2 ((<.+>)|(“.+”)) "[email protected]"
3 (<.+>) None
4 (".+") "[email protected]"

Named Group

re module also supports named group, with which you can give each group a name. The syntax of nameed group is (?P<name>pattern). Still using the name and e-mail matching example, you can used named group to make the pattern more readable.

1
2
3
4
5
6
7
8
9
10
11
import re
# e-mail address may be wrapped by <> or ""
s = 'Sean "[email protected]"'
pattern = r'(?P<name>\w+)\s+(?P<email>(<.+>)|(".+"))'
m = re.search(pattern, s)
print m.groups()
print m.groupdict()
1
2
('Sean', '"[email protected]"', None, '"[email protected]"')
{'name': 'Sean', 'email': '"[email protected]"'}

It’s clearer to use named groups than ordinary groups since we can know each part is all about from the names.

Non-capture Group

Sometimes you just want to group someting to use the repetition expression and do not want it shown in the group results. In this case, a non-capture group can help. It’s syntax is (?:pattern).

1
2
3
4
5
6
7
8
9
10
11
import re
# e-mail address may be wrapped by <> or ""
s = 'Sean "[email protected]"'
pattern = r'(?P<name>\w+)\s+(?P<email>(?:<.+>)|(?:".+"))'
m = re.search(pattern, s)
print m.groups()
print m.groupdict()
1
2
('Sean', '"[email protected]"')
{'name': 'Sean', 'email': '"[email protected]"'}

You can see that (<.+>) and (".+") are no longer in the group set after changing them into non-capture groups.

If you care about angle brackets or quotes, here is a more complex version using look ahead and behind assertions, which I’ll detail later. If you find this example too complicated for you now, just skip it and visit it back after you have learned the assertion syntax:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import re
# e-mail address may be wrapped by <> or ""
s = 'Sean "[email protected]"'
pattern = \
r"""(?x) # verbose mode
(?P<name>\w+) # user name part
\s+
[<"]
(?P<email> # email address part
(?<=<)(?:[\w.-]+)@(?:[\w.-]+)(?=>) # wrapped by <>
|
(?<=")(?:[\w.-]+)@(?:[\w.-]+)(?=") # wrapped by ""
)
[>"]
"""
m = re.search(pattern, s)
print m.groups()
print m.groupdict()
1
2
('Sean', '[email protected]')
{'name': 'Sean', 'email': '[email protected]'}

Search Options

The search options are also a part of the re grammar, they change search behavior. re module function has a named parameter flags, you can use | operation to set mutiple flags at the same time. Also, you can add flag abbreviations at the beginning of the pattern using the syntax (?flags). The following table shows the flags and their abbreviation.

Flag Abbreviation Meaning
IGNORECASE i Let the search be case in-sensitive.
MULTILINE m Let ^ and $ match the begin and end of each line.
DOTALL s Let . match \n.
UNICODE u Add unicode support.
VERBOSE x Ignore white spaces and comments in the pattern.

Let’s see a simple example using MULTILINE and IGNORECASE flag:

1
2
3
4
5
6
7
8
9
import re
s = """This is the first paragraph.
This is the second.
This is the third.
"""
print re.findall(r'^this', s, flags=re.MULTILINE|re.IGNORECASE)
print re.findall(r'(?im)^this', s)
1
2
['This', 'This', 'This']
['This', 'This', 'This']

The DOTALL flag has been discussed in previous section, here we just skip it. The UNICODE flag is help when you deal with unicode string. Without this flag, \w will not match some Chinese characters, and it will match them if UNICODE flag is used:

1
2
3
4
5
6
7
8
9
10
11
12
13
import re
motto = u"自强不息 厚德载物"
pattern = r'(\w+)\s+(\w+)'
pattern_u = r'(?u)(\w+)\s+(\w+)'
print pattern
print re.findall(pattern, motto)
print
print pattern_u
print re.findall(pattern_u, motto)
1
2
3
4
5
(\w+)\s+(\w+)
[]
(?u)(\w+)\s+(\w+)
[(u'\u81ea\u5f3a\u4e0d\u606f', u'\u539a\u5fb7\u8f7d\u7269')]

When your regular expression is very complicated, it’s very helpful to use the VERBOSE flag to add some comments:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import re
s = 'Sean <[email protected]>'
pattern = \
r"""
(\w+) # user name
\s+
<([\w.-]+) # local part of the e-mail address
@
([\w.-]+) # domain part of the e-mail address
>
"""
print re.findall(pattern, s, flags=re.VERBOSE)
1
[('Sean', 'sean.lan.thu', 'gmail.com')]

Look Ahead and Look Behind

re has another powerful feature named look ahead assertion, it enables you to “look ahead” at the string to see whether it matches or not. The look ahead assertion doesn’t consume any input, and if it fails, the search will end.

The positive look ahead assertion’s syntax is (?=pattern).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import re
s1 = 'Sean [email protected]'
s2 = 'Sean <[email protected]>'
s3 = 'Sean "[email protected]"'
pattern = \
r"""(?x) # verbose mode
(\w+) # user name
\s+
(?=<[\w.-][email protected][\w.-]+> | # only email wrapped with <>
[\w.-][email protected][\w.-]+) # or not wrapped with any thing is valid
<?
([\w.-]+) # local part of the e-mail address
@
([\w.-]+) # domain part of the e-mail address
>?
"""
print re.findall(pattern, s1)
print re.findall(pattern, s2)
print re.findall(pattern, s3) # won't match
1
2
3
[('Sean', 'sean.lan.thu', 'gmail.com')]
[('Sean', 'sean.lan.thu', 'gmail.com')]
[]

The negative look ahead assertion’s syntax is (?!pattern). The match will go on only if pattern in the negative look ahead assertion not find any match.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import re
pattern = \
r"""(?x) # verbose mode
(?![\w.-][email protected]\.com) # filter out e-mails sent from unknown.com
([\w.-]+) # local part of the e-mail address
@
([\w.-]+) # domain part of the e-mail address
"""
print re.findall(pattern, s1)
print re.findall(pattern, s2) # won't match
1
2
[('somebody', 'somecompany.com')]
[]

The above example can also be written using negative look behind assertion, whose syntax is (?<!pattern):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import re
pattern = \
r"""(?x) # verbose mode
([\w.-]+) # local part of the e-mail address
@
([\w.-]+)$ # domain part of the e-mail address
(?<!unknown\.com) # filter out e-mails sent from unknown.com
"""
print re.findall(pattern, s1)
print re.findall(pattern, s2) # won't match
1
2
[('somebody', 'somecompany.com')]
[]

Different from look ahead assertion, there is one requirement for look behind assertion that its pattern must be of fixed width, which means no wildcard or range repetition is allowed. For example, (?<!apple{2}) is ok but (?<!apple{2,3}) is not.

The syntax of positive look behind assertion is (?<=pattern), and the following example only matches gmail e-mail addresses:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import re
pattern = \
r"""(?x) # verbose mode
([\w.-]+) # local part of the e-mail address
@
([\w.-]+)$ # domain part of the e-mail address
(?<=gmail\.com) # filter out e-mails not sent from gmail.com
"""
print re.findall(pattern, s1)
print re.findall(pattern, s2) # won't match
1
2
[('somebody', 'gmail.com')]
[]

Self-referencing

re also allows you to use the group previously capured later with self-referencing syntax \num, where num is the group index start from 1.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import re
s1 = 'Sean [email protected]'
s2 = 'Sean [email protected]'
pattern = \
r"""(?ix) # verbose mode, ignore case
(\w+) # user name
\s+
\1 # local part must be the same with the user name
@
([\w.-]+)$ # domain part of the e-mail address
"""
print re.findall(pattern, s1)
print re.findall(pattern, s2) # won't match
1
2
[('Sean', 'gmail.com')]
[]

You can also reference a named group by (?P=name):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import re
s1 = 'Sean [email protected]'
s2 = 'Sean [email protected]'
pattern = \
r"""(?ix) # verbose mode, ignore case
(?P<name>\w+) # user name
\s+
(?P=name) # local part must be the same with the user name
@
([\w.-]+)$ # domain part of the e-mail address
"""
print re.findall(pattern, s1)
print re.findall(pattern, s2) # won't match
1
2
[('Sean', 'gmail.com')]
[]

In re.sub() and re.subn(), you can reference a group in repl parameter by \g<name> if it’s named or \g<num> if not :

1
2
3
4
5
6
7
import re
print re.sub(pattern=r'(?P<local>[\w.-]+)@([\w.-]+)',
repl=r'\g<local> at \g<2>',
string=s)
1
sean at gmail.com

In MatchObject.expand() method, you can also reference a group in template paremeter by \g<name> or \g<num>:

1
2
3
4
5
6
7
8
import re
m = re.search(pattern=r'(?P<local>[\w.-]+)@([\w.-]+)', string=s)
if m:
print m.expand(r'\g<local> at \g<2>')
1
sean at gmail.com

You can also use self-referencing for conditional match: (?(id)true-pattern|false-pattern), where id is the group name or number, true-pattern is the pattern to use if the group finds a match and false-pattern is the pattern to use otherwise. The false-pattern is optional:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import re
pattern = \
r"""(?x) # verbose mode
^
(?P<angle_bracket><)?
([\w.-]+) # local part must be the same with the user name
@
([\w.-]+) # domain part of the e-mail address
(?(angle_bracket) # if it has left angle bracket
> # it must has a right angle bracket
)
"""
for s in [s1, s2, s3]:
m = re.search(pattern, s)
print s,
if m:
print "Valid"
else:
print 'Not Valid'
1
2
3

Conclusion

This post details the python’s re grammar, and the following list is a short recap:

  • repetition
    • *, +, ?, {m}, {m,n}
    • greedy vs. non-greedy(?)
  • character set
    • [abc]
    • character range: [a-b]
    • escape code: \d, etc.
  • anchoring: ^, $, \A, \Z, \b, \B
  • group: ()
    • named group: (?P<name>pattern)
    • no-capture group: (?:pattern)
  • search options: i, m, s, u, x
  • look ahead and behind assertion
    • positive look ahead: (?=pattern)
    • negative look ahead: (?!pattern)
    • positive look behind: (?<=pattern)
    • negative look behind: (?<!pattern)
  • self referecing
    • \num, (?P=name)
    • \g<num>, \g<name>

There are usualy more than one regular expressions to sovle the same problem, and they differ in length, readability and efficiency. Lots of practice is needed before writing beautiful, efficient and extensible regular expressions. After walking through this post, you should try to figure out the details about how regular expressions are compiled and work, and the best way is to implement a re engine by yourself.

You can download the jupyter notebook version from here.