This post will give a quick guide to the common usages of python re module. Python re module is mainly comprised of some handy module functions, such as search, match etc, two useful classes: RegexObject for matching and MatchObject for the result, and serval constant flags. This post won’t detail all the usages and grammars of regular expression; instead, it focuses on how to use the functions and classes re offers to achieve our purpose.

There is one big section introducing the usage of module funtions, i.e, search. Within it, the MatchObject is introduced to show how to get the result. After discussing all the search related functions, functions which can perform some modifcation (generate a new string or list, not modify the original) will be visited. For example, sub and subn for substitution and split for spliting string with the pattern.

The RegexObject contains methods which have similar capability with the module functions. I’ll discuss them by examples after the module funcions.

Module Function Usage

Module functions can be categorized into two classes. One is for searching, the other is for modifying.

search, match, findall, finditer are four search related functions, and they all accept three arguments: pattern, string and flags.

parameter description
pattern the regular expression, usually in raw string. i.e, r'\d+' for numbers with one or more digits.
string the string to be searched.
flags(optional) some flags to control the search position, search behavior, etc. Multiple flags can be used at the same time, with | operation: i.e, re.I|re.M instructs the search process to ignore cases and let ^ and $ match the beginning and end of each line. For all the flags, please refer to the python doc.

Both search and match return a MatchObject if any match is found, and None otherwise. However, match checks a match at the beginning of the string, while search won’t:

1
2
3
4
import re
# example from python doc
print re.match("c", "abcdef") # No match
print re.search("c", "abcdef") # Find a match
1
2
None
<_sre.SRE_Match object at 0x1119be920>

You can force search to start at the beginning of the string with ^:

1
print re.search("^c", "abcdef") # No match
1
None

However, even in MULTILINE mode, re.match() still only matches at the beginning of the whole string:

1
2
print re.search("^L", "Sean\nLan", re.M) # Find a match
print re.match("^L", "Sean\nLan", re.M)
1
2
<_sre.SRE_Match object at 0x1119bea58>
None

Class MatchObject

MatchObject always has the boolean value True since None is returned if no match found, and it’s recommended to check whether a match is found before doing any operation:

1
2
3
4
s = 'Love Vanilla.'
m = re.search(r'\w+', s)
if m:
print 'Find a match:', s[m.start(): m.end()]
1
Find a match: Love

A MatchObject instance has the following method:

  • expand(template) will perform backslash substition. Numeric backreferences (i.e, \1, \2) and named backreferences (i.e, g\<name>) will be replaced by the corresponding group.
  • group([group1, ...]) will return one or more groups of the match. If there is no arguments, the whole match will be returned. If multiple arguments are present, all the corresponding groups are returned as a tuple. group(0) is the whole match and group(1), etc are the subgroups.
  • groups() will return all the subgroups of the match.
  • groupdict() will return a dict of all the named subgroups.
  • start([group]) and end([group]) will return the indices of the start and end of substring matched by group and group defaults to 0, which means the whole match.
  • span([group]) will return a tuple (m.start(group), m.end(group)) for MatchObject m.

The following example covers all the usage of the above methods:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
sample_str = 'Hello, Sean.'
sample_pattern = r'(\w+),\s+(?P<name>\w+).'
match = re.search(sample_pattern, sample_str)
if match:
print 'expand():\t', match.expand(r'\2 is the name. Hi, \g<name>!')
print 'group():\t', match.group()
print 'group(0):\t', match.group(0)
print 'group(1):\t', match.group(1)
print 'group(0, 1):\t', match.group(0, 1)
print 'groups():\t', match.groups()
print 'groupdict():\t', match.groupdict()
print 'whole match: \t', match.string[match.start():match.end()]
print 'group(2): \t', match.string[match.start(2):match.end(2)]
print 'span(2):\t', match.span(2)
1
2
3
4
5
6
7
8
9
10
expand(): Sean is the name. Hi, Sean!
group(): Hello, Sean.
group(0): Hello, Sean.
group(1): Hello
group(0, 1): ('Hello, Sean.', 'Hello')
groups(): ('Hello', 'Sean')
groupdict(): {'name': 'Sean'}
whole match: Hello, Sean.
group(2): Sean
span(2): (7, 11)

finditer returns an iterator yielding MatchObject instances over all non-overlapping matches, while findall returns a list of all the matching substrings in string or a list of subgroup tuples if one or more groups are present in the pattern.

1
2
3
4
5
6
7
8
9
10
sample_str = 'Stay hungry. Stay foolish.'
sample_pattern = r'(\w+)\s+(\w+).'
print 're.findall:'
print re.findall(sample_pattern, sample_str)
print
print 're.finditer:'
for m in re.finditer(sample_pattern, sample_str):
print m.groups()
1
2
3
4
5
6
re.findall:
[('Stay', 'hungry'), ('Stay', 'foolish')]
re.finditer:
('Stay', 'hungry')
('Stay', 'foolish')

Replace and Split

Both sub and subn have five parametrs: pattern, repl, string, count, flags, in which pattern, string and flags have the same meaning as those in search. repl can be a string or a function; if it is a string, any backslash escapes in it are processed, i.e, \n is converted to a single newline character. Backrefences are replaces with the matched value:

1
2
3
4
sample_str = 'Hello, Sean.'
sample_pattern = r'(\w+),\s+(?P<name>\w+).'
sample_repl = r'\g<name>, \1!'
print re.sub(sample_pattern, sample_repl, sample_str)
1
Sean, Hello!

If repl is a function, it is called for every non-overlapping occurrence of pattern. The function takes a single match object argument, and returns the replacement string. For example:

1
2
3
4
5
6
7
# example from python doc
def dashrepl(matchobj):
if matchobj.group(0) == '-': return ' '
else: return '-'
print re.sub(r'-{1,2}', dashrepl, 'pro--gram-files')
1
pro-gram files

The count argument set the maximun number of substitution, and defaults to 0 (meaning all matches are replaced):

1
re.sub('\w+', 'Bar', 'Foo Foo, Foo Foo Again!', count=3)
1
'Bar Bar, Bar Foo Again!'

subn performs the same operation as sub(), but return a tuple (new_string, number_of_subs_made) instead.

1
re.subn('\w+', 'Bar', 'Foo Foo, Foo Foo Again!', count=3)
1
('Bar Bar, Bar Foo Again!', 3)

split splits string by pattern. It has four parameters: pattern, string, maxsplit and flags. All of them are the same as those in search except maxsplit. maxsplit set the maximum number of times split occurs. After it is reached, all the remainer of the string is set as the last element of the returned list. maxsplit defaults to 0, which means there is no limit of split.

1
2
print re.split(r'\s+', 'Long long ago, go go go.')
print re.split(r'\s+', 'Long long ago, go go go.', maxsplit=2)
1
2
['Long', 'long', 'ago,', 'go', 'go', 'go.']
['Long', 'long', 'ago, go go go.']

Compiled RegexObject Usage

re.compile accepts two arguments pattern and flags and return a RegexObject object. The object can be reused to improve efficiency.

RegexObject offers methods search, match, findall, finditer, sub, subn, split, which are similar to the module functions. However, since the re.compile has already designated pattern and flags, pattern and flags are not in the method parameters, and you can retrieve the flags and pattern from the flags and pattern properties.

The parameters for search, match, findall and finditer are string, pos(optional), endpos(optional), i.e: search(string[, pos[, endpos]]). pos and endpos give the range of the string to be searched.

1
2
regex = re.compile('\d+');
regex.findall('0123456789', pos=3, endpos=5)
1
['34']

Note: It’s not equivalent to first slice the string with string[pos:endpos] then perform the search. ^ is still the start of the orginal string.

1
2
3
4
regex = re.compile('^\d+');
s = '0123456789'
print regex.findall(s[3:])
print regex.findall(s, pos=3)
1
2
['3456789']
[]

sub and subn methods are similar to module function sub and subn, except that the pattern and flags parameters are removed:

1
2
3
4
regex = re.compile('\d+');
print regex.sub('num', '123, 456, 789')
print regex.sub('num', '123, 456, 789', count=2)
print regex.subn('num', '123, 456, 789', count=2)
1
2
3
num, num, num
num, num, 789
('num, num, 789', 2)

split only has string and maxsplit parameters now:

1
2
3
regex = re.compile(r'\s+');
print regex.split('I love python.')
print regex.split('I love python.', maxsplit=1)
1
2
['I', 'love', 'python.']
['I', 'love python.']

Conclusion

This post covers serveral common usages of python’s re module, including module functions and compiled RegexObject usage. Regular expression is very powerful and far beyond what can be clarified in one single post. Practice makes perfect, and it’s better for you to practice using re in your projects. The jupyter notebook version is also available.