String nightmares: A brief tour into the world of regular expressions

So, we have arrived at the lowpoint of the Python sessions, namely string matching and regular expressions. This is dangerous and frustrating territory since regular expressions are almost like a language on their own. Therefore, I will not even try be comprehensive here. Instead, we will play around with the vocabulary a little bit so that you can get a feel of when regular expressions are helpful and what you can do with them.

1 Principle functions for matching

We have already worked with some string methods and we now turn to the topic of matching strings with regular expressions. Regular expressions define a string pattern that would like to match, given a specific source string. Before we can start properly, we need a string to work with. This time we take something more famous than a lowly xkcd poem.

alice = '''If I had a world of my own, everything would be nonsense. 
        Nothing would be what it is, because everything would be what it isn't. 
        And contrary wise, what is, it wouldn't be. 
        And what it wouldn't be, it would. You see?'''

1.1 Match at the beginning of a string with match

The easiest method for string matching is the match function from the re module which we will import now.

import re

It checks whether a string starts with a specific pattern. In this case our pattern will just be If and the string alice will be our source.

match_result_1 = re.match('If', alice)

In this case we have passed the pattern If directly as an argument. If we work on more complex tasks, we can also first compile a pattern. The following code does the same thing as the one above.

my_pattern = re.compile('If') 
match_result_2 = my_pattern.match(alice)

So far, it might not be obvious what the difference between strings and regular expressions is but we will come to that soon. Bear with me for the moment, we come to that in the next section. First we will take a look at some other functions which are useful.

A brief detour: match objects

You might have noticed that we have not looked at the results returned by the match and search functions. This is because they return strange objects. Let’s take a look.

print(match_result_1)
<re.Match object; span=(0, 2), match='If'>
print(search_result_1)
<re.Match object; span=(11, 16), match='world'>

As you can see the function return match objects, which give you the offset ranges of the matches found in the source string as well as the match returned. You can access them separately.

search_result_1.span()
(11, 16)
alice[11:17]
'world '
search_result_1.group()
'world'

Now, what is returned if no match is found? Let’s find out.

search_result_2 = re.search('supercalifragilistic', alice)
print(search_result_2)
None

The function returned nothing which makes sense because there is no match. We still have a problem though when we use the group method.

search_result_2.group()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/tmp/ipykernel_2366/549355169.py in <module>
----> 1 search_result_2.group()

AttributeError: 'NoneType' object has no attribute 'group'

How can you prevent Python from throwing an exception each time when no match is found and you use the group function. You can use the property that None is evaluated as False when used as a boolean and create a conditional.

if search_result_2: 
    print(search_result_2.group())
if search_result_1: 
    print(search_result_1.group())
world

OK, now that we have settled this topic we go back to our matching functions.

1.3 List of matches with findall

The findall function returns a list of all non-overlapping matches.

findall_result_1 = re.findall('it', alice)
print(findall_result_1)
['it', 'it', 'it', 'it', 'it']

If there is no match, an empty list is returned.

findall_result_2 = re.findall('frägellägel', alice)
print(findall_result_2)
[]

1.4 Split at pattern with split

The split function allows you to split a source string at the matches and returns a list of the resulting pieces.

split_result_1 = re.split('it', alice)
split_result_1
['If I had a world of my own, everything would be nonsense. \n        Nothing would be what ',
 ' is, because everything would be what ',
 " isn't. \n        And contrary wise, what is, ",
 " wouldn't be. \n        And what ",
 " wouldn't be, ",
 ' would. You see?']

If no match is found, a list with one element, the original source string, will be returned.

split_result_2 = re.split('smoogle', alice)
split_result_2
["If I had a world of my own, everything would be nonsense. \n        Nothing would be what it is, because everything would be what it isn't. \n        And contrary wise, what is, it wouldn't be. \n        And what it wouldn't be, it would. You see?"]

1.5 Replace matches in a string with sub

Sometimes, you might want to replace all substrings with match a certain pattern with another string. You can do this with the sub function. It returns a new string with the requested replacements.

re.sub('i', 'ü', alice)
"If I had a world of my own, everythüng would be nonsense. \n        Nothüng would be what üt üs, because everythüng would be what üt üsn't. \n        And contrary wüse, what üs, üt wouldn't be. \n        And what üt wouldn't be, üt would. You see?"

These are all already neat functions, but they become truly powerful when we combine them with regular expressions, which we will turn to next.

2 Creating patterns

2.1 The basics

So far, this all seems not to be too intimidating. But we are also just starting out. Note that we cannot only pass strings but also more complex patterns to the functions above. Let’s say we want to find all substrings consisting of a w and any other character. We can do this by adding a '.'.

re.findall('w.', alice)
['wo', 'wn', 'wo', 'wo', 'wh', 'wo', 'wh', 'wi', 'wh', 'wo', 'wh', 'wo', 'wo']

Cool, right? We have a bunch of those basic operators:

  • .: any character except \n,

  • *: preceding character can appear a number of times (including zero times),

  • ?: preceding character is optional.

In the following, we do some examples.

# an arbitary character + 'u'
source = "Humpty Dumpty"
re.findall('.u', source)
['Hu', 'Du']
# a 'u' optionally preceded by an 'H'
re.findall('H?u', source)
['Hu', 'u']
# sequences of one or more 'e'
source = 'Tweedle Dee and Tweedle Dum'
re.findall('ee*', source)
['ee', 'e', 'ee', 'ee', 'e']

You can already see, how powerful and horribly ugly these things can become. Let’s take it up a notch.

2.2 Special characters

Apart from the usual characters, you can use a number of special characters:

  • \d: a single digit

  • \D: a single non-digit

  • \w: an alphanumeric character (digits, letters or underscore)

  • \W: a non-alphanumeric character

  • \s: a whitespace character

  • \S: a non-whitespace character

  • \b: a word boundary

  • \B: a non-word boundary

I know, whoever came up with should burn in a special kind of hell. Still, let’s try to work with them. I am afraid, we cannot use Alice here, since she’s not complicated enough. You might be happy though!

# split the address into its parts
address = 'Langstrasse 81, 8004'
# postal code
print(re.findall('\d\d\d\d', address))
# house number
print(re.findall('\d\d,', address))
# street
print(re.findall('\w\w*\s', address))
['8004']
['81,']
['Langstrasse ']

Sometimes we want to match on something but not have the whole match but a substring. For, example we might want my house number without the comma. Using parentheses we can organize regular expressions in capturing groups.

my_match = re.search('(\d\d),', address)
print(my_match.group(0))
81,

Calling the group element with the 0 gives you the whole match sequence. Calling it with a 1 gives you the match we are interested in.

print(my_match.group(1))
81

2.3 Pattern specifiers

Admittedly, these patterns are not super-elegant yet. We need more specifiers:

  • (expr): expr

  • expr1|expr2: expr1 or expr2

  • ^: start of source string

  • $: end of source string

  • expr?: zero or one of expr

  • expr*: zero or more of expr, as many as possible

  • expr*?: zero or more of expr, as few as possible

  • expr+: one or more of expr, as many as possible

  • expr+?: one or more of expr, as few as possible

  • expr{m}: m consecutive expr

  • expr{m, n}: m to n consecutive expr, as many as possible

  • expr{m, n}?: m to n consecutive expr, as few as possible

  • [abc]: a, b, or c

  • [^abc]: not a, b, or c

  • expr(?= next): expr if followed by next

  • expr(?! next): expr if not followed by next

  • (?<= prev) expr: expr if preceded by prev

  • (?<! prev) expr: expr if not preceded by prev.

This is why I think this chapter of our course is aptly named. You will not learn this quickly. But let’s go through some examples.

# choice between two expressions
source = "Humpty Dumpty"
re.findall('Humpty|Dumpty', source)
['Humpty', 'Dumpty']
# alternative expression
re.findall('[HD]umpty', source)
['Humpty', 'Dumpty']
# look for 'Dumpty' at the beginning of the string
re.findall('^Dumpty', source)
[]
# look for 'Dumpty' at the end of the string
re.findall('Dumpty$', source)
['Dumpty']
source = 'Tweedle Dee'
# find sequences of one or more 'e' character, as many as possible
re.findall('e+', source)
['ee', 'e', 'ee']
# find sequences of one or more 'e' character, as few as possible
re.findall('e+?', source)
['e', 'e', 'e', 'e', 'e']
# find sequences of one or two 'e' characters 
re.findall('e{1,2}', source)
['ee', 'e', 'ee']
# find sequences of two 'e' characters
re.findall('e{2}', source)
['ee', 'ee']
# alternative
re.findall('(ee){1}', source)
['ee', 'ee']

Let’s finally do the address thing again:

# split the address into its parts
address = 'Langstrasse 81, 8004'
# postal code
print(re.findall('\d{4}$', address))
# house number
print(re.findall('\d{2}(?=,)', address))
# street
print(re.findall('^\w+(?=\s)', address))
['8004']
['81']
['Langstrasse']