CC-BY
Fabian M. Suchanek
Regular Expressions
26
Motivation
Assume that we want to find person names in a text:
2
What would be a simple first solution?
Elvis Presley
loved records by other country singers such as
Roy Acuff
.
He married
Priscilla Beaulieu
and had a child,
Lisa Marie Presley
.
Regular Expressions
Regular expressions (regexes) can detect strings of specific forms in text:
3
Regular expressions have applications in the detection of named entities in text, syntax checking,
and search in text.
They are a standard tool that is implemented in all reasonable text editors and operating systems.
Elvis Presley
loved records by other country singers such as
Roy Acuff
.
He married
Priscilla Beaulieu
and had a child,
Lisa Marie Presley
.
Regex:
[A-Z][a-z]+ [A-Z][a-z]+
Result:
Elvis Presley, Roy Acuff,
Priscilla Beaulieu, Lisa Marie, Marie Presley
capital letter followed by lowercase letters,
space, and again capital and lowercase
Def: Alphabet, Word
An
alphabet
is a set of symbols.
We will use as alphabet always implicitly the set of all unicode
characters.
4
A={a,b,c,d,e,f,0,1,2,3,4,5,6,7,8,9}
A={0,...9,a...,z,A...Z,!,?,...}
A
word
over an alphabet A is a sequence of symbols from A.
Since we use the alphabet of unicode characters, words are just strings.
hello!, 42, 3.141592, Elvis Presley
also with spaces!
Def: Language
A
language
over an alphabet S is a set of words over S.
5
Set of strings
we want to describe
L1 = {Elvis Presley, Priscilla Beaulieu...}
L2 = {1900, 1901, 1982, 2013, 2017, ...}
L3 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
L4 = {a, b, ba, abb, bbba, aaabbab, ababa, ...}
L5 = {a, b, aa, bb, aaa, bbb, ...}
L6 = {a, aa, aaa, ...}
L7 = { , ab, abab, ababab, ...}
L8 = {c, ca, caa, caaa, ...}
L9 = { , a, aa, aaa, ...}
Def: Language
A language over an alphabet S is a set of words over S.
6
a*
Our description
(much shorter than the original set!)
Set of strings
we want to describe
L1 = {Elvis Presley, Priscilla Beaulieu...}
L2 = {1900, 1901, 1982, 2013, 2017, ...}
L3 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
L4 = {a, b, ba, abb, bbba, aaabbab, ababa, ...}
L5 = {a, b, aa, bb, aaa, bbb, ...}
L6 = {a, aa, aaa, ...}
L7 = { , ab, abab, ababab, ...}
L8 = {c, ca, caa, caaa, ...}
L9 = { , a, aa, aaa, ...}
Def: Language
A language over an alphabet S is a set of words over S.
7
ca*
a*
L1 = {Elvis Presley, Priscilla Beaulieu...}
L2 = {1900, 1901, 1982, 2013, 2017, ...}
L3 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
L4 = {a, b, ba, abb, bbba, aaabbab, ababa, ...}
L5 = {a, b, aa, bb, aaa, bbb, ...}
L6 = {a, aa, aaa, ...}
L7 = { , ab, abab, ababab, ...}
L8 = {c, ca, caa, caaa, ...}
L9 = { , a, aa, aaa, ...}
Our description
(much shorter than the original set!)
Set of strings
we want to describe
Def: Language
A language over an alphabet S is a set of words over S.
8
ca*
a*
(ab)*
Our description
(much shorter than the original set!)
Set of strings
we want to describe
L1 = {Elvis Presley, Priscilla Beaulieu...}
L2 = {1900, 1901, 1982, 2013, 2017, ...}
L3 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
L4 = {a, b, ba, abb, bbba, aaabbab, ababa, ...}
L5 = {a, b, aa, bb, aaa, bbb, ...}
L6 = {a, aa, aaa, ...}
L7 = { , ab, abab, ababab, ...}
L8 = {c, ca, caa, caaa, ...}
L9 = { , a, aa, aaa, ...}
Def: Language
A language over an alphabet S is a set of words over S.
9
aa* = a+
ca*
a*
(ab)*
Our description
(much shorter than the original set!)
Set of strings
we want to describe
L1 = {Elvis Presley, Priscilla Beaulieu...}
L2 = {1900, 1901, 1982, 2013, 2017, ...}
L3 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
L4 = {a, b, ba, abb, bbba, aaabbab, ababa, ...}
L5 = {a, b, aa, bb, aaa, bbb, ...}
L6 = {a, aa, aaa, ...}
L7 = { , ab, abab, ababab, ...}
L8 = {c, ca, caa, caaa, ...}
L9 = { , a, aa, aaa, ...}
“a+” is defined as a shorthand notation for “aa*”
Def: Language
A language over an alphabet S is a set of words over S.
10
a+ | b+
aa* = a+
ca*
a*
(ab)*
Our description
(much shorter than the original set!)
Set of strings
we want to describe
L1 = {Elvis Presley, Priscilla Beaulieu...}
L2 = {1900, 1901, 1982, 2013, 2017, ...}
L3 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
L4 = {a, b, ba, abb, bbba, aaabbab, ababa, ...}
L5 = {a, b, aa, bb, aaa, bbb, ...}
L6 = {a, aa, aaa, ...}
L7 = { , ab, abab, ababab, ...}
L8 = {c, ca, caa, caaa, ...}
L9 = { , a, aa, aaa, ...}
“a+” is defined as a shorthand notation for “aa*”
Def: Language
A language over an alphabet S is a set of words over S.
11
aa* = a+
ca*
a*
(ab)*
(a | b)+
a+ | b+
Our description
(much shorter than the original set!)
Set of strings
we want to describe
L1 = {Elvis Presley, Priscilla Beaulieu...}
L2 = {1900, 1901, 1982, 2013, 2017, ...}
L3 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
L4 = {a, b, ba, abb, bbba, aaabbab, ababa, ...}
L5 = {a, b, aa, bb, aaa, bbb, ...}
L6 = {a, aa, aaa, ...}
L7 = { , ab, abab, ababab, ...}
L8 = {c, ca, caa, caaa, ...}
L9 = { , a, aa, aaa, ...}
“a+” is defined as a shorthand notation for “aa*”
Def: Language
A language over an alphabet S is a set of words over S.
12
(0 | 1 | ... | 9) = [0-9]
aa* = a+
ca*
a*
(ab)*
(a | b)+
a+ | b+
Our description
(much shorter than the original set!)
Set of strings
we want to describe
L1 = {Elvis Presley, Priscilla Beaulieu...}
L2 = {1900, 1901, 1982, 2013, 2017, ...}
L3 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
L4 = {a, b, ba, abb, bbba, aaabbab, ababa, ...}
L5 = {a, b, aa, bb, aaa, bbb, ...}
L6 = {a, aa, aaa, ...}
L7 = { , ab, abab, ababab, ...}
L8 = {c, ca, caa, caaa, ...}
L9 = { , a, aa, aaa, ...}
“a+” is defined as a shorthand notation for “aa*”
Def: Language
A language over an alphabet S is a set of words over S.
13
[0-9][0-9][0-9][0-9] = [0-9]{4}
(0 | 1 | ... | 9) = [0-9]
aa* = a+
ca*
a*
(ab)*
(a | b)+
a+ | b+
Our description
(much shorter than the original set!)
Set of strings
we want to describe
L1 = {Elvis Presley, Priscilla Beaulieu...}
L2 = {1900, 1901, 1982, 2013, 2017, ...}
L3 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
L4 = {a, b, ba, abb, bbba, aaabbab, ababa, ...}
L5 = {a, b, aa, bb, aaa, bbb, ...}
L6 = {a, aa, aaa, ...}
L7 = { , ab, abab, ababab, ...}
L8 = {c, ca, caa, caaa, ...}
L9 = { , a, aa, aaa, ...}
“a+” is defined as a shorthand notation for “aa*”
Def: Language
A language over an alphabet S is a set of words over S.
14
[A-Z][a-z]+ [A-Z][a-z]+
[0-9][0-9][0-9][0-9] = [0-9]{4}
(0 | 1 | ... | 9) = [0-9]
(a | b)+
a+ | b+
aa* = a+
ca*
a*
(ab)*
Our description
(much shorter than the original set!)
Set of strings
we want to describe
L1 = {Elvis Presley, Priscilla Beaulieu...}
L2 = {1900, 1901, 1982, 2013, 2017, ...}
L3 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
L4 = {a, b, ba, abb, bbba, aaabbab, ababa, ...}
L5 = {a, b, aa, bb, aaa, bbb, ...}
L6 = {a, aa, aaa, ...}
L7 = { , ab, abab, ababab, ...}
L8 = {c, ca, caa, caaa, ...}
L9 = { , a, aa, aaa, ...}
“a+” is defined as a shorthand notation for “aa*”
Regular expressions
A regular expression (regex) describes a language.
15
The language of a regular expression
[0-9][0-9][0-9][0-9] = [0-9]{4}
(0 | 1 | ... | 9) = [0-9]
(a | b)+
a+ | b+
aa* = a+
ca*
a*
[A-Z][a-z]+ [A-Z][a-z]+
L(E |F) = L(E) ∪ L(F)
L(x) = {x} for symbols x
L1 = {Elvis Presley, Priscilla Beaulieu...}
L2 = {1900, 1901, 1982, 2013, 2017, ...}
L3 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
L4 = {a, b, ba, abb, bbba, aaabbab, ababa, ...}
L5 = {a, b, aa, bb, aaa, bbb, ...}
L6 = {a, aa, aaa, ...}
L7 = { , ab, abab, ababab, ...}
L8 = {c, ca, caa, caaa, ...}
L9 = { , a, aa, aaa, ...}
Def: Regular expressions
A
regular expression
(regex) over an alphabet S is one of the following:
• the empty string
• an element of S
• a string of the form XY
• a string of the form (X|Y)
• a string of the form (X)*
where X and Y are regexes
A regex is associated
to a language:
16
its language
the regular
expression
(0 is a special regex that cannot appear as part of other regexes. It represents the empty language and does not match any string.)
->formal-grammars
L(E |F) = L(E) ∪ L(F)
L(x) = {x} for symbols x
Real-world example
17
Example from Wipolo via Dhouha Bouamor
Regular expressions cheat sheet
L(a) = {a}
L(ab) = {ab}
L(a | b) = {a, b}
L(a*) = {, a, aa, aaa, ...}
L(a+) := L(aa*)
L([a-z]) := L(a|b|...|z)
L(a{2,4}) := L(aa|aaa|aaaa)
L(a{3}) := L(aaa)
L(a?) := L( | a)
L(.) := {a|b|..|A|...|0|...|!|$|...}
L(\.) := {.}
18
Simple symbol
Concatenation
Disjunction
Kleene star
shorthand for “one or more”
shorthand for a range
shortand for a given number
shortand for a given number
shorthand for “optional”
shorthand for “any symbol”
escape sequence for special symbols
Task: Regexes
L(a) = {a}
L(ab) = {ab}
L(a | b) = {a, b}
L(a*) = {, a, aa, aaa, ...}
L(a+) := L(aa*)
L([a-z]) := L(a|b|...|z)
L(a{2,4}) := L(aa|aaa|aaaa)
L(a{3}) := L(aaa)
L(a?) := L( | a)
L(.) := {a|b|..|A|...|0|...|!|$|...}
L(\.) := {.}
19
Simple symbol
Concatenation
Disjunction
Kleene star
shorthand for “one or more”
shorthand for a range
shortand for a given number
shortand for a given number
shorthand for “optional”
shorthand for “any symbol”
escape sequence for special symbols
Define regexes for
• numbers
• phone numbers
Try it out!
• HTML tags
• Names of the form “Dr. Blah Blub”
Helpful Web page
20
https://regex101.com/
-> Python
Named regexes
When using regular expressions in a program, it is a good idea to name them:
day =
"[1-9][0-9]?"
month =
"(Jan|Feb|...)"
year =
"[12][0-9]{3}"
separator =
"(/|-)"
date = day+separator+month+separator+year
pattern = date+" to "+date
But: Human beings, who are almost unique in having the ability to learn from the experience
of others, are also remarkable for their apparent disinclination to do so. (Douglas Adams)
21
>groups
-> Python
Regex groups
22
A regex group is a sequence of the form (...) in a regex.
1st group
Groups are numbered by the order of their opening paranthesis.
on (([A-Z][a-z]+) ([0-9]+), ([0-9]+))
Regex groups
23
A regex group is a sequence of the form (...) in a regex.
Groups are numbered by the order of their opening paranthesis.
on (([A-Z][a-z]+) ([0-9]+), ([0-9]+))
1st group
2nd group
3rd group
4th group
Regex groups
Elvis Presley was born on January 8, 1935.
1st group: January 8th, 1935
2nd group: January
3rd group: 8
4th group: 1935
Try it out!
24
on (([A-Z][a-z]+) ([0-9]+), ([0-9]+))
Regexes in programming
Regex
Simplified regex
Finite State Machine
Matcher
Elvis recorded 120 songs.
your programming
language
25
[0-2]+
(0|1|2)(0|1|2)*
you
0
1
2
ε
Example: Regexes in Python
import re
pattern = re.compile("[0-2]+")
match = pattern.search(line)
if match!=None:
print(match.group())
26
DNA
DNAAA
Summary: Regular Expressions
A regular expression is a kind of “code” that matches a (possibly infinite)
set of strings.
Regexes are a standard facility in programming languages, and can be used, e.g.,
for information extraction.
27
[A-Z][a-z]+ matches {Elvis, Presley, Priscilla, ...}
->Formal-grammars
->Fact-extraction
->Nerc