CC-BY
Fabian M. Suchanek
Regular Expressions
26
Regular Expressions
Regular expressions can detect strings of specific forms in text:
2
Regular expression: L[a-z]+
Text: The purpose of life is to Live, to Love, to Learn.
Finds: “Live”, “Love”, “Learn”
Regular expressions have applications in the detection of named entities in text, syntax checking,
and search in text.
They are a standard tool that is implemented in all reasonable text editors and operating systems.
Def: Alphabet, Word
An
alphabet
is a set of symbols.
We will use as alphabet always implicitly the set of all unicode
characters.
3
A={a,b,c,d,e,f,0,1,2,3,4,5,6,7,8,9}
A={0,...9,a...,z,A...Z,!,?,...}
A
word
over an alphabet A is a sequence of symbols from A.
Since we use the alphabet of unicode characters, words are just strings.
hello!, 42, 3.141592, Andrei Sakharov
also with spaces!
Def: Language
A
language
over an alphabet S is a set of words over S.
4
Set of strings
we want to describe
L1 = {Andrei Sakharov, Shirin Ebadi, Liu Xiaobo, ...}
L2 = {1900, 1901, 1982, 2013, 2017, ...}
L3 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
L4 = {a, ab, abb, bbba, aaabbab, ababa, ...}
L5 = {a, b, aa, bb, aaa, bbb, ...}
L6 = {a, aa, aaa, ...}
L7 = {ab, abab, ababab, ...}
L8 = {c, ca, caa, caaa, ...}
L9 = { , a, aa, aaa, ...}
Def: Language
A language over an alphabet S is a set of words over S.
5
a*
Our description
(much shorter than the original set!)
Set of strings
we want to describe
L1 = {Andrei Sakharov, Shirin Ebadi, Liu Xiaobo, ...}
L2 = {1900, 1901, 1982, 2013, 2017, ...}
L3 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
L4 = {a, ab, abb, bbba, aaabbab, ababa, ...}
L5 = {a, b, aa, bb, aaa, bbb, ...}
L6 = {a, aa, aaa, ...}
L7 = {ab, abab, ababab, ...}
L8 = {c, ca, caa, caaa, ...}
L9 = { , a, aa, aaa, ...}
Def: Language
A language over an alphabet S is a set of words over S.
6
ca*
a*
L1 = {Andrei Sakharov, Shirin Ebadi, Liu Xiaobo, ...}
L2 = {1900, 1901, 1982, 2013, 2017, ...}
L3 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
L4 = {a, ab, abb, bbba, aaabbab, ababa, ...}
L5 = {a, b, aa, bb, aaa, bbb, ...}
L6 = {a, aa, aaa, ...}
L7 = { , ab, abab, ababab, ...}
L8 = {c, ca, caa, caaa, ...}
L9 = { , a, aa, aaa, ...}
Our description
(much shorter than the original set!)
Set of strings
we want to describe
Def: Language
A language over an alphabet S is a set of words over S.
7
ca*
a*
(ab)*
L1 = {Andrei Sakharov, Shirin Ebadi, Liu Xiaobo, ...}
L2 = {1900, 1901, 1982, 2013, 2017, ...}
L3 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
L4 = {a, ab, abb, bbba, aaabbab, ababa, ...}
L5 = {a, b, aa, bb, aaa, bbb, ...}
L6 = {a, aa, aaa, ...}
L7 = { , ab, abab, ababab, ...}
L8 = {c, ca, caa, caaa, ...}
L9 = { , a, aa, aaa, ...}
Our description
(much shorter than the original set!)
Set of strings
we want to describe
Def: Language
A language over an alphabet S is a set of words over S.
8
aa* = a+
ca*
a*
(ab)*
L1 = {Andrei Sakharov, Shirin Ebadi, Liu Xiaobo, ...}
L2 = {1900, 1901, 1982, 2013, 2017, ...}
L3 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
L4 = {a, ab, abb, bbba, aaabbab, ababa, ...}
L5 = {a, b, aa, bb, aaa, bbb, ...}
L6 = {a, aa, aaa, ...}
L7 = { , ab, abab, ababab, ...}
L8 = {c, ca, caa, caaa, ...}
L9 = { , a, aa, aaa, ...}
Our description
(much shorter than the original set!)
Set of strings
we want to describe
Def: Language
A language over an alphabet S is a set of words over S.
9
a+ | b+
aa* = a+
ca*
a*
(ab)*
L1 = {Andrei Sakharov, Shirin Ebadi, Liu Xiaobo, ...}
L2 = {1900, 1901, 1982, 2013, 2017, ...}
L3 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
L4 = {a, ab, abb, bbba, aaabbab, ababa, ...}
L5 = {a, b, aa, bb, aaa, bbb, ...}
L6 = {a, aa, aaa, ...}
L7 = { , ab, abab, ababab, ...}
L8 = {c, ca, caa, caaa, ...}
L9 = { , a, aa, aaa, ...}
Our description
(much shorter than the original set!)
Set of strings
we want to describe
Def: Language
A language over an alphabet S is a set of words over S.
10
aa* = a+
ca*
a*
(ab)*
(a | b)+
a+ | b+
L1 = {Andrei Sakharov, Shirin Ebadi, Liu Xiaobo, ...}
L2 = {1900, 1901, 1982, 2013, 2017, ...}
L3 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
L4 = {a, ab, abb, bbba, aaabbab, ababa, ...}
L5 = {a, b, aa, bb, aaa, bbb, ...}
L6 = {a, aa, aaa, ...}
L7 = { , ab, abab, ababab, ...}
L8 = {c, ca, caa, caaa, ...}
L9 = { , a, aa, aaa, ...}
Our description
(much shorter than the original set!)
Set of strings
we want to describe
Def: Language
A language over an alphabet S is a set of words over S.
11
(0 | 1 | ... | 9) = [0-9]
aa* = a+
ca*
a*
(ab)*
(a | b)+
a+ | b+
L1 = {Andrei Sakharov, Shirin Ebadi, Liu Xiaobo, ...}
L2 = {1900, 1901, 1982, 2013, 2017, ...}
L3 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
L4 = {a, ab, abb, bbba, aaabbab, ababa, ...}
L5 = {a, b, aa, bb, aaa, bbb, ...}
L6 = {a, aa, aaa, ...}
L7 = { , ab, abab, ababab, ...}
L8 = {c, ca, caa, caaa, ...}
L9 = { , a, aa, aaa, ...}
Our description
(much shorter than the original set!)
Set of strings
we want to describe
Def: Language
A language over an alphabet S is a set of words over S.
12
[0-9][0-9][0-9][0-9] = [0-9]{4}
(0 | 1 | ... | 9) = [0-9]
aa* = a+
ca*
a*
(ab)*
(a | b)+
a+ | b+
L1 = {Andrei Sakharov, Shirin Ebadi, Liu Xiaobo, ...}
L2 = {1900, 1901, 1982, 2013, 2017, ...}
L3 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
L4 = {a, ab, abb, bbba, aaabbab, ababa, ...}
L5 = {a, b, aa, bb, aaa, bbb, ...}
L6 = {a, aa, aaa, ...}
L7 = { , ab, abab, ababab, ...}
L8 = {c, ca, caa, caaa, ...}
L9 = { , a, aa, aaa, ...}
Our description
(much shorter than the original set!)
Set of strings
we want to describe
Def: Language
A language over an alphabet S is a set of words over S.
13
[A-Z][a-z]+ [A-Z][a-z]+
[0-9][0-9][0-9][0-9] = [0-9]{4}
(0 | 1 | ... | 9) = [0-9]
(a | b)+
a+ | b+
aa* = a+
ca*
a*
(ab)*
L1 = {Andrei Sakharov, Shirin Ebadi, ...}
L2 = {1900, 1901, 1982, 2013, 2017, ...}
L3 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
L4 = {a, ab, abb, bbba, aaabbab, ababa, ...}
L5 = {a, b, aa, bb, aaa, bbb, ...}
L6 = {a, aa, aaa, ...}
L7 = { , ab, abab, ababab, ...}
L8 = {c, ca, caa, caaa, ...}
L9 = { , a, aa, aaa, ...}
Our description
(much shorter than the original set!)
Set of strings
we want to describe
Regular expressions
A regular expression (regex) describes a language.
14
The language of a regular expression
[0-9][0-9][0-9][0-9] = [0-9]{4}
(0 | 1 | ... | 9) = [0-9]
(a | b)+
a+ | b+
aa* = a+
ca*
a*
[A-Z][a-z]+ [A-Z][a-z]+
L(E |F) = L(E) ∪ L(F)
L(x) = {x} for symbols x
L1 = {Andrei Sakharov, Shirin Ebadi, ...}
L2 = {1900, 1901, 1982, 2013, 2017, ...}
L3 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
L4 = {a, ab, abb, bbba, aaabbab, ababa, ...}
L5 = {a, b, aa, bb, aaa, bbb, ...}
L6 = {a, aa, aaa, ...}
L7 = { , ab, abab, ababab, ...}
L8 = {c, ca, caa, caaa, ...}
L9 = { , a, aa, aaa, ...}
Def: Regular expressions
A
regular expression
(regex) over an alphabet S is one of the following:
• the empty string
• an element of S
• a string of the form XY
• a string of the form (X|Y)
• a string of the form (X)*
where X and Y are regexes
A regex is associated
to a language:
15
its language
the regular
expression
(0 is a special regex that cannot appear as part of other regexes. It represents the empty language and does not match any string.)
->formal-grammars
L(E |F) = L(E) ∪ L(F)
L(x) = {x} for symbols x
Real-world example
16
Example from Wipolo via Dhouha Bouamor
Regular expressions cheat sheet
L(a) = {a}
L(ab) = {ab}
L(a | b) = {a, b}
L(a*) = {, a, aa, aaa, ...}
L(a+) := L(aa*)
L([a-z]) := L(a|b|...|z)
L(a{2,4}) := L(aa|aaa|aaaa)
L(a{3}) := L(aaa)
L(a?) := L( | a)
L(.) := {a|b|..|A|...|0|...|!|$|...}
L(\.) := {.}
17
Simple symbol
Concatenation
Disjunction
Kleene star
shorthand for “one or more”
shorthand for a range
shortand for a given number
shortand for a given number
shorthand for “optional”
shorthand for “any symbol”
escape sequence for special symbols
Task: Regexes
L(a) = {a}
L(ab) = {ab}
L(a | b) = {a, b}
L(a*) = {, a, aa, aaa, ...}
L(a+) := L(aa*)
L([a-z]) := L(a|b|...|z)
L(a{2,4}) := L(aa|aaa|aaaa)
L(a{3}) := L(aaa)
L(a?) := L( | a)
L(.) := {a|b|..|A|...|0|...|!|$|...}
L(\.) := {.}
18
Simple symbol
Concatenation
Disjunction
Kleene star
shorthand for “one or more”
shorthand for a range
shortand for a given number
shortand for a given number
shorthand for “optional”
shorthand for “any symbol”
escape sequence for special symbols
Define regexes for
• numbers
• phone numbers
Try it out!
• HTML tags
• Names of the form “Dr. Blah Blub”
Helpful Web page
19
https://regex101.com/
-> Python
Named regexes
When using regular expressions in a program, it is a good idea to name them:
day =
"[123]?[0-9]"
month =
"(Jan|Feb|...)"
year =
"[12][0-9]{3}"
separator =
"(/|-)"
pattern = day+separator+month+separator+year
But: Human beings, who are almost unique in having the ability to learn from the experience
of others, are also remarkable for their apparent disinclination to do so. (Douglas Adams)
20
>groups
-> Python
Regex groups
21
A regex group is a sequence of the form (...) in a regex.
1st group
Groups are numbered by the order of their opening paranthesis.
no (([a-z]+) is below ([a-z]+))
Regex groups
22
A regex group is a sequence of the form (...) in a regex.
1st group
Groups are numbered by the order of their opening paranthesis.
no (([a-z]+) is below ([a-z]+))
2nd group
3rd group
Regex groups
No idea is above scrutiny, and no person is below dignity.
1st group: person is below dignity
2nd group: person
3rd group: dignity
Try it out!
23
no (([a-z]+) is below ([a-z]+))
Maajid Nawaz
Regexes in programming
Regex
Simplified regex
Finite State Machine
Matcher
It’s in my DNA, DNAAA !
your programming
language
24
DN(A)+
DNA(A)*
D
N
A
A
you
Example: Regexes in Python
import re
pattern = re.compile("DN(A)+")
match = pattern.search(line)
if match!=None:
print(match.group())
25
DNA
DNAAA
Summary: Regular Expressions
A regular expression is a kind of “code” that matches a (possibly infinite)
set of strings.
Regexes are a standard facility in programming languages, and can be used, e.g.,
for information extraction.
26
[A-Z][a-z]+ matches {Sakharov, Ebadi, Xiaobo, ...}
->Formal-grammars
->Fact-extraction
->Nerc