Fabian M. Suchanek
Regular Expressions
26
Def: Alphabet, Word
An
alphabet
is a set of symbols.
We will use as alphabet always implicitly the set of all unicode
characters.
2
A={a,b,c,d,e,f,0,1,2,3,4,5,6,7,8,9}
A={0,...9,a...,z,A...Z,!,?,...}
A
word
over an alphabet A is a sequence of symbols from A.
Since we use the alphabet of unicode characters, words are just strings.
hello!, 42, 3.141592, Douglas and Sally
also with spaces!
Def: Language
A
language
over an alphabet S is a set of words over S.
3
Set of strings
we want to describe
L1 = {Arthur Dent, Ford Prefect, Marvin}
L2 = {1900, 1901, 1982, 2013, 2017, ...}
L3 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
L4 = {a, ab, abb, bbba, aaabbab, ababa, ...}
L5 = {a, b, aa, bb, aaa, bbb, ...}
L6 = {a, aa, aaa, ...}
L7 = {ab, abab, ababab, ...}
L8 = {c, ca, caa, caaa, ...}
L9 = { , a, aa, aaa, ...}
Def: Language
A language over an alphabet S is a set of words over S.
4
a*
Our description
(much shorter than the original set!)
Set of strings
we want to describe
L1 = {Arthur Dent, Ford Prefect, Marvin}
L2 = {1900, 1901, 1982, 2013, 2017, ...}
L3 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
L4 = {a, ab, abb, bbba, aaabbab, ababa, ...}
L5 = {a, b, aa, bb, aaa, bbb, ...}
L6 = {a, aa, aaa, ...}
L7 = {ab, abab, ababab, ...}
L8 = {c, ca, caa, caaa, ...}
L9 = { , a, aa, aaa, ...}
Def: Language
A language over an alphabet S is a set of words over S.
5
ca*
a*
L1 = {Arthur Dent, Ford Prefect, Marvin}
L2 = {1900, 1901, 1982, 2013, 2017, ...}
L3 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
L4 = {a, ab, abb, bbba, aaabbab, ababa, ...}
L5 = {a, b, aa, bb, aaa, bbb, ...}
L6 = {a, aa, aaa, ...}
L7 = { , ab, abab, ababab, ...}
L8 = {c, ca, caa, caaa, ...}
L9 = { , a, aa, aaa, ...}
Our description
(much shorter than the original set!)
Set of strings
we want to describe
Def: Language
A language over an alphabet S is a set of words over S.
6
ca*
a*
Our description
(much shorter than the original set!)
Set of strings
we want to describe
(ab)*
L1 = {Arthur Dent, Ford Prefect, Marvin}
L2 = {1900, 1901, 1982, 2013, 2017, ...}
L3 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
L4 = {a, ab, abb, bbba, aaabbab, ababa, ...}
L5 = {a, b, aa, bb, aaa, bbb, ...}
L6 = {a, aa, aaa, ...}
L7 = { , ab, abab, ababab, ...}
L8 = {c, ca, caa, caaa, ...}
L9 = { , a, aa, aaa, ...}
Def: Language
A language over an alphabet S is a set of words over S.
7
aa* = a+
ca*
a*
Our description
(much shorter than the original set!)
Set of strings
we want to describe
(ab)*
L1 = {Arthur Dent, Ford Prefect, Marvin}
L2 = {1900, 1901, 1982, 2013, 2017, ...}
L3 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
L4 = {a, ab, abb, bbba, aaabbab, ababa, ...}
L5 = {a, b, aa, bb, aaa, bbb, ...}
L6 = {a, aa, aaa, ...}
L7 = { , ab, abab, ababab, ...}
L8 = {c, ca, caa, caaa, ...}
L9 = { , a, aa, aaa, ...}
Def: Language
A language over an alphabet S is a set of words over S.
8
a+ | b+
aa* = a+
ca*
a*
Our description
(much shorter than the original set!)
Set of strings
we want to describe
(ab)*
L1 = {Arthur Dent, Ford Prefect, Marvin}
L2 = {1900, 1901, 1982, 2013, 2017, ...}
L3 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
L4 = {a, ab, abb, bbba, aaabbab, ababa, ...}
L5 = {a, b, aa, bb, aaa, bbb, ...}
L6 = {a, aa, aaa, ...}
L7 = { , ab, abab, ababab, ...}
L8 = {c, ca, caa, caaa, ...}
L9 = { , a, aa, aaa, ...}
Def: Language
A language over an alphabet S is a set of words over S.
9
aa* = a+
ca*
a*
Our description
(much shorter than the original set!)
Set of strings
we want to describe
(ab)*
L1 = {Arthur Dent, Ford Prefect, Marvin}
L2 = {1900, 1901, 1982, 2013, 2017, ...}
L3 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
L4 = {a, ab, abb, bbba, aaabbab, ababa, ...}
L5 = {a, b, aa, bb, aaa, bbb, ...}
L6 = {a, aa, aaa, ...}
L7 = { , ab, abab, ababab, ...}
L8 = {c, ca, caa, caaa, ...}
L9 = { , a, aa, aaa, ...}
(a | b)+
a+ | b+
Def: Language
A language over an alphabet S is a set of words over S.
10
(0 | 1 | ... | 9) = [0-9]
aa* = a+
ca*
a*
Our description
(much shorter than the original set!)
Set of strings
we want to describe
(ab)*
L1 = {Arthur Dent, Ford Prefect, Marvin}
L2 = {1900, 1901, 1982, 2013, 2017, ...}
L3 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
L4 = {a, ab, abb, bbba, aaabbab, ababa, ...}
L5 = {a, b, aa, bb, aaa, bbb, ...}
L6 = {a, aa, aaa, ...}
L7 = { , ab, abab, ababab, ...}
L8 = {c, ca, caa, caaa, ...}
L9 = { , a, aa, aaa, ...}
(a | b)+
a+ | b+
Def: Language
A language over an alphabet S is a set of words over S.
11
[0-9][0-9][0-9][0-9] = [0-9]{4}
(0 | 1 | ... | 9) = [0-9]
aa* = a+
ca*
a*
Our description
(much shorter than the original set!)
Set of strings
we want to describe
(ab)*
L1 = {Arthur Dent, Ford Prefect, ...}
L2 = {1900, 1901, 1982, 2013, 2017, ...}
L3 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
L4 = {a, ab, abb, bbba, aaabbab, ababa, ...}
L5 = {a, b, aa, bb, aaa, bbb, ...}
L6 = {a, aa, aaa, ...}
L7 = { , ab, abab, ababab, ...}
L8 = {c, ca, caa, caaa, ...}
L9 = { , a, aa, aaa, ...}
(a | b)+
a+ | b+
Def: Language
A language over an alphabet S is a set of words over S.
12
[A-Z][a-z]+ [A-Z][a-z]+
[0-9][0-9][0-9][0-9] = [0-9]{4}
(0 | 1 | ... | 9) = [0-9]
(a | b)+
a+ | b+
aa* = a+
ca*
a*
Our description
(much shorter than the original set!)
Set of strings
we want to describe
(ab)*
L1 = {Arthur Dent, Ford Prefect, ...}
L2 = {1900, 1901, 1982, 2013, 2017, ...}
L3 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
L4 = {a, ab, abb, bbba, aaabbab, ababa, ...}
L5 = {a, b, aa, bb, aaa, bbb, ...}
L6 = {a, aa, aaa, ...}
L7 = { , ab, abab, ababab, ...}
L8 = {c, ca, caa, caaa, ...}
L9 = { , a, aa, aaa, ...}
Regular expressions
A regular expression (regex) describes a language.
13
The language of a regular expression
[0-9][0-9][0-9][0-9] = [0-9]{4}
(0 | 1 | ... | 9) = [0-9]
(a | b)+
a+ | b+
aa* = a+
ca*
a*
[A-Z][a-z]+ [A-Z][a-z]+
L1 = {Arthur Dent, Ford Prefect, ...}
L2 = {1900, 1901, 1982, 2013, 2017, ...}
L3 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
L4 = {a, ab, abb, bbba, aaabbab, ababa, ...}
L5 = {a, b, aa, bb, aaa, bbb, ...}
L6 = {a, aa, aaa, ...}
L7 = { , ab, abab, ababab, ...}
L8 = {c, ca, caa, caaa, ...}
L9 = { , a, aa, aaa, ...}
*
for symbols
Def: Regular expressions
A
regular expression
(regex) over an alphabet
is one of the following:
• the empty string
• an element of
• a string of the form XY
• a string of the form (X|Y)
• a string of the form (X)*
where X and Y are regexes
A regex is associated
to a language:
14
its language
the regular
expression
(0 is a special regex that cannot appear as part of other regexes. It represents the empty language and does not match any string.)
*
for symbols
->formal-grammars
Real-world example
15
Example from Wipolo via Dhouha Bouamor
Regular expressions cheat sheet
L(a) = {a}
L(ab) = {ab}
L(a | b) = {a, b}
L(a*) = {, a, aa, aaa, ...}
L(a+) := L(aa*)
L([a-z]) := L(a|b|...|z)
L(a{2,4}) := L(aa|aaa|aaaa)
L(a{3}) := L(aaa)
L(a?) := L( | a)
L(.) := {a|b|..|A|...|0|...|!|$|...}
L(\.) := {.}
16
Simple symbol
Concatenation
Disjunction
Kleene star
shorthand for “one or more”
shorthand for a range
shortand for a given number
shortand for a given number
shorthand for “optional”
shorthand for “any symbol”
escape sequence for special symbols
Task: Regexes
L(a) = {a}
L(ab) = {ab}
L(a | b) = {a, b}
L(a*) = {, a, aa, aaa, ...}
L(a+) := L(aa*)
L([a-z]) := L(a|b|...|z)
L(a{2,4}) := L(aa|aaa|aaaa)
L(a{3}) := L(aaa)
L(a?) := L( | a)
L(.) := {a|b|..|A|...|0|...|!|$|...}
L(\.) := {.}
17
Simple symbol
Concatenation
Disjunction
Kleene star
shorthand for “one or more”
shorthand for a range
shortand for a given number
shortand for a given number
shorthand for “optional”
shorthand for “any symbol”
escape sequence for special symbols
Define regexes for
• numbers
• phone numbers
Try it
• HTML tags
• Names of the form “Dr. Blah Blub”
Helpful Web page
18
https://regex101.com/
Named regexes
When using regular expressions in a
program, it is a good idea to name them:
String digits = "[0-9]+";
String separator = "( |-)";
String pattern = digits+separator+digits;
But: Human beings, who are almost unique in having the ability
to learn from the experience of others, are also remarkable for
their apparent disinclination to do so. (Douglas Adams)
19
Regex groups
A regex group is a sequence of the form (...) in a regex.
the answer to ([a-z]+), the (([a-z]+) and ([a-z]+))
20
1st group
Regex groups
21
4th group
2nd group
3rd group
A regex group is a sequence of the form (...) in a regex.
the answer to ([a-z]+), the (([a-z]+) and ([a-z]+))
1st group
Groups are numbered by the order of their opening paranthesis.
Regex groups
He found the answer to life, the universe, and everything
1st group: life
2nd group: universe and everything
3rd group: universe
4th group: everything
Try it out!
22
the answer to ([a-z]+), the (([a-z]+), and ([a-z]+))
Find it yourself
Regexes in programming
Regex
Simplified regex
Finite State Machine
Matcher
His favorite numbers
are 42, 4200, and 19.
your
programming
language
23
42(0)+
420(0)*
4
2
0
0
you
>java
Example: Regexes in Java
Pattern pattern = Pattern.compile("42(0)+");
Matcher matcher = pattern.matcher("His favorite numbers are...");
while(matcher.find())
System.out.println(matcher.group());
24
4200
His favorite numbers
are 42, 4200, and 19.
Example: Regexes in Python
import re
pattern = re.compile("42(0)+")
match = pattern.search(line)
if match!=None:
print(match.group())
25
4200
His favorite numbers
are 42, 4200, and 19.
Summary: Regular Expressions
A regular expression is a kind of “code” that matches a (possibly infinite)
set of strings.
Regexes are a standard facility in programming languages, and can be
used, e.g., for information extraction.
26
[A-Z][a-z]+ matches {Elvis, Macron, Qwerty, ...}
->Formal-grammars