Fabian M. Suchanek
Character Encodings
22
Semantic IE
You
are
here
2
Source Selection and Preparation
Entity Recognition
Entity Disambiguation
singer
Fact Extraction
Reasoning
Instance Extraction
singer Elvis
Scripts
Google Translate. No warranty.
3
Thanks for all the fish
Scripts
Google Translate. No warranty.
(“Simplified”
Chinese)
4
Thanks for all the fish
(Arabic)
(Hebrew)
(Thai)
(Latin)
(Korean)
How to map characters to bytes?
100,000 different characters
from 90 scripts
1 byte = 8 bit =
numbers 0..255
5
y
?
0
1
...
255
Д
a
A
ß
€
é
Def: Character encoding
A
character encoding
(also: char encoding) is an injective mapping from
characters to (sequences of) bytes.
6
A -> 65
B -> 66
Д
-> 99, 99, 2
ü -> 99, 42
characters
bytes
...
Def: ASCII encoding
26 letters + 26 lowercase letters + punctuation
100 characters
The
ASCII encoding
is a particular character encoding maps
certain chars to single bytes, and ignores the others.
7
A -> 65
B -> 66
C -> 67
...
Disadvantage: works only for English
->
ü ->
Д
Def: Code pages
A
code page
is a character encoding that
maps script-specific characters to single bytes.
Example
(View -> Encoding)
(0-127 are usually
mapped as in ASCII)
8
A -> 65
B -> 66
-> 224
Greek code page:
Western code page:
A -> 65
B -> 66
...
...
à -> 224
Disadvantages:
• We have to know the code page
• We cannot mix scripts
• We cannot represent more than 256 characters
Def: HTML entities
HTML entities
are a particular character encoding where
particular strings (as defined by W3C) represent characters.
Example
à -> à
ü -> ü
ß -> ß
...
Advantage: Works in all browsers
Disadvantage: Very clumsy
List
These are
sequences
of bytes if
encoded in
ASCII
9
Def: Unicode
Unicode
is a character encoding that maps each character
to
, i.e., to 4 bytes.
Advantage: Maps all known characters
Disadvantage: Takes much space
A -> 65 -> 0, 0, 0, 65
B -> 66 -> 0, 0, 0, 66
-> 1001 -> 0, 0, 3, 42
-> 2001 -> 0, 0, 4, 17
...
Example1
Example2
Characters
0-127 are
as in ASCII
10
(not the real mappings)
Def: UTF-8
UTF-8
is a particular character encoding that maps
Unicode characters to sequences of bytes of different lengths.
A -> 65
B -> 66
-> 128, 42
-> 128, 128, 32
11
UTF-8: Chars 0-0x7F
Unicode chars 0-0x7F are mapped like in ASCII (i.e., to a single byte).
A -> 65
B -> 66
...
a -> 96
b -> 97
...
$ -> 36
! -> 33
...
Advantages:
• Compatibility with ASCII and code pages
• Space efficiency for English docs
12
UTF-8: Chars 0x80-0x7FF
Unicode chars 0x80-0x7FF (11 bits) are mapped to two bytes as follows:
13
xxxxxxxxxxx
2 bytes
110
xxxxx
10
xxxxxx
Unicode 9x80-0x7FF are Greek, Arabic, Hebrew etc.
UTF-8: Chars 0x80-0x7FF Example
14
110
00011
10
100111
Unicode
11 bit representation
ç
= 0xE7
= 00011100111
UTF-8: Chars 0x80-0x7FF Example
Example: Encoding “façade”
Example
15
f
a
ç
a ...
0x66
0x61
0xE7
0x61
0
0
1100001
1100110
100111
110
10
00011
1100001
0
UTF-8: Chars 0x800-0x7FFF
Unicode chars 0x800-0x7FFF (16 bits) are mapped
to three bytes as follows:
16
1110
xxxx
xxxxxxxxxxxxxxxx
10
xxxxxx
3 bytes
10
xxxxxx
This character range concerns mainly Chinese.
Decoding UTF-8
17
0
10
110
0
1100110
1100001
1100001
00011
100111
0
0
ç
a
f
a ...
• if the byte starts with 0
xxxxxxx
• if the byte starts with 110
xxxxx
• if the byte starts with 1110
xxxx
• if the byte starts with 10
xxxxxx
Decoding UTF-8
• if the byte starts with 0
xxxxxxx
• if the byte starts with 110
xxxxx
• if the byte starts with 1110
xxxx
• if the byte starts with 10
xxxxxx
18
0
10
110
0
1100110
1100001
1100001
00011
100111
0
0
ç
a
f
a ...
=> it’s a “normal” ASCII character
=> it’s an “extended” char, 1 byte follows
=> it’s a ”Chinese” char, 2 bytes follows
=> it’s a follower byte, you messed it up!
Summary: UTF-8
UTF-8 maps Unicode chars 0-65535 to 1-4 bytes.
Advantages:
• common Western chars are only 1 byte
• backwards compatibility with ASCII
• stream readability (follower bytes
cannot be confused with marker bytes)
• sorting compliance
19
UTF-16
UTF-8 is inefficient if the text contains many characters
between 0x0800 and 0xFFFF: It needs 3 bytes.
For this reason
UTF-16
has been proposed.
It encodes every Unicode character in either 16 or 32 bits.
Advantages:
• less space consumption in the range 0x0800 - 0xFFFF:
2 bytes as opposed to 3 in UTF-8.
Disadvantages:
• more space consumption in the range 0x0000 - 0x007F:
2 bytes instead of 1 in UTF-8.
• not backwards compatible to ASCII
20
Example: Char encodings in Python
21
with open("text.txt",
encoding="utf-8"
) as file:
for line in file:
print(line)
If omitted, this uses the
default encoding of
the operating system,
which might be different
from UTF-8!
with open("text.txt", "w",
encoding="utf-8"
) as file:
file.write("Bună ziua!")
Reading:
Writing:
Example: Char encodings in Java
22
File f = new File(...);
InputStream s = new FileInputStream(f);
10
00011
0
100111
1100001
110
0
1100110
Reader r = new InputStreamReader(s,"UTF-8");
f
a
ç
Summary: Char encodings
• ASCII: only English chars
• Code pages: one page per script
• HTML entities: work in browsers
• Unicode: maps all chars
• UTF-8: maps chars to variable # bytes
In most applications, UTF-8 is the encoding of choice.
23
->named-entity-recognition
->archiving