File Formats for
Digital Preservation
Fabian M. Suchanek
based on “
Best File Formats for Archiving
”
Pre-Digital Storage
How old is this?
2
Code Of Hammurabi
Pre-Digital Storage
And this?
3
St Cuthberg Gospel
Pre-Digital Storage
And this?
Can you still read it?
4
US Declaration of Independence
Digital Storage
5
Floppy Disk
And this?
Can you still read it?
Digital Storage
6
CD-ROM
And this?
Can you still read it?
Digital Media may be irrecoverable
7
Antiquity
Middle Ages
Modern times
# documents
Digital Dark Age
This poses problems
• for historians
• for society (legal problems)
• for the individual
(hypothetical graph, log scale)
Digital Media may be irrecoverable
8
Antiquity
Middle Ages
Modern times
# documents
Digital Dark Age
This poses problems
• for historians
• for society (legal problems)
• for the individual
The original
footage of the
1969 moon
landing was lost.
Only low quality
copies remain.
Wikipedia / Apollo 11 missing tapes
(hypothetical graph, log scale)
Digital Media may be irrecoverable
9
Antiquity
Middle Ages
Modern times
# documents
Digital Dark Age
This poses problems
• for historians
• for society (legal problems)
• for the individual
Wired Magazine 2017-01-19
Other example
(hypothetical graph, log scale)
Digital Media may be irrecoverable
10
Antiquity
Middle Ages
Modern times
# documents
Digital Dark Age
This poses problems
• for historians
• for society (legal problems)
• for the individual
my grandfather
in his twenties
me in my
twenties
(hypothetical graph, log scale)
Digital Preservation
11
There are (at least) 3 sources of obsolescence:
1) Digital media decays
2) Digital media becomes obsolete
3) File format becomes obsolete
>storage
Life expectancy of media
12
Estimates differ widely. This one is by
Crashplan.com
.
>storage
Life expectancy of media
13
Orthogonal dangers:
• hazards (fire, theft,
loss, mishandling)
• media no longer
supported
Opinion of the
Digital Preservation Workshop
: “Media technology
changes so rapidly that high longevity media is likely to be threatened
by obsolescence before its useful life is over”
Estimates differ widely. This one is by
Crashplan.com
.
Digital media becoming obsolete
14
DB Workshop
See also:
Museum of Obsolete media
Digital media becoming obsolete
DB Workshop
See also:
Museum of Obsolete media
15
Today:
• Optical storage
• USB keys
• cloud services
Example: BestBuy
stops
selling CDs.
The best solution
appears to copy
the data always from
the old medium to
the new one.
File Formats can become obsolete
16
The [British] National Archives, which holds 900 years of written material,
has more than 580 terabytes of data — the equivalent of 580,000
encyclopaedias — in older file formats that are no longer commercially
available. [
BBC: Warning of data ticking time bomb, 2007-07-03
]
Established File Formats
17
A file format is considered
established
if
• it has been around for a sufficiently long time
• it is supported by several vendors (and not just by a single company)
• it is platform‐independent (work on Windows, Mac, Linux, mobile)
Examples:
• MP3 for audio
• JPG for images
• PDF for documents
since
1993
1992
1993
>Flash
Example: Flash
18
Flash
is a software suite by Adobe for production of animations,
browser games, rich Internet applications, desktop applications,
mobile applications and mobile games. It consists of
• FLA: the main file format of Flash projects
• SWF (Shockwave Flash): a file format for
multimedia and action scripts
• FLV: the main file format for Flash videos
Flash
• has been around since 2000
• can be played in most desktop browsers
• is thus platform‐independent
80% of Web users interacted with Flash
at least once a day in 2014.
[
Chromium.org
]
=> very established
19
Flash has security problems, and was superseded by HTML5 capabilities.
Adobe News
BUT: Flash was abandoned
20
Not everybody has noticed...
You may also still have
Flash videos on your computer!
FLV is a container format, so you might be able to recover the content losslessly.
21
Established & not abandoned
The rest of this lecture is concerned with file formats
(1) that are established and
(2) that show no signs of abandonment.
File Formats
22
A
file format
is a standard way that information is encoded for storage
in a computer file [
Wikipedia
].
ñé*0NvÉyO9Rqann£=MºlQE
é>jÇc!é^hüXAu6K¥ndTP2K
iÉpBJ1ïe»ûkWwmrò¥ok~ñ7F
...
Data
Data as a sequence of
bytes stored in a file:
Fileformat
defines the translation
File Extension
23
The
file extension
is the part of the file name behind the last dot.
It identifies the file format.
ñé*0NvÉyO9Rqann£=MºlQE
é>jÇc!é^hüXAu6K¥ndTP2K
iÉpBJ1ïe»ûkWwmrò¥ok~ñ7F
...
Data
Data as a sequence of
bytes stored in a file:
File extension: JPG
Text documents:
• DOCX
• ODT
• ...
Images:
• JPG
• PNG
• ....
Audio:
• MP3
• OGG
• ...
...
Types of Data
24
•
Introduction
•
Images
•
Audio
•
Video
•
Office Documents
•
Summary
SVG
25
Scalable Vector Graphics
is a file format intended for vector images
(= images that consist of simple geometric shapes). File extension: SVG
Data
File format: SVG
<circle x=30 y=30 r=10
stroke=blue />
<line x1=15 x2=15...
...
SVG describes the shapes in XML
(a human‐readable format).
Try it out!
Data stored in file
(simplified)
SVG is for geometric shapes
26
File format: SVG
<man look=left
nose=big />
<tie style=old color=...
This does NOT WORK!
SVG is great for geometric shapes, but NOT for more complex images.
SVG has been around since 1999, and can be displayed in all
browsers => very established
Data stored in file
PNG
27
Data
File format: PNG
Portable Networks Graphics
is a file format intended for raster
images (= images that consist of pixels). File extension: PNG
WWWWWWBBBBB
WWWBBBBBBBB
...
The file stores the color
of every pixel. The data
is then compressed.
>details
Data stored in file
(simplified)
PNG Details
28
PNG files start with “0x89 PNG 0x0D 0x0A 0x1A 0x0A”, i.e. if a DOS
CRLF were transformed into a Linux CR or vice versa, we would notice.
PNG files define their colors in a
palette
.
Palette:
0 = darkgray
1 = lightgray
2 = light brown
...
22222220000011111111
....
There are also standard palettes (most notably red/green/blue).
>details
Data stored in file
(simplified)
PNG Details
29
There are more filtering steps. Finally, the data is then compressed
using the same algorithm as ZIP.
7
2 5
0 8
1
....
PNG can
interlace
the data,
so that the image shows
in low resolution when it
has been transferred partially.
Data stored in file
(simplified)
PNG Summary
30
PNG is great for
• scanning photos
• screenshots
PNG exists since 1997, can be displayed in any image
software and in any browser, is most widely used lossless
image format on the Web => very established
Try it out!
...but not so great for
• geometric shapes (use SVG)
>TIFF
PNG Competitors
31
Compared to GIF,
• PNG supports transparency
• PNG supports 16m colors
• PNG does not support animation
Compared to TIFF
• PNG is more widely supported
• PNG does not support multi‐page (and many other features)
• PNG does not support the CMYK color model
>CMYK
CMYK Color model
32
©
Mississippi State University
used in printing
used on the screen
Resolution
33
The
resolution
of an image is the number of pixels in each dimension.
1500 pixels
2500
pixels
For paper, the resolution is often
given in
dots per inch
(DPI):
1 inch (= 2.54cm)
for example:
600 pixels in 1 inch
=> 600 DPI
Choosing the Resolution
34
Human eye
One eye cell
can distinguish
31.5 arc seconds
=> 6000 pixels
in an image of
height
Image
If you stand at least as far away from the image as the image
is high, the image does not need more than 6000 pixels vertically.
(A higher resolution is needed for closer distances, zooming,
post‐processing, etc.) The resolution scales linearly with the distance:
The problem with PNG
35
A typical smartphone picture nowadays has a resolution of
3000
4000 pixels.
That’s 20 megabytes per picture!
(If you scan a photo at 600 DPI, you get 10MB-20MB)