Best File Formats for Archiving
This guide compares common file formats for the purpose of digitally archiving your data. It also discusses how to choose a resolution for images, and how to choose a sampling rate and a bit rate for MP3 audio files.

Disclaimer: All of the below is provided as my personal opinion only, without guarantee for completeness or correctness. The current version of this text is 2017-06-06.

 

Why archiving your data?

We do not often realize it, but much of your life is nowadays digital:

We often take it for granted that this data will live on. For example, the paper diaries from our childhood still live on. Old books and documents survive for centuries. You can still see the original US Declaration of Independence (200 years old). Or the Guttenberg Bible (500 years old). You can even see the Cuthbert Gospel (1300 years old). Scriptures carved in stone survive for millennia (the Code of Hammurabi is 3700 years old).

Digital data does not survive in this way. There is first the problem of the storage medium. Physical storing devices change roughly every 10 years: It used to be floppy disks, then it was CDs, then DVDs, then flash drives (USB sticks), and now the cloud. Whenever a new technology comes up, support for the older technologies fades out. There are today no more floppy drives. Furthermore, the devices themselves have a life span of about 10 years. After that time, they forget their data. Hard drives, for example, typically last around 5 years before they start becoming faulty. We may think that the cloud is the solution. However, cloud companies, likewise, may cease to exist. The only way to keep your data alive despite these changes is to constantly copy it from the older technology to the newer one.

Then there is the problem of file formats: Have you ever tried to open a “WRI” file on your computer? This was a popular document format. Today, few programs can read such a file. One day, they may become inaccessible for the average user. The same fate may strike today's MP3s, DOCX, or JPG one day. This leads to what has been called the Digital dark age: the impossibility to read historical electronic documents and multimedia, because they have been recorded in an obsolete and obscure file format. To prevent this from happening, you have to choose your archiving formats wisely. This guide will help you.

Types of File Formats

Open vs Proprietary File Formats

Plain Text File Formats

There is one group of file formats that is completely unproblematic for archiving: plain text formats. These are all file formats that you can open with a text editor such as Notepad, VI, or TextEdit, and that are human-readable. These include TXT-files, code files, LaTeX files, CSV files, TSV files, and the like. There is no particular software needed to read them. They are thus completely safe for archiving.

Two caveats remain: The first is the character encoding. There used to be a variety of encodings, but thankfully the world has now settled on UTF-8. It is the dominant encoding on the Web nowadays, it is the recommended setting for emails, it has been around since 1993, it's backwards compatible with ASCII, and it's space-efficient for Western characters. Thus, if you write Western characters, UTF-8 is the way to go. Make sure your text editor is set to UTF-8. Unfortunately, Windows does not use UTF-8 by default in Notepad or Wordpad. Hence, it is very cumbersome to use UTF-8 text files on Windows. Furthermore, Notepad does not display non-Windows line-breaks correctly. Therefore, I recommend using Notepad++, which is actually the most popular text editor on Windows. To display UTF-8 text files in Firefox, you have to press F10, then click View -> Encoding -> Unicode.

The second caveat is for files of code: The code itself will survive in a plain text file, but the compiler may not. Thus, you may find yourself with code that you can no longer run.

Proprietary File Formats

A “proprietary” file format is a format that is developed by one particular company. Examples are Microsoft Office documents (DOCX, XLSX, and the like), or Adobe Flash movies (SWF). These formats come in various flavors:
  1. Un-documented formats have no public documentation. Thus, nobody can easily write software that reads these files. The files can be read only by the software of the company that created it. Examples are the archiving files RAR, the Corel drawing files CDR, or Microsoft's WMA audio format.
  2. Documented file formats have a documentation. However, this documentation might not be available for free. For example, the standards of the International Organization for Standardization (ISO) are not free.
  3. For other documented file formats, the documentation is available for free. Thus, people can write software to process these files. However, these file formats may be encumbered by software licenses, patents, or intellectual property rights. Thus, whoever writes software to read these formats may have to pay the company. This is the case for MP3 and HEVC.
  4. Some formats were supposedly free of known patent issues, but then some other company started claiming intellectual property rights in retrospect. This has been the case for JPG.
  5. Some formats are known to be free of patent issues — either because claims to intellectual property have been rejected, or because the company has renounced to its claims. Still, the company may implement the format slightly differently from what has been documented, or may decide to change the format in the future. Therefore, the documents usually show differently in software from other developers.
  6. Some formats are standardized. This means that they have been submitted to a standards organization, which has documented the format. This makes the format a bit more resilient to unilateral change.

Still, proprietary file formats are de facto under the control of the single company. Software by other vendors often does a worse job at displaying it. So if you don't have the main company's software, if it is not available for your operating system, or if the company stops producing it, you lose access to your data.

Open File Formats

In view of the shortcomings of proprietary file formats, people have developed “open” file formats. These come again in a variety of flavors. The main criteria of an open format are:
  1. The format is fully documented and publicly available.
  2. The format is free from copyright restrictions, intellectual property claims, or restrictive licenses.
  3. The further development of the format is decided by a vendor-independent standards organization or a community (e.g., in the form of an open source development community).

Theoretically, anybody can write software to read open file formats. This reduces the dependency on a single company. Furthermore, an open standard is developed by a standards organization or a community, with the goal to make the format as general and thought-through as possible. Examples for open file formats are the open office formats (the document format ODT, the spreadsheet format ODS, etc.), the Web formats (the image format SVG, the document format HTML, etc.), and a number of other formats (such as the document format PDF or the image format PNG).

Quasi-Open File Formats

Open file formats are free from royalty claims. They are developed by a vendor-independent standards organization or a community. There are file formats that do not exactly fulfil these requirements, but that are equivalent to open file formats for all practical purposes.

Some of these file formats are not developed by a vendor-independent standards organization or a community. However, their development has practically stabilized. They do not change any more. Thus, the formats are equivalent to open formats for practical purposes. This includes formats such as the archiving format ZIP.

Some formats are under intellectual property claims, but these are disputed or unenforced. Thus, most common users are unaware of these claims, or choose to ignore them. Such formats include the audio format MP3, and the movie format MP4.

I call these formats “quasi-open file formats”. From a user's point of view, they are roughly equivalent to the open file formats.

Recommendation

Generally, the following file formats are more susceptible for archiving purposes:
  1. Plain text formats
  2. Open formats
  3. Quasi-open formats.
I group these together as “non-proprietary formats”.

Some open formats have a hard time catching on, because the proprietary competitor formats are being developed and pushed by big companies. At the same time, the boundary between open and proprietary formats is nowadays fuzzy: the big companies often standardize their formats publicly, they promise to abstain from patent claims, and they support also the open formats. Vice versa, due to the ubiquitousness of the proprietary formats, open software can usually read the proprietary formats, and even the software of one company can often read the proprietary formats of the other companies. Some proprietary formats are so standardized and so well-established (most notably the Microsoft Office formats) that they will most likely remain supported by software for the years to come.

We will discuss the best file formats for each type of data below.

Maturity

Established File Formats

When you choose a file format, you want to make sure that it stands the test of time. It is hard to predict whether a file format will still be around in the future, but the following can be indicators:
  1. The file format been around for a sufficiently long time.
  2. The file format is supported by several vendors (and not just by a single company).
  3. The file format is platform-independent, i.e., it enjoys support on Windows machines, Macs, and Unix-based systems.

I call file formats that respect these criteria “established”. They are the main focus of this guide. established file formats are, e.g., the image format JPG, the audio format MP3, or the document format HTML. Some proprietary formats are also quite established, in particular the Microsoft Office formats (DOCX, XLSX, etc.).

Open Browser Formats

There is a special class of file formats that I call “Open browser formats”. These are open file formats, i.e., they are developed by a community or an association of several parties. Their main strong point is that they can be displayed by the major Web browsers. No other software is needed to read the files. Since every major operating system nowadays has several Web browsers, these file formats are platform-independent and vendor-independent.

Some of these open browser formats relatively new, and thus not yet established. These include the audio format Opus or the movie format WebM. These formats are supported by important players such as Google, Mozilla, Wikipedia, or Apple. Hence, software for editing these formats has also already been written. All of this may indicate that the formats have good chances of becoming established in the future. That said, the implementation by the major browsers is no guarantee for the future. The browsers have stopped support for SVG fonts. OGG+Theora was first recommended by the W3C, and then retracted.

Recommendation

Generally, one should use only established file formats for archiving.

Idealists may also consider the Open Browser Formats. We will discuss the best file formats for each type of data below.

Lossy vs Lossless File Formats

Lossless file formats

A lossless file format stores the data exactly as it was originally produced or obtained. The majority of file formats is lossless. An office document, e.g., will store the text exactly as you typed it. But think of audio file formats: Some frequency combinations are inaudible to the human ear. So should we really store them in the file? If we remove them, the file becomes around 10 times smaller. That is what lossy file formats do. Lossless formats, in contrast, keep every detail — even if it is imperceptible. The choice between lossy and lossless formats applies generally to image data, video data, and to audio data.

Common lossless formats include PNG for images, and FLAC for audio data. Lossless formats are interesting for archiving for two main reasons: First, they allow future use of the data for applications that were not orginally envisaged. In the example of the audio file, a DJ may want to artificially slow down the record, and mix it with other audio files. Data losses that were once imperceptible will then suddenly be striking. The second argument for lossless formats is that we can never be sure how long file formats will persist in the future. One day, we may be obliged to convert our files into a newer format. If the newer format is also lossy, then the little losses will add up — ultimately degrading the quality of the file.

Thus, lossless formats are generally to be preferred for archiving purposes.

Resolution of lossless file formats

Lossless file formats keep every detail of the data, even if it is imperceptible to the human. This applies in particular to audio files, video files, and images. However, even lossless file formats cannot mirror reality completely. This is because reality is analog, and file formats are digital. To see this, think of the sinus waveform produced by a sound (shown on the right). A vinyl record of that sound will contain an engraving of exactly that waveform. A computer cannot do that: it has to digitize the sound wave, i.e., to break it down into small steps. Even the best digital recorders have to do that.

The same is true for digital cameras: They can only mirror reality with the number of pixels they have. Anything that is smaller than one of their pixels cannot be caputured.

Thus, even lossless file formats can mirror reality only up to a certain degree of precision. I call this degree the “resolution”. The higher this resolution, the better reality is captured — and the larger the file will be. Generally, one aims at a resolution that is so high that the human cannot distinguish the recording from reality. Lossless formats are then lossless in this sense.

In some cases, the source has already been digitized. Think of a CD that you want to rip to your computer. In these cases, a lossless copy of the CD to your hard drive is completely lossless with respect to that source.

Lossy file formats

As we have seen, lossless file formats mirror the input as closely as possible, up to a certain resolution. A lossy file format loses even more data: it throws away details of the data that can hardly be perceived anyway by a human. For example, the audio format MP3 is a lossy file format: it removes frequency combinations that cannot be perceived by a human anyway. This results in smaller file sizes. The image format JPG does the same thing for images: It throws away details in a picture that humans are unable or unlikely to perceive. These file formats typically let the user choose the compression ratio, i.e., the amount of detail that is thrown away. Higher compression rations produce smaller files and throw away more details.

As we have argued before, lossy file formats are generally less adequate for archiving. The only point to be made in favor of lossy file formats is their smaller size.

That said, it does not make sense to convert lossy file formats into lossless ones. Data that has been removed will never come back anyway. Thus, if the primary form of the file you have at hand is a lossy file format, you can just keep it the way it is.

Vector file formats

Lossless file formats can mirror reality up to a certain resolution. They cannot, e.g., mirror a sinus waveform in infinite detail. In the same way, a digital camera cannot take a perfect picture of a circle. There will always be pixels when we zoom in. But what if we knew that picture should contain a circle? Couldn't we just simply tell the computer “It's a circle”?

It turns out that this is possible to some degree. It is not possible when taking pictures of nature. But it is possible when we do drawings on the computer — e.g., for a slide presentation or in a drawing program. We draw a circle, and tell the machine “it's a circle”. The file then stores “it's a circle”, and the next time we open the file, the machine draws a circle. Since the machine knows it's a circle, we can zoom in infinitely without pixels ever appearing. This is what the vector image format SVG does.

The same thing can be achieved to some degree with music. If we know that a piano plays a certain sequence of notes, then there is no need to digitize the wave form. We can just tell the machine “A piano plays this sequence of notes”. This is what the audio format MIDI stores.

I will call these formats “Vector formats”. Compared to lossy and lossless formats, vector formats are ideal for archiving. First, they keep every detail. Second, they usually produce considerably smaller files than the lossy or lossless formats. However, vector formats can be used only when the image or sound is described explicitly.

Recommendation

For audio and image material, we have the choice between lossless formats, lossy formats, and vector formats. Generally, vector formats are the best formats for archiving, because they mirror the data exactly. However, they can only work if the underlying data is vectorized. If that is not the case, lossless formats are the way to go. Their resolution should be chosen so high that a human cannot perceive the difference to the original. If you have lossy formats lying around, it does not make sense to convert them to lossless formats, because the quality will not increase. You can just keep them. In particular, it does not make sense to convert lossy file formats to other lossy file formats, because this will only amplify the losses.

We will discuss the best file formats for each type of data below.

Read-only file formats

Read-only file formats

For most file formats, there is a software that can edit and modify the files. Yet, for some file formats, this is not the case — either because the format does not allow it, or because there is no such software, or because editing the file would result in loss of information. Take for example PDF documents. It is very hard to edit the text of a PDF document. You can add comments, and you can fill forms, but you cannot easily change the text of a PDF document. Thus, PDF is not modifiable. I call such file formats “read-only”.

Other read-only file formats are the lossy file formats, such as MP3 or JPG. There exists software to edit such files. However, each modification aggravates the loss of data. If you repeatedly edit a JPG file, the picture will ultimately suffer.

Recommendation

For archiving, read-only file formats should be avoided. This is because they disallow not just the modification of the data, but also the transformation of the data into another file format. The latter may become necessary if the file format becomes obsolete one day. In such a case, read-only file formats can result in a loss of data.

The lossy file formats are all read-only in our sense. None of the other file formats discussed here is read-only, unless this is explicitly mentioned.

Recommended File Formats

Documents

Microsoft Office Documents

Microsoft is the leading producer of office software. The Microsoft Office suite contains the software called Word for documents, the software Excel for spreadsheets, and the software Powerpoint for slide presentations. The file formats are DOCX, XLSX, and PPTX, respectively. Microsoft products and file formats are well established and are the de facto standard in the office world.

Microsoft formats are definitively established. However, they are also proprietary.

Open Office Formats

The Open Office project started with the goal to provide an open alternative to Microsoft products. Its file formats are called “Open Office formats”: ODT for documents, ODS for spreadsheets, and ODP for slide presentations. Confusingly, Microsoft has decided to call its own proprietary formats “Office Open XML”. Furthermore, the original OpenOffice software has been discontinued. It lives on in the LibreOffice project.

All that said, Open Office software and file formats are established and open. They are thus the way to go if you want to create archivable documents. The Libre Office software to work with such documents can be downloaded for free.

HTML for Documents

I will now discuss some less frequent choices of document file formats. We start with using HTML for documents.

HTML is a file format for text with layout. It is an open format, developed by the Word Wide Web Consortium. It has been around for more than 20 years, and it can be displayed on nearly any device with a display. The format can thus be considered sufficiently established. Conveniently, the main office software suites support exporting documents to HTML. They also support editing HTML documents. Thus, in principle, HTML would be the ideal file format for documents.

There are two caveats: First, editing support for HTML documents is not always perfect. An HTML document edited in Libre Office will not always show in the same way in Microsoft Office and vice versa. Second, there is no universally accepted way to integrate embedded material (such as images) into such documents. Common ways are:

  1. A link to the external image, leaving it where it is (LibreOffice). The disadvantage is that the link is not obvious. Deleting or moving the external image, or copying the HTML document without copying the external image will destroy the document.
  2. A separate folder, which contains copies of all embedded material. This is what happens when you click “Save complete Web page” in your browser. It is a well-supported mechanism across all browsers. The disadvantage is that the HTML file and the embedded material is physically separated. This bears the risk of deleting, renaming, copying, sharing, or moving one without the other.
  3. A bundle, i.e., a single folder that holds all embedded material as well as the main file (called index.html). The disadvantage is that one has a folder instead of a file.
  4. MHTML, which is basically a file that contains a sequence of files, much like MIME Emails. The disadvantage is that native support remains limited to Internet Explorer and Opera. Firefox and Safari require an extension.
  5. The Mozilla Archive Format (MAF) of Firefox — basically a ZIP file with the markup and images, with metadata saved as RDF. The disadvantage is that no other browser besides Firefox supports this format.
  6. Data URIs, i.e., the embedding of the image straight into the HTML file in base64 encoding. The disadvantage is that the HTML document becomes hard to edit by hand. Also, the office software typically does not allow this option when exporting. Thus, the file requires post-processing to embed the images in this way.
  7. Printing the entire file to PDF, which is an altogether different file format. The disadvantage is that PDF files are essentially read-only.

For these reasons, HTML has led a niche existence as a document format. Still, geeky idealists can use it to write text documents. Personally, I use HTML to write most of my documents, doing the markup by hand. I use bundles to group the embedded material. I also use HTML as an archiving format for documents. In that case, I opt for Data URIs, because they produce a neat single file.

PDF for Documents

PDF is a file format for documents that was originally developed by Adobe. The best-known software to display PDF documents is Adobe's own Acrobat Reader. PDF has come a long way since its inception. Nowadays, it is no longer under the control of Adobe, but under the control of a standards organization (in which Adobe is a mere member). Thus, it is an open file format. Furthermore, software to display PDF documents is ubiquitous: PDF documents can be displayed on all major operating systems (often natively), and in all major browsers. It is the de facto standard for sharing printable material. Thus, it is a established file format.

The main drawback of PDF is that it's hard to edit. There are not many software tools that allow users to modify PDF documents. Sure, you can add comments, you can fill out forms, and you can sign them, but you cannot easily change the text of a PDF document. In fact, you often cannot even properly copy/paste from a PDF document. The reason for all this is that PDF was designed as a page layout language. It essentially tells the printer where to place certain text elements — with no respect for the actual flow of text. PDF is essentially read-only.

Thus, while PDF is certainly a great choice for archiving unmodifiable documents, it is a poor choice for archiving anything that you wish to modify or transform in the future.

LaTeX + PDF for Documents

In the academic world, it is customary to produce documents with LaTeX. You basically write a plain text document, and sprinkle in little magic commands to define the layout. For example, you'd write “This is \textbf{Lisa}” to have “Lisa” appear in bold. There is also software that helps you writing such code. When you're done, you save the file as a TEX file. Then there is a software that compiles the TEX file into a PDF file. This software is established and open. PDF files, in turn, can be displayed in any major browser and on any major operating system. Furthermore, the underlying LaTeX code can always be edited. Thus, LaTeX is not read-only. Hence, it has all the advantages of PDF without PDF's main drawback.

Where is the catch? The catch is that it is a pain to write LaTeX documents. Of course, seasoned LaTeX users will rave about how great the layout of LaTeX is, and that may even be true. At the same time, you are sure to spend as much time on Stackoverflow (searching for the correct way to say things in LaTeX) as you spend actually writing the text. Take something as simple as inserting a blank line in your text. There are several pages of discussion about the best way to do that. The easiest way to do that (with 5 backslashes: “\\\ \\”) is not even mentioned. Thus, I personally discourage LaTeX for everyday use.

Recommendation

For documents, you have the choice: For geeks, I also mention the following:

If you have documents in any other file format, it may make sense to convert them to one of the above. One way to do that is to open the file (by double-clicking it), and then choosing “Save as” or “Export”. Then you can pick your target file format. If you want to automate the process, install Libre Office. Open a terminal and type soffice --convert-to targetFormat inputFile --outdir folder --headless On Mac, it's /Applications/ LibreOffice.app/ Contents/ MacOS/ soffice --convert-to targetFormat inputFile --outdir folder --headless The target format can be, e.g., “html” or “odt”. If you choose HTML, be warned that Libre Office produces very verbose HTML documents with external resources. You may want to clean up the HTML files and inline the resources into Data URIs. In any case, you should keep the original files.

Slide Presentations

Open Office and Microsoft Office

For slide presentations, you have again basically the choice between the Microsoft file format PPTX and the Open Office file format ODP. The same discussion as for documents applies: Both file formats are established. However, only the Open Office format is open. The accompanying software, Libre Office, can be downloaded for free.

SVG for Presentations

One geeky alternative to the Open Office and Microsoft Office presentation file formats is SVG. SVG is a file format for images. It is a vector format and thus lossless. Like HTML, it is an open format, developed by the Word Wide Web Consortium. It has been around for about 10 years, and it can be displayed by nearly all browsers. The format can thus be considered sufficiently established.

There are at least two tools that allow creating SVG slideshows: an SVG Open Tool and my own tool PowerLine. Presentations done with PowerLine can be displayed in any browser without additional software needed.

Thus, the display of such presentations is no problem at all. The creation of such presentations, in contrast, has to rely on amateur software. Therefore, SVG has not really caught on for slide show presentations. In other words, it is open, lossless, and established, but read-only.

Beamer

LaTeX+PDF is a very popular combination for writing documents in the academic world. Unfortunately, it is very painful to use. Interestingly, it is possible to experience this pain also when creating slide presentations. A widely used package for this purpose (but not the only one) is Beamer. Beamer presentations have a beautiful layout, but are often based mainly on bulleted lists. Anything else is more difficult to do in Beamer.

Beamer presentations have the same advantages and disadvantages as LaTeX+PDF. They are established and open, but painful to use.

Recommendation

For slide presentations, you have the choice: For geeks, I also mention the following:

If you have presentations in any other file format, it may make sense to convert them to one of the above. Proceed as for documents.

Spreadsheets

Open Office and Microsoft Office

For spreadsheets, you have again basically the choice between the Microsoft file format XLSX and the Open Office file format ODS. The same discussion as for documents applies: Both file formats are established. However, only the Open Office format is open.

HTML for Spreadsheets

HTML is a file format for text with layout. It is an open format. It is also established, most notably because it can be displayed on nearly any device with a display. HTML can in principle also be used to store spreadsheets. All major office applications support exporting spreadsheets to HTML. Some also allow editing them. Thus, HTML seems like the perfect format for spreadsheets.

The problem is that a simple HTML export of a spreadsheet keeps the cell values, but loses the formulas that were used to compute them (at least in Libre Office). Thus, the spreadsheet becomes effectively read-only for anything that goes beyond simple values in a table. For archiving purposes, this may be sufficient, but for everyday use, it is not.

An alternative is to use my tool Spreadshit. It is a spreadsheet program that runs as Javascript inside an HTML document. These HTML documents are self-contained and save for archiving. However, the tool is amateur software, and thus not ready for heavy use.

Recommendation

For spreadsheets, you have a very limited choice:

If you have spreadsheets in any other file format, it may make sense to convert them to one of the above. Proceed as for documents.

Audio Formats

FLAC

FLAC is a lossless audio format. It is also an open format, which distinguishes it from the proprietary formats ATRAC (Sony), ALAC (Apple; sometimes with file extension M4A), SACD (Sony and Philips), and Windows Media Audio Lossless (Microsoft). FLAC can be played natively on Windows machines and in all major browsers. Apple products (iOS, Mac, Safari) do not natively support FLAC — quite possibly because Apple has its own lossless audio format, ALAC. However, players for FLAC can be easily found also for Apple systems. FLAC is thus an established file format. This distinguishes it from the less well supported formats “Monkey's Audio” (APE), WavPack, TTA, MPEG-4 SLS, and SHN. Finally, FLAC compresses the data, thus making the files smaller without losing information. This distinguishes FLAC from non-compressing file formats such as WAV, AIFF, AU or raw PCM. FLAC is thus the primary choice for archiving audio data.

FLAC can encode different sampling rates (“resolutions”): higher sampling rates are more truthful to the original, but produce larger file sizes. Based on what humans can hear, the standard sampling rate for everyday use is commonly 44,100 Hz. This is the sampling rate that is used on Audio CDs. Some vendors advertise higher sampling rates, most notably with the DVD Audio format or the competing Super Audio CD format. However, blind tests have shown that humans cannot hear the difference between a sampling rate of 44,100 Hz and anything higher (except at very loud volume). Thus, there is no need to go beyond a sampling rate of 44,100 Hz. Vice versa, given that disk space is cheap nowadays, there is also no reason to go below that rate either.

Professionals may choose a higher sampling rate, if they plan to edit the audio material later on, e.g., by slowing it down, or transposing it. However, it does not make sense to rip an audio CD at a bit rate higher than 44,100 Hz. The result can never be of better quality than the original.

MP3

FLAC is a great choice for archiving audio material, because it is lossless. However, it also requires a lot of space. For this reason, people have developed lossy audio formats. These cut away little details in the audio material that cannot be heard by humans. By far the most popular lossy audio file format is MP3. That makes MP3 one of the most established file formats for audio.

There are ongoing disputes about whether MP3 is free of patents or not. Technicolor maintains that software that treats MP3 files has to pay a fee. However, these disputes matter little to everyday users. Therefore, MP3 is a quasi-open format. This distinguishes it from proprietary formats such as WMA.

MP3 can encode the data at different sampling rates. As I have argued before, a sampling rate of 44,100 Hz is a good choice, and this is indeed the default sampling rate. On top of that, MP3 supports different bit rates. A higher bit rate means a lower compression ratio, and more truthful encoding — at the expense of larger files. Common values for the bit rate are 128, 192, 256, and 320 kbit/s. Blind tests have shown that even trained ears cannot distinguish bit rates of 256 from the original. Therefore, there is no need to go beyond a bit rate of 256. At the same time, there is also no need to go below that bit rate, because disk space is cheap nowadays. Thus, 256 kbit/s is a good choice.

You can see the sampling rate and the bit rate of an MP3 file on a Mac by opening a terminal and typing afinfo your-file.mp3 On Linux, use file your-file.mp3

MP4+AAC

The AAC format was designed as a successor to MP3. It is a lossy format. Technically, AAC is a codec that has to live in a container file. The most common container for this purpose is MP4. The file extension is then “mp4”. Since this extension is also used for MP4 videos, the extension is sometimes changed to “m4a”. I will refer to the combination of MP4 with AAC as MP4+AAC. This format has been around since the early 2000's, and it enjoys widespread support on all major platforms and all major software implementations. It can thus be considered established.

MP4+AAC is developed by a standards consortium. Unfortunately, it is encumbered by patent restrictions. However, the format is free for consumers, and it can thus be considered quasi-open.

With all of this, there is no particular reason to choose MP4+AAC over MP3.

OGG+Opus

The Vorbis projet set out to create a new lossy audio format that would definitively be open and free from patents. The current version of the format is called Opus. Opus is just a “codec”, i.e., a way to encode audio data. It is not an actual file format. The encoding has to live in what is called a container file format. There are different container file formats that can contain Opus, most notably OGG, Matruska, and WebM. Vice versa, these containers can contain other encodings than just Opus. However, OGG is the most frequent choice for Opus, and hence the format is known as “OGG+Opus”. Common file endings are OGG and OGA.

Opus is an open and lossy audio format. It can be played in the browsers Firefox, Chrome, and Opera, and in a number of other programs. Most notably, Wikipedia encourages the use of Opus. Google uses Opus in its video format WebM+VP9. The format thus falls under the open browser formats. It is established to some degree, but it remains much less ubiquitous than MP3.

MIDI

MIDI is a file format for audio data. It is not an actual recording, but a sequence of instructions. You can imagine it as a note sheet, together with the information which instrument plays which lines. When you open the file, the computer will play the lines like an orchestra would. Thus, MIDI is a vector format in our sense. This makes MIDI files lossless and very small. The format is developed by the MIDI Manufacturers Association, which makes it an open standard for all practical purposes. MIDI files date back to the 1980's and they are very popular in the digital instrument community. There is software support in one way or the other on all major operating systems. Thus, the format can be considered reasonably established in our sense.

All of this said, MIDI cannot be used to record audio data. It can only be used with explicitly “vectorized” types of sounds. In particular, it cannot replace a recording of a human playing the piano, let alone of an orchestra playing a piece of music. This is because MIDI cannot express the variations in force, distance, perfection, and volume that characterize a piece of music played by humans.

Recommendation

For audio data, you have the following choices:

If you have audio files in any other file format, you'd better convert them into one of the above. One way to do that is with the free software FFmpeg. Once installed, open a terminal and type ffmpeg -i filename.old filename.new Here, old is the old file extension (e.g. wma) and new is the new one (e.g., mp3). If you want MP3 with 256kbit/s (as I suggest), use the option “-b:a 256k”. In any case, you should keep the original files.

Video

Container formats

There are two different choices to make when encoding video:
  1. The “codec”, i.e., the way in which the video or audio is encoded. An example of a popular video codec is AVC.
  2. The “container”, i.e., the actual file format that contains the codec.

A container can contain several codecs at the same time — for example one for the video data, one for the accompanying audio, and one for the subtitles. Popular container formats are:

Resolution

Just like images, videos have a resolution. In principle, the guidelines for image resolution apply for video as well. In practice, however, file size is the limiting factor. Common resolutions are 320x240 (for mobile devices), 1920x1080 (1080p Full HD), 4096x2160 (4K Digital cinema, iPhone), 7680x4320 (HD, 8K, maximum on Youtube), and anything in between.

In addition to a spatial resolution, video also has a temporal resolution: the number of pictures (or “frames”) per second. Common values are between 24 (as used in cinema) and 30.

MP4+AVC

One of the most popular video codecs nowadays is “MPEG-4 Part 10 (H.264)”. It is also known, equally bulky, as “H.264/MPEG-4 AVC”. This is a lossy encoding for video data. With a compression rate set to 0, it is also lossless, but this is less common because it consumes an extraordinary amount of space. AVC is a proprietary encoding, encumbered by patent litigations. However, the format is free to use for end-users. In any case, this discussion has had little impact on common users, and AVC is nowadays the de facto standard for movies. Thus, it can be considered quasi-open.

The codec typically lives in an MP4 container, and I will call that combination MP4+AVC. The accompanying audio is usually AAC, a lossy quasi-open audio format that was developed as a successor to MP3. This combination is one of the most established file formats for video data.

HEVC

HEVC is a lossy video codec that is developed with the goal to replace AVC. Most notably, its compression rate is higher. HEVC is not a free format: it uses a number of patents, and thus the use of HEVC requires the payment of royalties to their owners — although probably not by the end users. This cost has curbed the acceptance of the standard, most notably on the Web. No major browser supports the format. Nevertheless, Microsoft Windows and Apple's operating systems support HEVC out of the box.

The standard is thus proprietary, lossy, and established — but much less established than the ubiquitous AVC.

WebM+VP9

VP9 is a video codec pushed by Google. The format is lossy, but can also store lossless video if the compression rate is set to zero. It was designed as an open alternative to MP4+AVC. VP9 lives primarily in the WebM container, and thus we will refer to that file format as WebM+VP9. The accompanying audio is usually encoded in Opus. The WebM format is developed by the Alliance for Open Media, an organization in which Google plays a key part.

WebM+VP9 is relatively young, but it is supported by all major browsers (except Safari). It is most notably used by Youtube. It is supported by Google and Wikipedia. That said, the successor to VP9, called AV1, is already in the making. Thus, the entire family of formats is far from being established. It falls under the open browser formats.

MOV

MOV is a lossy file format for videos, i.e., it defines both a container and a codec. It was originally developed by Apple for its Quicktime software. It was thus proprietary. In the meantime, MOV has become the basis for the MP4 container, which is the standard for movies nowadays.

Flash

Flash is a software suite by Adobe for production of animations, browser games, rich Internet applications, desktop applications, mobile applications and mobile games. It comes with several file formats, most notably Flash is proprietary software. It can be played in all major browsers via a plug-in from Adobe. It used to be ubiquitous on the Web, and was thus established in our sense. However, recently the tide has turned against Flash: People criticise the dependence on a single vendor (Adobe), a number of security flaws of Flash, as well as the possibility of tracking users by help of so-called Flash cookies.

For all of these reasons, the Web community (and Google in particular), have been pushing for alternative file formats. Hence, Flash is nowadays on its way out.

MPG

MPG (or MPEG) is a lossy video file format. It defines both a container and a codec. The older variant of MPG is known as MPEG-1, and the newer one as MPEG-2. Due to its age, all known patents have expired, and the format is nowadays open. Today, MPG is the most widely compatible lossy audio/video format in the world. However, it cannot be played on a Mac without additional software. Thus, it is reasonably well, but not fully established.

In any case, MPG has since been superseded by MP4+AVC, which offers higher video quality. Hence, preference should generally be given MP4+AVC.

Recommendation

For video data, there are no popular lossless formats for common users — mainly because the file size is prohibitively large for today's standards. Thus, you basically have the choice between the following options:

If you have video files in any other file format, you'd better convert them into one of the above. One way to do that is with the free software VLC player. In File->Convert, you can convert any video file to any other video file. Click “customize” to choose the correct Video codec, Audio codec, and container format. If you want to automate the process, you can use FFmpeg. Once installed, open a terminal. To convert to MP4+AVC, type ffmpeg -crf 10 -i filename.old filename.mp4 Here, old is the old file extension (e.g. avi). The option “-crf 10” enforces a low compression rate. To convert to WebM, use ffmpeg -i filename.old -c:v libvpx-vp9 -crf 10 -b:v 0 -c:a libopus filename.webm The option “-c:v libvpx-vp9” tells the converter to use VP9 (instead of the older VP8). The option “-crf 10” enforces a low compression rate. The option “-b:v 0” has to be set with the option “crf”. In any case, you should keep the original files.

Images

SVG

SVG is a file format for images. It is a vector format and thus lossless. Like HTML, it is an open format, developed by the Word Wide Web Consortium. It has been around for about 10 years, and it can be displayed by nearly all browsers. The format can thus be considered sufficiently established. It has superseded older vector formats such as CGM.

SVG is generally the way to go if you have images in vector form.

PNG

PNG is a lossless image file format. It is supported by all major browsers, and can be displayed and edited on all major operating systems. It is the most widely used lossless image compression format on the Internet. It is thus a very established format. Furthermore, it is an open file format, developed by the PNG Working Group. This distinguishes it from the proprietary (and more space consuming) BMP and GIF formats, as well as from the vendor-dependent RAW format.

PNG is thus the format of choice for non-vectorized lossless images.

TIFF

TIFF is a container format for images. It can contain lossy and lossless image encodings. Most often, however, it is used as a lossless image file format. It is widely used by graphic artists, in the publishing industry, and by photographers. It is supported by a wide range of software, and is thus very established. The format was developed by Adobe, and it is thus a proprietary file format. Adobe holds the copyright on the TIFF specification. However, there are no known intellectual property litigations. Also, the format stems from the 1980s. Thus the format can be considered quasi-open.

TIFF has a number of features that PNG does not support. That makes TIFF more powerful, but also more difficult to fully implement.

Image Resolution

PNG and TIFF are lossless image formats. Still, since they are not vector formats, they can mirror reality only up to a certain resolution. The resolution that you would want depends on how you want to use your image: A photo that you hold in your hand can have a smaller resolution than a poster on your wall.

There is a lot to discuss here, trading off resolution with file size in different use cases. However, to cut all of this short, here is a simple rule of thumb: If your image has 6000 pixels from top to bottom, you're completely safe.

Let's see why I'm saying this. The underlying assumption is that you are always at least as far from the picture as the picture is high. Consider a sheet of paper. It's 30cm high, and you generally do not hold it closer to your nose than 30cm. Consider a smart phone: It's 10cm high, and you generally do not hold it closer to your nose than 10cm. Consider a poster. It's 1m high, and you generally stand 1m away from it when you look at it. Consider an advertising board. It can be 3m high if it's on the wall of a high-rise building — but then you generally stand at least 3m away from it. Now if the picture has a height of x, and if your distance to the picture is at least x, then the picture spans a vertical angle of your field of view of arctan((x/2)/x)*2=53 degrees. Now each of your eye cells covers an angle of 31.5 arc seconds. This means that the eye can distinguish 6057 pixels top-to-bottom in your image. This holds independently of the scaling: As long as you're standing at least as far away from the picture as the print-out is high, you cannot distinguish more than 6000 pixels.

Formulas for the required resolutions
In practice, you are usually even farther away from the picture than the height of the picture. Then you only need a proportion of the 6000 pixels. The number of pixels scales linearly, as given by the formula on the right. I also give the formulas for the required Dots per Inch (DPI) and mega pixel resolutions. Examples:
  1. If you have a sheet of paper that's 30cm high and you hold it at 30cm from your nose, you need a scanning resolution of at most 6000/30cm×2.54cm = 500 DPI.
  2. If you want a poster that's 100cm high, but expect people to hold their nose at 50cm from it, you need a print resolution of 300 DPI, and 12000 vertical pixels.
  3. If you hold a picture of 10cm height 30cm away from your nose, you need (1/3×6)2×1.5 = 6 Mega Pixels.
  4. If you want a Retina display that is 20cm high, with your nose 40cm away, you need 3000 pixels vertically (the MacBook Pro has 2500).
Higher resolutions are reasonable only if you plan to zoom into parts of the picture, or if you want to transform the picture in some way.

JPG

PNG and TIFF are lossless image formats. They are more truthful to reality, but consume more disk space. The most popular lossy image format is JPG (also: JPEG). It is the most common format for photographic images on the Web. The format has been around since the 1990's, it is supported on all major operating systems, it can be read by all major browsers, and it is the default format for digital cameras. It is thus one of the most established formats at all. This distinguishes it from Google's WebP format, which has not yet proven to be better than JPG.

There are a number of patent issues surrounding the JPEG format, but these are irrelevant for all practical use cases by common users. The format is standardized by the Joint Photographic Experts Group. It is thus an open standard.

Being a lossy image format, JPG allows the choice of a compression ratio. The higher the compression, the smaller the file, and the less truthful the image. At very high compression ratios, artifacts start popping up in the image. The lowest compression ratio is thus the safest choice for archiving purposes. Some programs allow the user to choose the “image quality”, which is simply the inverse of the compression ratio (highest image quality = lowest compression ratio).

RAW

Most digital cameras produce JPG images nowadays. JPG is a lossy file format. Some cameras allow getting hold of the original, lossless version of the picture. The file format is called RAW — even though it is not a single file format. Rather, each camera vendor has their own proprietary file format for RAW images.

Thus, while RAW images are lossless, they are also not good for everyday use. They have to be converted to an established format — usually JPG, TIFF, or PNG.

Recommendation

For image files, you have the following options:

If you have images in any other format, it makes sense to convert them to one of the above. One way to do that is to open the file (by double-clicking it), and then to choose “save as” or “export”. Then choose a target file format. In any case, you should keep the original files.

Compression formats

Compression

A compression file format is a file format that allows making a file consume less disk space (without losing information). An archiving file format is a file format that allows to combine several files into a single file. Many file formats combine both actions, and so the process itself is known as “archiving”, “compression”, or “packing”.

There are numerous file formats for archiving, for compression, and for both. They differ in many features, and in particular in their compression ratio. Since the difference in compression ratio depends on the content you compress, we concentrate on the other differences here.

ZIP

One of the most common archiving and compression formats is ZIP. ZIP was first proposed in the 1990's, and it is supported natively by all major operating systems. It is thus one of the most established file formats at all. It is so prevalent that “to zip” has come to mean “to compress with ZIP”.

Technically, ZIP is a proprietary file format, because it is developed by PKWARE. However, there are no known license issues, and the format is so ubiquitous that it can be considered quasi-open.

BZIP2

BZIP2 is an open file compression format. It was designed to be more space-efficient than ZIP — much like a plethora of other compression formats. BZIP2 can store only one file per archive. It also requires the installation of extra software on non-Linux machines. Thus, it falls behind ZIP in maturity.

TAR+GZIP

TAR is a pure archiving format. It stores several files in a single file without compressing them. The archive file is then often compressed using GZIP. This yields files with the suffix “.tar.gz”. Both TAR and GZIP are open file formats and extremely popular on Linux systems. The GZIP format is also used in HTTP compression on the Internet.

That said, the format is a bit cumbersome to use. On Linux systems, the magic formula to uncompress a TAR+GZIP file is tar -xzf file On non-Linux systems, additional software is required. Thus, TAR+GZIP is not established in our sense.

RAR

RAR is a proprietary file format for archiving and compression. The format is widely used and supported, and can thus be considered established.

At the same time, the format is not open. RAR files can be created only with commercial software WinRAR, RAR, and other software that has written permission from the creator of RAR. Thus, RAR falls behind ZIP for archiving purposes.

7Z

7Z is an open file format for compression and archiving. It is not as ubiquitous and well-supported as ZIP, and thus falls behind ZIP in terms of maturity. One of the advantages of 7Z is that it allows encryption with the AES-256 standard. AES is the most widely used symmetric encryption method nowadays, and AES-256 is the state-of-the-art variant of it.

Recommendation

ZIP is established and quasi-open. All other formats are either less established or less open. Thus, there is generally no reason to deviate from ZIP. The only interesting alternative in my view is 7Z. It is less established, but truly open, and it supports AES-256 encryption.

If you have archive files in any other format that you care about, it may make sense to convert them to ZIP (or 7Z): Unpack them, and then re-pack them to ZIP. You do not need to keep the originals.