Long term data archival

2023-05-25

Tags: #computers #backup #archival #storage #bitrot

             ___________________
            |,--------.         |
            || backup |         |
            |`--------'         [
            |        .-.        |
            |       |   |       |
            |        `-' o      |
            |        .-.        |
            |        : :        |
            |        :_;        |
            |_______.___._______|

A few words about my research on making data archives that will stand the test of time.

Goal: Archive data for long-term storage.

Requirements:

- Durable storage.

- Resiliency to bit rot.

- No need for special rooms or conditions to store media.

- Easy retrieval; no need to wait hours to restore data.

- Encryption.

- Indexed archives for easy reference.

Selecting storage media

Common media choices are:

💿 Optical storage (with one notable exception) is very unreliable and will result in read errors usually in a few years time. The exception here is M-DISC, which we'll talk about in a bit. There's also Syylex Glass Master Disc but it's ridiculously expensive ($1000 per disc).

🖴 Flash storage and Solid State drives need occasional connection to power to "refresh" bits and not lose data. Plus, cheap consumer-grade SSDs are notoriously unreliable. Avoid. HDDs are also not reliable (susceptible to sudden bad sectors) and may not even start after being dormant for a few years if you're unlucky. Avoid as well.

📼 Tapes (LTO) are too slow, need a lot of time for data retrieval, need expensive equipment and special storage conditions (low humidity, climate control etc) to be reliable for long-term data storage.

💾 Floppies. Gotta love them for nostalgia, but no.

Avoid other obscure media. Chances are, the hardware you'll need to read them will be obsolete and very difficult to find a few decades' time.

So, what to choose?

M-DISC. (Unless you have huge datasets, where tape is the only realistic option). From Wikipedia:

M-DISC's design is intended to provide archival media longevity. M-Disc claims that properly stored M-DISC DVD recordings will last up to 1000 years. The patents protecting the M-DISC technology assert that the data layer is a glassy carbon material that is substantially inert to oxidation and has a melting point of 200–1,000 °C (392–1,832 °F). M-Discs are readable by most regular DVD players made after 2005 and Blu-Ray & BDXL disc drives and writable by most made after 2011.

There have been accelerated aging tests for M-DISCS that prove their increased durability compared to even the best quality alternatives, but whether they'll last 50 or 500 years, is something to be seen. Other advantages:

- No need for specific equipment to read. DVD and BluRay drives will probably be here for a long time.

- No need for special storage environment, stash in a drawer and forget.

- No need to purchase special equipment to write. A good quality writer is recommended nevertheless; I got a Toshiba USB3 M-DISC writer at around $200 a few years ago.

There are M-DISC DVDs and BluRays, I chose the latter with the 25GB capacity which is decent. If you have huge storage requirements, then you should revisit LTO storage instead.

Backup procedure

Steps:

1. Encryption: Create Veracrypt volume and put your data there.

2. Recovering from corruption: Fortify the volume file with extra metadata to recover from data corruption.

3. Indexing: Make sure you know where's what.

4. Persistence: Burn the final files to the disc.

1. Encryption

I use Veracrypt. It's easy to use, uses solid crypto and it's cross-platform: Runs on Windows, MacOS, Linux and OpenBSD (which is what I use).

To create a new Veracrypt volume:

# veracrypt --text --create enc.vc --volume-type=normal \
  --size=<file_size_in_bytes> --filesystem=fat --encryption=aes \
  --hash=SHA-512 --random-source=/dev/urandom --keyfiles='' --pim='0'

To mount it:

# veracrypt --pim='0' --keyfiles='' ./enc.vc /mnt/enc

To unmount it:

# veracrypt --dismount /mnt/enc

2. Recovering from corruption

To ensure we can recover our data in case of errors, we'll use Parchive (Par2).

Create a Par2 archive with 5% recovery size and one recovery file:

# par2 create -r 5 -n 1 -a enc.vc.par2 enc.vc

To validate a Par2 archive:

# par2 verify ./enc.vc

In case of errors, repair:

# par2 repair ./enc.vc

3. Indexing

To create an encrypted list of files included in the backup:

find . | gpg --armor --cipher-algo AES-256 --symmetric >./files.txt.asc

To see all files included in the backup:

# gpg -d ./files.txt.asc

4. Persistence: Burning files to disc

After following the above steps, you'll have a set of four files. Burn those on the disc using Brasero or your favourite optical disc burning software. The first time you do this, before you stash the disc away I suggest you follow the procedure backwards to make sure you can decrypt and restore the files correctly.

=> http://archive.org/details/lne-syylex-glass-dvd-accelerated-aging-report

=> https://en.wikipedia.org/wiki/Parchive

=> https://github.com/Parchive/par2cmdline

=> https://www.veracrypt.fr/en/Home.html