Building An Open-source Personal Backup System

If you’re reading this, you probably have data that you can’t afford to lose.

I procrastinated on implementing automated backups for years until my laptop started failing a few months ago. Fortunately, I didn’t lose any data but that was the final wake-up call that I needed.

Here’s how I did it.

Table of Contents

Requirements

A good backup system should have the following:

  1. Cross-platform support: I needed to back up files on my Windows laptop as well as a Synology DS918+ NAS. This means I needed something that would work on both Windows and Linux.

  2. Automated: backups are only effective if they happen automatically in the background.

  3. 3-2-1 Rule: backups should follow the 3-2-1 rule.

    • 3 copies of data: a primary copy and 2 backups
    • Copies should be stored on at least 2 different devices
    • 1 copy of the data is off-site (on the cloud or in a different physical location)
  4. Ransomware-proof: ransomware is becoming increasingly common. This means that backup copies should not be mounted onto the filesystem, as that leaves them vulnerable to being encrypted as well.

  5. Cloud storage support: backing up to the cloud is an easy way of having an off-site copy of data.

  6. Encryption: encryption is the only way to ensure the confidentiality and integrity of your backups. This is especially important when backing up to the cloud.

  7. Data deduplication: avoids storing multiple copies of the same data. This is done by breaking up data into smaller blocks so that if two files are mostly similar, only blocks that represent the differences between both files need to be stored. This also has the nice side effect of efficient incremental backups because only the parts of the file that changed need to be stored.

What Backups Are Not

Backups are often used interchangeably with cloud synchronization services. Services like Dropbox and Google Drive market themselves as a backup solution. They are designed for keeping files on multiple devices synced. However, this also means that any accidental deletions or files being overridden by ransomware gets propagated to all your other devices. Although some cloud synchronization services have some form of versioning, retention of old versions is usually limited to a short period like 30 days.

RAID often gets mistaken for backups. RAID protects against hard drive failure. However, it doesn’t do anything to prevent bit rot, file corruption, accidental deletions, ransomware or a power surge frying the RAID array.

Backups store multiple versions of files so you can go back in time. This allows restoration of individual files to undo mistakes or entire folders to recover from ransomware.

Components

I used the following for automated backups:

  1. A NAS that acts as a backup server running on the local network.

    I got a Synology DS918+ for this purpose because it supports RAID and uses the BTRFS filesystem which protects against bit rot.

  2. Backblaze B2 for cloud storage1. B2 offers an API which is supported by many backup tools.

    You can technically use any cloud storage service for this. I chose B2 for its simplicity, its track record (They’re storing over an Exabyte of data) and affordability. B2’s pricing is competitive: storage is charged at US$0.005 / GB / month, downloads are US$0.01 / GB with the first GB of the day free and uploads are free.

  3. Kopia for cross-platform, open-source backups. Kopia supports B2 (among numerous other cloud storage backends), compression, encryption and data deduplication.

    Notably, Kopia supports lock-free deduplication. This means that it does not need to hold a global lock on the backup destination when performing a backup to prevent simultaneous concurrent backups from other clients from occurring. Multiple devices can concurrently write to the same destination. As far as I’m aware, Duplicacy is the only other free tool that supports this feature.

    Kopia has an optional GUI, but I’ll be using the command line.

Why The NAS?

Clients will be backing up to the NAS via Kopia. The NAS also runs Kopia to back up its contents to B2 for an extra layer of redundancy.

You may be wondering if the NAS is redundant, given the use of cloud storage for an off-site backup. Why not have clients back up to B2 directly instead?

Retrieving data off the cloud can be expensive due to download fees. Hence, having a server for primary backups running on the local network avoids these download fees. Data transfers on the local network are also much faster.

More importantly, this means that only the server needs to have the credentials needed to access backups on B2. If clients are infected by ransomware, the backups on B2 are safe as clients don’t have a direct connection to B2.

Step 1: Setting Up SFTP

The most convenient way for an instance of Kopia running on a client to send backups to the server is via SFTP. Hence, the NAS needs to expose an SFTP server for Kopia to connect to.

I used Sftpgo as my SFTP server because it is open-source, looks well-maintained and supports deployment via Docker.

See this blog post for more details on setting up sftpgo with Docker, especially if you’re also running it on a Synology NAS.

Step 2: Automating Client Backups to The Server

Once the server exposes an SFTP connection, create a user account on Sftpgo for each client being backed up.

To configure Kopia on client machines:

  1. Create a SFTP data repository to store backups:

    kopia repository create sftp --host=sftp.mydomain.com --keyfile=/path/to/ssh/keyfile --username some_user --path=backups

    You will be prompted for a mandatory password that will be used to encrypt backups.

  2. Once the repository has been created, tell Kopia to use it. This is done by using the repository connect command - replace connect above with create. For example:

    kopia repository connect sftp --host=sftp.mydomain.com --keyfile=/path/to/ssh/keyfile --username some_user --path=backups
  3. Once Kopia is connected to the SFTP repository, use the create snapshot command to back up data (snapshots are like git commits). Multiple paths may be specified as a convenience to back up multiple folders:

    kopia snapshot create ~/Projects ~/Documents ~/media

    This may take a while the first time depending on how much data you have. Subsequent snapshots will be much faster, as only the changes made since the last snapshot need to be uploaded.

    Kopia supports policies to configure things like exclusions, retention and compression.

    You’ll want to have the backup commands in a script and schedule it to run periodically, more on that later.

Step 3: Offsite Backups to B2

Once clients are sending backups to the server, the final step is to upload those backups to B2 for an offsite copy.

  1. Create a B2 account. Go to settings and take note of the application key ID displayed on that page.

  2. Install the B2 command line tool.

  3. Link the B2 CLI tool to your account: b2 authorize_account <key id from step 1>

  4. Create a bucket to store files: this can be done via the command line tool or the B2 web-based GUI.

  5. Create an application key for Kopia to connect to your B2 account. The cli tool should be used to do this as it allows specifying more granular permissions than what the GUI allows. b2 create-key --bucket my_bucket kopia listBuckets,listFiles,readFiles,writeFiles,deleteFiles

  6. Install Kopia on the server.

  7. On the server, have Kopia connect to the repository containing backups from clients via SFTP:

    kopia repository connect filesystem --path '/path/of/backups'
  8. With the application key from step 5, use the following command to create a mirror of the repository to B2:

    kopia repository sync-to b2 --bucket=your-bucket-name --key-id=key-id --key=some-key --parallel 8

    Once you’ve tested that this works, set up a cronjob to automate this.

Step 4: Restoring Data

It is always a good idea to restore some data to ensure that the backups work properly.

Use the snapshot list command to view the list of snapshots:

$ kopia snapshot list
Dickson@T480:d:\documents\projects
  2022-10-20 20:49:37 +08 ka1bfbdcb5396e50b66b3a172dca45986 221.2 MB dr-xr-xr-x files:360 dirs:18 (hourly-5..18,daily-2..3)
  + 13 identical snapshots until 2022-10-22 14:49:39 +08
  2022-10-22 15:49:41 +08 k29c536bbaf591c9f79319ecca9a7eef2 221.2 MB dr-xr-xr-x files:360 dirs:18 (latest-1..10,hourly-1..4,daily-1,weekly-1,monthly-1,annual-1)
  + 10 identical snapshots until 2022-10-22 18:19:39 +08

Individual snapshots can be mounted onto the filesystem to examine their contents or restore from them:

$ mkdir /tmp/mnt
$ kopia mount k29c536bbaf591c9f79319ecca9a7eef2 /tmp/mnt &
$ ls -l /tmp/mnt/
total 119992
<snip>
$ umount /tmp/mnt

Monitoring Automated Backups

If a backup or mirroring to B2 fails, it would be ideal to get pinged if anything went wrong. I use healthchecks.io for this purpose.

I also wanted cronjobs to avoid running if another instance was already in progress and to have logs saved for troubleshooting.

Here’s a snippet of a cronjob that I use for backups with those features.

#!/bin/bash
# set -e

scriptname=$(basename "$0")
lock="/var/run/${scriptname}"
exec 200>"$lock"
flock -n 200 || exit 1

NOW=$(date "+%F %H-%M-%S")
LOG_PATH="/logs/backup/${NOW}.log"
failure=false

# Redirect a copy of all output to ae log file
# See https://stackoverflow.com/questions/3173131/redirect-copy-of-stdout-to-log-file-from-within-bash-script-itself
# Redirect stdout ( > ) into a named pipe ( >() ) running "tee"
exec > >(tee -i "${LOG_PATH}")
# Without this, only stdout would be captured - i.e. your
# log file would not contain any error messages.
# SEE (and upvote) the answer by Adam Spiers, which keeps STDERR
# as a separate stream - I did not want to steal from him by simply
# adding his answer to mine.
exec 2>&1

echo "Starting backup at $NOW"

# ping healthchecks.io when the job starts
# this initiates a timer which can be used to keep track of how long this script took to run
curl -fsS -m 10 --retry 5 -o /dev/null https://hc-ping.com/.../start

# back up contents
if ! kopia snapshot create --parallel 8 ~/media ~/projects; then
  echo "Failed to create snapshots"
  failure=true
fi

if [ "$failure" = true ]
then
  echo "Errors encountered during backup, completed on $(date "+%F %H-%M-%S")"
  # ping healthchecks.io to signal failure
  curl -fsS -m 10 --retry 5 -o /dev/null https://hc-ping.com/.../fail
  exit 1
else
  echo "Backup completed successfully on $(date "+%F %H-%M-%S")"
  # ping healthchecks.io to signal success
  curl -fsS -m 10 --retry 5 -o /dev/null https://hc-ping.com/...
  exit 0
fi

Footnotes

  1. Backblaze has an Unlimited Backup offering for consumer use, but it doesn’t support Linux.