Very Cheap Backups

Fri, Aug 21, 2020. Tags: Tech.

I used to not back up my data. I know, I know, I was just asking for trouble. When I came to my senses (fortunately before losing anything important) I started looking for the cheapest backup solution possible. Here’s what I came up with.1

The gold standard for backups is “3–2–1”: three copies of your data, two of which are on different local devices, and one of which is offsite. Don’t settle for less, because you don’t have to.

The first copy is easy – it’s the one you’ve already got on your hard drive. One down, two to go.

The second copy is local, but on a different device. The benefit to having a second local copy is that in the event that your hard drive blows up, you can get your data back immediately, without having to download anything. It also means you don’t fully depend on your remote copy.

The conventional way to make a second copy is using an external hard drive, and you can get a decent one of those for $50. But we’re going for a dramatically cheap backup solution here, so how about we consider a different option: USB drives.

We gotta put that USB drive somewhere though, so let’s buy a Raspberry Pi. It’ll be our cheap little backup server. If we’re just backing up one computer, we could plug the USB straight into it, but this way we can back up multiple machines using the same setup. A Pi and a USB drive together will cost you a good $35, but you’ve probably got a USB drive lying around already, and maybe you already have a Pi too. So, optimistically, we’re at $0 so far.

Start the Pi and mount your USB drive at /opt/usb1. Copy your remote ssh key into /home/pi/.ssh/authorized_keys and make sure you can ssh into it from your local machine. Make a folder on the Pi at /opt/usb1/backups/, and run chown -R pi:pi /opt/usb1/ so that you can write to it. Here’s the script we’re going to use to back up our stuff.

View snapshot.zsh:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
#!/usr/bin/zsh

set -euo pipefail

# Config
OPT=-aPLhz
SRC=(/home/yourname/Documents/
     # add any other directories you want to back up here
     /home/yourname/Pictures/)
DEST=/opt/usb1/backups/
REMOTEUSER=pi
REMOTEHOST=192.168.1.5  # supposing this is the Pi's IP address
LAST=last
DATE=$(date "+%Y-%b-%d_%T")

# Make sure passwordless ssh is set up with the remote host,
# and that $REMOTEUSER has permission to write to $DEST

# Run rsync to create snapshot
rsync $OPT \
    --link-dest=$REMOTEUSER@$REMOTEHOST:${DEST}last \
    $SRC $REMOTEUSER@$REMOTEHOST:$DEST/$DATE

# Update the symlink to the latest backup
ssh -TT $REMOTEUSER@$REMOTEHOST <<EOF
ln -sfn $DATE $DEST/$LAST
EOF

Notice that this is a zsh script. I find it’s nicer for scripting in than bash. Just sudo apt install zsh and you’ve got it. I say apt — use whichever package manager comes with your system. I’ll use apt in my examples here.

Run ./snapshot.zsh on your computer and, if you’ve set everything up right, you should see all your data zap itself over to your Raspberry Pi. Nicely done.

Here’s how this script works: the first time you run it, it copies all your data into a folder named with the date and time, and creates a symbolic link to that folder called last. last always points to your most recent backup. Every subsequent time you run it, the script finds the difference between the current state of your files and your most recent backup, and uploads only the new data. If you change 10% of a file, the 90% that stays the same is copied over from the previous backup, instead of being sent over the wire. If you don’t change the file at all (here’s the cool part) it hardlinks the file from your last backup, so that it doesn’t consume any extra disk space. That’s the magic of rsync, baby! This way, your remote drive winds up looking like this:

/opt/usb1/backups:

2020-May-24_13:00:00/
2020-May-26_07:21:22/
2020-May-26_13:52:31/
2020-May-27_13:00:00/
2020-May-27_18:48:47/
2020-May-28_13:40:33/
2020-May-29_13:00:00/
2020-May-30_13:00:00/
2020-May-31_13:00:00/
last/

Every one of those folders is a snapshot of your data at that exact point in time. This is (fun fact) exactly how Mac’s Time Machine backups work. If your USB drive starts to get too full, just delete some of the old snapshots.

So that’s the second local copy done. What about the third one?

Well, in the spirit of frugality, I use the cheapest cloud storage I can find: Backblaze B2. B2 is S3-compatible object storage at $0.005/GB/month. Storying 10 GB for a year will cost you 60 cents.

I like B2 because it’s cheap even when you’re not buying storage in bulk. B2 works out to $60/TB/year, which is good, but not unheard-of. I don’t have a whole terabyte of valuable data though. I only have a few gigs, and B2 lets me store that for basically nothing.

That being said, I’m not married to Backblaze and you shouldn’t be either. Vendor lock-in is not fun. So these scripts will work with any S3-compatible object storage API.

Speaking of which, here are the scripts I use to back my data up to the cloud. Put them on the Raspberry Pi.

This one goes at /home/pi/scripts/cloud_backup.zsh.

View cloud_backup.zsh:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
#!/usr/bin/zsh

set -euo pipefail

TARBALL=
ENCRYPTED=
TMP_DIR=

# wait for the other processes to finish,
# then remove all the temporary files
function cleanup {
  sleep 2
  if [[ -n $TARBALL && -e $TARBALL ]]; then
    rm -rf $TARBALL
  fi

  if [[ -n $ENCRYPTED && -e $ENCRYPTED ]]; then
    rm -rf $ENCRYPTED
  fi

  if [[ -n $TMP_DIR && -e $TMP_DIR ]]; then
    rmdir $TMP_DIR
  fi
}

function main() {
  printf "=== BACKUP FOR $(date) ===\n"

  # Config vars
  source $BACKUP_CONFIG
  source $S3_CONFIG

  if ! [[ -e $TARGET ]]; then
    echo "Target $TARGET does not exist"
    exit 1
  fi
  REAL_TARGET=$(readlink $TARGET)

  # Tar your latest backup
  TMP_DIR=$(mktemp -dp $TMP_DIR_AREA tmp.backup.XXXX)/
  TARBALL=$TMP_DIR$NAME.tar.gz
  printf "tar starts at $(date +%T)\n"
  tar -czvf $TARBALL $REAL_TARGET
  printf "tar done at $(date +%T)\n"

  # Encrypt the tarball.
  # Make sure pyAesCrypt is installed:
  # $ sudo apt install python3-pip
  # $ pip3 install pyAesCrypt`
  ENCRYPTED=$TARBALL.aes
  printf "encrypt starts at $(date +%T)\n"
  $HOME/.local/bin/pyAesCrypt -p $PASSWORD -e $TARBALL -o $ENCRYPTED 2>&1 >/dev/null
  printf "encrypt done at $(date +%T)\n"

  # Upload the encrypted tarball
  REMOTE_NAME=$FOLDER$(basename $ENCRYPTED)
  printf "upload starts at $(date +%T)\n"
  $SCRIPT_DIR/s3_upload.py $BUCKET $ENCRYPTED $REMOTE_NAME
  printf "upload done at $(date +%T)\n"

  # Delete any stale backups in the cloud, to reduce the bill
  printf "cycling starts at $(date +%T)\n"
  $SCRIPT_DIR/s3_cycle_versions.py $BUCKET $REMOTE_NAME $COPIES
  printf "cycling starts at $(date +%T)\n"

  cleanup
}

trap "cleanup &" SIGINT EXIT ERR
main

It tars & compresses your most recent backup, encrypts it using pyAesCrypt, and uploads it to B2 (or any other cloud storage which is compatible with the S3 API). Make sure you install pyAesCrypt on the Pi before running the script:

1
2
pi@localhost ~ $ sudo apt install python3-pip
pi@localhost ~ $ pip3 install pyAesCrypt

This next one belongs at /home/pi/scripts/s3_upload.py.

View s3_upload.py:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
#!/usr/bin/env python3

import os
import sys
import threading

# make sure boto3 is installed
# $ sudo apt install python3-pip
# $ pip3 install boto3
import boto3

endpoint_url = os.environ.get("S3_ENDPOINT")
aws_access_key_id = os.environ.get("S3_KEY_ID")
aws_secret_access_key = os.environ.get("S3_KEY_SECRET")

def main():
    global endpoint_url
    global aws_access_key_id
    global aws_secret_access_key

    session = boto3.session.Session()
    client = session.client(
        "s3",
        endpoint_url=endpoint_url,
        aws_access_key_id=aws_access_key_id,
        aws_secret_access_key=aws_secret_access_key,
    )
    if len(sys.argv) != 4:
        print("Usage: ./s3_api_upload.py BUCKET SRC DEST")
        exit()

    bucket = sys.argv[1]
    src = sys.argv[2]
    dest = sys.argv[3]

    print(f"Uploading {src} to {bucket}:{dest} ...")
    client.upload_file(
        src,
        bucket,
        dest,
    )
    print("Done!")

if __name__ == "__main__":
    main()

This one’s simple enough; it handles the actual file uploading. It depends on AWS’s boto3 Python API, so install that using pip3 install boto3. If you want to isolate these Python dependencies in a virtualenv, that’s probably smart, but I haven’t.

The last script one goes at /home/pi/scripts/s3_cycle_versions.py:

View s3_cycle_versions.py:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
#!/usr/bin/env python3

import os
import sys

# make sure boto3 is installed
# $ sudo apt install python3-pip
# $ pip3 install boto3
import boto3

endpoint_url = os.environ.get("S3_ENDPOINT")
aws_access_key_id = os.environ.get("S3_KEY_ID")
aws_secret_access_key = os.environ.get("S3_KEY_SECRET")


def main():
    global endpoint_url
    global aws_access_key_id
    global aws_secret_access_key

    session = boto3.session.Session()
    client = session.client(
        "s3",
        endpoint_url=endpoint_url,
        aws_access_key_id=aws_access_key_id,
        aws_secret_access_key=aws_secret_access_key,
    )
    if len(sys.argv) != 4:
        print("Usage: ./s3_cycle_versions.py BUCKET KEY NUMBER_OF_COPIES")
        exit()

    bucket = sys.argv[1]
    key = sys.argv[2]
    copies = int(sys.argv[3])

    versions = client.list_object_versions(Bucket=bucket)

    # Collect all versions of each file together
    byVersion = {}
    for i in versions["Versions"]:
        if i["Key"] not in byVersion:
            byVersion[i["Key"]] = []
        byVersion[i["Key"]].append(i)

    # get & sort the one we're looking for
    fileVersions = sorted(
        byVersion[key], key=lambda x: x["LastModified"], reverse=True)

    # delete the old ones
    while len(fileVersions) > copies:
        currentVersionId = fileVersions[-1]["VersionId"]
        client.delete_object(Bucket=bucket, Key=key,
                             VersionId=currentVersionId)
        fileVersions.pop()

    print("Remaining Copies:")
    for f in fileVersions:
        print("{} from {}".format(f["Key"], f["LastModified"]))


if __name__ == "__main__":
    main()

This one is a bit complicated: See, B2 and other S3-compatible object storage systems let you version a file. Every time you upload a new backup under the same name, you’re adding a new version of the same file. You can expire these versions after a certain amount of time, but I didn’t want to do that, so I wrote this script, which preserves the three most recent versions of a file and removes the rest. That way, if I stop doing backups for some reason, the old versions won’t expire.

There are two config files you need to run these scripts – backup.conf and s3.conf.

backup.conf holds information about your backup:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# Example backup config file

# Ensure that there is no trailing slash for $TARGET,
# and that there *is* a trailing slash for $FOLDER (unless it's empty)

NAME=desktop-backup                # name of the backup file
PASSWORD="example_password"        # encryption password
TARGET=/opt/usb1/backups/last      # location of your most recent backup
TMP_DIR_AREA=/tmp/                 # location to put temporary files
SCRIPT_DIR=/home/pi/scripts/       # name of the directory holding the scripts
BUCKET=myname-backups              # bucket to upload to
FOLDER=backups-folder/             # folder in the bucket to upload to
COPIES=3                           # number of versions to keep in the cloud

s3.conf holds the authentication information to upload to the cloud:

1
2
3
4
5
6
7
# Example s3 config file

# Make sure you `export` all of these values

export S3_ENDPOINT="https://s3.us-west-002.backblazeb2.com"
export S3_KEY_ID="20349example34bf6caca789"
export S3_KEY_SECRET="SLDKFexampleJD98798KJGHIUH9879"

Log into your cloud storage account and create a bucket called myname-backups (or whatever – just make sure it’s the same as the one in the config file). The $COPIES parameter in backup.conf is set to 3, which tells the script to keep three versions of your data in the cloud – the most recent copy, plus the last two versions. Set it to whatever you want, depending on whether you want to maximize safety or minimize spending.

To copy your backup to the cloud, run this:

1
2
3
pi@localhost ~ $ export BACKUP_CONFIG=/home/pi/scripts/backup.conf
pi@localhost ~ $ export S3_CONFIG=/home/pi/scripts/s3.conf
pi@localhost ~ $ /home/pi/scripts/cloud_backup.zsh >> /home/pi/backups.log

Dump that into a script, run that once a week as a cron job for the pi user, and you’ve got your third copy of the data in the cloud.

That’s it! 3–2–1 — three copies, two on different local devices, one off-site. With nothing more than a USB stick, a Raspberry Pi, and a few cents a year.

So far, we’ve only talked about using a USB drive as a medium to store your backups. USB drives are a fairly cheap solution, but they’re not the most reliable thing in the world. If you want something really robust, get a network-attached storage device and set up a RAID array.2 But on this blog, cheap never precludes robust: let’s get a second USB drive and set up a fake RAID.

Plug both USBs into the Pi, mount them at /opt/usb1 and /opt/usb2 (or wherever), and copy the following into a file called mirror_usb.zsh.

View mirror_usb.zsh:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
#!/usr/bin/zsh

set -euo pipefail

DRIVE1=/opt/usb1/backups/
DRIVE2=/opt/usb2/backups/

# For every snapshot
for f in $(find $DRIVE1 -maxdepth 1 -mindepth 1 -type d); do
    # If it's in both drives
    if ls $DRIVE2 | grep -Fx $f >/dev/null; then
        # Compare both copies
        if ! diff -qr $DRIVE1/$f $DRIVE2/$f >/dev/null; then
            # And if they differ, that's bit rot
            touch $DRIVE1/CORRUPTED
            exit 1
        fi
    fi
fi
# Then copy over any new snapshots and update the symlink "last"
rsync -aH --delete $DRIVE1/ $DRIVE2/

Add mirror_usb.zsh to your crontab, run it once per day, and voila: Your second USB drive has become an automated mirror of the first one which will detect and alert you to any data decay.

If you want to get really fancy, you can set up a little Flask server on your Pi to detect the CORRUPTED file. You could have it also keep track of the most recent backup from your computer, and the last time a backup was uploaded to the cloud, and put this information in a little web dashboard, but this is all beyond the scope of this article. I just wanted to share some handy shell scripts.


  1. Caveat: I use a fair amount of shell scripting in here. If you use Windows, you’ll have to use WSL or something to run them. ↩︎

  2. Full disclosure: I actually did get a used NAS instead of using a USB stick. ↩︎