We’ve been promoting off-site backups to our clients for years for security and disaster recovery purposes. It’s awesome – any IT consultant will tell the same.
For a long time we had been using a provider that offered access through rsync which was awesome for us. It would work from our Linux, Mac and Windows machines. Many providers have cross platform solutions (but some don’t) and it’s a requirement for us.
Well our hosting provider crashed – dropped 32GiB of our data – ouch. Our configuration on their system (access via SSH/authorized rsa keys) was lost – ouch. Also the media has been talking to us constantly about the dangers of these off-site companies failing – we’ve heard of a few as well.
Amazon has a very solid product with S3 – 99.99% availability and 99.9% monthly uptime SLA.
Not perfect – but perfect is not possible and four-nines is very good.
But S3 doesn’t function like a normal file system and S3 only operates on full files at a time which makes tools like rsync not function directly with S3. There I’ll talk about a few other solutions and what Edoceo does for our S3 backup plan.
Other S3 Tools
There are tools like s3 or s3fox that can provide visual user interfaces to S3 and they are very handy but not suitable for automated backups. s3sync is one option but operates on whole files – s3fs is slightly more efficient and works with rsync.
At first we decided to just use s3fs – but we have five hosts to backup and didn’t want to share our AWS credentials on all these systems. Our preferred method would be to have the systems work with another host (EC2) and communicate over ssh and authenticate using rsa keys. Also AWS credentials are not shared on multiple hosts. Also by using EC2 we can do loads more high-level communications.
Our backup plan looks like this (hostnames are single uppercase)
- C: Run EC2 Image
- C, H, O, L: Rsync to EC2 image over SSH
- C: Stop EC2 Image
Run EC2 Image starts our own modified EC2 image that’s primed and ready for us to archive to. It is pre-configured to accept rsa keys from C, H, O and L, and it provides direct access to storage in EBS which snapshots to S3.
C, H, O, L all run rsycn over ssh to our ec2 image – and store in the EBS that is mounted at /mnt/[hostname] – so we have four EBS mounts – one for each server we archive. The data that each host pushes to EC2 is encrypted before it leaves that file system and transferred over and encrypted channel because it’s so easy why wouldn’t we?
Each host is responsible for it’s own encrypted rdiff-backup archive. We did this so the the stored bits at Amazon would be host specific encrypted. If Amazon is hacked our data is encrypted (AES128) and if someone breaks one archive the others are not at risk. Something to consider is that Amazon is becoming a big exposed target holding loads of possibly sensitive data for a large number of organisations. Big targets are attractive to malicious internet users and you should be cautious out there.
Our hosts are given a fixed amount of time to execute their backups – for us it’s 3.6h If they are not done by then they will have to wait until tomorrow. Using rsync allows us to stop in the middle of a backup and not have to start all over the next day. Our typical change set is less than 1GiB and most archive steps take under 15 minutes so this fits. It also keeps our costs on EC2 down as our image only runs 3.6h/d so our monthly costs are reduced by 80%.
Storage in EBS is cheap ($0.10/allocated GiB), when our image is not running our data stays in EBS under Amazons 99.[5-9]% availability SLA. However, being the paranoid sort we also send EBS snapshots to S3 (99.99% SLA) nightly stored at $0.10/GiB used. Cheap.
Sample Cost Breakdown
Fixed(ish - will grow over time) m1.small EC2 Image - 3.6h/d * 30d = $3.00 (0.10 * 4 * 30) EBS Storage 5GiB/mo * 4 Volumes = $2.00 (0.10 * 5 * 4) S3 Storage 20GiB/mo * 1 Bucket = $3.00 (0.15 * 20) Variable (based on transfer) EC2 Bytes In (estimate 5GiB/mo ) = $0.50 (0.10 * 5) EC2 Bytes Out (est <= 10TiB) = $0.17 (0.17 * 1) S3 Bytes I/0 (<=10TiB) = $0.17 (0.17 * 2) S3 Request PCPL (<=10,000) = $0.10 (0.01 * 10) S3 Request GET (<=20,000) = $0.02 (0.01 * 2) Total: = $7.96
Our host-specific backup looks like this:
- Mount encrypted filesystem (encfs) to /var/crypt/rdiff-backup
rdiff-backup the necessary live filesystem stuff to /var/crypt/rdiff-backup (keeping 20
- Unmount encrypted filesystem
- rsync encrypted rdiff-backup archive to ec2
As this ec2 image is running Gentoo (our favourite) we can also extend it with things like OpenVPN and Samba so that Windows hosts can directly connect and browse shares store in EBS. We'd have to have the instance more available than it is now.
Amazon EC2+EBS+S3 and a little elbow-grease-overhead can create an incredibly flexible and inexpensive backup option. Our monthly costs from an alternate provider