Peeking into the world of backups
I recently spent some time pouring over Amazon Glacier as a candidate for use as a generic backup service for data I’d prefer to not lose anytime soon. The low cost of storage coupled with the anticipated durability typical of other AWS offerings had me believing I had probably found the perfect service to trust with my backups. What I learned during the course of my introduction to the service turned out to be interesting and – more importantly – eye-openingly informative.
I mention Amazon Glacier throughout this post as it is the service I looked into. It is possible – maybe even likely – the revelations I encountered would apply to various other services known for performing data archiving.
Amazon Glacier as a generic backup solution?
Anyone looking for a backup service for personal use or within a small business might be surprised to discover a solution like Glacier may not be for them. This isn’t necessarily the impression one gets by reading through the overview page. Right from the beginning, the first line of text on the page seems to encourage readers to think of the service as a backup solution:
Amazon Glacier is an extremely low-cost storage service that provides secure and durable storage for data archiving and backup.
To the uninitiated, we seem to be off to a good start. The rest of the page gives some situations in which the service is worth using: offsite enterprise information archiving, archiving media assets, archiving research and scientific data, digital preservation, and magnetic tape replacement. While the given list of scenarios is clearly targeted at enterprise needs, there is nothing written to preclude the use of Glacier as a more generic backup solution for the simpler needs of an individual or small business. Herein lies a possible trap: with no up-front indication to the contrary, one can make the mistake of believing the service is perfectly capable of serving the needs of a generic backup solution even though the marketing is aimed at the enterprise.
So what’s the catch?
Data backup is not data archival
Glacier, and other services offering data archival, are designed for use by businesses with data which is critical to hold on to. Such a service is not meant to serve as storage for data that would be missed if lost, but rather for critical data that must not be lost, at all costs. The difference between the words backup and archive may not be immediately apparent to the everyday citizen, but it turns out that in the world of IT there is a widely understood distinction.
A backup is nothing more than a copy of data, kept for the purpose of being able to recover the original data in the event it is modified to an undesired state, accidentally deleted, or lost. A backup might be stored on a USB flash drive, a second hard disk on the same computer, a separate server on the network, or another easily accessible location. A backup provides some peace of mind knowing we can most likely recover from a localized user error, software problem, or hardware failure.
An archive is not quite so different; in fact, it is little more than a backup with a pessimistic outlook on potential risk. Unlike a “normal” backup which is merely desirable to have, data should be archived when the modification, deletion, or loss of data is unacceptable. Archival is useful for the protection of essential data from disasters a business may face such as burglary, hurricanes, or fire. Many companies keep backups in the same building where the data was created as a first line of defense – but what happens if the office building burns to the ground? Having an archive of all mission-critical data means the business can be reassembled rather than dismantled. Another case is businesses where data retention is required to comply with regulatory requirements – often for a period of many years. It is not uncommon to archive every piece of email correspondence such that nothing can be tampered with or lost, in the event of future legal proceedings years after the communication took place.
So why not use a service like Glacier for generic backups?
Data archival performed seriously is not cheap. Before the dawn of archival solutions provided by services like Glacier, enterprises had little choice but to build up their own immense infrastructure to ensure data could not possibly be permanently lost. Assembling a system and process designed to practically guarantee the safeguarding of data is an astronomical undertaking, typically costing thousands upon thousands of dollars. The burden of this financial cost had to be paid up front and then maintained indefinitely, whether or not the safety net provided by the archive would ever be used – even once – over the course of the business’s entire lifetime.
Enter hosted archival services. At least in the case of Glacier, a business does not pay the astronomical up-front costs to store data that may never need to be retrieved. Glacier’s price for storing the data is a small fraction of the financial burden which would otherwise be taken on by having to build out the full infrastructure. The foundation of the service relies upon the premise that most customers will rarely – if ever – need to download data they have archived. Should data retrieval become necessary due to a catastrophic event, the cost of doing so may be expensive but not burdensome if it means the difference between the resurrection of the business and its demise.
It’s time for a car analogy. Archiving data with a service like Glacier is like buying car insurance with total loss protection. A small insurance premium is paid monthly to protect against the possibility of needing to replace the car; likewise, a small storage fee is paid to the company archiving the data to protect against the possibility of needing to retrieve that data. If the car is destroyed or stolen, a larger deductible is paid for the replacement of the vehicle; likewise, in the event the archived data must be retrieved a larger fee is paid to the archival company for having replaced a copy of the data. Now imagine car insurance didn’t exist, but it is required to have replacement cars ready to replace the original at a moment’s notice in the event of an accident? Multiple new cars would have to be purchased up front, even though the original car may never actually need to be replaced. This is akin to the previous era of data archiving in which the full financial burden of the infrastructure had to be paid up front to protect against the possibility of disaster. Which insurance policy sounds better?
The danger of Glacier for the uninitiated
Back to Amazon Glacier. Before writing this article, I was not aware of the definition of data archiving as opposed to data backups. I initially believed I was going to be using Glacier for my non-critical backups. I had read the service’s overview page and it seemed to be the right service for me. I wasn’t an enterprise customer, but the description of how my data would be safe sounded just great. I then thoroughly read the entire pricing page to make sure I understood the costs associated with the service. Every other AWS service I have dealt with has straight-forward pricing, and so it seemed for Glacier. The pricing tables held reasonable breakdowns, and here’s a screenshot showing the supposed cost of data retrievals (note the image is blown up here – on the page it is in the smaller fine print):
I’ve highlighted in yellow the parts that made me believe the costs were minimal. The first 5% is free, with further retrieval billed at $0.01 per GB – perfect. Alas! If it sounds too good to be true, it probably is. I initially failed to notice the dodgy marketing words “starting at” prefixing the “$0.01 per GB” figure. Worse than that, I neglected to click the little “Learn more” link – which leads to this section buried in the FAQ. This single FAQ answer lays out the true costs of Glacier, and it’s not easy to understand… at all. The formula used to determine the cost of data retrievals is complicated, to the point that the speed at which you retrieve data matters greatly. Frankly, I’ve read the formula description multiple times and still do not understand how one would calculate the cost of a retrieval.
In the end, my credit card came out the winner as I had not begun to use Glacier before a friend of mine beat me to it. He had the same understanding of the pricing structure as I did, and performed his first retrieval for 45 GB. The cost came to just under $85 – nowhere remotely close to the $5.85 [45 GB * $0.01 = $0.45 retrieval fee + 45 GB * $0.12 = $5.40 outbound bandwidth] we would have expected based on the details present on the pricing page.
Apparently the formula used to calculate retrieval fees – using the speed of the retrieval and all – is fairly standard across data archiving services. Now that I understand the design principle behind long-term, hands-off data archiving as opposed to simple backups, it makes sense to me that retrievals would be costly. For everyone who is not an IT expert with archiving experience, hopefully Amazon will include more information about the retrieval costs on the pricing page instead of leaving it hidden in the FAQ with only a link in the fine print to lead us there.
For brave souls: the retrieval cost FAQ entry explains how it is possible to greatly reduce the large retrieval cost by downloading your data over a period of days or weeks. This does not fit the “I need it now” consumer mentality of backup retrievals, but if you are a small business or patient individual you might get away with using Glacier or a similar competitor without the risk of incurring enormous fees – if you can figure out the retrieval cost formula.
For the rest of us: remember the difference between a backup and archiving. Steer clear of archiving services unless you understand their purpose and associated costs. Stick to providers offering cheap backup solutions targeted at the consumer market.