Tag Archives: Backups

High Availability vs. Fault Tolerance

Many times, terms like “High Availability” and “Fault Tolerance” get thrown around as though they were the same thing. In fact, the term “fault tolerant” can mean different things to different people - and much like the terms “portal,” or “cloud,” it’s important to be clear about exactly what someone means by the term “fault tolerant.”

As part of our continuing efforts to guide you through the jargon jungle, we would like to discuss redundancy, fault tolerance, failover, and high availability, and we’d like to add one more term: continuous availability.

Our friends at Marathon Technologies shared the following graphic, which shows how IDC classifies the levels of availability:

Graphic of Availability Levels

The Availability Pyramid

Redundancy is simply a way of saying that you are duplicating critical components in an attempt to eliminate single points of failure. Multiple power supplies, hot-plug disk drive arrays, multi-pathing with additional switches, and even duplicate servers are all part of building redundant systems.

Unfortunately, there are some failures, particularly if we’re talking about server hardware, that can take a system down regardless of how much you’ve tried to make it redundant. You can build a server with redundant hot-plug power supplies and redundant hot-plug disk drives, and still have the system go down if the motherboard fails - not likely, but still possible. And if it does happen, the server is down. That’s why IDC classifies this as “Availability Level 1″ (“AL1″ on the graphic)…just one level above no protection at all.

The next step up is some kind of failover solution. If a server experiences a catastrophic failure, the work loads are “failed over” to a system that is capable of supporting those workloads. Depending on those work loads, and what kind of fail-over solution you have, that process can take anywhere from minutes to hours. If you’re at “AL2,” and you’ve replicated your data using, say, SAN replication or some kind of server-to-server replication, it could take a considerable amount of time to actually get things running again. If your servers are virtualized, with multiple virtualization hosts running against a shared storage repository, you may be able to configure your virtualization infrastructure to automatically restart a critical workload on a surviving host if the host it was running on experiences a catastrophic failure - meaning that your critical system is back up and on-line in the amount of time it takes the system to reboot - typically 5 to 10 minutes.

If you’re using clustering technology, your cluster may be able to fail over in a matter of seconds (“AL3″ on the graphic). Microsoft server clustering is a classic example of this. Of course, it means that your application has to be cluster-aware, you have to be running Windows Enterprise Edition, and you may have to purchase multiple licenses for your application as well. And managing a cluster is not trivial, particularly when you’ve fixed whatever failed and it’s time to unwind all the stuff that happened when you failed over. And your application was still unavailable during whatever interval of time was required for the cluster to detect the failure and complete the failover process.

You could argue that a fail over of 5 minutes or less equals a highly available system, and indeed there are probably many cases where you wouldn’t need anything better than that. But it is not truly fault tolerant. It’s probably not good enough if you are, say, running a security application that’s controlling the smart-card access to secured areas in an airport, or a video surveillance system that sufficiently critical that you can’t afford to have a 5-minute gap in your video record, or a process control system where a five minute halt means you’ve lost the integrity of your work in process and potentially have to discard thousands of dollars worth of raw material and lose thousands more in lost productivity while you clean out your assembly line and restart it.

That brings us to the concept of continuous availability. This is the highest level of availability, and what we consider to be true fault tolerance. Instead of simply failing workloads over, this level allows for continuous processing without disruption of access to those workloads. Since there is no disruption in service there is no data loss, no loss of productivity and no waiting for your systems to restart your workloads.

So all this leads to the question of what your business needs.

Do you have applications that are critical to your organization? If those applications go down how long could you afford to be without access to them? If those applications go down how much data can you afford to lose? 5 minutes? An hour? And, most importantly, what does it cost you if that application is unavailable for a period of time? Do you know, or can you calculate it?

This is another way to ask what the requirements are for your “RTO” (“Recovery Time Objective” - i.e., how long, when a system goes down, do you have before you must be back up) and “RPO” (“Recovery Point Objective” - i.e., when you do get the system back up, how much data it is OK to have lost in the process). We’ve discussed these concepts in previous posts. These are questions that only you can answer, and the answers are significantly different depending on your business model. If you’re a small business, and your accounting server goes down, and all it means is that you have to wait until tomorrow to enter today’s transactions, it’s a far different situation from a major bank that is processing millions of dollars in credit card transactions.

If you can satisfy your business needs by deploying one of the lower levels of availability, great! Just don’t settle for an AL1 or even an AL3 solution if what your business truly demands is continuous availability.

A Better Way to Backup Your Data

Moose Logic has been building and supporting networks for a long time. And during most of that time we’ve had a real love-hate relationship with most of the backup technologies we’ve implemented and/or recommended.

Tape backups - although they are arguably the best technology for long-term archival storage - are a pain to manage. Tapes wear out. Tape drives get dirty. People just don’t do test restores as often as they should. As a result, all too often, the first time you realize that you’ve got a problem with your backups is when you have a data loss, try to restore from your backups, and find out that they’re no good.

Add to that the astronomical growth in storage capacity, meaning that all the data you need to back up often won’t fit on one tape any more. So, unless you have someone working the night shift who can swap out the tape when it gets full, you’re faced with…

  • Buying multiple tape drives, which typically means you’re going to spend more on your backup software. And if your servers are virtualized, where are you going to install those tape drives?
  • Buying a tape library (a.k.a. autoloader), which can also get expensive.
  • Changing the tape when you come in the next morning, which means that your network performance suffers because you’re trying to finish the backup job(s) while people are trying to get work done.

Then there’s the issue of getting a copy of your data out of the building. Typically, that’s done by having multiple sets of tapes, and a designated employee who takes one set home every Friday and brings the other set in. If s/he remembers. Or isn’t sick or on vacation.

Backing up to external hard drives is a reasonable alternative for some. It solves the capacity issue in most cases. But over the years, we’ve seen reliability issues with some manufacturers’ units. We’ve uncovered nagging little issues like some units that don’t automatically come back on line after a power interruption. And they’re not necessarily the best for long-term archival storage, unless you keep them powered on - or at least power them on once in a while - because hard disks that just sit for long periods of time may develop issues with the lubrication in their bearings and not want to spin back up.

But we’ve finally found an approach that we really, really like. One that, as one of our engineers said in an internal email thread, we actually enjoy managing. In fact, we like it so much we built a backup appliance around it. It’s Microsoft’s System Center Data Protection Manager (SCDPM).

In this installment of the Moose Logic Video Series, our own Scott Gorcester gives you a quick overview of SCDPM 2010:

For more detail on how it works, check out the description of our MooseSentryTM backup appliance.

Why You Need Good Backups

A few days ago, in the post entitled “Seven things you need to do to keep your data safe,” we were talking primarily about some simple things that individuals can do to protect their data, even if (or especially if) they’re not IT professionals. In this post, we’re talking to you, Mr. Small Business Owner.

You might think that it’s intuitively obvious why you would need good backups, but according to an HP White Paper I recently discovered (which you should definitely download and read), as many as 40% of Small and Medium Sized Businesses don’t back up their data at all.

The White Paper is entitled Impact on U.S. Small Business of Natural and Man-Made Disasters. What kinds of disasters are we talking about? The White Paper cites statistics from a presentation to the 2007 National Hurricane Conference in New Orleans by Robert P. Hartwig of the Insurance Information Institute. According to Hartwig, over the 20-year period of 1986 through 2005, catastrophic losses broke down like this:

  • Hurricanes and tropical storms - 47.5%
  • Tornado losses - 24.5%
  • Winter storms - 7.8%
  • Terrorism - 7.7%
  • Earthquakes and other geologic events - 6.7%
  • Wind/hail/flood - 2.8%
  • Fire - 2.3%
  • Civil disorders, water damage, and utility services disruption - less than 1%

If you’re in Moose Logic’s back yard here in the great State of Washington, you probably went down that list and told yourself, with a sigh of relief, that you didn’t have to worry about almost three-quarters of the disasters, because we typically don’t have to deal with hurricanes and tornadoes. But you might be surprised, as I was, to learn that we are nevertheless in the top twenty states in terms of the number of major disasters, with 40 disasters declared in the period of 1955 - 2007. We’re tied with West Virginia for 15th place.

Sometimes, disasters come at you from completely unexpected directions. Witness the “Great Chicago Flood” of 1992. Quoting from the White Paper:

In 1899 the city of Chicago started work on a series of interconnecting tunnels located approximately forty feet beneath street level. This series of tunnels ran below the Chicago River and underneath the Chicago business district, known as The Loop. The tunnels housed a series of railroad tracks that were used to haul coal and to remove ashes from the many office buildings in the downtown area. The underground system fell into disuse in the 1940’s and was officially abandoned in 1959 and the tunnels were largely forgotten until April 13th, 1992.

Rehabilitation work on the Kinzie Street bridge crossing the Chicago River required new pilings and a work crew apparently drove one of those pilings through the roof of one of those long abandoned tunnels. The water flooded the basements of Loop office buildings and retail stores and an underground shopping district. More than 250 million gallons of water quickly began flooding the basements and electrical controls of over 300 buildings throughout the downtown area. At its height, some buildings had 40 feet of water in their lower levels. Recovery efforts lasted for over four weeks and, according to the City of Chicago cost businesses and residents, an estimated $1.95 billion. Some buildings remained closed for weeks. In those buildings were hundreds of small and medium businesses suddenly cut off from their data and records and all that it took to conduct business. The underground flood of Chicago proved to be one of the worst business disasters ever.

Or how about the disaster that hit Tessco Technologies, outside of Baltimore, in October of 2002? A faulty fire hydrant outside its Hunt Valley data center failed, and “several hundred thousand gallons of water blasted through a concrete wall leaving the company’s primary data center under several feet of water and left some 1400 hard drives and 400 SAN disks soaking wet and caked with mud and debris.”

How could you have possibly seen those coming?

And as if these disasters aren’t bad enough, other studies show that as much as 50% of data loss is caused by user error - and we all have users!

One problem, of course, as we’ve observed before, is that it’s difficult to build an ROI justification around the bad thing that didn’t happen. Unforeseen disasters are, well, unforeseen. There’s no guarantee that the big investment you make in backup and disaster recovery planning is going to give you any return in the next 12 - 24 months. It’s only going to pay off if, God forbid, you actually have a disaster to recover from. So it’s no surprise that, when a business owner is faced with the choice between making that investment and making some other kind of business investment that will have a higher likelihood of a short-term payback (or perhaps taking that dream vacation that the spouse has been bugging you about for the last five years), the backup / disaster recovery expenditure drops, once again, to the bottom of the priority list.

One solution is to shift your perspective, and view the expense as insurance. Heck, if it helps you can even take out a lease to cover the cost - then you can pretend the lease payment is an insurance premium! You wouldn’t run your business without business liability insurance - because without it you could literally lose everything. You shouldn’t run your business without a solid backup and disaster-recovery plan, either, and for precisely the same reason.

Please. Download the HP White Paper, read it, then work through the following exercise:

  • List all of the things that you can imagine that would possibly have an impact on your business. I mean everything - from the obvious things like flood, fire, and earthquake, to less obvious things, like a police action that restricts access to the building your office is in, or the pandemic that everyone keeps telling us is just around the corner.
  • For each item on your list, make your best judgment call, on a scale of 1 to 3, of
    • How likely it is to happen, and
    • How severely it would affect your business if it did happen.

You now have the beginnings of a priority list. The items that you rated “3″ in both columns (meaning not likely to happen, and not likely to have a severe effect on your business even if they did) you can push to the bottom of the priority list. The items that you rated “1″ in both columns need to be addressed yesterday. The others fall somewhere in between, and you’re going to have to use your best judgment in how to prioritize them - but at least you now have some rationale behind your decisions.

The one thing you can’t afford to do is to keep putting it off. Hope is not a strategy, nor is it a DR plan.

Seven things you need to do to keep your data safe

Jeremy Moskowitz recently posted a great article entitled Backup Tips for the 21st Century: Backup procedures so easy, your Mom could (and should) do it. This is not directed at IT managers or anyone else who has to manage a business network, although there are certainly some common themes, which we’ll talk about a bit later. Rather, the article is targeted at the average home user – you know, those people who are always asking you to help them with some kind of computer problem, because you “know about computers.”

I’d strongly recommend that you click over and read his entire article, and share it with as many people as possible, because he goes into detail on why you should be doing each of these things. [Editor's note: Unfortunately, Jeremy's article is no longer available.] But just to give you a little taste of it, here are the seven things:

  1. Get an online backup service (e.g., Carbonite.com, Mozy.com, etc.)
  2. Get a full-disk backup program
  3. Backup to an external USB drive (in fact, get two or three – they’re cheap)
  4. Don’t keep all your backups in your house
  5. Rotate between at least two, possibly three USB drives
  6. Keep copies of your original disks, downloadables, keycodes, and drivers
  7. Test your restore procedure

Although he feels strongly that you should do all seven in order to be absolutely safe, he also points out that just doing one of them will make you better off than most people – who don’t do anything at all! (And if you only do one, he suggests #3.)

Why should people do these things? Because, in Jeremy’s words, “DISK DRIVES ALWAYS FAIL. ALWAYS. It’s a guarantee. Even the newest ones with no moving parts. They all fail. Eventually.” And he’s right. The only question is when. I’ve seen drives fail within days of being installed (not many, but some), and drives last for years. But eventually, they will wear out. When they do, the data on them is toast, so you’d better either have a backup or have deep pockets to pay someone who specializes in forensic data recovery, and who may or may not be able to recover your most precious data from the dead drive no matter how much you’re willing to pay.

So, how does this translate to sound business practice? Allow me to paraphrase his seven points, and combine a couple of them:

  • Make sure you’re getting a copy of your data out of the building. Use an on-line service, stream data to a repository at a branch office, or just take a copy home every Friday. But do something to get a copy out of the building.
  • Your backup strategy should encompass both machine images and file/folder based backups. If you lose an entire system, it’s a lot faster to restore from an image than to reinstall the OS from scratch and then restore the data files. On the other hand, if all you need is a single file, or a single email message or mailbox, you don’t want to have to restore an entire image just to get that one thing you need.
  • What he said about disks failing goes double (at least) for tapes. Tapes are far less reliable than hard disks. Their capacity is limited. They wear out quickly. The drives get dirty and are subject to a variety of mechanical problems. Unless you’ve either got an expensive autoloader or a night operator to swap tapes in the middle of the night, if your tape fills up you either cancel the job when you come in the next morning, or you finish the backup during working hours and live with the performance hit of doing that while users are trying to work. That’s why we believe so strongly in disk-to-disk backups.
  • Keep copies of your original disks, downloadables, keycodes, and drivers. (Not much I can add to that point.)
  • Test your restore procedure. (Not much I can add to that either.) If you don’t ever do a test restore, you only think you’re getting good backups. And if you’re not, you won’t know about it until you have a catastrophic failure and find out that your data is gone forever.

That’s all for today - you go read Jeremy’s post in full, I’m going to swing by the local office superstore and pick up a couple more USB hard drives…