Category Archives: Troubleshooting And How-tos

Scott’s Book Arrived!

We are pleased to announce that Scott’s books have arrived! ‘The Business Owner’s Essential Guide to I.T.’ is 217 pages packed full of pertinent information.

For those of you who pre-purchased your books, Thank You! Your books have already been signed and shipped, you should receive them shortly and we hope you enjoy them as much as Scott enjoyed writing for you.

If you haven’t purchased your copy, click here, purchase a signed copy from us and all proceeds will be donated to the WA chapter of Mothers Against Drunk Driving (MADD).

XenServer Host Is In Emergency Mode

It’s 8 pm on a Sunday evening, and I get a panicked call from a customer because he cannot connect to his XenServersTM via the XenCenterTM management tool. However, as near as he could tell, all of the hosted virtual machines were up and running and in a healthy state. He had unsuccessfully tried to point the XenCenter management tool at another member of the XenServer pool but was unsuccessful.

So what happened and how do you fix it?

This situation can happen for several reasons but generally it happens when there are only two servers in the XenServer pool, and the pool master suddenly fails. In essence, what happens is the surviving server (let’s just call it the “slave”) can no longer see its peer, the pool master, so it assumes it has been stranded and goes into emergency mode to protect its own VMs. There are other ways this can happen (an incorrectly configured pool with HA turned on for example), but this is the most common reason that I have personally experienced.

Depending upon the situation, you may not be able to ping the master server because it is actually down, or you may be able to ping the server but it is in an inconsistent, “locked up”, state such that it cannot answer requests to it. If you are able to connect to the console of the master server either directly with a monitor, keyboard, and mouse (the old fashioned way) or through a remote management interface (DRAC, ILO, ILOM, etc) the server may appear to be running, but you may not be able to do anything with it.

At this point you may be thinking, “This is no big deal - just reboot the machine and it will be fine.” If you are lucky that may actually solve the problem, but in many cases it will not. What you might see is that after the master reboots you will be able to connect to the master but you will not see the slave. Or it may be that your master is truly broken and you are not able to simply reboot it due to a system or hardware failure. But, of course, you’ve still got to get your pool online and working again regardless.

During this period of time, if you try to use a tool such as Putty to connect to the slave via its management interface, you may not be able to connect to it either. If you try to ping the slave on the management interface you may not get any replies. But if you connect to the console of the slave (again, either the physical console or via a remote management interface) you will probably see that the machine is running, but if you look at XSconsole it will appear that the management interface is gone because there will be no IP address showing. By now you’ll probably be scratching your head because the strange thing is all the VMs are running.

So at this point your master appears to be down, or at least impaired, you’ve got no management interface on the slave, your pool is broken and you cannot manage the VMs. So what do you do?

Well, if this happens to you and your VMs are still up and running the first thing you should do is take a deep breath, because more than likely it is not as bad as you might think. XenServer is a robust platform and if the infrastructure is built correctly (and I’m going to quote a customer), “you can really slam the things around and they still work”.

After you take a deep breath and let it out slowly, from the console of the slave server, you will need to access the command line and start by typing:

xe host-is-in-emergency-mode

If the server returns an answer of “True” then you’ve confirmed that the server has gone into emergency mode in order to protect itself and the VMs running on it. (If the server returns an answer of “False” then you can stop reading, because the rest of this post isn’t going to help you.)

Assuming you receive the answer of “True” the slave server is in emergency mode because it cannot see a master – either because the master is actually down, or because the management interface(s) is(are) not working. Therefore, the next step is to promote the slave to master to get it out of emergency mode. We do this by typing:

xe pool-emergency-transition-to-master

At this point the slave server should take over as the pool master and the management interface should be available again. Now if you type the xe host-is-in-emergency-mode command again you should get an answer of “False”.

Now, open XenCenter again. It will first try to connect to the server that was the master, but after it times out it will then attempt to connect to the new master server. Be patient, because eventually it will connect (it may take several seconds) and you will again see your pool and be able to manage your VM’s. If some of the VMs are down because they were on the server that failed you’ll be able to start them on the remaining server (assuming you have shared backend storage and sufficient processor and memory resources).

Now what about the master if it has totally failed? What do I do after I’ve fixed, say, a hardware problem in order to return it to my pool?

If the following two conditions are true:

  1. You are using shared storage so that your VMs are not stored on the XenServer local drives, and
  2. You have built your XenServers with HBAs (fiber or iSCSI) rather than using Open iSCSI, which means the connectivity information to your backend SAN will be stored within the HBA,

…then it may be much simpler and quicker just to reload the XenServer operating system. (If you do not have shared backend storage, which means your VMs are on local storage, DO NOT DO THIS). I can rebuild my XenServers from scratch in about 20 - 30 minutes and have them back in the pool and running.

If either of those two conditions is not true then, depending upon your situation, recovery may be significantly more difficult. It could be as simple as resetting your Open iSCSI settings and connecting back to your SAN (still easy but takes more time to accomplish) or it could be as painful as rebuilding your VMs because you lost your server drives. (OUCH!)

Real world example: I recently had a NIC fail on the motherboard of my master server. Of course since the NIC was on the motherboard it meant the whole motherboard had to be replaced which significantly modified the hardware configuration for that server.

In this case, when I brought that XenServer back online it still had all the information about the old NICs showing in XenCenter, plus it had all the new NICs from the new hardware. Yes I could have used some PIF forget commands to remove the NICs that no longer existed and reconfigure everything but that would have taken me a bit of time to straighten out. Since I had iSCSI HBAs attached to a Datacore SAN (great product, by the way) for shared storage, all I did was reload XenServer on that machine, modify the multipath-enabled.conf file (that is a different blog topic for another day), and rejoin the server to the pool. Because the HBAs already had all the iSCSI information saved in the card, the storage automatically reconnected all the LUNs, the network interfaces took the configuration of the pool, and I was back online and running in less than 30 minutes.

After you repair the machine that failed and get it back online, you may want it to once again be the master server. To do this type:

xe host-list

You will get a list of available servers with their UUID’s. Record the UUID of the server that you want to designate as the new master and then type:

xe pool-designate-new-master host-uuid=[the uuid of the host you want]

After you type this your pool will again disappear from XenCenter, but after about 20 – 30 seconds (be patient) it will reappear with the new server as the master. Your pool should now be healthy, and you should again be able to manage servers as normal.

Ingram-Micro Cloud Summit 2014

On Monday afternoon, I walked by the beautiful 3 story atrium and into the conference center attached to the Westin Diplomat Hotel in Hollywood, FL. It was torturous. After experiencing a March in Seattle which had 3x the normal amount of rain, I was so excited to see the beautiful blue sky and feel the 70 degree temperatures. And it was just a few feet beyond me as I walked down the long hallway to the Conference Center.

Minutes later, I headed into my first session titled “Effective Executive Leadership Skills” led by Gary Beechum of SPC International. If you haven’t met Gary, you really should. He’s no-nonsense, direct, inspirational and articulate. He often references he time in the military and even uses some of the tools he picked-up while in the Army in his presentation. I definitely learned some things to bring back to our Leadership Team. One of the best parts of his presentation was the 14 Traits of Leaders.

At the reception that followed our classroom sessions I met a ton of new people. Many were from across the country and wanted to work with a firm like VirtualQube, and some who wanted to partner with us to deliver new bundles to customers. Our story really resonated with the attendees. There are a number of MSPs looking for a white-labeled cloud offering, and people would actually overhear my conversation and ask me for a card. I think one of the great benefits of this conference was since it was focused on “cloud” there weren’t MSPs who didn’t have any idea about how they were going to deliver cloud services. Many had come to the conclusion that they would rather hire-out a solid cloud vendor instead of re-invent the wheel and build their own hardware. Our story was like music to their ears. And we’ve even written about it recently here.

All-in-all, the first day of the conference has been so valuable that I’m excited not only for the rest of the conference, but for working more closely with Ingram Micro over the coming months.

Karl Burns

First Look at Citrix Access Gateway 5.0

At the recent Synergy Berlin conference, Citrix announced Access Gateway 5.0. We have confirmed that, as of now, 5.0 is available for download from the Citrix download site - both as an update for the CAG 2010 hardware appliance, and in Access Gateway VPX (virtual appliance) format. (Note: you will need a “mycitrix” account to download the software.)

One of the things I really like about 5.0 is that it now supports running two 2010 appliances in an active/passive HA configuration with automatic failover. This was a serious shortcoming of the original CAG appliance.

In earlier versions, if you were using the Access Gateway as a general-purpose SSL VPN, you could configure HA of a sort within the Access Gateway client plug-in, by defining primary and secondary Access Gateways for the client to connect to. However, if you were simply running the Access Gateway in “CSG replacement” mode to connect to a XenApp farm without requiring your users to first establish an SSL/VPN connection, you had no ability to provide automatic failover unless you had some kind of network load balancing device in front of multiple Access Gateway appliances. That meant, of course, that to avoid having the load balancing device become a single point of failure, you had to have some kind of HA functionality there as well. By the time you were done, the price tag had climbed to a level that just didn’t make sense for some smaller deployments.

NOTE: This specifically applies to the 2010 appliance. The CAG Enterprise models, because they are built on the NetScaler hardware platform, have always supported operation as HA pairs with automatic failover. Of course, a CAG MPX 5500 also carries a $9,000 list price, compared to $3,500 for a CAG 2010.

Now, with the release of 5.0, you can purchase two 2010 appliances (which will cost you less than a single MPX 5500), and run them as an active/passive HA pair. Thank you very much, Citrix CAG team!

Here are a couple of videos from Citrix TV. The first deals with how to upgrade an existing CAG 2010 to the 5.0 software using a USB flash drive, and then set up the basic system parameters:

The second video shows how to configure a pair of appliances for active/passive failover:

You can access several other “how-to” videos by going to http://www.citrix.com/tv, and searching on “Access Gateway 5.0.”

A Better Way to Backup Your Data

Moose Logic has been building and supporting networks for a long time. And during most of that time we’ve had a real love-hate relationship with most of the backup technologies we’ve implemented and/or recommended.

Tape backups - although they are arguably the best technology for long-term archival storage - are a pain to manage. Tapes wear out. Tape drives get dirty. People just don’t do test restores as often as they should. As a result, all too often, the first time you realize that you’ve got a problem with your backups is when you have a data loss, try to restore from your backups, and find out that they’re no good.

Add to that the astronomical growth in storage capacity, meaning that all the data you need to back up often won’t fit on one tape any more. So, unless you have someone working the night shift who can swap out the tape when it gets full, you’re faced with…

  • Buying multiple tape drives, which typically means you’re going to spend more on your backup software. And if your servers are virtualized, where are you going to install those tape drives?
  • Buying a tape library (a.k.a. autoloader), which can also get expensive.
  • Changing the tape when you come in the next morning, which means that your network performance suffers because you’re trying to finish the backup job(s) while people are trying to get work done.

Then there’s the issue of getting a copy of your data out of the building. Typically, that’s done by having multiple sets of tapes, and a designated employee who takes one set home every Friday and brings the other set in. If s/he remembers. Or isn’t sick or on vacation.

Backing up to external hard drives is a reasonable alternative for some. It solves the capacity issue in most cases. But over the years, we’ve seen reliability issues with some manufacturers’ units. We’ve uncovered nagging little issues like some units that don’t automatically come back on line after a power interruption. And they’re not necessarily the best for long-term archival storage, unless you keep them powered on - or at least power them on once in a while - because hard disks that just sit for long periods of time may develop issues with the lubrication in their bearings and not want to spin back up.

But we’ve finally found an approach that we really, really like. One that, as one of our engineers said in an internal email thread, we actually enjoy managing. In fact, we like it so much we built a backup appliance around it. It’s Microsoft’s System Center Data Protection Manager (SCDPM).

In this installment of the Moose Logic Video Series, our own Scott Gorcester gives you a quick overview of SCDPM 2010:



For more detail on how it works, check out the description of our MooseSentryTM backup appliance.