The Hard Way

On Friday, May 9, I went through the most rigorous training exercise I have had to go through while learning about the art of systems administration. By “training exercise” I mean that our entire VM storage back-end fell over and start convulsing on the server room floor and I had no clue what to do or even where to start.

It wasn’t one server, but all four of the servers falling over and failing to come back up. Ultimately, I believe it was caused by an update by Ubuntu which caused the drive mappings for our DRDB system to disappear and never return where we were expecting them.

I’m not going to get into the details of what happened, how long it took (too long), or what we ultimately ended up doing (a band-aid which is working quite well) … but I am going to more generically talk about what I have learned through that experience.

Always apply updates to one server at a time. Always. No exceptions. I had become complacent over the months of periodically running the updates on the storage servers and always having the system survive unscathed. All it took was one bad update and now I am more paranoid of updates than ever. Especially when you have a cluster, apply to one server at a time and verify that it is working before moving all resources back over and then applying the updates to the next one. It is common sense, I knew better, and now I’ll learn from my mistake of ignoring it.
Be conservative with your infrastructure. I might change my tune on this one again, but you’ll want to push ahead with the client-side of things and be as conservative as possible with whatever is backing the client. Being of the cutting edge is going to make you bleed, and trying to do things that others are not is a recipe for a lot of pain and suffering. You want the infrastructure services to be working all of the time and then pushing ahead on what the clients are using around that infrastructure. That means you need to be conservative in the face of many bells-and-whistles being tossed around.
Always know ore about your infrastructure than you think you need to. I had not spent the time I needed to learning about how our storage system worked on the very lowest level and we paid the price of a day of lost productivity because of it. If you are going to do something in-house, then you need to be willing to spend the time learning about what you are doing so that you can be comfortable putting your own eggs in your own hands at any time.
The Linux community is diverse … and the DRDB/HA community is amazing! I have to give credit to two people for walking me through what to do on that day. My former co-worker, @acspike, came up and helped as much as he could (which was a ton), and ultimately helped me get the system back up and running. However, @digimer, in the DRDB IRC channel, was the one willing to spend a good two hours walking me through testing what systems were causing the issues and calming me down. Before that I was sick to my stomach and after I could at least keep down my lunch. Without those two people I’d probably still be up there trying to get things back up and running.

The main idea is that it was a good learning experience, even if some of it was stuff I knew already and should have remembered to be doing. Hopefully I don’t do it again.