Last week a client suffered a major hardware failure on a storage and lost several virtual machines, including one which hosted an application to compute salaries. The failure came exactly a week before pay day. With over 600 sales executives, supervisors and managers expecting their salaries, losing the application on that very moment can’t be taken for less than another proof that Murphy’s Law continues to rule our world.
The software company which built the app didn’t have the resources or knowledge to restore it. So even though I quit working as the salary app architect more than 4 years ago and hadn’t seen them since, they called me desperately asking for help. They wanted me to recover the system on a new server so they could compute the salaries ASAP.
At first I devoted myself to assessing the situation. Obviously, they didn’t have VM backups, or they wouldn’t have called. They lost everything the VM had: software applications, configuration files, windows registry keys, logs and also most database files and backups. The application created database snapshots automatically, but it was configured to leave them on the same disk as the database itself. And obviously, nobody had thought of moving them elsewhere. This started to sound like quite a challenge.
The good news were they had a “backup strategy.” Ah, I thought, maybe the copied the database snapshots somewhere else after all. Nope. This so-called “backup strategy” copied only one of the three database snapshots the application created and nothing else. Someone discovered, however, they had backed up the VM filesystem a year ago, and thus one-year-old copies of the other two snapshots appeared. At least there was reason to hope the system could be restored, but it was a long shot.
This was a complex system involving three databases with history, dozens of configuration files, many add-ons and auxiliary systems. I thought I wasn’t prepared to stand up to the challenge. The 4 years since I worked on the system had washed away many memories, including basic knowledge about how the software worked. Add to that the fact that a handful of changes had been introduced since I left, some of them independently from the software company, making the lost application different from the only working copy remaining on the software company. And there was the fact that the client didn’t use the software the way it was designed for, so the recovered data could be littered with errors.
Maybe I would have to dive into the code to produce a new working copy. Maybe I would find the old database backups impossible to work with (they didn’t remember the exact timing of some of the changes they had introduced, so maybe the app expected a schema different from the one the backups showed.) The picture of me trying to confront the challenge and being the only hope of paying the staff in time made me feel ill at ease. I didn’t know how to start. I couldn’t think of a plan. Worst of all, I might be days, weeks or even months trying to repair it to no avail. I felt paralyzed and I seriously thought of rejecting the job.
But then I thought well, what the hell. Let’s try, it might be fun and I have nothing to lose. (It turns out I did have something to lose: a beautiful holiday. Then again they were willing to pay for it.)
I was completely honest with the client. I said I remembered nothing about the system, and that before trying to recover it I would have to study it. (“Nothing? Yes, nothing. I remember nothing at all.”) I also said I would only try to recover the software and, from what I saw, the probability of being successful was below 50%. Their request to reinstall the software on their machine was out of the question: the pressing issue was paying salaries, not installing the app. So I proposed a time-limited plan to do so, a plan during which I would only assess, study and then try and run the software company mirror app with their data and the outdated database backups. And that they would pay me regardless of the outcome.
The proposed course of action lifted a huge weight out of me, especially the fact that it was a time-limited period. I worked towards the goal of making the software run on the software company premises by friday. It was tuesday morning. If I failed to compute salaries, I would be able to present a course of action to recover the system. Divide and conquer relieved my anxiety and enabled me to work towards an achievable goal however daunting the big picture looked.
The first thing I did was to try and understand how the system worked. I remembered having written a manual, so I asked for it. Having an application manual was a complete surprise to them: many times they had wondered without avail how the system worked, and the answer was always right there. First win. Using the instruction manual I was able to make the mirror app work, although with test data. Second win.
A lot of things came to memory on during this first steps. After that, I devoted myself to trying to use it as a black box with the client’s data. Given the complexity of the system and the limited time I had, I promised not to delve into the code but to try and make it work by changing input parameters and configuration settings. If I had to step into the code, I would leave that to a later stage of the recovery process. This turned out to be a crucial decision to be successful.
Although with errors, by wednesday afternoon I had the system working with the client’s input data and old database snapshots. Third win.
Long story short, after making a couple of changes I had the system working by thursday morning and the client could pay their employees. With the comfort of working towards an achievable goal and devoting not to take “stage 2” steps (i.e., looking at or changing the source code), I was able to succeed.
Divide and conquer proved to be a good strategy yet again.