Tags: hardware, thoughts
Related Posts:
- ...
- ...
- ...
- ...
Disaster Planning… how much is enough?
Disaster Planning… how much is enough?
After a flurry of activity the past couple years, we are now taking a step back to see where to reassert some energy. Sometimes, when things are growing and changing at a quick pace, it is easy to push aside important, but not seemingly urgent, matters. One such area that has our renewed focus is our disaster preparedness.
I imagine we are in the same boat as other medium sized businesses, no matter the industry. We create and house lots of important data, across various systems, and it is consumed by many staff members in multiple locations. The volume of our data may be higher than average, but there are certainly industries that have much larger numbers. Our data and our reliance on technology is critical to our business, but the same could be said for a multitude of organizations these days.
At the onset of our various initiatives over the past while, we had always made sure to cover the basics. Is hardware redundant? Do we have backups? Is data stored in an offsite location? And by and large, it has been sufficient. We have had emergencies, we have had outages, and we have even ‘lost’ lots of data. However, each time, within a reasonable (from my perspective anyway!) timeframe, we were able to recover and resume normal business workflow. We have never actually lost any data; it has always been recoverable from a redundant location or backup.
So, if we have had such success, why give it any further thought? Isn’t the status quo good enough? Well, no. Any outage is still a disruption, and the more we can do to prevent them, the better off the business is.
The phrase “hindsight is 20/20” rings so true when looking at disaster recovery and prevention. Over the coming posts, I will use real incidents we faced, how we dealt with them at the time, and how we could handle (or prevent them!) in the future.
Most recently, we had a hard drive failure in one of our clinic PACS servers. Truth be told, we have had numerous hard drive failures. However, in all other instances, the business was not impacted at all. We planned for this, and as such have RAID in all of our servers. Every other time, we simply responded to pre-programmed alerts (those are key), and the system recovered to a happy state silently in the background, without affecting staff in the slightest.
This most recent hard drive failure happened to be accompanied by a malfunction of a RAID controller, so we were in fact down. Staff at that location could not process patients very well at all. How did this play out? Staff called us, complaining of a seemingly minor issue. We spent time chasing that issue, which eventually led us to the actual problem of the drive/controller failure. We dispatched someone to the site, and they switched all the equipment manually over to a backup PACS server. Thanks to Intelerads architecture, this redundancy is very straight forward. This process took time, in the meanwhile, staff were using some documented (and some creative off-the-cuff) processes to keep workflow moving, albeit at a slower pace. Simultaneously to the switchover, staff also used a spare drive to replace the dead drive, allowing our PACS vendor to get the server itself back in service.
What can this experience teach us? Well, for one, there is no fool proof plan. We thought we had a dead hard drive covered, but had not anticipated what we faced. Secondly, we were only informed because staff noticed issues… the server itself was still running (not properly mind you), so our alerts did not fire off. Actually, one alert did trigger, but it was incorrectly configured as a low priority alert, that we hadn’t noticed in time. We relied on a manual switch over process. While portions of it could be (and now are) documented and programmed so staff could switch to the backup server with just a button press, there still is equipment that does not have that ability, and does require a technical resource to re-program. That said, technology is available in the way of load balancers which could have (possibly) greatly assisted the fail over.
While we were in a failed over state, waiting for the server to become operational again, workflow was fairly normal, but it was a tad slower than normal. If we did not have a spare part on hand, or something more drastic had failed on the server, we would have had to rely on hardware support to kick in and provide replacement equipment, which could have taken days, and still more time for Intelerad to configure the replacement server. This leads to the question of should we have had a spare, configured, server ready to go? And that appropriately enough takes me back to the question… how much is enough?









Activity