Tags: ris, thoughts
Related Posts:
- ...
- ...
- ...
- ...
Disaster Planning… how much is enough? Part 3
We are fortunate (or unfortunate, depending on the day!) to have a RIS developed in house. This affords us some great flexibility, but it can also cause much stress. Without a vendor to fall back on, the buck stops with us when something goes south. Happily, that doesn’t happen often at all, but there is one scenario that allows a different look at disaster preparedness.
There was a backend upgrade that we were very excited to do, as our tests indicated a fantastic improvement in speed for our users. We tested and tested, trying to account for every scenario, as this was a rather large backend change. The biggest concern we had was that once users started using it and inputting data, rolling back became exceedingly difficult. If it was going to break, we wanted to break right away.
The day of reckoning came, and we crossed our fingers. There were no functional differences, so if an issue was going to occur, it was likely to be something quite bizarre. A couple of hours in, we came face to face with our bizarre tormentor. All of a sudden, mid morning, everything ground to a halt. Our super fast upgrade was not so fast after all.
Let me break for an interesting aside. RIS vs PACS: which is more ‘critical’ to business continuity? Looking at this from a workflow perspective (ignoring which system contains data that is more important), our RIS outage highlighted an answer very quickly. We have faced PACS outages, and certainly while disruptive and impactful to the business, staff was able to keep moving… slowly sure, but workflow remained intact. Some credit goes to the distributed nature of Intelepacs. There are single points of failure, but they are few. Modalities can still take images. Rad’s could view images on modalities, dictations could happen with Dictaphones. Certainly none of this is ideal or enjoyable for those impacted, but work did not stop dead. With the RIS down, it did. The RIS moves information around, and automates so many processes, that without it, staff simply didn’t know what to do next. Support staff can manually push items through the system, but doing so is very, very slow for a small team. So while workflow didn’t completely stop, it was practically at a standstill.
The slowness issue we faced was in fact a direct result of the speed gain we enjoyed. As it turns out, we had a query that was very poorly constructed. In our ‘slow’ system prior to upgrade, it never had the resources to get out of hand, so we never noticed it. Under our new system, so many instances of this particular query were able to run, that our server was starved of resources. Restarting it worked for a few moments, but quickly we were down again.
These series of blog posts are about disaster preparedness, not about programming stories, so how does this fit in? The ‘bug’ in our system affected one specific screen. If we could avoid using that screen, everything would be fine. However, the way the system was designed, that screen was needed by everyone to move the study to the next step. It became painfully obvious to us that if only we had put some of this functionality on other screens, work could have still continued. We managed to find and correct this rogue query in a quick, but still unacceptable, amount of time. The next thing we did was add in that functionality to the other areas. I’m sure some people would see this as potential code bloat, but while this gave us a measure of comfort that should something similar happen in the future, we’d be ok, it also made a lot of sense from a usability standpoint, we just ignored it up until that point, while we tackled other issues.
The moral this time is that sometimes you can’t possibly be prepared. While technically, there are ways we could have stress tested the system prior to rollout, they just aren’t feasible in our small shop. Having been through this experience once, it certainly opens our eyes to how we design in the future. We try to be more aware of having multiple methods to achieve a result. It also highlights the need to revisit systems periodically, and check the performance. Find a baseline, and put checks in place to notify someone if that baseline is exceeded. You may not be able to predict what segment of a system may decide to break on you, but after you do it a few times, you may surprise yourself how often you are indeed able to.
You can never be too prepared…









Activity