Dealing with Major Incidents – Sensible Service Management Part 5
Blog post by ITskeptic
Up to this point in the Sensible Service Management Series, we have looked at service management in general and then at how to respond to incidents and requests. That responding activity was all about the day-to-day activities when things are close to normal. But sometimes things get a long way from normal. The proverbial hits the fan. Something is so catastrophically broken that we need to drop normal processing and switch to a crisis mode. This is known – a bit prosaically – as a Major Incident.
In some schools of thought – me for instance – you can also have a Major Problem. They are different paths to the same point: we respond in the same way.
The first time we have a Major Incident, the response is usually to panic. Or at least, a lot of turmoil ensues, as people rush around working out what to do and who should do it. We get things fixed quicker and we keep our customers happier if we have planned how to deal with Major Incidents. Yes, we need to think about Major Incident Management (MIM) before catastrophe strikes.
Setting the scope for MIM – A Major Incident is more than a high severity incident, which we give our highest priority and put our best people on it, yet we still use the normal procedures to deal with it. And it is something less than a total disaster, where our systems are wiped out and we have to go to our disaster recovery (DR) plan. A Major Incident sits somewhere in between. Sometimes a high severity incident turns in to a Major Incident, when we realize the impact is worse than we thought or when it takes too long to fix and there is no sign of a resolution. And sometimes a Major Incident turns out to be so bad that we give up trying to deal with it and declare a disaster. So there’s a ladder of incidents with increasing severity:
|High Severity Incident||Normal procedure with top priority|
|Major Incident||MIM procedure|
Notice that I refer to a DR plan, not a DR procedure. In a disaster, we plan the things that need to be done. For a rebuild of our systems, we often define a series of steps that have to happen in a certain order, but there isn’t so much a procedure for dealing with a disaster (do this, then this, then this) as lists and policies. We have lists of the systems we need to restore first, lists of contacts and so on. We have policy that sets the rules and bounds and trigger points. But there is little point in trying to create a defined procedure because the circumstances of every disaster differ so much. We are going to have to make it up to suit the situation. The DR Plan helps us be as prepared as possible.
MIM sits somewhere in between a structured incident procedure and the guidance of a DR plan. In a major incident, we can say what the initial procedure will be, but beyond that, it will again vary depending on the circumstances. So, we also need to have a MIM plan to provide as much guidance and information as possible. The MIM plan should define the following:
Some organizations put a lot of effort into defining what constitutes a major incident. It is futile to define the exact measures of one (more than 90% of this; three or more of those). Trust me, the real world will invent a new crisis that is outside your definition yet clearly a Major Incident. It is worthwhile defining guidelines or principles for recognizing a Major Incident, but a Major Incident is like art: hard to define but you know it when you see it. And like art, you need a human to recognize it. So the key thing to define is not what a Major Incident is but who gets to declare one. This should be equally broad: any manager above a certain level, say. It must be easy to find a qualifying person in the heat of a crisis.
Other policy sets the principles we work by in addressing an incident, the goals, the rules and bounds, the responsibilities.
There is a Major Incident Manager role. You need several designated people willing to take on this role in a crisis. They may not be the same person as your day-to-day incident process manager. Choose people who are natural leaders, strong communicators and good under pressure.
Other roles include a Communications Manager, to deal with internal and especially external communication, and a Resolution Manager who directs, understands and coordinates the technical people fixing the problem. You need several designated people willing to take on those roles too.
The first few steps in dealing with a Major Incident will be the same in almost all cases. By having these defined, we bootstrap ourselves up to a state where we can start writing the rest of the procedure on the fly. These steps are listed in a checklist I posted at http://www.basicsm.com/declare-major-incident-or-problem.
Once we have a Major Incident Manager in control of a center of operations, we are in a situation to plan the next steps according to the situation, to hopefully resolve the incident.
Those steps will include some standard ones which should also be documented in the MIM Plan:
- The Major Incident Manager reconfirms that there is indeed a Major Incident
- The Communications Manager communicates the schedule and methodology for future updates to all parties
- The Communications Manager notifies all business owners of the service(s) and other stakeholders listed in the Communications Plan
- A Resolver Team or teams are assembled by the Resolution Manager. The Resolver Teams agree to their communication schedules and protocols before getting to work.
The Communication Plan should describe who needs to know what, how, and how often – from customers to non-involved internal staff.
It should describe what happens at certain points:
- Regular updates from the Resolution Manager to the Major Incident Manager. Nobody else communicates with Resolver Teams except the resolution Manager. Leave them alone to get the job done.
- Regular updates from the Communications Manager to stakeholders
- Meetings of the Major Incident Manager and Communications Manager with senior internal and customer management and other stakeholders
Center of Operations
The MIM Plan should describe how to set up and run a “war room” for the duration of the incident.
This includes looking after people: feeding them and making sure there are rosters of staff so that they get some rest, especially the key people. Therefore you need more than one person for each of the key roles. After 12-18 hours under pressure, people are dangerous decision-makers.
If this MIM Plan seems like a lot, it won’t be when the day comes, as it most certainly will. You will wish you had planned more.
One last point: rehearse this just as you rehearse fire drills. Rehearse the initiation until the center of operations is up and running. Rehearse the meetings with stakeholders. And rehearse the root cause analysis and problem resolution. We’ll talk about those last two next time when we look at Problem Management.
Some of the content in this article comes from the MIM checklists at http://www.basicsm.com/checklists. You can use those checklists to help keep control in a crisis. Braun Tacon http://majorincidenthandling.com/ contributed to the MIM checklists.
Have you tried GoToAssist Service Desk yet? Support teams can quickly and easily log and track incidents, deliver end-user self-service and manage configurations. The GoToAssist Service Desk tool provides a simple, intuitive way to more effectively manage IT operations and gain visibility into IT services. Try it free for 7-days, start using GoToAssist Service Desk today!