Problem Management – Sensible Service Management Part 6
So far, the Sensible Service Management Series has covered incidents and requests. This is the “front-office” activity involved in serving the users: meeting their needs and keeping their services running. There is a “back-office” activity closely related to incidents and requests: Problem Management. A problem is an underlying cause of incidents. Usually it means something is broken. If an alligator bites someone, you fix the incident with bandages and maybe surgery. To fix the problem, you shoot the alligator before it bites again.
Strictly speaking something can be wrong/broken and not be a problem because it is not causing incidents (yet). I like to call those Faults. You can keep life simple (and your GoToAssist Service Desk configuration simple) if you treat anything wrong/broken as a problem.
It is not uncommon to find an organization that doesn’t use problem records, only incidents. This is a big mistake. An incident record says that a user is unhappy. If we get the user working again (say using a workaround – see our Incident Management post) then the incident is over, though the problem may be still there. When an incident ends up getting used to track the problem, this screws up our reporting, making it look like we have long-running incidents, like we are not looking after the users.
Incident Management and Problem Management are very different activities and need to be kept separate. Not only does it make the incident reporting more accurate; keeping problems separate has other benefits:
- We can see our “portfolio” of problems: the overall situation gains visibility, so we can prioritize what we need to fix and work out how much resource we need.
- Incident and problem practices often have conflicting objectives. For example rebooting a server will quickly fix a lot of incidents but it potentially destroys diagnostic information for resolving the underlying problem. These conflicts should not be laid on one person to reconcile. Different groups should make their case to higher management to resolve it when they conflict.
So use problem records. We open problem records in several different situations:
- There is an incident and we can see there is an underlying technical cause (even if we don’t yet know exactly what it isn’t, say, a user error or an administration mistake)
- We detect a pattern in incidents and start to suspect there is an underlying cause for them
- We see something is wrong/broken
If you want, you can be quite general about what you define as a problem. For example, lots of user errors might show you there is a problem with the training.
Track all your problems (prioritize, work on them, follow up the slow ones), and record what you did about them, and close them off as you fix them or decide to live with them (if they are too hard or expensive to fix).
It is not the bosses’ job to solve problems. Problems don’t get escalated. The old manager’s mantra is “Bring me options not problems”. Those doing operations know best how to fix problems.
In order to fix a problem (or an incident) you quite often have to do root cause analysis. There are formal techniques you can use to do this. Some argue that there is no single root cause of problems. It generally takes several causes together to create a problem – they have to “line up” in some way. The first and most obvious cause you find is seldom the end of the story: keep asking “why” until the answers are not useful. Finding root cause is not necessarily about assigning blame – it is about removing cause. Complex systems are in fact permanently broken, so when they actually fail, it may be nobody’s fault. On the other hand, there could be negligence.
Once you are tracking and dealing with problems, the next level of maturity is to “kill the alligators before they bite you”: proactively seek out problems and fix them. When you are really good, you will forestall them and prevent them ever existing. Find a keen, clever, energetic employee and assign them half a day per week to be an Alligator Killer: measure them on how many problems they find and eliminate.
Your register of problems in GoToAssist Service Desk is closely linked to your register of risks, and you may want to link them. An unfixed problem poses the risk of future service interruptions.
The better you get at dealing with problems, the fewer incidents you will have. The other area that you can improve in order to reduce incidents is Change Management. We’ll talk about that next.