Major incidents are a big deal. They are the crisis situations that have widespread impacts on your employees, disrupt your operations and impact your ability to deliver on customer expectations. While you may assume your company is prepared for when a major incident strikes, assuming is not a strategy for success when the stakes are this high. You need knowledgeable and trained staff members who know what must be done, equipped with the tools and resources to do their job effectively and efficiently.
Every company has a plan for Major Incident Management. It may be as simple as “bring together a bunch of very smart people who are able to investigate and determine what is occurring” or it may be a sophisticated set of processes, decision structures and protocols. If you don’t know your plan (or don’t know if your staff knows), then now is a good time to review and improve your plan. Once a major incident starts, you must focus on acting, not planning, before you lose control of events.
Here are 5 ways that you can make your Major Incident Management plan better and easier to execute:
- Understand your data and its accuracy. Having data available to aid in diagnostics and troubleshooting is critical in a major incident situation. IT configuration data, support contacts and up-to-date dependency data are essential. You must understand what you actually have, not just what you think you have. The quality of your available data will be crucial in determining how quickly you can analyze the symptoms of the incident, identify the underlying cause and determine the actions required to restore service.
- Pre-assemble the infrastructure data picture. IT ecosystems are complex. If your company uses many 3rd party and cloud services that involve suppliers, then putting the picture together may be even more challenging. Before the incident starts, assemble your infrastructure data to understand where there is confusion, where are the blind spots and where your data may not be current. When the time comes to put this data to use, you must be able to trust it.
- Capture periodic “Last-Known-Good” configurations. Your IT environment is constantly evolving with every new user, every new device acquired and every piece of software updated or deployed. Change is one of the biggest causes of outages and major incidents. Unfortunately, once a change is made (or many changes) and failures start to occur, it can be difficult to know the previous condition of the environment, so you have a target state for restoration. Capturing last-known-good snapshots of your IT configuration data periodically is a helpful method for a “compare and contrast” of conditions during an incident to assess impacts and root-cause due to change.
- Determine your communication plan. Major Incident Management is more than resolving the technical issue, it also relates to managing perceptions and providing confidence to users and management and external stakeholders that the incident management team controls the incident and is taking all necessary actions to restore service quickly. Identifying impacted user groups, defining target audiences for incident communications and preparing templates prior to an incident can significantly reduce the effort required to manage communications during the incident and reinforce the perception of control and organization (even if there is chaos behind the scenes).
- Update your asset and support contact information. Maintaining up-to-date and accurate IT asset and support information ensures if a failure occurs, then you will know who to engage to help fix it. People are hired and/or change positions, support vendors change and assets are replaced and/or added to your IT environment. Monitoring these changes and maintaining up-to-date records can help avoid confusion in the middle of a major incident.
Major incidents will happen. You won’t know when they occur. You won’t know what will cause them. You can take steps today, however, to improve your Major Incident Management plan for easier execution. View our webinar titled “Improve Incident Management: Be Ready for the 3 am Call”