---
title: "Incident communication 101"
author: "Sorry™"
url: "https://books.sorryapp.com/2/incident-communication-101"
---

Incidents are inevitable. In today's fast-paced, always-on world, we depend more on technology than ever. Whether it’s a system-wide outage, a cybersecurity breach, or a short-term glitch in service, every organization will face moments that require clear and effective communication with the people who are impacted. 

While you can’t control when incidents happen, you can control how prepared you are. What sets the best teams apart is not just how quickly they resolve incidents, but how effectively they communicate throughout the entire process.

> Incident communication is more than just relaying updates. It’s about building trust, managing expectations, and minimizing confusion during high-stress and high-stakes situations.

In a hyper-connected society where digital communication is unavoidable, customers, stakeholders, and business teams rely on real-time information. Clear and timely communication can make the difference between a frustrated customer and a loyal one.

**About this eBook**

This eBook, Incident Communication 101, is meant to give you a comprehensive understanding of how to approach communication before, during, and after an incident. From defining an incident to communicating in a way that resonates with your audience, this guide will cover the principles, tools, and best practices for mastering incident communication. 


Whether you’re an engineer, manager, support rep, or founder, this e-book will help you:

- Set up clear processes for communicating when things go wrong.
- Define roles, responsibilities, and communication channels for incidents.
- Write straightforward, empathetic updates that build trust with your audience.
- Increase confidence with stakeholders during moments of disruption.
- Continuously improve your communication strategy and incident response process. 

By learning to communicate effectively in the heat of the moment, you can mitigate the immediate impact of incidents, inspire confidence, and foster long-term trust.


Let’s dive in!


1. The history of incident communication


Incidents have occurred throughout history for as long as humans have lived. Fires, natural disasters, pandemics, wars, and infrastructure issues are all examples of early-day incidents. As technology has become prevalent and people are more connected, how humans respond to and communicate during incidents has evolved.


To examine the evolution of incident communication, we must look at how incident management practices have changed over time.

# Watchtowers and fire signals

![1. The history of incident communication.png](https://books.sorryapp.com/u/1-the-history-of-incident-communication-PCcmQg.png)

Many centuries ago, [communities would use watchtowers](https://artsandculture.google.com/story/the-fortified-towers-wall-platforms-and-watchtowers-of-the-great-wall-simatai-great-wall-tourist-area/SAVhLZQwDXng7Q?hl=en) to detect potential incidents and serve as a first line of defense. Soldiers staffed these watchtowers and were responsible for responding to any signs of trouble. They relayed information using fire signals, sounds, flags, and human messengers.

Though these strategies may seem outdated, they formed the foundation for modern incident management practices, which still prioritize early detection, rapid communication, clearly defined roles, and swift action.


# Incident Command System (ICS)

Fast-forward to the 1970s. Initially developed for California wildfire response, the [Incident Command System (ICS)](https://www.usda.gov/sites/default/files/documents/ICS100.pdf) standardized incident response by introducing common terminology, defining clear roles and responsibilities, establishing effective communication channels, and emphasizing thorough documentation. 


The ICS emphasizes communication as a critical component of effective incident management, explicitly calling attention to the following best practices:

- Keep information clear and to the point to avoid any confusion.
- Use a variety of channels – like radios, status boards, and emails – to ensure everyone stays in the loop.
- Set regular update intervals to keep everyone aware of the latest developments.
- Ensure all messages are aligned and approved by the Incident Commander or the designated lead.

Since then, the ICS has been adopted across various sectors, including public safety, healthcare, and IT, and its communication practices remain relevant today.


# National Incident Management System (NIMS)

The events of September 11 highlighted the critical need for national standards in incident operations. In 2004, the [National Incident Management System](https://www.fema.gov/sites/default/files/2020-07/fema_nims_doctrine-2017.pdf) surfaced as a standardized approach for managing emergencies and incidents across all levels of government, private sector organizations, and non-governmental organizations. 

NIMS is a much more comprehensive approach to managing, responding, and communicating about incidents, and it encourages the following communication practices:

- Make sure systems work together so agencies can easily coordinate.
- Share real-time, consistent info to keep everyone on the same page.
- Use clear, standardized protocols for smooth communication.
- Rely on tools like radios and digital platforms to stay connected.
- Keep public messaging accurate and consistent through a Public Information Officer.

This holistic approach to incident management continues to play a role in how modern incident management teams communicate.

# Incident communication today

Incident communication and the overall practice of incident management look much different today than in years and centuries prior. While we don’t rely on flags and fire signals, the essence of these approaches is still present. 

Modern incident communication draws from the past to deliver a more streamlined and efficient experience leveraging technology. In software, organizations rely heavily on monitoring, alerting, and communication systems to reach the right people at the right time with relevant information. 

Here’s how incident communication operates today:

- Real-time alerts: Automated systems send immediate alerts to relevant teams and stakeholders. 
- Multi-channel updates: Status pages, social media, email, and in-app notifications ensure clear communication across diverse audiences.
- Human-centric messages: Updates are written with empathy and clarity, emphasizing transparency and building trust.
- Pre-defined playbooks: Teams rely on pre-written communication templates and clearly defined escalation paths to ensure consistency.
- Integrated tools: Platforms like Slack and PagerDuty and status page providers like Sorry streamline external communication and internal coordination. AI and machine learning are becoming more prevalent in predicting incidents and assisting with communication.
- Continuous learning: Post-incident reviews and retrospectives refine processes to improve future communication

Today, organizations can detect incidents faster, collaborate more productively, and notify the impacted audience nearly instantly.


# Pioneers of modern incident management

Modern incident management draws inspiration from the past but owes much of its recent evolution to the bold experimentation of tech pioneers. Here are some key players worth noting:

 ![Pioneers of modern incident management.png](https://books.sorryapp.com/u/pioneers-of-modern-incident-management-kqLSsh.png)

**Steve Capps: Developer of the “monkey.”**
In 1983, an Apple Macintosh engineer, Steve Capps, [developed a “monkey”](https://folklore.org/Monkey_Lives.html?sort=date) that generated a series of rapid-fire random user interface inputs. The monkey proved to be a valuable testing tool for identifying failures in software applications, eventually inspiring the idea of “chaos engineering” and the “chaos monkey.”

**Jesse Robins: Creator of “game day”**
In the early 2000s, Amazon engineer Jesse Robins, known as the “Master of Disaster,” created a program called Game Day. The idea was to simulate real-world incidents by introducing significant failures in software systems. Game days are still used to help teams improve their incident response and operational readiness.

**Netflix Engineers: Chaos engineering & chaos monkey**
A major Netflix outage in 2008 led the company to migrate from hardware servers to AWS, and their engineers needed a way to ensure system resilience in the new highly distributed architecture. 

Inspired by Jesse Robins’ game day, they devised the idea of chaos engineering, which intentionally introduces disruptions in critical systems. The goal is to identify and fix systemic weaknesses before they cause real issues. [Chaos monkey](https://www.gremlin.com/chaos-monkey/the-origin-of-chaos-monkey), inspired by Capps’ software monkey, was built to randomly turn off servers and test how well systems handle failures and failovers.


2. Preparing for incident communication


Being prepared for incidents is critical for effective communication. Pre-incident planning aligns your team with the tools and processes to respond effectively while keeping stakeholders informed. 


All great incident response teams start with a plan. Think of this as the pre-work to nailing how you communicate during incidents.

# Define an incident & set severity levels

A clear definition of "incident" and a solid understanding of severity levels set your team up for success from the start. Without this alignment, your team won’t be able to respond in a unified way.

## Defining an incident

First, you need to know what scenarios qualify as an incident. Defining what an incident means to your organization gets everyone on the same page and ensures incidents are declared consistently. 

A clear definition distinguishes incidents from bugs, prevents unnecessary escalations, streamlines your response process, and ensures resources are allocated appropriately. 

> Sorry’s definition of an incident: An incident is any unplanned disruption, degradation, or issue with a service, system, or process that requires urgent attention and diverts focus from planned work to restore normal operations.

## Severity levels

In addition to defining an incident, you should establish incident severity levels. A shared understanding of severity levels allows you to prioritize incidents, allocate resources, and standardize your communication approach. Severity levels typically range from a 3 to 5 scale system. For example:

- SEV 1: Critical incident with major impact.
- SEV 2: Major incident with moderate impact.
- SEV 3: Minor incident with low impact.

Many organizations create a priority matrix to help determine the severity of incidents. The ITIL incident management priority matrix is a great starting point.


 ![Screenshot 2025-01-17 at 12.52.46.png](https://books.sorryapp.com/u/screenshot-2025-01-17-at-12-52-46-S44Kw3.png) 

# Define communication channels

Determine the tools and platforms you’ll use for updates to ensure consistent and effective communication. How will you gather the contacts if you need to reach a specific subset of customers? What communication systems will your team need to be trained on and have access to? 

# Identify key stakeholders

Determine individuals or groups who might be impacted by or need to know about incidents. For each severity type, you might have different groups of people. Think of customers, end-users, team leads, and executives. 

# Establish internal communication practices

Effective internal communication helps your team make informed choices throughout the lifecyle of an incident. For example, an account executive might be preparing for a demo with a potential client when a critical server goes down. Without notice, the demo would be a trainwreck because the AE wouldn’t know about the issue. However, they could be promptly informed of the incident with effective internal communication and adjust accordingly.


Setting up clear protocols, practical tools, and efficient workflows ensures your organization has the necessary information to react accordingly. Check out Chapter 5, Incident Communication Best Practices, for more tips.

# Clarify roles and responsibilities

You don’t want to scramble during the incident to figure out who’s responsible for what, as this is valuable time you should spend resolving the incident. Determine incident roles and responsibilities beforehand to ensure tasks are handled and duplicate work is avoided. Assign tasks and decision-making authority to team members to foster accountability and efficiency.

3. Incident roles and responsibilities

Incident roles define who is responsible for what so everyone knows where to focus during an incident. Roles bring order to the incident response process and ensure your team is set up to communicate timely and effectively.

# Common roles

Every team has different needs and operates accordingly. However, these are some of the most common roles in incident management:

## Incident commander

Every incident needs an incident commander or incident lead. This person is responsible for overseeing the entire incident response process. For more minor incidents, the incident commander might even handle stakeholder communication.

## Communications lead

For larger teams or more complex incidents, a dedicated communications lead can help ensure timely and consistent messaging to stakeholders. They might handle external and internal communication across various channels. This role is often led by someone from the customer support or success team. The communications lead must be a skilled writer and be able to think and act quickly. Most importantly, they need to have a strong understanding of customers and stakeholders.

## Technical lead

The technical lead is responsible for investigating the technical aspects of the incident, identifying the root cause, and developing solutions to resolve the issue. They often communicate updates to the incident management or communications lead.

## Responder

Depending on the magnitude of the incident, the technical lead might have incident responders working with them. A responder is typically someone from the development or operations team who can execute specific technical tasks to mitigate the incident, such as deploying fixes, running diagnostics, or implementing workarounds.

## Customer support lead

For major incidents that are likely to trigger an influx of support inquiries, a support lead is often added to the incident response team to ensure appropriate communication with impacted users. In large-scale incidents, the support team may need to scale up staffing to manage the volume of inquiries.


# Documentation and training

Incident roles should be well-documented, and training should be conducted to ensure everyone is clear about their roles and responsibilities before an incident occurs. Here are some tips:

**Include incident training in new-hire onboarding:** This instills the importance of incident management from day one.

**Document roles and responsibilities:** Write a short description of each role and store it in your company wiki or anywhere accessible to all team members.

**Conduct regular fire drills or mock incidents:** Run through mock incidents regularly so everyone can get a refresher of what’s expected of them.

4. Incident communication channels


Effective incident communication involves aligning internal teams for a swift and coordinated response while keeping external stakeholders informed to build trust and manage expectations. 

Internally, centralized communication platforms help teams share updates, assign roles, and collaborate without confusion or redundancy. Externally, dedicated channels, such as status pages or customer support ticketing systems, enable teams to respond and set expectations at scale.

# Common incident communication channels:

- **Status pages:** A centralized place for incident communication.
- **Team chat:** For internal teams to communicate in real-time.
- **Alerting platforms:** So internal teams are notified of potential issues.
- **Social media:** To inform your followers of known issues.
- **Email notifications:** To reach a targeted group, often powered by status page notifications.
- **Text notifications:** To reach a targeted group, often powered by status page notifications.
- **Documentation:** For consulting existing technical docs and processes, taking notes, and drafting a post-incident review.
- **Issue tracking:** For assigning incident-related tasks.
- **Help center embed:** To surface active incidents on your help center, minimizing incoming tickets.
- **UI embed:** Surface any active incidents directly in your web UI to reach users directly where they’re experiencing issues.


# Internal communication: keeping your team aligned

Jumping into an active software incident is like being dropped into a high-stakes escape room where the clock is ticking, the lights are flickering, and half the team has different pieces of the puzzle. Reliable communication systems are a must-have when managing incidents and being prepared to update customers. 

Here are some standard tools used by teams to communicate about incidents internally:

## INTERNAL INCIDENT COMMUNICATION TOOLS

 ![Screenshot 2025-01-17 at 13.04.27.png](https://books.sorryapp.com/u/screenshot-2025-01-17-at-13-04-27-WM3SCw.png) 

# External communication: Keeping your customers informed

As a customer, encountering a software incident is like being stuck in traffic on a busy highway without knowing what’s causing the jam or how long it will last. Frustration builds as you watch the clock tick and wonder if you’ll ever reach your destination. The difference between a terrible and tolerable experience lies in the information you receive. 

The same goes for customers encountering issues with your product or service. The longer they wait for information, the more frustration builds. You can easily keep your customers in the loop with the right tools and processes.

Below are some popular incident communication tools used by teams to communicate externally:

 ![Screenshot 2025-01-17 at 13.05.18.png](https://books.sorryapp.com/u/screenshot-2025-01-17-at-13-05-18-39DFbs.png) 

# Status pages are critical incident communication tools

Status page tools like [Sorry™](https://www.sorryapp.com/) are versatile platforms that enhance internal and external communication during incidents. Internally, they act as a centralized hub for organization-wide visibility, ensuring everyone stays in the loop about the incident’s current status.

Externally, they provide customers and stakeholders with transparent, real-time updates, deflecting inbound support tickets and building trust through proactive communication. 

With [Sorry™](https://www.sorryapp.com/), you can enable customers and stakeholders to subscribe to notifications directly from your status page and embed your system status directly into your web UI.


5. Incident communication best practices

Once you have your team and communication channels ready to go, you need to be ready to communicate when things go wrong. We’ve compiled our favourite incident communication tips from 50 years of combined incident management experience:

# Communicate early
Acknowledge the incident when you’re confident it’s a valid issue. It’s okay to be vague if you don’t have many details. Inform your audience that you’re investigating the problem and will provide more information as soon as you have it.

# Update often
To keep all stakeholders informed, provide regular updates, even if progress is minimal. The more you communicate, the more you’ll reduce uncertainty and avoid negative speculation. Set a cadence for regular updates and stick to it.

# Use plain language
Use simple, straightforward language that everyone can understand. Avoid technical jargon or overly complex explanations to ensure clarity and avoid confusion. Simple language builds trust and helps all stakeholders stay informed and aligned.

**Bad:** We’re experiencing high latency due to unexpected degradation in the load balancer throughput capacity. Our SREs are executing a rollback of the most recently deployment and initiating a phased failover to mitigate cascading failures across shards.

**Good:** We're currently experiencing slower system performance due to an issue with our servers. Our team is working to fix it by rolling back recent changes and shifting traffic to backup systems to prevent further problems.

# Minimize panic with context

Sometimes, details about an incident are impossible to share because the information might not be available yet. However, whenever possible, provide context to help your audience understand the severity and scope of the incident. This helps reduce unnecessary worry.

**Bad:** Login functionality is currently down. We’re looking into it.

**Good:** We’re aware of an issue causing delays for some users when logging in. The issue appears to affect less than 10% users, primarily in the North American region.

# Communicate workarounds
Share temporary fixes or alternative solutions users can use while the issue is being investigated and resolved. This minimizes the disruption for the customer and provides an immediate path forward.

**Bad:** Our payment system is down, and we’re working on a fix. We apologize for the inconvenience.

**Good:** We’re currently experiencing an issue with our payment system, and some transactions may not go through. You can still complete your purchase by using PayPal or selecting “Pay Later” at checkout. Thank you for your patience as we resolve this issue.

# Focus on empathy
Empathy is a customer service superpower. Without it, customers feel that you don’t truly understand them. Acknowledge the impact on those affected, apologize when necessary, and show that you genuinely care about resolving the issue.

**Bad:** We’re aware of login issues and working on a fix. Stay tuned for more info.

**Good:** We’re aware of login issues accessing our web app. We know how important our service is to our customers, and we sincerely apologize for the disruption. Our team is actively working to resolve this issue as quickly as possible. We understand this may be frustrating and inconvenient and truly appreciate your patience as we work through this.

# Be transparent
Transparency builds trust, even during challenging moments. Communicate what you know and acknowledge what you don’t. Avoid downplaying the situation, and provide honest updates to maintain trust.

**Bad:** Our website is experiencing technical difficulties due to an upstream provider issue. We will update you as soon as we know more.

**Good:** Our website is currently down due to an error introduced in a recent update. This issue was caused by a misconfiguration in our deployment tests, and we take full accountability for this mistake. Our team is working to resolve this as soon as possible

# Explain what you’re doing
Share the steps your team has taken to resolve the issue so your audience knows you’re actively working on it. This gives them confidence your team is prioritizing a fix.

**Bad:** We’re aware the app isn’t loading, and we apologize for the inconvenience. Stay tuned for updates.

**Good** We’re aware that the app isn’t loading for some users. Our team has identified the issue as a server configuration error. We’re currently restarting affected servers and testing to push a fix out as soon as possible. If that doesn’t work, we’ll attempt a rollback. Check back within the hour for more info.

# Set expectations for a resolution
Whenever possible, clearly communicate realistic timelines for resolution, even if it’s just an estimate. This helps customers make decisions and reduces frustration.

** Bad:** The file upload feature isn’t working, and we’re looking into the root cause of this. Thanks for your patience.

**Good:** We’re aware that the file upload feature isn’t currently functioning properly. Our team has identified the issue and is implementing a fix. We expect to have the feature fully operational within the next 3 hours. Thank you for your understanding.

# Use templates
Communicating under pressure can be stressful. Create incident communication templates in advance to ensure clear, consistent, and professional messaging in the heat of the moment. Status page tools like Sorry™ offer the ability to create and store pre-written templates.

# Practice
Run through mock incidents or “fire drills” using real-life scenarios. Use these practice scenarios to rehearse your communication process and refine anything. We recommend setting a monthly or quarterly calendar reminder to practice your incident response and communication plan.


6. After the incident

What truly sets resilient teams apart is not only how they handle the heat of the moment but also what they do afterward. The post-incident phase is an opportunity to reflect, learn, and improve so that teams can be even more prepared the next time things go wrong. 

Here are some common tasks after an incident is resolved:

# Conduct a post-incident review

An incident retrospective, or post-incident review, is a structured process for reflecting on and analyzing incidents. Post-incident reviews bring the team together to identify lessons learned and opportunities for improvement. 

Traditionally, these are referred to as postmortems. However, many incident management practitioners have moved away from that term because of its association with death, which can feel counterproductive in collaborative environments. 


The post-incident review is an excellent opportunity to raise customer concerns and see how communication can be improved both internally and externally. 

# Document and share learnings

Capture the incident's timeline, impacts, resolutions, and key takeaways in a central place. Sharing these insights across teams ensures everyone benefits from the lessons learned and reduces the likelihood of similar issues in the future. 

# Implement fixes and improvements

Address the root causes identified in the post-incident review and implement any other necessary changes in the incident response process. This might involve bug fixes, monitoring improvements, or refining the response and communication process. 

# Communicate with stakeholders

As we’ve already discussed, transparency is key. For more significant incidents, customers will likely be eager to learn what happened, even long after the incident is resolved. Don’t leave them hanging, as this can cause more frustration. 

Let customers, partners, and other stakeholders know what happened, how it was resolved, and the steps you’re taking to prevent similar incidents. Remember to use plain language and focus on rebuilding trust. 

Many organizations publish a post-incident review writeup directly on their company status page.

# Monitor and validate fixes

Once fixes are in place, monitor systems closely to ensure stability and validate that the changes were effective. Stay close to the support queue to ensure residual effects are causing problems for customers. 

# Refine your process

In the spirit of continuous improvement, use the learnings from each incident to improve your overall incident response plan. Update playbooks, improve training programs, and simulate new scenarios to keep your team prepared.


# Moving forward with confidence: mastering incident communication

As explored throughout this eBook, effective communication is essential for maintaining trust and minimizing the impact of unexpected service disruptions. Good incident communication is driven by **preparedness, empathy, transparency, and clarity.**

Remember, the goal is not just to resolve the issue at hand but to build stronger relationships with your audience through thoughtful, timely, and honest communication. By mastering incident communication, you’ll be able to effectively reduce customer frustration and maintain trust even during challenging moments.

We hope you found this eBook helpful. Feel free to share it with a colleague or make it part of your own incident response training program.

We launched Sorry™ in 2014 to help companies improve customer communication during outages. Our incident communication platform is used by a variety of industries globally, including technology, healthcare, government, and more. 

By providing real-time updates, templates, and the ability to send email and SMS notifications, Sorry™ provides a stress-free way for businesses to acknowledge incidents promptly, reducing customer frustration and building trust. 

Interested in learning more about Sorry? [Schedule a demo today!](https://www.sorryapp.com/demo/)

Sources: 

https://devops.com/the-evolution-of-incident-management/

https://spike.sh/blog/history-and-evolution-of-incident-management/

https://rootly.com/blog/a-primer-on-the-history-and-evolution-of-incident-management-to-today

https://www.fema.gov/txt/nims/nims_ics_position_paper.txt

https://artsandculture.google.com/story/the-fortified-towers-wall-platforms-and-watchtowers-of-the-great-wall-simatai-great-wall-tourist-area/SAVhLZQwDXng7Q?hl=en

https://www.usda.gov/sites/default/files/documents/ICS100.pdf

https://www.fema.gov/emergency-managers/nims

https://folklore.org/Monkey_Lives.html?sort=date

https://www.itnews.com.au/feature/what-is-chaos-engineering-the-art-of-breaking-things-purposefully-555100

https://www.gremlin.com/chaos-monkey/the-origin-of-chaos-monkey