An Incident Response Overview

11 min readJan 14, 2022

A lot of time is devoted by organizations establishing controls, guidelines, and processes to prevent an incident. However, it’s possible many organizations don’t spend as much time on Incident Response in hopes that they will not have to deal with such events. As stated multiple times, cyber security incidents are not a matter of if but a matter of when. This write-up will outline some of the significant aspects of information security incident response and provide some thoughts, context, and tools where applicable. I’m not an Incident Response specialist; this is part of me sharing as I learn about the discipline during my graduate program.

Acknowledgments

Incident response can be a complex set of activities, and it requires thoughtful planning and resources for an organization to be prepared to deal with such events. This write-up would not have been possible without the countless hours spent by many professionals in developing NIST Special Publication 800–61 Revision 2. To all that have contributed, thank you.

Definitions

To make the process more actionable, it’s crucial to define and determine the scope of what an incident is. According to the previously referenced NIST document, a computer security incident is: “a violation or imminent threat of violation1 of computer security policies, acceptable use policies, or standard security practices.”

Even using such a definition, the possibilities of what counts as an incident can be endless. Because this document is meant to be accessible to most backgrounds, I’ll limit it to what is known as the CIA triad. This security model helps better frame the different aspects of security:

Confidentiality: Refers to protecting sensitive or private information from unauthorized access by a third party. Generally, this is achieved by having different levels of access to data that should be enforced by the party holding the data.
Integrity: Refers to the assurances that data is protected from unwanted modification or deletion by an unauthorized third party.
Availability: Refers to assurances related to the user’s successful access to the data or system. In other words, the system is operational and accessible as intended.

Importance of Communication in Incident Response

The timeliness of the response once something has gone role can be critical, but also it’s vital to handle all incidents with a consistent approach. These two aspects ensure disruption can be minimized, no valuable data for the investigation is lost, and future incidents can be prevented by identifying the root cause of the problem and how to mitigate it.

External Communication

Besides timeliness and consistency in the process, one of the most critical aspects is communication. An incident handling team has to manage communication at different levels of technical depth with a multitude of stakeholders with wide-ranging motivations:

Customers, Constituents, and the media: This group has three different stakeholders. The type of communication required and expected with customers and constituents can vary significantly depending on the geographical location and the respective law that apply. On the other hand, media relations must be addressed carefully as they can have severe legal and reputational implications.
Software & Support Vendors: Expert knowledge may be required from the vendor during the analysis phase of an incident. Constant communication will be needed to mitigate any vendor-related threat in the future.
Law Enforcement Agencies: Depending on the industry affected, regulations may require different law enforcement agencies to participate in the process according to their respective jurisdictions.
Incident Reporters: Government organizations and federal agencies specifically must report incidents to the United States Computer Emergency Readiness Team. Specific protocols must be followed for these interactions. Even if not a government organization, the incident response team may choose to contact an Information Sharing and Analysis Center (ISAC), industry-specific private sector groups.
Internet Service Providers: The type of attack may warrant communication with an Internet Service provider to mitigate an ongoing attack and establish measures to prevent future ones.
Other Incident Response teams: Proactively sharing information with peer organizations can help prevent existing known threats that have already been uncovered.

Internal Communication

Many groups within an organization will require precise coordination before, during, and after an incident. These groups may have to actively participate or be available for consultation as needed to make time-sensitive decisions. Some of the groups include:

Management: Key stakeholders as their decisions influence the structure, capabilities, and resources of the Incident Response team. In addition, cybersecurity incidents can have lasting reputational implications as well as regulatory scrutiny.
Information Assurance & IT Support: Information security members of the organization will be instrumental in understanding and coordinating prevention, containment, eradication, and recovery tasks. In addition, they will be needed to support short-term modifications to the existing security controls during the investigation. Similarly, IT experts assist with tasks required and experience and context about the technical nuances of the organization.
Legal Department: Legal teams are responsible for verifying procedures, policies controls, and plans are up to date and in compliance with the industry’s regulations. But also play an active role during an ongoing incident. They provide expertise in data privacy and security and help mitigate any legal risk associated with the incident.
Business Continuity Planning: Information security incidents are a considerable disruption to business operations. Business Continuity planning professionals can help alleviate the burden and lower the risk of the incident with their knowledge of mitigation strategies to maintain the organization’s operations.
Physical Security and Facilities Management: Given some incidents may occur due to material security breaches and physical assets may be required for an incident investigation, it’s essential to handle coordination with these teams to accelerate any tasks.

Stages of Incident Handling

In responding to an incident, many phases play uniquely essential roles in the overall mitigation of the event. Each of these phases also counts with unique tools, resources, and skills. This means that for an Incident Response team or program to work effectively, it relies on a strong foundation of security practices and is accompanied by processes and tools for each of them.

Preparation

This phase encompasses being ready for an incident and includes all the actions that help an organization prevent an incident in the first place.

Incident Preparation

The following include some of the tools and resources that need to be available for successful incident handling:

Digital forensic workstations, backup devices, and digital forensics software: During the response, tasks will include creating disk images (post-attack), preserving logs that led to the attack, and other relevant data to be analyzed. Once this data is extracted from hard drives and the system’s memory, other forensics software will be required to analyze and better understand what led to the issue.
Spare workstations and servers: During the recovery phases, the organization will need servers or workstations that will be used to restore the systems to a known working state.
Packet sniffers and protocol analyzers: These tools will enable the analyzes of network data flows and decode patterns from the malware if applicable.
Documentation including Network diagrams and list of critical assets: Detailed documentation of systems and their respective software and architecture diagrams will be needed to identify which other systems may have been affected or should be looked into.

Incident Prevention

As part of preventing incidents, organizations will require prevention mechanisms including the following:

Risk Assessments: Security best practices recommend organizations perform ongoing assessments to identify risks that could affect the organization. This includes understanding the threat landscape for that organization and prioritizing mitigation of known risks.
Endpoint Security: All systems in an organization should be “hardened” or follow recommended security guidelines such as ensuring users have access to the least possible systems while being allowed to perform their duties, ensuring that workstations and servers are configured and monitored to minimize any risk and applying the relevant software to mitigate security issues.
Network Security: An organization’s network should ensure that traffic to and from its services is locked down to only activity expressly allowed.
Malware Prevention: Relevant systems should have software that detects and prevents malware from running. This is true for workstations and generic servers, and specialized servers such as email, web proxies, and others.
User Awareness & Training: Preventing cyber attacks is an ongoing process and requires that users are aware and trained to understand the tools in place to mitigate risk from cyber attacks.

Detection & Analysis

Detection

The number of attack vectors for organizations is ever-increasing and specific to industries and geographical areas. The following summarizes a high-level set of sources of precursors and indicators that may signal an incident has occurred:

IDPS: An intrusion detection and prevention system are products that help in flagging suspicious events and save related data to be further investigated or immediately mitigate the issues.
SIEMs: Security Information and Event Management products work similarly to intrusion detection and prevention systems, but they generate alerts based on analysis of different types of log data.
Antivirus and Antispam Software: Software that can detect and prevent infections from malware. Some systems are signature-based, which means they only detect known instances of malware. At the same time, more modern solutions use the behavior of the software in a system to flag suspicious behavior.
File Integrity Checking Software: These applications are sued to identify when changes to critical files occur. It uses logic based on cryptographic checksum for the designated file and alerts when an unexpected modification has occurred.
Third-Party Monitoring Services: Third parties sell services that can notify an organization if their assets have been associated with incident activity in other organizations. Other services provide block-lists of indicators such as IP addresses, domains, and others that should be avoided so organizations can build rules in their environments to prevent access to them.
Operating System, service, and application logs: When an incident occurs, data related to operating system actions can help understand baseline behavior vs. abnormal behavior. The logs can also be used to correlate event information.
Network device Logs and Network flows: Network device logs such as firewall and router logs can help identify network trends and correlate events. Network flow refers to communication sessions between hosts. This can be used to identify malware activity, data exfiltration, and other types of behaviors.
Information on new vulnerabilities and exploits: New software vulnerabilities and exploits are regularly discovered by security researchers and threat actors. Organizations should remain up-to-date with vulnerabilities relevant to their systems as they may need to be checked against regularly if a breach or incident is suspected.
People within and outside the organization: Bug hunters, investigators, customers, and employees from an organization may also uncover issues that should be addressed immediately.

Analysis

As expected, the analysis part of incident response can sometimes be the trickiest. It has to do with all the validation that has to happen as part of this phase, given many times, tools and products may yield false positives or provide incorrect indicators. For this reason, it’s important, when possible, to evaluate each indicator in detail to determine if it’s accurate. Even when an indicator is accurate, it may still be expected behavior, so further analysis is needed to understand what is expected vs. not.

NIST outlines the following recommendations in Special Publication 800–63 which can help with validation of patterns and the analysis phase in general:

Profile networks and systems: In this context, profiling refers to measuring the characteristics of known activity so that changes in said characteristics can be identified quicker. An example of such an action would be looking at file integrity checking software or measuring network bandwidth usage, which can help highlight peaks around usage levels and other benchmarks.
Understand normal behaviors: The Incident Response discipline requires analyzing various types of logs and data, including network, system, and application. An incident responder needs to understand baseline behavior in all these systems to determine when something has gone out of the norm.
Create a log retention policy: The logs that capture the behavior mentioned earlier can be located in several places. These may include firewalls, application-specific logs, or network logs. A predictable retention policy is helpful because there is consistency when analyzing data from multiple places across a standard time frame. Also, log data going back up to months can be beneficial during an investigation since incidents may sometimes go unnoticed for long periods.
Perform event correlation: Because an attacker’s action may be logged in different systems, it’s essential to correlate data across sources. This helps when building a map of the attack and understanding the series of steps an attacker followed.
Keep all host clocks synchronized: Similar to having consistency of data retention, a key data quality requirement is to ensure the timestamps across workstations, servers, and systems are in sync.
Run packet sniffers to collect additional data: Network packet capture may be necessary if an incident is occurring or has occurred over the network. It’s helpful to configure the packet sniffer with specified criteria so that the volume data remains manageable.
Filter the data: Prioritizing which indicators should be investigated first and more detail is critical in time-sensitive situations. Some techniques include labeling indicators into categories and assigning some significance. In some instances, due to privacy regulations and concerns, organizations may need incident handlers to request explicit approvals to configure packet sniffers in the corporate network.

Categorization

The following tables are from the NIST 800–61 rev 2 document and are an excellent foundation to help establish different categorizations that may be helpful during Incidence Response.

Containment, Eradication & Recovery

Evidence Gathering

During the investigation and especially in the early stages, it’s important to gather evidence that is detailed enough in a timely manner for the analysis phase. During a legal proceeding and depending on the extent of the incident, exceptional care for handling the evidence and its preservation will be required. A chain of custody may help traceable steps the evidence took during the incident. NIST recommends a detailed log to be kept for:

Identifying information such as locations, hostnames, serial numbers, and others
Details of the person who collected or handled the evidence
Time and date of each occurrence evidence handling
Locations where the evidence was stored

The collection of evidence can be challenging for multiple reasons. For example, the threat actor may have “cleaned their tracks” in the system. Also, some types of evidence, such as in-memory data, may be lost if the proper actions are not taken. Finally, evidence may have to be gathered via snapshots of full system hard drives or initiated by the incident response team enabling additional network logging, packet sniffing, etc.

Containing the Incident

Key decisions will have to be made regarding containment strategies, and they will have tradeoffs. Every organization should define what they consider bearable risks when facing an incident. Some useful criteria for determining a containment strategy includes:

Potential damage and theft of resources:
Need for evidence preservation
Service availability
Time and resources needed to implement the strategy
Effectiveness of the strategy
Duration of the solution

Eradicating the Incident

After containment, it’s critical to ensure that malware is deleted from the systems, user accounts that have been compromised are disabled, and address any uncovered vulnerabilities. During the evidence gathering process, it’s key to identify all affected hosts. The eradication and recovery effort can take days to weeks or even months.

Post-Incident Activity

Creating a Follow-up Report

Identifying what happened, how, and how to prevent it is a critical phase of Incident Response. This ensures any issues identified can be mitigated and prevented from happening in the future.

Identifying lessons learned and mitigation strategy

NIST outlines a great set of questions that can help in identifying lessons learned during an incident:

What occurred and at what times?
Was the staff able to address the incident effectively? Were the procedures established followed? Are the procedures adequate?
What time-sensitive information did the team not have and why?
What were actions taken that affected the recovery process?
What should be done differently in a future incident?
What are ways to improve information sharing internally and externally?
Which tools or resources are needed to improve how the organization detects, analyzes, and mitigates incidents?

Summary

Incident Response counts with many phases, which play an essential role for more effective incident handling. In the table below is a summary of some of the critical elements of these phases: