June 23rd, 2016
Every now and then, the internet is set on fire. All too often when this happens -- and this does happen all too often -- we scramble, wave our arms in the air and enter panic mode, trying to patch systems left and right until eventually the excitement wears off and we continue with our day to day routine.
Unfortunately, this does not lead to an elimination of the given attack vector, and the next time a major vulnerability in a widely used library drops, we rinse and repeat.
A formal Incident Response Process can help reduce stress, improve the efficiency of your limited resources, and help yield actual results in keeping your users and data secure. The following is an outline of an incident response process as I might design it; it is based on some of my experiences and observations at different companies but does not reflect any one employer's specific process.
Within this document, I will focus on "major" incidents. The nature of computer security requires our Incident Response Process to be adaptive, since minor incidents may evolve to become major incidents as our understanding of the impact changes. As a result, your Incident Response Process should be applicable (and be followed!) for "minor" incidents as well.
Typical examples of "major" incidents include:
Some of these events are known a priori, such as by way of responsible disclosure through your Bug Bounty program or within the community; some of these events hit us without advance notice, such as the sudden disclosure of a 0-day vulnerability or an immediate alert condition (most frequently: a human going "huh, that's weird").
Your Incident Response Process needs to be documented and accessible for everybody involved. Ensure that all participants know where to find it, be that on your wiki, as a formal policy document, a shared Google doc, or whatever works best for your organization. This document is the place people will go to during an incident, in a time of high stress and tremendous pressure, so it needs to be clearly written, easy to find and read, and properly linked.
Your Incident Response Process should follow a runbook style, to allow incident responders to walk a simple decision tree and execute the required steps. You may be able to autoamte many of the tasks involved. This has the added benefit that they are executed reliably and no steps are missed.
Incident Response is divided into the following steps:
It's critical to understand (and, in execution, remember) that the primary task of the Incident Response team is not to complete these tasks, but to coordinate them. As such, efficient communications between the IR and other teams and individuals is crucial.
Communications amongst participants happen synchronously and in real-time (e.g. via IRC or some other online chat, face to face, or over the phone / video chat) as well as asynchronously (e.g. via email). Communications may be one-to-many (e.g. announcements), one-to-one (e.g. notification or dialog amongst individuals), or many-to-many (e.g. discussions); they may be confidential (initial disclosure or impact analysis), semi-confidential (internal discussions amongst different teams or organizations), internal-open (announcements to your company or organization at large), or public (on the internet or within public communities).
For each of these different types of communication, you will need a suitable channel. People will contact your IR team in a variety of ways. Not all alerts or disclosures necessarily trigger an incident and not all incidents are of equal importance or urgency. Your incident intake might be divided into:
This list of contact methods is in ascending order of priority and your on-call staff should track and respond to incoming requests appropriately.
In addition to the above, you also need to provide a method for people from outside the company to engage your team. This, however, goes a bit beyond the scope of just Incident Response, and may take the form of a Bug Bounty program, participation in community discussion forums or disclosure lists (such as e.g. Operations Security Trust), a public contact address with a public PGP key tied to it, or a variety of other possibilities.
Within this document, let us assume that incident intake begins with somebody within the company, irrelevant of how they became aware of the issue or by which channel they were notified.
It is important that everybody within the company knows how and where to report any security issues they encounter. You need to make sure that your contact information and engagement process is clearly spelled out and easy to find.
Upon notification or discovery of an issue, the first task of the IR staff is to identify and classify the incident. For this, your team may need to consult with SMEs on the affected piece of software and in how far this applies to your infrastructure and software stack, your larger infosec team on the perceived impact, your internal red team for exploitability, as well as your compliance team for input on any legal obligations resulting from the given issue.
Issue severity classification is a black art all by itself, requiring intimate understanding of all these factors. One factor is whether the vulnerability or issue is publicly known. Another is whether it is actively being exploited, assumed to be feasible to be exploited, or unlikely to be exploited. For example, a vulnerability found to have been present for years and disclosed without advance notification to the software vendors should be assumed to be actively exploited, while a responsibly disclosed vulnerability that requires extraordinary capabilities may well not be considered as such.
All this therefor requires understanding a reasonably accurate and realistic Threat Model. But you need more than just a designation based on vulnerability type. It is a common mistake to declare e.g. all Remote Code Execution (RCE) vulnerabilities as of highest priority, regardless of what might be exposed from the vulnerable system. Severity should include a combined scoring of at least:
When identifying the severity of an issue, the team also needs to determine in how far they are able to further disclose the vulnerability. One useful method of identifying whom sensitive information may be shared with is the Traffic Light Protocol or TLP. Considering and abiding by TLP designations even within your organization or your team(s) is critical to avoid accidental exposure of confidential information.
When processing the intake, your team may decide to treat a given issue as an incident. This decision may at times be based on incomplete information, such as a security pre-announcement.
At this point, your team should start the formal Incident Response Process, following the process outline you have described in your policy document.
The first steps of tracking an incident should include the identification of an incident lead, creation of a master or parent ticket, an incident chat channel, a timeline document, as well as an incident information document or wiki page. The order in which these are created does not matter, but all should be part of the list of items to check off as incident tracking begins:
Even though incident response is a team effort and requires the collaboration of many individuals across different organizations, it is useful to identify one primary person to coordinate the incident. We will refer to this person as the Incident Lead. This person has the responsibility to ensure that the incident is tracked and the Incident Response Process is followed.
The Incident Lead may be one of the first responders or a more senior analyst, but she should be identified and involved early on.
Note: the Incident Lead is not responsible to do all the work, but to make sure it gets done. That is, she needs to help coordinate the research, own the timeline document, review the documentation, and ensure updates and notifications are sent to the appropriate parties.
Note: first and primary responders may change through the course of an incident. Work may be coordinated and driven to completion by other individuals or teams, but in the end it is the responsibility of the Incident Lead to own the final resolution.
This ticket should track all work relating to the incident. It is, by nature, a parent ticket, primarily used to provide a terse summary of the issue, include links to more information, and to link any and all other tickets or problem reports defining outstanding work within your company or organization.
Ideally, you will tag all tickets with a unique incident identifier (such as a CVE number, if available) and use an automated script to correlate and link tickets to the master ticket.
The Incident Lead should own the master ticket should and only close it after all action items have been resolved and verified.
To ease discussions around a given incident, I recommend creating a dedicated internal-open chat channel early on in the incident response process. This will be the easiest way for people within your company or organization to ask for feedback, or for your team to coordinate resolution with different teams.
Unless the incident is classified as TLP Amber, you should open this channel to all members of your organization. The channel should be logged, and the title of the channel should be set to include the vulnerability or incident identifier (e.g. CVE number) as well as a link to the incident information document.
As soon as the Incident Pesponse Process starts, create a timeline document. This document will track important events and will be critical to help you analyze your response process in your post-mortem.
All too often, timelines are reconstructed after the incident. This necessarily leads to incongruities and misleading data, as either information is simply not available (any longer) or is (unintentionally and/or unconsciously) recorded incorrectly. This is why it's important to begin the timeline document early on and to continue to update it throughout the incident.
When creating a timeline document and recording events, you should:
This timeline document will be updated throughout the incident, and should be editable by all incident responders or participants. A shared Google doc or a plain text document under revision control is preferred, to ensure that changes can be tracked.
The basic structure of this document might be:
You will need to collect a lot of information, answer many questions, and make sure your organization can read up the best methods to address a given vulnerability. To collect this information, you should create an Incident Information Document early on. Necessarily incomplete in the beginning, update it throughout the process.
This will be your go-to document. It should provide all the important information around the incident, the vulnerabilities in question, the work-arounds and solutions as well as answers to the most commonly asked questions.
The basic structure of this document might be:
As you classify the incident, determine who needs to be contacted to help identify impact and risk. Establish which teams are needed to help fix the problem. You should have at hand a list of SMEs for the most common issues (TLS and cryptographic protocols, your serving stack, your primary languages and frameworks, etc.); consult with your red team on the analysis of the attack vectors and realistic exploitability.
It's important to remember that in some cases and based on which data may be at risk notifications may be in order up the chain. Your Incident Response Process document should have clear guidelines when to contact your CISO, and whether she ought to further escalate or notify executive staff.
In addition, publicly disclosed issues with major industry wide impact (think DROWN, POODLE, Shellshock, Heartbleed, ...) may require you to give your PR team a heads-up, as your company or organization may come under external scrutiny and press inquiries may be expected. When something like this strikes, it's also useful to send out a quick note to your organization at large.
As the incident response process starts, a more detailed analysis takes place. During this time, your team will update the timeline as well as the information documents as needed. The analysis includes:
For each of these items, remember to note relevant details in the timeline or incident information documents.
While running the incident, the primary responsibility of the Incident Response team is to dispatch information about the incident. This includes notifying system or software owners of the vulnerability. For this, you need a comprehensive and accurate inventory data base mapping systems and software components to the teams responsible for their maintenance.
The notification of work items most commonly takes the form of tracking tickets. These should:
Ticket creation likely requires some automation. As the incident progresses, you may have to adjust priorities or SLAs, guide owners to work-arounds, and identify follow-up actions. This is the long tail of the Incident Response Process, and care must be taken that work-arounds, fixes, and updates are verified and follow-up actions correctly classified and tracked.
A dashboard that shows the number and status of relevant tickets by organizational leader (e.g. by VP) can be useful in pushing for traction as well as in illustrating and understanding your attack surface and vulnerability.
The other important means of communication around the incident include the notification of the different audiences we identified earlier. You might consider notifying:
Throughout the incident, individual tasks will be marked as completed (e.g. by closing a ticket), and it is the Incident Response team's responsibility to verify that they were completed correctly and do not require any follow-up actions or to schedule any additional work items that may be needed.
For example, if a code injection vulnerability is found in a given piece of software, then it may be immediately acceptable to e.g. remove the software from the exposed system, or to restrict access to the system. But this is not sufficient to fully resolve the issue: all too frequently, vulnerable packages are kept available and systems get resurrected or reimaged with the vulnerable software. A suitable follow-up item would then be to track the elimination of the vulnerable software version from your repository altogether.
Major incidents require a post-mortem to allow your team(s) to review and learn how to improve the process. Having a complete timeline and a comprehensive incident information document is critical here.
Post-mortems should be scheduled for soon after the incident has been completed. Due to the long tail in resolving all follow-up work, it may frequently be the case that holding the post-mortem occurs even before the full incident is marked as resolved, i.e. while follow-up actions are still outstanding.
During the post-mortem, the Incident Response Lead presents the detailed timeline of the incident, and participants help fill in any gaps. The primary goal is to create a meaningful description of the event, to identify any missed remediation actions or follow-up tasks, and to help refine the process.
Ideally, the flow of the post-mortem would follow this outline:
Follow-up items to help improve the Incident Response Process should also be tracked in your ticketing system, be assigned to specific humans, and have a due date.
Post-mortems should be open to anybody within your company to observe (so long as no restricted or highly confidential information is disclosed), although it's important to avoid getting distracted or derailed in lengthy discussions about future prevention of similar incidents: when these occur, take note of the suggestions, create a ticket for somebody to follow up on or review, and move on.
Lastly, the final post mortem document should be linked to from the Incident Information document.
Incident Response is a hard, laborsome, tedious, and frequently thankless job. All too often, we don't know whether we're making any progress, and tracking incidents across a diverse and large infrastructure may feel like an attempt to boil the proverbial ocean.
Having a formal Incident Response Process that responders can adhere to step-by-step may help in assuring that nothing is overlooked, but in the end it is the reflective step of analyzing our own responses that can help us make the biggest impact.
June 23rd, 2016