Rob Behnke
September 19th, 2023
In Web3, detailed plans for improving the security of applications cannot eliminate the possibility of hacks. Still, a robust plan for handling vulnerabilities discovered in smart contracts can help mitigate losses for users.
This article explains the basics of formulating a plan for responding effectively to emergency situations in Web3. Whether you're a developer building dApps or an auditor advising clients on security matters, you'll find information in this article useful for improving the resilience and fault tolerance of decentralized protocols.
Before detailing how to craft a response plan, it’s important to know what types of scenarios qualify as a “security incident”. This reduces the risk that your team will handle certain issues poorly because they fail to recognize the security implications.
Bug or vulnerability that puts users’ funds at risk is discovered in your protocol’s smart contracts
Bug or vulnerability that puts users’ funds at risk is discovered in an underlying protocol (which your dApp utilizes)
Bug or vulnerability is discovered in third-party infrastructure and tooling used previously or currently. Examples include oracles, wallet address generators, bridges, multisig wallets, programming languages, and more.
Bug or vulnerability is discovered by an ethical hacker, bug bounty hunter, or security researcher
Active exploit or hack of your protocol is discovered and reported by a user, community member, or unidentified party
Key internal infrastructure is compromised (e.g., theft of founders’ private keys)
Any of the aforementioned situations would call for a swift response to mitigate the situation (Wintermute lost $160M in funds due to an address generator vulnerability, for example).
In the next section we discuss the elements of a well-crafted incident response plan for handling emergencies that have implications for the security of your Web3 project.
The first step in crafting an effective incident response plan is determining who will handle specific tasks during the process. For instance, you don’t want to start thinking about who should do what while a hack is ongoing. Assigning roles beforehand ensures you select people with the right skills and qualities (e.g. crisis management and the ability to work under pressure) and increases the odds of success.
While the description and naming of roles will vary, here are possible roles to think about when drafting your Web3 security incident response document:
Operations: The Operations role should handle on and off-chain monitoring of threats while triaging and prioritizing security issues as needed. The person (or persons) assigned to this position will also be in charge of notifying relevant personnel about the security incident and activating the response plan.
The Operations role plays a key part during and after the incident response as well. For instance, they may coordinate deployment of patches to fix smart contract bugs. Some example tasks: clearing queued operations and coordinating the signing of multisig transactions (especially if the project uses a multisig wallet).
Furthermore, Operations personnel could document details about the incident response—event timelines, remediation efforts, etc—to be used in drafting a post-mortem report.
Strategy: Like Operations, the Strategy role isn’t strictly defined. Nevertheless, an ideal requirement for the individual(s) assigned this role is the ability to analyze threats and craft remediation/eradication plans.
Additionally, the Strategy role may be required to drive discussions and collaborations forward as the team works on solving the problem. Hence, whoever holds this role should be skilled at handling high-stakes security incidents (gained through experience from dealing with real-world scenarios or practice).
Communications: Although it’s possible to have Operations handle communications, creating a separate role may improve efficiency. The Communications role manages the flow of information within and outside the war room. Some example tasks:
Communicating with relevant third parties such as auditors, protocol developers, and security experts
Coordinating communications between members of the incident response team and ensuring everyone has access to important information about the situation (e.g., EOA and contract addresses involved in the exploit, key transactions, data from on-chain monitoring tools, etc.)
Vetting information coming into the war room
Communicating with the protocol’s community and other stakeholders
“War room” is a colloquial term for a virtual or physical place where team members can collaborate on handling major emergencies. War rooms are necessary because responding to security issues requires gathering experts who are familiar with the situation and can contribute meaningfully to resolving the situation.
Those assigned the roles described in the previous section will necessarily be part of the war room, same as protocol developers, UI developers, and other key members of the Web3 project's team.
You may invite others (e.g. auditors or whitehat hackers) as long as they are (a) trustworthy (b) aware of the stakes involved, and (c) relevant to solving the problem. The response plan should identify which parties need to be invited to the war room and document processes for initiating contact.
While there aren’t any standards for structuring a war room, you'll want one with useful features for communication and collaboration. The war room could be anything from a private chat room on Signal, Telegram, or Discord to a Zoom/Google Hangouts.
After setting up a war room, the next step is to review the threat. This involves assessing known information to confirm the existence and severity of the bug or vulnerability. Below are some questions to guide this phase of the incident response:
Is there concrete evidence validating the issue? Examples would be recent transactions or social media reports from affected users.
Is this an isolated incident, or does it affect multiple components in the protocol? You want to identify domino effects that a particular vulnerability might produce.
Are users’ funds at risk? How much? The amount at stake may inform your team’s response plans—for example, the BNB token bridge exploit (which put nearly $566M at risk) required halting the blockchain as a mitigative measure.
If users’ funds are safe (for now), does the incident still require an intervention?
If users’ funds are at risk, what (immediate) actions can be taken to mitigate losses?
Once you have enough information confirming the incident, swift effort to determine the root cause(s) of smart contract exploits is critical. Members of the incident response team will likely need to collaborate on reviewing exploit transactions to understand the vulnerability.
Live debugging sessions (via videoconferencing tools like Zoom) are ideal and can be aided with the following tools:
Certain security incidents may require immediate action to mitigate the loss of funds. This is especially the case if an exploit is ongoing or the bug is still active.
Below are some defensive actions to consider if your protocol suffers an exploit:
Upgrading the UI to reflect information about security incidents or disable specific user operations (e.g., deposits and withdrawals)
Performing whitehat rescues (i.e. contracting ethical hackers to drain funds from vulnerable smart contracts)
Taking defensive actions reduces the pressure on incident response teams and leaves more room for properly evaluating attacks before developing long-term solutions.
Like other parts of the incident response, tasks in this phase should be assigned to specific parties beforehand to improve efficiency and speed of execution.
Further steps would include:
Listing EOA or multisig wallet accounts required to execute transactions
Preparing scripts for deploying mitigative measures in advance
Transparent, reliable, and consistent communication with users is an important aspect of effective incident response. Besides raising security awareness, it protects the protocol’s reputation and ensures it continues to receive support from the community.
That said, Web3 projects should follow rules guiding crisis communication when dealing with security incidents:
1. Make sure someone reviews all outgoing information for accuracy and to ensure the information that could further jeopardize security efforts isn’t disclosed immediately.
2. Avoid making commitments to remediating users’ losses until all facts concerning the situation are understood. Sometimes, the magnitude of losses may be larger than realized (making previous remediation plans infeasible).
3. Employ the best communication channels available. For example, communication may start with a post on the project’s Discord server or Telegram channel before publishing an announcement on the company Twitter page.
4. Send regular updates at frequent intervals. Even if no new information is available, posting messages at intervals communicates to users that you’re working to resolve the situation.
Note: This task will likely be handled by the Communications role or someone that acts in that capacity (e.g. community manager or public relations officer).
After implementing mitigative measures, you’ll need to work out a permanent solution. Teams should have discovered the root cause of the smart contract hack by now, so they can brainstorm a fix.
This phase could be split into several sub-tasks:
Different war room participants (coordinated by the Strategy Lead) work on various solutions. Each solution is evaluated according to different criteria such as complexity, implementation timeline, and minimization of losses for users. Teams may need to work out disagreements and evaluate concerns from members before settling on an acceptable solution.
After reaching a consensus, the team runs tests to confirm the patch fixes the flaw. Ganache and Hardhat are useful for forking blockchain state and testing contracts locally.
The patch is reviewed (if needed) by security auditors and other developers to ensure it doesn’t introduce new vulnerabilities. Another approach is to set up independent reviews using a platform like Code4rena (audit contests can be as short as 24-48 hours).
The solution is implemented by the responsible party (assigned beforehand). The incident response plan document should include contract deployment/upgrade scripts, relevant multisig wallet address(es), and other necessary details.
The post-mortem is the final phase of the security incident response phase. Here, team members (and other relevant parties) gather to reevaluate the incident, identify other root causes (aside from those mentioned already), and share feedback on the overall incident response process. This provides development teams with useful information for preventing future exploits and improving internal security and incident response processes.
Another reason to conduct a post-mortem is to gather information for the vulnerability disclosure statement. The disclosure statement details the security incident for the benefit of users, developers, and interested stakeholders and highlights actions taken by the team to deal with the vulnerability. This way, users are aware of your team's commitment to proactively handling security issues and safeguarding users’ funds.
Designing a detailed incident response playbook for your Web3 project puts you in a good position to respond effectively and efficiently to discovered vulnerabilities. While the stress and pressure associated with security incidents can increase the risk of error, a comprehensive plan to guide emergency response actions easily fixes this problem.
But, it isn’t enough to simply create an emergency response plan—you should perform dry runs at intervals so your team is well-acquainted with the incident process. This will improve the effectiveness of the incident response and put your Web3 project above your competition in terms of security preparedness.
For help in putting together a strategic security incident response plan, get in touch with Halborn.