Runbooks and Playbooks
Updated: Jan 5
I've spent too long thinking of these terms interchangeably.
Tactical Formula to Address a Situation
Describe a situation that may occur
Describe an end goal
List the steps to get the end goal
Contain the contacts for the situation
Strategic review of a response to a Situation
Strategic planning for the future
Targeted to leadership
Contains a RACI Chart to identify key roles and responsibilities
Start every automated response with communication
Security Must Dos
MFA - the best way to handle login
Use Secrets Manager
Limit Security Groups (No 0.0.0.0/0)
Centralize Cloud Trail Logs
Validate IAM roles (No Stars in your IAM rules)
Raise organization security culture
AWS Well-Architected Tool to Check tools
Additional Info from Well-Architected Framework
Use runbooks for standard activities such as deployment: Runbooks are the predefined steps to achieve specific outcomes. Use runbooks to perform standard activities, whether done manually or automatically. Examples include deploying a workload, patching it, or making DNS modifications.
For example, put processes in place to ensure rollback safety during deployments. Ensuring that you can roll back a deployment without any disruption for your customers is critical in making a service reliable.
For runbook procedures, start with a valid effective manual process, implement it in code, and trigger automated execution where appropriate.
Even for sophisticated workloads that are highly automated, runbooks are still useful for running game days (p. 46) or meeting rigorous reporting and auditing requirements.
Note that playbooks are used in response to specific incidents, and runbooks are used to achieve specific outcomes. Often, runbooks are for routine activities, while playbooks are used for responding to nonroutine events.
Use playbooks to investigate failures: Enable consistent and prompt responses to failure scenarios that are not well understood, by documenting the investigation process in playbooks. Playbooks are the predefined steps performed to identify the factors contributing to a failure scenario. The results from any process step are used to determine the next steps to take until the issue is identified or escalated.
The playbook is proactive planning that you must do, so as to be able to take reactive actions effectively. When failure scenarios not covered by the playbook are encountered in production, first address the issue (put out the fire). Then go back and look at the steps you took to address the issue and use these to add a new entry in the playbook.
Note that playbooks are used in response to specific incidents, while runbooks are used to achieve specific outcomes. Often, runbooks are used for routine activities and playbooks are used to respond to non-routine events.
Perform post-incident analysis: Review customer-impacting events, and identify the contributing factors and preventative action items. Use this information to develop mitigations to limit or prevent recurrence. Develop procedures for prompt and effective responses. Communicate contributing factors and corrective actions as appropriate, tailored to target audiences.
Assess why existing testing did not find the issue.
Add tests for this case if tests do not already exist.
Conduct game days regularly: Use game days to regularly exercise your procedures for responding to events and failures as close to production as possible (including in production environments) with the people who will be involved in actual failure scenarios. Game days enforce measures to ensure that production events do not impact users.
Game days simulate a failure or event to test systems, processes, and team responses. The purpose is to actually perform the actions the team would perform as if an exceptional event happened. This will help you understand where improvements can be made and can help develop organizational experience in dealing with events. These should be conducted regularly so that your team builds "muscle memory" on how to respond.