Logs of Duty: Disaster Recovery
Because resilience is mostly about confidence.
Rule #1: Write a disaster recovery plan. Make it a beautiful 20-page document. Store it in Confluence, and start telling everyone you’re ready to handle any disaster. With a bit of hope, you will never have to prove it.
Rule #2: It should be okay to skip some components. There is no GPU in the DR region you have picked? Conveniently forget to mention it, and let others find out when needed. Why do you have a GPU in your architecture anyway?
Rule #3: DR drills are not optional. Schedule them on Fridays for maximum attendance. When everyone wants to go home early, everything seems to work smoothly.
Rule #4: Assume automation never fails. The script written three years ago and called “dr.sh” is probably still fine. Nobody has tested it, but it’s in version control, so it must work.
Rule #5: Test failovers during business hours. Because we never know when things will break. And in the next post-mortem, justify your decision by saying you needed to prove the team is not ready. That’s leadership.
Rule #6: Backups are sacred. You back up each database twice a day, keep each file for five years, but never try to restore them? Trust is part of the culture.
Rule #7: Expect shared ownership and accountability. Everyone is responsible for disaster recovery. When you notice an app team forgot to update your document, trigger a DR.
Rule #8: Be confident. When your leadership team asks if we are ready to handle disasters, speak confidently and reassure them that we are prepared to. Forget that you haven't updated and tested your plan in five years.
Rule #9: Forget dependencies. DNS, messaging, and object storage will magically realign when needed.
Rule #10: Blame AWS, GCP & Azure. After your next DR test fails spectacularly, point a finger at your Cloud provider's outage and divert attention towards the need for a multi-cloud strategy.


