Today is deployment day.
I’ve been nominated to oversee the deployment of one of our microservices for two reasons:
I’m currently ranking first in the reliability leaderboard! There has been no incident under my watch for the past month. (Don’t be impressed, I’ve just returned from a three-week leave.)
This microservice has famously failed every deployment over the past five months. Someone clearly doesn’t like seeing my name on top of that leaderboard.
But I’m up for the challenge. As I said, I am just back to work and still have a spark of hope living inside.
I reach out to the team. I propose to make a bet on the deployment outcome. They don’t laugh. I understand.
We go through each step. The deployment includes a SQL script to patch a database before deploying the new version. It gives me the SRE ick: a sudden feeling that I dislike what’s going to happen.
I suggest re-running everything step by step in lower environments. If these guys were expecting someone to boost their confidence, it won’t be me.
Everything goes well.
The deployment window is starting soon.
I announce that we are going to start. Explosion of emojis. I feel the support. In my head, I imagine each emoji as a pat on my back while walking towards the rocket that will send me to the moon. Ah. Death emoji. Back to reality.
We’re starting. Everyone is on the call. Ready to go.
First steps are fine. No issue to report.
It’s time to run the SQL script.
Failure. Am I really surprised?
“Index already exists”.
Okay. Team, what do we do? No answer. Can we drop the existing one? No answer.
Then the alarms start ringing. Our core API backend for our mobile app is down. Our application is unusable. We are asked to join another bridge call.
Forget the moon, this time, it’s a perp walk.
The SQL script has stopped halfway, and only executed half of the data patching. It messed up everything. We drop the existing index, update the script to restart from where it stopped, and then run it again. Success.
Services are back to normal. 23min interruptions.
After three weeks’ leave, nothing has changed. Still firefighting. The only good thing is that I’m now at the bottom of the leaderboard. I can relax. And the losing streak can continue for this service.