Slack and the breakfast of champions
Why opening Slack before breakfast is the first mistake of the day.
8:30am. I have just finished making my coffee. I am about to sit down for breakfast with my wife. But I decide to grab my phone and… I make a stupid mistake. I robotically unlock my phone and open Slack.
Please don’t ask me why, don’t tell me I should learn how to disconnect from work. Just don’t say anything. It just happens. Even if I hide the icon, after a couple of weeks, my thumb would have found its way. I wish there were an app that could lock other apps outside working hours. Like a digital babysitter for my dopamine-starved thumb.
Anyway. I see a new SEV-1 incident has been raised. I say “new” because ten minutes ago, I had already opened Slack by accident, and there wasn’t any alert.
I can’t resist, I open the description. An internal application known as MBS is down. Orders are accumulating in a pending state because our systems can no longer detect fraudulent ones.
I have a history with this MBS application. Two SEV-1 and one SEV-2 in the past year. Under my watch. When I see these three letters, my PTSD triggers.
As far as I know, this application was being rewritten from scratch due to its instability. What I am unsure about is whether this new incident is related to the new or old version of the app.
I look at my wife. She looks at me. I don’t need to say anything. She understood. I will skip breakfast with her today.
I join the war room. This is the new version that is crashing. I feel relieved. It’s still a SEV-1, but I am just glad that this time, it’s not happening on my Kubernetes cluster.
We spend twenty minutes troubleshooting. An index is missing in the database. We re-create it. Latency sharply decreases. The issue is mitigated.
Now, we need to understand why the index has been dropped. A few more engineers join the room. Those are the ones who know the application very well. Looks like they prioritised their breakfast over the incident. Or didn’t check Slack like idiots. Unlike me.
They find the issue quickly. Business users submitted a CSV file last night with corrupted records, and it got successfully loaded into the database.
Sounds familiar.
I open the RCA of the previous incident. Yes. Exactly similar situation. Manual data input from business users had broken everything as well. I remember the leadership being furious at this design during the not-so-blameless post-mortem. It seems nothing has changed. The only good thing is that due to the lack of confidence in this service, most consumers have wrapped the API calls behind a feature flag, making it easier to limit the blast radius.
Anyway. Good luck to the team that will have to explain why the design is still the same.
For me, it’s time to go back to my breakfast. I wasn’t helpful in this war room. I should really uninstall Slack. Cold eggs, cold coffee, warm regrets. Classic SEV-1 morning.