Beaver Duty

Beaver Duty

Upstream Alert: One update, no more Internet

A daily OS patch knocked out thousands of OpenAI’s GPU nodes. And with them, a large chunk of ChatGPT and API availability.

Matt's avatar
Matt
Jul 15, 2025
∙ Paid

RCA: https://status.openai.com/incidents/01JXCAW3K3JAE0EP56AEZ7CBG3

What happened?

OpenAI’s most wanted shiny GPU fleet hit a wall on June 10, 2025. Imagine you are working from home, and suddenly your Internet connection cuts out. Frustrating, right? Now, imagine it happens at work. Nobody can do anything now. The productivity grinds to a halt. That is precisely what happened to the thousands of GPU nodes that lost network connectivity on that day.

A routine daily OS update restarted a Linux network service (the famous systemd-networkd) on the GPUs, but that didn’t play well with another custom networking agent that was running on these hosts. All networking routes were removed, and the nodes could no longer connect.

User's avatar

Continue reading this post for free, courtesy of Matt.

Or purchase a paid subscription.
© 2026 Matt · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture