Beaver Duty

Beaver Duty

Share this post

Beaver Duty
Beaver Duty
Upstream Alert: One update, no more Internet

Upstream Alert: One update, no more Internet

A daily OS patch knocked out thousands of OpenAI’s GPU nodes. And with them, a large chunk of ChatGPT and API availability.

Matt's avatar
Matt
Jul 15, 2025
∙ Paid

Share this post

Beaver Duty
Beaver Duty
Upstream Alert: One update, no more Internet
Share

RCA: https://status.openai.com/incidents/01JXCAW3K3JAE0EP56AEZ7CBG3

What happened?

OpenAI’s most wanted shiny GPU fleet hit a wall on June 10, 2025. Imagine you are working from home, and suddenly your Internet connection cuts out. Frustrating, right? Now, imagine it happens at work. Nobody can do anything now. The productivity grinds to a halt. That is precisely what happened to the thousands of GPU nodes that lost network connectivity on that day.

A routine daily OS update restarted a Linux network service (the famous systemd-networkd) on the GPUs, but that didn’t play well with another custom networking agent that was running on these hosts. All networking routes were removed, and the nodes could no longer connect.

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 Matt
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share