We built an internal infrastructure service that was mathematically flawless.
So, the Principal Engineer forced us to intentionally take it offline every quarter.
The Junior Developers thought he was crazy. But he understood the dark reality of distributed systems.
If your actual performance is much better than your stated SLA, users will come to rely on your current performance. Development teams will build unreasonable dependencies on your service.
They will assume it never fails, and they will stop writing fallback logic.
When your "perfect" system inevitably crashes three years later, it will take the entire company down with it.
Extreme reliability comes at a cost: maximizing stability limits how fast new features can be developed.
Don't overachieve.
Downtime isn't a failure. It's a feature to enforce the distributed systems reality.