The On-Call Test of Platform Excitement
Every platform team has an on-call rotation printed somewhere.
Ignore it. That is not the real test.
If you want to know the truth about a platform’s maturity, do not ask who is on-call. Make up an excuse, pretend the assigned person is unavailable, and ask the only question that actually exposes the system:
“Who wants to be on-call this weekend?”
Then watch the room.
If even one person flinches, looks at the floor, or suddenly remembers they need to defrost their fridge, you have your answer: the infrastructure is a liability.
The On-Call Test does not measure staffing. It measures how exciting your infrastructure really is. Is being on-call background noise or a full assault on a weekend?
The Boring Response
Sara: “I can take it.”
Immediate. Unemotional. No theatrics.
That is what calm architecture looks like: predictable, documented, and uneventful. On-call is just another shift, not a survival exercise.
The Broken Response
Immature platforms give you everything else.
Silence. Long, heavy silence.
Confusion. “Wait, isn’t it James turn?”
Excuses. “I had plans this weekend.”
Side-eye. The same person getting volunteered again.
If nobody wants to be on-call, your team just delivered a reliability report without saying a word: Alerts are noise, Incidents recur, Runbooks lie, Ownership is foggy. The system is unpredictable.
This isn’t a morale problem; it’s a system reliability problem that engineers are forced to absorb.
Truth: The Economic Cost of the Flinch
That flinch you saw is not psychological. It is the direct financial cost of operational entropy.
Every incident still requires manual heroics and improvisation. The average escalation consumes an engineer’s entire weekend. The system has no guardrails, and the pager becomes a roulette wheel.
Each on-call shift extracts a measurable tax: sleep lost, context switching, delayed recovery. That flinch is the moment an engineer calculates whether they can survive contact with production.
This is operational debt charging interest. And the currency is your engineering hours.
Every frantic weekend steals a weekday feature. Production fires cost a release. This all delays future revenue.
The on-call dread is your balance sheet screaming.
Why the Chaos Endures
Organizations love to misdiagnose this. They call it burnout, resourcing, culture or morale issues.
Wrong problem.
The real issue is incentives. Many companies reward the weekend hero, not the engineer who made sure the weekend was boring.
Drama is visible. Prevention is invisible. Until prevention becomes the path to recognition, chaos stays incentivized.
Action: The BoringOps Solution
Getting to “I’ll take it” is not a pep talk. It is a platform engineering discipline that enforces predictability. It is building a system that does not require courage to operate.
This means:
- Systematic Failure Mode Analysis: Stop treating incidents as isolated. Document patterns until the system can predict its own pain points.
- Runbooks as Code: Replace improvisation with enforced automation and remediation loops.
- Clear Ownership Boundaries: No more operational ambiguity. Every component must have a single, named owner responsible for its operational predictability.
When failure paths shrink and resolution stops being artisanal, the team gets its hours back. That is capacity return, and it compounds.
Boring Infrastructure Stabilizes Everything Below It
A mature platform is boring by design. When you ask who wants to cover on-call, someone simply says:
“I’ll take it.”
But, if that question triggers tension, negotiations, or evasive maneuvers, your infrastructure is too exciting to be safe.
Make the infrastructure boring. On-call becomes boring too.
Everything else follows.
boring (adj.): When you forgot you were on-call, and nothing happened anyway.