The page comes in at 3 AM. You fumble for your phone, squinting at the screen. By the time you’ve diagnosed the issue, mitigated the problem, and documented what happened, it’s 5 AM. You try to sleep for another hour before your alarm goes off.
This isn’t a rare occurrence. It’s Tuesday.
For too many engineering teams, this is the reality of on-call work. And the data shows it’s unsustainable: 47% of DevOps engineers say on-call overload contributes directly to burnout or frustration.
But on-call doesn’t have to be this way. The best organizations have figured out how to maintain reliable systems without destroying their teams in the process.
The Burnout Reality
The numbers paint a stark picture of engineering burnout in 2024-2025:
- 22% of engineering leaders and developers report critical levels of burnout
- Another 24% report moderate burnout, with only 21% in the “healthy” category
- Two out of three engineers experienced burnout in the past 12 months
- 57% of SREs still spend more than half their week on toil despite automation advances
On-call is a major contributor to this burnout. The “always-on” culture prevalent in DevOps fosters an environment where downtime and recovery are deprioritized.
The Hidden Cost: Attrition
Burned-out engineers leave. And replacing them is expensive.
According to research on IT work-life balance, unspoken frustrations about punishing on-call schedules contribute directly to employee attrition. In major tech hubs like London and San Francisco, it can cost $300,000 or more to replace a single engineer.
The same research found:
- 23.1% of IT professionals said poor work-life balance prompted them to consider leaving
- One in four said the “always-on” nature of their work makes their jobs unmanageable
- 56.7% accept disrupted sleep and poor work-life balance as “just part of the job”
Perhaps most concerning: 72% of respondents said their management team has little to no visibility into how on-call work negatively affects employees’ personal lives.
The Google Standard: 2 Pages Per Shift
If anyone has figured out sustainable on-call at scale, it’s Google. Their Site Reliability Engineering (SRE) practices have become the industry standard for good reason.
The Page Budget
Google has established a clear limit: maximum 2 paging incidents per 12-hour on-call shift.
The reasoning is mathematical. Google has found that on average, dealing with the tasks involved in an on-call incident - root-cause analysis, remediation, and follow-up activities like writing a postmortem and fixing bugs - takes 6 hours. Two incidents per 12-hour shift is the sustainable maximum.
The ideal? A median of zero pages on most shifts. If a given component causes pages every day, it’s likely that something else will break at some point, causing more incidents than should be permitted.
What Happens When You Exceed the Budget
Google’s SRE book describes a cautionary tale: a team regularly receiving five paging incidents per shift instead of their budget of two.
The consequences were predictable:
- Engineers heroically responded to the daily onslaught of pages but couldn’t keep up
- There wasn’t enough time to find root causes and properly fix issues
- Some engineers left the team to join less operationally burdened teams
- High-quality incident follow-up became rare
Sound familiar? Many teams live this reality without recognizing it as a systemic failure.
The 50/25/25 Rule
Google’s approach to SRE work allocation provides another useful framework:
| Activity | Target % of Time |
|---|---|
| Engineering/project work | At least 50% |
| On-call duties | No more than 25% |
| Other operational work | Up to 25% |
Source: Google SRE Book
The key insight: SRE work should be a healthy mix of duties, with at least half of time spent on project work. On-call should never become someone’s entire job.
The Science of Sleep and Incident Response
On-call work doesn’t just affect morale - it directly impacts incident response quality.
Sleep Inertia: The 3 AM Liability
When a page wakes you at 3 AM, you’re not operating at full capacity. Sleep inertia is a transition state that occurs upon waking in which alertness and cognitive performance are temporarily degraded.
Research shows:
- Sleep inertia typically lasts 15-30 minutes post waking
- Performance may not stabilize until 2-3 hours post-waking
- Firefighters reported that sleep inertia is a significant safety concern during alarm response
The implications for incident response are clear: the person responding to your 3 AM outage is cognitively impaired. Building systems that reduce the need for 3 AM responses isn’t just humane - it produces better outcomes.
The Anxiety Effect
Even when the pager doesn’t go off, being on-call affects sleep. Studies have found that on-call workers frequently experience difficulties getting to sleep as well as reduced sleep quality and quantity, sometimes even in the absence of a call.
Research examining how anxiety about missing alarms affects on-call workers found that pre-bed anxiety was increased during on-call conditions, and total sleep time was shorter with lower sleep efficiency compared to control conditions.
Building Sustainable On-Call
The path to sustainable on-call involves structural changes, not just individual coping strategies.
1. Limit Shift Length
An on-call rotation that has to handle one or more pages per day must be structured in a sustainable way, with recommended shift lengths limited to 12 hours. Shorter shifts are better for mental health.
2. Avoid Consecutive Shifts
Continuous on-call shifts without sufficient breaks can negatively impact mental and physical health. Organizations should:
- Limit consecutive on-call days
- Avoid scheduling employees for back-to-back shifts
- Ensure adequate recovery time between rotations
3. Eliminate Night Shifts with Follow-the-Sun
For globally distributed teams, a multi-site “follow the sun” rotation allows teams to avoid night shifts altogether.
The model works by having teams in different time zones hand off coverage:
| Region | Coverage Window (Local) |
|---|---|
| Americas | 8 AM - 4 PM |
| Europe | 4 PM - 12 AM (Americas time) |
| Asia-Pacific | 12 AM - 8 AM (Americas time) |
Each team works during their normal business hours while providing 24/7 coverage globally.
Night shifts are demanding and draining and can negatively impact employee health and well-being. Follow-the-sun eliminates this entirely for teams with global presence.
4. Involve Engineers in Scheduling
Forcing people to go on-call without them contributing to the schedule will not work out well for employee wellbeing and productivity. Involving engineers in the process:
- Keeps scheduling transparent
- Allows everyone to provide feedback
- Accommodates personal commitments and preferences
- Creates buy-in and shared ownership
5. Enable Manager Support and Recovery
Hands-on managers can recognize when a responder’s on-call experience has been particularly stressful. Managers should:
- Understand what a typical shift looks like
- Recognize when someone has carried an unusually large burden
- Encourage and enable time off to recover after difficult shifts
Good management visibility prevents the 72% problem where leadership has no idea how on-call is affecting their teams.
The Compensation Question
One indicator of whether an organization takes on-call seriously: do they compensate for it?
Companies That Pay
According to The Pragmatic Engineer’s analysis:
- Major tech companies often pay $600-$1,000+ USD per week
- Well-funded startups typically offer $400-$800 USD per week
- UK companies show a higher propensity for on-call compensation than US companies
Google’s Model
Google’s compensation model is particularly thoughtful. For any hour outside of 08:00-18:00 your local time where you are on-call:
- If your response SLA is 30-60 minutes or less: 1/3 time-in-lieu
- If your response SLA is 5 minutes or less: 2/3 time-in-lieu
- Time-in-lieu can be used for vacation or paid out quarterly
- Hard cap of 80 hours per quarter
This can result in 8 additional weeks of vacation or pay per year.
Why Compensation Matters
Compensation can take various forms:
- Cash payments
- Time off in lieu
- Lightening the load with dedicated SRE staff
- Making rotations voluntary
The specific model matters less than the signal it sends: this work is valued and recognized as a burden beyond normal responsibilities.
Reduce the Pages, Not Just the Pain
The best on-call improvement is fewer pages. Every page that doesn’t need to happen is stress avoided.
Fix Alert Fatigue
Alert fatigue occurs when engineers are overwhelmed by too many alerts and can’t properly triage and respond. Teams must:
- Identify what alerts genuinely need immediate human response
- Determine what can be automated
- Decide what can wait until morning versus requiring a 3 AM page
Some teams report their on-call engineer gets paged about a hundred times in a typical 24-hour shift, with many pages getting ignored while real problems get buried. This is a system failure, not an individual one.
Invest in Reliability
Every recurring page represents an opportunity for permanent improvement. What happens after an incident defines culture more than the incident itself. Postmortems should identify not just what failed, but why - and drive lasting fixes.
If the same issue keeps paging, that’s not bad luck. It’s technical debt demanding attention.
Automate Self-Healing
Many pages are for issues that have known, automatable solutions. Invest in:
- Auto-scaling for capacity issues
- Automatic restarts for known failure modes
- Self-healing infrastructure
- Automated runbooks for common problems
The 2025 State of DevOps Report noted that 57% of SREs still spend more than half their week on toil. Much of this toil represents automation opportunities.
The Sustainable On-Call Checklist
Use this checklist to evaluate your on-call practices:
Structural Health
- Shifts are 12 hours or less
- No back-to-back shifts for the same person
- Adequate recovery time between rotations
- Engineers have input into scheduling
- Night shifts are avoided or shared fairly
Page Health
- Average 2 or fewer pages per shift
- Median pages per shift is near zero
- Recurring pages get permanently fixed
- Alerts are regularly reviewed and pruned
- Non-urgent issues don’t page after hours
Cultural Health
- On-call burden is visible to management
- Difficult shifts are followed by recovery time
- Compensation or recognition exists for on-call work
- Postmortems focus on systems, not blame
- Engineers can raise concerns without retaliation
Work Balance
- On-call is no more than 25% of an engineer’s time
- At least 50% of time is spent on project work
- On-call engineers aren’t also expected to hit sprint goals
- Mental health resources are available
The Long Game
Sustainable on-call isn’t about making an unsustainable situation slightly more bearable. It’s about building systems, processes, and culture that support both reliability and the humans who maintain it.
The organizations that get this right don’t view it as a trade-off. They understand that burned-out engineers make mistakes, miss problems, and eventually leave. Sustainable on-call isn’t just humane - it’s the foundation of sustainable reliability.
The pager will always exist. But it doesn’t have to be a source of dread. With thoughtful design, appropriate limits, and genuine organizational support, on-call can be a manageable part of engineering work rather than the thing that drives your best people away.
Effective alerting is the foundation of sustainable on-call. FlareWarden monitors from multiple global locations and validates issues before alerting, so your team only gets notified about real problems - reducing noise and protecting your engineers’ sleep.