On-Call That Doesn't Burn Out Your Team: Building Sustainable Incident Response

The page comes in at 3 AM. You fumble for your phone, squinting at the screen. By the time you’ve diagnosed the issue, mitigated the problem, and documented what happened, it’s 5 AM. You try to sleep for another hour before your alarm goes off.

This isn’t a rare occurrence. It’s Tuesday.

For too many engineering teams, this is the reality of on-call work. And the data shows it’s unsustainable: 47% of DevOps engineers say on-call overload contributes directly to burnout or frustration.

But on-call doesn’t have to be this way. The best organizations have figured out how to maintain reliable systems without destroying their teams in the process.

The Burnout Reality

The numbers paint a stark picture of engineering burnout in 2024-2025:

22% of engineering leaders and developers report critical levels of burnout
Another 24% report moderate burnout, with only 21% in the “healthy” category
Two out of three engineers experienced burnout in the past 12 months
57% of SREs still spend more than half their week on toil despite automation advances

On-call is a major contributor to this burnout. The “always-on” culture prevalent in DevOps fosters an environment where downtime and recovery are deprioritized.

The Hidden Cost: Attrition

Burned-out engineers leave. And replacing them is expensive.

According to research on IT work-life balance, unspoken frustrations about punishing on-call schedules contribute directly to employee attrition. In major tech hubs like London and San Francisco, it can cost $300,000 or more to replace a single engineer.

The same research found:

23.1% of IT professionals said poor work-life balance prompted them to consider leaving
One in four said the “always-on” nature of their work makes their jobs unmanageable
56.7% accept disrupted sleep and poor work-life balance as “just part of the job”

Perhaps most concerning: 72% of respondents said their management team has little to no visibility into how on-call work negatively affects employees’ personal lives.

The Google Standard: 2 Pages Per Shift

If anyone has figured out sustainable on-call at scale, it’s Google. Their Site Reliability Engineering (SRE) practices have become the industry standard for good reason.

The Page Budget

Google has established a clear limit: maximum 2 paging incidents per 12-hour on-call shift.

The reasoning is mathematical. Google has found that on average, dealing with the tasks involved in an on-call incident - root-cause analysis, remediation, and follow-up activities like writing a postmortem and fixing bugs - takes 6 hours. Two incidents per 12-hour shift is the sustainable maximum.

The ideal? A median of zero pages on most shifts. If a given component causes pages every day, it’s likely that something else will break at some point, causing more incidents than should be permitted.

What Happens When You Exceed the Budget

Google’s SRE book describes a cautionary tale: a team regularly receiving five paging incidents per shift instead of their budget of two.

The consequences were predictable:

Engineers heroically responded to the daily onslaught of pages but couldn’t keep up
There wasn’t enough time to find root causes and properly fix issues
Some engineers left the team to join less operationally burdened teams
High-quality incident follow-up became rare

Sound familiar? Many teams live this reality without recognizing it as a systemic failure.

The 50/25/25 Rule

Google’s approach to SRE work allocation provides another useful framework:

Activity	Target % of Time
Engineering/project work	At least 50%
On-call duties	No more than 25%
Other operational work	Up to 25%

Source: Google SRE Book

The key insight: SRE work should be a healthy mix of duties, with at least half of time spent on project work. On-call should never become someone’s entire job.

The Science of Sleep and Incident Response

On-call work doesn’t just affect morale - it directly impacts incident response quality.

Sleep Inertia: The 3 AM Liability

When a page wakes you at 3 AM, you’re not operating at full capacity. Sleep inertia is a transition state that occurs upon waking in which alertness and cognitive performance are temporarily degraded.

Research shows:

Sleep inertia typically lasts 15-30 minutes post waking
Performance may not stabilize until 2-3 hours post-waking
Firefighters reported that sleep inertia is a significant safety concern during alarm response

The implications for incident response are clear: the person responding to your 3 AM outage is cognitively impaired. Building systems that reduce the need for 3 AM responses isn’t just humane - it produces better outcomes.

The Anxiety Effect

Even when the pager doesn’t go off, being on-call affects sleep. Studies have found that on-call workers frequently experience difficulties getting to sleep as well as reduced sleep quality and quantity, sometimes even in the absence of a call.

Research examining how anxiety about missing alarms affects on-call workers found that pre-bed anxiety was increased during on-call conditions, and total sleep time was shorter with lower sleep efficiency compared to control conditions.

Building Sustainable On-Call

The path to sustainable on-call involves structural changes, not just individual coping strategies.

1. Limit Shift Length

An on-call rotation that has to handle one or more pages per day must be structured in a sustainable way, with recommended shift lengths limited to 12 hours. Shorter shifts are better for mental health.

Team members run the risk of exhaustion when shifts run long, and when people are tired, they make mistakes.

2. Avoid Consecutive Shifts

Continuous on-call shifts without sufficient breaks can negatively impact mental and physical health. Organizations should:

Limit consecutive on-call days
Avoid scheduling employees for back-to-back shifts
Ensure adequate recovery time between rotations

3. Eliminate Night Shifts with Follow-the-Sun

For globally distributed teams, a multi-site “follow the sun” rotation allows teams to avoid night shifts altogether.

The model works by having teams in different time zones hand off coverage:

Region	Coverage Window (Local)
Americas	8 AM - 4 PM
Europe	4 PM - 12 AM (Americas time)
Asia-Pacific	12 AM - 8 AM (Americas time)

Each team works during their normal business hours while providing 24/7 coverage globally.

Night shifts are demanding and draining and can negatively impact employee health and well-being. Follow-the-sun eliminates this entirely for teams with global presence.

4. Involve Engineers in Scheduling

Forcing people to go on-call without them contributing to the schedule will not work out well for employee wellbeing and productivity. Involving engineers in the process:

Keeps scheduling transparent
Allows everyone to provide feedback
Accommodates personal commitments and preferences
Creates buy-in and shared ownership

5. Enable Manager Support and Recovery

Hands-on managers can recognize when a responder’s on-call experience has been particularly stressful. Managers should:

Understand what a typical shift looks like
Recognize when someone has carried an unusually large burden
Encourage and enable time off to recover after difficult shifts

Good management visibility prevents the 72% problem where leadership has no idea how on-call is affecting their teams.

The Compensation Question

One indicator of whether an organization takes on-call seriously: do they compensate for it?

Companies That Pay

Companies like Google, Intercom, Spotify, LaunchDarkly, CircleCI, and PayPal compensate at or above $1,000 USD per week for on-call responsibilities.

According to The Pragmatic Engineer’s analysis:

Major tech companies often pay $600-$1,000+ USD per week
Well-funded startups typically offer $400-$800 USD per week
UK companies show a higher propensity for on-call compensation than US companies

Google’s Model

Google’s compensation model is particularly thoughtful. For any hour outside of 08:00-18:00 your local time where you are on-call:

If your response SLA is 30-60 minutes or less: 1/3 time-in-lieu
If your response SLA is 5 minutes or less: 2/3 time-in-lieu
Time-in-lieu can be used for vacation or paid out quarterly
Hard cap of 80 hours per quarter

This can result in 8 additional weeks of vacation or pay per year.

Why Compensation Matters

Companies which care about healthy on-call practices or want to minimize attrition make it clear on-call is additional work and offer some sort of compensation.

Compensation can take various forms:

Cash payments
Time off in lieu
Lightening the load with dedicated SRE staff
Making rotations voluntary

The specific model matters less than the signal it sends: this work is valued and recognized as a burden beyond normal responsibilities.

Reduce the Pages, Not Just the Pain

The best on-call improvement is fewer pages. Every page that doesn’t need to happen is stress avoided.

Fix Alert Fatigue

Alert fatigue occurs when engineers are overwhelmed by too many alerts and can’t properly triage and respond. Teams must:

Identify what alerts genuinely need immediate human response
Determine what can be automated
Decide what can wait until morning versus requiring a 3 AM page

Some teams report their on-call engineer gets paged about a hundred times in a typical 24-hour shift, with many pages getting ignored while real problems get buried. This is a system failure, not an individual one.

Invest in Reliability

Every recurring page represents an opportunity for permanent improvement. What happens after an incident defines culture more than the incident itself. Postmortems should identify not just what failed, but why - and drive lasting fixes.

If the same issue keeps paging, that’s not bad luck. It’s technical debt demanding attention.

Automate Self-Healing

Many pages are for issues that have known, automatable solutions. Invest in:

Auto-scaling for capacity issues
Automatic restarts for known failure modes
Self-healing infrastructure
Automated runbooks for common problems

The 2025 State of DevOps Report noted that 57% of SREs still spend more than half their week on toil. Much of this toil represents automation opportunities.

The Sustainable On-Call Checklist

Use this checklist to evaluate your on-call practices:

Structural Health

Shifts are 12 hours or less
No back-to-back shifts for the same person
Adequate recovery time between rotations
Engineers have input into scheduling
Night shifts are avoided or shared fairly

Page Health

Average 2 or fewer pages per shift
Median pages per shift is near zero
Recurring pages get permanently fixed
Alerts are regularly reviewed and pruned
Non-urgent issues don’t page after hours

Cultural Health

On-call burden is visible to management
Difficult shifts are followed by recovery time
Compensation or recognition exists for on-call work
Postmortems focus on systems, not blame
Engineers can raise concerns without retaliation

Work Balance

On-call is no more than 25% of an engineer’s time
At least 50% of time is spent on project work
On-call engineers aren’t also expected to hit sprint goals
Mental health resources are available

The Long Game

Sustainable on-call isn’t about making an unsustainable situation slightly more bearable. It’s about building systems, processes, and culture that support both reliability and the humans who maintain it.

The organizations that get this right don’t view it as a trade-off. They understand that burned-out engineers make mistakes, miss problems, and eventually leave. Sustainable on-call isn’t just humane - it’s the foundation of sustainable reliability.

The pager will always exist. But it doesn’t have to be a source of dread. With thoughtful design, appropriate limits, and genuine organizational support, on-call can be a manageable part of engineering work rather than the thing that drives your best people away.

Effective alerting is the foundation of sustainable on-call. FlareWarden monitors from multiple global locations and validates issues before alerting, so your team only gets notified about real problems - reducing noise and protecting your engineers’ sleep.