Graceful degradation
What is it?
Graceful degradation is the planned ability of a system to keep its most important safety-related functions available when parts fail or resources run short. Instead of stopping entirely, the system enters a predefined degraded mode: non-critical features are paused, simplified, or substituted so that alarm, interlock, shutdown, or other protection functions remain dependable. The priorities, triggers, and behaviors are designed up front and verified, so the degraded state is deliberate and auditable—not accidental.
How it supports functional safety
By assigning and enforcing priorities, graceful degradation reduces the chance that systematic failures—such as unbounded logging, chatty diagnostics, or UI tasks—consume resources needed by the safety function. It also limits the propagation of software faults by isolating or shedding affected features. While it does not prevent random/common-cause hardware faults, it can detect their effects (e.g., power droop, bus errors, memory pressure) and trigger safe fallbacks so the safety function does not silently act on corrupted or delayed information.
When to use
- Mixed-criticality products (e.g., life-safety alarms + non-critical reporting on the same platform).
- Systems with known resource ceilings or brownout risk (CPU, memory, I/O bandwidth, power/thermal headroom).
- Installations where a controlled degraded period is necessary to reach or maintain a safe state (e.g., evacuation alarm/annunciation while history export is deferred).
- Where communications can be intermittent (keep local protection; shed remote services gracefully).
- Retaining minimum diagnostic coverage while avoiding operator overload during abnormal conditions.
Inputs & Outputs
Inputs
- Hazard analysis & criticality assignment for each function (safety function vs. convenience features).
- Resource & health monitors (CPU/memory thresholds, brownout flag, bus errors, watchdog pre-timeout).
- Mode table with triggers, priorities, and allowed substitutions/simplifications.
- Operator communication requirements (annunciation, indicators, messages).
Outputs
- Active degraded mode with a bounded, deterministic service set (what runs, what is shed).
- Safe reactions taken (e.g., local alarm latched, outputs driven to safe state, inhibited non-critical services).
- Operator/HMI indication of degraded state and any functional limitations.
- Evidence (event record) sufficient for post-event analysis without jeopardizing safety at runtime.
Procedure
- Identify functions & assign priorities. From the hazard analysis, classify each function (e.g., life-safety alarm, emergency notification, history export, remote dashboard) and justify its priority.
- Define a mode table. For each trigger (CPU>x%, low-voltage, comms loss, memory pressure, self-test failure), specify: kept, shed, or simplified; timing bounds; and the safe reaction if conditions worsen.
- Instrument resource & fault detection. Add monitors with hysteresis and debounce. Bound detection latency so mode switching remains deterministic.
- Design arbitration. Ensure the safety function has non-preemptible time/space budgets; gate low-priority tasks with admission control and rate limits.
- Annunciate degraded operation. Provide clear indicators/messages and maintenance guidance; avoid alarm floods.
- Verify degraded modes. Use fault injection (e.g., CPU load, comms drop, brownout simulation) to prove timing, outputs, and safe reactions meet requirements.
- Document & justify. Trace priorities and mode behavior to hazards/SIL and keep acceptance criteria auditable.
Worked Example
High-level
Fire detection & alarm panel. One platform hosts: (1) life-safety alarm & sounder activation, (2) emergency voice announcements, (3) event history export to a building server, and (4) a maintenance web UI. During a CPU overload combined with intermittent building network, the panel must ensure alarm loop scanning and sounder/voice activation remain within timing bounds. It intentionally defers history export and disables the maintenance web UI until resources recover, while clearly indicating “Degraded Mode” on the local display.
Code-level
// Pseudo-C demonstrating safety-first priorities for a life-safety alarm panel
if (brownout() || cpu_load() > 90 || comms_unstable()) {
set_mode(DEGRADED);
admit(alarm_scan); // keep life-safety alarm processing
admit(evac_annunciation); // keep sirens/voice evacuation
shed(maintenance_web_ui); // SAFE REACTION: disable non-safety UI
defer(history_export); // SAFE REACTION: buffer; export later
rate_limit(self_tests, 1_Hz); // SAFE REACTION: simplify non-critical diagnostics
if (loop_fault_detected()) {
activate_zone_alarm(); // SAFE REACTION: enter safe state (fail-safe alarm)
latch_fault_indicator(); // inform operator until acknowledged
}
} else {
set_mode(NORMAL);
run_all_services();
}
Result: Life-safety alarm/evacuation performance is preserved with deterministic timing; non-critical features are paused without risking loss of protection.
Quality criteria
- Priority traceability: Each keep/shed decision is justified by the hazard analysis and SIL/PL requirements.
- Determinism under stress: Worst-case execution times and communication latencies for safety functions are met in degraded mode.
- Clear annunciation: Operators can recognize degraded mode and know what is limited.
- Bounded transitions: Entry/exit criteria and hysteresis prevent oscillation between modes.
- Evidence: Minimal, non-disruptive records of mode changes and causes are preserved.
Common pitfalls
- Wrong priorities: Business features outrank safety. Mitigation: lock priorities from hazard analysis, independent safety review.
- Mode thrashing: Rapid entry/exit causes instability. Mitigation: add hysteresis/time guards and resource headroom.
- Hidden degradation: Operators unaware of limitations. Mitigation: mandatory indicators and concise messages.
- Unbounded shedding: Safety support functions (e.g., essential self-tests) also vanish. Mitigation: minimum service set and rate-limited simplification instead of removal.
- Masking latent faults: Long periods in degraded mode hide failures. Mitigation: proof-test intervals and maintenance actions after degradation.
References
FAQ
Is graceful degradation the same as redundancy?
No. Redundancy provides extra capacity to continue full service after faults. Graceful degradation deliberately reduces non-critical services so the safety function remains dependable even without spares.
Can graceful degradation replace going to a safe state?
No. Degradation buys time and preserves safety functions, but if conditions continue to deteriorate or integrity targets are not met, the system must transition to a defined safe state.
How do I prove degraded operation is safe?
Define acceptance criteria (timing, coverage, annunciation), inject representative faults (load, comms loss, brownout), and demonstrate that the safety function meets its requirements in each degraded mode.
Does graceful degradation apply only to hardware?
No. While often discussed in hardware contexts, it applies to the total system: software services, communications, diagnostics, and HMI.