Graceful degradation is a design and architecture technique in functional safety where a safety-related system moves from a fully redundant, nominal configuration into a defined degraded configuration after one or more faults are detected, while still ensuring a safe state or acceptable risk reduction.
Instead of the safety function disappearing abruptly when one channel fails, the system:
Detects the fault (using diagnostics and comparison between channels),
Isolates or de-rates the faulty element,
Continues operating in a reduced MooN configuration (e.g. 2oo3 → 1oo2),
Enforces a transition to a safe state if further degradation would violate the safety integrity target.
Under :contentReference[oaicite:0]{index=0}, this typically applies to redundant architectures designed to provide hardware fault tolerance (HFT ≥ 1), for example 1oo2, 2oo3, or more complex MooN structures implemented in sensors, logic solvers, and final elements.
How it supports functional safety
Graceful degradation supports functional safety by shaping what happens after a fault occurs. IEC 61508 does not only require you to meet a numerical failure target (PFDavg or PFH); it also requires that safety-related systems respond to faults in a predictable, specified manner consistent with the Safety Requirements Specification (SRS).
Systematic failures
From a systematic failure perspective, graceful degradation forces the engineering team to:
Explicitly specify degraded-state behaviour in the SRS, instead of leaving fault responses implicit or tool-dependent.
Design deterministic state transitions (normal → degraded → safe) in hardware and software, avoiding ambiguous behaviour that could be a source of systematic faults.
Implement and verify fault-handling logic as a first-class safety requirement, thereby reducing the risk of design, implementation, or integration errors in fault response.
Document assumptions about allowed degraded operation, which can then be reviewed, tested, and audited systematically.
Many “mysterious” field behaviours after a fault are in fact systematic design gaps in the degraded state, not random hardware failures. A well-defined graceful degradation concept closes this gap.
Random and common-cause hardware faults
From a hardware-fault perspective, graceful degradation is the operational consequence of using redundancy to meet SIL targets:
Random hardware faults in one channel are detected by diagnostics (self-tests, comparison, cross-monitoring) and isolated from the remaining channels.
The system continues to provide the safety function in a reduced architecture (e.g. 2oo3 → 1oo2), with a recalculated and higher, but still acceptable, probability of dangerous failure.
The time spent in degraded operation is limited, so that the increased risk contribution remains inside the overall PFDavg / PFH target.
The technique works together with common-cause failure (CCF) mitigation: physical separation, environmental protection, independent power, and diverse design reduce the chance that the same cause defeats all redundant channels simultaneously.
Graceful degradation does not remove common-cause failures, but it ensures that a single fault or local disturbance does not immediately translate into a complete loss of the safety function.
When to use
Graceful degradation is appropriate when:
The required SIL is achieved using redundant hardware architectures (e.g. sensors in 1oo2 or 2oo3, logic solvers with multiple processing channels, redundant actuators).
Hardware Fault Tolerance (HFT) and Safe Failure Fraction (SFF) requirements from IEC 61508-2 are met using redundancy and diagnostics.
Continuous or high-availability operation is needed, and an immediate trip on every single detected fault is not operationally acceptable, provided the safety integrity can still be maintained.
Reliable diagnostics are available to detect deviation, channel disagreement, or internal failures in at least one of the redundant channels.
The organisation is prepared to control and monitor time in degraded mode through procedures and maintenance response.
It should not be used as a “safety blanket” when:
CCF vulnerabilities are high (e.g. all channels in one cabinet with poor fire/EMI protection, shared single power supply, identical configuration errors).
Diagnostic coverage is low or unproven, so that faults may remain latent in multiple channels.
There is no realistic ability to track degraded mode duration and enforce repair or safe shutdown.
Inputs & Outputs
Inputs
Target SIL and associated PFDavg / PFH limits.
Chosen hardware architecture (MooN structure, device type A/B, SFF values).
Failure rate data for dangerous, safe, detected and undetected failures (e.g. λD, λDD, λDU).
Assumed repair times and maintenance response (MTTR, maximum allowed degraded time).
SRS specification of normal, degraded, and safe states, including escalation criteria.
Outputs
Documented degraded-state behaviour (logic, states, alarms) in the SRS and design specifications.
Updated reliability calculations for both nominal and degraded configurations (PFDavg / PFH).
Defined monitoring and alarming strategy for entering, remaining in, and leaving degraded mode.
Procedures for maintenance, repair, and, if needed, forced transition to a safe state after a maximum degraded duration.
Evidence for the safety case showing that the overall safety integrity requirements are met, including degraded operation.
Procedure
Derive SIL and architectural needs.
From hazard and risk analysis, determine required SIL and whether redundancy is needed to meet IEC 61508-2 architectural constraints (HFT, SFF).
Select the MooN architecture.
Choose appropriate MooN structures (e.g. 1oo2, 2oo3) for sensors, logic solvers, and final elements that can support the required SIL with justified failure data and CCF assumptions.
Define normal and degraded states in the SRS.
Specify in the Safety Requirements Specification how the system behaves in normal and degraded modes: which channels are active, what voting logic is used, and what must trigger transition to a safe state.
Design fault detection and isolation.
Implement diagnostics (self-tests, comparison of channels, watchdogs, plausibility checks) capable of detecting faulty channels in time. Define the rules that determine when a channel is isolated and how the remaining channels are used.
Quantify degraded-mode integrity.
Perform reliability calculations (PFDavg or PFH) for both the nominal and degraded configurations, including β-factors for CCF. Determine the maximum allowable time the system can stay in degraded mode while still meeting overall targets.
Implement monitoring, alarming, and time limits.
Ensure operators are informed via clear alarms when the system enters degraded mode, and enforce time limits for repair or shutdown. Define escalation rules if the limit is reached.
Verify with fault injection and validation tests.
Test the implementation by injecting representative faults (real or simulated) in one channel at a time. Verify that the system correctly detects, isolates, and transitions through states according to the specification, and that a safe reaction occurs when integrity can no longer be guaranteed.
Capture assumptions and evidence in the safety case.
Document all assumptions about degraded mode operation, repair times, diagnostic coverage, and CCF mitigation, and include the test results and calculations as part of the safety case.
Worked Example
High-level
Consider a SIL 3 overspeed protection function for a critical rotating machine. The function uses three independent speed sensors to detect overspeed and initiate a trip.
Architecture: 2oo3 voting on speed measurement.
Diagnostics: plausibility checks and cross-comparison between channels each cycle.
Normal state: all three channels active; trip occurs if at least two channels detect speed above the trip threshold.
Failure scenario:
Sensor 2 develops a fault and starts to output values significantly deviating from Sensors 1 and 3.
The comparison logic detects persistent disagreement beyond a configured tolerance and duration.
Sensor 2 is marked as faulty, automatically isolated from the voting logic, and an alarm is raised indicating “Overspeed protection degraded – 1oo2.”
The system continues to operate using Sensors 1 and 3 in an effective 1oo2 configuration (HFT now 0 instead of 1).
Reliability calculations have already shown that this degraded state is acceptable for up to 24 hours if no further faults occur.
If a second sensor fails, or if the 24-hour time limit is reached without repair, the logic forces a safe shutdown of the machine.
This behaviour is documented in the SRS, verified by test, and included in the safety case. The machine is protected throughout: first by redundancy, then by controlled degraded operation, and finally by a safe shutdown when integrity can no longer be assured.
Code-level
// Pseudo C-like example of graceful degradation voting logic for a 2oo3 speed SIF
#define TRIP_LIMIT_RPM 5000
#define MAX_DEGRADED_TIME_SEC (24 * 3600)
typedef struct {
int value_rpm;
bool valid;
} Channel;
Channel ch1, ch2, ch3;
bool system_degraded = false;
unsigned long degraded_start_time = 0;
int read_speed_sensor(int id);
bool check_plausibility(int rpm);
unsigned long now_seconds(void);
void overspeed_protection_cycle(void) {
// Read channels
ch1.value_rpm = read_speed_sensor(1);
ch2.value_rpm = read_speed_sensor(2);
ch3.value_rpm = read_speed_sensor(3);
ch1.valid = check_plausibility(ch1.value_rpm);
ch2.valid = check_plausibility(ch2.value_rpm);
ch3.valid = check_plausibility(ch3.value_rpm);
// Basic plausibility & disagreement detection
int valid_count = (ch1.valid ? 1 : 0) + (ch2.valid ? 1 : 0) + (ch3.valid ? 1 : 0);
if (valid_count < 2) {
// SAFE REACTION: Not enough valid channels, integrity lost
trip_machine(); // force safe state
raise_alarm("Overspeed protection integrity lost");
return;
}
// Determine if system has entered degraded mode
if (valid_count == 2 && !system_degraded) {
system_degraded = true;
degraded_start_time = now_seconds();
raise_alarm("Overspeed protection degraded (1oo2 active)");
}
// Enforce maximum degraded duration
if (system_degraded) {
unsigned long elapsed = now_seconds() - degraded_start_time;
if (elapsed > MAX_DEGRADED_TIME_SEC) {
// SAFE REACTION: Degraded time exceeded, trip to avoid excessive risk
trip_machine();
raise_alarm("Max degraded time exceeded – forced shutdown");
return;
}
}
// Compute majority / effective 2oo3 voting (ignoring invalid channels)
int above_limit_count = 0;
if (ch1.valid && ch1.value_rpm > TRIP_LIMIT_RPM) above_limit_count++;
if (ch2.valid && ch2.value_rpm > TRIP_LIMIT_RPM) above_limit_count++;
if (ch3.valid && ch3.value_rpm > TRIP_LIMIT_RPM) above_limit_count++;
// Trip rule: at least 2 valid channels above limit → trip
if (above_limit_count >= 2) {
trip_machine();
}
}
Quality criteria
Clarity: Normal, degraded, and safe states are explicitly defined and unambiguous in the SRS and design documents.
Determinism: Transitions between states are deterministic, fully specified, and testable; no hidden or tool-dependent behaviours.
Traceability: Fault handling and degraded-mode behaviour are traceable from hazards and safety requirements through to implementation and test cases.
Quantitative justification: Reliability calculations explicitly cover both full and degraded configurations, including time-in-degraded-mode assumptions and β-factors for CCF.
Diagnostic robustness: Fault detection mechanisms are shown (by analysis and test) to achieve the required diagnostic coverage and to discriminate safely between channels.
Operational control: Alarms, procedures, and time limits for degraded mode are clearly defined, communicated, and enforceable in the operating organisation.
Auditability: All assumptions, design decisions, and test results supporting graceful degradation are recorded and can be independently assessed in a functional safety assessment.
Common pitfalls
Indefinite degraded operation.
Systems remain in 1oo2 or single-channel mode for weeks or months without repair. Mitigation: Define and enforce a strict maximum degraded duration and escalate to safe shutdown if it is exceeded.
No recalculation of integrity in degraded mode.
Only the full architecture is analysed; degraded mode is treated as “it will probably be fine.” Mitigation: Perform separate PFDavg/PFH calculations for degraded mode and integrate time-in-degraded into the average risk model.
Weak or unproven diagnostics.
Assumed channel disagreement detection does not actually catch realistic failure modes. Mitigation: Validate diagnostics with realistic fault injection and conservative modelling of diagnostic coverage.
High common-cause exposure.
Redundant channels share cabinets, power, or environment to such an extent that one disturbance can defeat all channels. Mitigation: Improve CCF measures (physical separation, diversity, independent power, environmental hardening) and re-evaluate β-factors.
Ambiguous alarms and unclear responsibilities.
Operators do not understand the significance of degraded alarms or who needs to act. Mitigation: Rationalise alarms, train operators, and include degraded-mode responses in operating procedures and drills.
Testing only normal operation.
Factory and site acceptance tests focus on nominal behaviour; degraded states and transitions are barely exercised. Mitigation: Include degraded-mode scenarios and fault injection as standard test cases with acceptance criteria.
Related techniques
Fault detection and diagnostics (including channel comparison and self-tests)
Hardware Fault Tolerance (HFT) design and assessment
Common Cause Failure (CCF) analysis and β-factor modelling
Diverse redundancy (hardware and software)
Safe state enforcement and fail-safe design
Safety Requirements Specification (SRS) for fault handling and degraded states
References
IEC 61508-2: Functional safety of electrical/electronic/programmable electronic safety-related systems – Part 2: Requirements for E/E/PE safety-related systems.
IEC 61508-6: Functional safety – Part 6: Guidelines on the application of IEC 61508-2 and IEC 61508-3 (including common-cause failure measures).
Xing, L., Meshkat, L., Donohue, S. – “Reliability analysis of hierarchical computer-based systems subject to common-cause failures”, Reliability Engineering & System Safety.
Relevant industry guidance on SIL verification and SIS design (e.g. ISA / CCPS technical reports).
FAQ
Does graceful degradation guarantee that the original SIL is maintained?
No. After degradation, the effective architecture and HFT have changed. You must recalculate PFDavg/PFH and confirm that, given the limited time in degraded mode, the overall risk target is still met. In some cases the SIL claim must be reduced during degraded mode.
Is graceful degradation a mandatory requirement in IEC 61508?
IEC 61508 does not necessarily use the phrase “graceful degradation” explicitly, but once you choose redundant architectures, the standard expects that you define and verify how the system behaves when faults are detected. A well-engineered graceful degradation concept is often the only practical way to satisfy architectural and fault tolerance requirements.
Can graceful degradation replace common-cause failure mitigation?
No. If a common cause defeats multiple channels simultaneously (fire, flooding, shared power failure, common software defect), there may be no redundancy left to degrade. CCF measures (separation, diversity, protection) must be in place regardless of graceful degradation.
How long is it safe to remain in a degraded state?
There is no fixed number in the standard. The allowable degraded time is a design choice that must be justified by reliability calculations, maintenance capability, and risk acceptance. It must be specified in the SRS, implemented in logic (timers, monitoring), and enforced operationally.
Below you may find information about the purposes for which we and our partners use cookies and process data. You can exercise your preferences for processing, and/or see details on our partners' websites.
Analytical cookiesDisable all
Functional cookies
Other cookies
We use cookies to personalize content and ads, to provide social media features and to analyze our traffic.Learn more about our cookie policy.