Graceful Degradation in Functional Safety — IEC 61508

19 February 2026 · Dr. Michel Houtermans · 11 min read

Graceful degradation is a design technique where a safety-related system transitions from full redundancy to a reduced — but still safe — configuration after detecting a fault, instead of losing the safety function abruptly. It is central to how redundant architectures behave under IEC 61508.

What is graceful degradation?

In functional safety, graceful degradation means a safety-related system moves from its nominal, fully redundant configuration into a defined degraded configuration after one or more faults are detected — while still ensuring acceptable risk reduction.

Instead of the safety function disappearing when one channel fails, the system:

  • Detects the fault using diagnostics and channel comparison
  • Isolates or de-rates the faulty element
  • Continues operating in a reduced MooN configuration (e.g. 2oo3 → 1oo2)
  • Enforces a transition to a safe state if further degradation would violate the safety integrity target

Under IEC 61508, this typically applies to redundant architectures designed to provide hardware fault tolerance (HFT ≥ 1) — for example 1oo2, 2oo3, or more complex MooN structures implemented in sensors, logic solvers, and final elements.

The key question is not whether a system can tolerate a fault — it is whether the system's behaviour after a fault is specified, quantified, and verified.

How graceful degradation supports functional safety

IEC 61508 does not only require you to meet a numerical failure target (PFDavg or PFH). It also requires that safety-related systems respond to faults in a predictable, specified manner consistent with the Safety Requirements Specification (SRS). Graceful degradation is how that requirement is met in redundant architectures.

Systematic failures

From a systematic failure perspective, graceful degradation forces the engineering team to:

  • Explicitly specify degraded-state behaviour in the SRS, instead of leaving fault responses implicit or tool-dependent
  • Design deterministic state transitions (normal → degraded → safe) in hardware and software, avoiding ambiguous behaviour that could itself be a source of systematic faults
  • Implement and verify fault-handling logic as a first-class safety requirement, reducing the risk of design, implementation, or integration errors in fault response
  • Document assumptions about allowed degraded operation, which can then be reviewed, tested, and audited

Important: Many "mysterious" field behaviours after a fault are in fact systematic design gaps in the degraded state — not random hardware failures. A well-defined graceful degradation concept closes this gap.

Random and common-cause hardware faults

From a hardware-fault perspective, graceful degradation is the operational consequence of using redundancy to meet SIL targets:

  • Random hardware faults in one channel are detected by diagnostics and isolated from the remaining channels
  • The system continues to provide the safety function in a reduced architecture (e.g. 2oo3 → 1oo2), with a recalculated — and higher, but still acceptable — probability of dangerous failure
  • The time spent in degraded operation is limited, so that the increased risk contribution remains inside the overall PFDavg / PFH target
  • The technique works together with common-cause failure (CCF) mitigation: physical separation, environmental protection, independent power, and diverse design reduce the chance that the same cause defeats all channels simultaneously

Graceful degradation does not remove common-cause failures. But it ensures that a single fault or local disturbance does not immediately translate into a complete loss of the safety function.

When to use graceful degradation

Graceful degradation is appropriate when:

  • The required SIL is achieved using redundant hardware architectures (e.g. 1oo2, 2oo3 sensors, multiple processing channels, redundant actuators)
  • HFT and SFF requirements from IEC 61508-2 are met using redundancy and diagnostics
  • Continuous or high-availability operation is needed, and an immediate trip on every single detected fault is not operationally acceptable — provided the safety integrity can still be maintained
  • Reliable diagnostics are available to detect channel disagreement or internal failures
  • The organisation is prepared to control and monitor time in degraded mode through procedures and maintenance response

It should not be used as a safety blanket when:

  • CCF vulnerabilities are high (e.g. all channels in one cabinet, shared single power supply, identical configuration errors)
  • Diagnostic coverage is low or unproven, so faults may remain latent in multiple channels
  • There is no realistic ability to track degraded mode duration and enforce repair or safe shutdown

Inputs and outputs

Inputs

  • Target SIL and associated PFDavg / PFH limits
  • Chosen hardware architecture (MooN structure, device type A/B, SFF values)
  • Failure rate data for dangerous, safe, detected and undetected failures (λD, λDD, λDU)
  • Diagnostic concept and coverage (online tests, proof tests, cross-monitoring, comparison logic)
  • Common-cause failure analysis results (e.g. β-factor, IEC 61508-6 scoring)
  • Assumed repair times and maintenance response (MTTR, maximum allowed degraded time)
  • SRS specification of normal, degraded, and safe states, including escalation criteria

Outputs

  • Documented degraded-state behaviour (logic, states, alarms) in the SRS and design specifications
  • Updated reliability calculations for both nominal and degraded configurations (PFDavg / PFH)
  • Defined monitoring and alarming strategy for entering, remaining in, and leaving degraded mode
  • Procedures for maintenance, repair, and forced transition to a safe state after a maximum degraded duration
  • Evidence for the safety case showing that overall safety integrity requirements are met, including degraded operation

Procedure

  1. Derive SIL and architectural needs. From hazard and risk analysis, determine the required SIL and whether redundancy is needed to meet IEC 61508-2 architectural constraints (HFT, SFF).
  2. Select the MooN architecture. Choose appropriate MooN structures (e.g. 1oo2, 2oo3) for sensors, logic solvers, and final elements that can support the required SIL with justified failure data and CCF assumptions.
  3. Define normal and degraded states in the SRS. Specify how the system behaves in normal and degraded modes: which channels are active, what voting logic is used, and what triggers transition to a safe state.
  4. Design fault detection and isolation. Implement diagnostics (self-tests, channel comparison, watchdogs, plausibility checks) capable of detecting faulty channels in time. Define the rules for isolating a channel and using the remaining channels.
  5. Quantify degraded-mode integrity. Perform reliability calculations (PFDavg or PFH) for both the nominal and degraded configurations, including β-factors for CCF. Determine the maximum allowable time in degraded mode while still meeting overall targets.
  6. Implement monitoring, alarming, and time limits. Ensure operators are informed via clear alarms when the system enters degraded mode, and enforce time limits for repair or shutdown. Define escalation rules if the limit is reached.
  7. Verify with fault injection and validation tests. Inject representative faults in one channel at a time. Verify that the system correctly detects, isolates, and transitions through states according to specification — and that a safe reaction occurs when integrity can no longer be guaranteed.
  8. Capture assumptions and evidence in the safety case. Document all assumptions about degraded mode operation, repair times, diagnostic coverage, and CCF mitigation. Include test results and calculations as part of the safety case.
Graceful degradation is only "graceful" if you can prove that the degraded configuration still meets its integrity claim for a limited time — if you do not quantify and control that time, you are simply running a broken system.

Worked example — SIL 3 overspeed protection

Consider a SIL 3 overspeed protection function for a critical rotating machine. The function uses three independent speed sensors to detect overspeed and initiate a trip.

  • Architecture: 2oo3 voting on speed measurement
  • Diagnostics: plausibility checks and cross-comparison between channels each cycle
  • Normal state: all three channels active; trip occurs if at least two channels detect speed above the trip threshold

Failure scenario

  1. Sensor 2 develops a fault and outputs values significantly deviating from Sensors 1 and 3.
  2. The comparison logic detects persistent disagreement beyond a configured tolerance and duration.
  3. Sensor 2 is marked as faulty, automatically isolated from the voting logic, and an alarm is raised: "Overspeed protection degraded — 1oo2."
  4. The system continues using Sensors 1 and 3 in an effective 1oo2 configuration (HFT now 0 instead of 1).
  5. Reliability calculations have already shown that this degraded state is acceptable for up to 24 hours if no further faults occur.
  6. If a second sensor fails, or if the 24-hour time limit is reached without repair, the logic forces a safe shutdown of the machine.

This behaviour is documented in the SRS, verified by test, and included in the safety case. The machine is protected throughout: first by redundancy, then by controlled degraded operation, and finally by a safe shutdown when integrity can no longer be assured.

Code-level example

The following pseudo-C example illustrates the voting logic, degraded-mode detection, and time-limited operation described above.

// Pseudo C-like example of graceful degradation voting logic for a 2oo3 speed SIF

#define TRIP_LIMIT_RPM 5000
#define MAX_DEGRADED_TIME_SEC (24 * 3600)

typedef struct {
    int value_rpm;
    bool valid;
} Channel;

Channel ch1, ch2, ch3;

bool system_degraded = false;
unsigned long degraded_start_time = 0;

int read_speed_sensor(int id);
bool check_plausibility(int rpm);
unsigned long now_seconds(void);

void overspeed_protection_cycle(void) {
    // Read channels
    ch1.value_rpm = read_speed_sensor(1);
    ch2.value_rpm = read_speed_sensor(2);
    ch3.value_rpm = read_speed_sensor(3);

    ch1.valid = check_plausibility(ch1.value_rpm);
    ch2.valid = check_plausibility(ch2.value_rpm);
    ch3.valid = check_plausibility(ch3.value_rpm);

    // Basic plausibility & disagreement detection
    int valid_count = (ch1.valid ? 1 : 0)
                    + (ch2.valid ? 1 : 0)
                    + (ch3.valid ? 1 : 0);

    if (valid_count < 2) {
        // SAFE REACTION: Not enough valid channels, integrity lost
        trip_machine();
        raise_alarm("Overspeed protection integrity lost");
        return;
    }

    // Determine if system has entered degraded mode
    if (valid_count == 2 && !system_degraded) {
        system_degraded = true;
        degraded_start_time = now_seconds();
        raise_alarm("Overspeed protection degraded (1oo2 active)");
    }

    // Enforce maximum degraded duration
    if (system_degraded) {
        unsigned long elapsed = now_seconds() - degraded_start_time;
        if (elapsed > MAX_DEGRADED_TIME_SEC) {
            // SAFE REACTION: Degraded time exceeded
            trip_machine();
            raise_alarm("Max degraded time exceeded – forced shutdown");
            return;
        }
    }

    // Compute effective 2oo3 voting (ignoring invalid channels)
    int above_limit_count = 0;
    if (ch1.valid && ch1.value_rpm > TRIP_LIMIT_RPM) above_limit_count++;
    if (ch2.valid && ch2.value_rpm > TRIP_LIMIT_RPM) above_limit_count++;
    if (ch3.valid && ch3.value_rpm > TRIP_LIMIT_RPM) above_limit_count++;

    // Trip rule: at least 2 valid channels above limit
    if (above_limit_count >= 2) {
        trip_machine();
    }
}

Quality criteria

  • Clarity: Normal, degraded, and safe states are explicitly defined and unambiguous in the SRS and design documents.
  • Determinism: Transitions between states are deterministic, fully specified, and testable — no hidden or tool-dependent behaviours.
  • Traceability: Fault handling and degraded-mode behaviour are traceable from hazards and safety requirements through to implementation and test cases.
  • Quantitative justification: Reliability calculations explicitly cover both full and degraded configurations, including time-in-degraded-mode assumptions and β-factors for CCF.
  • Diagnostic robustness: Fault detection mechanisms are shown by analysis and test to achieve the required diagnostic coverage.
  • Operational control: Alarms, procedures, and time limits for degraded mode are clearly defined, communicated, and enforceable.
  • Auditability: All assumptions, design decisions, and test results are recorded and can be independently assessed in a functional safety assessment.

Common pitfalls

Indefinite degraded operation

Systems remain in 1oo2 or single-channel mode for weeks or months without repair. This silently erodes the safety integrity to unacceptable levels.

Mitigation: Define and enforce a strict maximum degraded duration. Escalate to safe shutdown if it is exceeded.

No recalculation of integrity in degraded mode

Only the full architecture is analysed; degraded mode is treated as "it will probably be fine."

Mitigation: Perform separate PFDavg / PFH calculations for degraded mode and integrate time-in-degraded into the average risk model.

Weak or unproven diagnostics

Assumed channel disagreement detection does not actually catch realistic failure modes.

Mitigation: Validate diagnostics with realistic fault injection and conservative modelling of diagnostic coverage.

High common-cause exposure

Redundant channels share cabinets, power, or environment to such an extent that one disturbance can defeat all channels simultaneously.

Mitigation: Improve CCF measures — physical separation, diversity, independent power, environmental hardening — and re-evaluate β-factors.

Ambiguous alarms and unclear responsibilities

Operators do not understand the significance of degraded alarms or who needs to act.

Mitigation: Rationalise alarms, train operators, and include degraded-mode responses in operating procedures and drills.

Testing only normal operation

Factory and site acceptance tests focus on nominal behaviour; degraded states and transitions are barely exercised.

Mitigation: Include degraded-mode scenarios and fault injection as standard test cases with acceptance criteria.

Frequently asked questions

Does graceful degradation guarantee that the original SIL is maintained?

No. After degradation, the effective architecture and HFT have changed. You must recalculate PFDavg / PFH and confirm that, given the limited time in degraded mode, the overall risk target is still met. In some cases the SIL claim must be reduced during degraded mode.

Is graceful degradation a mandatory requirement in IEC 61508?

IEC 61508 does not use the phrase "graceful degradation" explicitly. But once you choose redundant architectures, the standard expects that you define and verify how the system behaves when faults are detected. A well-engineered graceful degradation concept is often the only practical way to satisfy architectural and fault tolerance requirements.

Can graceful degradation replace common-cause failure mitigation?

No. If a common cause defeats multiple channels simultaneously — fire, flooding, shared power failure, common software defect — there may be no redundancy left to degrade into. CCF measures (separation, diversity, protection) must be in place regardless.

How long is it safe to remain in a degraded state?

There is no fixed number in the standard. The allowable degraded time is a design choice that must be justified by reliability calculations, maintenance capability, and risk acceptance. It must be specified in the SRS, implemented in logic (timers, monitoring), and enforced operationally.

References

  • IEC 61508-2 — Requirements for E/E/PE safety-related systems
  • IEC 61508-6 — Guidelines on the application of IEC 61508-2 and IEC 61508-3 (including common-cause failure measures)
  • IEC 61508-7:2010 Annex C — Graceful degradation (Table A.2 reference)
  • ISA / CCPS technical reports on SIL verification and SIS design

Go deeper — IEC 61508 Certification Course

Our IEC 61508 course covers redundancy design, fault tolerance, SIL verification, and safety case preparation — for engineers who need to get it right the first time.

Explore the course → Ask us a question
We use cookies
Cookie preferences
Below you may find information about the purposes for which we and our partners use cookies and process data. You can exercise your preferences for processing, and/or see details on our partners' websites.
Analytical cookies Disable all
Functional cookies
Other cookies
We use cookies to personalize content and ads, to provide social media features and to analyze our traffic. Learn more about our cookie policy.
Accept all Decline all Change preferences
Cookies