Fault Detection in Functional Safety — IEC 61508

Fault Detection in Functional Safety — IEC 61508

Fault detection is the deliberate act of checking a system, subsystem, or software component for erroneous states — and stopping those errors from propagating to the safety function. It uses value-domain, time-domain, and structural checks to discover faults early and trigger a defined safe reaction within the fault tolerant time interval (FTTI).

What is fault detection?

Fault detection uses value-domain checks (limits, plausibility, monotonicity), time-domain checks (timeouts, deadlines, execution jitter), and structural means (redundancy, diversity, voting) to identify erroneous states.

In self-checking systems, a component relinquishes control or triggers a safe state when it detects its own results are incorrect. Fault detection can be implemented at multiple levels: physical (e.g. temperature, voltage), logical (e.g. error-detecting codes), functional (e.g. assertions), and external (e.g. cross-checks against independent measurements).

How it supports functional safety

Fault detection reduces the likelihood that systematic failures — such as software bugs, logic defects, integration mistakes, or incorrect assumptions — will silently propagate to the safety function. By detecting anomalies quickly and activating a defined safe reaction, the design limits the consequence of such failures within the FTTI.

Fault detection also intercepts manifestations of random or common-cause hardware faults when they appear as corrupted values, timing violations, or protocol errors — so the safety function does not act on bad data.

The key question is: for every signal and computation that could cause a hazard if wrong — do you have a check, and does the check trigger a safe reaction in time?

When to use

  • When a single, unchecked computation or sensor value could lead directly to a hazardous control action
  • When SIL targets require diagnostic coverage of input processing, control logic, or communication paths
  • When communication links may drop, corrupt, or reorder messages (e.g. fieldbus, CAN, Ethernet)
  • When environmental or process dynamics make "stuck-at", out-of-range, or implausible values credible

Inputs and outputs

Inputs

  • Critical signals and computed results (sensor values, actuator commands, estimates)
  • Timing information (task periods, deadlines, heartbeat events)
  • Protocol metadata (CRC, sequence counters, timestamps, freshness indicators)
  • Configuration for limits, rate-of-change, plausibility rules, and timeouts

Outputs

  • Fault flags and diagnostic codes (latched as appropriate)
  • Safe reaction commands (e.g. shut down heater, hold last safe value, switch to redundant channel)
  • Degraded or limp-home mode selection
  • Traceable diagnostic records (event logs, counters, timestamps)

Procedure

  1. Map hazards to checks. Identify signals and computations whose unchecked failure could cause a hazard. Derive value-domain, time-domain, and structural checks from the FMEA/FMEDA and FTTI.
  2. Select detection mechanisms. Combine range/limit checks, rate-of-change and plausibility rules, timeouts and watchdogs, error-detecting codes (e.g. CRC), and — where justified — redundancy with voting or a diverse monitor.
  3. Define safe reactions. For each detected fault, specify a deterministic, bounded-time reaction (e.g. discard, hold last known safe value, enter safe state, switch channel) and how it is latched and reset.
  4. Implement locally. Place checks at the smallest practical subsystem (function/module) to localise diagnosis. Propagate only validated data to higher levels with associated diagnostic status.
  5. Instrument diagnostics. Count detections, timestamp events, and store context with affected data to support trend analysis and investigation.
  6. Verify and validate. Use unit tests, boundary tests, timing analysis, and fault injection to confirm detection coverage and that reaction time ≤ FTTI. Document evidence for assessment.
Justify each detection mechanism against the required SIL and diagnostic coverage targets — and prove the safe reaction is independent and timely (reaction time bounded ≤ FTTI) under worst-case conditions.

Worked example — heater safety controller

A heater is controlled by a safety controller using a temperature sensor message that carries a sequence counter and CRC. Software implements: (1) CRC check, (2) range limits, (3) rate-of-change and "stuck" detection, and (4) freshness timeout. If any check fails, the controller disables the heater and latches a diagnostic until a supervised reset.

Code-level example

#include <stdint.h>
#include <stdbool.h>

#define TEMP_MAX_Cx10     1200   // 120.0 C in tenths
#define TEMP_MIN_Cx10     -400   // -40.0 C in tenths
#define MAX_ROC_Cx10      50     // max rise per sample (5.0 C)
#define STUCK_LIMIT       5      // max identical readings before flag
#define FRESHNESS_MS      200    // data must arrive within 200 ms

extern uint32_t platform_millis(void);
extern void heater_off(void);         // SAFE REACTION actuator
extern void log_fault(const char*);   // traceable diagnostic
extern bool read_sensor_frame(uint8_t* buf, uint32_t len);

static void safe_shutdown(const char* reason) {
    heater_off();                  // SAFE REACTION: enter safe state
    log_fault(reason);             // record fault
}

bool check_temperature(int16_t new_temp) {
    static int16_t last_temp = 0x7FFF;
    static int repeat_count = 0;

    if (new_temp > TEMP_MAX_Cx10 || new_temp < TEMP_MIN_Cx10) {
        safe_shutdown("Out of range");
        return false;
    }

    if (last_temp != 0x7FFF) {
        int16_t roc = new_temp - last_temp;
        if (roc > MAX_ROC_Cx10 || roc < -MAX_ROC_Cx10) {
            safe_shutdown("Rate of change fault");
            return false;
        }
        if (new_temp == last_temp) {
            if (++repeat_count > STUCK_LIMIT) {
                safe_shutdown("Sensor stuck");
                return false;
            }
        } else {
            repeat_count = 0;
        }
    }

    last_temp = new_temp;
    return true;  // value accepted
}

Result: The controller will not act on corrupted, stale, or implausible data. Instead it deterministically disables heat within the FTTI, latches the condition, and provides diagnostics.

Quality criteria

  • Timeliness: Worst-case detection + reaction time proven ≤ FTTI.
  • Determinism: Checks have bounded execution time and well-defined priorities.
  • Coverage: The set of checks demonstrably covers relevant failure modes.
  • Independence: Monitors sufficiently diverse to avoid common systematic faults.
  • Traceability: Faults, counters, and timestamps are recorded and linked to requirements and tests.

Common pitfalls

Detection without reaction

Faults are flagged but no safe reaction is executed — the system continues with bad data.

Mitigation: Always specify and test a bounded safe reaction for every detection.

Single check syndrome

Only one type of check is applied, leaving other failure modes undetected.

Mitigation: Combine value, time, and structural checks for defence in depth.

Non-diverse monitor

The monitor duplicates the main function, so both fail the same way.

Mitigation: Use dissimilar or simple monitors, not duplicates of the main logic.

Over-sensitive thresholds

Thresholds set too tight cause nuisance trips and erode operator trust.

Mitigation: Tune thresholds using real data and add hysteresis.

FTTI exceeded

Detection and reaction take longer than the fault tolerant time interval allows.

Mitigation: Analyse timing, prioritise detection tasks, and pre-emptively disable hazards.

Frequently asked questions

Does fault detection always require redundancy?

No. Redundancy is one option. Many designs combine simple value and time checks (limits, plausibility, freshness) with protocol checks (CRC, sequence) and a lightweight diverse monitor to achieve coverage with lower cost and complexity.

How is diagnostic coverage shown for software fault detection?

IEC 61508 treats DC primarily for hardware, but software detection can prevent unsafe outputs and support the overall safety case. Show evidence via requirements traceability, fault-injection tests, coverage vs. identified failure modes, and timing proofs that reaction ≤ FTTI.

What is the difference between fault detection and fault tolerance?

Detection finds erroneous states and triggers a safe reaction (e.g. shutdown or degrade). Tolerance continues service despite faults (e.g. hot standby with voting). Many safety designs detect and then either tolerate (switch channel) or go safe.

What evidence convinces assessors?

Documented rationale for checks, independence and diversity argument, WCET/timing analysis, repeatable fault-injection results, and latched diagnostics tied to hazards and FTTI demonstrate adequacy.

Related techniques

  • Watchdog timer — detects timing and control-flow faults
  • Assertion programming — detects software logic errors at runtime
  • N-version programming / diverse monitor — detects faults via independent implementations and voting
  • Error-detecting codes (CRC, parity) — detect data corruption in memory and communication
  • 1oo2 / 2oo3 voting — architectural patterns that enable fault detection and safe selection

References

  • IEC 61508-3:2010/2017 — Annex A & Annex C (C.3.1 Fault detection and diagnosis)
  • Laprie et al. — Dependability: Basic Concepts and Terminology (Springer)
  • Redmill, F.J. — Dependability of Critical Computer Systems (Elsevier, 1988)

Go deeper — IEC 61508 Certification Course

Our IEC 61508 course covers fault detection, diagnostic coverage, software safety techniques, and safety case preparation — for engineers building safety-related systems.

Explore the course → Ask us a question
We use cookies
Cookie preferences
Below you may find information about the purposes for which we and our partners use cookies and process data. You can exercise your preferences for processing, and/or see details on our partners' websites.
Analytical cookies Disable all
Functional cookies
Other cookies
We use cookies to personalize content and ads, to provide social media features and to analyze our traffic. Learn more about our cookie policy.
Accept all Decline all Change preferences
Cookies