Fault detection

What is it?

Fault detection is the deliberate act of checking a system, subsystem, or software component for erroneous states and stopping those errors from propagating. It uses value-domain checks (limits, plausibility, monotonicity), time-domain checks (timeouts, deadlines, execution jitter), and structural means (redundancy, diversity, voting). In self-checking systems, a component relinquishes control or triggers a safe state when it detects its own results are incorrect. Fault detection can be implemented at multiple levels: physical (e.g., temperature, voltage), logical (e.g., error-detecting codes), functional (e.g., assertions), and external (e.g., cross-checks against independent measurements).

How it supports functional safety

Fault detection reduces the likelihood that systematic failures—such as software bugs, logic defects, integration mistakes, or incorrect assumptions—will silently propagate to the safety function. By detecting anomalies quickly and activating a defined safe reaction, the design limits the consequence of such failures within the fault tolerant time interval (FTTI). Fault detection also intercepts manifestations of random or common-cause hardware faults when they appear as corrupted values, timing violations, or protocol errors, so the safety function does not act on bad data.

When to use

When a single, unchecked computation or sensor value could lead directly to a hazardous control action.
When SIL targets require diagnostic coverage of input processing, control logic, or communication paths.
When communication links may drop, corrupt, or reorder messages (e.g., fieldbus, CAN, Ethernet).
When environmental or process dynamics make “stuck-at”, out-of-range, or implausible values credible.

Inputs & Outputs

Inputs

Critical signals and computed results (sensor values, actuator commands, estimates).
Timing information (task periods, deadlines, heartbeat/“kick” events).
Protocol metadata (CRC, sequence counters, timestamps, freshness indicators).
Configuration for limits, rate-of-change, plausibility rules, and timeouts.

Outputs

Fault flags and diagnostic codes (latched as appropriate).
Safe reaction commands (e.g., shut down heater, hold last safe value, switch to redundant channel).
Degraded/limp-home mode selection.
Traceable diagnostic records (event logs, counters, timestamps).

Procedure

Map hazards to checks. Identify signals and computations whose unchecked failure could cause a hazard; derive value-domain, time-domain, and structural checks from the FMEA/FMEDA and FTTI.
Select detection mechanisms. Combine range/limit checks, rate-of-change and plausibility rules, timeouts and watchdogs, error-detecting codes (e.g., CRC), and—where justified—redundancy with voting or a diverse monitor.
Define safe reactions. For each detected fault, specify a deterministic, bounded-time reaction (e.g., discard/ignore, hold last known safe value, enter safe state, switch channel) and how it is latched and reset.
Implement locally. Place checks at the smallest practical subsystem (function/module) to localize diagnosis; propagate only validated data to higher levels with associated diagnostic status.
Instrument diagnostics. Count detections, timestamp events, and store context with affected data to support trend analysis and investigation.
Verify & validate. Use unit tests, boundary tests, timing analysis, and fault injection to confirm detection coverage and that reaction time ≤ FTTI; document evidence for assessment.

Worked Example

High-level

A heater is controlled by a safety controller using a temperature sensor message that carries a sequence counter and CRC. Software implements: (1) CRC check, (2) range limits, (3) rate-of-change and “stuck” detection, and (4) freshness timeout. If any check fails, the controller disables the heater and latches a diagnostic until a supervised reset.

Code-level

#include <stdint.h>
#include <stdbool.h>

#define TEMP_MAX_Cx10     1200   // 120.0°C in tenths
#define TEMP_MIN_Cx10     -400   // -40.0°C in tenths
#define MAX_ROC_Cx10      50     // max rise per sample (5.0°C)
#define STUCK_LIMIT       5      // max identical readings before flag
#define FRESHNESS_MS      200    // data must arrive within 200 ms

extern uint32_t platform_millis(void);
extern void heater_off(void);         // SAFE REACTION actuator
extern void log_fault(const char*);   // traceable diagnostic
extern bool read_sensor_frame(uint8_t* buf, uint32_t len);

static void safe_shutdown(const char* reason) {
    heater_off();                  // SAFE REACTION: enter safe state
    log_fault(reason);             // record fault
}

// Simplified example check
bool check_temperature(int16_t new_temp) {
    static int16_t last_temp = 0x7FFF;
    static int repeat_count = 0;

    if (new_temp > TEMP_MAX_Cx10 || new_temp < TEMP_MIN_Cx10) {
        safe_shutdown("Out of range");
        return false;
    }

    if (last_temp != 0x7FFF) {
        int16_t roc = new_temp - last_temp;
        if (roc > MAX_ROC_Cx10 || roc < -MAX_ROC_Cx10) {
            safe_shutdown("Rate of change fault");
            return false;
        }
        if (new_temp == last_temp) {
            if (++repeat_count > STUCK_LIMIT) {
                safe_shutdown("Sensor stuck");
                return false;
            }
        } else {
            repeat_count = 0;
        }
    }

    last_temp = new_temp;
    return true; // value accepted
}

Result: The controller will not act on corrupted, stale, or implausible data; instead it deterministically disables heat within the FTTI, latches the condition, and provides diagnostics.

Quality criteria

Timeliness: Worst-case detection + reaction time proven ≤ FTTI.
Determinism: Checks have bounded execution time and well-defined priorities.
Coverage: The set of checks demonstrably covers relevant failure modes.
Independence: Monitors sufficiently diverse to avoid common systematic faults.
Traceability: Faults, counters, and timestamps are recorded and linked to requirements and tests.

Common pitfalls

Detection without reaction. Mitigation: Always specify and test a bounded safe reaction.
Single check syndrome. Mitigation: Combine value, time, and structural checks.
Non-diverse monitor. Mitigation: Use dissimilar/simple monitors, not duplicates.
Over-sensitive thresholds. Mitigation: Tune thresholds using real data and add hysteresis.
FTTI exceeded. Mitigation: Analyze timing, prioritize detection tasks, and pre-emptively disable hazards.

References

FAQ

Does fault detection always require redundancy?

No. Redundancy is one option. Many designs combine simple value/time checks (limits, plausibility, freshness) with protocol checks (CRC, sequence) and a lightweight diverse monitor to achieve coverage with lower cost and complexity.

How is “diagnostic coverage” shown for software fault detection?

IEC 61508 treats DC primarily for hardware, but software detection can prevent unsafe outputs and support the overall safety case. Show evidence via requirements traceability, fault-injection tests, coverage vs. identified failure modes, and timing proofs that reaction ≤ FTTI.

What’s the difference between fault detection and fault tolerance?

Detection finds erroneous states and triggers a safe reaction (e.g., shutdown or degrade). Tolerance continues service despite faults (e.g., hot standby with voting). Many safety designs detect and then either tolerate (switch channel) or go safe.

What evidence convinces assessors?

Documented rationale for checks, independence/diversity argument, WCET/timing analysis, repeatable fault-injection results, and latched diagnostics tied to hazards and FTTI demonstrate adequacy.

Fault Detection

Fault detection

What is it?

How it supports functional safety

When to use

Inputs & Outputs

Inputs

Outputs

Procedure

Worked Example

High-level

Code-level

Quality criteria

Common pitfalls

References

FAQ

For you

You and Us

Resources

About Risknowlogy

Fault Detection

Fault detection

What is it?

How it supports functional safety

When to use

Inputs & Outputs

Inputs

Outputs

Procedure

Worked Example

High-level

Code-level

Quality criteria

Common pitfalls

Related techniques

References

FAQ

For you

You and Us

Resources

About Risknowlogy