certifiable-bench

Performance benchmarking in safety-critical systems has a fundamental problem: you can’t compare performance across platforms unless you first prove the outputs are identical.

View on GitHub

The Problem

Standard benchmarking tools measure how fast code runs. They don’t verify that the code produces the same results on different hardware. For most software, this doesn’t matter. For safety-critical ML inference, it matters enormously.

Consider deploying a neural network trained on x86 to a RISC-V edge device. The standard approach: benchmark both, compare latency. But what if floating-point rounding differences, SIMD variations, or library implementations cause the outputs to differ? You’ve optimised for speed on a system that’s now producing different results.

DO-178C, IEC 62304, and ISO 26262 all require evidence of correct behaviour. A benchmark showing “RISC-V is 2.3x slower than x86” is meaningless if the outputs don’t match.

The Solution: The Bit-Identity Gate

certifiable-bench introduces a simple but critical concept: the bit-identity gate. You can only compare performance after proving the outputs are identical.

cb_compare_results(&x86_result, &riscv_result, &comparison);

if (comparison.outputs_identical) {
    // Outputs match — performance comparison is meaningful
    printf("RISC-V is %.2fx slower\n", comparison.latency_ratio_q16 / 65536.0);
} else {
    // Outputs differ — comparison is invalid
    printf("ERROR: Outputs differ, comparison invalid\n");
}

This changes the workflow. Instead of “benchmark first, hope outputs match”, it’s “verify outputs, then benchmark”. The verification uses FIPS 180-4 SHA-256 to hash outputs during the benchmark run (outside the timing loop), then compares hashes across platforms.

What’s Implemented

The harness provides six modules, each with formal requirements documentation:

Module	SRS	Tests	Purpose
Timer	SRS-001	10,032	High-resolution timing with 23ns measured overhead
Metrics	SRS-002	1,502	Statistics, histograms, outlier detection, WCET estimation
Platform	SRS-006	35	CPU detection, hardware counters, environment monitoring
Verify	SRS-004	113	SHA-256 hashing, golden references, result binding
Runner	SRS-003	92	Warmup phases, critical loop, configurable iterations
Report	SRS-005	66	JSON/CSV output, cross-platform comparison

Total: 11,840 assertions across approximately 233 SHALL statements in the requirements documentation.

The Critical Loop

The benchmark runner separates timing from verification:

for (i = 0; i < config->measure_iterations; i++) {
    /* === CRITICAL LOOP START === */
    t_start = cb_timer_now_ns();
    rc = fn(ctx, input, output);
    t_end = cb_timer_now_ns();
    /* === CRITICAL LOOP END === */
    
    samples[i] = t_end - t_start;
    
    /* Verification OUTSIDE critical timing */
    if (config->verify_outputs) {
        cb_verify_ctx_update(&verify_ctx, output, output_size);
    }
}

Verification happens outside the timed region. The final hash covers all outputs from all iterations, ensuring determinism across the entire run.

Statistics That Matter for Certification

Beyond mean and standard deviation, certifiable-bench computes metrics required for safety certification:

WCET Estimation: Worst Case Execution Time bound calculated as max + 6×stddev. This provides a conservative estimate for real-time scheduling.

Percentiles: p50, p95, p99 for understanding tail latency.

Outlier Detection: MAD-based detection (Iglewicz & Hoaglin method) to identify anomalous samples from thermal throttling, interrupts, or other interference.

Histogram: Configurable bins with overflow/underflow tracking for visualising the full latency distribution.

All statistics use integer-only arithmetic where possible, avoiding floating-point non-determinism in the measurement infrastructure itself.

Platform Detection

The harness automatically detects:

Architecture: x86_64, aarch64, riscv64
CPU model: From /proc/cpuinfo
Frequency: From sysfs, with stability monitoring
Hardware counters: Via perf_event (cycles, instructions, cache misses, branch mispredictions)
Environment: Temperature, throttle events

A stability check flags results if CPU frequency drifts more than 5% during the benchmark run.

Result Binding

Each benchmark result is cryptographically bound to its context:

H(output_hash || platform || config || stats || timestamp)

This creates a tamper-evident record. If someone claims “we achieved X latency on platform Y”, the hash can be verified against the full result data.

Usage

#include "cb_runner.h"
#include "cb_report.h"

cb_result_code_t my_inference(void *ctx, const void *in, void *out) {
    // Your neural network inference
    return CB_OK;
}

int main(void) {
    cb_config_t config;
    cb_result_t result;

    cb_config_init(&config);
    config.warmup_iterations = 100;
    config.measure_iterations = 1000;

    cb_run_benchmark(&config, my_inference, model_ctx,
                     input_data, input_size,
                     output_data, output_size,
                     &result);

    cb_print_summary(&result);
    cb_write_json(&result, "benchmark_x86.json");

    return 0;
}

Pipeline Context

certifiable-bench sits between certifiable-inference (the deterministic ML engine) and certifiable-harness (the end-to-end verification system):

certifiable-inference ──→ certifiable-bench ──→ certifiable-harness
         ↑                       │
         └───────────────────────┘
              Performance data

The harness runs the inference engine with benchmark instrumentation, produces JSON reports, and the harness consumes these for cross-platform comparison. Model bundles from certifiable-deploy can include baseline benchmark data for regression testing.

Why This Matters

For aerospace (DO-178C), medical devices (IEC 62304), and automotive (ISO 26262), timing data is part of the certification evidence package. Section 6.3.4 of DO-178C specifically requires “Software Timing and Sizing Data” as a verification output.

But timing data without correctness verification is incomplete evidence. certifiable-bench provides both: proof of determinism and measurement of performance.

The bit-identity gate ensures you never accidentally compare apples to oranges. If x86 and RISC-V produce different outputs, you find out immediately, before drawing conclusions about relative performance.

Getting Started

git clone https://github.com/SpeyTech/certifiable-bench
cd certifiable-bench
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make
ctest --output-on-failure

Documentation

The repository includes formal documentation:

CB-MATH-001: Mathematical foundations (statistics, verification, comparison algorithms)
CB-STRUCT-001: Data structure specifications
SRS-001 through SRS-006: Requirements documents with ~233 SHALL statements

Current Status

The harness is feature-complete and ready for integration testing:

✅ All statistics and verification tests pass
✅ JSON/CSV reporting working
✅ Cross-platform comparison implemented
⏳ Bit-identity verification on RISC-V pending hardware access
⏳ CI regression detection pending infrastructure setup

As with any architectural approach, suitability depends on system requirements, risk classification, and regulatory context.

certifiable-bench

The Problem

The Solution: The Bit-Identity Gate

What’s Implemented

The Critical Loop

Statistics That Matter for Certification

Platform Detection

Result Binding

Usage

Pipeline Context

Why This Matters

Getting Started

Documentation

Current Status

About the Author

Questions or Contributions?

certifiable-bench

The Problem

The Solution: The Bit-Identity Gate

What’s Implemented

The Critical Loop

Statistics That Matter for Certification

Platform Detection

Result Binding

Usage

Pipeline Context

Why This Matters

Getting Started

Documentation

Current Status

About the Author

Occasional Technical Updates

Questions or Contributions?