Performance benchmarking in safety-critical systems has a fundamental problem: you can’t compare performance across platforms unless you first prove the outputs are identical.
The Problem
Standard benchmarking tools measure how fast code runs. They don’t verify that the code produces the same results on different hardware. For most software, this doesn’t matter. For safety-critical ML inference, it matters enormously.
Consider deploying a neural network trained on x86 to a RISC-V edge device. The standard approach: benchmark both, compare latency. But what if floating-point rounding differences, SIMD variations, or library implementations cause the outputs to differ? You’ve optimised for speed on a system that’s now producing different results.
DO-178C, IEC 62304, and ISO 26262 all require evidence of correct behaviour. A benchmark showing “RISC-V is 2.3x slower than x86” is meaningless if the outputs don’t match.
The Solution: The Bit-Identity Gate
certifiable-bench introduces a simple but critical concept: the bit-identity gate. You can only compare performance after proving the outputs are identical.
cb_compare_results(&x86_result, &riscv_result, &comparison);
if (comparison.outputs_identical) {
// Outputs match — performance comparison is meaningful
printf("RISC-V is %.2fx slower\n", comparison.latency_ratio_q16 / 65536.0);
} else {
// Outputs differ — comparison is invalid
printf("ERROR: Outputs differ, comparison invalid\n");
}This changes the workflow. Instead of “benchmark first, hope outputs match”, it’s “verify outputs, then benchmark”. The verification uses FIPS 180-4 SHA-256 to hash outputs during the benchmark run (outside the timing loop), then compares hashes across platforms.
What’s Implemented
The harness provides six modules, each with formal requirements documentation:
| Module | SRS | Tests | Purpose |
|---|---|---|---|
| Timer | SRS-001 | 10,032 | High-resolution timing with 23ns measured overhead |
| Metrics | SRS-002 | 1,502 | Statistics, histograms, outlier detection, WCET estimation |
| Platform | SRS-006 | 35 | CPU detection, hardware counters, environment monitoring |
| Verify | SRS-004 | 113 | SHA-256 hashing, golden references, result binding |
| Runner | SRS-003 | 92 | Warmup phases, critical loop, configurable iterations |
| Report | SRS-005 | 66 | JSON/CSV output, cross-platform comparison |
Total: 11,840 assertions across approximately 233 SHALL statements in the requirements documentation.
The Critical Loop
The benchmark runner separates timing from verification:
for (i = 0; i < config->measure_iterations; i++) {
/* === CRITICAL LOOP START === */
t_start = cb_timer_now_ns();
rc = fn(ctx, input, output);
t_end = cb_timer_now_ns();
/* === CRITICAL LOOP END === */
samples[i] = t_end - t_start;
/* Verification OUTSIDE critical timing */
if (config->verify_outputs) {
cb_verify_ctx_update(&verify_ctx, output, output_size);
}
}Verification happens outside the timed region. The final hash covers all outputs from all iterations, ensuring determinism across the entire run.
Statistics That Matter for Certification
Beyond mean and standard deviation, certifiable-bench computes metrics required for safety certification:
WCET Estimation: Worst Case Execution Time bound calculated as max + 6×stddev. This provides a conservative estimate for real-time scheduling.
Percentiles: p50, p95, p99 for understanding tail latency.
Outlier Detection: MAD-based detection (Iglewicz & Hoaglin method) to identify anomalous samples from thermal throttling, interrupts, or other interference.
Histogram: Configurable bins with overflow/underflow tracking for visualising the full latency distribution.
All statistics use integer-only arithmetic where possible, avoiding floating-point non-determinism in the measurement infrastructure itself.
Platform Detection
The harness automatically detects:
- Architecture: x86_64, aarch64, riscv64
- CPU model: From
/proc/cpuinfo - Frequency: From sysfs, with stability monitoring
- Hardware counters: Via
perf_event(cycles, instructions, cache misses, branch mispredictions) - Environment: Temperature, throttle events
A stability check flags results if CPU frequency drifts more than 5% during the benchmark run.
Result Binding
Each benchmark result is cryptographically bound to its context:
H(output_hash || platform || config || stats || timestamp)This creates a tamper-evident record. If someone claims “we achieved X latency on platform Y”, the hash can be verified against the full result data.
Usage
#include "cb_runner.h"
#include "cb_report.h"
cb_result_code_t my_inference(void *ctx, const void *in, void *out) {
// Your neural network inference
return CB_OK;
}
int main(void) {
cb_config_t config;
cb_result_t result;
cb_config_init(&config);
config.warmup_iterations = 100;
config.measure_iterations = 1000;
cb_run_benchmark(&config, my_inference, model_ctx,
input_data, input_size,
output_data, output_size,
&result);
cb_print_summary(&result);
cb_write_json(&result, "benchmark_x86.json");
return 0;
}Pipeline Context
certifiable-bench sits between certifiable-inference (the deterministic ML engine) and certifiable-harness (the end-to-end verification system):
certifiable-inference ──→ certifiable-bench ──→ certifiable-harness
↑ │
└───────────────────────┘
Performance dataThe harness runs the inference engine with benchmark instrumentation, produces JSON reports, and the harness consumes these for cross-platform comparison. Model bundles from certifiable-deploy can include baseline benchmark data for regression testing.
Why This Matters
For aerospace (DO-178C), medical devices (IEC 62304), and automotive (ISO 26262), timing data is part of the certification evidence package. Section 6.3.4 of DO-178C specifically requires “Software Timing and Sizing Data” as a verification output.
But timing data without correctness verification is incomplete evidence. certifiable-bench provides both: proof of determinism and measurement of performance.
The bit-identity gate ensures you never accidentally compare apples to oranges. If x86 and RISC-V produce different outputs, you find out immediately, before drawing conclusions about relative performance.
Getting Started
git clone https://github.com/SpeyTech/certifiable-bench
cd certifiable-bench
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make
ctest --output-on-failureDocumentation
The repository includes formal documentation:
- CB-MATH-001: Mathematical foundations (statistics, verification, comparison algorithms)
- CB-STRUCT-001: Data structure specifications
- SRS-001 through SRS-006: Requirements documents with ~233 SHALL statements
Current Status
The harness is feature-complete and ready for integration testing:
- ✅ All statistics and verification tests pass
- ✅ JSON/CSV reporting working
- ✅ Cross-platform comparison implemented
- ⏳ Bit-identity verification on RISC-V pending hardware access
- ⏳ CI regression detection pending infrastructure setup
As with any architectural approach, suitability depends on system requirements, risk classification, and regulatory context.