The model times out. The feature store is unreachable. Memory pressure forces garbage collection mid-inference. The GPU throws a CUDA error.
What happens next determines whether your users experience a brief hiccup or a complete outage.
Most ML systems have one answer to failure: return an error. HTTP 500, stack trace in the logs, maybe a retry. But production systems that serve real users need a better answer. They need graceful degradation — the ability to provide something useful even when the primary path fails.
A degraded answer is almost always better than no answer. A safe default is almost always better than an error page.
Why ML Systems Fail
Before designing fallbacks, understand the failure modes. ML inference fails differently than traditional services because it has more moving parts, each with its own failure characteristics.
Timeout Failures
The model takes too long. This is the most common failure in production ML. Causes include: input larger than expected, cold start after scaling, resource contention, or the model simply being too slow for production latency requirements.
Timeouts are predictable — you know they’ll happen, you just don’t know when. Design for them.
Resource Exhaustion
Memory pressure, GPU out of memory, thread pool exhausted, connection pool drained. Resource failures cascade: one component under pressure slows down, causing backpressure on upstream components, which then also fail.
Resource exhaustion is often self-inflicted. A traffic spike causes slowdowns, which causes retries, which causes more traffic, which causes more slowdowns.
Dependency Failures
Feature stores, embedding services, preprocessing pipelines, model registries — modern ML systems have many dependencies. Any one of them can fail, and they fail independently.
Dependency failures require dependency awareness. You can’t fall back gracefully if you don’t know what failed.
Model Failures
The model itself can fail: NaN outputs, confidence below threshold, exceptions in custom layers, corrupted weights after a bad deployment. Model failures are often silent — the model returns something, but that something is garbage.
Model failures need validation. Trust but verify.
Data Failures
Bad input data: missing features, out-of-range values, encoding mismatches, null where not-null was expected. Data failures are caller problems, but they’re your responsibility to handle gracefully.
The Fallback Hierarchy
When the primary model can’t answer, fall back through progressively simpler alternatives. Each level trades accuracy for availability.
Level 1: Cached Predictions
For requests you’ve seen before, return the cached answer. Cache hit rates vary wildly by use case — recommendation systems might hit 60%, fraud detection might hit 5% — but any cache hit is a request you don’t need to compute.
from functools import lru_cache
from typing import Optional, Tuple
import hashlib
import json
class PredictionCache:
def __init__(self, redis_client, ttl_seconds: int = 3600):
self.redis = redis_client
self.ttl = ttl_seconds
def _cache_key(self, model_version: str, input_data: dict) -> str:
# Deterministic key from input
input_hash = hashlib.sha256(
json.dumps(input_data, sort_keys=True).encode()
).hexdigest()[:16]
return f"pred:{model_version}:{input_hash}"
def get(self, model_version: str, input_data: dict) -> Optional[dict]:
key = self._cache_key(model_version, input_data)
cached = self.redis.get(key)
if cached:
return json.loads(cached)
return None
def set(self, model_version: str, input_data: dict, prediction: dict):
key = self._cache_key(model_version, input_data)
self.redis.setex(key, self.ttl, json.dumps(prediction))Cache considerations:
- Key design: Include model version. A cached prediction from v2 isn’t valid for v3.
- TTL: Balance freshness against hit rate. Shorter TTL for fast-changing domains.
- Stale-while-revalidate: Return stale cache while refreshing in the background.
Level 2: Simplified Model
Maintain a lightweight model that trades accuracy for speed and reliability. This might be:
- A smaller version of the same architecture (distilled model)
- A simpler model type (logistic regression instead of deep learning)
- A model with fewer features (dropping expensive-to-compute features)
class FallbackModelChain:
def __init__(self, primary_model, fallback_model, timeout_ms: int = 100):
self.primary = primary_model
self.fallback = fallback_model
self.timeout_ms = timeout_ms
def predict(self, input_data: dict) -> Tuple[dict, str]:
"""Returns (prediction, source) tuple."""
# Try primary with timeout
try:
result = self._predict_with_timeout(
self.primary, input_data, self.timeout_ms
)
return result, "primary"
except TimeoutError:
logger.warning("primary_model_timeout",
timeout_ms=self.timeout_ms)
except Exception as e:
logger.error("primary_model_error",
error_type=type(e).__name__)
# Fall back to simplified model
try:
result = self.fallback.predict(input_data)
return result, "fallback"
except Exception as e:
logger.error("fallback_model_error",
error_type=type(e).__name__)
raise # Let outer handler deal with itThe fallback model should be fast, reliable, and always available. Load it eagerly at startup. Keep it in memory. Don’t let it share resources with the primary model.
Level 3: Rule-Based Fallback
When models fail, fall back to rules. Rules are fast, predictable, and explainable. They won’t be as accurate as your model, but they provide a reasonable answer.
class RuleBasedFallback:
"""Simple rules for fraud detection when models are unavailable."""
def predict(self, transaction: dict) -> dict:
score = 0.0
reasons = []
# High amount
if transaction.get("amount", 0) > 5000:
score += 0.3
reasons.append("high_amount")
# New account
account_age_days = transaction.get("account_age_days", 0)
if account_age_days < 7:
score += 0.25
reasons.append("new_account")
# Unusual merchant category
high_risk_categories = {"gambling", "crypto", "wire_transfer"}
if transaction.get("merchant_category") in high_risk_categories:
score += 0.2
reasons.append("high_risk_category")
# International transaction from domestic account
if (transaction.get("is_international") and
not transaction.get("has_international_history")):
score += 0.15
reasons.append("unusual_international")
return {
"score": min(score, 1.0),
"confidence": 0.4, # Low confidence for rule-based
"source": "rules",
"reasons": reasons
}Rules should be conservative. When you don’t know, err on the side that causes less damage. For fraud detection, that might mean flagging more for review. For recommendations, that might mean showing popular items instead of personalised ones.
Level 4: Safe Default
When everything else fails, return a safe default. What “safe” means depends on your domain:
- Fraud detection: Allow the transaction but flag for manual review
- Content recommendation: Show trending/popular content
- Search ranking: Return results in chronological order
- Pricing: Use list price without dynamic adjustments
- Ad targeting: Show untargeted ads
class SafeDefault:
"""Last resort when all prediction methods fail."""
def __init__(self, domain: str):
self.defaults = {
"fraud": {
"score": 0.0,
"action": "allow",
"flag_for_review": True,
"confidence": 0.0,
"source": "safe_default"
},
"recommendation": {
"items": [], # Will be filled with trending
"strategy": "popularity",
"confidence": 0.0,
"source": "safe_default"
},
"pricing": {
"use_list_price": True,
"discount": 0.0,
"confidence": 0.0,
"source": "safe_default"
}
}
self.domain = domain
def get(self) -> dict:
return self.defaults.get(self.domain, {"source": "safe_default"})The safe default should never fail. It should have no dependencies, no computation, no I/O. It’s a constant that gets returned when everything else is broken.
Circuit Breakers: Failing Fast
When a service is failing, continuing to call it makes things worse. Timeouts pile up, resources get exhausted, and the failure cascades.
Circuit breakers prevent this by tracking failure rates and “opening” when failures exceed a threshold. An open circuit fails immediately without calling the underlying service.
import time
from enum import Enum
from threading import Lock
class CircuitState(Enum):
CLOSED = "closed" # Normal operation
OPEN = "open" # Failing fast
HALF_OPEN = "half_open" # Testing recovery
class CircuitBreaker:
def __init__(
self,
failure_threshold: int = 5,
recovery_timeout: float = 30.0,
half_open_requests: int = 3
):
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.half_open_requests = half_open_requests
self.state = CircuitState.CLOSED
self.failure_count = 0
self.success_count = 0
self.last_failure_time = 0
self.lock = Lock()
def can_execute(self) -> bool:
with self.lock:
if self.state == CircuitState.CLOSED:
return True
if self.state == CircuitState.OPEN:
# Check if recovery timeout has passed
if time.monotonic() - self.last_failure_time > self.recovery_timeout:
self.state = CircuitState.HALF_OPEN
self.success_count = 0
return True
return False
if self.state == CircuitState.HALF_OPEN:
# Allow limited requests to test recovery
return self.success_count < self.half_open_requests
return False
def record_success(self):
with self.lock:
if self.state == CircuitState.HALF_OPEN:
self.success_count += 1
if self.success_count >= self.half_open_requests:
self.state = CircuitState.CLOSED
self.failure_count = 0
else:
self.failure_count = 0
def record_failure(self):
with self.lock:
self.failure_count += 1
self.last_failure_time = time.monotonic()
if self.state == CircuitState.HALF_OPEN:
self.state = CircuitState.OPEN
elif self.failure_count >= self.failure_threshold:
self.state = CircuitState.OPENUse separate circuit breakers for each dependency. The feature store might be healthy while the model service is failing. Breaking the circuit on one shouldn’t affect the other.
Putting It Together
A production inference endpoint that degrades gracefully:
class ResilientPredictor:
def __init__(self, config):
self.cache = PredictionCache(config.redis_client)
self.primary_model = load_model(config.primary_model_path)
self.fallback_model = load_model(config.fallback_model_path)
self.rules = RuleBasedFallback()
self.safe_default = SafeDefault(config.domain)
self.circuit_breaker = CircuitBreaker()
self.timeout_ms = config.timeout_ms
def predict(self, input_data: dict) -> dict:
# Level 0: Check cache first
cached = self.cache.get(self.primary_model.version, input_data)
if cached:
return {**cached, "source": "cache"}
# Level 1: Try primary model (if circuit allows)
if self.circuit_breaker.can_execute():
try:
result = self._predict_with_timeout(
self.primary_model, input_data, self.timeout_ms
)
self.circuit_breaker.record_success()
self.cache.set(self.primary_model.version, input_data, result)
return {**result, "source": "primary"}
except Exception as e:
self.circuit_breaker.record_failure()
logger.warning("primary_failed", error=str(e))
# Level 2: Try fallback model
try:
result = self.fallback_model.predict(input_data)
return {**result, "source": "fallback"}
except Exception as e:
logger.warning("fallback_failed", error=str(e))
# Level 3: Try rules
try:
result = self.rules.predict(input_data)
return result
except Exception as e:
logger.error("rules_failed", error=str(e))
# Level 4: Safe default (should never fail)
return self.safe_default.get()Every request gets an answer. The answer might come from the primary model, a cache, a fallback, rules, or a safe default — but it’s always an answer.
Observability for Degradation
Track which fallback level served each request. This isn’t just operational awareness — it’s product signal.
# Emit metrics on every prediction
metrics.increment("prediction_served", tags={
"source": result.get("source", "unknown"),
"model_version": self.primary_model.version
})
# Alert when fallback usage exceeds threshold
# "More than 10% of requests served by fallback for 5 minutes"A sudden spike in fallback usage indicates a problem even if users aren’t complaining. They’re getting answers, but not the best answers. That’s signal worth investigating.
For comprehensive monitoring approaches, see The Observability Blind Spot and the certifiable-monitor project for deterministic drift detection.
Design Principles
Fail fast, recover slow. Open the circuit quickly when things go wrong. Close it slowly, verifying the service has actually recovered.
Independent fallbacks. Each fallback level should have independent failure modes. If the fallback model shares a GPU with the primary model, they’ll fail together.
Test the fallbacks. Fallback code paths are rarely exercised in normal operation. Use chaos engineering to verify they work: inject failures, kill dependencies, exhaust resources.
Make degradation visible. Users might not notice degraded responses, but your team should. Dashboard the fallback rates. Alert on anomalies.
Design for the worst case. Assume every component will fail simultaneously. What answer do you give then? That’s your safe default, and it should be acceptable.
The Alternative Is Worse
Without graceful degradation, ML systems have two modes: working and broken. Users experience either perfect service or error pages. There’s no middle ground.
With graceful degradation, systems have a spectrum: optimal, degraded-but-functional, minimal-but-safe. Users always get something useful. The system absorbs failures instead of propagating them.
Production systems fail. The question is whether they fail gracefully or fail loudly. Graceful degradation turns potential outages into service quality variations — a much better trade-off for everyone involved.
For more on production ML reliability, see Debugging Model Behavior in Production and Production AI Systems: What 30 Years of UNIX Taught Me.