The Symptoms
The model worked perfectly in development. Accuracy was 94%. Latency was 50ms. Integration tests passed. Staging looked good.
Then you deploy to production. Within an hour, customers report incorrect predictions. You check the dashboard - accuracy is 67%. Some predictions are returning null. P99 latency is 8 seconds.
You roll back. Everything returns to normal. You try deploying again the next day. Same problems. The model itself hasn’t changed. The code hasn’t changed. But something is different between staging and production.
This is the most frustrating type of ML failure: behavior that appears only in production and can’t be reproduced in development.
The Investigation Process
Debugging production model behavior follows a pattern. Work through these steps systematically.
Step 1: Capture the Failing Input
You need the exact input that caused the problem. Not a similar input. Not a representative sample. The specific input that failed.
# Add input capture on prediction failures
def predict_with_capture(model, input_data, request_id):
try:
result = model.predict(input_data)
return result
except Exception as e:
# Capture failing input
failing_input = {
'request_id': request_id,
'input_data': sanitize_for_storage(input_data),
'error': str(e),
'timestamp': time.time(),
'model_version': model.version
}
# Store for later analysis
store_failing_input(failing_input)
log.error("prediction_failed",
request_id=request_id,
error=str(e),
input_hash=hash_input(input_data)
)
raise
def sanitize_for_storage(input_data):
"""Remove PII but keep structure and statistics"""
if isinstance(input_data, dict):
return {
k: sanitize_value(v)
for k, v in input_data.items()
}
return input_data
def sanitize_value(value):
"""Replace sensitive values while preserving type/range"""
if isinstance(value, str):
return f"<string length={len(value)}>"
elif isinstance(value, (int, float)):
return f"<numeric value ~{int(value)}>" # Approximate
return valueWhy this works: You can’t debug what you can’t reproduce. Capturing the failing input lets you reproduce the failure locally.
Common mistake: Only logging aggregated statistics. “5% of predictions failed” tells you nothing about which inputs failed or why.
Step 2: Check Input Distribution Shift
Production data often differs from training data in subtle ways.
# Compare production input distributions to training
def analyze_input_distribution(production_inputs, training_inputs):
"""Compare key statistics between production and training"""
stats = {}
for field in production_inputs[0].keys():
prod_values = [inp[field] for inp in production_inputs]
train_values = [inp[field] for inp in training_inputs]
if isinstance(prod_values[0], (int, float)):
stats[field] = {
'prod_mean': mean(prod_values),
'train_mean': mean(train_values),
'prod_std': std(prod_values),
'train_std': std(train_values),
'prod_min': min(prod_values),
'train_min': min(train_values),
'prod_max': max(prod_values),
'train_max': max(train_values),
}
# Flag significant differences
mean_diff = abs(stats[field]['prod_mean'] - stats[field]['train_mean'])
if mean_diff > stats[field]['train_std']:
stats[field]['warning'] = 'mean_shift'
elif isinstance(prod_values[0], str):
prod_unique = set(prod_values)
train_unique = set(train_values)
stats[field] = {
'prod_unique_count': len(prod_unique),
'train_unique_count': len(train_unique),
'new_values': prod_unique - train_unique,
'missing_values': train_unique - prod_unique
}
if stats[field]['new_values']:
stats[field]['warning'] = 'unseen_categories'
return stats
# Run this analysis periodically
prod_inputs = load_recent_production_inputs(hours=24)
train_inputs = load_training_sample()
distribution_stats = analyze_input_distribution(prod_inputs, train_inputs)
# Alert on significant shifts
for field, stats in distribution_stats.items():
if 'warning' in stats:
log.warning("distribution_shift",
field=field,
warning=stats['warning'],
details=stats
)What to look for:
- Numeric features outside training range
- Categorical features with unseen values
- Skewed distributions (prod mean far from training mean)
- Missing or null values not present in training
Real example: A fraud detection model failed because production transactions included a new payment_method value (“buy_now_pay_later”) that wasn’t in training data. Model had no learned behavior for this value.
Step 3: Reproduce in Isolation
Take the failing input and run it through the model in a controlled environment.
# Reproduce failure locally
def reproduce_failure(failing_input_record):
"""Attempt to reproduce a production failure locally"""
# Load exact model version from production
model = load_model(
version=failing_input_record['model_version']
)
# Reconstruct input
input_data = failing_input_record['input_data']
# Reproduce prediction
try:
result = model.predict(input_data)
print(f"Local prediction succeeded: {result}")
print("This suggests environment difference, not model issue")
return {
'reproduced': False,
'local_result': result
}
except Exception as e:
print(f"Local prediction failed: {e}")
print("Error reproduced - this is a model/input issue")
# Analyze the failure
analyze_prediction_failure(model, input_data, e)
return {
'reproduced': True,
'error': str(e)
}
def analyze_prediction_failure(model, input_data, error):
"""Deep dive into why prediction failed"""
# Check input validity
print("\n=== Input Validation ===")
for key, value in input_data.items():
print(f"{key}: type={type(value)}, value={value}")
# Check for NaN/Inf
if isinstance(value, float):
if math.isnan(value):
print(f" WARNING: {key} is NaN")
if math.isinf(value):
print(f" WARNING: {key} is Inf")
# Check feature preprocessing
print("\n=== Feature Preprocessing ===")
try:
features = model.preprocess(input_data)
print(f"Preprocessing succeeded: {features}")
except Exception as e:
print(f"Preprocessing failed: {e}")
print("Issue is in feature engineering, not model inference")
return
# Check model inference
print("\n=== Model Inference ===")
try:
output = model.forward(features)
print(f"Model inference succeeded: {output}")
except Exception as e:
print(f"Model inference failed: {e}")
print("Issue is in model execution")If it reproduces locally: Problem is in the model or input. Proceed to Step 4.
If it doesn’t reproduce locally: Problem is environmental (dependencies, resources, data sources). Proceed to Step 5.
Step 4: Isolate the Layer
Models have layers: preprocessing, feature engineering, inference, postprocessing. Find which layer fails.
# Test each layer independently
def isolate_failing_layer(model, input_data):
"""Determine which model layer causes failure"""
layers = []
# Layer 1: Input validation
try:
validated = model.validate_input(input_data)
layers.append(('validation', 'pass', validated))
except Exception as e:
layers.append(('validation', 'fail', str(e)))
return layers # Can't proceed
# Layer 2: Feature extraction
try:
features = model.extract_features(validated)
layers.append(('feature_extraction', 'pass', features))
except Exception as e:
layers.append(('feature_extraction', 'fail', str(e)))
return layers
# Layer 3: Preprocessing (scaling, encoding)
try:
preprocessed = model.preprocess(features)
layers.append(('preprocessing', 'pass', preprocessed))
except Exception as e:
layers.append(('preprocessing', 'fail', str(e)))
return layers
# Layer 4: Model inference
try:
raw_output = model.forward(preprocessed)
layers.append(('inference', 'pass', raw_output))
except Exception as e:
layers.append(('inference', 'fail', str(e)))
return layers
# Layer 5: Postprocessing
try:
final_output = model.postprocess(raw_output)
layers.append(('postprocessing', 'pass', final_output))
except Exception as e:
layers.append(('postprocessing', 'fail', str(e)))
return layers
return layers
# Run isolation analysis
layers = isolate_failing_layer(model, failing_input)
for layer_name, status, result in layers:
print(f"{layer_name}: {status}")
if status == 'fail':
print(f" Error: {result}")
print(f" Issue is in {layer_name} layer")
breakCommon failure points:
Preprocessing: Scaling/normalization with unexpected input ranges
# Fails if input outside training range
scaled = (value - mean) / std # std might be 0 for constant featuresFeature extraction: Missing or malformed fields
# Fails if field doesn't exist
age = input_data['user_age'] # KeyError if missingInference: NaN or Inf propagation through network
# NaN inputs create NaN outputs
output = model(features) # Silently propagates NaNStep 5: Check Environmental Differences
If the failure doesn’t reproduce locally, the environment differs between staging and production.
# Compare staging vs production environments
def compare_environments():
"""Capture environment details for comparison"""
import sys
import platform
env_info = {
# Python environment
'python_version': sys.version,
'platform': platform.platform(),
# Package versions
'tensorflow_version': tf.__version__,
'numpy_version': np.__version__,
'pandas_version': pd.__version__,
# System resources
'cpu_count': os.cpu_count(),
'available_memory_gb': psutil.virtual_memory().available / 1e9,
# Model file hash (verify model is identical)
'model_file_hash': hash_file('model.pt'),
# Configuration
'model_config': model.get_config(),
# Data sources
'feature_db_host': os.getenv('FEATURE_DB_HOST'),
'feature_db_version': get_db_version(),
}
return env_info
# Capture in both environments
staging_env = compare_environments()
production_env = compare_environments()
# Find differences
differences = []
for key in staging_env:
if staging_env[key] != production_env[key]:
differences.append({
'key': key,
'staging': staging_env[key],
'production': production_env[key]
})
if differences:
print("Environment differences found:")
for diff in differences:
print(f" {diff['key']}:")
print(f" Staging: {diff['staging']}")
print(f" Production: {diff['production']}")Common environmental causes:
Version mismatches: TensorFlow 2.10 vs 2.12 - subtle behavior changes
Resource constraints: Staging has 8GB RAM, production has 4GB - OOM failures
Data sources: Staging uses cached data, production queries live database - latency/content differences
Concurrency: Staging is single-threaded, production is multi-threaded - race conditions
Step 6: Enable Prediction Logging
For issues that are intermittent or rare, enable detailed prediction logging to capture context when failures occur.
# Detailed prediction logging
class PredictionLogger:
def __init__(self, sample_rate=0.1):
self.sample_rate = sample_rate
def should_log(self, is_error=False):
"""Always log errors, sample successes"""
if is_error:
return True
return random.random() < self.sample_rate
def log_prediction(self, input_data, output, error=None, metadata=None):
"""Log prediction with full context"""
if not self.should_log(is_error=error is not None):
return
log_entry = {
'timestamp': time.time(),
'input_hash': hash_input(input_data),
'input_stats': compute_input_stats(input_data),
'output': output if error is None else None,
'error': str(error) if error else None,
'metadata': metadata or {},
'environment': {
'model_version': metadata.get('model_version'),
'host': socket.gethostname(),
'memory_usage_mb': get_memory_usage(),
}
}
# Store for analysis
prediction_log.write(log_entry)
# Use in serving
logger = PredictionLogger(sample_rate=0.1)
def predict(input_data, request_id):
metadata = {'request_id': request_id, 'model_version': model.version}
try:
output = model.predict(input_data)
logger.log_prediction(input_data, output, metadata=metadata)
return output
except Exception as e:
logger.log_prediction(input_data, None, error=e, metadata=metadata)
raiseAnalysis queries:
# Find patterns in failures
failures = prediction_log.query(
"SELECT * FROM predictions WHERE error IS NOT NULL"
)
# Group by error type
error_counts = failures.groupby('error').size()
print("Most common errors:")
print(error_counts.sort_values(ascending=False))
# Find input patterns that fail
for error_type in error_counts.index[:5]:
error_inputs = failures[failures['error'] == error_type]
print(f"\n=== {error_type} ===")
print("Input characteristics:")
print(error_inputs['input_stats'].describe())Common Production-Only Failures
Scale-Dependent Failures
Problems that only appear at production traffic levels.
Memory leaks: Small per-request leak becomes catastrophic at 1000 req/sec
# Leaked memory accumulates
cache = {} # Never cleared
cache[request_id] = result # Grows foreverResource exhaustion: Connection pools, file handles, GPU memory
# Runs out after 1000 requests
db_connection = create_connection() # Never closedMitigation: Load testing at production scale, resource monitoring, connection pooling
Data-Dependent Failures
Problems triggered by specific data patterns that appear rarely.
Adversarial inputs: Unusual combinations not in training data
# Model never saw age=150
prediction = model.predict({'age': 150, 'income': 50000})Edge cases: Extreme values, empty lists, null handling
# Division by zero on empty list
avg_purchase = sum(purchases) / len(purchases) # len=0 in prodMitigation: Input validation, defensive coding, edge case testing
Timing-Dependent Failures
Problems that depend on request timing or state.
Race conditions: Multiple requests modifying shared state
# Not thread-safe
if cache.get(key) is None:
cache[key] = expensive_computation() # Race condition
return cache[key]Stale caches: Features cached too long, out of sync with model
# Feature cached 1 hour ago, model updated 30 minutes ago
features = feature_cache.get(user_id) # StaleMitigation: Thread-safe code, cache invalidation, versioning
The Debugging Toolkit
Essential tools for production model debugging:
1. Request replay: Capture and replay production requests locally 2. Diff tool: Compare staging vs production environments 3. Input profiler: Analyze input distributions over time
4. Layer inspector: Step through model layers with real inputs 5. Prediction logs: Comprehensive logging with sampling (see The Observability Gap in ML Systems)
These tools, combined with the systematic process above, make most production failures debuggable.
The Unsexy Truth
Most production model failures aren’t exotic ML problems. They’re boring software engineering problems:
- Missing input validation
- Unhandled edge cases
- Version mismatches
- Resource constraints
- Race conditions
The ML model itself is usually fine. The infrastructure around it is broken.
Fix the infrastructure using standard debugging techniques. The systematic process above works because it treats model serving as a software system, not as magic.
When your model works in staging but fails in production, it’s telling you something about the difference between those environments. Listen to it. Capture the failing inputs. Reproduce the failure. Isolate the cause. Fix the infrastructure.
Then add tests to prevent regression. The same failure pattern shouldn’t surprise you twice.
Related Reading
For more on production ML infrastructure:
- The Observability Gap in ML Systems - What to log to make debugging possible
- Production AI Systems: What 30 Years of UNIX Taught Me - Infrastructure principles including failure handling
- Model Serving Architecture Patterns - Architecture decisions that affect debuggability