The 3AM Page
The model serving cluster is down. Again. Production traffic is failing. The error message says “Internal Server Error.” The logs say nothing useful.
You ssh into a pod. CPU looks fine. Memory looks fine. The model loaded successfully an hour ago. Predictions were working. Then they stopped. No obvious trigger. No deployment. No config change. Just… stopped.
You restart the pods. Traffic recovers. Problem “solved.” You go back to bed.
Three days later, it happens again. Same symptoms. Same non-explanation. Same restart-and-hope fix.
This is the observability gap in ML systems. Traditional monitoring tells you the system is broken. It doesn’t tell you why the model stopped making predictions.
What Traditional Observability Misses
Standard monitoring tracks system metrics: CPU, memory, disk, network. These work for stateless services. They fail for ML systems because the interesting failures happen inside the model.
System says: “Everything’s fine! CPU at 40%, memory at 60%.”
Reality: The model is returning garbage predictions because input distributions shifted and nobody noticed.
System says: “Pod restarted due to OOM kill.”
Reality: A single adversarial input caused the model to allocate unbounded memory, but you’ll never know which input because it’s not logged.
System says: “P99 latency increased from 50ms to 200ms.”
Reality: A specific class of inputs takes 10x longer to process, but you don’t know which class or why.
Traditional observability gives you symptoms. ML observability requires understanding model behavior, not just process health.
The Missing Logs
When I started working with ML systems, I asked: “Where are the prediction logs?”
The response: “We log accuracy metrics to MLflow.”
That’s not observability. That’s research tracking. In production, I need to answer different questions:
- Which prediction failed?
- What input caused this latency spike?
- Did the model see this input pattern before?
- When did prediction quality start degrading?
- Which model version served this request?
None of these questions are answerable from accuracy dashboards or system metrics. They require logging what actually happened during serving.
What to Log (And What Not To)
Log Every Prediction
Not just errors. Not just slow requests. Every single prediction.
def serve_prediction(model_id, input_data, request_id):
start = time.monotonic()
# Compute input hash for deduplication/lookup
input_hash = hashlib.sha256(
json.dumps(input_data, sort_keys=True).encode()
).hexdigest()[:16]
try:
result = model.predict(input_data)
latency_ms = (time.monotonic() - start) * 1000
# Log success with context
log.info("prediction_success",
request_id=request_id,
model_id=model_id,
model_version=model.version,
input_hash=input_hash,
input_size=len(str(input_data)),
output_hash=hash(str(result))[:16],
latency_ms=latency_ms,
timestamp=time.time()
)
return result
except Exception as e:
latency_ms = (time.monotonic() - start) * 1000
# Log failure with same context
log.error("prediction_failure",
request_id=request_id,
model_id=model_id,
model_version=model.version,
input_hash=input_hash,
error_type=type(e).__name__,
error_msg=str(e)[:200], # Truncate long errors
latency_ms=latency_ms,
timestamp=time.time()
)
raiseThis gives you a complete prediction history. When something breaks, you can reconstruct exactly what happened.
Log Input Characteristics
Don’t log the raw input (privacy, storage cost). Log characteristics that help debug.
def log_input_stats(input_data):
"""Log statistical properties of input"""
if isinstance(input_data, dict):
stats = {
"num_fields": len(input_data),
"field_names": sorted(input_data.keys()),
"total_size_bytes": len(json.dumps(input_data))
}
# For numeric fields, log ranges
for key, value in input_data.items():
if isinstance(value, (int, float)):
stats[f"{key}_value"] = value
elif isinstance(input_data, list):
stats = {
"list_length": len(input_data),
"item_types": list(set(type(x).__name__ for x in input_data)),
"total_size_bytes": len(json.dumps(input_data))
}
log.info("input_characteristics", **stats)When you see “P99 latency spiked at 2AM,” you can correlate with “inputs with list_length > 1000 started appearing at 2AM.”
Log Model State Changes
Models aren’t static. They get loaded, unloaded, swapped, updated. Log every state transition.
class ModelServer:
def load_model(self, model_id, version):
log.info("model_load_start",
model_id=model_id,
version=version,
timestamp=time.time()
)
try:
model = load_model_from_storage(model_id, version)
memory_mb = get_model_memory_usage(model)
log.info("model_load_success",
model_id=model_id,
version=version,
memory_mb=memory_mb,
load_time_ms=(time.monotonic() - start) * 1000,
timestamp=time.time()
)
return model
except Exception as e:
log.error("model_load_failure",
model_id=model_id,
version=version,
error=str(e),
timestamp=time.time()
)
raiseWhen models mysteriously stop working, you can see: “Model B version 2.1 was loaded at 02:47, failures started at 02:48.”
What NOT to Log
Don’t log model internals during serving. No gradients, no activations, no attention weights. These are expensive to compute and rarely useful for production debugging.
Don’t log PII. Hash inputs instead of logging raw data. If you need to debug specific inputs, store hashes and retrieve inputs separately with proper access controls.
Don’t log everything to stdout. Use structured logging (JSON) that can be parsed and indexed. Use log levels appropriately (INFO for normal operations, WARN for degraded states, ERROR for failures).
The Correlation Problem
Logs are useless if you can’t correlate them. Every log entry needs a request ID that spans the entire request lifecycle.
@app.route('/predict', methods=['POST'])
def predict_endpoint():
# Generate request ID at entry point
request_id = str(uuid.uuid4())
log.info("request_start",
request_id=request_id,
endpoint="/predict",
timestamp=time.time()
)
try:
# Pass request_id through entire stack
result = serve_prediction(
model_id=request.json['model_id'],
input_data=request.json['input'],
request_id=request_id
)
log.info("request_success",
request_id=request_id,
status_code=200,
timestamp=time.time()
)
return jsonify(result), 200
except Exception as e:
log.error("request_failure",
request_id=request_id,
error=str(e),
timestamp=time.time()
)
return jsonify({"error": "Internal error"}), 500Now when you see a failure, you can grep for the request_id and see the entire request flow: when it started, which model served it, what the input characteristics were, where it failed.
Detection vs. Diagnosis
Traditional monitoring detects problems: “Latency increased.” ML observability enables diagnosis: “Latency increased because inputs with >500 tokens started appearing, and those take 10x longer to process.”
Detection gets you paged. Diagnosis gets you back to sleep.
Without proper logging:
- “Model is failing” → restart pods, hope it works
- “Latency increased” → scale up, hope it helps
- “Accuracy dropped” → no idea when or why
With proper logging:
- “Model is failing” → specific input pattern triggers OOM, add input validation
- “Latency increased” → P99 driven by large inputs, add size limits or separate queue
- “Accuracy dropped” → distribution shift detected at specific timestamp, trigger retraining
The Storage Cost Objection
“But logging every prediction is expensive!”
Yes. It costs money. Know what costs more? Not being able to debug production issues. Not knowing when your model started failing. Not being able to reproduce incidents.
ML systems that work at 3AM are worth more than the S3 bill for prediction logs.
Practical cost management:
- Use log sampling for high-volume endpoints (log 1 in 100 for routine requests, 100% for errors)
- Compress logs before storage
- Use log retention policies (7 days hot, 30 days warm, archive after 90)
- Store aggregated statistics rather than every single prediction
But start with logging everything. Optimize later when you know what you actually need.
Observability Enables Everything Else
Good observability isn’t just for debugging. It enables:
Model monitoring: You can’t detect drift without comparing current inputs to historical inputs.
A/B testing: You can’t measure model improvements without detailed prediction logs.
Incident response: You can’t fix what you can’t see.
Compliance: You can’t audit model decisions without prediction history.
Cost optimization: You can’t optimize what you don’t measure.
Observability is infrastructure. Like networking or storage, you build it once and everything else benefits.
Start Tomorrow
The principles here build on Production AI Systems: What 30 Years of UNIX Taught Me - observability is just one UNIX principle applied to ML.
If your ML serving system has poor observability:
Day 1: Add structured logging to every prediction. Log input hash, output hash, latency, model version, timestamp.
Day 2: Add request IDs that span the entire request lifecycle. Now you can correlate logs across services.
Day 3: Add input characteristic logging. Log sizes, types, statistical properties - not raw data.
Day 4: Set up log aggregation (ELK, Splunk, CloudWatch Logs - doesn’t matter which). Make logs searchable.
Day 5: Create dashboards that matter: prediction volume over time, latency percentiles by input size, error rates by model version.
This isn’t ML-specific. It’s just observability applied to ML systems. The same logging discipline that kept traditional services debuggable works for models too.
The Unsexy Truth (Again)
Production ML failures are debuggable. But only if you log what matters.
The interesting failures aren’t visible in CPU graphs or memory charts. They’re visible in prediction logs, input characteristics, and model state transitions.
Most ML teams discover this the hard way, after the third 3AM page for an issue they can’t diagnose.
Build observability first. Add ML second. Not the other way around.
Then when production breaks at 3AM, you’ll actually be able to figure out why.