The standard advice for ML serving is simple: keep it stateless. Each request contains everything needed. Each response is independent. Scale horizontally by adding replicas. This is good advice — until it isn’t.
Recommendation systems that ignore what you just clicked. Chatbots that forget your name mid-conversation. Fraud detection that can’t see the pattern across transactions. Search that doesn’t understand “show me more like that.”
Real ML systems often need state. Session context, conversation history, user embeddings, interaction patterns, online learning updates. The question isn’t whether to have state — it’s where to put it and how to manage it.
When Stateless Isn’t Enough
Stateless inference works when the input contains all necessary context. For many use cases, it does. Image classification doesn’t need to know what images you classified yesterday. Sentiment analysis doesn’t care about previous sentences in other documents.
But several common patterns require state:
Conversational Systems
“Show me blue ones” only makes sense if the system remembers you were looking at shoes. Conversation context accumulates across turns. Without state, every message needs to include the entire conversation history — possible but inefficient, and it pushes complexity to the client.
Personalised Recommendations
Good recommendations depend on user history. What did they click? What did they buy? What did they ignore? This isn’t just “features in, prediction out” — it’s continuous state evolution based on feedback.
Session-Based Behaviour
User behaviour within a session differs from behaviour across sessions. A user browsing casually behaves differently than one actively shopping. Session state — time on site, pages visited, cart contents — improves predictions.
Online Learning
Models that update based on feedback need to maintain learned state. Contextual bandits, reinforcement learning, adaptive systems — these require state that evolves with each interaction.
Multi-Turn Tasks
Complex tasks span multiple requests. Document processing with iterative refinement. Multi-step reasoning. Workflows with branching logic. Each step depends on results from previous steps.
Where State Lives
State has to live somewhere. The options differ in latency, durability, and complexity.
Client-Side State
The client sends all necessary state with each request. The server remains stateless.
# Client sends conversation history with each request
{
"session_id": "abc123",
"message": "Show me blue ones",
"history": [
{"role": "user", "content": "I'm looking for running shoes"},
{"role": "assistant", "content": "I can help with that. Any preferences on brand or price range?"},
{"role": "user", "content": "Under $150, any brand"}
]
}Advantages: Server stays stateless. Easy to scale. No session affinity required.
Disadvantages: Payload grows with history. Client must manage state. Can’t update state server-side (e.g., computed embeddings). Clients can tamper with state.
Client-side state works for short conversations and when you trust the client. It fails for long sessions, sensitive state, or server-computed features.
In-Memory Server State
The server maintains state in memory, keyed by session ID.
from collections import defaultdict
from threading import Lock
from typing import Dict, Any
import time
class InMemoryStateStore:
def __init__(self, ttl_seconds: int = 3600):
self.state: Dict[str, Dict[str, Any]] = {}
self.timestamps: Dict[str, float] = {}
self.ttl = ttl_seconds
self.lock = Lock()
def get(self, session_id: str) -> Dict[str, Any]:
with self.lock:
self._evict_expired()
return self.state.get(session_id, {})
def set(self, session_id: str, state: Dict[str, Any]):
with self.lock:
self.state[session_id] = state
self.timestamps[session_id] = time.monotonic()
def update(self, session_id: str, updates: Dict[str, Any]):
with self.lock:
if session_id not in self.state:
self.state[session_id] = {}
self.state[session_id].update(updates)
self.timestamps[session_id] = time.monotonic()
def _evict_expired(self):
now = time.monotonic()
expired = [
sid for sid, ts in self.timestamps.items()
if now - ts > self.ttl
]
for sid in expired:
del self.state[sid]
del self.timestamps[sid]Advantages: Low latency. Simple implementation. No external dependencies.
Disadvantages: Lost on server restart. Requires session affinity (sticky sessions). Doesn’t scale across replicas. Memory pressure under load.
In-memory state works for development and single-instance deployments. It fails at scale or when durability matters.
External State Store
State lives in Redis, DynamoDB, or another external store. All server replicas access the same state.
import json
from typing import Dict, Any, Optional
import redis
class RedisStateStore:
def __init__(self, redis_client: redis.Redis, ttl_seconds: int = 3600):
self.redis = redis_client
self.ttl = ttl_seconds
self.prefix = "ml:session:"
def _key(self, session_id: str) -> str:
return f"{self.prefix}{session_id}"
def get(self, session_id: str) -> Dict[str, Any]:
data = self.redis.get(self._key(session_id))
if data:
return json.loads(data)
return {}
def set(self, session_id: str, state: Dict[str, Any]):
self.redis.setex(
self._key(session_id),
self.ttl,
json.dumps(state)
)
def update(self, session_id: str, updates: Dict[str, Any]):
# Get-modify-set with optimistic locking
pipe = self.redis.pipeline()
key = self._key(session_id)
while True:
try:
pipe.watch(key)
current = self.get(session_id)
current.update(updates)
pipe.multi()
pipe.setex(key, self.ttl, json.dumps(current))
pipe.execute()
break
except redis.WatchError:
continue # Retry on conflict
def append_to_list(self, session_id: str, field: str, value: Any, max_length: int = 100):
"""Append to a list field with bounded length."""
key = self._key(session_id)
# Use Lua script for atomic append+trim
script = """
local current = redis.call('GET', KEYS[1])
local data = current and cjson.decode(current) or {}
data[ARGV[1]] = data[ARGV[1]] or {}
table.insert(data[ARGV[1]], cjson.decode(ARGV[2]))
while #data[ARGV[1]] > tonumber(ARGV[3]) do
table.remove(data[ARGV[1]], 1)
end
redis.call('SETEX', KEYS[1], ARGV[4], cjson.encode(data))
return #data[ARGV[1]]
"""
self.redis.eval(script, 1, key, field, json.dumps(value), max_length, self.ttl)Advantages: Survives restarts. Scales horizontally. Shared across replicas. Rich data structures (Redis lists, sets, sorted sets).
Disadvantages: Network latency on every access. External dependency. Requires serialization. Consistency considerations.
External state is the default choice for production systems. Redis is common for session state; DynamoDB or Cassandra for larger state or higher durability requirements.
Database-Backed State
For state that must survive indefinitely — user profiles, long-term preferences, model feedback — use a proper database.
from sqlalchemy import Column, String, JSON, DateTime, create_engine
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker
from datetime import datetime
Base = declarative_base()
class UserState(Base):
__tablename__ = 'user_state'
user_id = Column(String, primary_key=True)
preferences = Column(JSON, default={})
embedding = Column(JSON, default=[]) # Store as list, convert to numpy
interaction_count = Column(Integer, default=0)
last_updated = Column(DateTime, default=datetime.utcnow)
class DatabaseStateStore:
def __init__(self, connection_string: str):
engine = create_engine(connection_string)
Base.metadata.create_all(engine)
self.Session = sessionmaker(bind=engine)
def get_user_state(self, user_id: str) -> Optional[UserState]:
session = self.Session()
try:
return session.query(UserState).filter_by(user_id=user_id).first()
finally:
session.close()
def update_user_embedding(self, user_id: str, new_embedding: list):
session = self.Session()
try:
state = session.query(UserState).filter_by(user_id=user_id).first()
if state:
state.embedding = new_embedding
state.last_updated = datetime.utcnow()
else:
state = UserState(user_id=user_id, embedding=new_embedding)
session.add(state)
session.commit()
finally:
session.close()Advantages: Durable. Queryable. Supports complex relationships. Transactional.
Disadvantages: Higher latency. More complex operations. Schema management.
Use databases for state that has long-term value: user profiles, aggregated feedback, learned preferences. Don’t use databases for ephemeral session state.
Architectural Patterns
Pattern 1: Layered State
Different types of state live in different stores, accessed in order of latency:
class LayeredStateManager:
def __init__(self, memory_store, redis_store, db_store):
self.memory = memory_store # Hot session state
self.redis = redis_store # Warm session state
self.db = db_store # Cold user state
def get_context(self, session_id: str, user_id: str) -> dict:
context = {}
# Layer 1: In-memory session state (conversation history)
session_state = self.memory.get(session_id)
if session_state:
context['history'] = session_state.get('history', [])
# Layer 2: Redis session state (computed features)
redis_state = self.redis.get(session_id)
if redis_state:
context['session_embedding'] = redis_state.get('embedding')
context['click_history'] = redis_state.get('clicks', [])
# Layer 3: Database user state (long-term preferences)
user_state = self.db.get_user_state(user_id)
if user_state:
context['preferences'] = user_state.preferences
context['user_embedding'] = user_state.embedding
return contextThis pattern keeps hot data close (in-memory), warm data accessible (Redis), and cold data durable (database).
Pattern 2: Event Sourcing
Instead of storing current state, store the sequence of events that produced it. Reconstruct state by replaying events.
from dataclasses import dataclass
from typing import List
from datetime import datetime
@dataclass
class Event:
session_id: str
event_type: str
payload: dict
timestamp: datetime
class EventSourcedState:
def __init__(self, event_store):
self.events = event_store
def append_event(self, session_id: str, event_type: str, payload: dict):
event = Event(
session_id=session_id,
event_type=event_type,
payload=payload,
timestamp=datetime.utcnow()
)
self.events.append(event)
def get_state(self, session_id: str) -> dict:
events = self.events.get_events(session_id)
state = {"history": [], "preferences": {}, "context": None}
for event in events:
if event.event_type == "message":
state["history"].append(event.payload)
elif event.event_type == "preference_update":
state["preferences"].update(event.payload)
elif event.event_type == "context_set":
state["context"] = event.payload.get("context")
return stateAdvantages: Complete audit trail. Can reconstruct state at any point. Natural fit for ML feedback loops.
Disadvantages: State reconstruction cost. Storage growth. Complexity.
Event sourcing works well for systems where the history itself has value — audit requirements, debugging, replay for analysis.
Pattern 3: Sticky Sessions with Replication
Route all requests for a session to the same server. Replicate state asynchronously for fault tolerance.
# Load balancer configuration (conceptual)
# Route by session_id hash to consistent server
class StickySessionManager:
def __init__(self, local_store, replication_queue):
self.local = local_store
self.replication = replication_queue
def update_state(self, session_id: str, state: dict):
# Update local immediately
self.local.set(session_id, state)
# Queue async replication
self.replication.put({
"session_id": session_id,
"state": state,
"timestamp": time.time()
})
def get_state(self, session_id: str) -> dict:
# Always read from local for lowest latency
return self.local.get(session_id)Advantages: Lowest latency for reads. Simple consistency model. Server can cache aggressively.
Disadvantages: Requires session-aware load balancing. Failover complexity. Uneven load distribution.
Sticky sessions work when latency is critical and sessions are relatively short-lived.
The CAP Theorem Applies
Stateful ML services face the same distributed systems trade-offs as any other stateful service. You cannot simultaneously guarantee:
- Consistency: All replicas see the same state
- Availability: Every request receives a response
- Partition tolerance: The system continues operating despite network failures
For most ML use cases, availability matters more than strict consistency. A recommendation based on slightly stale state is better than no recommendation. A conversation that occasionally loses context is better than a conversation that fails entirely.
Design for eventual consistency. Accept that state might be briefly inconsistent across replicas. Use idempotent operations where possible.
Practical Considerations
State size matters. Session state should be small — conversation history, recent interactions, computed embeddings. Large state (full user history, large models) belongs in external storage with lazy loading.
TTL everything. Sessions end. Users leave. State that lingers forever consumes resources forever. Set TTLs on all session state. Archive or delete on expiry.
Instrument state operations. Track state store latency. Alert on failures. Monitor cache hit rates. State store problems become inference problems.
Plan for state loss. Despite best efforts, state gets lost — store failures, network partitions, bugs. Design for graceful degradation when state is missing. Can you serve a reasonable (if suboptimal) response without state?
Test stateful behaviour. Stateful systems have more edge cases. What happens when state is partially corrupted? When the store is slow? When a session spans a server restart? These scenarios need explicit testing.
Choosing Your Approach
Start with the simplest approach that meets your requirements:
- Stateless if you can include all context in the request
- Client-side state for short sessions with trusted clients
- Redis for most production session state needs
- Database for durable, long-term user state
- Event sourcing when history and audit matter
- Sticky sessions when latency is critical and sessions are short
Don’t over-engineer. Many systems that think they need complex state management actually need a Redis instance and thoughtful key design. Add complexity only when simpler approaches demonstrably fail.
For more on production ML architecture, see Graceful Degradation in ML Systems for handling state store failures, and The Observability Blind Spot for monitoring stateful services.