AI Architecture

State Management in ML Services: Beyond Stateless Inference

Architectural patterns for ML systems that need to remember

Published
January 23, 2026 22:00
Reading Time
8 min
Comparison of stateless inference versus stateful ML service with session context and state persistence

The standard advice for ML serving is simple: keep it stateless. Each request contains everything needed. Each response is independent. Scale horizontally by adding replicas. This is good advice — until it isn’t.

Recommendation systems that ignore what you just clicked. Chatbots that forget your name mid-conversation. Fraud detection that can’t see the pattern across transactions. Search that doesn’t understand “show me more like that.”

Real ML systems often need state. Session context, conversation history, user embeddings, interaction patterns, online learning updates. The question isn’t whether to have state — it’s where to put it and how to manage it.

When Stateless Isn’t Enough

Stateless inference works when the input contains all necessary context. For many use cases, it does. Image classification doesn’t need to know what images you classified yesterday. Sentiment analysis doesn’t care about previous sentences in other documents.

But several common patterns require state:

Conversational Systems

“Show me blue ones” only makes sense if the system remembers you were looking at shoes. Conversation context accumulates across turns. Without state, every message needs to include the entire conversation history — possible but inefficient, and it pushes complexity to the client.

Personalised Recommendations

Good recommendations depend on user history. What did they click? What did they buy? What did they ignore? This isn’t just “features in, prediction out” — it’s continuous state evolution based on feedback.

Session-Based Behaviour

User behaviour within a session differs from behaviour across sessions. A user browsing casually behaves differently than one actively shopping. Session state — time on site, pages visited, cart contents — improves predictions.

Online Learning

Models that update based on feedback need to maintain learned state. Contextual bandits, reinforcement learning, adaptive systems — these require state that evolves with each interaction.

Multi-Turn Tasks

Complex tasks span multiple requests. Document processing with iterative refinement. Multi-step reasoning. Workflows with branching logic. Each step depends on results from previous steps.

Where State Lives

State has to live somewhere. The options differ in latency, durability, and complexity.

Client-Side State

The client sends all necessary state with each request. The server remains stateless.

# Client sends conversation history with each request
{
    "session_id": "abc123",
    "message": "Show me blue ones",
    "history": [
        {"role": "user", "content": "I'm looking for running shoes"},
        {"role": "assistant", "content": "I can help with that. Any preferences on brand or price range?"},
        {"role": "user", "content": "Under $150, any brand"}
    ]
}

Advantages: Server stays stateless. Easy to scale. No session affinity required.

Disadvantages: Payload grows with history. Client must manage state. Can’t update state server-side (e.g., computed embeddings). Clients can tamper with state.

Client-side state works for short conversations and when you trust the client. It fails for long sessions, sensitive state, or server-computed features.

In-Memory Server State

The server maintains state in memory, keyed by session ID.

from collections import defaultdict
from threading import Lock
from typing import Dict, Any
import time

class InMemoryStateStore:
    def __init__(self, ttl_seconds: int = 3600):
        self.state: Dict[str, Dict[str, Any]] = {}
        self.timestamps: Dict[str, float] = {}
        self.ttl = ttl_seconds
        self.lock = Lock()
    
    def get(self, session_id: str) -> Dict[str, Any]:
        with self.lock:
            self._evict_expired()
            return self.state.get(session_id, {})
    
    def set(self, session_id: str, state: Dict[str, Any]):
        with self.lock:
            self.state[session_id] = state
            self.timestamps[session_id] = time.monotonic()
    
    def update(self, session_id: str, updates: Dict[str, Any]):
        with self.lock:
            if session_id not in self.state:
                self.state[session_id] = {}
            self.state[session_id].update(updates)
            self.timestamps[session_id] = time.monotonic()
    
    def _evict_expired(self):
        now = time.monotonic()
        expired = [
            sid for sid, ts in self.timestamps.items()
            if now - ts > self.ttl
        ]
        for sid in expired:
            del self.state[sid]
            del self.timestamps[sid]

Advantages: Low latency. Simple implementation. No external dependencies.

Disadvantages: Lost on server restart. Requires session affinity (sticky sessions). Doesn’t scale across replicas. Memory pressure under load.

In-memory state works for development and single-instance deployments. It fails at scale or when durability matters.

External State Store

State lives in Redis, DynamoDB, or another external store. All server replicas access the same state.

import json
from typing import Dict, Any, Optional
import redis

class RedisStateStore:
    def __init__(self, redis_client: redis.Redis, ttl_seconds: int = 3600):
        self.redis = redis_client
        self.ttl = ttl_seconds
        self.prefix = "ml:session:"
    
    def _key(self, session_id: str) -> str:
        return f"{self.prefix}{session_id}"
    
    def get(self, session_id: str) -> Dict[str, Any]:
        data = self.redis.get(self._key(session_id))
        if data:
            return json.loads(data)
        return {}
    
    def set(self, session_id: str, state: Dict[str, Any]):
        self.redis.setex(
            self._key(session_id),
            self.ttl,
            json.dumps(state)
        )
    
    def update(self, session_id: str, updates: Dict[str, Any]):
        # Get-modify-set with optimistic locking
        pipe = self.redis.pipeline()
        key = self._key(session_id)
        
        while True:
            try:
                pipe.watch(key)
                current = self.get(session_id)
                current.update(updates)
                pipe.multi()
                pipe.setex(key, self.ttl, json.dumps(current))
                pipe.execute()
                break
            except redis.WatchError:
                continue  # Retry on conflict
    
    def append_to_list(self, session_id: str, field: str, value: Any, max_length: int = 100):
        """Append to a list field with bounded length."""
        key = self._key(session_id)
        # Use Lua script for atomic append+trim
        script = """
        local current = redis.call('GET', KEYS[1])
        local data = current and cjson.decode(current) or {}
        data[ARGV[1]] = data[ARGV[1]] or {}
        table.insert(data[ARGV[1]], cjson.decode(ARGV[2]))
        while #data[ARGV[1]] > tonumber(ARGV[3]) do
            table.remove(data[ARGV[1]], 1)
        end
        redis.call('SETEX', KEYS[1], ARGV[4], cjson.encode(data))
        return #data[ARGV[1]]
        """
        self.redis.eval(script, 1, key, field, json.dumps(value), max_length, self.ttl)

Advantages: Survives restarts. Scales horizontally. Shared across replicas. Rich data structures (Redis lists, sets, sorted sets).

Disadvantages: Network latency on every access. External dependency. Requires serialization. Consistency considerations.

External state is the default choice for production systems. Redis is common for session state; DynamoDB or Cassandra for larger state or higher durability requirements.

Database-Backed State

For state that must survive indefinitely — user profiles, long-term preferences, model feedback — use a proper database.

from sqlalchemy import Column, String, JSON, DateTime, create_engine
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker
from datetime import datetime

Base = declarative_base()

class UserState(Base):
    __tablename__ = 'user_state'
    
    user_id = Column(String, primary_key=True)
    preferences = Column(JSON, default={})
    embedding = Column(JSON, default=[])  # Store as list, convert to numpy
    interaction_count = Column(Integer, default=0)
    last_updated = Column(DateTime, default=datetime.utcnow)

class DatabaseStateStore:
    def __init__(self, connection_string: str):
        engine = create_engine(connection_string)
        Base.metadata.create_all(engine)
        self.Session = sessionmaker(bind=engine)
    
    def get_user_state(self, user_id: str) -> Optional[UserState]:
        session = self.Session()
        try:
            return session.query(UserState).filter_by(user_id=user_id).first()
        finally:
            session.close()
    
    def update_user_embedding(self, user_id: str, new_embedding: list):
        session = self.Session()
        try:
            state = session.query(UserState).filter_by(user_id=user_id).first()
            if state:
                state.embedding = new_embedding
                state.last_updated = datetime.utcnow()
            else:
                state = UserState(user_id=user_id, embedding=new_embedding)
                session.add(state)
            session.commit()
        finally:
            session.close()

Advantages: Durable. Queryable. Supports complex relationships. Transactional.

Disadvantages: Higher latency. More complex operations. Schema management.

Use databases for state that has long-term value: user profiles, aggregated feedback, learned preferences. Don’t use databases for ephemeral session state.

Architectural Patterns

Pattern 1: Layered State

Different types of state live in different stores, accessed in order of latency:

class LayeredStateManager:
    def __init__(self, memory_store, redis_store, db_store):
        self.memory = memory_store   # Hot session state
        self.redis = redis_store     # Warm session state
        self.db = db_store           # Cold user state
    
    def get_context(self, session_id: str, user_id: str) -> dict:
        context = {}
        
        # Layer 1: In-memory session state (conversation history)
        session_state = self.memory.get(session_id)
        if session_state:
            context['history'] = session_state.get('history', [])
        
        # Layer 2: Redis session state (computed features)
        redis_state = self.redis.get(session_id)
        if redis_state:
            context['session_embedding'] = redis_state.get('embedding')
            context['click_history'] = redis_state.get('clicks', [])
        
        # Layer 3: Database user state (long-term preferences)
        user_state = self.db.get_user_state(user_id)
        if user_state:
            context['preferences'] = user_state.preferences
            context['user_embedding'] = user_state.embedding
        
        return context

This pattern keeps hot data close (in-memory), warm data accessible (Redis), and cold data durable (database).

Pattern 2: Event Sourcing

Instead of storing current state, store the sequence of events that produced it. Reconstruct state by replaying events.

from dataclasses import dataclass
from typing import List
from datetime import datetime

@dataclass
class Event:
    session_id: str
    event_type: str
    payload: dict
    timestamp: datetime

class EventSourcedState:
    def __init__(self, event_store):
        self.events = event_store
    
    def append_event(self, session_id: str, event_type: str, payload: dict):
        event = Event(
            session_id=session_id,
            event_type=event_type,
            payload=payload,
            timestamp=datetime.utcnow()
        )
        self.events.append(event)
    
    def get_state(self, session_id: str) -> dict:
        events = self.events.get_events(session_id)
        state = {"history": [], "preferences": {}, "context": None}
        
        for event in events:
            if event.event_type == "message":
                state["history"].append(event.payload)
            elif event.event_type == "preference_update":
                state["preferences"].update(event.payload)
            elif event.event_type == "context_set":
                state["context"] = event.payload.get("context")
        
        return state

Advantages: Complete audit trail. Can reconstruct state at any point. Natural fit for ML feedback loops.

Disadvantages: State reconstruction cost. Storage growth. Complexity.

Event sourcing works well for systems where the history itself has value — audit requirements, debugging, replay for analysis.

Pattern 3: Sticky Sessions with Replication

Route all requests for a session to the same server. Replicate state asynchronously for fault tolerance.

# Load balancer configuration (conceptual)
# Route by session_id hash to consistent server

class StickySessionManager:
    def __init__(self, local_store, replication_queue):
        self.local = local_store
        self.replication = replication_queue
    
    def update_state(self, session_id: str, state: dict):
        # Update local immediately
        self.local.set(session_id, state)
        
        # Queue async replication
        self.replication.put({
            "session_id": session_id,
            "state": state,
            "timestamp": time.time()
        })
    
    def get_state(self, session_id: str) -> dict:
        # Always read from local for lowest latency
        return self.local.get(session_id)

Advantages: Lowest latency for reads. Simple consistency model. Server can cache aggressively.

Disadvantages: Requires session-aware load balancing. Failover complexity. Uneven load distribution.

Sticky sessions work when latency is critical and sessions are relatively short-lived.

The CAP Theorem Applies

Stateful ML services face the same distributed systems trade-offs as any other stateful service. You cannot simultaneously guarantee:

  • Consistency: All replicas see the same state
  • Availability: Every request receives a response
  • Partition tolerance: The system continues operating despite network failures

For most ML use cases, availability matters more than strict consistency. A recommendation based on slightly stale state is better than no recommendation. A conversation that occasionally loses context is better than a conversation that fails entirely.

Design for eventual consistency. Accept that state might be briefly inconsistent across replicas. Use idempotent operations where possible.

Practical Considerations

State size matters. Session state should be small — conversation history, recent interactions, computed embeddings. Large state (full user history, large models) belongs in external storage with lazy loading.

TTL everything. Sessions end. Users leave. State that lingers forever consumes resources forever. Set TTLs on all session state. Archive or delete on expiry.

Instrument state operations. Track state store latency. Alert on failures. Monitor cache hit rates. State store problems become inference problems.

Plan for state loss. Despite best efforts, state gets lost — store failures, network partitions, bugs. Design for graceful degradation when state is missing. Can you serve a reasonable (if suboptimal) response without state?

Test stateful behaviour. Stateful systems have more edge cases. What happens when state is partially corrupted? When the store is slow? When a session spans a server restart? These scenarios need explicit testing.

Choosing Your Approach

Start with the simplest approach that meets your requirements:

  1. Stateless if you can include all context in the request
  2. Client-side state for short sessions with trusted clients
  3. Redis for most production session state needs
  4. Database for durable, long-term user state
  5. Event sourcing when history and audit matter
  6. Sticky sessions when latency is critical and sessions are short

Don’t over-engineer. Many systems that think they need complex state management actually need a Redis instance and thoughtful key design. Add complexity only when simpler approaches demonstrably fail.


For more on production ML architecture, see Graceful Degradation in ML Systems for handling state store failures, and The Observability Blind Spot for monitoring stateful services.

About the Author

William Murray is a Regenerative Systems Architect with 30 years of UNIX infrastructure experience, specializing in deterministic computing for safety-critical systems. Based in the Scottish Highlands, he operates SpeyTech and maintains several open-source projects including C-Sentinel and c-from-scratch.

Let's Discuss Your AI Infrastructure

Available for UK-based consulting on production ML systems and infrastructure architecture.

Get in touch
← Back to AI Architecture