AI Architecture

Model Serving Architecture Patterns

Understanding latency, throughput, and the trade-offs between them

Published
January 13, 2026 20:35
Reading Time
5 min

The Question Nobody Asks

“How should we serve our models?”

Most teams reach for the first solution that works: wrap the model in Flask, throw it behind a load balancer, call it done. This works until it doesn’t.

Then someone asks: “Why is P99 latency 5 seconds when P50 is 50ms?” Or: “Why can we only handle 10 requests per second?” Or: “Why did one slow request take down the entire service?”

These aren’t model problems. They’re architecture problems. The way you structure model serving determines what’s possible and what’s impossible.

The Three Fundamental Patterns

Every model serving architecture is a variation on three basic patterns. Each optimizes for different constraints. Each has different failure modes.

Pattern 1: Single-Process Serving

The simplest pattern: one process loads one model, serves all requests sequentially.

# Single-process server
class ModelServer:
    def __init__(self, model_path):
        self.model = load_model(model_path)
    
    def predict(self, request):
        return self.model.predict(request.data)

# Flask/FastAPI wrapper
app = FastAPI()
server = ModelServer("model.pt")

@app.post("/predict")
def predict_endpoint(request: PredictRequest):
    return server.predict(request)

What this optimizes for:

  • Simplicity (minimal code, easy to reason about)
  • Memory efficiency (one model copy in memory)
  • Consistent behavior (no parallelism surprises)

Where this breaks:

  • Throughput: Sequential processing caps requests/second
  • Latency: One slow request blocks all others (head-of-line blocking)
  • Availability: Process crash = service down

When to use it:

  • Low traffic (< 10 requests/second)
  • Latency not critical (P99 can be seconds)
  • Model fits comfortably in memory
  • Development/testing environments

Production reality: This works for far more use cases than people admit. A single process can handle 50-100 fast predictions per second. That’s enough for many internal tools and early-stage products.

Pattern 2: Multi-Worker Serving

Multiple processes, each with its own model copy, behind a load balancer.

# With Gunicorn/Uvicorn workers
# gunicorn -w 4 -k uvicorn.workers.UvicornWorker app:app

class ModelServer:
    def __init__(self, model_path):
        # Each worker loads its own model copy
        self.model = load_model(model_path)
        self.pid = os.getpid()
        log.info("model_loaded", worker_pid=self.pid)
    
    def predict(self, request):
        return self.model.predict(request.data)

What this optimizes for:

  • Throughput (N workers = N× capacity)
  • Isolation (one worker crash doesn’t affect others)
  • Load balancing (distribute work across workers)

Where this breaks:

  • Memory: N workers = N× model memory
  • Cold start: Loading N models takes N× time
  • Inconsistency: Different workers may have different model versions during deployment

When to use it:

  • Medium traffic (10-1000 requests/second)
  • Model fits in memory multiple times
  • Latency matters (need parallelism)
  • Can’t afford custom infrastructure

Sizing workers: Rule of thumb: workers = 2 × CPU_cores + 1 for CPU-bound models. For GPU models, workers = number_of_GPUs (one worker per GPU).

The memory trap: A 2GB model with 8 workers needs 16GB just for models. Add framework overhead, request buffering, and OS needs - suddenly you need 24GB. Always measure actual memory usage under load.

Pattern 3: Async Batch Serving

Collect requests into batches, process batches through model, distribute results.

import asyncio
from collections import defaultdict

class BatchModelServer:
    def __init__(self, model_path, batch_size=32, timeout_ms=10):
        self.model = load_model(model_path)
        self.batch_size = batch_size
        self.timeout_ms = timeout_ms
        self.pending = []
        self.futures = {}
        
    async def predict(self, request_id, data):
        """Add request to batch, return when processed"""
        future = asyncio.Future()
        self.pending.append((request_id, data, future))
        
        # Trigger batch if full
        if len(self.pending) >= self.batch_size:
            asyncio.create_task(self._process_batch())
        
        return await future
    
    async def _process_batch(self):
        """Process accumulated requests as batch"""
        if not self.pending:
            return
            
        # Extract batch
        batch = self.pending[:self.batch_size]
        self.pending = self.pending[self.batch_size:]
        
        ids, inputs, futures = zip(*batch)
        
        # Batch inference
        try:
            batch_input = stack_inputs(inputs)
            batch_output = self.model.predict(batch_input)
            
            # Distribute results
            for req_id, output, future in zip(ids, batch_output, futures):
                future.set_result(output)
                
        except Exception as e:
            # Fail all requests in batch
            for future in futures:
                future.set_exception(e)
    
    async def _batch_timeout_worker(self):
        """Process partial batches after timeout"""
        while True:
            await asyncio.sleep(self.timeout_ms / 1000)
            if self.pending:
                asyncio.create_task(self._process_batch())

What this optimizes for:

  • GPU efficiency (batching maximizes GPU utilization)
  • Throughput (batch processing is faster than N individual calls)
  • Cost (fewer GPU instances needed for same throughput)

Where this breaks:

  • Latency: Waiting for batch adds delay (batching_delay = timeout_ms)
  • Complexity: Async code, result distribution, timeout management
  • Fairness: Large requests can starve small requests

When to use it:

  • GPU-based models (batching gives 5-10× throughput improvement)
  • High throughput requirements (> 1000 req/sec)
  • Can tolerate added latency (10-100ms batching delay acceptable)
  • Cost optimization matters (GPU time is expensive)

The latency-throughput trade-off: Larger batches = higher throughput but higher latency. Smaller batches = lower latency but lower throughput. There’s no free lunch.

The Patterns Combined

Production systems often combine patterns:

Hybrid: Multi-Worker + Batching

# Multiple workers, each doing batching
# gunicorn -w 4 -k uvicorn.workers.UvicornWorker app:app

class HybridModelServer:
    def __init__(self, model_path):
        self.batch_server = BatchModelServer(
            model_path,
            batch_size=32,
            timeout_ms=10
        )
    
    async def predict(self, request):
        return await self.batch_server.predict(
            request.id,
            request.data
        )

This gives you:

  • Worker-level isolation (failures contained)
  • Batch-level GPU efficiency
  • Scalability (add workers for more capacity)

At the cost of:

  • Maximum complexity
  • Maximum memory (N workers × model size)
  • Hardest to debug

Latency vs Throughput: The Fundamental Trade-off

You cannot optimize for both simultaneously. Every architecture choice moves you along this spectrum.

Optimize for latency:

  • Single-process serving (no queueing)
  • Small or no batching (immediate processing)
  • Overprovisioned capacity (requests never wait)
  • Multiple model replicas (parallel serving)

Result: Low latency, high cost, lower throughput.

Optimize for throughput:

  • Large batches (maximize GPU utilization)
  • Request queueing (keep GPU busy)
  • Underprovisioned capacity (queue absorbs bursts)
  • Fewer replicas (higher utilization per instance)

Result: High throughput, lower cost, higher latency.

Production reality: Most systems need something in between. The right balance depends on your use case, not universal best practices.

Failure Modes by Pattern

Different patterns fail differently. Understanding failure modes helps you pick the right pattern.

Single-Process Failure Modes

Head-of-line blocking: One slow request blocks all subsequent requests.

# Request 1: 100ms (normal)
# Request 2: 10s (adversarial input)
# Request 3: 100ms (normal) - but waits 10s for Request 2

Mitigation: Request timeouts, input validation, separate queues for different request types.

Process crash: One bad input crashes the process, service down until restart.

Mitigation: Process supervision (systemd, k8s), health checks, circuit breakers. See Production AI Systems: What 30 Years of UNIX Taught Me for proven failure handling patterns.

Multi-Worker Failure Modes

Memory exhaustion: Workers compete for memory, OOM killer terminates processes randomly.

Mitigation: Resource limits per worker, monitoring memory per worker, autoscaling based on memory not just CPU.

Version skew: During deployment, some workers have new model, some have old model. Requests get inconsistent results.

Mitigation: Blue-green deployment (switch all at once) or canary with traffic shaping (gradual migration).

Batch Serving Failure Modes

Batch contamination: One bad input in batch causes entire batch to fail.

# Batch of 32 requests
# Request 15 has malformed input
# All 32 requests fail

Mitigation: Input validation before batching, per-request error handling, batch splitting on failure.

Latency variance: Some requests wait 1ms for batch, others wait 50ms (full timeout). P99 latency = timeout duration regardless of actual processing time.

Mitigation: Smaller batches (trades throughput for consistency), adaptive batching (adjust size based on traffic), separate batches by priority.

Choosing the Right Pattern

Ask these questions in order:

1. What’s your traffic volume?

  • < 10 req/sec → Single-process is fine
  • 10-1000 req/sec → Multi-worker
  • > 1000 req/sec → Consider batching

2. What’s your latency requirement?

  • P99 < 100ms → No batching, multi-worker
  • P99 < 1s → Small batches (5-10ms timeout)
  • P99 < 5s → Batching acceptable

3. What hardware are you using?

  • CPU-only → Multi-worker (cheap parallelism)
  • GPU → Batching (essential for GPU efficiency)

4. What’s your memory budget?

  • Model × N workers fits in RAM → Multi-worker
  • Model × N workers exceeds RAM → Single-process or batching

5. Can you tolerate complexity?

  • No → Single-process or multi-worker
  • Yes → Batching or hybrid

Production-Ready Implementation

Here’s a robust multi-worker serving implementation with proper error handling and observability (see The Observability Gap in ML Systems for what to log and why):

import time
import logging
from functools import wraps

class ProductionModelServer:
    def __init__(self, model_path, timeout_ms=100):
        self.model = self._load_with_retry(model_path)
        self.timeout_ms = timeout_ms
        self.stats = {"success": 0, "timeout": 0, "error": 0}
        
    def _load_with_retry(self, model_path, retries=3):
        """Load model with retries and timeout"""
        for attempt in range(retries):
            try:
                logging.info(f"Loading model, attempt {attempt + 1}")
                model = load_model(model_path, timeout=30)
                logging.info("Model loaded successfully")
                return model
            except Exception as e:
                logging.error(f"Load failed: {e}")
                if attempt == retries - 1:
                    raise
                time.sleep(2 ** attempt)  # Exponential backoff
    
    def predict(self, request_id, data):
        """Predict with timeout and comprehensive error handling"""
        start = time.monotonic()
        
        try:
            # Validate input
            if not self._validate_input(data):
                self.stats["error"] += 1
                raise ValueError("Invalid input format")
            
            # Predict with timeout
            result = self._predict_with_timeout(data)
            
            latency_ms = (time.monotonic() - start) * 1000
            self.stats["success"] += 1
            
            logging.info("prediction_success",
                request_id=request_id,
                latency_ms=latency_ms
            )
            
            return result
            
        except TimeoutError:
            self.stats["timeout"] += 1
            logging.error("prediction_timeout",
                request_id=request_id,
                timeout_ms=self.timeout_ms
            )
            raise
            
        except Exception as e:
            self.stats["error"] += 1
            logging.error("prediction_error",
                request_id=request_id,
                error=str(e)
            )
            raise
    
    def _predict_with_timeout(self, data):
        """Run prediction with timeout"""
        # Implementation depends on framework
        # TensorFlow: use timeout in session.run()
        # PyTorch: use signal.alarm() or threading
        return self.model.predict(data)
    
    def _validate_input(self, data):
        """Validate input before prediction"""
        # Check type, shape, ranges
        return True
    
    def health_check(self):
        """Health check for load balancer"""
        return {
            "status": "healthy",
            "model_loaded": self.model is not None,
            "stats": self.stats
        }

The Unsexy Truth

The right serving pattern depends on your constraints, not what’s popular.

Single-process serving handles more traffic than people think. Multi-worker serving solves most problems without exotic infrastructure. Batching is essential for GPUs but adds complexity.

Start simple. Measure. Add complexity only when you have evidence it’s needed.

Most serving performance problems aren’t architectural - they’re resource limits, memory leaks, or inefficient models. Fix those first before redesigning your architecture.

The best serving architecture is the simplest one that meets your requirements. Everything else is premature optimization.1G

About the Author

William Murray is a Regenerative Systems Architect with 30 years of UNIX infrastructure experience, specializing in deterministic computing for safety-critical systems. Based in the Scottish Highlands, he operates SpeyTech and maintains several open-source projects including C-Sentinel and c-from-scratch.

Let's Discuss Your AI Infrastructure

Available for UK-based consulting on production ML systems and infrastructure architecture.

Get in touch
← Back to AI Architecture