AI Architecture

Cost Engineering for ML Infrastructure: What Actually Matters

Where the money goes and what to optimise first

Published
January 24, 2026 00:05
Reading Time
6 min
Chart showing ML infrastructure costs dominated by storage and idle compute rather than GPUs

Note: Cost figures in this article are indicative estimates based on typical cloud pricing patterns. Actual costs vary significantly based on provider, region, commitment level, and workload characteristics. Use these as directional guidance, not precise projections.

The conversation usually starts with GPUs. “We need more GPUs.” “GPU costs are killing us.” “If we had better GPUs, we could train faster.”

GPUs are expensive. They’re also visible — big line items that draw attention. But in most ML infrastructure deployments, GPUs aren’t where the money actually goes. Or rather, they’re not where the waste goes.

The biggest cost savings in ML infrastructure rarely come from GPU optimisation. They come from the boring stuff: storage nobody remembers to delete, compute that sits idle, data that crosses network boundaries repeatedly, and operational complexity that requires expensive humans to manage.

Where Cost Actually Goes

A typical ML infrastructure bill breaks down differently than most teams expect:

Storage and data management: 30-40%

Training data, model checkpoints, experiment artifacts, logs, feature stores, vector databases. Data accumulates. Old experiments never get deleted. Checkpoints from failed runs persist forever. That “temporary” dataset from six months ago is still in S3.

Idle and overprovisioned compute: 20-30%

Inference clusters sized for peak load running at 15% utilisation. Development GPU instances left running overnight. Kubernetes clusters with generous resource requests and minimal actual usage. Auto-scaling that scales up eagerly and down reluctantly.

Active compute (including GPUs): 15-25%

The actual work — training runs, inference serving, data processing. The part everyone focuses on.

Network egress: 8-15%

Data moving between regions, between clouds, between services. Every byte that leaves a cloud provider’s network costs money. Cross-region replication, API responses, model downloads.

Operational overhead: 5-10%

Monitoring, logging, security scanning, compliance tooling. The infrastructure to manage the infrastructure.

The exact percentages vary by organisation, but the pattern is consistent: storage and idle resources typically exceed active compute costs. Yet optimisation efforts focus disproportionately on making training 10% faster rather than eliminating the 25% of compute that does nothing.

The 80/20 of Cost Optimisation

Optimise in order of impact, not visibility.

1. Kill Idle Resources

The highest-ROI cost optimisation is eliminating resources that aren’t doing anything. This sounds obvious. It’s also consistently underexecuted.

Find idle compute:

import boto3
from datetime import datetime, timedelta

def find_idle_instances(min_cpu_percent: float = 5.0, days: int = 7):
    """Find EC2 instances with sustained low CPU utilisation."""
    ec2 = boto3.client('ec2')
    cloudwatch = boto3.client('cloudwatch')
    
    instances = ec2.describe_instances(
        Filters=[{'Name': 'instance-state-name', 'Values': ['running']}]
    )
    
    idle_instances = []
    end_time = datetime.utcnow()
    start_time = end_time - timedelta(days=days)
    
    for reservation in instances['Reservations']:
        for instance in reservation['Instances']:
            instance_id = instance['InstanceId']
            
            # Get CPU utilisation
            response = cloudwatch.get_metric_statistics(
                Namespace='AWS/EC2',
                MetricName='CPUUtilization',
                Dimensions=[{'Name': 'InstanceId', 'Value': instance_id}],
                StartTime=start_time,
                EndTime=end_time,
                Period=3600,  # Hourly
                Statistics=['Average']
            )
            
            if response['Datapoints']:
                avg_cpu = sum(d['Average'] for d in response['Datapoints']) / len(response['Datapoints'])
                if avg_cpu < min_cpu_percent:
                    idle_instances.append({
                        'instance_id': instance_id,
                        'instance_type': instance['InstanceType'],
                        'avg_cpu': avg_cpu,
                        'launch_time': instance['LaunchTime']
                    })
    
    return idle_instances

Run this weekly. Automate termination for development instances. Require justification for production instances below threshold.

GPU utilisation is typically abysmal:

Most GPU instances run at 15-30% utilisation. The GPU is expensive, but it spends most of its time waiting — waiting for data to load, waiting for preprocessing, waiting for the next batch.

# Check GPU utilisation on NVIDIA instances
nvidia-smi --query-gpu=utilization.gpu,utilization.memory --format=csv -l 5

If utilisation is consistently below 50%, you’re paying for capacity you’re not using. Options: smaller instances, better batching, or GPU sharing (MIG on A100s, time-slicing on smaller GPUs).

2. Implement Storage Lifecycle Policies

Data has a half-life. Training data from last year’s model is rarely accessed. Checkpoints from experiments that didn’t pan out serve no purpose. Logs older than your retention requirement are liability, not asset.

Tiered storage:

# S3 lifecycle policy (conceptual)
lifecycle_policy = {
    "Rules": [
        {
            "ID": "training-data-lifecycle",
            "Filter": {"Prefix": "training-data/"},
            "Status": "Enabled",
            "Transitions": [
                {
                    "Days": 30,
                    "StorageClass": "STANDARD_IA"  # Infrequent access
                },
                {
                    "Days": 90,
                    "StorageClass": "GLACIER"  # Archive
                }
            ]
        },
        {
            "ID": "experiment-cleanup",
            "Filter": {"Prefix": "experiments/"},
            "Status": "Enabled",
            "Expiration": {"Days": 180}  # Delete after 6 months
        },
        {
            "ID": "checkpoint-cleanup",
            "Filter": {"Prefix": "checkpoints/"},
            "Status": "Enabled",
            "NoncurrentVersionExpiration": {"NoncurrentDays": 30}
        }
    ]
}

The key insight: most ML data follows a predictable access pattern. Hot for days to weeks during active development, then rarely or never accessed again. Design storage policies around this reality.

Cost difference is dramatic:

  • S3 Standard: ~$0.023/GB/month
  • S3 Infrequent Access: ~$0.0125/GB/month
  • S3 Glacier: ~$0.004/GB/month

Moving 10TB from Standard to Glacier saves roughly $190/month. Multiply by the petabytes many organisations accumulate and the savings become substantial.

3. Reduce Network Egress

Cloud providers charge asymmetrically: ingress is free, egress is expensive. Every byte leaving their network costs money. Cross-region traffic costs more. Cross-cloud traffic costs the most.

Common egress waste:

  • Training data downloaded repeatedly instead of cached locally
  • Model artifacts transferred between regions for each deployment
  • Feature store queries crossing availability zones
  • Logs shipped to external monitoring without aggregation

Strategies:

class EgressAwareDataLoader:
    """Cache remote data locally to avoid repeated egress charges."""
    
    def __init__(self, remote_path: str, local_cache: str, max_cache_gb: int = 100):
        self.remote_path = remote_path
        self.local_cache = local_cache
        self.max_cache_bytes = max_cache_gb * 1024**3
    
    def get(self, key: str) -> bytes:
        local_path = os.path.join(self.local_cache, key)
        
        # Check local cache first
        if os.path.exists(local_path):
            with open(local_path, 'rb') as f:
                return f.read()
        
        # Download from remote
        data = self._download_remote(key)
        
        # Cache locally (with eviction if needed)
        self._ensure_cache_space(len(data))
        os.makedirs(os.path.dirname(local_path), exist_ok=True)
        with open(local_path, 'wb') as f:
            f.write(data)
        
        return data
    
    def _ensure_cache_space(self, needed_bytes: int):
        """Evict oldest files if cache is full."""
        current_size = self._get_cache_size()
        if current_size + needed_bytes <= self.max_cache_bytes:
            return
        
        # Get files sorted by access time
        files = []
        for root, _, filenames in os.walk(self.local_cache):
            for fname in filenames:
                path = os.path.join(root, fname)
                files.append((os.path.getatime(path), path, os.path.getsize(path)))
        
        files.sort()  # Oldest first
        
        # Evict until we have space
        freed = 0
        for atime, path, size in files:
            if current_size - freed + needed_bytes <= self.max_cache_bytes:
                break
            os.remove(path)
            freed += size

Colocate compute with data:

If your training data is in us-east-1, run training in us-east-1. This sounds obvious but is routinely violated when teams use whatever region has available GPU capacity.

4. Use Spot Instances for Training

Training workloads are interruptible. Checkpoints allow resumption. This makes them ideal for spot/preemptible instances at 60-90% discount.

class SpotTrainingManager:
    """Manage training on spot instances with automatic checkpointing."""
    
    def __init__(self, checkpoint_interval: int = 300):
        self.checkpoint_interval = checkpoint_interval
        self.last_checkpoint = time.time()
        
        # Register spot interruption handler
        signal.signal(signal.SIGTERM, self._handle_interruption)
    
    def _handle_interruption(self, signum, frame):
        """Save checkpoint on spot interruption (2-minute warning)."""
        logger.warning("Spot interruption received, saving checkpoint")
        self.save_checkpoint(emergency=True)
        sys.exit(0)
    
    def training_step(self, model, batch):
        # Normal training
        loss = model.train_step(batch)
        
        # Periodic checkpointing
        if time.time() - self.last_checkpoint > self.checkpoint_interval:
            self.save_checkpoint()
            self.last_checkpoint = time.time()
        
        return loss
    
    def save_checkpoint(self, emergency: bool = False):
        checkpoint = {
            'model_state': self.model.state_dict(),
            'optimizer_state': self.optimizer.state_dict(),
            'epoch': self.current_epoch,
            'step': self.current_step,
            'best_metric': self.best_metric
        }
        
        path = f"s3://checkpoints/run-{self.run_id}/step-{self.current_step}.pt"
        torch.save(checkpoint, path)
        
        if emergency:
            logger.info(f"Emergency checkpoint saved: {path}")

Spot considerations:

  • Checkpoint frequently (every 5-10 minutes for long training)
  • Use instance types with higher spot availability
  • Implement automatic restart from latest checkpoint
  • Consider spot fleets across multiple instance types

5. Right-Size Before Upgrading

The instinct when inference is slow is to get bigger instances. Often the better answer is to use current instances more efficiently.

Batching:

Single-request inference wastes GPU parallelism. Batching multiple requests together amortises overhead and increases throughput.

class BatchingInferenceServer:
    def __init__(self, model, max_batch_size: int = 32, max_wait_ms: int = 50):
        self.model = model
        self.max_batch_size = max_batch_size
        self.max_wait_ms = max_wait_ms
        self.pending_requests = []
        self.lock = threading.Lock()
    
    async def predict(self, input_data: dict) -> dict:
        future = asyncio.Future()
        
        with self.lock:
            self.pending_requests.append((input_data, future))
            
            if len(self.pending_requests) >= self.max_batch_size:
                self._process_batch()
        
        # Wait for result or timeout trigger
        return await future
    
    def _process_batch(self):
        with self.lock:
            if not self.pending_requests:
                return
            
            batch = self.pending_requests[:self.max_batch_size]
            self.pending_requests = self.pending_requests[self.max_batch_size:]
        
        # Batch inference
        inputs = [req[0] for req in batch]
        results = self.model.predict_batch(inputs)
        
        # Resolve futures
        for (_, future), result in zip(batch, results):
            future.set_result(result)

Model optimisation:

Before buying bigger GPUs, try:

  • Quantisation (INT8 inference is 2-4x faster than FP32)
  • Pruning (remove unnecessary weights)
  • Distillation (train a smaller model to mimic the large one)
  • ONNX Runtime or TensorRT optimisation

A quantised model on a smaller instance often beats an unoptimised model on a larger instance — at a fraction of the cost.

The Hidden Cost of Complexity

Infrastructure cost isn’t just cloud bills. It’s also engineering time.

A system that requires a dedicated ML platform team to operate costs more than a simpler system that application developers can manage themselves. Kubernetes clusters need care and feeding. Custom training frameworks need maintenance. Bespoke serving infrastructure needs debugging.

Before adding infrastructure complexity to solve a cost problem, calculate the total cost including engineering time. Sometimes the answer is “pay the cloud bill and ship features instead.”

Measurement Before Optimisation

You can’t optimise what you don’t measure. Before starting cost optimisation:

Tag everything:

# Apply consistent tags to all resources
REQUIRED_TAGS = {
    'team': 'ml-platform',
    'project': 'recommendation-engine',
    'environment': 'production',
    'cost-center': 'product-12345'
}

def create_instance_with_tags(instance_config: dict) -> str:
    instance_config['TagSpecifications'] = [{
        'ResourceType': 'instance',
        'Tags': [{'Key': k, 'Value': v} for k, v in REQUIRED_TAGS.items()]
    }]
    return ec2.run_instances(**instance_config)

Allocate costs to teams:

Cloud cost allocation by tag lets you answer “which team is spending the most?” and “which project has runaway costs?” Without this, cost optimisation is organisational guesswork.

Track cost per prediction:

def calculate_cost_per_prediction(
    monthly_cost: float,
    monthly_predictions: int
) -> float:
    """
    Simple but useful: what does each prediction cost?
    Track this over time to see if you're improving.
    """
    return monthly_cost / monthly_predictions if monthly_predictions > 0 else 0

# Example: $10,000/month infrastructure, 50M predictions
# Cost per prediction: $0.0002 (0.02 cents)

This single metric — cost per prediction — captures the efficiency of your entire ML infrastructure. Track it monthly. Set targets. Celebrate improvements.

The Optimisation Sequence

When cloud bills arrive and leadership asks for cuts, resist the urge to start with GPUs. Instead:

  1. Find and kill idle resources — immediate savings, zero risk
  2. Implement storage lifecycle — significant savings, low effort
  3. Reduce egress — moderate savings, requires architecture awareness
  4. Use spot for training — good savings, requires checkpoint discipline
  5. Right-size inference — moderate savings, requires load testing
  6. Optimise models — variable savings, requires ML expertise
  7. Then consider GPU changes — often unnecessary if steps 1-6 are done

Most organisations can cut ML infrastructure costs 30-50% without touching model architecture or GPU selection. The savings are in the boring stuff — storage policies, idle detection, network topology. Not exciting. Very effective.


For more on ML infrastructure architecture, see When You Don’t Need a Feature Store on avoiding premature complexity, and Graceful Degradation in ML Systems for handling cost-driven capacity constraints.

About the Author

William Murray is a Regenerative Systems Architect with 30 years of UNIX infrastructure experience, specializing in deterministic computing for safety-critical systems. Based in the Scottish Highlands, he operates SpeyTech and maintains several open-source projects including C-Sentinel and c-from-scratch.

Let's Discuss Your AI Infrastructure

Available for UK-based consulting on production ML systems and infrastructure architecture.

Get in touch
← Back to AI Architecture