ML Model Serving System Design: From Prototype to Production

Getting a machine learning model to work in a Jupyter notebook is one thing. Serving that model reliably at scale, with low latency, high availability, and continuous updates, is an entirely different challenge. ML model serving system design sits at the intersection of machine learning engineering, distributed systems, and DevOps. It requires you to think carefully about inference latency, throughput, resource utilization, model versioning, and graceful degradation.

This guide walks through the complete landscape of ML model serving, from choosing the right serving architecture to scaling inference, managing deployment pipelines, running A/B tests, optimizing latency, and monitoring production models. Whether you are preparing for a system design interview or building a real-time ML system design for your organization, this article provides the depth and practical detail you need.

Why Model Serving Is Hard

The gap between a working prototype and a production inference system is enormous. In a notebook, you load a model, feed it data, and get predictions. In production, you must handle:

Concurrent requests: Thousands of users hitting your model at once, each expecting sub-100ms responses
Model versioning: Rolling out new model versions without downtime or degrading the user experience
Resource management: GPUs are expensive. You need to maximize utilization without starving any single request
Fault tolerance: What happens when a model server crashes mid-inference? When a GPU runs out of memory?
Data consistency: Ensuring the same preprocessing pipeline runs in training and serving (training-serving skew)
Cost optimization: Balancing performance requirements against cloud compute bills that can spiral quickly

Most ML teams discover these challenges the hard way. A model that takes 2 seconds per prediction in a notebook becomes a bottleneck when 500 requests arrive simultaneously. A model that works perfectly on the training distribution starts producing nonsensical outputs when real-world data shifts. These are the problems that ML model serving system design must solve.

Serving Architecture Patterns

The first decision in any AI inference system design is how to expose your model to consumers. There are several well-established patterns, each with distinct tradeoffs.

REST API Serving

The most straightforward approach is wrapping your model behind a REST API using frameworks like FastAPI or Flask.

# FastAPI model serving example
from fastapi import FastAPI
import torch

app = FastAPI()
model = None

@app.on_event("startup")
async def load_model():
    global model
    model = torch.jit.load("model.pt")
    model.eval()

@app.post("/predict")
async def predict(request: PredictionRequest):
    with torch.no_grad():
        inputs = preprocess(request.data)
        output = model(inputs)
        return {"prediction": postprocess(output)}

Strengths: Simple to build and debug. Works with any ML framework. Easy to add authentication, logging, and middleware. Familiar to most developers.

Weaknesses: HTTP overhead adds latency. Not ideal for very high throughput. No built-in model management or batching.

Best for: Internal tools, low-to-medium traffic APIs, rapid prototyping, teams without dedicated ML infrastructure.

gRPC Serving

For higher performance, gRPC provides binary serialization (Protocol Buffers) and HTTP/2 multiplexing. Purpose-built serving frameworks like TensorFlow Serving and NVIDIA Triton Inference Server use gRPC natively.

// Inference service definition
service InferenceService {
  rpc Predict(PredictRequest) returns (PredictResponse);
  rpc StreamPredict(stream PredictRequest) returns (stream PredictResponse);
}

message PredictRequest {
  string model_name = 1;
  int64 model_version = 2;
  repeated float input_tensor = 3;
  repeated int64 input_shape = 4;
}

Strengths: 2-10x lower latency than REST for large payloads. Built-in streaming support. Strong typing via protobuf. Efficient binary serialization reduces bandwidth.

Weaknesses: Harder to debug (binary protocol). Requires protobuf schema management. Less tooling for browser clients (though gRPC-Web exists).

Best for: High-throughput internal services, inter-service communication, large tensor inputs/outputs, real-time ML system design use cases.

Serverless Inference

AWS Lambda, SageMaker Serverless Inference, and Google Cloud Functions allow you to serve models without managing infrastructure.

# AWS Lambda handler for model inference
import json
import boto3
import numpy as np

# Model loaded during cold start
model = load_model_from_s3("s3://models/production/v12/model.tar.gz")

def handler(event, context):
    body = json.loads(event["body"])
    inputs = preprocess(body["features"])
    prediction = model.predict(inputs)
    return {
        "statusCode": 200,
        "body": json.dumps({"prediction": prediction.tolist()})
    }

Strengths: Zero infrastructure management. Automatic scaling to zero (no cost when idle). Pay-per-request pricing works well for bursty traffic.

Weaknesses: Cold start latency (seconds for large models). Memory and package size limits. No GPU support on basic Lambda (SageMaker Serverless does support GPU). Not suitable for large models.

Best for: Lightweight models (under 500MB), sporadic traffic, cost-sensitive workloads, preprocessing/postprocessing steps.

Streaming Inference

For real-time use cases like large language model generation, live video analysis, or continuous sensor data processing, streaming inference delivers results incrementally.

# Server-Sent Events for streaming inference
from fastapi import FastAPI
from fastapi.responses import StreamingResponse

@app.post("/generate")
async def generate_stream(request: GenerateRequest):
    async def token_generator():
        for token in model.generate_stream(request.prompt):
            yield f"data: {json.dumps({'token': token})}\n\n"
        yield "data: [DONE]\n\n"

    return StreamingResponse(
        token_generator(),
        media_type="text/event-stream"
    )

Strengths: Lower time-to-first-token. Better user experience for generative models. Enables real-time processing pipelines.

Weaknesses: More complex client-side handling. Harder to load balance. Connection management overhead.

Best for: LLM serving, real-time video/audio processing, live recommendation updates, chat applications.

Comparison of Serving Approaches

Approach	Latency	Throughput	Scalability	Complexity	Cost Model
REST API	Medium	Medium	Manual/Auto	Low	Always-on
gRPC	Low	High	Manual/Auto	Medium	Always-on
Serverless	High (cold)	Variable	Automatic	Low	Pay-per-use
Streaming	Low (first token)	Medium	Manual/Auto	High	Always-on
Triton/TF Serving	Low	Very High	Manual/Auto	Medium	Always-on

Real-Time vs Batch vs Near-Real-Time

Not every prediction needs to happen in milliseconds. Choosing the right inference pattern can dramatically reduce costs and complexity.

Real-Time Inference

Predictions are generated on-demand for each request, typically within 10-200ms.

When to use:

User-facing features requiring immediate response (search ranking, content recommendations, fraud detection at checkout)
Interactive applications where latency directly impacts user experience
Decisions that depend on the most current data available

Tradeoffs: Requires always-on infrastructure. Must handle peak traffic. Most expensive per-prediction.

Batch Inference

Predictions are computed for large datasets on a schedule (hourly, daily, weekly).

When to use:

Generating recommendations for all users overnight
Scoring entire customer databases for marketing campaigns
Periodic report generation or risk assessments
Any scenario where predictions can be pre-computed and cached

Tradeoffs: Results may be stale. Requires storage for pre-computed predictions. Much cheaper per-prediction.

Near-Real-Time (Micro-Batch)

A middle ground where predictions are generated in small batches every few seconds or minutes.

When to use:

Streaming data pipelines (Kafka consumers processing events)
Systems that can tolerate seconds of delay but not hours
Aggregation-dependent features (e.g., "purchases in last 5 minutes")

Tradeoffs: More complex pipeline architecture. Must tune batch size and window duration.

Hybrid Architectures

Most production systems combine multiple patterns. A common approach:

+------------------------------------------------------------------+
|                     HYBRID INFERENCE ARCHITECTURE                 |
+------------------------------------------------------------------+
|                                                                    |
|   REAL-TIME PATH                                                   |
|   +---------+     +----------------+     +------------------+      |
|   | API     | --> | Feature Store  | --> | Real-Time Model  |      |
|   | Request |     | (Online)       |     | Server (GPU)     |      |
|   +---------+     +----------------+     +------------------+      |
|                                                 |                  |
|                                                 v                  |
|                                          +-------------+           |
|                                          |  Response   |           |
|                                          |  Cache      |           |
|                                          +-------------+           |
|                                                                    |
|   NEAR-REAL-TIME PATH                                              |
|   +---------+     +----------------+     +------------------+      |
|   | Kafka   | --> | Stream         | --> | Micro-Batch      |      |
|   | Events  |     | Processor      |     | Inference        |      |
|   +---------+     +----------------+     +------------------+      |
|                                                 |                  |
|                                                 v                  |
|                                          +-------------+           |
|                                          | Feature     |           |
|                                          | Store Update|           |
|                                          +-------------+           |
|                                                                    |
|   BATCH PATH                                                       |
|   +---------+     +----------------+     +------------------+      |
|   | Cron /  | --> | Data Warehouse | --> | Batch Inference  |      |
|   | Airflow |     | (Full Dataset) |     | Cluster (Spark)  |      |
|   +---------+     +----------------+     +------------------+      |
|                                                 |                  |
|                                                 v                  |
|                                          +-------------+           |
|                                          | Results DB  |           |
|                                          | (Pre-comp.) |           |
|                                          +-------------+           |
+------------------------------------------------------------------+

In this pattern, the batch path pre-computes predictions for most users. The real-time path handles cases where fresh predictions are critical (new users, recently changed data). The near-real-time path continuously updates features used by the real-time model.

Scaling Inference

Once your serving architecture is in place, the next challenge is scaling it to handle production traffic. Scalable AI inference system design requires thinking about compute, memory, and network at every layer.

Horizontal Scaling and Load Balancing

The most straightforward scaling approach is running multiple model server replicas behind a load balancer.

# Kubernetes deployment for model serving
apiVersion: apps/v1
kind: Deployment
metadata:
  name: model-server
spec:
  replicas: 8
  selector:
    matchLabels:
      app: model-server
  template:
    spec:
      containers:
      - name: inference
        image: model-server:v2.1
        resources:
          limits:
            nvidia.com/gpu: 1
            memory: "16Gi"
          requests:
            cpu: "4"
            memory: "8Gi"
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: model-server-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: model-server
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Pods
    pods:
      metric:
        name: inference_queue_depth
      target:
        type: AverageValue
        averageValue: "5"

Key considerations:

Health checks: Model loading can take 30-60 seconds. Use readiness probes to prevent traffic before the model is ready.
Autoscaling metrics: Scale on custom metrics like queue depth or GPU utilization, not just CPU. Standard CPU metrics miss GPU-bound workloads entirely.
Session affinity: Generally unnecessary for stateless inference, but critical for streaming or multi-turn interactions.

GPU Sharing and Scheduling

GPUs are expensive. Running one model per GPU is wasteful if the model does not fully utilize the hardware.

Multi-Process Service (MPS): NVIDIA MPS allows multiple processes to share a single GPU. Each process gets a portion of GPU compute. Useful when individual inference requests do not saturate the GPU.

Multi-Instance GPU (MIG): Available on A100 and H100 GPUs. Physically partitions a GPU into up to 7 isolated instances. Each instance has dedicated memory and compute. Provides hardware-level isolation, preventing noisy-neighbor effects.

Time-Slicing: The simplest approach. Multiple pods share a GPU by taking turns. Kubernetes enables this with nvidia.com/gpu fractional requests. Works for lightweight models but adds latency due to context switching.

+--------------------------------------------------+
|              NVIDIA A100 GPU (80GB)               |
+--------------------------------------------------+
|  MIG Instance 1   |  MIG Instance 2  | MIG Inst 3|
|  (40GB, 4 SMs)    |  (20GB, 2 SMs)   | (20GB)    |
|                    |                   |           |
|  Large Language    |  Embedding        | Ranking   |
|  Model             |  Model            | Model     |
|  (latency-critical)|  (high-throughput)| (batch)   |
+--------------------------------------------------+

Model Optimization

Before scaling horizontally, optimize the model itself. Smaller, faster models reduce infrastructure costs.

Quantization: Reduces model precision from FP32 to FP16, INT8, or even INT4. Often achieves 2-4x speedup with minimal accuracy loss.

# Post-training quantization with PyTorch
import torch

model_fp32 = load_model()
model_int8 = torch.quantization.quantize_dynamic(
    model_fp32,
    {torch.nn.Linear},
    dtype=torch.qint8
)
# Model size: 400MB -> 100MB
# Latency: 45ms -> 18ms

Knowledge Distillation: Train a smaller "student" model to mimic a larger "teacher" model. The student can be 3-10x smaller while retaining 95-99% of the teacher's performance.

Pruning: Remove redundant weights or neurons. Structured pruning removes entire channels or layers, making the model physically smaller. Unstructured pruning zeros out individual weights, requiring sparse computation support.

ONNX Conversion: Convert models to ONNX (Open Neural Network Exchange) format for cross-framework optimization. ONNX Runtime provides optimized execution across CPUs, GPUs, and specialized hardware.

Request Batching

Processing multiple requests together exploits GPU parallelism. A single prediction might use 5% of GPU capacity, but a batch of 32 uses 80%.

Dynamic batching collects incoming requests and groups them into batches based on configurable time windows and maximum batch sizes.

  Requests arrive:  t=0ms  t=2ms  t=3ms  t=5ms  t=8ms
                      |      |      |      |      |
                      v      v      v      v      v
                  +-----------------------------------+
                  |      Dynamic Batcher              |
                  |  max_batch_size: 32                |
                  |  max_delay_ms: 10                  |
                  +-----------------------------------+
                              |
              Batch formed at t=10ms (4 requests)
                              |
                              v
                  +-----------------------------------+
                  |      GPU Inference                |
                  |  Batch of 4: 12ms total           |
                  |  vs. 4 x 8ms = 32ms sequential   |
                  +-----------------------------------+
                              |
                  Results dispatched to each client

NVIDIA Triton Inference Server provides dynamic batching out of the box. You configure max_batch_size and batching_delay_ms, and Triton handles the rest. TensorFlow Serving also supports batching via its --enable_batching flag and a batching parameters file.

Scaled Serving Infrastructure

Putting it all together, here is what a production-grade scaled inference system looks like:

+-----------------------------------------------------------------------+
|                  SCALED MODEL SERVING INFRASTRUCTURE                   |
+-----------------------------------------------------------------------+
|                                                                         |
|  +----------+     +-----------+     +-----------------------------+     |
|  | Clients  | --> | CDN /     | --> | API Gateway                 |     |
|  | (Web,    |     | Edge      |     | (Auth, Rate Limit, Route)   |     |
|  |  Mobile) |     | Cache     |     +-----------------------------+     |
|  +----------+     +-----------+                  |                      |
|                                                  v                      |
|                                    +---------------------------+        |
|                                    |     Load Balancer         |        |
|                                    | (Round-Robin / Least-Conn)|        |
|                                    +---------------------------+        |
|                                       |        |        |               |
|                                       v        v        v               |
|                                  +--------+--------+--------+           |
|                                  | Model  | Model  | Model  |           |
|                                  | Server | Server | Server |           |
|                                  | Pod 1  | Pod 2  | Pod N  |           |
|                                  |        |        |        |           |
|                                  | GPU    | GPU    | GPU    |           |
|                                  | +Batch | +Batch | +Batch |           |
|                                  +--------+--------+--------+           |
|                                       |        |        |               |
|                                       v        v        v               |
|                                  +----------------------------+         |
|                                  |     Prediction Cache       |         |
|                                  |     (Redis Cluster)        |         |
|                                  +----------------------------+         |
|                                              |                          |
|                            +-----------------+-----------------+        |
|                            |                                   |        |
|                            v                                   v        |
|                   +-----------------+              +----------------+   |
|                   | Feature Store   |              | Model Registry |   |
|                   | (Feast/Tecton)  |              | (MLflow)       |   |
|                   +-----------------+              +----------------+   |
|                                                                         |
|  +---------------------------+    +-----------------------------+       |
|  | Autoscaler               |    | Monitoring                  |       |
|  | (HPA on queue depth,     |    | (Prometheus, Grafana,       |       |
|  |  GPU util, latency)      |    |  custom ML metrics)         |       |
|  +---------------------------+    +-----------------------------+       |
+-----------------------------------------------------------------------+

Model Registry and Deployment Pipelines

A model registry is the single source of truth for all trained models. It tracks model artifacts, metadata, performance metrics, lineage, and deployment status.

Model Registry Options

MLflow Model Registry: Open-source, widely adopted. Supports model stages (Staging, Production, Archived). Integrates with most ML frameworks. Can be self-hosted or used via Databricks.

Weights and Biases (W&B): Strong experiment tracking and artifact versioning. Model Registry links artifacts to training runs. Good collaboration features for team-based ML development.

SageMaker Model Registry: Deep AWS integration. Approval workflows built in. Connects directly to SageMaker endpoints for deployment. Best choice for AWS-native ML pipelines.

CI/CD for Models

ML models need the same deployment rigor as application code. A model CI/CD pipeline typically includes:

+---------------------------------------------------------------+
|                    MODEL CI/CD PIPELINE                        |
+---------------------------------------------------------------+
|                                                                 |
|  1. TRIGGER                                                     |
|     - New model registered in registry                          |
|     - Scheduled retraining completes                            |
|     - Manual promotion request                                  |
|                                                                 |
|  2. VALIDATION GATE                                             |
|     - Run model on holdout test set                             |
|     - Compare metrics against production baseline               |
|     - Check for regression on critical slices                   |
|     - Verify input/output schema compatibility                  |
|     - Run bias and fairness checks                              |
|                                                                 |
|  3. INTEGRATION TESTING                                         |
|     - Deploy to staging environment                             |
|     - Run end-to-end inference tests                            |
|     - Verify latency meets SLA (p99 < 100ms)                   |
|     - Load test at 2x expected peak traffic                     |
|                                                                 |
|  4. DEPLOYMENT                                                  |
|     - Blue-green or canary rollout                              |
|     - Gradual traffic shift (1% -> 10% -> 50% -> 100%)         |
|     - Automated rollback on metric degradation                  |
|                                                                 |
|  5. POST-DEPLOYMENT                                             |
|     - Monitor prediction distributions                          |
|     - Compare against shadow model predictions                  |
|     - Alert on drift or anomalies                               |
+---------------------------------------------------------------+

Blue-Green and Canary Deployments

Blue-green deployment for ML models maintains two identical environments. The "blue" environment serves production traffic while "green" runs the new model version. After validation, traffic switches entirely to green. This approach provides instant rollback by switching back to blue if issues arise.

Canary deployment gradually shifts traffic to the new model version. Start with 1-5% of traffic, monitor metrics, then incrementally increase. This limits the blast radius of a bad model.

# Canary routing logic
def route_request(request, canary_percentage=5):
    # Deterministic routing based on user ID
    # ensures same user always hits same model version
    bucket = hash(request.user_id) % 100

    if bucket < canary_percentage:
        return call_model("v2-canary")
    else:
        return call_model("v1-production")

For ML models specifically, canary deployments should monitor not just standard health metrics but also prediction distribution shifts, confidence score distributions, and downstream business metrics (click-through rate, conversion rate, user engagement).

A/B Testing for ML Models

A/B testing is fundamental to measuring whether a new model actually improves outcomes. It goes beyond canary deployment by applying statistical rigor to the comparison.

Traffic Splitting Strategies

Random splitting: Each request is randomly assigned to model A or model B. Simple but can introduce noise if user behavior varies across sessions.

User-level splitting: Each user is consistently assigned to one model variant for the duration of the experiment. Reduces noise from within-user variability. Implemented via hashing the user ID.

Stratified splitting: Users are divided into strata (e.g., by geography, device type, engagement level), and splitting happens within each stratum. Ensures balanced representation across segments.

Multi-armed bandit: Dynamically shifts traffic toward the better-performing model during the experiment. Reduces the cost of running an inferior model on half your traffic. Thompson Sampling and Upper Confidence Bound (UCB) are common algorithms.

Statistical Significance in ML Experiments

ML A/B tests require careful statistical methodology:

Sample size calculation: Before the experiment, calculate the required sample size to detect a meaningful effect. For a 1% improvement in click-through rate with 80% power and 95% confidence, you may need millions of observations.
Multiple testing correction: If you are evaluating 10 metrics simultaneously, apply Bonferroni correction or control false discovery rate (FDR) to avoid false positives.
Sequential testing: Rather than waiting for a fixed sample size, sequential testing frameworks (like always-valid confidence intervals) allow you to check results continuously without inflating false positive rates.
Novelty and primacy effects: Users may initially engage more with a new model simply because it is different. Run experiments long enough (typically 2-4 weeks) to let these effects dissipate.

Shadow Mode Deployment

Before exposing users to a new model, run it in shadow mode. The production model serves all live traffic while the new model receives a copy of each request and generates predictions that are logged but not returned to users.

+----------------------------------------------------------------+
|                    SHADOW MODE DEPLOYMENT                       |
+----------------------------------------------------------------+
|                                                                  |
|   +----------+     +------------------+     +----------------+   |
|   | Incoming | --> | Production Model | --> | Response to    |   |
|   | Request  |     | (v1, live)       |     | User           |   |
|   +----------+     +------------------+     +----------------+   |
|        |                                                         |
|        | (async copy)                                            |
|        v                                                         |
|   +------------------+     +------------------+                  |
|   | Shadow Model     | --> | Log Predictions  |                  |
|   | (v2, not live)   |     | (compare later)  |                  |
|   +------------------+     +------------------+                  |
|                                                                  |
|   Analysis: Compare v1 vs v2 predictions offline                 |
|   - Prediction agreement rate                                    |
|   - Confidence distribution differences                          |
|   - Latency comparison                                           |
|   - Error case analysis                                          |
+----------------------------------------------------------------+

Shadow mode is especially valuable for high-stakes applications (fraud detection, medical diagnosis, autonomous systems) where the cost of a bad prediction is severe. It lets you measure how the new model would have performed without any risk to users.

Latency Optimization Techniques

In a real-time ML system design, every millisecond counts. Latency optimization operates at multiple levels.

Model Compilation

Converting models from eager-mode frameworks to optimized runtimes dramatically reduces inference latency.

TorchScript (PyTorch): Traces or scripts your PyTorch model into an intermediate representation that can be optimized and run without a Python runtime.

# TorchScript compilation
import torch

model = MyModel()
model.eval()

# Tracing (works for models without control flow)
example_input = torch.randn(1, 3, 224, 224)
traced_model = torch.jit.trace(model, example_input)
traced_model.save("model_traced.pt")

# Scripting (works for models with control flow)
scripted_model = torch.jit.script(model)

TensorRT (NVIDIA): Optimizes models specifically for NVIDIA GPUs. Applies layer fusion, kernel auto-tuning, precision calibration, and memory optimization. Can deliver 2-6x speedup over native PyTorch.

ONNX Runtime: Framework-agnostic optimization. Supports CPU, GPU, and specialized accelerators. Graph optimizations include constant folding, redundant node elimination, and operator fusion.

Optimization	Typical Speedup	Accuracy Impact	Complexity
TorchScript	1.2-2x	None	Low
ONNX Runtime	1.5-3x	None	Low
TensorRT FP16	2-4x	Minimal	Medium
TensorRT INT8	3-6x	Small (calibration needed)	High
Quantization + Pruning	4-8x	Moderate	High

Caching Strategies

Not every request needs a fresh inference call. Strategic caching can eliminate redundant computation.

Prediction cache: For deterministic models, cache the mapping from input features to output predictions. A Redis cluster with TTL-based expiration works well. Hash the input features to create cache keys.

import hashlib
import redis
import json

cache = redis.Redis(host="redis-cluster", port=6379)

def predict_with_cache(features, ttl_seconds=300):
    # Create deterministic cache key from features
    key = hashlib.sha256(
        json.dumps(features, sort_keys=True).encode()
    ).hexdigest()

    cached = cache.get(key)
    if cached:
        return json.loads(cached)

    prediction = model.predict(features)
    cache.setex(key, ttl_seconds, json.dumps(prediction))
    return prediction

Embedding cache: For models that compute embeddings (search, recommendations), cache the embeddings rather than recomputing them. User embeddings change infrequently and can be cached for hours or days.

Feature cache: Cache expensive feature computations in a feature store. Features like "user's average session duration over 30 days" do not change with each request and can be pre-computed.

Edge Deployment

For ultra-low latency requirements (under 10ms), deploy models directly to edge locations close to users.

CDN-based inference: Services like Cloudflare Workers AI and AWS CloudFront Functions run lightweight models at edge locations worldwide.
On-device inference: TensorFlow Lite, Core ML, and ONNX Runtime Mobile run models directly on user devices. Zero network latency, but limited model size and compute.
Edge servers: Deploy model servers in regional data centers or at the network edge. Useful for applications like real-time game AI, autonomous vehicles, or industrial IoT.

The tradeoff with edge deployment is always model size versus latency. Larger, more accurate models must run in centralized data centers. Smaller, distilled models can run at the edge.

Monitoring Model Serving

Production ML systems require monitoring at both the infrastructure level and the model performance level. A model can be "up" from a systems perspective while producing degraded predictions due to data drift.

Latency Percentiles

Track latency at multiple percentiles, not just the average.

p50 (median): The typical user experience. Target depends on use case, but under 50ms is common for real-time serving.
p95: The experience for 1 in 20 users. Should still be acceptable. Often 2-3x the p50.
p99: The worst-case experience (excluding extreme outliers). If your p99 is 500ms but your SLA is 200ms, you have a tail latency problem.
p99.9: Critical for high-traffic services where even 0.1% of slow requests affects thousands of users.

  Latency Distribution Example:

  p50:  23ms  |========
  p75:  35ms  |============
  p90:  58ms  |====================
  p95:  89ms  |==============================
  p99: 245ms  |=================================================================================
  p99.9: 890ms|[off chart - investigate these]

Tail latency spikes often indicate garbage collection pauses, GPU memory contention, model loading, or batch size variability. Investigate p99 anomalies aggressively.

Throughput and Error Rates

Requests per second (RPS): Track per model version and per endpoint. Set alerts for both traffic drops (possible upstream issue) and spikes (possible attack or upstream bug).
Error rate: Categorize errors: model errors (NaN outputs, shape mismatches), infrastructure errors (OOM, timeouts), and client errors (bad input). Each category requires different remediation.
Queue depth: For systems with request batching, monitor how many requests are waiting. Growing queues indicate the system is falling behind.

Model Performance Drift

This is where ML monitoring diverges from traditional application monitoring. Models degrade over time as the real world changes.

Data drift: The distribution of input features shifts from what the model was trained on. For example, a fraud detection model trained on pre-pandemic transaction data will see very different patterns post-pandemic.

Concept drift: The relationship between inputs and outputs changes. A recommendation model trained when users preferred long-form content may become stale as short-form video becomes dominant.

Prediction drift: The distribution of model outputs changes, even if accuracy has not been measured. A sudden shift in prediction confidence or class distribution warrants investigation.

Monitoring approach:

Compare incoming feature distributions against training data using statistical tests (KS test, PSI, Jensen-Shannon divergence)
Track prediction distributions over time with rolling windows
Log ground truth labels when available and compute accuracy metrics on a delay
Set alerts on distribution shift magnitude and accuracy degradation thresholds

Resource Utilization

GPU utilization: Target 70-85%. Below 50% means you are overpaying. Above 90% risks latency spikes from contention.
GPU memory: Monitor peak usage. OOM kills are the most common cause of inference server crashes.
CPU utilization: Preprocessing and postprocessing often happen on CPU. A CPU bottleneck can starve the GPU.
Network I/O: Large model inputs (images, audio) can saturate network bandwidth, especially with high request concurrency.

Architecture Diagram: Complete Production Model Serving Stack

Here is a comprehensive view of a production ML model serving system, showing all the components discussed in this guide working together:

+===========================================================================+
||                 COMPLETE PRODUCTION MODEL SERVING STACK                  ||
+===========================================================================+
|                                                                             |
|   CLIENTS                                                                   |
|   +-------+  +--------+  +---------+                                        |
|   | Web   |  | Mobile |  | Internal|                                        |
|   | App   |  | App    |  | Service |                                        |
|   +---+---+  +---+----+  +----+----+                                        |
|       |          |             |                                             |
|       +----------+-------------+                                            |
|                  |                                                           |
|                  v                                                           |
|   EDGE / GATEWAY LAYER                                                      |
|   +------------------------------------------------------------------+      |
|   |  CDN + Edge Cache (CloudFront / Cloudflare)                      |      |
|   +------------------------------------------------------------------+      |
|   |  API Gateway (Kong / AWS API GW)                                 |      |
|   |  - Authentication    - Rate Limiting    - Request Routing        |      |
|   +------------------------------------------------------------------+      |
|                  |                                                           |
|                  v                                                           |
|   ROUTING LAYER                                                             |
|   +------------------------------------------------------------------+      |
|   |  Traffic Router / A/B Test Controller                            |      |
|   |  - Model version routing    - Canary % control                   |      |
|   |  - Shadow mode forking      - User-level assignment              |      |
|   +------------------------------------------------------------------+      |
|           |              |                |                                  |
|           v              v                v                                  |
|   SERVING LAYER                                                             |
|   +-----------------+ +-----------------+ +-----------------+               |
|   | Model v2        | | Model v1        | | Shadow Model v3 |               |
|   | (Canary - 5%)   | | (Prod - 95%)    | | (Logging Only)  |               |
|   |                 | |                 | |                 |               |
|   | Triton Server   | | Triton Server   | | Triton Server   |               |
|   | - Dynamic Batch | | - Dynamic Batch | | - Async Logging |               |
|   | - GPU (MIG)     | | - GPU (MIG)     | | - GPU (shared)  |               |
|   +-----------------+ +-----------------+ +-----------------+               |
|           |              |                |                                  |
|           v              v                v                                  |
|   DATA LAYER                                                                |
|   +------------------------------------------------------------------+      |
|   | Prediction Cache (Redis)    | Feature Store (Feast / Tecton)     |      |
|   +------------------------------------------------------------------+      |
|   | Embedding Store (Pinecone / pgvector)  | Ground Truth Logger     |      |
|   +------------------------------------------------------------------+      |
|                                                                             |
|   ML PLATFORM LAYER                                                         |
|   +------------------------------------------------------------------+      |
|   | Model Registry (MLflow)                                          |      |
|   |  - Model artifacts    - Metadata    - Lineage    - Approvals     |      |
|   +------------------------------------------------------------------+      |
|   | CI/CD Pipeline (GitHub Actions / Argo)                           |      |
|   |  - Validation gates   - Integration tests   - Auto-rollout      |      |
|   +------------------------------------------------------------------+      |
|   | Training Pipeline (Kubeflow / SageMaker)                         |      |
|   |  - Scheduled retraining    - Hyperparameter tuning               |      |
|   +------------------------------------------------------------------+      |
|                                                                             |
|   MONITORING LAYER                                                          |
|   +------------------------------------------------------------------+      |
|   | Infrastructure (Prometheus + Grafana)                            |      |
|   |  - Latency (p50/p95/p99)   - GPU util   - Error rates           |      |
|   +------------------------------------------------------------------+      |
|   | Model Performance (Evidently / Arize / WhyLabs)                  |      |
|   |  - Data drift    - Prediction drift    - Accuracy tracking       |      |
|   +------------------------------------------------------------------+      |
|   | Business Metrics (Datadog / internal dashboards)                 |      |
|   |  - CTR    - Conversion    - Revenue impact                       |      |
|   +------------------------------------------------------------------+      |
|   | Alerting (PagerDuty / OpsGenie)                                  |      |
|   |  - Latency SLA breach   - Drift threshold   - Error spike        |      |
|   +------------------------------------------------------------------+      |
|                                                                             |
+=============================================================================+

This architecture handles the full lifecycle: requests flow through the gateway and traffic router, hit the appropriate model version, leverage caching and feature stores, and feed into a comprehensive monitoring stack. The ML platform layer manages model artifacts, automated pipelines, and deployment orchestration.

Key Takeaways

Designing an ML model serving system for production requires thinking far beyond the model itself. Here are the principles that matter most:

Match the serving pattern to your use case. Not every prediction needs real-time inference. Batch and near-real-time patterns can reduce costs by 10-100x for the right workloads.
Optimize the model before scaling the infrastructure. Quantization, distillation, and compilation often deliver more speedup per dollar than adding more GPUs.
Dynamic batching is nearly free performance. If you are serving on GPUs, enable dynamic batching. The throughput gains are substantial with minimal latency impact.
Treat model deployment like application deployment. Blue-green deployments, canary rollouts, and automated rollback are not optional for production ML systems.
Shadow mode before A/B test before full rollout. This progression, from zero-risk validation to controlled experiment to full deployment, prevents costly mistakes.
Monitor predictions, not just infrastructure. A model server can report 100% uptime and 20ms latency while the model itself produces increasingly wrong predictions due to data drift.
Cache aggressively. Many ML applications see significant input repetition. A prediction cache with a 5-minute TTL can eliminate 30-60% of inference calls.
Plan for GPU efficiency from the start. MIG partitioning, time-slicing, and workload-aware scheduling prevent the common pattern of paying for 8 GPUs while only using 2 at capacity.

Design Your ML Serving Architecture

Visualizing these complex serving architectures helps teams align on design decisions before writing code. InfraSketch's AI-powered system design tool lets you generate production ML serving diagrams from natural language descriptions, complete with inference servers, feature stores, model registries, and monitoring stacks.

Describe your ML serving requirements in plain English and get a complete architecture diagram in seconds, along with a detailed design document covering scaling strategies, deployment patterns, and monitoring approaches.

ML Model Serving System Design: From Prototype to Production

ML Model Serving System Design: From Prototype to Production

Why Model Serving Is Hard

Serving Architecture Patterns

REST API Serving

gRPC Serving

Serverless Inference

Streaming Inference

Comparison of Serving Approaches

Real-Time vs Batch vs Near-Real-Time

Real-Time Inference

Batch Inference

Near-Real-Time (Micro-Batch)

Hybrid Architectures

Scaling Inference

Horizontal Scaling and Load Balancing

GPU Sharing and Scheduling

Model Optimization

Request Batching

Scaled Serving Infrastructure

Model Registry and Deployment Pipelines

Model Registry Options

CI/CD for Models

Blue-Green and Canary Deployments

A/B Testing for ML Models

Traffic Splitting Strategies

Statistical Significance in ML Experiments

Shadow Mode Deployment

Latency Optimization Techniques

Model Compilation

Caching Strategies

Edge Deployment

Monitoring Model Serving

Latency Percentiles

Throughput and Error Rates

Model Performance Drift

Resource Utilization

Architecture Diagram: Complete Production Model Serving Stack

Key Takeaways

Design Your ML Serving Architecture

Related Resources

Try InfraSketch Tools

Generate a Diagram

System Design Tool

Design Doc Generator

All Tools

Related Articles

AI Pipeline System Design: Data Ingestion, Training, and Serving

Streaming ML System Design: Real-Time Machine Learning Architecture

AI System Design: A Comprehensive Reference Architecture