MLOps System Design: Reference Architecture for 2026
Machine learning models are only valuable when they run reliably in production. The challenge is that ML systems are fundamentally different from traditional software. They depend on data that shifts over time, models that degrade silently, and training pipelines that are expensive, slow, and difficult to reproduce. MLOps, the discipline of applying DevOps principles to machine learning, exists to solve these problems. It provides the practices, tools, and architectures needed to take a model from a Jupyter notebook to a production service that teams can trust.
This guide presents a complete MLOps system design reference architecture. It covers maturity levels, core components, monitoring strategies, CI/CD pipelines for ML, infrastructure platforms, cost management, and the anti-patterns that trip teams up most often. Whether you are preparing for a system design interview or building a production ML platform, this reference architecture gives you a concrete blueprint to work from.
What Is MLOps?
MLOps (Machine Learning Operations) is the set of practices that combines machine learning, DevOps, and data engineering to deploy and maintain ML models in production reliably and efficiently. It addresses the unique challenges of ML systems that traditional software engineering practices do not cover.
In traditional software, the behavior of a system is defined entirely by code. If the code does not change, the system behaves the same way tomorrow as it does today. ML systems break this assumption. Their behavior depends on three moving parts:
- Code: The model architecture, feature engineering logic, training scripts, and serving infrastructure.
- Data: The training data, feature values, and incoming request distributions. All of these change over time.
- Models: The trained artifacts that encode learned patterns. These degrade as the world changes.
MLOps provides discipline around all three axes. It borrows concepts from DevOps (version control, CI/CD, monitoring, infrastructure as code) and extends them with ML-specific practices like data versioning, experiment tracking, model registries, feature stores, and model performance monitoring.
Why MLOps Matters
Without MLOps, teams run into predictable failure modes:
- Models that worked in notebooks fail in production because the training environment does not match the serving environment.
- Model performance degrades silently because nobody is monitoring prediction quality after deployment.
- Retraining is manual and slow because there is no automated pipeline to retrain on fresh data.
- Experiments are not reproducible because datasets, hyperparameters, and code versions were not tracked together.
- Deployments are risky because there is no rollback mechanism, no canary testing, and no way to compare a new model to the current one in production.
MLOps solves each of these by treating the entire ML lifecycle, from data ingestion to model retirement, as an engineering problem with repeatable, automated, observable processes.
MLOps Maturity Levels
Google's MLOps maturity framework defines three levels that describe how automated and reliable an organization's ML processes are. Understanding where your team falls helps prioritize what to build next.
Level 0: Manual Process
At Level 0, everything is manual. Data scientists train models in notebooks, hand off artifacts to engineers, and deployments happen through ad hoc scripts or manual uploads. There is no pipeline, no automation, and no monitoring.
Characteristics:
- Training is interactive and notebook-driven
- No separation between experimentation and production code
- Model deployment is a manual, infrequent process
- No monitoring of model performance in production
- Retraining happens when someone notices a problem (or doesn't happen at all)
Appropriate for: Early prototyping, proof-of-concept projects, teams with fewer than two ML models in production.
Level 1: ML Pipeline Automation
At Level 1, the training process is automated as a pipeline. Data ingestion, feature engineering, training, evaluation, and deployment run as a connected workflow. The pipeline can be triggered on a schedule or by data changes, and it produces consistent, reproducible results.
Characteristics:
- Automated training pipeline (end-to-end)
- Feature store provides consistent features for training and serving
- Experiment tracking captures metrics, parameters, and artifacts
- Model registry stores versioned model artifacts
- Continuous training on new data (scheduled or triggered)
- Manual deployment approval (human-in-the-loop before production)
Appropriate for: Teams with a handful of models in production that retrain periodically.
Level 2: CI/CD for ML
At Level 2, the pipeline itself is under CI/CD. Changes to training code, feature definitions, or model architecture trigger automated testing, validation, and deployment. The system can retrain, evaluate, and promote models to production without human intervention (with appropriate guardrails).
Characteristics:
- CI/CD for pipeline code (not just model artifacts)
- Automated model validation (performance thresholds, bias checks, regression tests)
- Automated deployment with canary or blue-green strategies
- Comprehensive monitoring with automated alerting and retraining triggers
- A/B testing infrastructure for comparing models in production
- Full reproducibility from data version to deployed model
Appropriate for: Teams running many models in production at scale, organizations where ML is a core business capability.
Maturity Level Comparison
| Capability | Level 0 | Level 1 | Level 2 |
|---|---|---|---|
| Training process | Manual | Automated pipeline | Automated + CI/CD |
| Model deployment | Manual scripts | Automated with approval | Automated with guardrails |
| Monitoring | None | Basic metrics | Full observability |
| Retraining | Ad hoc | Scheduled/triggered | Automated with validation |
| Feature management | Ad hoc | Feature store | Feature store + versioning |
| Experiment tracking | Notebooks | Centralized tracking | Tracking + automated comparison |
| Testing | None | Pipeline tests | Full ML testing suite |
| Time to deploy | Days/weeks | Hours | Minutes |
Most organizations start at Level 0 and should aim for Level 1 as soon as they have a model in production. Level 2 is the target for teams operating ML at scale. Jumping straight to Level 2 without the fundamentals of Level 1 in place is a common and costly mistake.
Reference Architecture Components
The following diagram shows a complete MLOps reference architecture. Each component serves a specific role in the lifecycle of an ML model, from raw data to production predictions.
┌──────────────────────────────────────────────────────────────────────────┐
│ MLOps Reference Architecture │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌──────────────┐ │
│ │ Data Sources │───>│ Data │───>│ Feature │ │
│ │ (DBs, APIs, │ │ Pipeline │ │ Store │ │
│ │ Streams) │ │ (Ingestion, │ │ (Online + │ │
│ └─────────────┘ │ Cleaning) │ │ Offline) │ │
│ └─────────────┘ └──────┬───────┘ │
│ │ │
│ ┌─────────────┐ ┌─────────────┐ ┌──────▼───────┐ │
│ │ Source │───>│ CI/CD │───>│ Training │ │
│ │ Control │ │ Pipeline │ │ Pipeline │ │
│ │ (Code+Data) │ │ (Test,Build)│ │ (Train,Eval) │ │
│ └─────────────┘ └─────────────┘ └──────┬───────┘ │
│ │ │
│ ┌─────────────┐ ┌─────────────┐ ┌──────▼───────┐ │
│ │ Experiment │<───│ Model │<───│ Model │ │
│ │ Tracking │ │ Registry │ │ Validation │ │
│ │ (MLflow, │ │ (Versioned │ │ (Perf, Bias, │ │
│ │ W&B) │ │ Artifacts) │ │ Regression) │ │
│ └─────────────┘ └──────┬──────┘ └──────────────┘ │
│ │ │
│ ┌──────▼───────┐ │
│ │ Deployment │ │
│ │ (Blue-Green, │ │
│ │ Canary, A/B)│ │
│ └──────┬───────┘ │
│ │ │
│ ┌─────────────┐ ┌──────▼───────┐ ┌──────────────┐ │
│ │ Monitoring │<───│ Serving │<───│ API Gateway │<── Requests │
│ │ (Data,Model,│ │ Infra │ │ (Rate Limit, │ │
│ │ Infra) │ │ (K8s, Lambda)│ │ Auth, Route)│ │
│ └──────┬──────┘ └──────────────┘ └──────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ Alerting + │── Trigger retraining if drift/degradation detected │
│ │ Retraining │ │
│ │ Triggers │ │
│ └─────────────┘ │
└──────────────────────────────────────────────────────────────────────────┘
Let's walk through each component in detail.
Source Control for Code and Data
Version control in MLOps extends beyond code. You need to version the data, the configuration, and the pipeline definitions alongside the model code.
Code versioning uses standard Git workflows. All training scripts, feature engineering code, pipeline definitions, serving code, and infrastructure-as-code live in a repository. Branch-based workflows (feature branches, pull requests, code review) apply just as they do in traditional software development.
Data versioning is the ML-specific addition. Tools like DVC (Data Version Control) track large datasets alongside Git without storing the actual data in the repository. DVC stores metadata (hashes, pointers) in Git while the data itself lives in cloud storage (S3, GCS, Azure Blob). This means you can check out a specific commit and get the exact code, configuration, and dataset that produced a given model.
Pipeline versioning defines the training workflow as code. Tools like Kubeflow Pipelines, Airflow, or Prefect let you define DAGs (directed acyclic graphs) that describe each step of data processing, training, evaluation, and registration. When the pipeline definition changes, it goes through the same code review and testing process as application code.
Key tools: Git, DVC, LakeFS, Delta Lake, Pachyderm
Feature Store
A feature store is a centralized repository for storing, managing, and serving ML features. It solves a critical problem in ML systems: the inconsistency between features computed during training and features computed during serving (known as the training-serving skew).
A feature store has two interfaces:
- Offline store: Provides historical feature values for training. Backed by a data warehouse or data lake (BigQuery, Snowflake, S3/Parquet). Supports point-in-time lookups to prevent data leakage during training.
- Online store: Provides low-latency feature values for real-time inference. Backed by a key-value store (Redis, DynamoDB, Bigtable). Sub-millisecond reads for serving.
┌──────────────────────┐
│ Feature Store │
│ │
Batch jobs ──────>│ ┌────────────────┐ │<────── Training jobs
Stream jobs ─────>│ │ Feature │ │ (offline reads)
│ │ Definitions │ │
│ │ (schemas, │ │
│ │ transforms) │ │
│ └───────┬────────┘ │
│ │ │
│ ┌─────▼─────┐ │
│ │ Offline │ │
│ │ Store │ │
│ │ (Parquet/ │ │
│ │ BigQuery)│ │
│ └─────┬─────┘ │
│ │ Materialize│
│ ┌─────▼─────┐ │
│ │ Online │ │<────── Serving infra
│ │ Store │ │ (real-time reads)
│ │ (Redis/ │ │
│ │ DynamoDB) │ │
│ └───────────┘ │
└──────────────────────┘
Why it matters: Without a feature store, teams often compute features differently in training notebooks and serving code. A feature that uses a 30-day rolling average in training might accidentally use a 7-day window in production, leading to silent prediction errors. The feature store ensures consistency by defining features once and serving them to both contexts.
For a deeper dive into feature store architecture, see our guide on Feature Store System Design.
Key tools: Feast (open-source), Tecton, Hopsworks, SageMaker Feature Store, Vertex AI Feature Store
Training Pipeline
The training pipeline automates the process of turning raw data and feature definitions into a trained, evaluated model. At Level 1 maturity and above, this pipeline runs without human intervention.
A typical training pipeline includes these stages:
- Data ingestion: Pull fresh data from sources or the feature store.
- Data validation: Check for schema changes, missing values, outliers, and distribution drift compared to the training baseline.
- Feature engineering: Compute features (or pull pre-computed features from the feature store).
- Model training: Run the training algorithm with tracked hyperparameters.
- Model evaluation: Compute metrics (accuracy, precision, recall, AUC, F1) on a held-out test set.
- Model validation: Compare the new model against the current production model. Check performance thresholds, bias metrics, and regression tests.
- Model registration: If validation passes, register the model in the model registry with metadata (metrics, data version, code version, parameters).
Orchestration tools: Kubeflow Pipelines, Apache Airflow, Prefect, Dagster, Argo Workflows, AWS Step Functions
Training frameworks: PyTorch, TensorFlow, XGBoost, scikit-learn, Hugging Face Transformers
Experiment tracking: MLflow, Weights & Biases, Neptune, Comet ML
Model Registry
The model registry is the single source of truth for trained model artifacts. It stores model versions alongside their metadata, making it possible to reproduce, compare, roll back, and audit any model that has ever been deployed.
Each entry in the registry includes:
- Model artifact: The serialized model file (ONNX, TorchScript, SavedModel, pickle)
- Metrics: Performance metrics on evaluation datasets
- Parameters: Hyperparameters used during training
- Data version: A pointer to the exact dataset used for training
- Code version: The Git commit hash of the training code
- Lineage: The pipeline run that produced the model
- Stage: The model's lifecycle stage (staging, production, archived)
The model registry enables critical workflows. When a production model degrades, you can roll back to the previous version in seconds. When a regulatory audit asks how a model was trained, you can trace from the deployed artifact back to the exact data and code.
Key tools: MLflow Model Registry, SageMaker Model Registry, Vertex AI Model Registry, Weights & Biases Model Registry
Serving Infrastructure
Serving infrastructure takes a registered model and makes it available for predictions. The right serving pattern depends on your latency requirements, throughput, and cost constraints.
Real-time serving handles individual prediction requests with low latency (typically under 100ms). The model runs as a microservice behind an API, often containerized and deployed on Kubernetes or a managed service.
Batch serving processes large volumes of data on a schedule. This pattern is appropriate when predictions do not need to be immediate (e.g., daily recommendation updates, risk scoring overnight).
Streaming serving processes events from a message queue (Kafka, Kinesis) and produces predictions in near-real-time. This is the pattern for use cases like fraud detection, where predictions must be fast but can tolerate slightly higher latency than synchronous APIs.
┌─────────────────────────────────────────────────────────────┐
│ Serving Patterns │
│ │
│ Real-time: Request ──> API ──> Model ──> Response │
│ (REST/gRPC) (< 100ms) │
│ │
│ Batch: Data ──> Spark/Beam ──> Model ──> Storage │
│ (scheduled) (minutes/hours) │
│ │
│ Streaming: Events ──> Stream Processor ──> Model ──> Out │
│ (Kafka/Kinesis) (near real-time) │
│ │
└─────────────────────────────────────────────────────────────┘
For a comprehensive guide to serving architecture, see ML Model Serving System Design.
Key tools: TensorFlow Serving, TorchServe, Triton Inference Server, Seldon Core, BentoML, KServe, SageMaker Endpoints, Vertex AI Prediction
Monitoring and Alerting
Monitoring is arguably the most important and most neglected component of an MLOps architecture. ML models degrade silently. Unlike a crashed server (which triggers an immediate alert), a model that starts making bad predictions looks perfectly healthy from an infrastructure perspective. It returns HTTP 200 responses with valid JSON payloads. The latency is fine. CPU utilization is normal. But the predictions are wrong.
We cover monitoring in depth in the next section.
ML Monitoring and Observability
ML monitoring goes beyond traditional application monitoring. You need three layers of observability: data quality, model performance, and infrastructure health.
Data Quality Monitoring
Data is the foundation of every ML system. If the data changes, the model's behavior changes, often in ways that are hard to predict. Data quality monitoring catches these changes before they reach the model.
Schema drift detects structural changes in input data. A column that was always present might start arriving as null. A categorical field might gain new values. An integer field might start containing strings. Schema monitoring catches these changes at the boundary of your system.
Distribution drift (also called data drift or covariate shift) detects when the statistical distribution of input features changes compared to the training data. For example, if a model was trained on transactions averaging $50 and starts receiving transactions averaging $500, the model is operating outside its training distribution.
Missing value monitoring tracks the percentage of null or missing values for each feature. A feature that was 99% populated during training but drops to 70% in production indicates an upstream data pipeline issue.
Volume monitoring tracks the number of incoming records over time. A sudden drop in volume might indicate a broken data pipeline rather than a genuine business change.
# Example: Data quality checks with Great Expectations
import great_expectations as gx
context = gx.get_context()
# Define expectations
suite = context.add_expectation_suite("input_data_quality")
suite.add_expectation(
gx.expectations.ExpectColumnValuesToNotBeNull(column="user_id")
)
suite.add_expectation(
gx.expectations.ExpectColumnValuesToBeBetween(
column="transaction_amount", min_value=0, max_value=100000
)
)
suite.add_expectation(
gx.expectations.ExpectColumnDistinctValuesToBeInSet(
column="currency", value_set=["USD", "EUR", "GBP", "JPY"]
)
)
# Validate incoming batch
results = context.run_checkpoint(
checkpoint_name="production_data_check",
batch_request=batch_request,
)
if not results.success:
trigger_alert("Data quality check failed", results)
Model Performance Monitoring
Model performance monitoring tracks whether the model's predictions are accurate over time. This is the most critical layer of ML monitoring because it directly measures whether the model is delivering business value.
Prediction drift (also called concept drift) occurs when the relationship between inputs and outputs changes. The model's predictions may shift even when the input distribution looks stable, because the underlying patterns in the data have changed. For example, a credit risk model trained before a recession may assign incorrect risk scores during the recession because the relationship between income and default risk has changed.
Accuracy monitoring tracks standard ML metrics (accuracy, precision, recall, F1, AUC, RMSE) over time. This requires ground truth labels, which may be delayed. In some domains (e.g., fraud detection), you know the true label within days. In others (e.g., long-term churn prediction), it may take months.
Prediction distribution monitoring tracks the distribution of model outputs. If a binary classifier that historically predicted "positive" 10% of the time suddenly predicts "positive" 40% of the time, something has changed, even if you do not yet have ground truth to measure accuracy directly.
Segment-level monitoring breaks performance down by important dimensions (geography, customer segment, product category). A model might perform well overall but degrade significantly for a specific segment.
┌────────────────────────────────────────────────────────────┐
│ Model Monitoring Dashboard │
│ │
│ Accuracy (7-day rolling) Prediction Distribution │
│ ┌──────────────────────┐ ┌────────────────────┐ │
│ │ 0.95 ─────┐ │ │ ▓▓▓ │ │
│ │ 0.90 └──┐ │ │ ▓▓▓▓▓ │ │
│ │ 0.85 └─── ! │ │ ▓▓▓▓▓▓▓▓ │ │
│ │ 0.80 ────threshold── │ │ ▓▓▓▓▓▓▓▓▓▓▓ │ │
│ │ Week1 Week2 Week3 │ │ ▓▓▓▓▓▓▓▓▓▓▓▓▓▓ │ │
│ └──────────────────────┘ │ 0.0 0.5 1.0 │ │
│ └────────────────────┘ │
│ Data Drift Score Latency P99 │
│ ┌──────────────────────┐ ┌────────────────────┐ │
│ │ 0.3 │ │ 150ms │ │
│ │ 0.2 ─────────────────│ │ 100ms ──────────── │ │
│ │ 0.1 ────threshold── │ │ 50ms │ │
│ │ 0.0 │ │ 0ms │ │
│ └──────────────────────┘ └────────────────────┘ │
└────────────────────────────────────────────────────────────┘
Infrastructure Monitoring
Infrastructure monitoring tracks the health of the serving systems, training clusters, and data pipelines that support your ML workloads.
Latency monitoring measures the end-to-end time from receiving a prediction request to returning a response. Track P50, P95, and P99 latency. Set SLOs (service level objectives) and alert when latency exceeds them.
Throughput monitoring tracks the number of predictions served per second. This helps with capacity planning and detecting traffic anomalies.
Resource utilization tracks CPU, GPU, memory, and disk usage for both training and serving workloads. GPU utilization is particularly important because GPUs are expensive and underutilization is a common source of waste.
Pipeline health monitors the success/failure rate of training pipeline runs, data pipeline runs, and feature computation jobs. A failed pipeline run that goes unnoticed means the model is serving stale predictions.
Alerting and Automated Retraining Triggers
Monitoring is only useful if it triggers action. Define clear alerting rules that distinguish between informational, warning, and critical conditions.
Example alerting rules:
| Condition | Severity | Action |
|---|---|---|
| Model accuracy drops below 0.85 | Warning | Notify ML team, investigate |
| Model accuracy drops below 0.80 | Critical | Trigger automated retraining |
| Data drift score exceeds 0.3 | Warning | Investigate data pipeline |
| Feature missing rate exceeds 10% | Critical | Alert data engineering team |
| Serving latency P99 exceeds 200ms | Warning | Check resource utilization |
| Training pipeline fails 3x in a row | Critical | Page on-call engineer |
Automated retraining closes the loop. When monitoring detects that a model has degraded beyond a threshold, it triggers the training pipeline to retrain on fresh data. The retrained model goes through the same validation process before being promoted to production.
Monitor ──> Drift Detected ──> Trigger Pipeline ──> Train ──> Validate ──> Deploy
│
(fails) │ (passes)
│ │
▼ ▼
Alert Team Production
Monitoring Tools
| Tool | Type | Strengths |
|---|---|---|
| Evidently AI | Open-source | Data drift, model performance, reports and dashboards |
| Whylabs | SaaS | Data profiling, drift detection, anomaly detection |
| Arize | SaaS | Full observability platform, embedding drift, LLM monitoring |
| Fiddler | SaaS | Explainability, bias monitoring, performance tracking |
| NannyML | Open-source | Performance estimation without ground truth |
| Prometheus + Grafana | Open-source | Infrastructure metrics, custom dashboards |
| Datadog ML Monitoring | SaaS | Integrated with existing infrastructure monitoring |
For more on building observable AI pipelines, including monitoring considerations at each stage, see our dedicated guide.
CI/CD for ML Models
CI/CD for ML is more complex than CI/CD for traditional software because you need to test three things: the code, the data, and the model. A broken unit test fails immediately and loudly. A subtly degraded model passes every infrastructure check and quietly makes bad predictions for weeks.
Code Testing
Code testing for ML covers the same ground as traditional software testing, plus ML-specific concerns.
Unit tests for feature transforms verify that your feature engineering logic produces correct outputs. These tests are fast, deterministic, and should run on every commit.
# Example: Unit testing a feature transform
def compute_rolling_avg(transactions, window_days=30):
"""Compute rolling average transaction amount."""
cutoff = datetime.now() - timedelta(days=window_days)
recent = [t.amount for t in transactions if t.timestamp >= cutoff]
return sum(recent) / len(recent) if recent else 0.0
def test_rolling_avg_basic():
transactions = [
Transaction(amount=100.0, timestamp=datetime(2026, 1, 15)),
Transaction(amount=200.0, timestamp=datetime(2026, 1, 20)),
]
result = compute_rolling_avg(transactions, window_days=30)
assert result == 150.0
def test_rolling_avg_empty():
result = compute_rolling_avg([], window_days=30)
assert result == 0.0
def test_rolling_avg_filters_old():
transactions = [
Transaction(amount=100.0, timestamp=datetime(2025, 1, 1)), # old
Transaction(amount=200.0, timestamp=datetime(2026, 1, 20)), # recent
]
result = compute_rolling_avg(transactions, window_days=30)
assert result == 200.0
Integration tests for pipelines verify that the pipeline stages connect correctly, that data flows through without errors, and that the output schema matches expectations. These tests typically run on a small sample dataset.
Serving tests verify that the model serving endpoint loads the model correctly, handles edge cases (missing features, malformed inputs), and returns responses in the expected format.
Data Testing
Data testing validates that the data feeding your pipeline meets quality expectations. This is where tools like Great Expectations shine.
Data contracts define the expected schema, value ranges, distributions, and relationships between fields. They serve as a formal agreement between data producers and consumers.
# Example: Data contract for user features
contract:
name: user_features
version: 2.1
fields:
- name: user_id
type: string
nullable: false
unique: true
- name: account_age_days
type: integer
nullable: false
min: 0
max: 36500
- name: total_transactions
type: integer
nullable: false
min: 0
- name: avg_transaction_amount
type: float
nullable: true
min: 0
max: 1000000
- name: country_code
type: string
nullable: false
allowed_values: ["US", "GB", "DE", "FR", "JP", "CA", "AU"]
freshness:
max_staleness: 24h
volume:
min_rows_per_day: 10000
Distribution tests compare the current data distribution against a reference distribution (usually the training data). If the Kolmogorov-Smirnov test or Population Stability Index exceeds a threshold, the pipeline should alert or halt.
Key tools: Great Expectations, Deequ, TDDA, Pandera, Soda
Model Testing
Model testing goes beyond "did the training script run without errors." It validates that the trained model meets quality standards before deployment.
Performance threshold tests verify that the model meets minimum accuracy, precision, recall, or other metric thresholds on a held-out evaluation set. A model that scores below the threshold is rejected.
Regression tests compare the new model against the current production model on the same evaluation set. The new model should perform at least as well as the current one. If it doesn't, the deployment is blocked.
Bias and fairness tests check whether the model's performance varies across demographic groups or sensitive attributes. Tools like Fairlearn and AIF360 provide metrics for disparate impact, equalized odds, and demographic parity.
Adversarial tests check how the model handles edge cases, out-of-distribution inputs, and adversarial examples. Does the model fail gracefully, or does it make confident but incorrect predictions?
# Example: Model validation pipeline
def validate_model(new_model, current_model, eval_data):
"""Validate new model before deployment."""
results = {}
# Performance threshold
new_accuracy = evaluate(new_model, eval_data)
results["accuracy"] = new_accuracy
if new_accuracy < 0.85:
return {"passed": False, "reason": f"Accuracy {new_accuracy} below 0.85"}
# Regression test
current_accuracy = evaluate(current_model, eval_data)
if new_accuracy < current_accuracy - 0.02:
return {"passed": False, "reason": f"Regression: {new_accuracy} < {current_accuracy}"}
# Bias check
for segment in ["age_group", "region", "gender"]:
segment_metrics = evaluate_by_segment(new_model, eval_data, segment)
max_gap = max(segment_metrics.values()) - min(segment_metrics.values())
if max_gap > 0.10:
return {"passed": False, "reason": f"Bias in {segment}: gap {max_gap}"}
return {"passed": True, "metrics": results}
Deployment Strategies
ML models need the same deployment safety nets as traditional services, with some ML-specific additions.
Blue-green deployment maintains two identical environments. The new model deploys to the inactive environment, gets validated, and traffic switches over. If something goes wrong, you switch back instantly.
Canary deployment routes a small percentage of traffic (e.g., 5%) to the new model while the rest continues to hit the current model. You monitor the canary for errors, latency regressions, and prediction quality before gradually increasing traffic.
Shadow deployment sends all traffic to both models but only returns the current model's predictions. The new model's predictions are logged for comparison but never shown to users. This is the safest strategy for validating a model in production without any user impact.
A/B testing routes traffic between two models based on user segments and measures the business impact (conversion rate, revenue, engagement) of each model. This is the gold standard for validating that a new model delivers real business value.
┌─────────────────────────────────────────────────────────────┐
│ Deployment Strategies │
│ │
│ Blue-Green: [Traffic] ──> [Model v2] │
│ [Model v1] (standby) │
│ │
│ Canary: [Traffic] ──95%──> [Model v1] │
│ ──5%───> [Model v2] │
│ │
│ Shadow: [Traffic] ──> [Model v1] ──> Response │
│ ──> [Model v2] ──> Log only │
│ │
│ A/B Test: [Traffic] ──50%──> [Model v1] ──> Measure │
│ ──50%──> [Model v2] ──> Measure │
└─────────────────────────────────────────────────────────────┘
Infrastructure Platforms
Several platforms provide managed or semi-managed MLOps infrastructure. The right choice depends on your existing cloud provider, team expertise, scale, and need for customization.
Kubernetes + Kubeflow
Kubeflow is an open-source ML toolkit that runs on Kubernetes. It provides components for every stage of the ML lifecycle: notebooks (Jupyter), pipelines (Kubeflow Pipelines), training (TFJob, PyTorchJob), serving (KServe), and experiment tracking (Katib).
Strengths: Full control, cloud-agnostic, extensible, large community. You can run it on any Kubernetes cluster (on-prem, AWS, GCP, Azure).
Weaknesses: Complex to set up and maintain. Requires Kubernetes expertise. The operational burden is significant, especially for small teams.
Best for: Organizations with strong platform engineering teams, multi-cloud or hybrid environments, teams that need full customization.
AWS SageMaker
SageMaker is AWS's fully managed ML platform. It covers the entire ML lifecycle: data labeling (Ground Truth), notebooks (Studio), feature store, training (managed training jobs with automatic scaling), model registry, deployment (real-time, batch, async), monitoring (Model Monitor), and pipelines.
Strengths: Deep AWS integration, managed infrastructure, built-in model monitoring, spot instance support for training, pay-per-use pricing.
Weaknesses: AWS lock-in, some features feel bolted-on rather than cohesive, pricing can be opaque, limited customization compared to Kubeflow.
Best for: Teams already invested in AWS, organizations that want managed infrastructure, teams without deep Kubernetes expertise.
Google Vertex AI
Vertex AI is Google Cloud's unified ML platform. It provides managed notebooks, a feature store, AutoML, custom training, model registry, serving (with traffic splitting), pipelines, model monitoring, and experiment tracking.
Strengths: Strong integration with BigQuery and other GCP data services, good AutoML capabilities, clean UI, built-in experiment tracking.
Weaknesses: GCP lock-in, smaller ecosystem compared to AWS, some features are newer and less battle-tested.
Best for: Teams on GCP, organizations using BigQuery for analytics, teams that want a clean integrated experience.
Azure Machine Learning
Azure ML provides a complete MLOps platform with managed notebooks, data labeling, automated ML, designer (visual pipeline builder), managed training, model registry, managed endpoints, and pipeline orchestration.
Strengths: Enterprise integration (Active Directory, Azure DevOps), strong hybrid cloud support, good compliance certifications, designer for non-code users.
Weaknesses: Azure lock-in, UI can feel fragmented, documentation is sometimes inconsistent.
Best for: Enterprise organizations on Azure, teams needing Active Directory integration, organizations with strict compliance requirements.
Platform Comparison
| Capability | Kubeflow | SageMaker | Vertex AI | Azure ML |
|---|---|---|---|---|
| Setup complexity | High | Low | Low | Medium |
| Cloud lock-in | None | AWS | GCP | Azure |
| Feature store | Third-party | Built-in | Built-in | Built-in |
| Model monitoring | Third-party | Built-in | Built-in | Built-in |
| Pipeline orchestration | Kubeflow Pipelines | SageMaker Pipelines | Vertex Pipelines | Azure Pipelines |
| GPU support | Full | Full | Full | Full |
| Autoscaling | Manual config | Managed | Managed | Managed |
| Cost | Infrastructure only | Pay-per-use | Pay-per-use | Pay-per-use |
| Customization | Unlimited | Moderate | Moderate | Moderate |
| Multi-cloud | Yes | No | No | Partial (Arc) |
To visualize these architectures interactively, try our ML System Design Tool, which generates architecture diagrams from natural language descriptions.
Cost Management and FinOps for ML
ML workloads are expensive. Training large models requires powerful GPUs, storing massive datasets consumes storage budgets, and serving predictions at scale drives compute costs. FinOps (Financial Operations) for ML requires specific strategies that go beyond traditional cloud cost management.
GPU Cost Optimization
GPUs are the single largest cost driver for most ML teams. A single NVIDIA A100 instance costs roughly $3-4 per hour on-demand across major cloud providers.
Spot instances (preemptible VMs) offer 60-90% discounts compared to on-demand pricing. Training jobs, which can be checkpointed and resumed, are ideal candidates for spot instances. The key is implementing checkpointing logic that saves model state periodically and resumes from the last checkpoint when a spot instance is reclaimed.
# Example: Checkpointing for spot-instance training
import torch
def save_checkpoint(model, optimizer, epoch, loss, path):
torch.save({
"epoch": epoch,
"model_state_dict": model.state_dict(),
"optimizer_state_dict": optimizer.state_dict(),
"loss": loss,
}, path)
def load_checkpoint(model, optimizer, path):
checkpoint = torch.load(path)
model.load_state_dict(checkpoint["model_state_dict"])
optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
return checkpoint["epoch"], checkpoint["loss"]
# Save every N steps
for step, batch in enumerate(dataloader):
loss = train_step(model, batch)
if step % 500 == 0:
save_checkpoint(model, optimizer, epoch, loss, "checkpoint.pt")
Right-sizing means matching the GPU type and count to the workload. Many teams default to the largest available GPU when a smaller one would suffice. Profile your training jobs to determine actual GPU memory usage, compute utilization, and bottlenecks (CPU, I/O, network) before selecting instance types.
Scheduling runs training jobs during off-peak hours when spot instance availability is higher and prices are lower. Automated scheduling tools can queue jobs and launch them when prices drop below a threshold.
Storage Optimization
ML datasets are large and multiply quickly as teams create different versions, preprocessing variants, and feature sets.
Data lifecycle policies automatically move older data to cheaper storage tiers. Recent data stays in hot storage (S3 Standard, GCS Standard) for active training. Older versions move to infrequent access tiers. Archived data moves to cold storage (S3 Glacier, GCS Archive).
Deduplication eliminates redundant copies. A naive workflow creates copies at every pipeline stage (raw, cleaned, featurized, train/test splits). Deduplication tools and data versioning (DVC, LakeFS) reduce this waste by storing only diffs between versions.
Compression reduces storage costs and speeds up data loading. Parquet and ORC formats provide excellent compression ratios for tabular data. For image and video datasets, consider storing in compressed formats and decompressing on-the-fly during training.
Inference Cost Management
For models serving real-time predictions, inference costs can exceed training costs over time, especially at scale.
Model optimization reduces the computational cost per prediction. Techniques include:
- Quantization: Reduce model weights from 32-bit floats to 16-bit or 8-bit integers. Reduces memory usage and increases throughput with minimal accuracy loss.
- Pruning: Remove model weights with small magnitudes. Reduces model size and inference time.
- Distillation: Train a smaller "student" model to mimic a larger "teacher" model. The student is cheaper to serve.
- ONNX Runtime: Convert models to ONNX format and serve with ONNX Runtime, which applies graph optimizations and hardware-specific acceleration.
Autoscaling adjusts the number of serving replicas based on traffic patterns. Scale to zero during periods of no traffic (if your latency SLO allows a cold start). Scale up quickly in response to traffic spikes. Use predictive scaling for workloads with predictable daily or weekly patterns.
Batching groups multiple prediction requests and processes them together. This is more efficient for GPU-based inference because GPUs are designed for parallel computation. The trade-off is slightly increased latency for individual requests.
| Optimization | Savings | Trade-off |
|---|---|---|
| Spot instances (training) | 60-90% | Job interruptions, checkpointing overhead |
| Right-sizing GPUs | 20-50% | Requires profiling effort |
| Model quantization (INT8) | 2-4x throughput | Minimal accuracy loss (< 1% typical) |
| Model distillation | 5-10x smaller model | Requires additional training effort |
| Autoscaling to zero | 90%+ during idle | Cold start latency |
| Request batching | 2-5x throughput | Increased P99 latency |
MLOps Anti-Patterns
Knowing what not to do is just as valuable as knowing the best practices. These are the most common anti-patterns that ML teams encounter.
1. Notebook-Driven Development
The pattern: Data scientists develop models entirely in Jupyter notebooks. The "production deployment" involves copying cells from a notebook into a Python script. No version control, no tests, no code review.
Why it fails: Notebooks encourage non-linear execution (running cells out of order), hidden state, and implicit dependencies. Code that works in a notebook often fails in production because of differences in execution order, environment variables, library versions, and data access patterns.
The fix: Use notebooks for exploration and prototyping only. Once a model shows promise, refactor the code into properly structured Python modules with tests, type hints, and documentation. The training pipeline should run the module code, never a notebook.
2. No Monitoring After Deployment
The pattern: The team celebrates deploying a model to production and moves on to the next project. Nobody checks whether the model's predictions are still accurate two weeks, two months, or two years later.
Why it fails: Models degrade over time. Data distributions shift. User behavior changes. Business conditions evolve. A model that was 95% accurate at deployment may be 70% accurate six months later, and nobody notices until a business stakeholder complains about declining metrics.
The fix: Set up monitoring from day one. Track prediction distributions, data drift, and (when available) accuracy metrics over time. Define alerting thresholds. Automate retraining when degradation exceeds acceptable bounds.
3. Manual Deployments
The pattern: Deploying a new model involves SSH-ing into a server, copying a model file, restarting a service, and hoping nothing breaks. There is no rollback mechanism, no canary deployment, and no automated validation.
Why it fails: Manual deployments are slow, error-prone, and risky. They create a bottleneck where only specific team members know the deployment process. When a bad model is deployed, rolling back is painful and time-consuming.
The fix: Automate the deployment pipeline. Use blue-green or canary deployments. Implement automated model validation before production promotion. Ensure one-click (or zero-click) rollback to the previous model version.
4. Training-Serving Skew
The pattern: Features are computed differently in training (batch, in Python) and serving (real-time, in Java/Go). The training code uses pandas, the serving code uses a different library, and they produce slightly different results for the same input.
Why it fails: Even small numerical differences in feature computation can cause significant degradation in model performance. A feature that rounds differently, handles nulls differently, or computes a moving average with a different implementation can produce predictions that do not match what the model learned during training.
The fix: Use a feature store that serves the same feature values to both training and serving. If a feature store is not feasible, write feature computation logic once and share it between training and serving (e.g., as a shared library).
5. Over-Engineering the Platform
The pattern: Before deploying a single model, the team spends six months building a "complete MLOps platform" with a feature store, model registry, automated retraining, A/B testing, and multi-model serving.
Why it fails: You cannot know what you need until you have models in production. Teams that build platforms speculatively end up with complex infrastructure that does not fit their actual workflows. Meanwhile, the business waits months for any ML value.
The fix: Start simple. Deploy your first model with basic tooling (a script, a container, basic monitoring). Add infrastructure as you encounter real problems. Let production experience drive platform investment, not speculation.
6. Ignoring Data Quality
The pattern: The team focuses all its energy on model architecture and hyperparameter tuning while treating data as a given. Data pipeline failures, schema changes, and quality issues are discovered only when model performance degrades.
Why it fails: In most ML systems, data quality has a larger impact on model performance than model architecture. A simple model on clean, high-quality data will outperform a sophisticated model on noisy, inconsistent data.
The fix: Invest in data quality from the start. Implement data contracts, schema validation, distribution monitoring, and missing value checks. Treat data quality as a first-class concern, not an afterthought.
Designing Your MLOps Architecture
Building a complete MLOps system can feel overwhelming. The reference architecture above has many components, and implementing all of them at once is neither practical nor advisable. Here is a pragmatic approach to building out your MLOps capabilities.
Start with the basics: Version control for code, a basic training script (not a notebook), a containerized serving endpoint, and simple monitoring (latency, error rate, prediction distribution). This gets you to a functional Level 0.5.
Add pipeline automation: Convert your training script into a pipeline with data validation, training, evaluation, and registration stages. Add experiment tracking. This is Level 1.
Add a feature store: When you have more than two models sharing features, or when you encounter training-serving skew, introduce a feature store. Start with a lightweight option like Feast before investing in a managed service.
Add CI/CD and automated validation: When you have enough models that manual validation becomes a bottleneck, automate model testing and deployment. Add canary deployments and automated retraining triggers. This is Level 2.
Add advanced monitoring: When you have enough production experience to understand your failure modes, invest in data drift detection, segment-level monitoring, and automated alerting.
For patterns specific to ML system design, including recommended approaches for different types of models and data, see our dedicated guide. If you are building real-time AI systems, our guide on Real-World AI System Architecture covers production deployment patterns in depth.
Conclusion
MLOps is not a single tool or platform. It is a set of engineering practices that make machine learning systems reliable, reproducible, and maintainable in production. The reference architecture in this guide provides a blueprint, but the most important thing is to start where you are and build incrementally.
The key takeaways:
- Maturity is a journey. Move from Level 0 to Level 1 before attempting Level 2. Each level solves real problems and provides a foundation for the next.
- Monitoring is not optional. ML models degrade silently. If you deploy a model without monitoring, you are flying blind.
- Automate the pipeline, not just the training. CI/CD for ML covers code, data, and models. All three need testing and validation.
- Start simple. Over-engineering your platform before you have production experience is one of the most common and expensive mistakes in MLOps.
- Data quality matters more than model complexity. Invest in data quality monitoring, validation, and contracts early.
The MLOps ecosystem is maturing rapidly, and the tools available in 2026 make it significantly easier to build production ML systems than even two years ago. The challenge is no longer "can we build this infrastructure" but "can we adopt the right practices and build incrementally."
Ready to design your MLOps architecture? Try InfraSketch's ML System Design Tool to generate architecture diagrams from natural language descriptions. Describe your ML pipeline, and get an interactive diagram you can refine, annotate, and export as a design document.
