AI Pipeline System Design: Data Ingestion, Training, and Serving

Building a production AI system requires far more than training a model in a notebook and calling it done. The real challenge lies in designing the end-to-end ML system design architecture that moves data from raw sources through feature engineering, training, evaluation, and finally into serving infrastructure where predictions reach users. Each phase introduces its own set of engineering problems around reliability, scalability, and correctness.

This guide walks through all three phases of an AI pipeline system design (data ingestion, training, and serving), covering the infrastructure decisions, orchestration patterns, and common pitfalls that separate prototype ML systems from production-grade ones. Whether you are preparing for a system design interview or building a real ML platform, understanding how these components connect is essential.

The Three Phases of an AI Pipeline

Every production ML system can be decomposed into three major phases:

Data Ingestion and Feature Engineering: Getting raw data into the system, validating it, and transforming it into features the model can consume.
Training Infrastructure: Computing model weights from training data, tuning hyperparameters, and managing GPU resources efficiently.
Serving Infrastructure: Deploying trained models to handle prediction requests in real time, in batch, or at the edge.

These phases are not isolated. They share data contracts, depend on consistent feature transformations, and require coordinated orchestration. A failure in any phase cascades through the entire pipeline.

+------------------+     +------------------+     +------------------+
|  Data Ingestion  | --> | Model Training   | --> | Model Serving    |
|  & Feature Eng.  |     | & Evaluation     |     | & Monitoring     |
+------------------+     +------------------+     +------------------+
        |                        |                        |
        v                        v                        v
  Feature Store          Model Registry           Prediction Logs
  Data Catalog           Experiment Tracker       A/B Test Results

The rest of this guide examines each phase in detail, then connects them into a unified architecture.

Data Ingestion Architecture

Data ingestion is the foundation of any AI pipeline. The quality, freshness, and reliability of your data directly determine the upper bound of your model's performance. There are three primary ingestion patterns: batch, streaming, and hybrid.

Batch Ingestion with ETL

Batch ingestion processes data in scheduled intervals (hourly, daily, or weekly). This is the most common pattern for training pipelines because models are typically retrained on a cadence rather than continuously.

Common tools:

Apache Spark: Distributed processing for large-scale ETL
AWS Glue / Azure Data Factory: Managed ETL services
dbt: SQL-based transformation for warehouse-centric pipelines
Apache Airflow: Orchestration of batch ETL DAGs

Typical batch ingestion flow:

+-----------+     +-----------+     +------------+     +---------------+
| Source DBs| --> | Extract   | --> | Transform  | --> | Load to       |
| APIs      |     | (Pull)    |     | (Clean,    |     | Data Lake /   |
| Files     |     |           |     |  Validate) |     | Feature Store |
+-----------+     +-----------+     +------------+     +---------------+

Batch ingestion works well when your data sources update infrequently and your model does not need predictions based on the last few seconds of data. The trade-off is latency: data may be hours or days old by the time it reaches the model.

Streaming Ingestion with Kafka and Kinesis

Streaming ingestion processes data as it arrives, enabling near-real-time feature computation and model updates. This pattern is necessary for use cases like fraud detection, recommendation engines, and real-time bidding.

Common tools:

Apache Kafka: High-throughput distributed event streaming
Amazon Kinesis: Managed streaming on AWS
Apache Flink: Stateful stream processing
Apache Pulsar: Multi-tenant, geo-replicated messaging

Streaming ingestion architecture:

+-------------+     +----------+     +----------------+     +---------+
| Event       | --> | Kafka /  | --> | Stream         | --> | Feature |
| Producers   |     | Kinesis  |     | Processor      |     | Store   |
| (Apps, IoT, |     | (Buffer) |     | (Flink/Spark   |     | (Online |
|  Sensors)   |     |          |     |  Streaming)    |     |  Store) |
+-------------+     +----------+     +----------------+     +---------+

Streaming pipelines are more complex to operate. You need to handle out-of-order events, exactly-once semantics, backpressure, and schema evolution. The operational overhead is significant, so only adopt streaming when the use case truly demands low-latency features.

Hybrid Approach

Most production ML systems use a hybrid approach, sometimes called the Lambda architecture. Batch pipelines handle historical data and periodic retraining, while streaming pipelines compute real-time features for serving. The feature store bridges the two by providing a unified interface for both batch and real-time features.

Data Validation and Quality Checks

Raw data is unreliable. Schemas change without notice, upstream services introduce bugs, and data distributions shift over time. Without validation, these issues silently degrade model performance.

Validation strategies:

Schema validation: Enforce column types, required fields, and allowed values using tools like Great Expectations, Pandera, or TensorFlow Data Validation (TFDV).
Statistical validation: Monitor distributions, null rates, and value ranges. Alert when metrics deviate beyond thresholds.
Referential integrity: Verify that foreign keys, join keys, and entity identifiers are consistent across datasets.
Freshness checks: Confirm that data partitions arrive on schedule. Missing partitions should block downstream processing.

# Example: Great Expectations validation
import great_expectations as gx

context = gx.get_context()
validator = context.sources.pandas_default.read_csv("training_data.csv")

# Schema checks
validator.expect_column_to_exist("user_id")
validator.expect_column_values_to_be_of_type("user_id", "int64")

# Statistical checks
validator.expect_column_values_to_not_be_null("label", mostly=0.99)
validator.expect_column_mean_to_be_between("feature_a", min_value=0.1, max_value=10.0)

# Distribution checks
validator.expect_column_kl_divergence_to_be_less_than(
    "feature_b", partition_object=reference_distribution, threshold=0.1
)

Data validation should be a hard gate in your pipeline. If validation fails, the pipeline should halt and alert rather than silently pushing bad data into training.

Feature Engineering Pipelines

Feature engineering transforms raw data into the numeric representations that models consume. The challenge is doing this consistently between training and serving, at scale, and with low latency for real-time use cases.

Batch Feature Computation

Batch features are computed on a schedule over historical data. These include aggregations (average transaction amount over 30 days), window functions (rolling counts), and complex joins across multiple tables.

Common patterns:

SQL-based transforms: Run in Spark SQL or warehouse (BigQuery, Redshift, Snowflake)
PySpark jobs: For transforms requiring custom logic
dbt models: For teams that prefer SQL-first feature engineering

Batch features are typically materialized into an offline feature store for training and backfilled into the online store for serving.

Real-Time Feature Computation

Some features must be computed at request time because they depend on data that is only seconds old. Examples include the number of login attempts in the last five minutes, current session duration, and the most recent item a user viewed.

Real-time features are computed by stream processors (Flink, Spark Structured Streaming) consuming from Kafka or Kinesis. They are written to a low-latency online store (Redis, DynamoDB, Cassandra) for retrieval during inference.

+----------+     +---------+     +----------------+     +-------------+
| Kafka    | --> | Flink   | --> | Online Feature | --> | Serving     |
| Events   |     | Window  |     | Store (Redis)  |     | Application |
|          |     | Agg.    |     |                |     |             |
+----------+     +---------+     +----------------+     +-------------+

Feature Stores for Consistency

A feature store is a centralized system that manages the storage, serving, and discovery of ML features. It solves the critical problem of training-serving skew by ensuring that the same transformation logic produces features for both training and inference.

Key capabilities of a feature store:

Capability	Description
Dual storage	Offline store for training (S3/warehouse), online store for serving (Redis/DynamoDB)
Point-in-time correctness	Retrieves features as they existed at a specific timestamp, preventing data leakage
Feature registry	Searchable catalog of all features with metadata, ownership, and lineage
Materialization	Automated syncing of features from offline to online stores
Monitoring	Tracks feature freshness, coverage, and distribution drift

Popular feature store implementations:

Feast: Open-source, cloud-agnostic
Tecton: Managed platform built on Feast concepts
Amazon SageMaker Feature Store: AWS-native
Databricks Feature Store: Integrated with MLflow
Hopsworks: Open-source with extensive support for real-time features

For a deeper dive into feature store architecture, see our dedicated guide on Feature Store System Design.

Transformation Registries

A transformation registry centralizes the definitions of all feature transformations so they can be reused, versioned, and applied consistently. Without one, teams end up duplicating transformation logic between training scripts and serving code, which is one of the most common sources of training-serving skew.

# Example: Feast feature definition
from feast import Entity, Feature, FeatureView, FileSource, ValueType

user = Entity(name="user_id", value_type=ValueType.INT64)

user_features = FeatureView(
    name="user_features",
    entities=["user_id"],
    ttl=timedelta(days=1),
    features=[
        Feature(name="avg_transaction_30d", dtype=ValueType.FLOAT),
        Feature(name="login_count_7d", dtype=ValueType.INT64),
        Feature(name="account_age_days", dtype=ValueType.INT64),
    ],
    batch_source=FileSource(
        path="s3://features/user_features.parquet",
        timestamp_field="event_timestamp",
    ),
)

By defining features declaratively, the feature store can generate the correct retrieval logic for both batch (training) and online (serving) contexts.

Training Infrastructure

Training infrastructure must balance three competing concerns: speed (time to train), cost (GPU hours are expensive), and reproducibility (every experiment must be traceable). The choices you make here depend on model size, dataset scale, and budget.

Single-Node Training

For models that fit in the memory of a single GPU (most tabular ML, small to medium deep learning models), single-node training is the simplest and most cost-effective approach. Use a single GPU instance (AWS p3.2xlarge, GCP n1-standard with T4/V100) with frameworks like PyTorch, TensorFlow, or XGBoost.

When single-node is sufficient:

Dataset fits in memory (or can be loaded in chunks)
Model has fewer than a few hundred million parameters
Training completes in a reasonable time (under a few hours)
You do not need to iterate on experiments rapidly

Distributed Training

When models or datasets exceed what a single machine can handle, you need distributed training. There are two primary strategies.

Data Parallelism: Each worker holds a complete copy of the model and processes a different shard of the training data. Gradients are synchronized across workers after each step (via AllReduce or parameter servers). This is the most common distributed training strategy and scales well for models that fit on a single GPU.

+----------+     +----------+     +----------+
| Worker 1 |     | Worker 2 |     | Worker 3 |
| (Full    |     | (Full    |     | (Full    |
|  Model)  |     |  Model)  |     |  Model)  |
| Shard 1  |     | Shard 2  |     | Shard 3  |
+----+-----+     +----+-----+     +----+-----+
     |                |                |
     +-------+--------+--------+------+
             |                 |
        Gradient Sync (AllReduce)
             |
       +-----+-----+
       | Updated   |
       | Model     |
       +-----------+

Model Parallelism: The model itself is split across multiple GPUs because it is too large to fit in a single GPU's memory. Different layers or tensor slices reside on different devices. This is necessary for large language models (billions of parameters) and requires careful pipeline scheduling to minimize idle GPU time.

Frameworks for distributed training:

PyTorch Distributed (DDP): Native data parallelism in PyTorch
DeepSpeed: Microsoft's library for large model training (ZeRO optimizer)
Horovod: Framework-agnostic distributed training library
Ray Train: Distributed training with Ray's actor model
FSDP (Fully Sharded Data Parallel): PyTorch's memory-efficient alternative to DDP

Hyperparameter Tuning

Hyperparameter tuning searches the space of model configurations to find the best performing set. This is inherently parallelizable and benefits significantly from distributed infrastructure.

Tuning strategies:

Strategy	How It Works	Best For
Grid Search	Exhaustive search over defined grid	Small search spaces, few hyperparameters
Random Search	Random sampling from distributions	Medium search spaces, faster than grid
Bayesian Optimization	Models objective function, picks promising points	Expensive evaluations, continuous parameters
Hyperband / ASHA	Early stopping of poor trials	Large search spaces, limited budget
Population-Based Training	Evolves hyperparameters during training	Long training runs, RL and GANs

Tools:

Optuna: Pythonic, supports pruning and multiple backends
Ray Tune: Distributed tuning with support for all strategies above
W&B Sweeps: Integrated with Weights & Biases experiment tracking
SageMaker Automatic Model Tuning: Managed Bayesian optimization

Spot and Preemptible Instances for Cost Reduction

GPU instances are expensive. A single p4d.24xlarge (8x A100 GPUs) costs over $30 per hour on-demand. Spot instances (AWS) or preemptible VMs (GCP) offer 60-90% discounts but can be reclaimed with short notice.

Making spot instances work for training:

Checkpointing: Save model state every N steps to durable storage (S3, GCS). On preemption, resume from the last checkpoint.
Elastic training: Use frameworks like PyTorch Elastic or Ray Train that can scale workers up and down dynamically.
Mixed fleets: Combine on-demand base capacity with spot instances for burst capacity.
Instance diversification: Request multiple instance types to reduce the probability of simultaneous preemption across all workers.

# Example: PyTorch checkpointing for spot instance resilience
import torch

def save_checkpoint(model, optimizer, epoch, step, loss, path):
    torch.save({
        "epoch": epoch,
        "step": step,
        "model_state_dict": model.state_dict(),
        "optimizer_state_dict": optimizer.state_dict(),
        "loss": loss,
    }, path)

def load_checkpoint(model, optimizer, path):
    checkpoint = torch.load(path)
    model.load_state_dict(checkpoint["model_state_dict"])
    optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
    return checkpoint["epoch"], checkpoint["step"]

GPU Cluster Management

At scale, managing GPU clusters becomes a discipline of its own. Key considerations include:

Scheduling: Use Kubernetes with GPU device plugins or Slurm for HPC-style scheduling. Ensure fair queuing across teams.
Multi-tenancy: Isolate workloads using namespaces, resource quotas, and GPU partitioning (MIG on A100s).
Networking: Distributed training is bottlenecked by inter-node communication. Use high-bandwidth interconnects (NVLink, InfiniBand, EFA on AWS).
Storage: Training data must be accessible at high throughput. Use parallel file systems (Lustre, FSx) or high-performance object storage.
Monitoring: Track GPU utilization, memory usage, and communication overhead. Low GPU utilization indicates a bottleneck elsewhere (data loading, CPU preprocessing, network).

+-------------------+     +-------------------+
| GPU Cluster       |     | Shared Storage    |
| +------+ +------+ |     | +---------------+ |
| | Node | | Node | |     | | Training Data | |
| | 8xGPU| | 8xGPU| |<--->| | (Lustre/FSx)  | |
| +------+ +------+ |     | +---------------+ |
| +------+ +------+ |     | +---------------+ |
| | Node | | Node | |     | | Checkpoints   | |
| | 8xGPU| | 8xGPU| |<--->| | (S3/GCS)      | |
| +------+ +------+ |     | +---------------+ |
+-------------------+     +-------------------+
        |
+-------------------+
| Job Scheduler     |
| (K8s / Slurm)     |
+-------------------+

Model Evaluation and Validation

Training a model is only half the battle. Before any model reaches production, it must pass through a rigorous evaluation and validation pipeline that goes beyond checking a single accuracy number.

Offline Metrics

Offline evaluation computes metrics on held-out test sets. The specific metrics depend on the problem type:

Classification: Precision, recall, F1, AUC-ROC, AUC-PR, calibration curves
Regression: MAE, RMSE, R-squared, quantile errors
Ranking: NDCG, MAP, MRR
Generative models: Perplexity, BLEU, ROUGE, human evaluation scores

Beyond aggregate metrics, evaluate performance across slices of your data. A model that performs well overall but fails for a specific demographic or region is not ready for production. Tools like Fairlearn, Aequitas, and What-If Tool support slice-based evaluation.

A/B Testing Integration

Offline metrics correlate imperfectly with business outcomes. A/B testing (or online experimentation) measures the actual impact of a new model on users. The evaluation pipeline should integrate with the experimentation platform to:

Generate predictions from both the current model and the candidate model
Route a percentage of traffic to each model
Collect outcome metrics (click-through rate, conversion, revenue)
Run statistical tests to determine if the candidate is significantly better

Design your serving infrastructure to support traffic splitting from the start. Retrofitting A/B testing into a monolithic serving system is painful.

Champion-Challenger Pattern

The champion-challenger pattern (also called shadow mode or canary deployment) provides a safer alternative to full A/B testing. The current production model (champion) serves all real traffic. The new model (challenger) receives the same requests but its predictions are logged rather than returned to users.

Benefits:

No risk to user experience during evaluation
Collects realistic prediction distributions for comparison
Can run alongside A/B tests for additional validation

Flow:

                +---> Champion Model ---> Response to User
User Request ---|
                +---> Challenger Model ---> Log predictions
                                            (compare offline)

Model Registry

A model registry is the central repository for trained model artifacts. It serves as the handoff point between training and serving, storing not just the model weights but also:

Metadata: Training dataset version, hyperparameters, training duration
Metrics: All offline evaluation results
Lineage: Which code commit, feature versions, and data snapshots produced this model
Stage: Development, staging, production, or archived
Signatures: Input/output schema for serving validation

Popular model registries:

MLflow Model Registry: Open-source, widely adopted
Weights & Biases Model Registry: Integrated with experiment tracking
Amazon SageMaker Model Registry: AWS-native
Vertex AI Model Registry: GCP-native

A model should not be deployed to production without being registered, evaluated, and explicitly promoted through the registry's stage gates.

Serving Infrastructure

Serving infrastructure delivers model predictions to end users and downstream systems. The requirements vary dramatically depending on whether you need real-time responses, batch predictions, or edge deployment.

Online Serving with REST and gRPC

Online serving handles individual prediction requests with low-latency requirements (typically under 100ms). The model runs behind an API endpoint that accepts input features and returns predictions.

Architecture components:

+--------+     +-----------+     +---------------+     +-----------+
| Client | --> | Load      | --> | Model Server  | --> | Response  |
|        |     | Balancer  |     | (TF Serving / |     |           |
|        |     | (ALB/NLB) |     |  Triton /     |     |           |
|        |     |           |     |  TorchServe)  |     |           |
+--------+     +-----------+     +---------------+     +-----------+
                                       |
                                 +-----+-----+
                                 | Feature   |
                                 | Store     |
                                 | (Online)  |
                                 +-----------+

Model serving frameworks:

Framework	Strengths	Best For
TensorFlow Serving	Mature, gRPC native, model versioning	TensorFlow models
Triton Inference Server	Multi-framework, dynamic batching, GPU optimization	Mixed model types, GPU serving
TorchServe	PyTorch native, easy model packaging	PyTorch models
BentoML	Framework-agnostic, simple API, containerized	Rapid prototyping, multi-framework
Seldon Core	Kubernetes-native, inference graphs, A/B testing	K8s environments
Ray Serve	Python-native, composable, scales horizontally	Complex inference pipelines

Key serving concerns:

Autoscaling: Scale model replicas based on request rate, latency percentiles, or GPU utilization. Use Kubernetes HPA or cloud-specific autoscalers.
Batching: Accumulate multiple requests and process them together to maximize GPU throughput. Triton and TF Serving support dynamic batching natively.
Caching: Cache predictions for repeated inputs. This is especially effective for recommendation systems where the same items are scored frequently.
Health checks: Implement readiness and liveness probes. A model server that loaded a corrupt model should not receive traffic.

Batch Prediction

Not all predictions need to happen in real time. Batch prediction processes large volumes of inputs on a schedule and writes results to a database or data warehouse for later retrieval.

When to use batch prediction:

Predictions are needed for all entities periodically (daily recommendations, risk scores)
Latency requirements are relaxed (hours, not milliseconds)
Input data is only available in bulk
Cost optimization is important (batch processing is more efficient per prediction)

Architecture:

+------------+     +-----------+     +---------------+     +----------+
| Input Data | --> | Spark /   | --> | Model         | --> | Results  |
| (S3/       |     | Dataflow  |     | (loaded from  |     | (DB /    |
|  Warehouse)|     |           |     |  registry)    |     |  Cache)  |
+------------+     +-----------+     +---------------+     +----------+

Batch prediction jobs run on the same distributed compute infrastructure as training (Spark, Ray, SageMaker Batch Transform) but with read-only model access.

Edge Deployment

Edge deployment pushes models to devices that are close to the data source: mobile phones, IoT sensors, autonomous vehicles, or on-premise servers. This is necessary when network latency is unacceptable, connectivity is unreliable, or data privacy prevents sending raw data to the cloud.

Edge deployment considerations:

Model compression: Quantization (INT8/INT4), pruning, knowledge distillation to reduce model size and inference cost.
Runtime: TensorFlow Lite, ONNX Runtime, Core ML, TensorRT for hardware-specific optimization.
Updates: Over-the-air (OTA) model updates with rollback capability.
Monitoring: Collect inference telemetry from edge devices for aggregate monitoring (you cannot inspect individual requests at scale).

Model Optimization for Serving

Production models often need optimization to meet latency and cost targets. Common techniques include:

Quantization: Reduce precision from FP32 to FP16 or INT8. Reduces memory footprint and speeds up inference, often with minimal accuracy loss.
Pruning: Remove weights that contribute little to predictions. Structured pruning removes entire neurons or channels.
Distillation: Train a smaller "student" model to mimic a larger "teacher" model. The student is faster to serve while retaining most of the teacher's accuracy.
ONNX conversion: Export models to ONNX format for cross-framework optimization and hardware-specific acceleration.
Compilation: Use tools like TVM, XLA, or TensorRT to compile models into optimized machine code for the target hardware.

Orchestration

ML pipelines have complex dependencies between data ingestion, feature computation, training, evaluation, and deployment. Orchestration tools manage these dependencies as directed acyclic graphs (DAGs), handling scheduling, retries, monitoring, and alerting.

Orchestration Tool Comparison

Tool	Architecture	ML-Specific Features	Strengths	Weaknesses
Apache Airflow	Central scheduler, workers	Limited (needs plugins)	Mature, large community, extensive integrations	UI dated, dynamic DAGs complex, not ML-native
Kubeflow Pipelines	Kubernetes-native	Full ML lifecycle	Deep K8s integration, containerized steps, experiment tracking	Requires K8s expertise, complex setup
Dagster	Asset-based, software-defined	Good (IO managers, partitions)	Type safety, testability, asset lineage	Newer community, fewer integrations
Prefect	Hybrid (cloud + agent)	Moderate	Simple Python API, good observability, easy local dev	Less mature than Airflow for complex workflows
ZenML	ML-specific abstraction	Full ML lifecycle	Stack-agnostic, pipeline caching, artifact tracking	Newer, smaller community
Metaflow	Python-native	Full ML lifecycle	Simple API, built-in versioning, Netflix-proven	AWS-centric, limited scheduling

DAG-Based Pipeline Management

Each stage of the ML pipeline becomes a node in a DAG. The orchestrator ensures that stages run in the correct order, retries failed stages, and provides visibility into pipeline health.

Example: Training pipeline DAG

validate_data --> compute_features --> split_data
                                          |
                                    +-----+-----+
                                    |           |
                              train_model   train_baseline
                                    |           |
                                    +-----+-----+
                                          |
                                    evaluate_models
                                          |
                                    register_model
                                          |
                                    deploy_to_staging
                                          |
                                    run_integration_tests
                                          |
                                    promote_to_production

Scheduling and Monitoring

Scheduling patterns:

Cron-based: Run pipelines on fixed schedules (e.g., retrain daily at 2 AM UTC)
Event-driven: Trigger pipelines when new data arrives (S3 event, Kafka topic)
Data-aware: Run only when upstream data partitions are available (Dagster sensors, Airflow data-aware scheduling)

Monitoring essentials:

Pipeline health: Success/failure rates, duration trends, SLA compliance
Data quality: Validation results at each stage, schema drift detection
Model quality: Prediction distribution shifts, feature drift, accuracy degradation
Infrastructure: GPU utilization, memory pressure, queue depth, cost per pipeline run

Set up alerting for pipeline failures, SLA breaches, and data quality violations. Use PagerDuty, OpsGenie, or Slack integrations so the on-call engineer is notified immediately.

Connecting the Phases: End-to-End Architecture

Now that we have examined each phase independently, let us connect them into a complete AI pipeline architecture. The following diagram shows how all components interact in a production ML system.

+===========================================================================+
|                        END-TO-END AI PIPELINE                             |
+===========================================================================+
|                                                                           |
|  DATA INGESTION                                                           |
|  +----------+  +----------+                                               |
|  | Batch    |  | Streaming|                                               |
|  | Sources  |  | Sources  |                                               |
|  | (DBs,    |  | (Kafka,  |                                               |
|  |  Files)  |  |  Events) |                                               |
|  +----+-----+  +----+-----+                                               |
|       |              |                                                    |
|       v              v                                                    |
|  +----+--------------+----+                                               |
|  |  Data Validation       |                                               |
|  |  (Great Expectations)  |                                               |
|  +----+--------------+----+                                               |
|       |              |                                                    |
|       v              v                                                    |
|  +---------+   +-----------+                                              |
|  | Offline  |   | Online    |                                              |
|  | Feature  |   | Feature   |                                              |
|  | Store    |   | Store     |                                              |
|  | (S3/WH)  |   | (Redis)   |                                              |
|  +----+-----+   +-----+-----+                                              |
|       |               |                                                   |
+-------+---------------+---------------------------------------------------+
|       |               |                                                   |
|  TRAINING             |                                                   |
|       v               |                                                   |
|  +-----------+        |                                                   |
|  | Training  |        |                                                   |
|  | Cluster   |        |                                                   |
|  | (GPUs)    |        |                                                   |
|  +-----+-----+        |                                                   |
|        |               |                                                  |
|        v               |                                                  |
|  +------------+        |                                                  |
|  | Evaluation |        |                                                  |
|  | Pipeline   |        |                                                  |
|  +-----+------+        |                                                  |
|        |               |                                                  |
|        v               |                                                  |
|  +------------+        |                                                  |
|  | Model      |        |                                                  |
|  | Registry   |        |                                                  |
|  +-----+------+        |                                                  |
|        |               |                                                  |
+--------+---------------+--------------------------------------------------+
|        |               |                                                  |
|  SERVING               |                                                  |
|        v               v                                                  |
|  +-----+------+  +-----+-----+                                            |
|  | Model      |  | Feature   |                                            |
|  | Server     |<-| Retrieval |                                            |
|  | (Triton/   |  | (Online   |                                            |
|  |  TF Serve) |  |  Store)   |                                            |
|  +-----+------+  +-----------+                                            |
|        |                                                                  |
|        v                                                                  |
|  +------------+     +---------------+                                     |
|  | Prediction |---->| Monitoring    |                                     |
|  | Response   |     | & Alerting    |                                     |
|  +------------+     +-------+-------+                                     |
|                             |                                             |
|                             v                                             |
|                     +-------+-------+                                     |
|                     | Retrain       |                                     |
|                     | Trigger       |-----> (Back to Training)            |
|                     +---------------+                                     |
|                                                                           |
+===========================================================================+
|  ORCHESTRATION: Airflow / Kubeflow / Dagster                              |
+===========================================================================+

Key integration points:

Feature Store as the bridge: The feature store connects ingestion to both training and serving. Training reads from the offline store. Serving reads from the online store. Both use the same feature definitions.
Model Registry as the handoff: The registry is the single source of truth for model artifacts. Training writes to it. Serving reads from it. No model reaches production without passing through the registry.
Monitoring closes the loop: Prediction monitoring detects data drift, model degradation, and feature staleness. When metrics breach thresholds, it triggers retraining automatically or alerts the ML team.
Orchestration ties it together: The orchestration layer manages scheduling, dependencies, and failure handling across all three phases. It ensures that data validation completes before feature computation, features are ready before training starts, and models pass evaluation before deployment.

For more on how these patterns fit into broader MLOps practices, see our guide on MLOps System Design.

Common Pitfalls

Even well-designed ML systems fall prey to recurring problems. Here are the most common pitfalls and how to avoid them.

1. Training-Serving Skew

Training-serving skew occurs when the features or preprocessing applied during training differ from what happens during serving. This is the single most common source of silent model degradation in production.

Causes:

Different code paths for training and serving feature computation
Using future data during training that is unavailable at serving time
Differences in library versions between training and serving environments
Inconsistent handling of missing values, outliers, or categorical encodings

Prevention:

Use a feature store to enforce consistent transformations
Containerize both training and serving environments with identical dependencies
Log serving-time features and periodically compare them against training distributions
Write integration tests that verify feature parity between training and serving code paths

2. Data Leakage

Data leakage means that information from the test set or the future leaks into the training process, producing overly optimistic evaluation metrics that do not generalize to production.

Common sources:

Using features computed from the full dataset (including test rows) during preprocessing
Temporal leakage: using features derived from events after the prediction timestamp
Target leakage: using variables that are proxies for the label (e.g., using "loan approved" to predict "creditworthy")

Prevention:

Use point-in-time-correct feature retrieval (feature store with event timestamps)
Split data temporally, not randomly, for time-series problems
Review feature definitions with domain experts to identify proxy variables
Validate that training metrics and serving metrics are in the same range

3. Ignoring Monitoring

Many teams invest heavily in model development and then deploy without adequate monitoring. Without monitoring, model degradation is invisible until users complain or business metrics decline weeks later.

What to monitor:

Prediction distribution: Compare the distribution of predictions over time. A sudden shift indicates data drift or a broken feature pipeline.
Feature distributions: Monitor each input feature for drift using statistical tests (KS test, PSI, Jensen-Shannon divergence).
Latency: Track p50, p95, and p99 inference latency. Degradation may indicate resource contention or a model that has grown too large.
Error rates: Monitor for increased null predictions, out-of-range values, or serving errors.
Business metrics: Correlate model predictions with downstream outcomes (conversion rates, fraud catch rates, user engagement).

Tools: Evidently AI, Arize, WhyLabs, Fiddler, Amazon SageMaker Model Monitor, Seldon Alibi Detect.

4. Over-Engineering

Not every ML system needs Kubernetes, a feature store, distributed training, and a full MLOps platform on day one. Over-engineering introduces unnecessary complexity, slows iteration, and consumes engineering bandwidth better spent on model improvement.

Start simple:

Begin with batch predictions and a cron job before building real-time serving
Use a single GPU before investing in distributed training
Store features in a database before adopting a feature store
Deploy with a simple Flask/FastAPI server before adopting Triton or Seldon
Use MLflow for experiment tracking before building a custom platform

Scale when you have evidence: Add infrastructure complexity only when you can point to a specific bottleneck or requirement that justifies it. "Netflix does it this way" is not a valid reason for a 10-person team.

For more patterns on when to introduce complexity, see our guide on ML System Design Patterns.

Conclusion

Designing an end-to-end AI pipeline system requires thinking across data ingestion, feature engineering, training, evaluation, serving, and orchestration. Each phase has its own set of infrastructure challenges, and the connections between phases (particularly the feature store and model registry) are where most production ML systems succeed or fail.

The key takeaways from this guide:

Data quality is the foundation. No amount of model sophistication compensates for bad data. Invest in validation, monitoring, and governance early.
Feature stores prevent training-serving skew. A centralized feature store that serves both training and inference is the single most impactful infrastructure investment for ML teams.
Start simple, scale deliberately. Begin with batch ingestion, single-node training, and simple REST serving. Add streaming, distributed training, and GPU-optimized serving only when specific requirements demand it.
Monitoring closes the loop. Production ML systems degrade over time as data distributions shift. Without monitoring and automated retraining triggers, models silently become stale.
Orchestration is the glue. DAG-based pipeline management ensures that all phases run reliably, in the correct order, with proper error handling and observability.

Whether you are designing a fraud detection system, a recommendation engine, or a computer vision pipeline, these patterns and components form the backbone of production ML infrastructure. The specific tools may change, but the architectural principles remain consistent.

Want to visualize your AI pipeline architecture? Try InfraSketch to generate end-to-end ML system diagrams from natural language descriptions. Describe your data sources, training requirements, and serving needs, and get a complete architecture diagram in seconds.

AI Pipeline System Design: Data Ingestion, Training, and Serving

AI Pipeline System Design: Data Ingestion, Training, and Serving

The Three Phases of an AI Pipeline

Data Ingestion Architecture

Batch Ingestion with ETL

Streaming Ingestion with Kafka and Kinesis

Hybrid Approach

Data Validation and Quality Checks

Feature Engineering Pipelines

Batch Feature Computation

Real-Time Feature Computation

Feature Stores for Consistency

Transformation Registries

Training Infrastructure

Single-Node Training

Distributed Training

Hyperparameter Tuning

Spot and Preemptible Instances for Cost Reduction

GPU Cluster Management

Model Evaluation and Validation

Offline Metrics

A/B Testing Integration

Champion-Challenger Pattern

Model Registry

Serving Infrastructure

Online Serving with REST and gRPC

Batch Prediction

Edge Deployment

Model Optimization for Serving

Orchestration

Orchestration Tool Comparison

DAG-Based Pipeline Management

Scheduling and Monitoring

Connecting the Phases: End-to-End Architecture

Common Pitfalls

1. Training-Serving Skew

2. Data Leakage

3. Ignoring Monitoring

4. Over-Engineering

Conclusion

Related Resources

Try InfraSketch Tools

Generate a Diagram

System Design Tool

Design Doc Generator

All Tools

Related Articles

Machine Learning System Design Patterns: The Complete Guide

AI System Design: A Comprehensive Reference Architecture

ML Model Serving System Design: From Prototype to Production