Search:

Openshift AI on AWS Architecture

Enterprise GPU-as-a-Service Architecture with Red Hat OpenShift

Enterprise GPU-as-a-Service Architecture with Red Hat OpenShift AI

The demand for GPU compute in enterprise environments has exploded.

At Kloia, we kept seeing the same pattern: organizations buying GPU capacity they couldn't efficiently share, so we've designed and deployed a GPU-as-a-Service (GPUaaS) architecture that solves these problems using Red Hat OpenShift AI on AWS. This post walks through the architecture in detail.

Why OpenShift AI on AWS?

AWS provides the broadest selection of GPU instance types - A100s, H100s, H200s - across multiple regions, with Elastic Fabric Adapter (EFA) for high-bandwidth inter-node communication. But raw cloud GPUs alone don't give you a platform. You still need orchestration, multi-tenancy, security, model lifecycle management, and cost control.

OpenShift bridges that gap. It's an enterprise Kubernetes distribution with built-in security policies, RBAC, and operator-based lifecycle management.

And Red Hat OpenShift AI provides the complete AI/ML lifecycle. Engineers develop models using Jupyter notebooks and deploy them through automated MLOps pipelines built on Kubeflow Pipelines, track experiments, and serve models through inference endpoints.

All inference endpoints expose standardized OpenAI-compatible APIs, so consuming applications don't need to know which runtime is serving the model. The platform supports multiple model architectures through a unified serving layer.

And a centralized model registry provides versioned, tagged storage for trained model artifacts. This is the single source of truth for what models exist, who trained them, what data they were trained on, and which versions are deployed.

The combination gives you a self-service AI platform where data scientists can request GPU resources on demand, train or fine-tune models, and deploy them to production an it’s all without filing infrastructure tickets.

Core Infrastructure: Multi-Zone Architecture

Core Infrastructure

Control Plane

The OpenShift control plane runs across three master nodes. In the multi-zone configuration, these are distributed across three AWS availability zones in the UAE region. We ensure the cluster management layer remains operational even if an entire availability zone becomes unavailable.

Failover is automatic - no manual intervention required.

GPU Worker Nodes and MachineSets

Each GPU type in the cluster is managed through its own OpenShift MachineSet. In our architecture, separate MachineSets exist for A100, H100, and H200 instance types, each mapping to a specific EC2 instance family. This separation exists because each GPU type runs a different instance type with different pricing, availability characteristics, and workload suitability. It's the natural way to organize heterogeneous GPU infrastructure in OpenShift.

What makes this operationally significant is how it integrates with the Cluster Autoscaler. Because each GPU type is its own MachineSet with independent min/max replica counts, the autoscaler provisions nodes per GPU type based on pending pod requests.

 

GPU Resource Management and Auto-Scaling

This is the core of the GPUaaS value proposition. The system continuously balances GPU availability with cost efficiency through several integrated components and this creates a fluid, demand-driven resource pool.

GPU Resource Management

Resource Inventory and Intelligent Scheduling

The OpenShift API maintains a real-time inventory of available GPUs by type — how many A100s are idle, how many H100s are allocated, what's the queue depth for H200 requests.

The scheduler is configured with GPU-aware policies. Priority-based scheduling ensures production inference preempts development notebooks when resources are constrained. Resource matching places workloads on the right GPU type based on affinity rules and GPU Feature Discovery (GFD) labels. Preemption policies define which workloads can be evicted and which are protected.

Two-Tier Auto-Scaling in Detail

The auto-scaling system operates at two distinct layers, each solving a different problem. Understanding the boundary between them is important because it determines how fast the system responds and what it costs.

Two-Tier Auto-Scaling in Detail

Tier 1 — Pod-level scaling (seconds). KEDA watches application-specific metrics through ScaledObject custom resources, inference queue depth from the vLLM metrics endpoint, request latency percentiles from Prometheus, GPU utilization reported by DCGM Exporter or even custom metrics like tokens-per-second throughput.

When a trigger fires, KEDA adjusts the replica count of the model serving Deployment. If there's available GPU capacity on existing nodes, the new pods schedule immediately.

KEDA also supports scale-to-zero, which is critical for cost management. A model that nobody is querying at 2 AM can scale down to zero replicas, freeing its GPU for other workloads or allowing the node itself to be reclaimed.

Additionally, KEDA supports cron-based scheduling. If you know your finance team runs batch inference every morning at 8 AM, you can pre-scale the deployment before demand arrives, eliminating cold start entirely.

 

Tier 2 — Node-level scaling (minutes): When KEDA scales up pods but there's no GPU capacity available to schedule them, the pods enter a `Pending` state with an Unschedulable condition.

The OpenShift Cluster Autoscaler detects these pending pods, identifies which MachineSet can satisfy the GPU resource request (based on the nvidia.com/gpu resource and node affinity labels), and triggers the MachineSet to provision a new EC2 instance. The new node boots, joins the cluster, the NVIDIA GPU Operator provisions drivers and DCGM, and the pending pods get scheduled. This cycle takes 3 to 7 minutes depending on instance type and AMI caching.

Active-Active Inference Across Availability Zones

GPU worker nodes can run in an active-active configuration across both availability zones. This isn't just infrastructure redundancy - it directly enables active-active LLM inference at the application layer.

In OpenShift AI, model serving deployments  (via KServe or the vLLM ServingRuntime) are standard Kubernetes Deployments with multiple replicas.

When those replicas are spread across two AZs through pod anti-affinity rules and topology spread constraints, you get LLM inference endpoints that are simultaneously serving from both zones. The Internal Network Load Balancer distributes incoming inference requests across all healthy replicas regardless of zone, with health checks removing any unhealthy endpoint from the rotation.

Storage: FSx for NetApp ONTAP

AI workloads need shared, high-performance storage that works across zones. FSx for NetApp ONTAP in Multi-AZ configuration automatically replicates data between availability zones with an active-passive HA setup, providing seamless failover for persistent volumes.

NetApp's ONTAP technology also gives you data deduplication and compression — particularly valuable for AI datasets and model artifacts that contain significant redundant information. For performance-critical workloads, zone-specific EBS volumes still provide high-IOPS local storage with automated backup and cross-zone snapshot replication.

Distributed Workload Architecture

Large-scale AI workloads don't fit on a single GPU. The architecture supports the full spectrum of parallelism strategies needed for training and inference at scale.

Distributed Workload Architecture
Multi-node distributed training leverages PyTorch's Fully Sharded Data Parallel (FSDP) functionality, which partitions model parameters, gradients, and optimizer states across GPUs situated on numerous nodes. The NVIDIA Collective Communications Library (NCCL) facilitates GPU-to-GPU communication, and the AWS Elastic Fabric Adapter furnishes the high-bandwidth, low-latency networking infrastructure essential for distributed training workloads. Furthermore, GPUDirect RDMA allows for direct memory access between GPUs across nodes, thereby circumventing the CPU.

Each worker node runs NVIDIA GPU drivers and Kubernetes Device Plugins managed by the NVIDIA GPU Operator. Pods share access to persistent volumes backed by FSx for ONTAP, so training checkpoints and datasets are accessible from any node in the cluster.
This infrastructure supports everything from single-GPU fine-tuning to multi-node training with 4D parallelism (data + tensor + pipeline + sequence parallelism), depending on model size and available hardware.

Conclusion

 

Building an enterprise AI platform isn't about provisioning GPUs in the cloud. It's about creating a self-service platform where teams can move from experimentation to production without infrastructure friction.

The two-tier autoscaling system ensures you only pay for GPU instances when workloads actually need them. The distributed workload architecture supports the full spectrum of parallelism strategies - from single-GPU fine-tuning to multi-node training with 4D parallelism. And active-active inference across availability zones provides zone-level fault tolerance for production model serving.

If you're evaluating how to deliver AI infrastructure to your organization, we'd love to discuss how this architecture applies to your specific requirements.

Got questions

Others frequently ask…
  • GPUaaS lets organizations request GPU resources on demand instead of buying fixed capacity upfront. Key differences:
     
    • Pay-per-use: Only pay when you use the GPUs
    • Auto-scaling: Matches demand automatically, scales to zero during idle periods
    • GPU flexibility: Switch between A100s, H100s, or H200s as needed
    • Self-service: Data scientists deploy models without IT tickets
    • Cost efficiency: Eliminates wasted capacity and reduces overall infrastructure costs
    • Multi-tenant: Multiple teams share GPU pools with resource isolation
     
    It transforms GPU infrastructure from capital expense (CapEx) into flexible operational expense (OpEx).
  • OpenShift AI Advantages:
    • Built-in security, RBAC, and multi-tenant isolation
    • Complete AI/ML lifecycle: Jupyter, MLOps pipelines, experiment tracking, model registry
    • OpenAI-compatible APIs work with any model runtime
    • Standardized model serving layer across KServe, vLLM, Seldon
     
    AWS Advantages:
    • Widest GPU selection (A100, H100, H200) across regions
    • Elastic Fabric Adapter (EFA) for low-latency inter-GPU communication
    • High GPU availability and reliable infrastructure
    • Native integration with enterprise AWS environments
     
    Alternatives: Azure AI Studio (Microsoft integration), Google Vertex AI (TensorFlow focus), On-premises OpenShift (data residency needs). AWS-OpenShift offers the best flexibility for enterprise GPUaaS.
     
  • Tier 1 - Pod-Level Scaling (Seconds)
     
    KEDA monitors metrics from vLLM (inference queue depth), Prometheus (request latency), DCGM (GPU utilization), and custom metrics. When thresholds are crossed:
    • Immediately adjusts model deployment replicas
    • New pods schedule on available GPU capacity
    • Scales to zero when no queries exist, freeing GPUs
    • Pre-scales before known demand patterns (e.g., scheduled batch jobs at 8 AM)
     
    Tier 2 - Node-Level Scaling (3-7 Minutes)
     
    When all GPUs are full and pods remain pending:
    1. Cluster Autoscaler detects pending pods and their GPU requirements
    2. Identifies appropriate MachineSet (A100, H100, or H200)
    3. Provisions new EC2 instance in AWS
    4. NVIDIA GPU Operator installs drivers and monitoring
    5. Pending pods schedule onto new node
     
    Why Two Tiers? Pod scaling responds in seconds to traffic spikes; node scaling amortizes EC2 launch costs across multiple deployments.
     
  • Traditional setups use active-passive redundancy (one zone serves, one waits). This architecture enables true active-active serving:
     
    Pod Distribution:
    • GPU worker nodes run simultaneously across multiple availability zones
    • Model serving deployments have replicas spread across zones via pod anti-affinity rules and topology spread constraints
    • Both zones actively serve inference requests at the same time
     
    Traffic Distribution:
    • Internal Network Load Balancer distributes incoming requests across all healthy replicas
    • Health checks automatically remove unhealthy endpoints
    • Requests failover transparently if a zone goes down
     
    Benefits:
    • True Redundancy: Both zones actively serve; neither is idle
    • Higher Throughput: Doubles inference capacity
    • Automatic Failover: Single AZ failure doesn't degrade service
    • Zero Cold Starts: No warm-up during failover
     
    This is more than infrastructure redundancy-it enables active-active LLM inference at the application layer.
     
  • The architecture supports four distributed training strategies:
     
    1. Data Parallelism: Dataset split across GPUs, each holds full model copy. Best for large datasets.
    2. Tensor Parallelism: Model layers split horizontally across GPUs. Best for models needing extra parallelization.
    3. Pipeline Parallelism: Model layers stratified vertically across GPUs. Best for very large sequential models.
    4. Sequence Parallelism: Long sequences split across GPUs. Critical for LLMs with multi-thousand-token contexts.
     
    Technical Foundation:
    • PyTorch FSDP: Partitions model parameters, gradients, optimizer states across GPUs
    • NVIDIA NCCL: GPU-to-GPU communication library
    • AWS Elastic Fabric Adapter (EFA): Low-latency, high-bandwidth networking
    • GPUDirect RDMA: Direct GPU memory access, bypasses CPU bottlenecks
    • FSx NetApp ONTAP: Shared persistent storage across all nodes with deduplication and compression
     
    Supported Scenarios: Single-GPU fine-tuning, multi-GPU data parallelism (2-16 GPUs), multi-node distributed training, and 4D parallelism for trillion-parameter models.
Emre Kasgur

Software Engineer @kloia