Why GPU Cloud Costs Matter in Enterprise AI

Most enterprises underestimate GPU infrastructure costs because they focus on API spending. A team that runs OpenAI's GPT-4 API or Google's Gemini API at scale will pay millions in token costs annually. But when the same team moves to self-hosted or fine-tuned models, GPU costs dwarf token spending. A single NVIDIA H100 GPU rents for USD 98 per hour on AWS. A cluster of eight H100s costs USD 784 per hour, or over USD 6.8 million per year if run continuously.

GPU cloud compute is episodic, not continuous. Most training runs last hours to days. Inference services run continuously but often with low utilisation. The cost challenge is not the hardware itself—it is the combination of on-demand pricing, underutilised capacity, and missed commitment discounts. Enterprises that treat GPU costs as fixed infrastructure expenses rather than optimisable spend typically pay 40 to 85 percent more than necessary.

This matters because GPU costs are the direct infrastructure layer underneath GenAI. When enterprises move beyond managed APIs to hosted inference or fine-tuning, they own the infrastructure bill. Understanding the economic trade-offs between hyperscalers and specialist providers, between on-demand and committed capacity, and between training and inference architectures is now a core FinOps discipline.

Current GPU Pricing: Hyperscaler Comparison (2026)

Hyperscaler on-demand GPU pricing remains the baseline against which all other options are measured. AWS, Azure, and Google Cloud all offer enterprise-grade GPU instances with compliance, SLAs, and ecosystem lock-in benefits. Hyperscaler pricing is intentionally high to encourage commitment purchases or migration to specialist providers.

AWS EC2 GPU Instances

AWS P4d instances with eight NVIDIA A100 GPUs cost approximately USD 32 per hour on-demand. The newer P5 instances with eight H100 GPUs cost approximately USD 98 per hour. A typical GenAI training job running on P5 instances for 72 hours costs over USD 7,000 in compute alone, before data transfer or storage.

AWS P3 instances with NVIDIA V100 GPUs (older, cheaper) run at approximately USD 24 per hour for eight GPUs, making them viable for some training workloads if latency is less critical. However, V100 instances are being phased out as H100 and A100 become standard.

Azure Virtual Machines

Azure's ND-series instances (A100 and H100) price comparably to AWS. An ND96asr_v4 instance with eight A100 GPUs costs approximately USD 33 per hour. The ND96a_H100_v5 with eight H100 GPUs costs similarly to AWS at around USD 98 per hour. Azure adds marginal regional pricing variation, with EU regions typically 10 to 15 percent more expensive than US East.

Google Cloud A3 Instances

Google Cloud's A3 instances, newly released for enterprise H100 workloads, match AWS and Azure pricing at approximately USD 98 per hour for eight H100s. However, GCP includes a unique advantage: automatic Sustained Use Discounts (SUDs) that provide 30 percent cost reduction after 25 percent monthly usage without explicit commitment. This is a significant advantage for continuous inference workloads.

Specialist GPU Cloud Providers

Lambda Labs, CoreWeave, and RunPod operate dedicated GPU clouds at 60 to 85 percent discount to hyperscaler on-demand rates. Lambda's A100 instances cost USD 1.10 per hour. H100 instances run at USD 2.49 per hour. CoreWeave pricing is similar. These specialist providers target training and batch workloads where compliance and ecosystem integration are not required.

The trade-off is substantial. A 72-hour training job on CoreWeave H100 costs approximately USD 180 in compute. The same job on AWS P5 costs USD 7,000. However, specialist providers lack the compliance certifications (HIPAA, FedRAMP, GDPR), SLAs, and ecosystem integration (SageMaker, Vertex AI, Azure ML) that regulated enterprises require.

Commitment Discount Structures: Reserved Instances and Savings Plans

All three hyperscalers offer multi-year commitment discounts that reduce GPU costs by up to 72 percent. These discounts are the single largest lever for cost optimisation on hyperscalers, but most enterprises fail to use them.

AWS Reserved Instances and Savings Plans

AWS Reserved Instances (RIs) for P5 instances provide up to 72 percent discount for a 3-year commitment. A USD 98 per hour on-demand H100 instance commits to USD 27 per hour over three years. AWS Savings Plans work similarly but offer flexibility across instance families and regions, making them more valuable for enterprises with mixed GPU workloads.

The commercial challenge: commitment requires forecasting GPU capacity three years in advance and paying upfront for capacity you may not fully utilise. Many enterprises avoid commitments because training workloads are bursty and hard to forecast. However, inference workloads—running models in production—are continuous and predictable, making them ideal for RIs or Savings Plans.

AWS also offers AWS EDP and GPU commitment negotiation through account managers, where large enterprises can negotiate custom discounts up to 75 percent. This requires understanding AWS EDP commitment structures and demonstrating multi-year GPU spend forecasts.

Azure Reservations

Azure Reservations for ND-series H100 instances provide up to 72 percent discount for 3-year terms. Like AWS, the commitment is all-or-nothing: full upfront payment with no flexibility to downgrade or cancel. Azure Reservations are less granular than AWS Savings Plans, which can make them riskier for enterprises with evolving AI infrastructure plans.

GCP Committed Use Discounts with Automatic SUDs

GCP Committed Use Discounts (CUDs) for A3 instances provide up to 70 percent discount for 3-year commitments. However, GCP has a unique advantage: automatic Sustained Use Discounts. If you run an A3 instance continuously for more than 25 percent of a month, GCP automatically applies a 30 percent discount without any commitment. For continuous inference workloads, GCP's automatic SUD can be equivalent to a commitment discount without the upfront cost or lock-in.

This makes GCP the most attractive hyperscaler for continuous, steady-state AI inference where workload forecasting is predictable.

Spot and Preemptible Instances: The Training Workload Opportunity

This section applies the core principles of FinOps for enterprise technology spend to GPU infrastructure. The same cost attribution and workload optimisation disciplines that apply to software licensing apply equally to cloud compute.

Spot instances (AWS), Spot VMs (Azure), and Preemptible VMs (GCP) offer 90 percent cost reduction compared to on-demand pricing. A USD 98 per hour H100 on-demand instance becomes USD 10 per hour on the spot market. For training workloads that tolerate interruption, this is transformational.

The constraint is critical: spot instances can be interrupted with 30 seconds notice. This is unacceptable for real-time inference or latency-sensitive production workloads. However, training jobs that checkpoint regularly and can resume from the last checkpoint are ideal candidates.

Spot Instance Architecture Patterns

Enterprises that optimise GPU training costs separate training environments from inference environments. Training clusters run on spot instances with automated checkpoint-and-resume logic. Inference services run on reserved or on-demand capacity in high-availability configurations. This architectural separation is the foundation of cost-optimised AI infrastructure, and aligns with enterprise AI infrastructure governance principles.

PyTorch Lightning, TensorFlow, and Hugging Face transformers all support distributed checkpointing natively, making spot instance interruption handling straightforward. The operational overhead is minimal if implemented correctly.

Spot Instance Cost Profiles

A 72-hour training job on AWS P5 spot instances costs approximately USD 700 in compute (USD 10/hr × 8 GPUs × 72 hours × insurance factor). The equivalent on-demand cost is USD 7,000. The spot discount compounds dramatically for long-running training jobs and multi-experiment research phases.

Training vs Inference: Divergent Cost Optimisation Strategies

GPU costs diverge sharply between training and inference, making a unified cost optimisation strategy ineffective.

Training Cost Profile

Training is episodic: a team might run 50 training experiments over a month, each lasting 6 to 72 hours. Total GPU utilisation is unpredictable. The cost lever is per-job, not per-month. Cost optimisation strategy: use spot instances, benchmark training efficiency (GPU utilisation targeting 70+ percent), and avoid standing GPU capacity.

Inference Cost Profile

Inference is continuous: a deployed model serves traffic 24 hours per day, 365 days per year. GPU utilisation is predictable and measurable. The cost lever is the utilisation rate—do not overprovision capacity. Cost optimisation strategy: use reserved instances or commitments, continuously monitor utilisation, and implement request batching and quantisation to reduce per-inference GPU cost.

Enterprises that combine these strategies achieve 40 to 60 percent cost reduction without changing workloads. A hybrid approach—training on spot, inference on reserved—combines benefits of both models.

Hyperscaler vs Specialist GPU Cloud: When to Use Each

The choice between hyperscalers and specialist GPU clouds is not economic alone—it is strategic, operational, and regulatory.

When to Use Hyperscalers

Use AWS, Azure, or GCP for: regulated workloads (HIPAA, FedRAMP, GDPR residency requirements); enterprises with existing enterprise contracts (EDP, unified spending commitments); workloads requiring ecosystem integration (SageMaker, Azure ML, Vertex AI); or teams that need managed services and ops overhead reduction.

Hyperscaler enterprise AI infrastructure governance is built into the platform. Access controls, audit logging, and cost allocation are native. For regulated enterprises, this ecosystem advantage often outweighs pure cost considerations.

When to Use Specialist GPU Clouds

Use Lambda, CoreWeave, or RunPod for: non-regulated training workloads; short-term experimental phases where GPU commitment is temporary; startups and scaleups without existing cloud contracts; or teams that want minimal ops overhead and do not need platform services.

Specialist providers excel at training cost optimisation but lack the ecosystem and compliance certifications required for regulated production inference. For larger enterprises managing multi-cloud infrastructure, see our guidance on OCI and multi-cloud infrastructure cost governance.

Hybrid Strategy: The Optimal Model

Many enterprises adopt a hybrid model: training runs on specialist GPU clouds (CoreWeave, Lambda) during development and experimentation. Production inference runs on hyperscaler-managed services (SageMaker, Vertex AI) for compliance and reliability. This separates cost-optimised workloads from managed-service workloads, using the right tool for each.

Data Egress and Hidden GPU Costs

GPU cost optimisation must account for data egress. Moving trained model weights and training data across clouds adds significant cost that often exceeds the GPU savings from using specialist providers.

AWS charges USD 0.09 per GB for data egress to the internet. Azure charges USD 0.087 per GB. GCP varies by region but averages USD 0.08 per GB. A 50 GB model trained on CoreWeave and transferred to AWS for inference costs USD 4.50 in transfer fees. A training dataset of 500 GB costs USD 45 to transfer.

For hybrid strategies, cost should factor egress: if you train on specialist clouds and infer on hyperscalers, include egress costs in the model weight transfer. A strategy that saves USD 5,000 on training but costs USD 500 in egress is still a win, but requires accounting.

FinOps Disciplines for GPU Cost Governance

Cost optimisation without governance leads to cost creep. FinOps disciplines applied to GPU workloads ensure sustained savings.

GPU Workload Tagging and Cost Allocation

Tag every GPU instance with team, project, and cost centre. AWS, Azure, and GCP all support resource tagging. Use tags to allocate GPU costs back to teams and enforce cost ownership. Teams that see their GPU spend are more likely to optimise.

Cost-per-Training-Run Metric

Track the cost of each training job: (USD per hour × hours × number of GPUs). Report this metric to teams. When teams see they spent USD 8,000 to train a model that could have cost USD 800 on spot instances, they change behaviour.

GPU Utilisation Monitoring

Target 70+ percent GPU utilisation. Use NVIDIA DCGM (Data Centre GPU Manager) to monitor GPU utilisation in production. Inference workloads that run below 40 percent utilisation are over-provisioned. Training workloads that run below 60 percent utilisation are inefficient. Use these metrics to drive architectural change.

Idle GPU Detection and Auto-Shutdown

Implement automation to detect idle GPUs (zero utilisation for 30+ minutes) and shut them down. Many enterprises leave training instances running after jobs complete, wasting thousands per month. Automated shutdown based on utilisation metrics prevents this.

Model Serving Efficiency

Reduce per-inference GPU cost through request batching and quantisation. A model that processes single requests inefficiently can be optimised with batching logic. Quantisation (FP16, INT8) reduces memory footprint and allows more concurrent requests per GPU. These optimisations reduce GPU count required for the same throughput.

Ready to optimise GPU costs across your AI infrastructure?

FinOps advisors can help you design hybrid strategies, negotiate commitments, and implement workload separation.
GPU cloud cost optimisation advisory →

The Strategic Framework: Putting It Together

A complete GPU cost optimisation strategy has four layers:

  • Workload Architecture: Separate training (spot instances, specialist clouds) from inference (reserved instances, hyperscalers). This is the primary lever.
  • Commitment Strategy: Reserve capacity for continuous inference workloads. Use spot for episodic training. Forecast accurately—commitments locked for three years are difficult to escape.
  • Cloud Selection: Use hyperscalers for compliance and ecosystem. Use specialist clouds for cost. Hybrid is optimal for mature AI programmes.
  • Operational Governance: Tag costs, monitor utilisation, measure cost-per-job, auto-shutdown idle capacity, and optimise model serving efficiency.

Enterprises executing all four layers typically achieve 40 to 60 percent sustained GPU cost reduction compared to on-demand baseline. Additional savings—up to 85 percent—require accepting trade-offs: specialist provider risk, spot instance interruption risk, or architectural complexity.

The most valuable insight is simple: GPU costs are optimisable. Most enterprises pay hyperscaler on-demand rates because they lack a cost optimisation framework. A FinOps discipline applied to GPU infrastructure delivers measurable, sustained savings that compound annually.

Stay updated on AI and cloud cost governance

Subscribe to the Redress Compliance newsletter for FinOps insights, GPU pricing updates, and practical cost optimisation frameworks.