Why AI Budgets Fail: The SaaS Mental Model Problem

Enterprise procurement teams have spent fifteen years building cost models around per-user, per-month SaaS pricing. AI API costs work on an entirely different economic logic. The input you pay for is not a user seat — it is a token, the fundamental unit of text that a model processes. You pay once for tokens submitted to the model (input tokens) and again for tokens the model generates in response (output tokens). Output tokens cost two to ten times more than input tokens depending on the model.

This difference matters enormously at budget time. A team that estimates "500 users at enterprise AI pricing" will produce a wildly different number than a team that models "500 users generating 4,000 tokens of input and receiving 800 tokens of output, 120 times per month, across a mix of GPT-5.4 and Claude Sonnet." The latter estimate is defensible. The former is a guess.

The enterprise AI licensing guide covering OpenAI, Anthropic, Google, and AWS provides vendor-by-vendor context on how each provider structures commercial agreements. This page focuses on the budget modelling mechanics that sit beneath those commercial terms.

Current Token Pricing: GPT-5.4, Claude, and Gemini

Token prices have fallen sharply over the past 18 months — roughly 80 percent across major providers between early 2025 and early 2026. That creates a complication for budget models: price assumptions that were accurate six months ago are now materially wrong. Always anchor your model to current list rates and apply a downward-adjustment scenario assumption for mid-year repricing.

OpenAI GPT-5.4

GPT-5.4 is the current OpenAI flagship model as of Q1 2026, following the retirement of GPT-4o in February 2026. Input token rates for GPT-5.4 via direct API sit at approximately $1.75 per million tokens; output tokens are approximately $14 per million tokens. Cached input tokens — requests that match previously submitted context — receive a 90 percent discount, bringing cached input cost to approximately $0.175 per million tokens. This caching discount is one of the most powerful levers in AI cost management for workloads with stable system prompts or repeated document contexts.

For enterprise contracts negotiated directly through OpenAI, pricing starts at $45 to $75 per user per month with a 150-seat minimum and annual commitment. Volume discounts of 20 to 40 percent are achievable at the 500-seat-plus tier. Detailed negotiation strategy is covered in the OpenAI enterprise procurement negotiation playbook.

Anthropic Claude

Claude Sonnet 4.6 — Anthropic's primary enterprise model — prices at $3 per million input tokens and $15 per million output tokens via direct API. The Anthropic Message Batches API provides a 50 percent discount for non-time-sensitive workloads, bringing batch processing rates to $1.50 input and $7.50 output per million tokens. Prompt caching on Claude offers a 90 percent discount on cached reads, the same rate as OpenAI.

Claude enterprise seat licensing runs $30 to $35 per user per month for 500-plus seats on annual commitment. The Anthropic Claude enterprise licensing guide for 2026 covers the full commercial structure, including volume tiers and what is negotiable at each level.

Google Gemini

Gemini pricing via Vertex AI sits at approximately $2 per million input tokens and $12 per million output tokens for the Gemini Pro tier. Context window pricing matters for Gemini: requests exceeding 128K tokens see a tiered rate increase. Enterprise agreements via Google Cloud include Gemini API access as part of broader Committed Use Discount (CUD) structures.

Azure OpenAI vs Direct OpenAI

Azure OpenAI Service uses the same underlying GPT models but bills through Azure consumption credits, which can apply against Enterprise Agreements, Microsoft Azure Consumption Commitments (MACC), and Reserved Capacity. The detailed comparison of Azure OpenAI versus direct OpenAI enterprise procurement is essential reading before committing spend to either channel — the financial and contractual implications diverge significantly at scale.

Need help building your enterprise AI token budget model?

Our AI procurement advisory team has modelled AI consumption costs for 100+ enterprise clients across GPT, Claude, and Gemini deployments.
Talk to Our AI Procurement Specialists →

Building Your AI Token Budget Model: The Five-Layer Framework

A reliable enterprise AI token budget model has five layers. Most organisations build only layers one and two and wonder why their actuals diverge from budget by 40 to 60 percent.

Layer 1: Use Case Inventory

Begin by inventorying every AI use case in deployment or active planning. Group them into categories: document processing, code assistance, customer-facing chat, internal search and retrieval, agentic automation, and content generation. Each category has a radically different token profile. A document summarisation task might consume 4,000 to 8,000 input tokens and generate 400 to 800 output tokens. An agentic workflow processing a complex research request may consume 1 to 3.5 million tokens per task across multiple model calls.

Layer 2: Token Volume Per Use Case

For each use case, estimate the following: average input tokens per request (including system prompt), average output tokens per request, requests per user per day, and active user count. The system prompt is frequently underestimated — a detailed system prompt for a legal document assistant can be 3,000 to 8,000 tokens, prepended to every single request. At 50 requests per user per day across 200 users, that system prompt alone consumes 30 to 80 billion tokens per month before a single word of actual user input is counted.

Layer 3: Model Mix and Routing Logic

Not every request needs the most capable — and most expensive — model. Simple classification, short summarisation, and structured data extraction often perform identically on smaller, cheaper models. Enterprise deployments with intelligent model routing have achieved 27 to 55 percent cost reduction without measurable quality degradation in RAG (retrieval-augmented generation) setups. Your budget model should reflect the actual model mix, not assume every call goes to the flagship tier.

Layer 4: Caching and Batch Offload

Prompt caching and batch processing are the two most impactful cost levers available today. Prompt caching applies when a request begins with a context window that matches a previously submitted and cached prefix. For workloads with stable system prompts and document contexts, caching hit rates of 60 to 80 percent are achievable, reducing effective input token costs by 54 to 72 percent. Batch processing, where latency tolerance exists, provides an additional 50 percent reduction on both input and output tokens. Combined, these two techniques can reduce gross token spend by 70 to 90 percent on suitable workloads.

Layer 5: Growth and Volatility Assumptions

AI usage within enterprises is growing at 36 percent year on year on average, but that average masks enormous variance. Coding assist deployments are growing at 80 to 120 percent annually as GitHub Copilot-equivalent tools become standard developer tooling. Agentic workflows, once deployed, generate token volumes that scale with the number of tasks queued — a small increase in task volume can produce a disproportionate increase in token spend. Build your budget model with three scenarios: baseline (current run rate plus 20 percent growth), base-case (plus 50 percent growth), and stress-case (plus 150 percent growth). Present all three to finance.

Use Case Token Benchmarks

The following benchmarks are derived from enterprise deployments across our advisory engagements and should be used as starting points, not final estimates. Actual volumes vary based on system prompt length, user behaviour, and model response verbosity settings.

  • Legal document review: 6,000 to 12,000 input tokens, 500 to 1,200 output tokens per document. High value, low frequency.
  • Code completion (inline): 800 to 2,000 input tokens, 100 to 400 output tokens. High frequency — developers average 40 to 80 completions per day.
  • Complex debugging and code review: 20,000 to 80,000 input tokens (full file context), 2,000 to 5,000 output tokens per session. Can reach 500,000 tokens for multi-file analysis.
  • Customer support chat: 1,500 to 4,000 input tokens (conversation history), 200 to 600 output tokens per turn. Multi-turn conversations compound token usage.
  • Internal search and Q&A (RAG): 3,000 to 8,000 input tokens (retrieved context plus query), 300 to 800 output tokens.
  • Agentic research and execution tasks: 1,000,000 to 3,500,000 tokens per complex task across multiple model calls.
  • Document generation (reports, proposals): 2,000 to 5,000 input tokens, 3,000 to 8,000 output tokens. Output-heavy — model generates most of the tokens.

PTU vs PAYG: The Budget Forecasting Decision

Azure OpenAI offers two billing models: pay-as-you-go (PAYG), where you pay per token consumed, and Provisioned Throughput Units (PTU), a reserved capacity commitment billed monthly regardless of utilisation. For enterprise budget forecasting, PTU creates predictable fixed costs and eliminates the variance problem. The trade-off is utilisation risk: if your team under-uses the provisioned capacity, you have paid for unused throughput.

The PTU breakeven calculation: PTU economics become favourable when consistent monthly PAYG spend on a given workload exceeds approximately $1,800 per month. Below that threshold, PAYG flexibility outweighs PTU cost efficiency. PTU commitments on Azure OpenAI start at $2,448 per month for a minimum provisioned unit. For workloads with predictable, steady-state usage — production customer support bots, internal search deployments with defined user populations — PTU delivers 50 to 70 percent cost reduction versus PAYG at equivalent throughput.

Full analysis of Azure versus direct OpenAI commercial structures, including PTU specifics, is in the Azure OpenAI vs direct OpenAI enterprise comparison.

From Token Counts to Business Outcomes

Finance teams will challenge AI budgets that are expressed in tokens and millions of dollars of API spend without a clear link to business value. The most effective budget presentations translate token economics into cost-per-outcome metrics: cost per document processed, cost per code review, cost per customer interaction resolved.

A simple example: a legal team processing 800 contract reviews per month at 8,000 input tokens and 800 output tokens per review, using GPT-5.4 at current rates with 60 percent caching, incurs a gross input token cost of approximately $5.60 per review and output cost of $11.20 per review — roughly $17 per contract review in AI API costs. Against a paralegal hourly rate and the number of hours saved, this becomes a straightforward ROI calculation that finance can approve and audit.

The broader context for AI cost governance and how token costs fit into enterprise financial planning is covered in detail in the AI consumption billing and token cost control guide.

Common Budget Overrun Scenarios

Three patterns account for the majority of AI budget overruns in enterprises we have worked with.

Shared API key sprawl: Development teams spin up AI integrations using a single API key. Without per-team or per-application cost attribution, there is no visibility into which projects are consuming tokens until the monthly invoice arrives. Implementing per-application API keys with tagging and quota limits at the outset eliminates this problem.

Prompt changes causing 10x token spikes: A developer adds context to a system prompt — adding 5,000 tokens — without recalculating the downstream cost impact. If that system prompt is prepended to 50,000 API calls per month, the change adds 250 million input tokens and potentially $437 to $1,750 in monthly cost, depending on the model. Pre-deployment token cost impact assessment is a cheap control that catches expensive mistakes.

Agentic workflow underestimation: Teams build agentic workflows in development where the number of model calls per task is small and controlled. In production, with real user inputs, the same workflow spawns three to eight times more model calls than anticipated. Always measure agentic token consumption under realistic load before committing to a budget figure.

Five Steps to a Defensible AI Budget

A rigorous AI budget process follows five steps. Organisations that complete all five are significantly more likely to land within 15 percent of their actual spend.

Step 1: Instrument existing usage. Before forecasting future costs, measure current consumption. Deploy per-application token logging, segment by use case, and run for 30 days before building the model. Real consumption data is worth more than any benchmark estimate.

Step 2: Model each use case separately. A consolidated "AI budget" that aggregates all use cases into a single number will be wrong. Each use case has a different token profile, growth rate, and caching opportunity. Model them separately and aggregate.

Step 3: Apply optimisation scenarios. Build the budget in three versions: fully unoptimised (every token at PAYG list rates), partially optimised (prompt caching on stable workloads, model routing applied to simple tasks), and fully optimised (batch processing for non-time-sensitive work, PTU for steady-state workloads, full caching). Present the range to leadership.

Step 4: Set quota-based controls. Budget commitments without enforcement controls are wishes, not budgets. Configure token quota limits per application, per team, and per model in your API gateway or AI platform dashboard before the budget period starts. This is covered in depth in the enterprise guide to negotiating OpenAI contracts, which includes governance provisions available at the contract level.

Step 5: Build a monthly review cadence. AI token costs change faster than any other enterprise software cost category. Monthly review of actual versus budget, with a structured variance analysis, allows you to catch overruns before they compound.

Stay Current on AI Licensing and Cost Management

Token pricing, model availability, and enterprise contract structures shift continuously. Subscribe to the Redress Compliance newsletter for monthly updates on AI commercial changes that affect enterprise buyers.

Getting Expert Support for AI Budget Modelling

Token cost forecasting is not a one-time activity. As models evolve, pricing shifts, and deployment patterns change, the budget model must be updated. Our enterprise AI procurement advisory specialists have modelled token consumption economics for over 100 enterprise clients, from 500-seat pilot deployments to multi-million-dollar production AI platforms. We provide vendor-neutral analysis, use-case benchmarking, and commercial negotiation support across OpenAI, Anthropic, Google, and AWS.

Download the AI platform contract negotiation guide for a detailed treatment of the commercial terms that sit alongside token pricing — including enterprise agreement structure, data governance provisions, and exit rights that affect the total cost and risk profile of your AI platform commitments.