AWS EDP Negotiation guide for enterprise cloud cost governance.

AI Token Cost Management: Controlling

Updated March 2026 · 22 min read · Morten Andersen

The Token Economy: Understanding AI Billing Units

The shift from cloud compute billing (measured in virtual machine hours, compute capacity, storage tiers) to AI token billing represents a fundamental change in how enterprises budget technology spend. A token is the smallest unit of text processed by a language model — roughly 4 characters of English text, or 1–2 words. Crucially, token costs are split into two tiers: input tokens (the prompt and context you send to the model) and output tokens (the completion the model returns to you).

This distinction matters enormously. Output tokens cost 3–5 times more than input tokens on every major platform. A single production query generating a 1,000-token response might consume 500–1,000 input tokens (prompt + context) and 1,000 output tokens, costing £0.02–£0.08 depending on the model tier selected. Multiply that by millions of queries across your organisation each month, with no governance, and token costs become a line-item that rivals your compute spend.

Current 2026 pricing tiers across the three major vendors:

Anthropic Claude: Haiku £0.80/$1 input / £4/$5 output per million tokens; Sonnet £2.40/$3 input / £12/$15 output; Opus £12/$15 input / £60/$75 output
OpenAI GPT-5 family: Mini £0.20/$0.25 input / £1.60/$2 output; Standard £1.40/$1.75 input / £11.20/$14 output; Pro £16.80/$21 input / £134.40/$168 output
Google Gemini: Flash £0.40/$0.50 input / £2.40/$3 output; Pro £1.60/$2 input / £9.60/$12 output per million tokens

Most enterprises have no visibility into which models their teams are using, which queries cost the most, or where token spend is concentrating. This opacity is the core problem. Unlike cloud infrastructure spend (tracked through FinOps tooling, cost allocation tags, and chargeback), AI token spend is often scattered across developer personal API keys, product team subscriptions, and corporate contracts — creating what we call the "shadow AI" problem.

The Enterprise Token Spend Problem: Shadow APIs and No Allocation

The FinOps Foundation has begun extending its cost governance framework to AI infrastructure, but most enterprises are 12–18 months behind on implementation. The problem manifests in three ways:

1. Distributed Purchasing and No Centralised Visibility

Different teams purchase API access through different channels: one team via AWS Bedrock (bundled with compute), another via Azure OpenAI (bundled with Enterprise Mobility), a third via direct OpenAI API keys. A fourth team might use Claude directly through Anthropic's console. Without a central billing aggregator, you cannot see total monthly token consumption across the organisation. Finance discovers token costs only when the credit card bill arrives — long after decisions were made about which models to use.

2. Shadow AI: Developers Using Personal Keys

Engineers spinning up prototypes often create personal OpenAI or Anthropic API keys to avoid procurement delays. Those keys work, the prototype goes into production, and suddenly the startup's token costs appear on someone's personal account. Most CFOs have no idea how many "unauthorized" APIs are in production. Worse, those keys have no rate limits, no usage monitoring, and no way to tie costs back to teams. This is your shadow spend equivalent.

3. No Framework for Cost Allocation or Chargeback

Even if you do have a single contract with OpenAI or Anthropic, there's typically no mechanism to allocate token costs back to teams, products, or use cases. Cloud providers solved this with tagging, cost allocation tags, and chargeback reporting. AI vendors are just beginning to offer APIs for this. Most enterprises still have conversations like: "We spent £50,000 on tokens this month. Who do we charge?" The answer: no one, because you cannot measure it.

These three factors create a compounding cost problem. Teams have no incentive to optimise because they don't see the cost. Finance cannot forecast. Procurement cannot negotiate because they don't know how much you're actually consuming. The solution is enterprise technology governance extended to AI: central ownership, cost visibility, and chargeback accountability.

Extend FinOps governance to your AI stack.

Learn how to build cost allocation, chargeback, and governance frameworks for multi-vendor AI deployments.

FinOps for AI and GenAI spend governance →

Prompt Caching: The Highest-ROI Cost Lever

Of all the cost reduction tactics available, prompt caching delivers the fastest ROI and requires the least architectural rework. The mechanism is simple: if you send the same large prompt or context block to a model multiple times, both OpenAI and Anthropic will cache that content and charge you only 10% of the base input token rate on subsequent cache hits. Google's context caching works identically.

A typical production use case: a RAG (retrieval-augmented generation) application where you retrieve a 10,000-token document chunk from your knowledge base, prepend it to a user query, and send both to Claude or GPT. Without caching, that 10,000-token document is processed and charged at full rate on every query. With caching, the document is charged at full rate once, then at 10% of the input rate on every hit thereafter. If 80% of your queries reuse the same document, you save approximately 72% on input token costs for that workflow.

Suitable use cases for prompt caching:

System prompts: A 500–2,000 token system prompt defining the model's role, constraints, and output format. Used identically across 100,000+ queries monthly. Cache this.
Document context: Multi-document RAG where the same PDFs, specifications, or policy documents appear in 60%+ of queries. Cache the documents once, query them repeatedly.
Conversation history: Multi-turn chatbots or agent loops where the conversation history is prepended to every query. Cache the history after each turn.
Few-shot examples: If you're using few-shot prompting (supplying examples of desired input-output pairs), cache the examples. They don't change per query.

Implementation: Prompt caching is supported on Claude (all tiers), GPT-4o and GPT-4 Turbo (OpenAI), and Gemini 2.0 Pro (Google). It requires a single API parameter flag and minimal code change. No infrastructure change required.

Sizing the impact: For a typical enterprise RAG application processing 1 million queries monthly with 5,000-token average prompt size (document + context + query), prompt caching can reduce input token costs from 5 billion tokens to approximately 1.5 billion tokens monthly — a 70% reduction in input costs. Output costs remain unchanged (you still generate the same completions), but input savings alone often exceed the entire FinOps team's cost target.

Model Tiering Strategy: The 70/20/10 Rule

Not every query requires GPT-5 Pro or Claude Opus. Yet most organisations, lacking a routing strategy, default all traffic to the most capable (and most expensive) model available. This is equivalent to running all your workloads on the most powerful cloud instance type regardless of whether they need it.

The 70/20/10 rule is a FinOps discipline applied to models:

70% of queries: Route to the budget tier model. Claude Haiku (£0.80/$1 input, £4/$5 output), OpenAI Mini (£0.20/$0.25 input, £1.60/$2 output), or Google Gemini Flash (£0.40/$0.50 input, £2.40/$3 output). These models are sufficient for classification, summarisation, simple code generation, and customer-facing chat.
20% of queries: Route to mid-tier. Claude Sonnet (£2.40/$3 input, £12/$15 output), OpenAI Standard (£1.40/$1.75 input, £11.20/$14 output), or Gemini Pro (£1.60/$2 input, £9.60/$12 output). These handle more complex reasoning, multi-step tasks, and code review.
10% of queries: Route to premium tier. Claude Opus (£12/$15 input, £60/$75 output), OpenAI Pro (£16.80/$21 input, £134.40/$168 output). Use only for high-stakes reasoning, novel problems, or when you have budget flexibility.

The key is identifying which use cases actually require premium models. Many organisations discover that 70% of their workload runs fine on Haiku or Mini — but they've been using Opus or Pro anyway. A simple test: run the same prompt through both budget and premium tiers, compare outputs. If the budget tier's answer is acceptable, save the cost.

Practical impact: An enterprise running 10 million monthly queries at an average of 1,500 tokens (input + output) costs approximately £15,000–£25,000 monthly if all routed to Sonnet or Standard. Applying 70/20/10 routing reduces that to £4,000–£8,000 monthly. The cost reduction is 60–80%, with zero change to user experience for the 70% of queries that don't require premium reasoning.

Implementation requires model routing logic in your application layer or through a proxy (AWS Bedrock, Azure OpenAI, or self-hosted gateway). The routing logic can be simple (based on query type, user tier, or confidence thresholds) or sophisticated (ML-based prediction of which tier is needed). Start simple.

Context Window Optimisation: RAG and Chunking

Context tokens are often the largest cost component in production applications. Unlike a one-shot query (where input might be 200–500 tokens), RAG systems, agent loops, and multi-step reasoning chains prepend retrieved documents, conversation history, and intermediate results to every query. A single retrieve-augment-generate-retrieve cycle can easily consume 10,000–20,000 input tokens.

Every unnecessary token in the context window costs money on every query. Here are three levers:

1. Retrieval Optimisation (RAG Precision)

If your retrieval system is too broad (returning 20 documents when 3 would suffice), you're including thousands of unnecessary context tokens. Tighter retrieval filters, re-ranking, and hybrid search reduce context size. Tools like Cohere Rerank or semantic filtering (using vector embeddings) improve retrieval precision and directly reduce context tokens consumed.

2. Context Summarisation

For long conversations or multi-step agent loops, summarise old conversation history instead of appending it all to each new query. After 10 turns of dialogue, summarise the first 8 turns into a 200-token summary, then append only the summary + the last 2 turns. This keeps context size bounded while preserving context coherence. Claude's prompt caching works particularly well here: cache the summary once, update it periodically.

3. Chunking Strategy

Document chunking for RAG (splitting a 50-page PDF into searchable chunks) is an art. Too-large chunks include irrelevant text alongside relevant content. Too-small chunks break semantic meaning. Optimal chunk size varies by use case but typically 500–1,500 tokens per chunk yields the best precision. Semantic chunking (splitting on meaning boundaries, not arbitrary token counts) further improves quality and reduces context bloat.

Combined impact: Moving from naive RAG (retrieving 5 documents of 3,000 tokens each = 15,000 context tokens per query) to optimised RAG (retrieving 2 documents of 800 tokens each, summarised = 2,000 context tokens) reduces input costs by 87% on that workflow. Scale this across millions of queries, and context optimisation is a multi-million-pound initiative.

Chargeback and Cost Allocation: Building Enterprise Governance

You cannot manage what you do not measure. FinOps practitioners know this from cloud governance; it applies identically to AI spend. Building a chargeback framework requires four elements:

1. Tagging at Query Time

When your application calls OpenAI, Anthropic, or Google APIs, include metadata tags identifying the team, product, environment, and use case. All three vendors support custom metadata:

Anthropic: Headers parameter; include team_id, product_name, environment (prod/staging), use_case
OpenAI: Custom metadata in request body; Azure OpenAI uses cost allocation tags
Google Vertex AI: Labels (team, product, cost_center, environment)

This requires one-time integration work. The benefit is immediate: you can trace every token back to its source.

2. Monthly Cost Allocation Reporting

Export your usage data from each vendor's API (or billing API), aggregated by tag, and produce a monthly chargeback report. Format: Team A consumed 500M tokens, cost £2,400; Team B consumed 200M tokens, cost £960. Present these monthly to team leads, finance, and CFO. Visibility drives behaviour change. Tools like AWS Bedrock and spend negotiation platforms can automate this for large-scale deployments.

3. Establish Cost Ownership

Assign accountability. Who owns AI spend for each team? Who approves new model deployments? Who reviews monthly chargeback? Without clear ownership, there is no incentive to optimise. This is basic FinOps hygiene, often overlooked in AI spend.

4. Cost Governance and Approval Gates

Establish a simple gate: teams can spin up new API keys, but any deployment consuming >£500/month must be reviewed and approved by a cost committee (finance + engineering leadership). Similarly, shift to a premium model (Opus, Pro) requires business justification. These gates prevent shadow spend from metastasising.

Reference implementations: AWS Bedrock includes native cost allocation; Azure OpenAI integrates with Azure Cost Management; for direct API consumption, use middleware like Lithops, MLflow, or Weights & Biases to instrument queries and build chargeback.

Enterprise Contract Negotiation: Commits and Volume Discounts

All three major providers offer commit-based discounts at enterprise scale. Unlike cloud providers (which negotiate discounts for 3-year commitments), AI vendors typically offer discounts on annual token consumption commitments.

OpenAI Enterprise

OpenAI Enterprise contracts start at £30–40/$45–75 per user per month (minimum 150 seats), with volume discounts available for token commitments above 10 billion monthly. Enterprise agreements also include AI data policy (no data used for model improvement), IP indemnification, and service-level agreements. Standard commercial terms: GPT-5 Pro at £14/$21 per million tokens input negotiates down to £11–13/$16–18 on an enterprise agreement with annual token commitment.

Anthropic Volume Tiers

Anthropic offers volume discounts at 500+ seat commitments and tiered pricing for organisations consuming >50 billion tokens monthly. A typical enterprise agreement might negotiate Claude Opus from £12/$15 per million input tokens down to £9–10/$12–13 with annual commitment. Importantly, Anthropic's contracts include explicit data governance: Claude does not train on customer API traffic.

Google Vertex AI

Google positions Vertex AI (managed Gemini) as part of the broader cloud platform. Pricing tiers are available for annual commitments. A 50 billion monthly token commitment with Google might unlock 20–25% discounts against standard Gemini pricing. However, you must be a meaningful Google Cloud customer to access these discounts.

Key Contract Negotiation Points

Data residency: Where is your data processed and stored? (EU customers particularly sensitive here)
Data governance: Can the vendor use your API traffic to train or improve models? (Most enterprises require "no model improvement" clauses)
IP indemnification: If your generated content matches copyrighted material, who bears the liability?
Exit rights: How do you transition out of the contract? (Important given vendor lock-in risk)
Service levels: What is the SLA for availability and response times?
Volume commitment: What baseline token consumption commits to? Overages at standard rate, or discounted rate?

Negotiation horizon: 6–8 weeks from initial outreach to signed agreement, with legal review. Start conversations 90 days before your renewal or before you reach £10,000/month in annual spend (the threshold at which vendors typically assign account management).

Building the AI Cost Governance Program

Implementing all of these levers requires a structured programme, not ad-hoc improvements. The roadmap:

Month 1–2: Discovery and Baseline

Audit existing AI spend across all channels (direct APIs, cloud provider native services, third-party integrations). Use APIs, billing exports, and developer interviews to build a complete picture.
Establish baseline monthly token consumption and cost. Include "shadow APIs" — personal keys, trial accounts, non-production experiments.
Map which models and vendors are in use, and for what use cases.

Month 3–4: Implement Tagging and Measurement

Deploy metadata tagging at query time (team, product, environment, use case).
Build a cost aggregation pipeline (export usage APIs daily, aggregate by tag, store in a data warehouse or FinOps tool).
Establish ownership and baseline accountability (CFO/Finance reviews monthly chargeback for first time).

Month 5–6: Optimisation Pilots

Identify high-token-cost workflows (top 10–20% of spend).
Pilot prompt caching on two production RAG systems (document-heavy, high-query volume).
Pilot model tiering: test Haiku/Mini on 20% of traffic; measure quality degradation (usually none). Gradually increase to 70% if results are acceptable.
Review context window sizes in RAG systems; implement chunking and retrieval optimisation for top 3 workflows.

Month 7–8: Governance Framework

Establish cost approval gates: new APIs require team lead sign-off; >£500/month spend requires financial review.
Document cost ownership: who approves model changes, API keys, new deployments?
Publish monthly chargeback reporting to all teams.

Month 9+: Contract Renegotiation

With baseline data and cost trends, approach vendors for volume discounts or annual commitments.
Estimate likely annual spend (Month 1–6 data × growth factor); negotiate volume discounts accordingly.
Include data governance and exit-rights clauses in agreements.

Expected outcomes of a mature AI cost governance programme: 50–70% reduction in cost-per-token through a combination of model tiering (60–80%), prompt caching (70% on suitable workflows), context optimisation (20–40%), and contract discounts (15–25%). Total programme cost (one FTE + tooling): £80,000–150,000 over 9 months. Payback period: typically 4–6 months once optimisations are deployed. This aligns with Oracle and infrastructure cost governance best practices in multi-cloud environments.

Ready to optimise your AI spend?

Our FinOps specialists have built cost governance frameworks for 50+ enterprises across OpenAI, Anthropic, and Google platforms.

AI token cost management specialists →

Building Your AI Cost Framework: Next Steps

Token cost management is not complicated. It requires discipline, measurement, and the same FinOps practices that organisations have applied to cloud spend for the past five years. The difference: with AI, the upside is faster. A single prompt caching implementation can save 70% on input costs for a workflow within weeks. A model tiering strategy deployed across 10 million monthly queries delivers 60–80% savings without any change to end-user experience.

Start with measurement. Know what you're spending, where it's being spent, and why. From there, the optimisation roadmap becomes obvious. Most organisations we work with move from £0 of measured AI spend to a structured, cost-governed, 50–70% optimised programme within 9 months.

The enterprises winning at AI cost management are extending their FinOps for enterprise technology spend practices into AI. They're treating tokens like any other compute resource: measurable, taggable, allocatable, and optimisable. The organisations that don't? They'll wake up in Q3 2026 to a £500,000/month AI bill they cannot explain and have no way to reduce.

Token costs are variable, consumption-based, and growing fast. Build the governance framework now, while vendors are still optimising for market adoption and discount negotiations are still possible.

AI Token Cost Management: Controlling OpenAI, Gemini, and Claude Spend