How to use this assessment: Work through each item and mark it complete once you have confirmed the position with your AI vendor or internal team. Items flagged High Risk represent the most common sources of material overspend in enterprise AI deployments. A score of 15 or more confirmed items indicates a well-governed AI cost position. Fewer than 10 confirmed items suggests significant exposure.

Scoring Guide
Tally your confirmed items against these benchmarks to determine your current AI token cost maturity.
0 – 9 High Exposure
10 – 14 Partial Governance
15 – 20 Well Governed

Section 1: Pricing Model Fundamentals

Most enterprise teams anchor their AI cost projections on the input token price alone. Output tokens, which are typically priced at three to five times the input rate, and the ratio of output to input in your specific workflows, determine actual spend. Confirm you have baseline clarity across these foundational items before modelling any scenario.

1. You have mapped the input token price and output token price separately for every model in production.
Output tokens are charged at a premium relative to input tokens across all major providers — typically three to five times the input rate. A workflow that generates 500 output tokens for every 1,000 input tokens is spending proportionally far more on outputs than the headline rate implies. Failing to separate these costs is the most common source of first-year AI cost overruns.
● High Risk
2. You have calculated your actual input-to-output token ratio for each use case.
Chatbot conversations, document summarisation, code generation, and agentic workflows each carry very different input-to-output ratios. A coding assistant generating 800 output tokens per 200 input tokens runs at 4x output weighting; the same premium model that looks cheap on input pricing becomes expensive in practice. Instrument your pipelines to log actual token counts before you commit to a model selection.
● High Risk
3. You understand how context window size translates into billable tokens per request.
A 128k-token context window is a capability, not a free resource. Every token in the context window — including the full conversation history in multi-turn applications — is billed on each request. An enterprise chatbot with a 20,000-token system prompt resends those 20,000 tokens with every user message. Without prompt caching, this silently multiplies input token costs by the average conversation length.
● High Risk
4. You have reviewed the rate schedule for thinking or reasoning tokens on extended-reasoning models.
Models with chain-of-thought reasoning capabilities — including OpenAI o-series and Claude's extended thinking mode — generate internal reasoning tokens that are billed separately from visible output tokens. These reasoning tokens can represent 60 to 80 percent of total output token cost on complex analytical tasks. If you are using reasoning models without tracking thinking token consumption, your cost model is incomplete.
● High Risk

Section 2: Cost Reduction Levers

LLM API prices dropped approximately 80 percent between early 2025 and early 2026, yet enterprise AI bills are rising — because usage and complexity are growing faster than unit prices are falling. The following optimisation levers are available to all enterprise buyers but are consistently under-utilised.

5. You have enabled prompt caching for all use cases with static or semi-static system prompts.
Prompt caching reduces input token costs by 75 to 90 percent on the cached portion. OpenAI's GPT-5 family charges 10 percent of the base input rate on cache hits. Anthropic's Claude charges the same. Google's Gemini context caching also charges 10 percent of the base rate. For enterprise applications where the system prompt is hundreds or thousands of tokens repeated across every request, caching is the single highest-return optimisation available — typically reducing total input costs by 50 to 70 percent.
● High Risk
6. You route batch-eligible workloads through provider batch APIs to access the 50 percent discount tier.
All major AI API providers — OpenAI, Anthropic, and Google — offer batch processing endpoints that deliver a 50 percent discount in exchange for asynchronous delivery with up to 24-hour turnaround. Document processing, data enrichment, report generation, nightly analytics, and any workflow that does not require real-time response qualifies. Most enterprises that have not explicitly reviewed their workload inventory are routing batch-eligible tasks through the real-time API and paying double the necessary rate.
● High Risk
7. You have implemented model routing to direct queries to the appropriate model tier based on task complexity.
The single most impactful cost optimisation for high-volume deployments is routing queries based on complexity. A typical enterprise distribution — 70 percent of queries to a budget model such as GPT-5 Mini at $0.25 per million input tokens, 20 percent to a mid-tier model, and 10 percent to a premium model — reduces blended token cost by 50 to 70 percent versus routing all queries to the flagship model. Implement a lightweight classification layer before each AI call and review model routing rules quarterly as capability gaps narrow.
● Medium Risk
8. You have applied output constraints — max_tokens limits and structured output modes — to all production pipelines.
Without explicit max_tokens parameters and conciseness instructions, LLMs default to verbose outputs. Enforcing JSON-only responses, setting max_tokens to the minimum necessary for the task, and including a clear instruction to be concise reduces output token consumption by 30 to 50 percent in structured-output pipelines, with no quality degradation. This is a zero-cost optimisation that should be implemented at pipeline build time, not as a retrofit after costs escalate.
● Medium Risk

Want us to calculate your current AI token spend and identify savings?

We benchmark against 500+ enterprise AI deployments to quantify your optimisation potential.
Download the Guide →

Section 3: Hidden and Ancillary Costs

OpsLyft's analysis of enterprise AI deployments found that hidden costs — retrieval augmentation, embedding generation, context window management, retry logic, and error handling — routinely add 40 to 60 percent on top of the inference bill. These costs are real but rarely appear in vendor pricing pages.

9. You have quantified embedding generation costs separately from inference costs in your RAG pipelines.
Retrieval-Augmented Generation architectures generate embedding tokens for every document chunk, every user query, and every retrieval call. Embedding API costs are charged separately from inference costs and do not appear in your LLM API bill. At scale, embedding costs for a 10-million-document knowledge base with daily delta updates can easily reach $5,000 to $15,000 per month, a cost category that many teams discover only when reviewing their first quarterly AI infrastructure bill.
● Medium Risk
10. You have reviewed context window creep in multi-turn applications and implemented context trimming or summarisation.
Every new message in a multi-turn conversation resends the entire conversation history. A 50-turn customer support conversation with an average of 300 tokens per turn accumulates 15,000 tokens of context by turn 50 — tokens that are billed every request. Without rolling window trimming or periodic summarisation, average context token costs grow linearly with conversation length. Implement summarisation at turn 10 to 15 and replace full conversation history with a compressed summary to cap context growth.
● High Risk
11. You have measured token consumption from agentic AI workflows against the per-task estimates from pilot phase.
Gartner's March 2026 analysis confirms that agentic AI models require 5 to 30 times more tokens per task than standard chatbots, due to tool call loops, self-critique passes, multi-step planning, and error recovery. An agentic workflow that appeared to cost $0.02 per task in a controlled pilot may cost $0.30 to $0.60 per task in production with full error handling, retry logic, and sub-agent coordination. Verify production token consumption against pilot projections before any large-scale agentic rollout.
● High Risk
12. You have priced multimodal inputs — images, audio, and video — using provider-specific token conversion rates.
Multimodal inputs are billed using token equivalents, not file size. Google's Gemini 2.0 Flash charges 258 tokens per second of video and 25 tokens per second of audio. A 1024x1024 image converts to approximately 1,290 tokens. An enterprise document processing workflow that ingests 10,000 PDFs per month containing diagrams and charts may consume tens of millions of image tokens that are absent from the text-only cost model. Audit every modality present in your input pipeline against its token conversion rate before forecasting costs.
● Medium Risk

Section 4: Vendor Pricing Structures and Comparison

Price differences between vendors at the headline level have converged significantly in 2026, but structural differences in pricing models, caching mechanisms, and enterprise tiers create meaningful cost divergence at enterprise scale. The following items address the vendor comparison work that should precede any multi-year commitment.

13. You have run a like-for-like cost comparison across OpenAI, Anthropic, Google (Vertex AI), and Azure OpenAI for your specific workloads.
Headline per-token rates are insufficient for workload-level cost comparison. Effective cost depends on input-to-output ratio, caching eligibility, batch API applicability, context window requirements, and rate limit constraints specific to your workload profile. Gemini 3 Flash is the lowest-cost option for many high-volume, lower-complexity tasks at $0.50 per million input tokens. Claude Haiku 4.5 at $1.00 per million input tokens leads on cache-heavy workloads. Azure OpenAI pricing may differ from OpenAI direct due to enterprise agreements. Run each workload through each vendor's calculator using actual token counts from production logs before committing.
● Medium Risk
14. You have assessed Azure OpenAI's Provisioned Throughput Units (PTU) model against pay-as-you-go for your peak workload profile.
Azure OpenAI's PTU model allocates guaranteed throughput capacity in exchange for a monthly or annual reservation fee, providing predictable pricing that eliminates per-token rate exposure. PTUs make economic sense when workloads are predictable, run at high sustained throughput, and require consistent latency guarantees. They are poor value for bursty, unpredictable, or low-volume workloads where reserved capacity sits idle. Model your peak-to-average utilisation ratio before committing to PTU reservations — low-utilisation commitments are a common source of AI budget waste.
● Medium Risk
15. You understand the fine-tuning cost structure — training tokens, hosting fees, and inference price premiums — for any model customisation projects.
Fine-tuning a base model involves three separate cost components that are frequently conflated: training token cost (charged per million tokens in the training dataset), hosting fees for the fine-tuned model endpoint (often a fixed monthly fee regardless of usage), and an inference price premium (fine-tuned model endpoints are typically priced 30 to 100 percent higher than base model endpoints). A fine-tuning project with a 10-million-token training dataset, $300 per month endpoint hosting fee, and 2x inference price premium can have a total cost of ownership that exceeds the savings from improved model accuracy in any use case below 50,000 daily requests.
● Lower Risk
"Enterprise AI spend is growing 30 to 50 percent quarter-on-quarter for most organisations. The teams that built cost governance into their AI architecture from the start are achieving 60 to 80 percent savings versus teams that treated optimisation as a post-deployment exercise." — Senior Licensing Advisor, Redress Compliance

Section 5: Contract Terms and Negotiation

Enterprise buyers who commit to meaningful consumption volumes can secure 25 to 40 percent below list rates, along with commercial protections — data isolation, IP ownership, SLA upgrades, and price stability — that are absent from standard API terms. The following items address the negotiation preparation that transforms a commodity API relationship into a governed enterprise contract.

16. You have obtained volume pricing commitments and price-lock terms from your primary AI vendor.
All major AI API providers offer negotiated enterprise pricing for organisations committing to minimum annual consumption. OpenAI enterprise accounts committing at the 500-seat level can achieve 15 to 25 percent below list; two-year or three-year commitments add a further 5 to 15 percent. Google will reduce per-token rates and lock pricing for multi-year terms on consumption volume commitments. Without an enterprise agreement, you are subject to list price changes with limited notice — AI pricing has moved dramatically in both directions over short periods. Negotiate a price stability clause covering a minimum of 24 months alongside any volume commitment.
● High Risk
17. Your contract includes a Most Favoured Nation (MFN) clause that passes list price reductions to your enterprise rate.
AI API list prices have fallen approximately 80 percent between early 2025 and early 2026. An enterprise agreement that locks you into a fixed rate without an MFN clause means competitors on the standard API may be paying less than your negotiated enterprise price. An MFN clause — or a rate re-opener tied to list price changes exceeding a defined percentage — ensures your enterprise rate stays below or equal to market rates throughout the contract term. This is a standard commercial protection that most vendors will accept in enterprise negotiations.
● Medium Risk
18. You have assessed the cost impact of data residency and sovereignty requirements on your vendor shortlist.
The EU AI Act, fully applicable from August 2026, and equivalent regulations across APAC and the Americas impose data residency, audit, and documentation requirements on high-risk AI systems. Data residency-compliant deployments — regional API endpoints, private cloud hosting, or sovereign AI infrastructure — carry cost premiums of 10 to 25 percent versus standard shared-infrastructure API pricing, per Gartner estimates. Regulated industries (financial services, healthcare, defence) typically pay the higher end of this range. Factor compliance costs into your total cost of ownership before selecting a vendor architecture.
● Medium Risk

Section 6: Governance and FinOps Controls

The FinOps Foundation reports that 98 percent of organisations now manage some form of AI spend, up from 63 percent the prior year — yet only 44 percent have financial guardrails in place. Without active governance, AI token costs compound silently across teams, applications, and use cases.

19. You have implemented per-team, per-application token usage tagging, budget alerts, and hard consumption caps in your AI API governance framework.
Token consumption without tagging is invisible spend. Implement API key segmentation by team and application, daily token consumption dashboards with anomaly alerting, and hard rate-limit caps that trigger review at 80 percent of monthly budget before any single use case can exhaust the enterprise allocation. The FinOps Foundation's 2026 guidance treats AI cost governance as a policy-as-code discipline alongside security and compliance controls. Organisations that treat it as a finance-only concern consistently report surprise overruns when new use cases reach scale without governance handoffs.
● High Risk
20. You conduct a quarterly AI vendor pricing review against current list rates and emerging alternatives.
AI token pricing is the fastest-moving cost category in enterprise technology. New models with equivalent capability and lower price points are releasing on a quarterly cycle. A model that represented best value at selection may be two to three generations behind the cost-efficiency frontier within 12 months. Formalise a quarterly review process that benchmarks your current production models against new releases from all major providers, using a standardised workload test suite. Assign ownership to a named individual or team — AI FinOps does not self-maintain.
● Medium Risk

Ready to build a governed AI contract and cost architecture?

Download our AI Platform Contract Negotiation Guide — covering all major vendors, pricing structures, and negotiation tactics.
Download Free Guide →

Next Steps

Score your confirmed items against the benchmarks at the top of this page. If you are in the High Exposure or Partial Governance bands, the following three actions will deliver the largest immediate impact:

First, enable prompt caching on all use cases with static system prompts. This is a configuration change that typically requires less than one engineering day and can reduce monthly input token costs by 50 to 70 percent immediately.

Second, audit your batch API eligibility. Identify every workload in production that does not require real-time response and migrate it to the batch API endpoint. The 50 percent discount on batch workloads is the most accessible cost reduction lever available without model changes or architectural redesign.

Third, engage your primary AI vendor's enterprise account team with documented consumption data and a credible multi-vendor comparison. Volume commitments with price stability, MFN protection, and data residency terms are negotiable — but only when you approach the conversation with benchmark data and a credible alternative.

Redress Compliance works exclusively on the buyer side, with no vendor affiliations. Our GenAI advisory practice has benchmarked AI token costs, negotiated enterprise AI contracts, and built FinOps governance frameworks across 500+ enterprise engagements. Contact us for a confidential review of your AI cost position.