AI API Cost Reduction: Prompt Caching

Why Falling Prices Have Not Reduced Bills

The most common misconception in enterprise AI budgeting is that lower model prices automatically translate into lower costs. They do not — if usage scales proportionally, and it almost always does. Coding assist adoption jumped from 11 percent to 50 percent of enterprise LLM usage between 2024 and 2026. Agentic workflows generate token volumes that dwarf chat-based use cases. Retrieval-augmented generation deployments send large document contexts with every query. The unit price falls while the consumption volume climbs.

The organisations that actually reduce their AI costs do it by addressing consumption efficiency, not by waiting for vendor price reductions. The three primary levers — prompt caching, model routing, and batch processing — are available today across every major provider and require engineering investment, not vendor negotiation. The broader context for managing AI consumption costs is covered in the AI consumption billing and token cost control guide.

Technique 1: Prompt Caching

Prompt caching is the highest-impact single technique available for reducing AI API costs. All three major providers — OpenAI, Anthropic, and Google — now offer prompt caching with a 90 percent discount on cached input tokens. The mechanism: when a request begins with a context window prefix that matches a previously submitted and cached prefix, the provider reuses the cached computation rather than processing the input from scratch. You pay only a fraction of the standard input token rate for the cached portion.

How Prompt Caching Works in Practice

For caching to trigger, the cached prefix must be at the beginning of the context window and must meet a minimum length threshold (typically 1,024 tokens or more, depending on the provider). This design means caching is most effective for workloads where a large, stable prefix — typically the system prompt plus any fixed document context — precedes a variable user query.

Consider a legal document review assistant with a 5,000-token system prompt that provides the AI with jurisdiction rules, formatting requirements, and review criteria. Every request from every user begins with that same 5,000-token prefix before the actual document or query. Without caching, you pay full input token rates on 5,000 tokens per request. With caching and a 90 percent cache-read discount, the effective cost of those 5,000 tokens drops to 500 tokens equivalent. At 1,000 requests per day, that is a reduction of 4.5 million input tokens of effective cost per day from caching alone.

Caching Across Providers

Anthropic Claude: Prompt caching is enabled via cache_control parameters in the API. Cache TTL is five minutes by default, configurable. Cache read tokens cost 10 percent of standard input token rates — a 90 percent discount. This makes Anthropic's caching one of the most economical options for high-throughput enterprise deployments. Full commercial context is in the Claude enterprise licensing guide for 2026.

OpenAI GPT-5.4: Prompt caching is automatic for context windows over 1,024 tokens. Cached input tokens are billed at 10 percent of standard input rates — matching Anthropic's 90 percent discount. Cache eligibility persists for approximately 5 to 10 minutes of inactivity. For high-volume production applications, cache warming (sending a representative request at regular intervals to keep the cache live) is a common optimisation. The OpenAI enterprise procurement playbook covers how caching mechanics interact with enterprise agreement terms and volume commitments.

Google Gemini: Context caching on Vertex AI is available with a 90 percent discount on cached token reads, and Anthropic eliminated long-context surcharges on the 1M token window in 2025, making cached large-context workloads significantly more economical.

Maximising Cache Hit Rates

Cache hit rate is the primary determinant of caching value. Architectures that maximise cache hit rates share three characteristics: stable system prompts (avoid dynamic variables injected into the system prompt at request time), prefix-first document ordering (place document context before user queries, not after), and high request frequency per session (more requests from the same application in the same time window means more cache hits).

A well-optimised production deployment with a stable 6,000-token system prompt and 200 daily active users each making 20 requests can achieve 80 to 90 percent cache hit rates on the system prompt prefix — reducing effective input token cost by 72 to 81 percent on that component.

Want a detailed AI API cost optimisation assessment?

Our AI cost optimisation specialists have identified an average of 65 percent cost reduction opportunity across enterprise AI deployments we have audited.

Talk to Our AI Cost Specialists →

Technique 2: Intelligent Model Routing

Not every AI task requires the same model. GPT-5.4 and Claude Sonnet 4.6 are the right tools for complex reasoning, nuanced generation, and tasks where quality is paramount. They are not the right tools for simple classification, short summarisation, entity extraction, or structured data parsing — tasks that smaller, cheaper models handle at equivalent quality for a fraction of the cost.

Intelligent model routing evaluates each incoming request and directs it to the most cost-efficient model capable of handling it. Enterprise deployments with model routing in RAG setups have documented 27 to 55 percent overall cost reduction without measurable quality degradation. One published case reduced average tokens per request from 10,500 to 650 tokens (a 94 percent reduction) by combining routing with optimised prompt design.

Routing Architecture

A practical model routing architecture operates at two levels: task-type routing and quality-threshold routing.

Task-type routing classifies incoming requests by task type — classification, extraction, summarisation, generation, reasoning — and routes each type to a pre-selected model tier. Classification and extraction go to small models (GPT-4o Mini equivalent, Claude Haiku); summarisation and short generation go to mid-tier models; complex reasoning and long-form generation go to flagship models. This routing is deterministic and requires no per-request ML inference.

Quality-threshold routing uses a small, fast model to generate an initial response and evaluates it against a quality threshold. If the response meets the threshold, it is returned. If it does not, the request is escalated to a more capable model. This approach is most effective for customer-facing applications where quality is variable across request types.

Routing Infrastructure Tools

Two primary tools have emerged for enterprise model routing. LiteLLM is an open-source model router that provides a unified API interface across 100-plus AI providers, supports load balancing, fallback configuration, and spend tracking, and is self-hosted at no cost. It is the most commonly deployed option in enterprise environments that prioritise data sovereignty. PortKey is a managed AI gateway at approximately $49 per month for mid-tier usage, offering model routing, semantic caching, observability, and fallback management with less engineering overhead. Both tools can be configured to enforce cost-per-request limits and route based on budget constraints in addition to quality requirements.

Technique 3: Batch Processing APIs

Batch processing provides a 50 percent discount on both input and output tokens across OpenAI and Anthropic, with no quality degradation on the model response — you receive exactly the same output from the same model, delivered within a processing window of up to 24 hours rather than in real time. This is the most straightforward cost reduction available for any workload that does not require immediate response.

The practical scope of batch-eligible workloads is larger than most teams initially assume. Document processing pipelines — legal review, contract extraction, compliance screening — are batch-eligible because documents can be queued for processing overnight. Data enrichment, entity extraction from records, content moderation, translation, and quality assurance workflows are all batch candidates. Even some customer-facing use cases, such as generating personalised report content or pre-computing AI responses for predictable query types, can use batch processing effectively.

OpenAI Batch API

The OpenAI Batch API accepts JSON files of requests with up to 50,000 requests per batch or 100MB file size. Responses are returned within 24 hours. The 50 percent discount applies to all GPT-5.4 and GPT-4o requests processed through the Batch API. For enterprises processing high volumes of documents on non-time-sensitive schedules, this halves the cost of their most document-intensive AI workloads. Enterprise OpenAI contract negotiations can explicitly include batch API access as a contractual commitment, ensuring price stability.

The Anthropic Message Batches API provides the same 50 percent discount model for Claude Sonnet 4.6 requests. Batches can contain up to 10,000 requests with responses returned within 24 hours.

Combining the Techniques: A Worked Example

The techniques compound when deployed together. Consider an enterprise document processing pipeline: 10,000 legal contracts processed monthly, each requiring a 6,000-token system prompt, 4,000 tokens of document content, and generating 1,000 tokens of output summary.

Without optimisation at GPT-5.4 PAYG rates: 100 million input tokens at $1.75 per million = $175, plus 10 million output tokens at $14 per million = $140. Total: $315 per month.

With prompt caching (80 percent hit rate on the 6,000-token system prompt, saving 90 percent on cached reads): effective input cost approximately $70. With batch processing (50 percent discount on remaining charges): effective total approximately $35 to $40 per month. Combined saving: approximately 88 percent reduction from baseline.

This arithmetic is why the 95 percent cost reduction headline is achievable — it requires the right workload characteristics, but those characteristics describe a significant proportion of enterprise AI workloads.

Additional Optimisation Techniques

Beyond the three primary techniques, several supplementary optimisations compound the savings.

Context window management: Many RAG implementations retrieve too much context — sending 10 chunks of retrieved content when two or three would suffice for the query. Each unnecessary chunk adds input tokens without improving response quality. Re-ranking retrieved context and enforcing stricter relevance thresholds before sending to the model reduces context length materially. The reduction is immediate and requires no vendor changes.

Output length control: Output tokens cost 4 to 8 times more than input tokens. Prompts that explicitly constrain output length — "respond in 150 words or fewer", "return only the structured JSON fields" — reduce output token costs proportionally. For structured extraction tasks, specifying the exact output schema and format can reduce output token volume by 60 to 80 percent compared to open-ended generation.

Response caching: For queries that recur identically or near-identically — FAQ responses, standard report sections, regulatory compliance summaries — caching AI responses at the application layer means you pay for the token generation once and retrieve from cache thereafter. Semantic caching (matching queries by meaning rather than exact text) using vector similarity extends cache applicability to near-identical queries and is supported natively in PortKey and via custom implementation in LiteLLM.

The broader commercial context — including how these optimisation techniques affect enterprise agreement economics and PTU commitments — is covered in the enterprise AI licensing guide for 2026 and the Azure OpenAI vs direct OpenAI enterprise comparison.

AI Cost Optimisation Updates

Caching mechanisms, batch API features, and model pricing change regularly. Subscribe to the Redress Compliance newsletter for monthly AI cost optimisation updates for enterprise teams.

Subscribe to Newsletter →

Client result: In one engagement, a global professional services firm processing 12,000 contracts per month faced AI API costs of $280,000 annually. Redress identified and implemented prompt caching on their 8,000-token system prompt combined with batch API for overnight document runs. Total AI API cost dropped to $31,000 — an 89% reduction. The engagement fee was less than 4% of the first-year saving.

Where to Start: Prioritising Your Optimisation Effort

With multiple techniques available, the question is where to invest engineering effort first. The priority sequence should be determined by the magnitude of potential saving and implementation cost.

Prompt caching delivers the highest return for the lowest engineering investment. If your application has system prompts longer than 1,024 tokens and sends multiple requests per session, caching is available immediately by structuring your API calls correctly. Start here.

Batch processing requires identifying batch-eligible workloads and updating the relevant pipeline to use the Batch API endpoint rather than the standard API. It is a relatively small code change with a guaranteed 50 percent savings on those workloads. High-priority for any document processing or data enrichment pipeline.

Model routing requires more engineering investment but delivers sustained savings as workload volumes grow. It is most impactful for organisations running diverse workloads across multiple model tiers. Download the AI platform contract and optimisation guide for a comprehensive optimisation implementation framework, and speak with our enterprise AI cost advisory specialists to identify the highest-value opportunities in your specific deployment profile.

Reducing AI API Costs: Prompt Caching, Model Routing, and Batch Processing