Vertex AI: The Enterprise Gateway to Gemini

Google offers Gemini model access through two distinct channels: the Gemini Developer API (ai.google.dev), targeted at individual developers and early-stage experimentation, and Vertex AI on Google Cloud Platform, the enterprise deployment path that provides the data governance, compliance, network controls, and commercial structures required for production workloads in regulated environments.

For enterprise buyers, Vertex AI is the appropriate deployment platform in virtually all cases. The Gemini Developer API lacks the VPC Service Controls, CMEK key management, audit logging, and enterprise SLA structures that production AI workloads require. It also does not integrate with your GCP billing and committed use discount structure, meaning you lose the ability to leverage your existing Google Cloud commercial agreements against Vertex AI consumption.

Understanding Vertex AI Gemini pricing therefore means understanding two layers simultaneously: the per-token rates for each Gemini model, and how those token costs interact with your broader GCP commercial structure — including committed use discounts, billing organisation hierarchy, and any Google Cloud Committed Use programs you have in place.

Gemini Model Pricing on Vertex AI: The Full Rate Card

Vertex AI pricing for Gemini models is published by Google and updated periodically as new model generations are released. The following rates reflect the current generation as of early 2025. All prices are per million tokens unless stated otherwise.

Model Input (≤200K ctx) Output Input (>200K ctx) Cache Read
Gemini 2.5 Pro $1.25 $10.00 $2.50 $0.31
Gemini 2.5 Flash $0.30 $2.50 $0.60 $0.075
Gemini 2.0 Flash $0.10 $0.40 $0.10 $0.025
Gemini 1.5 Pro $1.25 $5.00 $2.50 $0.31
Gemini 1.5 Flash $0.075 $0.30 $0.15 $0.01875

These are published list rates. Enterprise organisations with annual Vertex AI or broader GCP spend commitments above $500,000 can negotiate custom pricing that reduces these rates by 15 to 40 percent depending on volume and commitment structure.

The 200,000 Token Context Window Threshold

One of the most commercially significant pricing features of Vertex AI Gemini is the context window threshold pricing model. Input tokens are priced at standard rates up to 200,000 tokens per request. Beyond 200,000 tokens, the input token rate doubles. This threshold has material implications for applications that use large-context reasoning — including document analysis, multi-turn conversations with extended history, or retrieval-augmented generation with large knowledge bases.

An application processing a 300,000-token input document using Gemini 2.5 Pro pays: 200,000 tokens at $1.25 per million ($0.25) plus 100,000 tokens at $2.50 per million ($0.25), totalling $0.50 per API call in input costs alone before any output tokens are generated. At 1,000 calls per day this is $15,000 per month in input token costs — a figure that often surprises teams who modelled costs based only on the standard rate.

Context Caching: The Most Underutilised Cost Lever

Google's context caching feature allows applications to cache large, frequently reused system prompts, reference documents, or conversation context server-side, then reference the cached content at a dramatically reduced rate on subsequent API calls. Cache read pricing is approximately 25 percent of the standard input rate across most Gemini models. Cache storage costs $1.00 per million tokens per hour, which is relevant for long-lived caches but minor relative to inference savings for most use cases.

The practical impact of context caching depends on application design. For a customer service application with a 10,000-token system prompt shared across all conversations, implementing context caching reduces the effective input cost of that system prompt by 75 percent across every API call after the initial cache write. For an application processing the same 50,000-token reference document repeatedly, caching can reduce costs by 60 to 75 percent on that document's contribution to total input costs.

Google's own data suggests that context caching can reduce overall Gemini API costs by up to 90 percent for applications with large, repeated prompts. In practice, we see 40 to 70 percent reductions in applications where caching is properly implemented — representing one of the highest-return cost optimisation actions available without changing model selection or application architecture.

Need a Vertex AI cost model for your production workloads?

We build accurate pre-deployment cost models including caching, context thresholds, and GCP agreement interactions.
Request a Cost Model →

Provisioned Throughput: When Pay-As-You-Go Stops Being Optimal

Vertex AI offers a provisioned throughput option for Gemini models, where organisations purchase guaranteed capacity in advance rather than competing for shared capacity on the pay-as-you-go tier. Provisioned throughput is priced at a premium to pay-as-you-go per-token rates but provides throughput guarantees, lower latency, and predictable cost floors for high-volume workloads.

The economics of provisioned throughput favour organisations with predictable, high-volume API usage that can be modelled with reasonable accuracy 12 months in advance. The risk is overprovisioning: most enterprises overprovision throughput based on peak demand projections, resulting in 30 to 50 percent unused provisioned capacity during off-peak periods. The minimum commitment for provisioned throughput is monthly, which reduces the risk of annual overprovisioning relative to the annual reserved models used by some competitors.

When to Choose Provisioned vs Pay-As-You-Go

Pay-as-you-go is appropriate for development, testing, variable workloads, and organisations in the first 12 months of Vertex AI production deployment where usage patterns are not yet established. Provisioned throughput becomes economically viable when monthly Vertex AI spend exceeds approximately $10,000 consistently and usage patterns are predictable enough to model within a 20 percent variance band. The transition from pay-as-you-go to provisioned throughput should be an explicit commercial decision made on consumption data, not on the recommendation of Google's account team.

Grounding Costs: The Add-On That Changes Your TCO

Grounding connects Gemini model responses to real-world, up-to-date information by linking API calls to Google Search, web results, or Google Maps data. Grounding is priced separately from token consumption and adds materially to the cost of applications that use it at scale.

Google Search Grounding is priced at $35 per 1,000 grounded prompts after the first 1,500 free prompts per day. Web Grounding for Enterprise costs $45 per 1,000 grounded prompts, and Google Maps Grounding is $25 per 1,000 grounded prompts. For an enterprise application generating 100,000 grounded prompts per month, Search Grounding alone adds $3,500 to the monthly Vertex AI bill — a cost that is entirely separate from token consumption and is frequently omitted from initial TCO modelling.

The Gemini Enterprise Subscription: A Separate Commercial Vehicle

Separate from Vertex AI consumption-based pricing, Google offers a Gemini Enterprise subscription at $45 per user per month (annual commitment) or $54 per user per month on a monthly basis. Gemini Enterprise provides access to Gemini features within Google Workspace — including document summarisation, email assistance, meeting notes, and integrated AI features across Google productivity applications.

The critical distinction for enterprise buyers is that Gemini Enterprise on Workspace and Gemini on Vertex AI are separate commercial offerings serving different use cases. Workspace Gemini Enterprise provides end-user productivity AI features. Vertex AI Gemini provides API access for building custom AI applications. Organisations that deploy both may be paying for overlapping AI capabilities in some use cases — particularly if they are building internal enterprise applications on Vertex AI that replicate functionality already available through Workspace Gemini Enterprise.

"Gemini Enterprise on Workspace and Gemini Pro on Vertex AI are not substitutes — they are different products for different buyers within your organisation. But budget governance that treats them as separate cost centres without coordination creates duplicate spend that compounds at renewal."

GCP Agreement Interactions: Where Hidden Savings Exist

Vertex AI consumption on Google Cloud Platform counts toward your total GCP billing volume, which determines your qualification for Committed Use Discount programs and any custom pricing negotiated in a Google Cloud enterprise agreement. This interaction creates an important commercial dynamic: organisations with existing GCP committed spend may find that adding Vertex AI consumption accelerates their qualification for higher discount tiers that apply across all GCP services, not just AI.

For example, an organisation spending $800,000 per year on GCP compute and storage that adds $200,000 of Vertex AI consumption moves into a $1,000,000 annual spend tier, unlocking discount structures available at that threshold. The incremental discount on non-AI GCP services may partially or fully offset the cost of Vertex AI consumption, depending on the discount delta between tiers and the organisation's specific GCP workload mix.

This interaction is rarely surfaced by Google's account teams in standard renewal conversations, because it requires modelling the GCP portfolio holistically rather than selling Vertex AI as a standalone product. Independent advisory support is particularly valuable in identifying and quantifying these cross-product commercial interactions.

Six Cost Optimisation Actions Before You Sign

1. Implement Context Caching: Before committing to any Vertex AI spend level, implement context caching for all applications with large, repeated system prompts or reference documents. The cost reduction — typically 40 to 70 percent on cached content — should be reflected in your consumption baseline before any commercial commitment is made.

2. Right-Size Model Selection: Deploy Gemini 2.5 Flash for high-volume routine tasks and reserve Gemini 2.5 Pro for complex reasoning tasks that genuinely require it. The cost differential is roughly 4x on input tokens and the same on output tokens. A mixed deployment strategy can reduce blended token costs by 30 to 50 percent relative to Pro-only deployments.

3. Model Context Window Usage: Identify all application paths that exceed 200,000 tokens per request and evaluate whether context caching, retrieval-augmented generation with shorter retrieved excerpts, or document chunking can reduce context window usage below the 2x pricing threshold.

4. Evaluate Grounding Costs Separately: If your application uses grounding, model grounding costs as a separate line item from token consumption. For high-grounding applications, grounding costs can represent 20 to 40 percent of total Vertex AI spend and warrant separate optimisation analysis.

5. Coordinate Vertex AI and GCP Agreement Negotiations: Negotiate Vertex AI pricing within the context of your overall GCP relationship, not as a standalone API cost discussion. The commercial leverage available from your total GCP spend is significantly greater than what is available from Vertex AI consumption alone.

6. Request Custom Pricing at Threshold: If your projected Vertex AI spend exceeds $500,000 annually, request custom pricing before signing any commitment. Published list rates are not the commercial reality for enterprise deployments at this scale.

Vertex AI Pricing Intelligence

Google updates Vertex AI pricing with new model releases. Subscribe for quarterly updates on rate changes, new model availability, and negotiation developments.