The Cloudflare Blog

Your AI bill is out of control. Cloudflare can fix it now.

Ming Lu — Fri, 05 Jun 2026 13:00:00 GMT

There isn't a CIO on the planet not worried about AI spend right now. CFOs are increasingly nervous, too.

For fear of falling behind, many companies have pushed their employees to use AI as aggressively as possible. The edict was clear: "Move fast, we'll figure out the bill later." And for the most part, it worked: AI has been genuinely transformational for the teams that leaned in.

But the costs are real: we’ve heard countless horror stories of huge bills and painful overages on token spend.

Today, we're announcing spend controls in Cloudflare AI Gateway, and a closed beta for identity-driven budgets and routing using Cloudflare Access and your existing identity provider.

As we’ve spoken with hundreds of companies about their AI strategy, we’ve seen a common story: The company gives every engineer access to frontier models through a shared API key. Usage takes off. At the end of the month, finance pulls the invoice and nobody can explain where the money went. Was it the machine learning team training a new pipeline? Was it an intern running Claude Opus on email triage? Was it a runaway continuous integration job that burned through 50 million tokens in a weekend? Nobody knows, because the API key doesn't tell you who used it.

Without guidelines, staff will generally reach for the biggest model available. And why wouldn't they? If there's no budget, no visibility, and no routing logic, the rational move is to use the most powerful model for everything. The problem is that most tasks don't need a frontier model. A code review summary doesn't need the same model as a complex architecture refactor. A log parser doesn't need the same model as a customer-facing content generator. It should be easy to select the right tool for the job, rather than defaulting to the most powerful and expensive one. And it should be simple to see where the spend is going.

You can't calculate ROI on your AI spend without visibility on what you're spending, and you can't protect that ROI without controls. Every other line item in a business has a budget and per-team attribution and AI spend should be no different.

What AI Gateway is

AI Gateway sits between your applications and AI providers. Instead of calling OpenAI, Anthropic, Google, or any other provider directly, your requests route through AI Gateway first.

This immediately gives you several useful tools:

Unified billing to easily switch between different providers and models
Logging across all providers — every request, token count, and cost in one place
Response caching
Rate limiting
Content guardrails and the ability to block Personally Identifiable Information (PII) and secrets before they reach the model

However, AI Gateway didn’t have an easy way to answer who is spending what or how you might set limits on AI spend.

You could see aggregate usage across your account. But you couldn't see that Jane from engineering burned through \$2,000 on Claude this month while the entire data science team only used \$400. You couldn't set a budget that said "engineering gets \$5,000/month on frontier models, interns get \$200/month on Kimi K2.6."

That changes today.

Spend limits: budgets for AI usage

AI Gateway now supports spend limits as a core feature. These are true cost control measures in the form of budgets set in dollars, not tokens, that track cumulative spend across all requests, operating independently of traditional rate limiting.

You can scope limits to any combination of dimensions: model, provider, or admin-defined custom attributes like user, team, or application. Windows can be fixed (resets on the first of the month, Monday, or midnight) or rolling, and set to daily, weekly, or monthly.

AI Gateway calculates cost per request based on the model's pricing, and tracks cumulative spend against your limit in real time. You can easily track your model spend on our analytics dashboard and filter by model, provider, or any custom attribute.

You have options for what happens when the budget limit is reached. AI Gateway will block further requests by default. Or you can set up rules through Dynamic Routes to route requests to a fallback model after you’ve hit a spend limit, so that a hard spending cap won’t kill your engineers’ workflow. We’re working to add the capability for you to also send alerts when a limit is reached.

Spend limits are available in open beta today for all AI Gateway users across all plans. Configure them in your gateway settings in the dashboard or via the API.

We use this ourselves

We're tracking token costs inside Cloudflare already. Every Cloudflare employee uses AI tools daily, routing millions of requests and billions of tokens per month through AI Gateway. We faced the same question every company faces at this scale: who's using what, and how do we budget for it?

We solved this by enabling AI Gateway to add identity to every request. When an employee authenticates via Cloudflare Access, we extract their identity from the JSON Web Token (JWT) and attach it as metadata on the AI Gateway request. This makes per-user token consumption, team-level usage breakdowns, and cost attribution across the organization all visible in one place.

Identity-driven budgets and policies (closed beta)

In addition to spend limits, today we’re also announcing identity-driven budgets and policies as a closed beta.

Spend limits in AI Gateway let you set budgets by model, provider, or custom attributes. But your application has to pass that metadata, and AI Gateway trusts whatever it receives. For verified, automatic attribution, you need identity.

When combined with Cloudflare Access, AI Gateway can see who is making each request — not just which account, but which employee, which identity provider (IdP) group, which service, etc.

Here's what that looks like in practice.

You can set per-user budgets, say \$500/month for individual contributors and \$2,000 for senior engineers. When a user hits their limit, requests can be downgraded to a cheaper model or blocked.

You can set per-team model policies. For instance, your ML team gets Claude Opus and GPT-4o. The brand design team can access generative image and video models. Interns use open-source models on Workers AI. These policies map directly to your existing IdP groups, the same identity provider groups you already manage.

For CI/CD pipelines and autonomous agents, Access service tokens allow you to give each agent a named identity. You can see that your code review bot used 5 million tokens this week while your documentation generator used 500,000. If one agent is running out of control, apply a budget policy without affecting any others.

Every AI Gateway log entry will include the authenticated identity: email, IdP group, service token name. Export these to your analytics platform, and you've got a cost-by-user-by-team breakdown without building anything custom.

Under the hood, you create a Cloudflare Access application for your AI Gateway endpoint and configure policies based on your IdP groups. When a developer or agent makes a request, they authenticate via OAuth, using the typical CLI device-code flow. AI Gateway validates the token and extracts the identity. You don't need to write a custom Worker, parse JWTs yourself, or rely on honor-system metadata headers.

We recently wrote about how we built our internal AI engineering stack. This is what we are making available today — so you can use it, too, and you don't have to build it yourself.

If you would like access to the closed beta, sign up here.

What's next: from cost control to cost optimization

Setting a budget is necessary. But once you’ve got a budget, how do you make the most of it?

The reality is that not every request needs a frontier model: a summarization task can run on a smaller, cheaper model without meaningful quality loss, while a large-scale code refactor might require the bleeding edge. But without controls, people will almost always opt for the most advanced model.

A solution for that is coming next: We're building intelligent, task-based routing in AI Gateway. For each request, we can analyze and automatically route it to the model that will give you the best result at the lowest cost. This is in active development, so follow our developer docs and changelog.

Get started

It’s free to get started with AI Gateway. Spend limits are available now for all users.

If you haven't already, create a gateway and point your applications at it. From there, set up spend limits in the dashboard or via API. Start with a high limit in monitoring mode to understand your current usage patterns before you start enforcing.

If you want per-user attribution and team-based policies, sign up for the identity-driven budgets closed beta, and we'll get you set up with the Access integration.

We want to hear how you're managing AI costs today. Join the conversation on Cloudflare Community or reach out to discuss your broader AI security strategy.

Building the agentic cloud: everything we launched during Agents Week 2026

Ming Lu — Mon, 20 Apr 2026 13:00:00 GMT

Today marks the end of our first Agents Week, an innovation week dedicated entirely to the age of agents. It couldn’t have been more timely: over the past year, agents have swiftly changed how people work. Coding agents are helping developers ship faster than ever. Support agents resolve tickets end-to-end. Research agents validate hypotheses across hundreds of sources in minutes. And people aren't just running one agent: they're running several in parallel and around the clock.

As Cloudflare's CTO Dane Knecht and VP of Product Rita Kozlov noted in our welcome to Agents Week post, the potential scale of agents is staggering: If even a fraction of the world's knowledge workers each run a few agents in parallel, you need compute capacity for tens of millions of simultaneous sessions. The one-app-serves-many-users model the cloud was built on doesn't work for that. But that's exactly what developers and businesses want to do: build agents, deploy them to users, and run them at scale.

Getting there means solving problems across the entire stack. Agents need compute that scales from full operating systems to lightweight isolates. They need security and identity built into how they run. They need an agent toolbox: the right models, tools, and context to do real work. All the code that agents generate needs a clear path from afternoon prototype to production app. And finally, as agents drive a growing share of Internet traffic, the web itself needs to adapt for the emerging agentic web. Turns out, the containerless, serverless compute platform we launched eight years ago with Workers was ready-made for this moment. Since then, we've grown it into a full platform, and this week we shipped the next wave of primitives purpose-built for agents, organized around exactly those problems.

We are here to create Cloud 2.0 — the agentic cloud. Infrastructure designed for a world where agents are a primary workload.

Here's a list of everything we announced this week — we wouldn’t want you to miss a thing.

Compute

It starts with compute. Agents need somewhere to run, and somewhere to store and run the code they write. Not all agents need the same thing: some need a full operating system to install packages and run terminal commands, most need something lightweight that starts in milliseconds and scales to millions. This week we shipped the environments to run them, as well as a new Git-compatible workspace for agents:

Announcement	Summary
Artifacts: Versioned storage that speaks Git	Give your agents, developers, and automations a home for code and data. We’ve just launched Artifacts: Git-compatible versioned storage built for agents. Create tens of millions of repos, fork from any remote, and hand off a URL to any Git client.
Agents have their own computers with Sandboxes GA	Cloudflare Sandboxes give AI agents a persistent, isolated environment: a real computer with a shell, a filesystem, and background processes that starts on demand and picks up exactly where it left off.
Dynamic, identity-aware, and secure: egress controls for Sandboxes	Outbound Workers for Sandboxes provide a programmable, zero-trust egress proxy for AI agents. This allows developers to inject credentials and enforce dynamic security policies without exposing sensitive tokens to untrusted code.
Durable Objects in Dynamic Workers: Give each AI-generated app its own database	Durable Object Facets allows Dynamic Workers to instantiate Durable Objects with their own isolated SQLite databases. This enables developers to build platforms that run persistent, stateful code generated on-the-fly.
Rearchitecting the Workflows control plane for the agentic era	Cloudflare Workflows, a durable execution engine for multi-step applications, now supports 50,000 concurrency and 300 creation rate limits through a rearchitectured control plane, helping scale to meet the use cases for durable background agents.

Security

Running agents and their code is only half the challenge. Agents connect to private networks, access internal services, and take autonomous actions on behalf of users. When anyone in an organization can spin up their own agents, security can't be an afterthought. It has to be the default. This week, we launched the tools to make that easy.

Announcement	Summary
Secure private networking for everyone: users, nodes, agents, Workers — introducing Cloudflare Mesh	Cloudflare Mesh provides secure, private network access for users, nodes, and autonomous AI agents. By integrating with Workers VPC, developers can now grant agents scoped access to private databases and APIs without manual tunnels.
Managed OAuth for Access: make internal apps agent-ready in one click	Managed OAuth for Cloudflare Access helps AI agents securely navigate internal applications. By adopting RFC 9728, agents can authenticate on behalf of users without using insecure service accounts.
Securing non-human identities: automated revocation, OAuth, and scoped permissions	Cloudflare is introducing scannable API tokens, enhanced OAuth visibility, and GA for resource-scoped permissions. These tools help developers implement a true least-privilege architecture while protecting against credential leakage.
Scaling MCP adoption: our reference architecture for enterprise MCP deployments	We share Cloudflare's internal strategy for governing MCP using Access, AI Gateway, and MCP server portals. We also launch Code Mode to slash token costs and recommend new rules for detecting Shadow MCP in Cloudflare Gateway.

Agent Toolbox

A capable agent needs to be able to think and remember, communicate, and see. This means being powered with the right models, with access to the right tools and the right context for their task at hand. This week we shipped the primitives — inference, search, memory, voice, email, and a browser — that turn an agent into something that actually gets work done.

Announcement	Summary
Project Think: building the next generation of AI agents on Cloudflare	Announcing a preview of the next edition of the Agents SDK — from lightweight primitives to a batteries-included platform for AI agents that think, act, and persist.
Add voice to your agent	An experimental voice pipeline for the Agents SDK enables real-time voice interactions over WebSockets. Developers can now build agents with continuous STT and TTS in just ~30 lines of server-side code.
Cloudflare Email Service: now in public beta. Ready for your agents	Agents are becoming multi-channel. That means making them available wherever your users already are — including the inbox. Cloudflare Email Service enters public beta with the infrastructure layer to make that easy: send, receive, and process email natively from your agents.
Cloudflare's AI platform: an inference layer designed for agents	We're building Cloudflare into a unified inference layer for agents, letting developers call models from 14+ providers. New features include Workers binding for running third-party models and an expanded catalog with multimodal models.
Building the foundation for running extra-large language models	We built a custom technology stack to run fast large language models on Cloudflare’s infrastructure. This post explores the engineering trade-offs and technical optimizations required to make high-performance AI inference accessible.
Unweight: how we compressed an LLM 22% without sacrificing quality	Running large LLMs across Cloudflare’s network requires us to be smarter and more efficient about GPU memory bandwidth. That’s why we developed Unweight, a lossless inference-time compression system that achieves up to a 22% model footprint reduction, so that we can deliver faster and cheaper inference than ever before.
Agents that remember: introducing Agent Memory	Cloudflare Agent Memory is a managed service that gives AI agents persistent memory, allowing them to recall what matters, forget what doesn't, and get smarter over time.
AI Search: the search primitive for your agents	AI Search is the search primitive for your agents. Create instances dynamically, upload files, and search across instances with hybrid retrieval and relevance boosting. Just create a search instance, upload, and search.
Browser Run: give your agents a browser	Browser Rendering is now Browser Run, with Live View, Human in the Loop, CDP access, session recordings, and 4x higher concurrency limits for AI agents.

Prototype to production

The best infrastructure is also one that’s easy to use. We want to meet developers and their agents where they’re already working: in the terminal, in the editor, in a prompt, and make the full Cloudflare platform accessible without context-switching.

Announcement	Summary
Building a CLI for all of Cloudflare	We’re introducing cf, a new unified CLI designed for consistency across the Cloudflare platform, alongside Local Explorer for debugging local data. These tools simplify how developers and AI agents interact with our nearly 3,000 API operations.
Introducing Agent Lee - a new interface to the Cloudflare stack	Agent Lee is an in-dashboard agent that shifts Cloudflare’s interface from manual tab-switching to a single prompt. Using sandboxed TypeScript, it helps you troubleshoot and manage your stack as a grounded technical collaborator.
Introducing Flagship: feature flags built for the age of AI	Introducing Flagship, a native feature flag service built on Cloudflare’s global network to eliminate the latency of third-party providers. By using KV and Durable Objects, Flagship allows for sub-millisecond flag evaluation.
Deploy Postgres and MySQL databases with PlanetScale + Workers	Learn how to deploy PlanetScale Postgres and MySQL databases via Cloudflare and connect Cloudflare Workers.
Register domains wherever you build: Cloudflare Registrar API now in beta	The Cloudflare Registrar API is now in beta. Developers and AI agents can search, check availability, and register domains at cost directly from their editor, their terminal, or their agent — without leaving their workflow.

Agentic Web

As more agents come online, they're still browsing an Internet that was built for people. Existing websites need new tools to control what bots can access their content, package and present it for agents, and measure how ready they are for this shift.

Announcement	Summary
Introducing the Agent Readiness score. Is your site agent-ready?	The Agent Readiness score can help site owners understand how well their websites support AI agents. Here we explore new standards, share Radar data, and detail how we made Cloudflare’s docs the most agent-friendly on the web.
Redirects for AI Training enforces canonical content	Soft directives don’t stop crawlers from ingesting deprecated content. Redirects for AI Training allows anybody on Cloudflare to redirect verified crawlers to canonical pages with one toggle and no origin changes.
Agents Week: Network performance update	By migrating our request handling layer to a Rust-based architecture called FL2, Cloudflare has increased its performance lead to 60% of the world’s top networks. We use real-user measurements and TCP connection trimeans to ensure our data reflects the actual experience of people on the Internet
Shared dictionary compression that keeps up with the agentic web	We give you a sneak peek of our support for shared compression dictionaries, show you how it improves page load times, and reveal when you’ll be able to try the beta yourself.

That’s a wrap

Agents Week 2026 is ending, but the agentic cloud is just getting started. Everything we shipped this week — from compute and security to the agent toolbox and the agentic web — is the foundation. We're going to keep building on it to give you everything you need to build what's next.

We also have more blog posts coming out today and tomorrow to continue the story, so keep an eye out for the latest at our blog.

If you're building on any of what we announced this week, we want to hear about it. Come find us on X or Discord, or head to the developer documentation.

Cloudflare’s AI Platform: an inference layer designed for agents

Ming Lu — Thu, 16 Apr 2026 14:05:00 GMT

AI models are changing quickly: the best model to use for agentic coding today might in three months be a completely different model from a different provider. On top of this, real-world use cases often require calling more than one model. Your customer support agent might use a fast, cheap model to classify a user's message; a large, reasoning model to plan its actions; and a lightweight model to execute individual tasks.

This means you need access to all the models, without tying yourself financially and operationally to a single provider. You also need the right systems in place to monitor costs across providers, ensure reliability when one of them has an outage, and manage latency no matter where your users are.

These challenges are present whenever you’re building with AI, but they get even more pressing when you’re building agents. A simple chatbot might make one inference call per user prompt. An agent might chain ten calls together to complete a single task and suddenly, a single slow provider doesn't add 50ms, it adds 500ms. One failed request isn't a retry, but suddenly a cascade of downstream failures.

Since launching AI Gateway and Workers AI, we’ve seen incredible adoption from developers building AI-powered applications on Cloudflare and we’ve been shipping fast to keep up! In just the past few months, we've refreshed the dashboard, added zero-setup default gateways, automatic retries on upstream failures, and more granular logging controls. Today, we’re making Cloudflare into a unified inference layer: one API to access any AI model from any provider, built to be fast and reliable.

One catalog, one unified endpoint

Starting today, you can call third-party models using the same AI.run() binding you already use for Workers AI. If you’re using Workers, switching from a Cloudflare-hosted model to one from OpenAI, Anthropic, or any other provider is a one-line change.

const response = await env.AI.run('anthropic/claude-opus-4-6',{
input: 'What is Cloudflare?',
}, {
gateway: { id: "default" },
});

For those who don’t use Workers, we’ll be releasing REST API support in the coming weeks, so you can access the full model catalog from any environment.

We’re also excited to share that you'll now have access to 70+ models across 12+ providers — all through one API, one line of code to switch between them, and one set of credits to pay for them. And we’re quickly expanding this as we go.

You can browse through our model catalog to find the best model for your use case, from open-source models hosted on Cloudflare Workers AI to proprietary models from the major model providers. We’re excited to be expanding access to models from Alibaba Cloud, AssemblyAI, Bytedance, Google, InWorld, MiniMax, OpenAI, Pixverse, Recraft, Runway, and Vidu — who will provide their models through AI Gateway. Notably, we’re expanding our model offerings to include image, video, and speech models so that you can build multimodal applications

Accessing all your models through one API also means you can manage all your AI spend in one place. Most companies today are calling an average of 3.5 models across multiple providers, which means no one provider is able to give you a holistic view of your AI usage. With AI Gateway, you’ll get one centralized place to monitor and manage AI spend.

By including custom metadata with your requests, you can get a breakdown of your costs on the attributes that you care about most, like spend by free vs. paid users, by individual customers, or by specific workflows in your app.

const response = await env.AI.run('@cf/moonshotai/kimi-k2.5',
      {
prompt: 'What is AI Gateway?'
      },
      {
metadata: { "teamId": "AI", "userId": 12345 }
      }
    );

Bring your own model

AI Gateway gives you access to models from all the providers through one API. But sometimes you need to run a model you've fine-tuned on your own data or one optimized for your specific use case. For that, we are working on letting users bring their own model to Workers AI.

The overwhelming majority of our traffic comes from dedicated instances for Enterprise customers who are running custom models on our platform, and we want to bring this to more customers. To do this, we leverage Replicate’s Cog technology to help you containerize machine learning models.

Cog is designed to be quite simple: all you need to do is write down dependencies in a cog.yaml file, and your inference code in a Python file. Cog abstracts away all the hard things about packaging ML models, such as CUDA dependencies, Python versions, weight loading, etc.

Example of a cog.yaml file:

build:
  python_version: "3.13"
  python_requirements: requirements.txt
predict: "predict.py:Predictor"

Example of a predict.py file, which has a function to set up the model and a function that runs when you receive an inference request (a prediction):

from cog import BasePredictor, Path, Input
import torch

class Predictor(BasePredictor):
    def setup(self):
        """Load the model into memory to make running multiple predictions efficient"""
        self.net = torch.load("weights.pth")

    def predict(self,
            image: Path = Input(description="Image to enlarge"),
            scale: float = Input(description="Factor to scale image by", default=1.5)
    ) -> Path:
        """Run a single prediction on the model"""
        # ... pre-processing ...
        output = self.net(input)
        # ... post-processing ...
        return output

Then, you can run cog build to build your container image, and push your Cog container to Workers AI. We will deploy and serve the model for you, which you then access through your usual Workers AI APIs.

We’re working on some big projects to be able to bring this to more customers, like customer-facing APIs and wrangler commands so that you can push your own containers, as well as faster cold starts through GPU snapshotting. We’ve been testing this internally with Cloudflare teams and some external customers who are guiding our vision. If you’re interested in being a design partner with us, please reach out! Soon, anyone will be able to package their model and use it through Workers AI.

The fast path to first token

Using Workers AI models with AI Gateway is particularly powerful if you’re building live agents – where a user's perception of speed hinges on time to first token or how quickly the agent starts responding, rather than how long the full response takes. Even if total inference is 3 seconds, getting that first token 50ms faster makes the difference between an agent that feels zippy and one that feels sluggish.

Cloudflare's network of data centers in 330 cities around the world means AI Gateway is positioned close to both users and inference endpoints, minimizing the network time before streaming begins.

Workers AI also hosts open-source models on its public catalog, which now includes large models purpose-built for agents, including Kimi K2.5 and real-time voice models. When you call these Cloudflare-hosted models through AI Gateway, there's no extra hop over the public Internet since your code and inference run on the same global network, giving your agents the lowest latency possible.

Built for reliability with automatic failover

When building agents, speed is not the only factor that users care about – reliability matters too. Every step in an agent workflow depends on the steps before it. Reliable inference is crucial for agents because one call failing can affect the entire downstream chain.

Through AI Gateway, if you're calling a model that's available on multiple providers and one provider goes down, we'll automatically route to another available provider without you having to write any failover logic of your own.

If you’re building long-running agents with Agents SDK, your streaming inference calls are also resilient to disconnects. AI Gateway buffers streaming responses as they’re generated, independently of your agent's lifetime. If your agent is interrupted mid-inference, it can reconnect to AI Gateway and retrieve the response without having to make a new inference call or paying twice for the same output tokens. Combined with the Agents SDK's built-in checkpointing, the end user never notices.

Replicate

The Replicate team has officially joined our AI Platform team, so much so that we don’t even consider ourselves separate teams anymore. We’ve been hard at work on integrations between Replicate and Cloudflare, which include bringing all the Replicate models onto AI Gateway and replatforming the hosted models onto Cloudflare infrastructure. Soon, you’ll be able to access the models you loved on Replicate through AI Gateway, and host the models you deployed on Replicate on Workers AI as well.

Get started

To get started, check out our documentation for AI Gateway or Workers AI. Learn more about building agents on Cloudflare through Agents SDK.

The Cloudflare Blog

Your AI bill is out of control. Cloudflare can fix it now.

What AI Gateway is

Spend limits: budgets for AI usage

We use this ourselves

Identity-driven budgets and policies (closed beta)

What's next: from cost control to cost optimization

Get started

Building the agentic cloud: everything we launched during Agents Week 2026

Compute

Security

Agent Toolbox

Prototype to production

Agentic Web

That’s a wrap

Cloudflare’s AI Platform: an inference layer designed for agents

One catalog, one unified endpoint

Bring your own model

The fast path to first token

Built for reliability with automatic failover

Replicate

Get started

Watch on Cloudflare TV