OpenAI is giving developers a new way to trim their AI bills. The company has rolled out Flex processing, a beta‑only pricing tier that trades blazing‑fast responses for substantial savings. This tier is perfect for background jobs or experimental work where speed isn’t critical.
It’s a new budget‑friendly option in the OpenAI API. Flex processing slices every per‑token rate in half for the o3 and o4‑mini models, but your calls drop to the back of the queue. In practice, that means:
Model | Standard Input | Flex Input | Standard Output | Flex Output |
---|---|---|---|---|
o3 | $10 / M tokens | $5 / M | $40 / M tokens | $20 / M |
o4‑mini | $1.10 / M tokens | $0.55 / M | $4.40 / M tokens | $2.20 / M |
For example, the “$5 / M” means $5 per million tokens processed. In other words, if you send the model one million input tokens (≈ 750,000 English words), you’ll be charged $5 on the Flex tier.
READ ALSO: Why OpenAI is courting Windsurf, not Cursor, for a potential $3B acquisition
- Target use‑cases: Data enrichment pipelines, large‑scale evaluations, asynchronous tasks, and any project you’d tag as “lower priority” or “non‑production.”
- Trade‑off: Responses may take longer, and, at peak demand, requests might be queued or throttled.
Flex vs. Batch
Flex (service_tier=”flex”) | Batch (/v1/batch endpoint) | |
---|---|---|
Call style | Normal /synchronous chat/completions (just add service_tier:"flex" ). | Upload one .jsonl file that can bundle millions of requests. |
Turn‑around | Guaranteed ≤ 24 h. In practice, you often get results inside ~10 min–1 h, but OpenAI reserves a full day. | Guaranteed ≤ 24 h. In practice, you often get results inside ~10 min–1 h, but OpenAI reserves a full day. OpenAI Help Center, OpenAI Platform |
Discount | No stated SLA. Requests run after real‑time traffic clears. Typical latency is seconds to minutes, but can spike to “please retry later.” TechCrunch | –50 % versus synchronous price list for the same model. OpenAI Help Center |
Rate limits | Still governed by your normal per‑minute/token caps, just deprioritized. | Separate “batch quota”—today up to ~250 M input tokens in a single job—and it doesn’t count against live‑API rate limits. OpenAI Community |
Streaming / functions | Allowed (everything the live endpoint supports, including streaming chunks and function‑calling). | No streaming. Each response is written to an output file you download after the job finishes. OpenAI Help Center |
Integration effort | One extra parameter; ideal if your code already makes chat/completions calls. | Requires building a small pipeline: create file → submit batch → poll status → fetch results. |
Best for | Medium‑latency tasks that still benefit from an immediate HTTP response: • user‑facing features that can wait a bit • eval dashboards where freshness matters. | Huge offline workloads: • nightly data enrichment. • embedding or summarising millions of documents. • large prompt A/B tests where real‑time speed is irrelevant. |
Why Flex sometimes feels “almost as fast”
Flex jobs piggyback on whatever idle GPU slots are free at the moment. During quiet periods, the queue may be practically empty, so you get your answer in under a minute, exactly what you would have seen.
But unlike Batch, there’s no SLA that guarantees completion; at peak usage, you can hit multi‑minute waits or transient “resource unavailable” errors. If consistent latency matters, you still have to pay full price (or build retry logic).
When to choose which
Choose Flex if… | Choose Batch if… |
---|---|
You can tolerate variable latency, but not the complexity of new tooling. | You’re processing hundreds of thousands or millions of prompts and don’t need them back immediately. |
You can tolerate variable latency but not the complexity of new tooling. | You need to blow past your normal rate limits or run jobs while you sleep. |
Just remember:
• Seconds‑to‑minutes latency target → Flex.
• Minutes‑to‑hours latency target, huge volume, or you want to forget about rate limits → Batch.
Both tiers deliver the same model quality—only the queueing strategy changes. So if 50 % off was the main attraction of Batch and your workloads need answers in < 10 min, Flex is the simpler lever to pull.
New Verification Requirements
Flex isn’t the only change: developers in usage tiers 1‑3 must now clear an ID‑verification step to unlock o3 (and certain features such as reasoning summaries and the streaming API). OpenAI says the measure helps keep malicious actors out of its ecosystem
Just hours before OpenAI’s announcement, Google unveiled Gemini 2.5 Flash, a leaner model that squares up to DeepSeek’s R1 while undercutting it on input costs. OpenAI’s move indicates a broader race to serve developers who care as much about price efficiency as raw horsepower.
If your application can tolerate the occasional delay or brief unavailability, Flex processing offers a straightforward way to halve your token spend without switching models or vendors. As for the latency‑sensitive production systems, the traditional “priority” tier still reigns.
This is a welcome relief as the cost of running frontier models keeps creeping upward, and competitors rush out “budget” alternatives.