OpenRouters 50% Off GPT5: Real Costs, RPM Caps, and Clean Benchmarks To Run Right Now

OpenRouters 50% off GPT5 promo is live right now. It runs from Sept 17 at 10:00 PST through Sept 24 at 10:00 PST, with a cap of about 20 requests per minute per API key. If youve been waiting to batch a content backlog or run benchmarks, this is the window. I cleared a big backlog today because the math is simple: half the bill for the same work. The latest posts are on adam.holter.com.

Numbers that matter for this week

  • Standard GPT5 pricing: $1.25 per million input tokens, $10 per million output tokens
  • Cached input tokens: $0.125 per million (about 90% off standard input)
  • Promo: 50% off all GPT5 usage during the week
  • Rate cap: ~20 RPM per API key

Other providers worth mixing in pipelines:

  • Qwen 30B A3B: about $0.07 in / $0.28 out per million tokens
  • Qwen3oder on Cerebras: about $2 in / $2 out per million tokens (see my earlier piece on their speed claims and limits: Cerebras opens a free 1M tokens per day tier)

Worked example A: a 10k-token refactor job

Assume 10,000 input tokens and 10,000 output tokens. No cache hits.

Standard cost
Input: 10,000  $1.25 / 1,000,000 = $0.0125
Output: 10,000  $10   / 1,000,000 = $0.10
Total: $0.1125

Promo cost
Input: $0.0125  0.5 = $0.00625
Output: $0.10    0.5 = $0.05
Total: $0.05625

Savings per 10k+10k job: $0.05625

Scaled view:

  • 1M input + 1M output tokens: Standard $11.25, Promo $5.625
Cost comparison chart

Promo cuts the bill in half. Numbers shown for a 10k job and for 1M input + 1M output tokens.

What if I have a large static context

Cached input tokens are already discounted by ~90% at $0.125 per million. During the promo that falls to $0.0625 per million. Example: 150k tokens of shared context reused across 100 requests is 15M cached tokens.

With cache, standard: 15M  $0.125 / 1M = $1.875
With cache, promo:    15M  $0.0625 / 1M = $0.9375
Without cache, standard: 15M  $1.25 / 1M = $18.75

Cache makes repeated context cheap, and the promo makes it cheaper. This also means A/B tests can be polluted if one branch warms the cache and the other doesnt. Use a cacheware harness.

Worked example B: multigent pipelines hit the ~20 RPM cap

Scenario: 10 agents, each attempting 3 requests per minute. Thats 30 RPM requested. The cap is ~20 RPM per key, so 10 of those requests will queue each minute. Throughput flattens at 20 RPM until the backlog clears.

RPM cap chart

Once you request more than 20 RPM on a single key, throughput flattens at the cap.

Queue depth math

If you request 30 RPM on a 20 RPM cap, backlog grows by 10 requests per minute. After 5 minutes, thats a queue of 50 requests. Even if you stop sending new requests, it will still take 2.5 minutes to drain at 20 RPM. Plan your A/B windows accordingly.

Queue depth over time

Backlog grows at 10 requests per minute when you ask for 30 RPM against a 20 RPM cap.

How to measure true pertask cost during a discount

The risk during promos is that you understate future cost by 2. Fix that with two numbers on every run:

  • Actual billed promo cost
  • Shadow cost at standard rates, using the same token counts

Also record cache hits and misses to avoid false conclusions when one branch benefits from a warm context. This is basic, but most teams skip it during a rush.

// Minimal run log JSON for clean accounting
{
  "run_id": "2025-09-18T10:12:45Z-abc123",
  "key_id": "exp-key-01",
  "model": "openrouter/gpt-5",
  "tokens_in": 10000,
  "tokens_out": 10000,
  "cached_in": 0,
  "rate_limit_events": 0,
  "latency_ms": 1840,
  "billed_promo_usd": 0.05625,
  "shadow_standard_usd": 0.1125,
  "cache_hit": false,
  "status": "ok"
}

Batch sizing under provider caps

With ~20 RPM per key you have two levers: serialize or increase perrequest payload size. Larger requests reduce RPM pressure but increase latency and variance. A few rules Ia0ve found reliable:

  • Deterministic pipelines: fewer, larger calls are fine if you validate outputs downstream
  • Interactive loops: prefer more, smaller calls to reduce blast radius of a single failure
  • Always track tokens to verify that a bigger batch isnb4t accidentally inflating total tokens

Throughput planning quick math

Say you need to process 120 requests in an hour. At 20 RPM hard cap per key, max throughput per key is 1,200 requests per hour if each request is one API call. If your pipeline uses 3 calls per item, your pera0key ceiling is 400 completed items per hour. If you need more, you will either stagger steps or use additional keys if policy allows.

Provider mixing: when to keep GPTa05 and when to offload

Promos can distort routing and queue depth. Keep GPTa05 for steps where it clearly wins on quality, and offload bulk transforms to cheaper or faster pools.

  • Example split: planning or higha0stakes refactor prompts on GPTa05; bulk transforms on Qwen 30B A3B at about $0.07 in / $0.28 out per million
  • Cerebras slots for Qwen3a0Coder are roughly $2 in / $2 out per million; for predictable, tokena0heavy jobs with moderate quality needs, that can be the steady option. I covered practical limits here: Cerebras opens a free 1M tokens per day tier.

Illustrative numbers:

Step 1: GPTa05 plan
2k in + 2k out
Standard: 0.0025 + 0.02 = $0.0225
Promo: $0.01125

Step 2: Bulk transform on Qwen 30B A3B
50k in + 50k out
Cost: 50k a0 0.07/M + 50k a0 0.28/M = $0.0175

Step 2 on Qwen3a0Coder Cerebras instead
50k in + 50k out
Cost: 50k a0 $2/M + 50k a0 $2/M = $0.2

During the promo, moving planning to GPTa05 is cheap. The bulk step may still live better on a cheaper provider if quality holds for your task class.

Promo playbook: what to actually run this week

  • Higha0token content refactors, long code audits, and dataset labeling passes
  • Heada0toa0head quality checks across providers with a fixed rubric and cold/warm cache variants
  • Latency profiling and retry policy tuning while traffic is spiky

If you need a deeper reminder on why prompt recall and longa0tail specificity break, I wrote about that here: LLMs as a Lossy Encyclopedia. Use that mindset to design your A/B rubric.

Clean A/B testing during a discount window

  • Freeze model names and versions; record any routing metadata returned by the API
  • Run backa0toa0back pairs within the same minute to reduce drift from queue depth
  • Use fixed seeds where applicable and lock temperatures
  • Make caches explicit: run both colda0cache and warma0cache trials, or add a benign random salt to force a miss when needed
  • Log tokens, retries, 429s/5xx, and final status per trial
  • Keep shadow standard costs in the logs while you pay promo prices

Detecting degraded QoS or routing fallbacks

Discount windows move traffic around. You want early warning when the pool is congested or the router swaps capacity behind the scenes.

  • Track TTFB and total latency per request; alert on spikes above your p95 baseline
  • Track 429 and 5xx per minute; auto backoff and spread traffic across keys if allowed
  • Capture response metadata that identifies the serving provider or shard; flag unexpected changes mida0run
  • Sample generations with a fixed test prompt across the run; compare token counts and output length distributions for drift

Simple rule: if latency jumps, error rate climbs, or provider metadata changes, pause the test or split the run so your results arenb4t polluted.

Keys and RPM: how to avoid stepping on yourself

  • Prea0allocate experiment keys by pipeline or team and tag runs with the key id in logs
  • Throttle each agent to a budgeted RPM so aggregate load sits at or below 20 per key
  • If policy allows, shard across keys and enforce pera0key ceilings in your scheduler

About GPTa05 Thinking modes in ChatGPT

OpenAI added a thinking time toggle in ChatGPT for GPTa05 with Thinking. Plus, Pro, and Business users now have Standard as the default and Extended as an option, with Pro also getting Light and Heavy. Posts on X claimed indicative juice values: Light 5, Standard 18, Extended 64, Heavy 200. If you compare ChatGPT runs to your OpenRouter API runs, expect latency and output length differences based on that setting. For experiments, record the UI mode alongside token counts so you can normalize.

Tactical checklist for this week

  • Short bursts: compress expensive jobs into this window, but always log the standarda0rate equivalent cost
  • Prea0allocate experiment keys: keep parallel tests from colliding with the ~20 RPM cap
  • Cachea0aware harness: record hits and misses and run paired cold and warm trials
  • Throughput planning: stay at or below 20 RPM per key, or split workloads across keys if policy allows
  • QoS watch: add probes for latency, error rate, and provider metadata changes; pause or reroute on drift
  • Provider mixing: keep GPTa05 for the steps that need it; move bulk tokens to cheaper pools if quality is good enough

Bottom line

The discount is real and it changes the shape of your tests. The cap and routing dynamics are not the same as a normal week, so plan for it. If you have benchmark ideas, drop them in the comments or share your runs. And if youa9ve been sitting on a big batch job, run it now while it costs half as much. I posted a lot today on adam.holter.com because the pricing made it a simple call.

Links

They're clicky!

Follow on X →Ironwood →
Adam Holter
Adam Holter

Founder of Ironwood AI. Writing about AI models, agents, and what's actually happening in the space.