SWE-Bench Pro Commercial Dataset: A harder, cleaner test of AI coding agents on real products

SWE-Bench Pro is the first software agent benchmark that feels like real work. It doesn’t hide ambiguity, it punishes regressions, and it pulls tasks from live products people actually use. The headline: models that cruise past 70% on prior tests land near 23% here, and scores slide further on the private commercial set. If you want to know whether an agent can handle messy codebases with real tests, this is the benchmark that matters.

Why SWE-Bench Pro changes how we measure coding agents

Most coding benchmarks are too clean. They prune the rough edges that make day-to-day engineering painful. SWE-Bench Pro keeps the rough edges and adds guardrails so evaluation is stable and reproducible:

  • Contamination resistance by design: The public and held-out sets draw from GPL-style copyleft repos that are usually excluded from commercial training corpora. The commercial set uses private startup codebases that are not public at all.
  • Real tasks across real products: Instances come from consumer apps, B2B services, and developer tools. Solutions average 107.4 lines of code across 4.1 files. This is long-horizon editing, not single-function trivia.
  • Human-augmented problem briefs: Instead of deleting under-specified issues, humans add just enough context to make them solvable without prescribing an implementation.
  • Reproducible Docker environments: Professional engineers prepare each repo so code and tests run out of the box, which removes the “it failed because of the environment” escape hatch.

That mix leads to a benchmark that is both harder and far less saturated than typical SWE tests. You won’t see inflated scores from models guessing patterns they’ve memorized. You’ll see whether the agent can actually reason through an unfamiliar codebase and ship a safe patch.

Dataset structure and what the commercial set really tests

The full benchmark contains 1,865 tasks across 41 repos split into three sets:

The commercial subset is the cleanest generalization test we have for SWE agents today. If a model performs well on private repositories it never could have seen, it’s doing more than pattern recall. That’s the point: reduce the chance of leakage so we measure actual problem solving.

The metric that matters: Resolve Rate

Resolve Rate is strict on purpose. A task is “resolved” only if both conditions hold in the provided environment:

  • Issue resolved: The new fail-to-pass tests that failed before now pass.
  • No regressions: The original pass-to-pass tests still pass.

No partial credit for clever diffs that break something else. If the patch doesn’t pass both gates, it’s a miss. That’s closer to how teams ship: you don’t take wins that sink the regression suite.

How models actually perform on Pro

On SWE-Bench Verified, top models often clear 70%. On Pro, performance drops hard on the public set, and it drops even more on the commercial set. A few highlights:

  • Public set: Frontier models sit around the low 20s. For example, OpenAI GPT-5 around 23% and Claude Opus 4.1 around 23%.
  • Commercial set: Scores fall further. Claude Opus 4.1 moves from 22.7% to 17.8%. OpenAI GPT-5 moves from 23.1% to 14.9%.
  • Older models lag: OpenAI GPT-4o at 4.9% and DeepSeek Qwen-3 32B at 3.4% on the public set.
  • Language effects: Go and Python tasks trend higher. JavaScript and TypeScript are more erratic and often lower.
  • Repository effects: Some repos stay stubbornly below 10% for everyone. Others allow certain models to exceed 50%, which shows codebase structure and documentation matter a lot.
  • Complexity hurts: As patches span more files and lines, success rates sink. Multi-file planning remains a core weakness.
Public vs Commercial Resolve Rate on SWE-Bench Pro

On the commercial subset, both models drop further — a clearer read on generalization to unseen private code.

How tasks are built and why that matters

Each instance comes from a real commit pair that both fixes the issue and proves nothing else broke:

  1. Sourcing: Curated public GPL and private company repos.
  2. Environment creation: Engineers assemble Docker builds with all dependencies.
  3. Harvesting: Commit scraping keeps only pairs that add fail-to-pass tests and preserve pass-to-pass tests.
  4. Augmentation: Humans write a problem statement and minimal requirements brief that lets you re-create the patch without prescribing how.

This pipeline avoids the usual pitfalls: toy problems, flakey test rigs, and prompts that read like a classroom exercise. It’s the closest thing we have to replaying a real ticket from a working product.

Why this benchmark is less saturated — and why that’s good

Benchmarks that sit in training sets become stale fast. Scores drift up, everyone tunes to the quirks, and eventually it’s a leaderboard of prompt hacks. SWE-Bench Pro fights that in two ways:

  • Legal and access barriers to training exposure: GPL for the public set and private code for the commercial set reduce the chance of contamination.
  • Scale and variety of tasks: 50–100+ instances per repo with medium-to-large patches across multiple files makes overfitting harder.

The result: more headroom for real progress and less noise from memorization. If you improve here, it likely means the agent actually got better at reasoning across codebases.

Practical takeaways for teams

If you run an engineering org or you’re building an agent product, treat SWE-Bench Pro as the filter before you trust an agent with production code.

  • Set expectations: A 70% badge on older benchmarks won’t translate. Expect results in the 10–25% range on Pro today, depending on the model and task mix.
  • Use the commercial set for procurement: If a vendor claims strong generalization, ask for their Commercial leaderboard score at scale.com/leaderboard/swe_bench_pro_commercial.
  • Keep humans in the loop: Even top models miss most tasks. Use agents for triage, reproduction, skeleton patches, and test authoring, then finish with a human review.
  • Choose the right scaffold: Baseline results use the SWE-Agent framework. If you’re experimenting with alternatives like Replit’s approach, my notes on agent autonomy and incentives may help: Replit Agent 3 vs Open Source.
  • Mind cost and throughput: If you need to run many trials across models, see my write-up on clean pricing and caps for GPT-5 access: OpenRouter’s 50% Off GPT5.

Where models are breaking down

The failure modes carry a theme: long-horizon, multi-file edits remain brittle. Agents struggle when the fix touches several layers — say, a serialization tweak in a shared library that cascades into routing, input validation, and a UI regression. Symptoms you’ll see:

  • Narrow local fixes that miss broader invariants, causing hidden test failures.
  • Incorrect mental models of the repo structure, especially in monorepos and codebases with custom tooling.
  • JS and TS drift where dependency graphs and subtle typing issues burn cycles.
  • Fragile search behavior that bounces between files without a stable plan.

To compensate, your scaffold should push explicit planning, aggressive test running, and structured code reading before editing. The benchmarks reward that discipline.

How to get started

If you need a quick mental model: Pro is the test that cuts away shortcuts. It reflects how teams actually use agents — on real code, in noisy environments, under a strict test suite. It’s the right way to compare models and the right bar to set before you put an agent in your pipeline.

Bottom line

SWE-Bench Pro is tougher, cleaner, and closer to reality. The public set provides a solid comparison point with low contamination risk. The commercial set is the high-signal check for generalization. Scores around 20% may look low next to glossy benchmarks, but they match what I expect from day-to-day use: agents that help, not replace, and still need guardrails. If you’re deciding what to ship, measure here.

Additional reading if you’re tuning agents and infra:

Links

They're clicky!

Follow on X →Ironwood →
Adam Holter
Adam Holter

Founder of Ironwood AI. Writing about AI models, agents, and what's actually happening in the space.