What coding agent benchmarks miss
SWE-Bench is necessary, insufficient, and a little misleading. Here’s what to add.
FR
Freebuff ResearchBenchmarks and evals
TL;DR
- SWE-Bench measures patch correctness on toy GitHub issues — useful, not enough.
- Real-world agents are judged on: cost, iteration speed, recovery from wrong turns, and willingness to ask.
- Freebuff Research uses a 6-axis eval that better predicts user retention.
If you optimize purely for SWE-Bench, you ship a brittle agent that nails the benchmark and frustrates users. Here’s why — and what we measure instead.
The 4 things benchmarks miss#
- 1Cost per accepted change. A benchmark that takes 80 tool calls is "right" but unaffordable.
- 2Recovery from wrong turns. Real bugs hide. Did the agent backtrack, or dig itself deeper?
- 3Asking when uncertain. A good engineer asks. A SWE-Bench-optimized agent always answers.
- 4Cross-file consistency. Benchmarks rarely test 10+ file changes that must stay in sync.
The Freebuff Research 6-axis eval#
| Feature | Freebuff | SWE-Bench Verified alone |
|---|---|---|
| Patch correctness | Yes — weighted 25% | 100% of score |
| Cost per accepted PR | Yes — weighted 20% | Not measured |
| Recovery from wrong turns | Yes — weighted 15% | Not measured |
| Clarification questions asked | Yes — weighted 10% | Not measured |
| Cross-file consistency (>10 files) | Yes — weighted 20% | Not measured |
| User-reported friction | Yes — weighted 10% | Not measured |
Concrete failure modes we caught#
- Confident wrong patch: Top SWE-Bench agent failed silently on a typed migration because the test passed but the runtime behavior broke. Cross-file consistency check caught it.
- Cost runaway: One frontier agent burned $4.20 on a task another agent solved for $0.18. Same correctness; very different ROI.
- No-question bias: SWE-Bench-tuned agents never ask. Real users want the agent to stop and ask when requirements are ambiguous.
What this means for buyers#
If you’re comparing agents, don’t stop at SWE-Bench. Run them on 5 of your most representative tasks. Measure cost, time-to-merge, and how many tries it takes. The leaderboard winner is often not the winner on your repo.