What coding agent benchmarks miss

SWE-Bench is necessary, insufficient, and a little misleading. Here’s what to add.

Freebuff ResearchBenchmarks and evals

·May 22, 2026·9 min read

TL;DR

SWE-Bench measures patch correctness on toy GitHub issues — useful, not enough.
Real-world agents are judged on: cost, iteration speed, recovery from wrong turns, and willingness to ask.
Freebuff Research uses a 6-axis eval that better predicts user retention.

If you optimize purely for SWE-Bench, you ship a brittle agent that nails the benchmark and frustrates users. Here’s why — and what we measure instead.

The 4 things benchmarks miss#

1Cost per accepted change. A benchmark that takes 80 tool calls is "right" but unaffordable.
2Recovery from wrong turns. Real bugs hide. Did the agent backtrack, or dig itself deeper?
3Asking when uncertain. A good engineer asks. A SWE-Bench-optimized agent always answers.
4Cross-file consistency. Benchmarks rarely test 10+ file changes that must stay in sync.

The Freebuff Research 6-axis eval#

Feature	Freebuff	SWE-Bench Verified alone
Patch correctness	Yes — weighted 25%	100% of score
Cost per accepted PR	Yes — weighted 20%	Not measured
Recovery from wrong turns	Yes — weighted 15%	Not measured
Clarification questions asked	Yes — weighted 10%	Not measured
Cross-file consistency (>10 files)	Yes — weighted 20%	Not measured
User-reported friction	Yes — weighted 10%	Not measured

Concrete failure modes we caught#

Confident wrong patch: Top SWE-Bench agent failed silently on a typed migration because the test passed but the runtime behavior broke. Cross-file consistency check caught it.
Cost runaway: One frontier agent burned $4.20 on a task another agent solved for $0.18. Same correctness; very different ROI.
No-question bias: SWE-Bench-tuned agents never ask. Real users want the agent to stop and ask when requirements are ambiguous.

If you’re comparing agents, don’t stop at SWE-Bench. Run them on 5 of your most representative tasks. Measure cost, time-to-merge, and how many tries it takes. The leaderboard winner is often not the winner on your repo.

Try Freebuff on your hardest task

Free agent, free run, real measurement.

Install Freebuff

What coding agent benchmarks miss

The 4 things benchmarks miss#

The Freebuff Research 6-axis eval#

Concrete failure modes we caught#

What this means for buyers#

Try Freebuff on your hardest task

Keep reading

Why free coding agents won in 2026

The state of free AI coding in 2026

The free cloud coding agent

freebuff