What coding agent benchmarks miss

SWE-Bench is necessary, insufficient, and a little misleading. Here’s what to add.

FR
Freebuff ResearchBenchmarks and evals
··9 min read

TL;DR

  • SWE-Bench measures patch correctness on toy GitHub issues — useful, not enough.
  • Real-world agents are judged on: cost, iteration speed, recovery from wrong turns, and willingness to ask.
  • Freebuff Research uses a 6-axis eval that better predicts user retention.

If you optimize purely for SWE-Bench, you ship a brittle agent that nails the benchmark and frustrates users. Here’s why — and what we measure instead.

The 4 things benchmarks miss#

  1. 1Cost per accepted change. A benchmark that takes 80 tool calls is "right" but unaffordable.
  2. 2Recovery from wrong turns. Real bugs hide. Did the agent backtrack, or dig itself deeper?
  3. 3Asking when uncertain. A good engineer asks. A SWE-Bench-optimized agent always answers.
  4. 4Cross-file consistency. Benchmarks rarely test 10+ file changes that must stay in sync.

The Freebuff Research 6-axis eval#

FeatureFreebuffSWE-Bench Verified alone
Patch correctnessYes — weighted 25%100% of score
Cost per accepted PRYes — weighted 20%Not measured
Recovery from wrong turnsYes — weighted 15%Not measured
Clarification questions askedYes — weighted 10%Not measured
Cross-file consistency (>10 files)Yes — weighted 20%Not measured
User-reported frictionYes — weighted 10%Not measured

Concrete failure modes we caught#

  • Confident wrong patch: Top SWE-Bench agent failed silently on a typed migration because the test passed but the runtime behavior broke. Cross-file consistency check caught it.
  • Cost runaway: One frontier agent burned $4.20 on a task another agent solved for $0.18. Same correctness; very different ROI.
  • No-question bias: SWE-Bench-tuned agents never ask. Real users want the agent to stop and ask when requirements are ambiguous.

What this means for buyers#

If you’re comparing agents, don’t stop at SWE-Bench. Run them on 5 of your most representative tasks. Measure cost, time-to-merge, and how many tries it takes. The leaderboard winner is often not the winner on your repo.

Try Freebuff on your hardest task

Free agent, free run, real measurement.

Install Freebuff

About the author

FR

Freebuff Research· Benchmarks and evals

The Freebuff research arm — benchmarks, evals, and field notes from running coding agents in the wild.

Topics:swe bench coding agentscoding agent evaluationbest free coding agent benchmarkreal world agent performancecoding agent benchmarks comparisonevaluation methodology coding agents

Keep reading