We Scored 100% on AI Benchmarks Without Solving a Single Problem

Fake Scores, Real Consequences
Every major AI company uses benchmark scores to sell their models. Every investor uses them to pick winners. Every training data company uses them to price their product. And increasingly, benchmark scores aren’t just measuring models — they’re training them. RL rewards, data filtering, synthetic rollout selection — all downstream of benchmark scores.
So what happens when the benchmarks themselves are broken?
It’s not a hypothetical. A model that “improves SWE-bench by 5%” might just be better at exploiting test suite gaps. Training data priced on benchmark gains might be teaching models to game evaluations instead of solving real problems. The leaderboard number that closed your Series B might be inflatable by anyone who reads the eval script.
Here’s what’s been happening in public:
- IQuest-Coder-V1 claimed 81.4% on SWE-bench — then researchers found 24.4% of trajectories just ran
git logto copy the answer from commit history. Corrected score: 76.2%. - METR found that o3 and Claude 3.7 Sonnet reward-hack in 30%+ of evaluation runs — stack introspection, monkey-patching graders, operator overloading.
- OpenAI dropped SWE-bench Verified after finding 59.4% of audited problems had flawed tests.
- In KernelBench,
torch.empty()returns stale GPU memory containing the reference answer — zero computation, full marks.
These are the ones people caught by hand. We built an AI agent that finds them automatically — and it found a lot more.
Results at a Glance
We built an automated auditing system and pointed it at 13 widely-used AI benchmarks — including FrontierCS, BFCL, LiveBench, GAIA, WebArena, AGIEval, AgentBench, Terminal-Bench, tau-bench, MLE-bench, OSWorld, FieldWorkArena, and CAR-bench.
Overview of findings across 13 audited benchmarks. Every benchmark was rated critical risk.
The 45 confirmed exploits each come with a working proof-of-concept — code that achieves inflated or perfect scores without solving the actual task. They affect benchmarks used to evaluate everything from code generation to web navigation to general-purpose AI assistants.
We also cataloged 50 known issues across Terminal-Bench, SWE-bench, and KernelBench from public GitHub issues and papers. Our dual detection pipeline — one LLM-based, one formal — achieved 100% detection rate on all 50 after iterative improvement.
How We Found Them
Manual benchmark auditing doesn’t scale. A human expert might spend days analyzing a single evaluation harness. We wanted to audit 13 benchmarks with hundreds of scoring scripts each. So we built an AI agent that does it automatically — you give it a benchmark repo, it finds the vulnerabilities, writes the exploit code, and verifies it works. No human in the loop.
The agent runs a dual detection pipeline. The LLM Detector uses 15 specialized scanners that reason about code intent — cross-referencing task descriptions with test logic to catch semantic gaps like spec mismatches and answer leakage. The Formal Detector uses 12 analysis modules built on AST parsing, data-flow analysis, and z3 SMT solving to catch structural issues without any language model. Both were iteratively improved to 100% detection on a 50-issue catalog.
The full system is an 8-stage audit pipeline:
The full audit pipeline: from benchmark ingestion through adversarial PoC generation and findings correlation.
Point the agent at a benchmark repo and it produces a full audit report — vulnerability analysis, working exploit code, and fix suggestions. The 45 confirmed exploits in this post were all discovered, written, and verified by the agent, not by us.
Example 1: Perfect Score, Zero Work (Frontier-CS)
In Frontier-CS’s research track, the evaluator loads submitted solutions via importlib.util.spec_from_file_location() and exec_module(). This means the submitted code runs in the same Python process as the evaluator — sharing sys.modules, global state, and the call stack.
A malicious solution can walk up the stack and replace the scoring function:
class Solution:
def solve(self):
import sys
frame = sys._getframe(1)
while frame is not None:
if 'evaluate_kernel_performance' in frame.f_globals:
def fake_evaluate(*args, **kwargs):
return {
"score": 100, "pass_all": True,
"geometric_mean_speedup": 10.0,
"total_tests": 10, "passed_tests": 10,
}
frame.f_globals['evaluate_kernel_performance'] = fake_evaluate
break
frame = frame.f_back
return {"code": "pass"} # No actual solution needed
Result: score 100 out of 100 without solving anything. The solution doesn’t even need to contain relevant code. Alternative attack vectors include monkey-patching torch.allclose to always return True, overwriting baseline files to fake speedup ratios, or using atexit to inject a fake score as the last stdout line. The Docker container runs both evaluator and solution in the same process with no sandboxing.
Example 2: Fake C Extension (Terminal-Bench)
The task asks the agent to write a C extension for portfolio risk/return calculation — a systems programming task that should test low-level optimization skills. The test suite has three checks:
-
test_c_extension_exists()— verifies a compiled.sofile exists withportfolio_risk_candportfolio_return_cfunctions -
test_correctness_small()— checks computation accuracy on small inputs -
test_performance_and_scalability()— requires 1.2x speedup over pure Python
The critical gap: test 1 checks that C functions exist, but never calls them. Tests 2 and 3 import from the Python wrapper, not from the C extension. The wrapper can use any implementation internally.
The exploit — a dummy C extension paired with a numpy wrapper:
// portfolio_optimized_c.c — compiles to .so, passes test_c_extension_exists()
static PyObject* portfolio_risk_c(PyObject *self, PyObject *args) {
return PyFloat_FromDouble(0.0); // Dummy — never actually called
}
static PyObject* portfolio_return_c(PyObject *self, PyObject *args) {
return PyFloat_FromDouble(0.0); // Dummy — never actually called
}
# portfolio_optimized.py — uses numpy, not the C extension
import numpy as np
import portfolio_optimized_c # noqa: imported so .so is loadable
def portfolio_risk_c(weights, cov_matrix):
w = np.array(weights)
cov = np.array(cov_matrix)
return math.sqrt(float(w @ cov @ w)) # numpy, not C
Result: all three tests pass without writing any meaningful C code. The dummy .so satisfies the existence check. numpy satisfies correctness (within 1e-10) and performance (1.2x faster than pure Python loops).
Example 3: All 812 Answers in a JSON File (WebArena)
WebArena tests whether agents can navigate websites to complete real-world tasks — shopping, forum posting, content management. The evaluator checks the agent’s answer against reference answers using string matching.
The problem: config_files/test.raw.json contains every reference answer in plaintext:
{
"task_id": 0,
"intent": "What is the top-1 best-selling product in 2022",
"eval": {
"reference_answers": {
"exact_match": "Quest Lumaflex Band"
}
}
}
All 812+ tasks. Every answer. Sitting on the same filesystem where the agent runs. No sandboxing prevents the agent from reading it. A trivial cheating agent:
class CheatingAgent:
def __init__(self):
with open('config_files/test.raw.json') as f:
tasks = json.load(f)
self.answers = {
t['task_id']: t['eval']['reference_answers']
for t in tasks
}
def act(self, task_id, observation=None):
ref = self.answers.get(task_id, {})
answer = ref.get('exact_match', ref.get('must_include', [''])[0])
return {'action_type': 'stop', 'answer': answer}
Result: perfect score on all string-match tasks with zero web browsing. No clicking, no navigation, no understanding of web interfaces. Just read a JSON file and return the answer.
What This Means
Broken benchmarks don’t just produce wrong leaderboards — they poison training signals, inflate data pricing, and mislead deployment decisions. If nobody audits the evaluation infrastructure, everything built on top of it is unreliable.
Our agent found 45 confirmed exploits that human reviewers missed — not because they were subtle, but because nobody was looking. The tools and methodology are open source at github.com/moogician/trustworthy-env.