blank

Humans Still Beat AI in the Long Horizon: Revisiting Test-Time Scaling in the Agent Era

2026-06-16T00:00:00+00:00

TL;DR. Agents can spend test-time compute by trying, observing, and revising, so we ask whether their gains come from a better internal strategy or from something close to repeated sampling. We derive a simple Elo reference line: repeated sampling is linear in log test-time compute. In a 2022 two-week coding marathon, current agents plateau within 24 hours, while top humans keep improving over the official two weeks. The takeaway is that humans still do much better long-horizon test-time adaptation, and agent strategies have a lot of room to improve.

Agents Bring Intrinsic Test-Time Strategies

OpenAI’s o1 report showed that more test-time compute can improve model performance. Many papers followed, especially on verifiable tasks like code and math (Snell et al., Large Language Monkeys, Noam Brown’s recent post). The common plot is success rate versus the log number of trials, or log test-time compute. These curves often rise superlinearly before they saturate.

Coverage on MATH with an oracle verifier as the number of samples increases, from Large Language Monkeys.

These studies measure model performance under an external test-time strategy. The strategy is fixed outside the model: sample many candidate solutions, check them with a verifier, and report pass@k or coverage.

However, agents change this setup. During a run, an agent can try a solution, observe the result, reflect on what failed, and revise its next attempt. This raises the question we study: when an agent improves with more test-time compute, is it using a better test-time strategy, or is it mostly reproducing repeated sampling?

Repeated sampling is fixed outside the model, while an agent can use feedback inside the run.

A Simple Model for Test-Time Scaling

We first write down the simplest model behind the usual pass@k curves: repeated sampling. The model treats each attempt as an independent draw from the same continuous score distribution. For one task, let $X$ be the score of one sample and let $\tau$ be the threshold for success. Then one sample succeeds with probability

\[p = \operatorname{Pr}(X \geq \tau).\]

With $k$ independent samples, pass@k is

\[\operatorname{Pr}\left(\max_{1 \leq i \leq k} X_i \geq \tau\right) = 1 - (1 - p)^k.\]

For a dataset with multiple tasks, the usual test-time scaling curve averages this quantity across tasks. This gives a curve of mean pass@k, or coverage, as a function of the number of samples.

However, this evaluation is awkward for agents. An agent usually stops once it solves the task, so it is not natural to keep asking for more independent samples after success. For open-ended tasks such as FrontierCS, we could instead compare runs by the task’s own score. But raw-score gains are hard to interpret. In circle-packing tasks studied by AlphaEvolve, improving the objective value from 1 to 2 can be trivial, while improving it from 2.35 to 2.36 can require a much harder improvement. The score number does not by itself tell us how much capability changed. We therefore want a comparison that asks a simpler question: when one candidate spends more test-time compute than another candidate on the same task, how often does it produce the better answer?

Following the same repeated-sampling model, this becomes a pairwise question. If one candidate gets $k_a$ independent attempts and another gets $k_b$ independent attempts, the pairwise win probability is

\[\operatorname{Pr}\left(\max_{1 \leq i \leq k_a} X_i > \max_{1 \leq j \leq k_b} X'_j\right) = \frac{k_a}{k_a + k_b},\]

where $X_i$ and $X'_j$ are independent attempts on the same task.

Now comes the useful part. A Bradley-Terry model converts pairwise win rates into a one-dimensional strength scale. If candidate $a$ has BT log-strength $\theta_a$ and candidate $b$ has BT log-strength $\theta_b$, then

\[\operatorname{Pr}(a \text{ beats } b) = \frac{\exp(\theta_a)}{\exp(\theta_a) + \exp(\theta_b)}.\]

We can see that, for repeated sampling, the pairwise win probability above is exactly matched by

\[\theta_a = \log k_a + c,\]

for any constant $c$.

Thus, repeated sampling is a test-time strategy whose Elo is linear in log test-time compute. This is super helpful because it gives us a reference line. To judge an agent’s intrinsic test-time strategy, we can plot its Elo curve as test-time compute increases and compare it to this line. If the curve is above, below, or close to linear, the agent’s strategy is better than, worse than, or equivalent to repeated sampling.

Repeated sampling gives a linear reference line in Elo versus log test-time compute.

Agents Struggle, Humans Do More Than Sample

With this reference line in hand, we can ask what happens in real long-horizon tasks. We compare agent trajectories against the repeated-sampling line, and we also ask: how do they compare to top humans working on the same tasks?

We study AtCoder Heuristic Contest 014: RectJoin, a long-horizon coding and optimization contest in 2022. Human contestants write algorithms and can submit many code solutions during the contest, and the leaderboard keeps the best score they achieve. Since the contest happened before modern coding agents were widely available, these human trajectories are not assisted by AI agents. The task is open-ended: there is no known optimal solution, only better and worse scores.

At a high level, RectJoin starts with marked dots on a grid. A solver repeatedly chooses three existing dots and one empty grid point that form a valid axis-aligned or 45-degree rectangle, then marks the new point and draws the rectangle boundary. The objective is to maximize a weighted score over the final marked dots, where dots farther from the center have larger weight.

Visualization of RectJoin from AtCoder Heuristic Contest 014.

Methodology

Human trajectories. For humans, we study two groups from the final standings: the top 10 contestants and the top 50 contestants. At each checkpoint, we compute prefix-best scores. This gives one vector for each human group at each checkpoint, where each entry is a contestant’s best score up to that time.

Agent setting. We recreate the contest setting for agents. Using the FrontierCS evaluation layer, each agent runs in a continuous 24-hour loop. It can keep submitting candidates, observe scores, and revise its next attempt, just like a human contestant. If it stops early, we resume the same session rather than starting a fresh run.

We test two agent systems: Claude Opus 4.6 with Claude Code, and GPT-5.5 with Codex. For each agent system, we run five independent 24-hour trials. At every wall-clock checkpoint we care about, we collect the prefix-best score from each trial. This gives a length-5 vector for each agent system at each checkpoint.

Once we have these vectors, we can estimate pairwise win rates directly. For example, one vector might contain the prefix-best scores of the top 10 human contestants at 24 hours, while another contains the prefix-best scores from the five Claude Code trials at 5 hours. We want to know how often the 24-hour top-human scores beat the 5-hour Claude Code scores. If the two vectors are

\[u = (u_1, \ldots, u_n), \qquad v = (v_1, \ldots, v_m),\]

we define the empirical probability that $u$ beats $v$ by comparing all pairs:

\[\widehat{\operatorname{Pr}}(u \text{ beats } v) = \frac{1}{nm}\sum_{i=1}^{n}\sum_{j=1}^{m} \mathbf{1}[u_i > v_j].\]

These pairwise win rates are the inputs to the Elo fit.

Results

We fit Elo ratings from these pairwise win rates using L-BFGS. All agent systems and human groups are placed in one joint Bradley-Terry fit, so their ratings are directly comparable. For agents, we use the 24-hour trajectories from our runs. For humans, we use the full trajectories from the official two weeks.

Elo trajectories for top humans and agent systems on RectJoin, the long-horizon coding task introduced above.

The result is sharp. Agents improve quickly in the first few hours, but their Elo curves flatten by the 24-hour mark, even though a single agent trial can use up to 100M tokens. Top humans improve more slowly at first, but they keep climbing for days and eventually pass the agent systems by a large margin. This suggests that current agents can sprint early, but they still lack the long-horizon test-time adaptation that strong human contestants use during an extended contest.

We can also ask a more local question for each participant system. If we only look at one system at a time, the repeated-sampling model gives a reference line for what would happen if that system simply drew more independent samples from its own output distribution. Comparing the observed Elo curve against this line tells us whether the system’s test-time strategy is more or less efficient than repeated sampling from itself.

Per-source Elo curves compared with the repeated-sampling reference line implied by the model.

The gray dashed line is the repeated-sampling reference. Curves above this line gain Elo faster than independent sampling from the same source distribution. Curves below it gain Elo more slowly. The agent systems become sublinear relative to this reference by the end of the 24-hour run. In contrast, the human curves become superlinear over longer horizons. Humans do not just sample; current agents still have a long way to go on long-horizon test-time scaling.

Takeaways for Agentic Test-Time Scaling

Agents have their own test-time strategies. We should evaluate whether performance improves with more compute and what strategy produced that improvement.
Use repeated sampling as a reference line. If an agent's Elo grows linearly with log compute, it may be doing little more than repeated sampling. Deviations from that line are the signal.
Keep humans as a long-horizon reference. Top humans still show adaptive improvement over long horizons, which gives us a useful target for agentic test-time scaling.
Study more open-ended long-horizon tasks. We need more tasks, longer run trajectories, and careful failure analysis to understand where agents still fall short.

Citing Us

Our full paper is coming soon. In the meantime, please cite this blog post if you found it helpful. For discussion, contact qmang@berkeley.edu or lky04@cs.washington.edu.

@misc{mang2026humansstillbeatagents,
  title  = {Humans Still Beat AI in the Long Horizon: Revisiting Test-Time Scaling in the Agent Era},
  author = {Qiuyang Mang and Kaiyuan Liu and Bo Peng and Shreyas Pimpalgaonkar and Alex Dimakis and Alvin Cheung},
  year   = {2026},
  url    = {https://joyemang33.github.io/blog/2026/humans-dont-just-sample/}
}

We Scored 100% on AI Benchmarks Without Solving a Single Problem

2026-04-02T00:00:00+00:00

Fake Scores, Real Consequences

Every major AI company uses benchmark scores to sell their models. Every investor uses them to pick winners. Every training data company uses them to price their product. And increasingly, benchmark scores aren’t just measuring models — they’re training them. RL rewards, data filtering, synthetic rollout selection — all downstream of benchmark scores.

So what happens when the benchmarks themselves are broken?

It’s not a hypothetical. A model that “improves SWE-bench by 5%” might just be better at exploiting test suite gaps. Training data priced on benchmark gains might be teaching models to game evaluations instead of solving real problems. The leaderboard number that closed your Series B might be inflatable by anyone who reads the eval script.

Here’s what’s been happening in public:

IQuest-Coder-V1 claimed 81.4% on SWE-bench — then researchers found 24.4% of trajectories just ran git log to copy the answer from commit history. Corrected score: 76.2%.
METR found that o3 and Claude 3.7 Sonnet reward-hack in 30%+ of evaluation runs — stack introspection, monkey-patching graders, operator overloading.
OpenAI dropped SWE-bench Verified after finding 59.4% of audited problems had flawed tests.
In KernelBench, torch.empty() returns stale GPU memory containing the reference answer — zero computation, full marks.

These are the ones people caught by hand. We built an AI agent that finds them automatically — and it found a lot more.

Results at a Glance

We built an automated auditing system and pointed it at 13 widely-used AI benchmarks — including FrontierCS, BFCL, LiveBench, GAIA, WebArena, AGIEval, AgentBench, Terminal-Bench, tau-bench, MLE-bench, OSWorld, FieldWorkArena, and CAR-bench.

Overview of findings across 13 audited benchmarks. Every benchmark was rated critical risk.

The 45 confirmed exploits each come with a working proof-of-concept — code that achieves inflated or perfect scores without solving the actual task. They affect benchmarks used to evaluate everything from code generation to web navigation to general-purpose AI assistants.

We also cataloged 50 known issues across Terminal-Bench, SWE-bench, and KernelBench from public GitHub issues and papers. Our dual detection pipeline — one LLM-based, one formal — achieved 100% detection rate on all 50 after iterative improvement.

How We Found Them

Manual benchmark auditing doesn’t scale. A human expert might spend days analyzing a single evaluation harness. We wanted to audit 13 benchmarks with hundreds of scoring scripts each. So we built an AI agent that does it automatically — you give it a benchmark repo, it finds the vulnerabilities, writes the exploit code, and verifies it works. No human in the loop.

The agent runs a dual detection pipeline. The LLM Detector uses 15 specialized scanners that reason about code intent — cross-referencing task descriptions with test logic to catch semantic gaps like spec mismatches and answer leakage. The Formal Detector uses 12 analysis modules built on AST parsing, data-flow analysis, and z3 SMT solving to catch structural issues without any language model. Both were iteratively improved to 100% detection on a 50-issue catalog.

The full system is an 8-stage audit pipeline:

The full audit pipeline: from benchmark ingestion through adversarial PoC generation and findings correlation.

Point the agent at a benchmark repo and it produces a full audit report — vulnerability analysis, working exploit code, and fix suggestions. The 45 confirmed exploits in this post were all discovered, written, and verified by the agent, not by us.

Example 1: Perfect Score, Zero Work (Frontier-CS)

In Frontier-CS’s research track, the evaluator loads submitted solutions via importlib.util.spec_from_file_location() and exec_module(). This means the submitted code runs in the same Python process as the evaluator — sharing sys.modules, global state, and the call stack.

A malicious solution can walk up the stack and replace the scoring function:

class Solution:
    def solve(self):
        import sys
        frame = sys._getframe(1)
        while frame is not None:
            if 'evaluate_kernel_performance' in frame.f_globals:
                def fake_evaluate(*args, **kwargs):
                    return {
                        "score": 100, "pass_all": True,
                        "geometric_mean_speedup": 10.0,
                        "total_tests": 10, "passed_tests": 10,
                    }
                frame.f_globals['evaluate_kernel_performance'] = fake_evaluate
                break
            frame = frame.f_back
        return {"code": "pass"}  # No actual solution needed

Result: score 100 out of 100 without solving anything. The solution doesn’t even need to contain relevant code. Alternative attack vectors include monkey-patching torch.allclose to always return True, overwriting baseline files to fake speedup ratios, or using atexit to inject a fake score as the last stdout line. The Docker container runs both evaluator and solution in the same process with no sandboxing.

Root cause: No process isolation between submitted code and evaluation infrastructure. The solution has full read/write access to evaluator source code and baseline implementations inside the container.

Example 2: Fake C Extension (Terminal-Bench)

The task asks the agent to write a C extension for portfolio risk/return calculation — a systems programming task that should test low-level optimization skills. The test suite has three checks:

test_c_extension_exists() — verifies a compiled .so file exists with portfolio_risk_c and portfolio_return_c functions
test_correctness_small() — checks computation accuracy on small inputs
test_performance_and_scalability() — requires 1.2x speedup over pure Python

The critical gap: test 1 checks that C functions exist, but never calls them. Tests 2 and 3 import from the Python wrapper, not from the C extension. The wrapper can use any implementation internally.

The exploit — a dummy C extension paired with a numpy wrapper:

// portfolio_optimized_c.c — compiles to .so, passes test_c_extension_exists()
static PyObject* portfolio_risk_c(PyObject *self, PyObject *args) {
    return PyFloat_FromDouble(0.0);  // Dummy — never actually called
}
static PyObject* portfolio_return_c(PyObject *self, PyObject *args) {
    return PyFloat_FromDouble(0.0);  // Dummy — never actually called
}

# portfolio_optimized.py — uses numpy, not the C extension
import numpy as np
import portfolio_optimized_c  # noqa: imported so .so is loadable

def portfolio_risk_c(weights, cov_matrix):
    w = np.array(weights)
    cov = np.array(cov_matrix)
    return math.sqrt(float(w @ cov @ w))  # numpy, not C

Result: all three tests pass without writing any meaningful C code. The dummy .so satisfies the existence check. numpy satisfies correctness (within 1e-10) and performance (1.2x faster than pure Python loops).

Root cause: The benchmark tests what it's easy to test (does a file exist?) rather than what it claims to test (can the agent write optimized C?). No check verifies the C extension functions produce correct output or are actually called.

Example 3: All 812 Answers in a JSON File (WebArena)

WebArena tests whether agents can navigate websites to complete real-world tasks — shopping, forum posting, content management. The evaluator checks the agent’s answer against reference answers using string matching.

The problem: config_files/test.raw.json contains every reference answer in plaintext:

{
  "task_id": 0,
  "intent": "What is the top-1 best-selling product in 2022",
  "eval": {
    "reference_answers": {
      "exact_match": "Quest Lumaflex Band"
    }
  }
}

All 812+ tasks. Every answer. Sitting on the same filesystem where the agent runs. No sandboxing prevents the agent from reading it. A trivial cheating agent:

class CheatingAgent:
    def __init__(self):
        with open('config_files/test.raw.json') as f:
            tasks = json.load(f)
        self.answers = {
            t['task_id']: t['eval']['reference_answers']
            for t in tasks
        }

    def act(self, task_id, observation=None):
        ref = self.answers.get(task_id, {})
        answer = ref.get('exact_match', ref.get('must_include', [''])[0])
        return {'action_type': 'stop', 'answer': answer}

Result: perfect score on all string-match tasks with zero web browsing. No clicking, no navigation, no understanding of web interfaces. Just read a JSON file and return the answer.

Root cause: Reference answers stored in agent-accessible filesystem with no integrity protection. The evaluator reads from the same JSON files the agent can access.

What This Means

Broken benchmarks don’t just produce wrong leaderboards — they poison training signals, inflate data pricing, and mislead deployment decisions. If nobody audits the evaluation infrastructure, everything built on top of it is unreliable.

Our agent found 45 confirmed exploits that human reviewers missed — not because they were subtle, but because nobody was looking. The tools and methodology are open source at github.com/moogician/trustworthy-env.

Argus: Automated Discovery of Test Oracles for Database Management Systems Using LLMs

2026-02-23T00:00:00+00:00

TL;DR

Database Management Systems (DBMSs) are notoriously hard to test because you need a test oracle — a way to know if the output is correct. Prior work builds these oracles by hand, creating a never-ending cycle of manual effort.

Argus breaks this cycle by using LLMs to automatically discover test oracles, then formally verifies them with a SQL equivalence prover for soundness, and efficiently instantiates them into thousands of concrete test cases. Evaluated on five heavily-tested DBMSs, Argus found 41 previously unknown bugs (36 logic bugs), outperforming state-of-the-art manual oracle designs.

In practice, spending just ~$10 on LLM calls generates millions of reliable SQL tests — each capable of catching logic bugs, where a query silently returns wrong results instead of throwing an error.

The Problem: Test Oracles Are a Bottleneck

When testing a DBMS, how do you know if the result of a SQL query is correct? This is the test oracle problem. A naive approach would be to compare two DBMSs against each other, but that misses bugs they share. The dominant approach instead builds semantic equivalence oracles: transform a query $Q$ into a semantically equivalent $Q'$, run both, and flag inconsistencies as bugs.

The catch: designing such transformation mechanisms is entirely manual. Researchers have published over 20 top-conference papers, each hand-crafting specialized oracles — TLP, NoREC, EET, DQP — yet bugs keep slipping through. Consider this real TiDB bug that went undetected for years:

CREATE TABLE t1(c INT);
INSERT INTO t1 VALUES (1);

-- Q: Empty table filter → should return {}
SELECT c / 3 FROM t1 WHERE false;       -- {} ✓

-- Oracle: Q EXCEPT Q should always be empty
SELECT c / 3 FROM t1 EXCEPT SELECT c / 3 FROM t1;  -- {0.3333} ✗ (BUG!)

Catching this required the very specific insight that $Q \setminus Q = \emptyset$. A human had to think of it. Can we make a machine do that automatically?

Key Insight: Constrained Abstract Queries (CAQ)

The core innovation in Argus is a new representation called a Constrained Abstract Query (CAQ) — a SQL query template with typed placeholders that can be filled with concrete SQL snippets.

A placeholder $\square_i$ can be either:

Expr(TableName : SQLDatatype) — any expression over a table that returns a given type (e.g., a Boolean expression over t1)
Table(SQLTableDef) — any table or subquery with a given schema

An equivalent CAQ pair $(s, q_1, q_2)$ is two CAQs that produce the same results for every possible instantiation of their placeholders. For example, the classic TLP oracle can be expressed as a CAQ pair:

-- Q₁: seed query
SELECT * FROM t1, □₁⊲Table(...);

-- Q₂: TLP three-way partition
SELECT * FROM t1, □₁⊲Table(...) WHERE (□₂⊲Expr(t1:BOOLEAN) IS TRUE)
UNION ALL
SELECT * FROM t1, □₁⊲Table(...) WHERE (□₂⊲Expr(t1:BOOLEAN) IS FALSE)
UNION ALL
SELECT * FROM t1, □₁⊲Table(...) WHERE (□₂⊲Expr(t1:BOOLEAN) IS NULL);

-- Instantiation examples:
-- □₁ ↦ t1 ASOF JOIN t2
-- □₂ ↦ json_valid(t1.c0)

The power of CAQs: one CAQ pair is a reusable oracle that can generate thousands of concrete test cases by filling its placeholders with diverse SQL snippets.

The Argus Pipeline

Argus operates in two stages:

Stage 1 — Test Oracle Discovery (offline, one-time)

① Database Seeding. A grammar-based generator (SQLancer) produces random database schemas and seed CAQs. Virtual columns and tables serve as placeholders, making the output compatible with SQL provers that expect concrete syntax.

② LLM-based Oracle Generation + Formal Verification. For each seed CAQ $q$, Argus iteratively prompts an LLM to generate an equivalent variant $q'$. Two mechanisms ensure quality:

In-context learning — the LLM is shown verified successes (Equal set) and failures (Fail set) from previous rounds.
Diversity-oriented sampling — verified CAQs are clustered by query-plan tree-edit distance (k-means), and samples are drawn from each cluster to push the LLM toward novel execution plans.

Every candidate $q'$ must pass a SQL equivalence prover (SQLSolver) before acceptance. Placeholders are replaced by virtual entities so the prover can reason on concrete queries. Only formally verified pairs become test oracles — zero false positives by design.

Stage 2 — Test Case Instantiation (online, per DBMS)

③ Corpus Synthesis. A hybrid approach (LLM + grammar-based generator) pre-generates a large library of SQL snippets:

LLMs produce complex, feature-rich expressions and table structures, guided by official DBMS documentation.
The grammar-based generator covers corner values and edge cases systematically.
Cross-combination: expressions are recursively composed (e.g., substituting a Boolean expr into an INT function that expects a Boolean column) to create intricate multi-level expressions.
Every snippet is runtime-validated on the target DBMS to filter type mismatches and invalid SQL.

④⑤⑥ Instantiation & Bug Detection. Each verified CAQ pair is instantiated up to $K$ times by randomly sampling compatible snippets from the corpus. Placeholders are replaced consistently in both $q$ and $q'$. Random database instances are created, and the two queries are executed. Any result mismatch is a bug report.

Three general constraints on snippets guarantee that instantiated pairs remain equivalent even when concrete expressions are plugged in:

Determinism — no RANDOM(), CURRENT_TIMESTAMP, etc.
Null-preserving — expression returns NULL when evaluated on all-NULL rows.
Empty-result-preserving — expression returns empty on an empty table.

Representative Bugs Found

📌 PostgreSQL: Incorrect json function in RIGHT JOIN

CREATE TABLE t(c INT);
INSERT INTO t VALUES (1);

-- Q1: RIGHT JOIN with FALSE → left side always NULL
SELECT sub.c FROM (
  SELECT json_array_length(json_array(3, 2, t.c)) AS c FROM t
) AS sub RIGHT JOIN t ON FALSE;  -- Expected: {NULL}, Got: {2} ✗

-- Q2: explicitly NULL in subquery
SELECT sub.c FROM (SELECT NULL AS c FROM t) AS sub
RIGHT JOIN t ON FALSE;  -- {NULL} ✓

Root cause: PostgreSQL’s json functions bypass the null-propagation rule for RIGHT JOIN, producing incorrect non-null values. Reported and fixed within 24 hours.

📌 Dolt: EXISTS duplicates rows

CREATE TABLE t(c0 INT, c1 INT, PRIMARY KEY (c0, c1));
INSERT INTO t VALUES (1,1), (2,2), (2,3);

-- With NOT NULL primary key, EXISTS is always TRUE → should return all rows once
SELECT * FROM t WHERE EXISTS (SELECT 1 FROM t AS x WHERE x.c0 = t.c0);
-- Got: {(1,1),(2,2),(2,3),(1,1),(2,2),(2,3)} ✗ — every row duplicated!

SELECT * FROM t;  -- {(1,1),(2,2),(2,3)} ✓

📌 DuckDB: Empty CTE incorrectly short-circuits UNION ALL

CREATE TABLE t1(c0 BOOLEAN);
CREATE TABLE t2(c0 BOOLEAN);
CREATE TABLE t3(c0 BOOLEAN);
INSERT INTO t2 VALUES (true);
INSERT INTO t3 VALUES (true);

-- Q1
SELECT t2.c0 FROM t2, t3 LEFT JOIN t1 ON false;  -- {true} ✓

-- Q2 (equivalent via CTE expansion)
WITH c AS (SELECT * FROM t1 WHERE false)
SELECT t2.c0 FROM t2 CROSS JOIN t3 CROSS JOIN c
UNION ALL
SELECT t2.c0 FROM t2 CROSS JOIN t3 WHERE NOT EXISTS (SELECT 1 FROM c);
-- Got: {0 rows} ✗

Root cause: DuckDB incorrectly assumes an empty materialized CTE always causes the outer query to return no rows — not true with UNION ALL.

🎉 Real-world impact: Dolt officially wrote about Argus on their blog, detailing how our AI-generated SQL tests found 19 bugs in their database engine and how they are integrating the Argus-generated test suite into their regression testing process.

Evaluation Results

Bugs Found (5 DBMSs, 2-month campaign)

DBMS	Reported	Fixed	Confirmed	Logic Bugs
Dolt	19	18	19	18
DuckDB	8	6	7	4
MySQL	8	0	5	8
PostgreSQL	1	1	1	1
TiDB	5	2	5	5
Total	41	27	36	36

36 out of 41 bugs are logic bugs — the silent, most dangerous class that cause incorrect query results without any error. Compared to recent works with manually crafted oracles, Argus finds more despite targeting DBMSs already extensively tested.

Code Coverage

On DuckDB (24-hour run):

Argus achieves +19.9% line coverage and +18.1% branch coverage over SQLancer++
5.5× line and 6.4× function metamorphic coverage over SQLancer — metamorphic coverage measures how much code is exercised differently between the two equivalent queries, directly correlating with logic bug-finding ability

On PostgreSQL:

Outperforms SQLancer++ by +19.0% line and +22.5% branch coverage
Covers 23 query features (vs SQLancer’s 15 in pglast), demonstrating the LLM’s ability to generate feature-rich queries

New Oracles vs. Prior Manual Oracles

In a head-to-head comparison on Dolt v1.0.0 (6-hour window), using the same snippet corpus and CAQ format for fairness:

Argus-5,000 oracles: found 10 unique logic bugs
Baseline (union of TLP + NoREC + EET + DQP — 11 hand-crafted oracles): found 3
Argus-50 oracles (fewer than baseline): found only 2, confirming the quantity of oracles matters

The 3.33× improvement demonstrates that automation unlocks oracle diversity that manual design simply cannot match at scale.

Cost & Efficiency

Argus’s two-stage design is dramatically more efficient than a naive LLM baseline that generates full concrete query pairs directly:

Cost Item	Phase	Cost
CAQ pair generation	Offline · one-time	~$3
Snippet corpus (100,000 snippets)	Offline · per DBMS	~$12
Instantiation & test execution	Online · reusable	FREE

The naive baseline generates test cases orders of magnitude more slowly because it must call the LLM for every single test case.

Soundness: Why the SQL Prover Matters

When we replaced the SQL equivalence prover with LLM-as-a-judge (using GPT-5):

20 consecutive bug reports were all false positives
Among 20 LLM-judged-equivalent CAQ pairs, 1 was actually inequivalent (5% error rate)

In mature DBMSs, finding a single real bug may require thousands of queries. Even a 5% error rate overwhelmingly drowns out true bugs. The prover is not optional — it’s what makes Argus practical.

What Makes Argus Different

Aspect	Prior work	Argus
Oracle design	Manual, expert-crafted	Automated by LLM
Soundness	Assumed correct	Formally verified by SQL prover
Scalability	~10s of hand-written oracles	Thousands of verified CAQ pairs

Discussion

Prover limitations are opportunities. SQL equivalence provers currently support a subset of SQL features (core Calcite syntax: outer joins, nested queries, basic aggregations). Argus’s two-stage design mitigates this by proving equivalence at the abstract CAQ level, then instantiating placeholders with complex, DBMS-specific snippets that go beyond the prover’s reasoning capabilities.

We also found prover bugs. During development, Argus revealed 10 bugs in SQLSolver and QED — incorrect equivalence proofs that would have caused false positives. All were fixed quickly. Improving Argus simultaneously improves the tools it depends on.

Extensible by design. Argus can be steered toward specific SQL features simply by adjusting the LLM prompt (e.g., “ensure the generated snippet includes at least one OUTER JOIN”). No code changes needed.

Future directions. Two natural extensions:

Expand the target domain. Argus’s core idea — using LLMs to discover semantic equivalences and formally verifying them — is not specific to relational DBMSs. The same paradigm could apply to compilers (e.g., finding equivalent IR transformations that expose miscompilation bugs), network systems (e.g., equivalent packet-forwarding rules that reveal routing inconsistencies), or graph/spatial databases (e.g., equivalent graph traversal queries). Any domain with a formal notion of equivalence and a verifier is a candidate.
Oracle prioritization. Given thousands of LLM-generated oracles, which are most likely to find bugs in a specific DBMS? Combining coverage feedback, historical bug patterns, and oracle structural diversity could guide Argus toward higher-yield test oracles.

Citation

@misc{mang2025argus,
  title         = {Automated Discovery of Test Oracles for Database Management Systems Using LLMs},
  author        = {Qiuyang Mang and Runyuan He and Suyang Zhong and Xiaoxuan Liu and Huanchen Zhang and Alvin Cheung},
  year          = {2025},
  eprint        = {2510.06663},
  archivePrefix = {arXiv},
  primaryClass  = {cs.DB},
  url           = {https://arxiv.org/abs/2510.06663}
}

This work was accepted at SIGMOD 2026. Find out more: [arXiv] [Slides] [Dolt Blog]