<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en"><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="https://joyemang33.github.io/feed.xml" rel="self" type="application/atom+xml"/><link href="https://joyemang33.github.io/" rel="alternate" type="text/html" hreflang="en"/><updated>2026-06-21T00:59:20+00:00</updated><id>https://joyemang33.github.io/feed.xml</id><title type="html">blank</title><subtitle>A undergraduate student in Testing, Database, Graph. </subtitle><entry><title type="html">Humans Still Beat AI in the Long Horizon: Revisiting Test-Time Scaling in the Agent Era</title><link href="https://joyemang33.github.io/blog/2026/humans-dont-just-sample/" rel="alternate" type="text/html" title="Humans Still Beat AI in the Long Horizon: Revisiting Test-Time Scaling in the Agent Era"/><published>2026-06-16T00:00:00+00:00</published><updated>2026-06-16T00:00:00+00:00</updated><id>https://joyemang33.github.io/blog/2026/humans-dont-just-sample</id><content type="html" xml:base="https://joyemang33.github.io/blog/2026/humans-dont-just-sample/"><![CDATA[<p style="font-size: 1rem; line-height: inherit; margin: 0 0 1.5rem;"><strong style="font-weight: 700;">TL;DR.</strong> Agents can spend test-time compute by trying, observing, and revising, so we ask whether their gains come from a better internal strategy or from something close to repeated sampling. We derive a simple Elo reference line: repeated sampling is linear in log test-time compute. In a 2022 two-week coding marathon, current agents plateau within 24 hours, while top humans keep improving over the official two weeks. The takeaway is that humans still do much better long-horizon test-time adaptation, and agent strategies have a lot of room to improve.</p> <h2 id="agents-bring-intrinsic-test-time-strategies">Agents Bring Intrinsic Test-Time Strategies</h2> <p>OpenAI’s <a href="https://openai.com/index/learning-to-reason-with-llms/">o1 report</a> showed that more test-time compute can improve model performance. Many papers followed, especially on verifiable tasks like code and math (<a href="https://arxiv.org/abs/2408.03314">Snell et al.</a>, <a href="https://arxiv.org/abs/2407.21787">Large Language Monkeys</a>, <a href="https://x.com/polynoamial/status/2064210146558136827">Noam Brown’s recent post</a>). The common plot is success rate versus the log number of trials, or log test-time compute. These curves often rise superlinearly before they saturate.</p> <p><img src="/assets/img/large-language-monkeys-math-oracle-verifier.png" alt="MATH oracle verifier coverage as the number of samples increases" style="max-width: 60%; display: block; margin: 1.5rem auto 0.75rem;"/></p> <p style="text-align: center; font-size: 0.9rem; color: #777; margin-top: 0;"> Coverage on MATH with an oracle verifier as the number of samples increases, from <a href="https://arxiv.org/abs/2407.21787">Large Language Monkeys</a>. </p> <p>These studies measure model performance under an external test-time strategy. The strategy is fixed outside the model: sample many candidate solutions, check them with a verifier, and report pass@k or coverage.</p> <p>However, agents change this setup. During a run, an agent can try a solution, observe the result, reflect on what failed, and revise its next attempt. This raises the question we study: <strong style="font-weight: 700;">when an agent improves with more test-time compute, is it using a better test-time strategy, or is it mostly reproducing repeated sampling?</strong></p> <p><img src="/assets/img/agent-vs-repeated-sampling-loop.png" alt="External repeated sampling versus an agent internal test-time loop" style="max-width: 70%; display: block; margin: 1.5rem auto 0.75rem;"/></p> <p style="text-align: center; font-size: 0.9rem; color: #777; margin-top: 0;"> Repeated sampling is fixed outside the model, while an agent can use feedback inside the run. </p> <h2 id="a-simple-model-for-test-time-scaling">A Simple Model for Test-Time Scaling</h2> <p>We first write down the simplest model behind the usual pass@k curves: repeated sampling. The model treats each attempt as an independent draw from the same continuous score distribution. For one task, let \(X\) be the score of one sample and let \(\tau\) be the threshold for success. Then one sample succeeds with probability</p> \[p = \operatorname{Pr}(X \geq \tau).\] <p>With \(k\) independent samples, pass@k is</p> \[\operatorname{Pr}\left(\max_{1 \leq i \leq k} X_i \geq \tau\right) = 1 - (1 - p)^k.\] <p>For a dataset with multiple tasks, the usual test-time scaling curve averages this quantity across tasks. This gives a curve of mean pass@k, or coverage, as a function of the number of samples.</p> <p>However, this evaluation is awkward for agents. An agent usually stops once it solves the task, so it is not natural to keep asking for more independent samples after success. For open-ended tasks such as <a href="https://frontier-cs.org/">FrontierCS</a>, we could instead compare runs by the task’s own score. But raw-score gains are hard to interpret. In circle-packing tasks studied by <a href="https://deepmind.google/blog/alphaevolve-a-gemini-powered-coding-agent-for-designing-advanced-algorithms/">AlphaEvolve</a>, improving the objective value from 1 to 2 can be trivial, while improving it from 2.35 to 2.36 can require a much harder improvement. The score number does not by itself tell us how much capability changed. We therefore want a comparison that asks a simpler question: <strong style="font-weight: 700;">when one candidate spends more test-time compute than another candidate on the same task, how often does it produce the better answer?</strong></p> <p>Following the same repeated-sampling model, this becomes a pairwise question. If one candidate gets \(k_a\) independent attempts and another gets \(k_b\) independent attempts, the pairwise win probability is</p> \[\operatorname{Pr}\left(\max_{1 \leq i \leq k_a} X_i &gt; \max_{1 \leq j \leq k_b} X'_j\right) = \frac{k_a}{k_a + k_b},\] <p>where \(X_i\) and \(X'_j\) are independent attempts on the same task.</p> <p>Now comes the useful part. A Bradley-Terry model converts pairwise win rates into a one-dimensional strength scale. If candidate \(a\) has BT log-strength \(\theta_a\) and candidate \(b\) has BT log-strength \(\theta_b\), then</p> \[\operatorname{Pr}(a \text{ beats } b) = \frac{\exp(\theta_a)}{\exp(\theta_a) + \exp(\theta_b)}.\] <p>We can see that, for repeated sampling, the pairwise win probability above is exactly matched by</p> \[\theta_a = \log k_a + c,\] <p>for any constant \(c\).</p> <p><strong style="font-weight: 700;">Thus, repeated sampling is a test-time strategy whose Elo is linear in log test-time compute.</strong> This is super helpful because it gives us a reference line. To judge an agent’s intrinsic test-time strategy, we can plot its Elo curve as test-time compute increases and compare it to this line. If the curve is above, below, or close to linear, the agent’s strategy is better than, worse than, or equivalent to repeated sampling.</p> <p><img src="/assets/img/repeated-sampling-elo-reference.png" alt="Repeated sampling forms a linear Elo reference line as log test-time compute increases" style="max-width: 60%; display: block; margin: 1.5rem auto 0.75rem;"/></p> <p style="text-align: center; font-size: 0.9rem; color: #777; margin-top: 0;"> Repeated sampling gives a linear reference line in Elo versus log test-time compute. </p> <h2 id="agents-struggle-humans-do-more-than-sample">Agents Struggle, Humans Do More Than Sample</h2> <p>With this reference line in hand, we can ask what happens in real long-horizon tasks. We compare agent trajectories against the repeated-sampling line, and we also ask: <strong style="font-weight: 700;">how do they compare to top humans working on the same tasks?</strong></p> <p>We study <a href="https://atcoder.jp/contests/ahc014/tasks/ahc014_a">AtCoder Heuristic Contest 014: RectJoin</a>, <strong style="font-weight: 700;">a long-horizon coding and optimization contest in 2022</strong>. Human contestants write algorithms and can submit many code solutions during the contest, and the leaderboard keeps the best score they achieve. Since the contest happened before modern coding agents were widely available, these human trajectories are not assisted by AI agents. The task is open-ended: there is no known optimal solution, only better and worse scores.</p> <p>At a high level, RectJoin starts with marked dots on a grid. A solver repeatedly chooses three existing dots and one empty grid point that form a valid axis-aligned or 45-degree rectangle, then marks the new point and draws the rectangle boundary. The objective is to maximize a weighted score over the final marked dots, where dots farther from the center have larger weight.</p> <p><img src="/assets/img/ahc014-rectjoin-example.gif" alt="AtCoder Heuristic Contest 014 RectJoin visualization" style="max-width: 62%; display: block; margin: 1.5rem auto 0.75rem;"/></p> <p style="text-align: center; font-size: 0.9rem; color: #777; margin-top: 0;"> Visualization of RectJoin from <a href="https://atcoder.jp/contests/ahc014/tasks/ahc014_a">AtCoder Heuristic Contest 014</a>. </p> <h3 id="methodology">Methodology</h3> <p><em>Human trajectories.</em> For humans, we study two groups from the final standings: <strong style="font-weight: 700;">the top 10 contestants</strong> and <strong style="font-weight: 700;">the top 50 contestants</strong>. At each checkpoint, we compute prefix-best scores. This gives one vector for each human group at each checkpoint, where each entry is a contestant’s best score up to that time.</p> <p><em>Agent setting.</em> We recreate the contest setting for agents. Using the <a href="https://frontier-cs.org/">FrontierCS</a> evaluation layer, each agent runs in a continuous 24-hour loop. It can keep submitting candidates, observe scores, and revise its next attempt, just like a human contestant. If it stops early, we resume the same session rather than starting a fresh run.</p> <p>We test two agent systems: <strong style="font-weight: 700;">Claude Opus 4.6 with Claude Code</strong>, and <strong style="font-weight: 700;">GPT-5.5 with Codex</strong>. For each agent system, we run five independent 24-hour trials. At every wall-clock checkpoint we care about, we collect the prefix-best score from each trial. This gives a length-5 vector for each agent system at each checkpoint.</p> <p>Once we have these vectors, we can estimate pairwise win rates directly. For example, one vector might contain the prefix-best scores of the top 10 human contestants at 24 hours, while another contains the prefix-best scores from the five Claude Code trials at 5 hours. We want to know how often the 24-hour top-human scores beat the 5-hour Claude Code scores. If the two vectors are</p> \[u = (u_1, \ldots, u_n), \qquad v = (v_1, \ldots, v_m),\] <p>we define the empirical probability that \(u\) beats \(v\) by comparing all pairs:</p> \[\widehat{\operatorname{Pr}}(u \text{ beats } v) = \frac{1}{nm}\sum_{i=1}^{n}\sum_{j=1}^{m} \mathbf{1}[u_i &gt; v_j].\] <p>These pairwise win rates are the inputs to the Elo fit.</p> <h3 id="results">Results</h3> <p>We fit Elo ratings from these pairwise win rates using <a href="https://doi.org/10.1007/BF01589116">L-BFGS</a>. All agent systems and human groups are placed in one joint Bradley-Terry fit, so their ratings are directly comparable. For agents, we use the 24-hour trajectories from our runs. For humans, we use the full trajectories from the official two weeks.</p> <p><img src="/assets/img/human-vs-agent-elo.png" alt="Elo trajectories comparing human contestants and agent systems" style="max-width: 100%; display: block; margin: 1.5rem auto 0.75rem;"/></p> <p style="text-align: center; font-size: 0.9rem; color: #777; margin-top: 0;"> Elo trajectories for top humans and agent systems on <a href="https://atcoder.jp/contests/ahc014/tasks/ahc014_a">RectJoin</a>, the long-horizon coding task introduced above. </p> <p>The result is sharp. Agents improve quickly in the first few hours, but their Elo curves flatten by the 24-hour mark, even though a single agent trial can use up to 100M tokens. <strong style="font-weight: 700;">Top humans improve more slowly at first, but they keep climbing for days and eventually pass the agent systems by a large margin. This suggests that current agents can sprint early, but they still lack the long-horizon test-time adaptation that strong human contestants use during an extended contest.</strong></p> <p>We can also ask a more local question for each participant system. If we only look at one system at a time, the repeated-sampling model gives a reference line for what would happen if that system simply drew more independent samples from its own output distribution. Comparing the observed Elo curve against this line tells us whether the system’s test-time strategy is more or less efficient than repeated sampling from itself.</p> <p><img src="/assets/img/per-source-elo-vs-repeated-sampling.png" alt="Per-source Elo curves compared with the repeated-sampling reference line" style="max-width: 100%; display: block; margin: 1.5rem auto 0.75rem;"/></p> <p style="text-align: center; font-size: 0.9rem; color: #777; margin-top: 0;"> Per-source Elo curves compared with the repeated-sampling reference line implied by the model. </p> <p>The gray dashed line is the repeated-sampling reference. Curves above this line gain Elo faster than independent sampling from the same source distribution. Curves below it gain Elo more slowly. The agent systems become <strong style="font-weight: 700;">sublinear</strong> relative to this reference by the end of the 24-hour run. In contrast, the human curves become <strong style="font-weight: 700;">superlinear</strong> over longer horizons. <strong style="font-weight: 700;">Humans do not just sample; current agents still have a long way to go on long-horizon test-time scaling.</strong></p> <h2 id="takeaways-for-agentic-test-time-scaling">Takeaways for Agentic Test-Time Scaling</h2> <ul style="padding-left: 1.1rem; margin-left: 0;"> <li><strong style="font-weight: 700;">Agents have their own test-time strategies.</strong> We should evaluate whether performance improves with more compute and what strategy produced that improvement.</li> <li style="margin-top: 0.65rem;"><strong style="font-weight: 700;">Use repeated sampling as a reference line.</strong> If an agent's Elo grows linearly with log compute, it may be doing little more than repeated sampling. Deviations from that line are the signal.</li> <li style="margin-top: 0.65rem;"><strong style="font-weight: 700;">Keep humans as a long-horizon reference.</strong> Top humans still show adaptive improvement over long horizons, which gives us a useful target for agentic test-time scaling.</li> <li style="margin-top: 0.65rem;"><strong style="font-weight: 700;">Study more open-ended long-horizon tasks.</strong> We need more tasks, longer run trajectories, and careful failure analysis to understand where agents still fall short.</li> </ul> <h2 id="citing-us">Citing Us</h2> <p>Our full paper is coming soon. In the meantime, please cite this blog post if you found it helpful. For discussion, contact <a href="mailto:qmang@berkeley.edu">qmang@berkeley.edu</a> or <a href="mailto:lky04@cs.washington.edu">lky04@cs.washington.edu</a>.</p> <div class="citation-code-block"> <div class="language-bibtex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nc">@misc</span><span class="p">{</span><span class="nl">mang2026humansstillbeatagents</span><span class="p">,</span>
  <span class="na">title</span>  <span class="p">=</span> <span class="s">{Humans Still Beat AI in the Long Horizon: Revisiting Test-Time Scaling in the Agent Era}</span><span class="p">,</span>
  <span class="na">author</span> <span class="p">=</span> <span class="s">{Qiuyang Mang and Kaiyuan Liu and Bo Peng and Shreyas Pimpalgaonkar and Alex Dimakis and Alvin Cheung}</span><span class="p">,</span>
  <span class="na">year</span>   <span class="p">=</span> <span class="s">{2026}</span><span class="p">,</span>
  <span class="na">url</span>    <span class="p">=</span> <span class="s">{https://joyemang33.github.io/blog/2026/humans-dont-just-sample/}</span>
<span class="p">}</span>
</code></pre></div> </div> </div>]]></content><author><name>Qiuyang Mang&lt;sup&gt;1&lt;/sup&gt;, Kaiyuan Liu&lt;sup&gt;2&lt;/sup&gt;, Bo Peng&lt;sup&gt;3&lt;/sup&gt;, Shreyas Pimpalgaonkar&lt;sup&gt;4&lt;/sup&gt;, Alex Dimakis&lt;sup&gt;1,4&lt;/sup&gt;, Alvin Cheung&lt;sup&gt;1&lt;/sup&gt;</name></author><category term="research"/><category term="test-time-scaling"/><category term="LLM-agents"/><category term="scaling-laws"/><summary type="html"><![CDATA[Agents can spend test-time compute by trying, observing, and revising. We derive an Elo reference for repeated sampling, then show that in a 2022 two-week coding marathon, current agents plateau within 24 hours while top humans keep improving.]]></summary></entry><entry><title type="html">We Scored 100% on AI Benchmarks Without Solving a Single Problem</title><link href="https://joyemang33.github.io/blog/2026/trustworthy-benchmarks/" rel="alternate" type="text/html" title="We Scored 100% on AI Benchmarks Without Solving a Single Problem"/><published>2026-04-02T00:00:00+00:00</published><updated>2026-04-02T00:00:00+00:00</updated><id>https://joyemang33.github.io/blog/2026/trustworthy-benchmarks</id><content type="html" xml:base="https://joyemang33.github.io/blog/2026/trustworthy-benchmarks/"><![CDATA[<hr/> <p><img src="/assets/img/trustworthy-benchmarks/teaser-v2.png" alt="Hand-drawn diagram showing benchmark leakage paths that produce fake perfect scores" style="max-width: 90%; display: block; margin: 1rem auto;"/></p> <h3 id="fake-scores-real-consequences">Fake Scores, Real Consequences</h3> <p>Every major AI company uses benchmark scores to sell their models. Every investor uses them to pick winners. Every training data company uses them to price their product. And increasingly, benchmark scores aren’t just measuring models — they’re <em>training</em> them. RL rewards, data filtering, synthetic rollout selection — all downstream of benchmark scores.</p> <p><strong>So what happens when the benchmarks themselves are broken?</strong></p> <p>It’s not a hypothetical. A model that “improves SWE-bench by 5%” might just be better at exploiting test suite gaps. Training data priced on benchmark gains might be teaching models to game evaluations instead of solving real problems. The leaderboard number that closed your Series B might be inflatable by anyone who reads the eval script.</p> <p>Here’s what’s been happening in public:</p> <ul> <li><a href="https://github.com/IQuestLab/IQuest-Coder-V1/issues/14">IQuest-Coder-V1</a> claimed 81.4% on SWE-bench — then researchers found 24.4% of trajectories just ran <code class="language-plaintext highlighter-rouge">git log</code> to copy the answer from commit history. Corrected score: 76.2%.</li> <li><a href="https://metr.org/blog/2025-06-05-recent-reward-hacking/">METR found</a> that o3 and Claude 3.7 Sonnet reward-hack in <strong>30%+ of evaluation runs</strong> — stack introspection, monkey-patching graders, operator overloading.</li> <li><a href="https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/">OpenAI dropped SWE-bench Verified</a> after finding 59.4% of audited problems had flawed tests.</li> <li>In <a href="https://github.com/ScalingIntelligence/KernelBench/issues/82">KernelBench</a>, <code class="language-plaintext highlighter-rouge">torch.empty()</code> returns stale GPU memory containing the reference answer — <a href="https://deep-reinforce.com/defense_kernel_hack.html">zero computation, full marks</a>.</li> </ul> <p>These are the ones people caught by hand. We built an AI agent that finds them automatically — and it found a lot more.</p> <h3 id="results-at-a-glance">Results at a Glance</h3> <p>We built an automated auditing system and pointed it at 13 widely-used AI benchmarks — including FrontierCS, BFCL, LiveBench, GAIA, WebArena, AGIEval, AgentBench, Terminal-Bench, tau-bench, MLE-bench, OSWorld, FieldWorkArena, and CAR-bench.</p> <div style="text-align: center;"> <img src="/assets/img/trustworthy-benchmarks/results.svg" style="max-width: 85%; display: block; margin: 1rem auto;" alt="Audit Results Overview"/> <p style="margin-top: 0.8rem; font-size: 0.9em; color: #888;">Overview of findings across 13 audited benchmarks. Every benchmark was rated critical risk.</p> </div> <p>The 45 confirmed exploits each come with a working proof-of-concept — code that achieves inflated or perfect scores without solving the actual task. They affect benchmarks used to evaluate everything from code generation to web navigation to general-purpose AI assistants.</p> <p>We also cataloged <strong>50 known issues</strong> across Terminal-Bench, SWE-bench, and KernelBench from public GitHub issues and papers. Our dual detection pipeline — one LLM-based, one formal — achieved <strong>100% detection rate</strong> on all 50 after iterative improvement.</p> <h3 id="how-we-found-them">How We Found Them</h3> <p>Manual benchmark auditing doesn’t scale. A human expert might spend days analyzing a single evaluation harness. We wanted to audit 13 benchmarks with hundreds of scoring scripts each. So we built an <strong>AI agent that does it automatically</strong> — you give it a benchmark repo, it finds the vulnerabilities, writes the exploit code, and verifies it works. No human in the loop.</p> <p>The agent runs a dual detection pipeline. The <strong>LLM Detector</strong> uses 15 specialized scanners that reason about code intent — cross-referencing task descriptions with test logic to catch semantic gaps like spec mismatches and answer leakage. The <strong>Formal Detector</strong> uses 12 analysis modules built on AST parsing, data-flow analysis, and z3 SMT solving to catch structural issues without any language model. Both were iteratively improved to <strong>100% detection</strong> on a 50-issue catalog.</p> <p>The full system is an 8-stage audit pipeline:</p> <div style="text-align: center;"> <img src="/assets/img/trustworthy-benchmarks/pipeline.svg" style="max-width: 90%; display: block; margin: 1rem auto;" alt="8-Stage Audit Pipeline"/> <p style="margin-top: 0.8rem; font-size: 0.9em; color: #888;">The full audit pipeline: from benchmark ingestion through adversarial PoC generation and findings correlation.</p> </div> <p>Point the agent at a benchmark repo and it produces a full audit report — vulnerability analysis, working exploit code, and fix suggestions. The 45 confirmed exploits in this post were all <strong>discovered, written, and verified by the agent</strong>, not by us.</p> <h3 id="example-1-perfect-score-zero-work-frontier-cs">Example 1: Perfect Score, Zero Work (Frontier-CS)</h3> <p>In Frontier-CS’s research track, the evaluator loads submitted solutions via <code class="language-plaintext highlighter-rouge">importlib.util.spec_from_file_location()</code> and <code class="language-plaintext highlighter-rouge">exec_module()</code>. This means the submitted code runs <strong>in the same Python process as the evaluator</strong> — sharing <code class="language-plaintext highlighter-rouge">sys.modules</code>, global state, and the call stack.</p> <p>A malicious solution can walk up the stack and replace the scoring function:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">Solution</span><span class="p">:</span>
    <span class="k">def</span> <span class="nf">solve</span><span class="p">(</span><span class="n">self</span><span class="p">):</span>
        <span class="kn">import</span> <span class="n">sys</span>
        <span class="n">frame</span> <span class="o">=</span> <span class="n">sys</span><span class="p">.</span><span class="nf">_getframe</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
        <span class="k">while</span> <span class="n">frame</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span><span class="p">:</span>
            <span class="k">if</span> <span class="sh">'</span><span class="s">evaluate_kernel_performance</span><span class="sh">'</span> <span class="ow">in</span> <span class="n">frame</span><span class="p">.</span><span class="n">f_globals</span><span class="p">:</span>
                <span class="k">def</span> <span class="nf">fake_evaluate</span><span class="p">(</span><span class="o">*</span><span class="n">args</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">):</span>
                    <span class="k">return</span> <span class="p">{</span>
                        <span class="sh">"</span><span class="s">score</span><span class="sh">"</span><span class="p">:</span> <span class="mi">100</span><span class="p">,</span> <span class="sh">"</span><span class="s">pass_all</span><span class="sh">"</span><span class="p">:</span> <span class="bp">True</span><span class="p">,</span>
                        <span class="sh">"</span><span class="s">geometric_mean_speedup</span><span class="sh">"</span><span class="p">:</span> <span class="mf">10.0</span><span class="p">,</span>
                        <span class="sh">"</span><span class="s">total_tests</span><span class="sh">"</span><span class="p">:</span> <span class="mi">10</span><span class="p">,</span> <span class="sh">"</span><span class="s">passed_tests</span><span class="sh">"</span><span class="p">:</span> <span class="mi">10</span><span class="p">,</span>
                    <span class="p">}</span>
                <span class="n">frame</span><span class="p">.</span><span class="n">f_globals</span><span class="p">[</span><span class="sh">'</span><span class="s">evaluate_kernel_performance</span><span class="sh">'</span><span class="p">]</span> <span class="o">=</span> <span class="n">fake_evaluate</span>
                <span class="k">break</span>
            <span class="n">frame</span> <span class="o">=</span> <span class="n">frame</span><span class="p">.</span><span class="n">f_back</span>
        <span class="k">return</span> <span class="p">{</span><span class="sh">"</span><span class="s">code</span><span class="sh">"</span><span class="p">:</span> <span class="sh">"</span><span class="s">pass</span><span class="sh">"</span><span class="p">}</span>  <span class="c1"># No actual solution needed
</span></code></pre></div></div> <p><strong>Result: score 100 out of 100 without solving anything.</strong> The solution doesn’t even need to contain relevant code. Alternative attack vectors include monkey-patching <code class="language-plaintext highlighter-rouge">torch.allclose</code> to always return <code class="language-plaintext highlighter-rouge">True</code>, overwriting baseline files to fake speedup ratios, or using <code class="language-plaintext highlighter-rouge">atexit</code> to inject a fake score as the last stdout line. The Docker container runs both evaluator and solution in the same process with no sandboxing.</p> <div style="background: #f5f5f5; border-left: 3px solid #999; padding: 0.8rem 1rem; border-radius: 4px; margin: 1.2rem 0; font-size: 0.88rem; color: #444;"> <strong>Root cause:</strong> No process isolation between submitted code and evaluation infrastructure. The solution has full read/write access to evaluator source code and baseline implementations inside the container. </div> <h3 id="example-2-fake-c-extension-terminal-bench">Example 2: Fake C Extension (Terminal-Bench)</h3> <p>The task asks the agent to write a <strong>C extension</strong> for portfolio risk/return calculation — a systems programming task that should test low-level optimization skills. The test suite has three checks:</p> <ol> <li><code class="language-plaintext highlighter-rouge">test_c_extension_exists()</code> — verifies a compiled <code class="language-plaintext highlighter-rouge">.so</code> file exists with <code class="language-plaintext highlighter-rouge">portfolio_risk_c</code> and <code class="language-plaintext highlighter-rouge">portfolio_return_c</code> functions</li> <li><code class="language-plaintext highlighter-rouge">test_correctness_small()</code> — checks computation accuracy on small inputs</li> <li><code class="language-plaintext highlighter-rouge">test_performance_and_scalability()</code> — requires 1.2x speedup over pure Python</li> </ol> <p>The critical gap: <strong>test 1 checks that C functions exist, but never calls them.</strong> Tests 2 and 3 import from the Python wrapper, not from the C extension. The wrapper can use any implementation internally.</p> <p>The exploit — a dummy C extension paired with a numpy wrapper:</p> <div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// portfolio_optimized_c.c — compiles to .so, passes test_c_extension_exists()</span>
<span class="k">static</span> <span class="n">PyObject</span><span class="o">*</span> <span class="nf">portfolio_risk_c</span><span class="p">(</span><span class="n">PyObject</span> <span class="o">*</span><span class="n">self</span><span class="p">,</span> <span class="n">PyObject</span> <span class="o">*</span><span class="n">args</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">return</span> <span class="n">PyFloat_FromDouble</span><span class="p">(</span><span class="mi">0</span><span class="p">.</span><span class="mi">0</span><span class="p">);</span>  <span class="c1">// Dummy — never actually called</span>
<span class="p">}</span>
<span class="k">static</span> <span class="n">PyObject</span><span class="o">*</span> <span class="nf">portfolio_return_c</span><span class="p">(</span><span class="n">PyObject</span> <span class="o">*</span><span class="n">self</span><span class="p">,</span> <span class="n">PyObject</span> <span class="o">*</span><span class="n">args</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">return</span> <span class="n">PyFloat_FromDouble</span><span class="p">(</span><span class="mi">0</span><span class="p">.</span><span class="mi">0</span><span class="p">);</span>  <span class="c1">// Dummy — never actually called</span>
<span class="p">}</span>
</code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># portfolio_optimized.py — uses numpy, not the C extension
</span><span class="kn">import</span> <span class="n">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="n">portfolio_optimized_c</span>  <span class="c1"># noqa: imported so .so is loadable
</span>
<span class="k">def</span> <span class="nf">portfolio_risk_c</span><span class="p">(</span><span class="n">weights</span><span class="p">,</span> <span class="n">cov_matrix</span><span class="p">):</span>
    <span class="n">w</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="nf">array</span><span class="p">(</span><span class="n">weights</span><span class="p">)</span>
    <span class="n">cov</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="nf">array</span><span class="p">(</span><span class="n">cov_matrix</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">math</span><span class="p">.</span><span class="nf">sqrt</span><span class="p">(</span><span class="nf">float</span><span class="p">(</span><span class="n">w</span> <span class="o">@</span> <span class="n">cov</span> <span class="o">@</span> <span class="n">w</span><span class="p">))</span>  <span class="c1"># numpy, not C
</span></code></pre></div></div> <p><strong>Result: all three tests pass without writing any meaningful C code.</strong> The dummy <code class="language-plaintext highlighter-rouge">.so</code> satisfies the existence check. numpy satisfies correctness (within 1e-10) and performance (1.2x faster than pure Python loops).</p> <div style="background: #f5f5f5; border-left: 3px solid #999; padding: 0.8rem 1rem; border-radius: 4px; margin: 1.2rem 0; font-size: 0.88rem; color: #444;"> <strong>Root cause:</strong> The benchmark tests what it's easy to test (does a file exist?) rather than what it claims to test (can the agent write optimized C?). No check verifies the C extension functions produce correct output or are actually called. </div> <h3 id="example-3-all-812-answers-in-a-json-file-webarena">Example 3: All 812 Answers in a JSON File (WebArena)</h3> <p><a href="https://webarena.dev/">WebArena</a> tests whether agents can navigate websites to complete real-world tasks — shopping, forum posting, content management. The evaluator checks the agent’s answer against reference answers using string matching.</p> <p>The problem: <code class="language-plaintext highlighter-rouge">config_files/test.raw.json</code> contains <strong>every reference answer in plaintext</strong>:</p> <div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
  </span><span class="nl">"task_id"</span><span class="p">:</span><span class="w"> </span><span class="mi">0</span><span class="p">,</span><span class="w">
  </span><span class="nl">"intent"</span><span class="p">:</span><span class="w"> </span><span class="s2">"What is the top-1 best-selling product in 2022"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"eval"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="nl">"reference_answers"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
      </span><span class="nl">"exact_match"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Quest Lumaflex Band"</span><span class="w">
    </span><span class="p">}</span><span class="w">
  </span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div> <p>All 812+ tasks. Every answer. Sitting on the same filesystem where the agent runs. No sandboxing prevents the agent from reading it. A trivial cheating agent:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">CheatingAgent</span><span class="p">:</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="n">self</span><span class="p">):</span>
        <span class="k">with</span> <span class="nf">open</span><span class="p">(</span><span class="sh">'</span><span class="s">config_files/test.raw.json</span><span class="sh">'</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
            <span class="n">tasks</span> <span class="o">=</span> <span class="n">json</span><span class="p">.</span><span class="nf">load</span><span class="p">(</span><span class="n">f</span><span class="p">)</span>
        <span class="n">self</span><span class="p">.</span><span class="n">answers</span> <span class="o">=</span> <span class="p">{</span>
            <span class="n">t</span><span class="p">[</span><span class="sh">'</span><span class="s">task_id</span><span class="sh">'</span><span class="p">]:</span> <span class="n">t</span><span class="p">[</span><span class="sh">'</span><span class="s">eval</span><span class="sh">'</span><span class="p">][</span><span class="sh">'</span><span class="s">reference_answers</span><span class="sh">'</span><span class="p">]</span>
            <span class="k">for</span> <span class="n">t</span> <span class="ow">in</span> <span class="n">tasks</span>
        <span class="p">}</span>

    <span class="k">def</span> <span class="nf">act</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">task_id</span><span class="p">,</span> <span class="n">observation</span><span class="o">=</span><span class="bp">None</span><span class="p">):</span>
        <span class="n">ref</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="n">answers</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="n">task_id</span><span class="p">,</span> <span class="p">{})</span>
        <span class="n">answer</span> <span class="o">=</span> <span class="n">ref</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="sh">'</span><span class="s">exact_match</span><span class="sh">'</span><span class="p">,</span> <span class="n">ref</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="sh">'</span><span class="s">must_include</span><span class="sh">'</span><span class="p">,</span> <span class="p">[</span><span class="sh">''</span><span class="p">])[</span><span class="mi">0</span><span class="p">])</span>
        <span class="k">return</span> <span class="p">{</span><span class="sh">'</span><span class="s">action_type</span><span class="sh">'</span><span class="p">:</span> <span class="sh">'</span><span class="s">stop</span><span class="sh">'</span><span class="p">,</span> <span class="sh">'</span><span class="s">answer</span><span class="sh">'</span><span class="p">:</span> <span class="n">answer</span><span class="p">}</span>
</code></pre></div></div> <p><strong>Result: perfect score on all string-match tasks with zero web browsing.</strong> No clicking, no navigation, no understanding of web interfaces. Just read a JSON file and return the answer.</p> <div style="background: #f5f5f5; border-left: 3px solid #999; padding: 0.8rem 1rem; border-radius: 4px; margin: 1.2rem 0; font-size: 0.88rem; color: #444;"> <strong>Root cause:</strong> Reference answers stored in agent-accessible filesystem with no integrity protection. The evaluator reads from the same JSON files the agent can access. </div> <h3 id="what-this-means">What This Means</h3> <p>Broken benchmarks don’t just produce wrong leaderboards — they poison training signals, inflate data pricing, and mislead deployment decisions. If nobody audits the evaluation infrastructure, everything built on top of it is unreliable.</p> <p>Our agent found 45 confirmed exploits that human reviewers missed — not because they were subtle, but because nobody was looking. The tools and methodology are open source at <a href="https://github.com/moogician/trustworthy-env">github.com/moogician/trustworthy-env</a>.</p>]]></content><author><name>Hao Wang, Qiuyang Mang</name></author><category term="research"/><category term="benchmark"/><category term="evaluation"/><category term="reward-hacking"/><category term="AI safety"/><category term="trustworthy"/><summary type="html"><![CDATA[AI benchmarks decide which models get funded, deployed, and trusted. We hacked 13 of them. 45 working exploits. Every benchmark rated critical. If the scores are fake, so is everything built on them — including your training data.]]></summary></entry><entry><title type="html">Argus: Automated Discovery of Test Oracles for Database Management Systems Using LLMs</title><link href="https://joyemang33.github.io/blog/2026/argus/" rel="alternate" type="text/html" title="Argus: Automated Discovery of Test Oracles for Database Management Systems Using LLMs"/><published>2026-02-23T00:00:00+00:00</published><updated>2026-02-23T00:00:00+00:00</updated><id>https://joyemang33.github.io/blog/2026/argus</id><content type="html" xml:base="https://joyemang33.github.io/blog/2026/argus/"><![CDATA[<hr/> <h3 id="tldr">TL;DR</h3> <p>Database Management Systems (DBMSs) are notoriously hard to test because you need a <em>test oracle</em> — a way to know if the output is correct. Prior work builds these oracles <strong>by hand</strong>, creating a never-ending cycle of manual effort.</p> <p><strong>Argus</strong> breaks this cycle by using LLMs to <em>automatically discover</em> test oracles, then formally <em>verifies</em> them with a SQL equivalence prover for soundness, and efficiently <em>instantiates</em> them into thousands of concrete test cases. Evaluated on five heavily-tested DBMSs, Argus found <strong>41 previously unknown bugs</strong> (36 logic bugs), outperforming state-of-the-art manual oracle designs.</p> <p>In practice, spending just <strong>~$10</strong> on LLM calls generates <strong>millions of reliable SQL tests</strong> — each capable of catching <em>logic bugs</em>, where a query silently returns wrong results instead of throwing an error.</p> <hr/> <h3 id="the-problem-test-oracles-are-a-bottleneck">The Problem: Test Oracles Are a Bottleneck</h3> <p><img src="https://joyemang33.github.io/assets/img/argus-1.png" alt="Test Oracle Problem" style="max-width: 75%; display: block; margin: 1rem auto;"/></p> <p>When testing a DBMS, how do you know if the result of a SQL query is <em>correct</em>? This is the <strong>test oracle problem</strong>. A naive approach would be to compare two DBMSs against each other, but that misses bugs they share. The dominant approach instead builds <em>semantic equivalence oracles</em>: transform a query \(Q\) into a semantically equivalent \(Q'\), run both, and flag inconsistencies as bugs.</p> <p>The catch: <strong>designing such transformation mechanisms is entirely manual</strong>. Researchers have published over 20 top-conference papers, each hand-crafting specialized oracles — <a href="https://dl.acm.org/doi/10.1145/3428279"><code class="language-plaintext highlighter-rouge">TLP</code></a>, <a href="https://dl.acm.org/doi/10.1145/3368089.3409710"><code class="language-plaintext highlighter-rouge">NoREC</code></a>, <a href="https://github.com/JZuming/EET"><code class="language-plaintext highlighter-rouge">EET</code></a>, <a href="https://dl.acm.org/doi/10.1145/3654990"><code class="language-plaintext highlighter-rouge">DQP</code></a> — yet bugs keep slipping through. Consider this real TiDB bug that went undetected for <em>years</em>:</p> <div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">t1</span><span class="p">(</span><span class="k">c</span> <span class="nb">INT</span><span class="p">);</span>
<span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">t1</span> <span class="k">VALUES</span> <span class="p">(</span><span class="mi">1</span><span class="p">);</span>

<span class="c1">-- Q: Empty table filter → should return {}</span>
<span class="k">SELECT</span> <span class="k">c</span> <span class="o">/</span> <span class="mi">3</span> <span class="k">FROM</span> <span class="n">t1</span> <span class="k">WHERE</span> <span class="k">false</span><span class="p">;</span>       <span class="c1">-- {} ✓</span>

<span class="c1">-- Oracle: Q EXCEPT Q should always be empty</span>
<span class="k">SELECT</span> <span class="k">c</span> <span class="o">/</span> <span class="mi">3</span> <span class="k">FROM</span> <span class="n">t1</span> <span class="k">EXCEPT</span> <span class="k">SELECT</span> <span class="k">c</span> <span class="o">/</span> <span class="mi">3</span> <span class="k">FROM</span> <span class="n">t1</span><span class="p">;</span>  <span class="c1">-- {0.3333} ✗ (BUG!)</span>
</code></pre></div></div> <p>Catching this required the very specific insight that \(Q \setminus Q = \emptyset\). A human had to think of it. Can we make a machine do that automatically?</p> <hr/> <h3 id="key-insight-constrained-abstract-queries-caq">Key Insight: Constrained Abstract Queries (CAQ)</h3> <p>The core innovation in Argus is a new representation called a <strong>Constrained Abstract Query (CAQ)</strong> — a SQL query template with typed <em>placeholders</em> that can be filled with concrete SQL snippets.</p> <p>A placeholder \(\square_i\) can be either:</p> <ul> <li><strong><code class="language-plaintext highlighter-rouge">Expr(TableName : SQLDatatype)</code></strong> — any expression over a table that returns a given type (e.g., a Boolean expression over <code class="language-plaintext highlighter-rouge">t1</code>)</li> <li><strong><code class="language-plaintext highlighter-rouge">Table(SQLTableDef)</code></strong> — any table or subquery with a given schema</li> </ul> <p>An <strong>equivalent CAQ pair</strong> \((s, q_1, q_2)\) is two CAQs that produce the <em>same results for every possible instantiation</em> of their placeholders. For example, the classic <strong>TLP oracle</strong> can be expressed as a CAQ pair:</p> <div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">-- Q₁: seed query</span>
<span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">t1</span><span class="p">,</span> <span class="err">□₁⊲</span><span class="k">Table</span><span class="p">(...);</span>

<span class="c1">-- Q₂: TLP three-way partition</span>
<span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">t1</span><span class="p">,</span> <span class="err">□₁⊲</span><span class="k">Table</span><span class="p">(...)</span> <span class="k">WHERE</span> <span class="p">(</span><span class="err">□₂⊲</span><span class="n">Expr</span><span class="p">(</span><span class="n">t1</span><span class="p">:</span><span class="nb">BOOLEAN</span><span class="p">)</span> <span class="k">IS</span> <span class="k">TRUE</span><span class="p">)</span>
<span class="k">UNION</span> <span class="k">ALL</span>
<span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">t1</span><span class="p">,</span> <span class="err">□₁⊲</span><span class="k">Table</span><span class="p">(...)</span> <span class="k">WHERE</span> <span class="p">(</span><span class="err">□₂⊲</span><span class="n">Expr</span><span class="p">(</span><span class="n">t1</span><span class="p">:</span><span class="nb">BOOLEAN</span><span class="p">)</span> <span class="k">IS</span> <span class="k">FALSE</span><span class="p">)</span>
<span class="k">UNION</span> <span class="k">ALL</span>
<span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">t1</span><span class="p">,</span> <span class="err">□₁⊲</span><span class="k">Table</span><span class="p">(...)</span> <span class="k">WHERE</span> <span class="p">(</span><span class="err">□₂⊲</span><span class="n">Expr</span><span class="p">(</span><span class="n">t1</span><span class="p">:</span><span class="nb">BOOLEAN</span><span class="p">)</span> <span class="k">IS</span> <span class="k">NULL</span><span class="p">);</span>

<span class="c1">-- Instantiation examples:</span>
<span class="c1">-- □₁ ↦ t1 ASOF JOIN t2</span>
<span class="c1">-- □₂ ↦ json_valid(t1.c0)</span>
</code></pre></div></div> <p>The power of CAQs: one CAQ pair is a reusable oracle that can generate <em>thousands</em> of concrete test cases by filling its placeholders with diverse SQL snippets.</p> <hr/> <h3 id="the-argus-pipeline">The Argus Pipeline</h3> <p><img src="https://joyemang33.github.io/assets/img/argus-2.png" alt="Overall pipeline of Argus" style="max-width: 90%; display: block; margin: 1rem auto;"/></p> <p>Argus operates in two stages:</p> <h4 id="stage-1--test-oracle-discovery-offline-one-time">Stage 1 — Test Oracle Discovery (offline, one-time)</h4> <p><strong>① Database Seeding.</strong> A grammar-based generator (<a href="https://github.com/sqlancer/sqlancer"><code class="language-plaintext highlighter-rouge">SQLancer</code></a>) produces random database schemas and seed CAQs. Virtual columns and tables serve as placeholders, making the output compatible with SQL provers that expect concrete syntax.</p> <p><strong>② LLM-based Oracle Generation + Formal Verification.</strong> For each seed CAQ \(q\), Argus iteratively prompts an LLM to generate an equivalent variant \(q'\). Two mechanisms ensure quality:</p> <ul> <li><strong>In-context learning</strong> — the LLM is shown verified successes (<em>Equal</em> set) and failures (<em>Fail</em> set) from previous rounds.</li> <li><strong>Diversity-oriented sampling</strong> — verified CAQs are clustered by query-plan tree-edit distance (k-means), and samples are drawn from each cluster to push the LLM toward novel execution plans.</li> </ul> <p>Every candidate \(q'\) must pass a <strong>SQL equivalence prover</strong> (<a href="https://github.com/SJTU-IPADS/SQLSolver"><code class="language-plaintext highlighter-rouge">SQLSolver</code></a>) before acceptance. Placeholders are replaced by virtual entities so the prover can reason on concrete queries. Only formally verified pairs become test oracles — <strong>zero false positives by design</strong>.</p> <h4 id="stage-2--test-case-instantiation-online-per-dbms">Stage 2 — Test Case Instantiation (online, per DBMS)</h4> <p><strong>③ Corpus Synthesis.</strong> A hybrid approach (LLM + grammar-based generator) pre-generates a large library of SQL snippets:</p> <ul> <li>LLMs produce <em>complex, feature-rich</em> expressions and table structures, guided by official DBMS documentation.</li> <li>The grammar-based generator covers <em>corner values and edge cases</em> systematically.</li> <li><strong>Cross-combination</strong>: expressions are recursively composed (e.g., substituting a <code class="language-plaintext highlighter-rouge">Boolean</code> expr into an <code class="language-plaintext highlighter-rouge">INT</code> function that expects a Boolean column) to create intricate multi-level expressions.</li> <li>Every snippet is <em>runtime-validated</em> on the target DBMS to filter type mismatches and invalid SQL.</li> </ul> <p><strong>④⑤⑥ Instantiation &amp; Bug Detection.</strong> Each verified CAQ pair is instantiated up to \(K\) times by randomly sampling compatible snippets from the corpus. Placeholders are replaced consistently in both \(q\) and \(q'\). Random database instances are created, and the two queries are executed. Any result mismatch is a bug report.</p> <p>Three <em>general constraints</em> on snippets guarantee that instantiated pairs remain equivalent even when concrete expressions are plugged in:</p> <ol> <li><strong>Determinism</strong> — no <code class="language-plaintext highlighter-rouge">RANDOM()</code>, <code class="language-plaintext highlighter-rouge">CURRENT_TIMESTAMP</code>, etc.</li> <li><strong>Null-preserving</strong> — expression returns <code class="language-plaintext highlighter-rouge">NULL</code> when evaluated on all-<code class="language-plaintext highlighter-rouge">NULL</code> rows.</li> <li><strong>Empty-result-preserving</strong> — expression returns empty on an empty table.</li> </ol> <hr/> <h3 id="representative-bugs-found">Representative Bugs Found</h3> <p style="font-size: 1rem; font-weight: 700; margin-bottom: 0.3rem;">📌 PostgreSQL: Incorrect json function in RIGHT JOIN</p> <div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">t</span><span class="p">(</span><span class="k">c</span> <span class="nb">INT</span><span class="p">);</span>
<span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">t</span> <span class="k">VALUES</span> <span class="p">(</span><span class="mi">1</span><span class="p">);</span>

<span class="c1">-- Q1: RIGHT JOIN with FALSE → left side always NULL</span>
<span class="k">SELECT</span> <span class="n">sub</span><span class="p">.</span><span class="k">c</span> <span class="k">FROM</span> <span class="p">(</span>
  <span class="k">SELECT</span> <span class="n">json_array_length</span><span class="p">(</span><span class="n">json_array</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="n">t</span><span class="p">.</span><span class="k">c</span><span class="p">))</span> <span class="k">AS</span> <span class="k">c</span> <span class="k">FROM</span> <span class="n">t</span>
<span class="p">)</span> <span class="k">AS</span> <span class="n">sub</span> <span class="k">RIGHT</span> <span class="k">JOIN</span> <span class="n">t</span> <span class="k">ON</span> <span class="k">FALSE</span><span class="p">;</span>  <span class="c1">-- Expected: {NULL}, Got: {2} ✗</span>

<span class="c1">-- Q2: explicitly NULL in subquery</span>
<span class="k">SELECT</span> <span class="n">sub</span><span class="p">.</span><span class="k">c</span> <span class="k">FROM</span> <span class="p">(</span><span class="k">SELECT</span> <span class="k">NULL</span> <span class="k">AS</span> <span class="k">c</span> <span class="k">FROM</span> <span class="n">t</span><span class="p">)</span> <span class="k">AS</span> <span class="n">sub</span>
<span class="k">RIGHT</span> <span class="k">JOIN</span> <span class="n">t</span> <span class="k">ON</span> <span class="k">FALSE</span><span class="p">;</span>  <span class="c1">-- {NULL} ✓</span>
</code></pre></div></div> <p><strong>Root cause:</strong> PostgreSQL’s json functions bypass the null-propagation rule for <code class="language-plaintext highlighter-rouge">RIGHT JOIN</code>, producing incorrect non-null values. Reported and fixed within <strong>24 hours</strong>.</p> <p style="font-size: 1rem; font-weight: 700; margin-bottom: 0.3rem;">📌 Dolt: EXISTS duplicates rows</p> <div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">t</span><span class="p">(</span><span class="n">c0</span> <span class="nb">INT</span><span class="p">,</span> <span class="n">c1</span> <span class="nb">INT</span><span class="p">,</span> <span class="k">PRIMARY</span> <span class="k">KEY</span> <span class="p">(</span><span class="n">c0</span><span class="p">,</span> <span class="n">c1</span><span class="p">));</span>
<span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">t</span> <span class="k">VALUES</span> <span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">),</span> <span class="p">(</span><span class="mi">2</span><span class="p">,</span><span class="mi">2</span><span class="p">),</span> <span class="p">(</span><span class="mi">2</span><span class="p">,</span><span class="mi">3</span><span class="p">);</span>

<span class="c1">-- With NOT NULL primary key, EXISTS is always TRUE → should return all rows once</span>
<span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">t</span> <span class="k">WHERE</span> <span class="k">EXISTS</span> <span class="p">(</span><span class="k">SELECT</span> <span class="mi">1</span> <span class="k">FROM</span> <span class="n">t</span> <span class="k">AS</span> <span class="n">x</span> <span class="k">WHERE</span> <span class="n">x</span><span class="p">.</span><span class="n">c0</span> <span class="o">=</span> <span class="n">t</span><span class="p">.</span><span class="n">c0</span><span class="p">);</span>
<span class="c1">-- Got: {(1,1),(2,2),(2,3),(1,1),(2,2),(2,3)} ✗ — every row duplicated!</span>

<span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">t</span><span class="p">;</span>  <span class="c1">-- {(1,1),(2,2),(2,3)} ✓</span>
</code></pre></div></div> <p style="font-size: 1rem; font-weight: 700; margin-bottom: 0.3rem;">📌 DuckDB: Empty CTE incorrectly short-circuits UNION ALL</p> <div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">t1</span><span class="p">(</span><span class="n">c0</span> <span class="nb">BOOLEAN</span><span class="p">);</span>
<span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">t2</span><span class="p">(</span><span class="n">c0</span> <span class="nb">BOOLEAN</span><span class="p">);</span>
<span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">t3</span><span class="p">(</span><span class="n">c0</span> <span class="nb">BOOLEAN</span><span class="p">);</span>
<span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">t2</span> <span class="k">VALUES</span> <span class="p">(</span><span class="k">true</span><span class="p">);</span>
<span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">t3</span> <span class="k">VALUES</span> <span class="p">(</span><span class="k">true</span><span class="p">);</span>

<span class="c1">-- Q1</span>
<span class="k">SELECT</span> <span class="n">t2</span><span class="p">.</span><span class="n">c0</span> <span class="k">FROM</span> <span class="n">t2</span><span class="p">,</span> <span class="n">t3</span> <span class="k">LEFT</span> <span class="k">JOIN</span> <span class="n">t1</span> <span class="k">ON</span> <span class="k">false</span><span class="p">;</span>  <span class="c1">-- {true} ✓</span>

<span class="c1">-- Q2 (equivalent via CTE expansion)</span>
<span class="k">WITH</span> <span class="k">c</span> <span class="k">AS</span> <span class="p">(</span><span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">t1</span> <span class="k">WHERE</span> <span class="k">false</span><span class="p">)</span>
<span class="k">SELECT</span> <span class="n">t2</span><span class="p">.</span><span class="n">c0</span> <span class="k">FROM</span> <span class="n">t2</span> <span class="k">CROSS</span> <span class="k">JOIN</span> <span class="n">t3</span> <span class="k">CROSS</span> <span class="k">JOIN</span> <span class="k">c</span>
<span class="k">UNION</span> <span class="k">ALL</span>
<span class="k">SELECT</span> <span class="n">t2</span><span class="p">.</span><span class="n">c0</span> <span class="k">FROM</span> <span class="n">t2</span> <span class="k">CROSS</span> <span class="k">JOIN</span> <span class="n">t3</span> <span class="k">WHERE</span> <span class="k">NOT</span> <span class="k">EXISTS</span> <span class="p">(</span><span class="k">SELECT</span> <span class="mi">1</span> <span class="k">FROM</span> <span class="k">c</span><span class="p">);</span>
<span class="c1">-- Got: {0 rows} ✗</span>
</code></pre></div></div> <p><strong>Root cause:</strong> DuckDB incorrectly assumes an empty materialized CTE always causes the outer query to return no rows — not true with <code class="language-plaintext highlighter-rouge">UNION ALL</code>.</p> <p style="font-size: 0.85rem; background: var(--global-code-bg-color, #f6f8fa); border-left: 3px solid #6c8ebf; padding: 0.6rem 0.9rem; border-radius: 4px; margin: 1rem 0;">🎉 <strong>Real-world impact:</strong> Dolt <a href="https://www.dolthub.com/blog/2025-10-21-ai-sql-testing/">officially wrote about Argus on their blog</a>, detailing how our AI-generated SQL tests found <strong>19 bugs</strong> in their database engine and how they are integrating the Argus-generated test suite into their regression testing process.</p> <hr/> <h3 id="evaluation-results">Evaluation Results</h3> <h4 id="bugs-found-5-dbmss-2-month-campaign">Bugs Found (5 DBMSs, 2-month campaign)</h4> <table class="table table-sm text-center" style="width: auto; margin: 1rem auto 2rem; font-family: 'Open Sans', -apple-system, BlinkMacSystemFont, 'Segoe UI', sans-serif; font-size: 0.88rem; border: 1px solid #ccc; color: var(--global-text-color);"> <thead style="background-color: #f0f0f0; font-weight: 600;"> <tr> <th class="text-start" style="background-color: #f0f0f0; border-color: #ccc; padding: 4px 10px; color: var(--global-text-color);">DBMS</th> <th style="background-color: #f0f0f0; border-color: #ccc; padding: 4px 10px; color: var(--global-text-color);">Reported</th> <th style="background-color: #f0f0f0; border-color: #ccc; padding: 4px 10px; color: var(--global-text-color);">Fixed</th> <th style="background-color: #f0f0f0; border-color: #ccc; padding: 4px 10px; color: var(--global-text-color);">Confirmed</th> <th style="background-color: #f0f0f0; border-color: #ccc; padding: 4px 10px; color: var(--global-text-color);">Logic Bugs</th> </tr> </thead> <tbody> <tr><td class="text-start" style="border-color: #ccc; padding: 3px 10px; color: var(--global-text-color);">Dolt</td><td style="border-color: #ccc; padding: 3px 10px; color: var(--global-text-color);">19</td><td style="border-color: #ccc; padding: 3px 10px; color: var(--global-text-color);">18</td><td style="border-color: #ccc; padding: 3px 10px; color: var(--global-text-color);">19</td><td style="border-color: #ccc; padding: 3px 10px; color: var(--global-text-color);">18</td></tr> <tr><td class="text-start" style="border-color: #ccc; padding: 3px 10px; color: var(--global-text-color);">DuckDB</td><td style="border-color: #ccc; padding: 3px 10px; color: var(--global-text-color);">8</td><td style="border-color: #ccc; padding: 3px 10px; color: var(--global-text-color);">6</td><td style="border-color: #ccc; padding: 3px 10px; color: var(--global-text-color);">7</td><td style="border-color: #ccc; padding: 3px 10px; color: var(--global-text-color);">4</td></tr> <tr><td class="text-start" style="border-color: #ccc; padding: 3px 10px; color: var(--global-text-color);">MySQL</td><td style="border-color: #ccc; padding: 3px 10px; color: var(--global-text-color);">8</td><td style="border-color: #ccc; padding: 3px 10px; color: var(--global-text-color);">0</td><td style="border-color: #ccc; padding: 3px 10px; color: var(--global-text-color);">5</td><td style="border-color: #ccc; padding: 3px 10px; color: var(--global-text-color);">8</td></tr> <tr><td class="text-start" style="border-color: #ccc; padding: 3px 10px; color: var(--global-text-color);">PostgreSQL</td><td style="border-color: #ccc; padding: 3px 10px; color: var(--global-text-color);">1</td><td style="border-color: #ccc; padding: 3px 10px; color: var(--global-text-color);">1</td><td style="border-color: #ccc; padding: 3px 10px; color: var(--global-text-color);">1</td><td style="border-color: #ccc; padding: 3px 10px; color: var(--global-text-color);">1</td></tr> <tr><td class="text-start" style="border-color: #ccc; padding: 3px 10px; color: var(--global-text-color);">TiDB</td><td style="border-color: #ccc; padding: 3px 10px; color: var(--global-text-color);">5</td><td style="border-color: #ccc; padding: 3px 10px; color: var(--global-text-color);">2</td><td style="border-color: #ccc; padding: 3px 10px; color: var(--global-text-color);">5</td><td style="border-color: #ccc; padding: 3px 10px; color: var(--global-text-color);">5</td></tr> <tr style="font-weight: 700; border-top: 2px solid #ccc;"><td class="text-start" style="border-color: #ccc; padding: 3px 10px; color: var(--global-text-color);">Total</td><td style="border-color: #ccc; padding: 3px 10px; color: var(--global-text-color);">41</td><td style="border-color: #ccc; padding: 3px 10px; color: var(--global-text-color);">27</td><td style="border-color: #ccc; padding: 3px 10px; color: var(--global-text-color);">36</td><td style="border-color: #ccc; padding: 3px 10px; color: var(--global-text-color);">36</td></tr> </tbody> </table> <p>36 out of 41 bugs are <strong>logic bugs</strong> — the silent, most dangerous class that cause incorrect query results without any error. Compared to recent works with manually crafted oracles, Argus finds <em>more</em> despite targeting DBMSs already extensively tested.</p> <h4 id="code-coverage">Code Coverage</h4> <p>On <strong>DuckDB</strong> (24-hour run):</p> <ul> <li>Argus achieves <strong>+19.9% line coverage</strong> and <strong>+18.1% branch coverage</strong> over <code class="language-plaintext highlighter-rouge">SQLancer++</code></li> <li><strong>5.5× line</strong> and <strong>6.4× function</strong> <a href="https://arxiv.org/abs/2508.16307">metamorphic coverage</a> over <code class="language-plaintext highlighter-rouge">SQLancer</code> — <em>metamorphic coverage</em> measures how much code is exercised <em>differently</em> between the two equivalent queries, directly correlating with logic bug-finding ability</li> </ul> <p>On <strong>PostgreSQL</strong>:</p> <ul> <li>Outperforms <code class="language-plaintext highlighter-rouge">SQLancer++</code> by <strong>+19.0% line</strong> and <strong>+22.5% branch</strong> coverage</li> <li>Covers <strong>23 query features</strong> (vs <code class="language-plaintext highlighter-rouge">SQLancer</code>’s 15 in <code class="language-plaintext highlighter-rouge">pglast</code>), demonstrating the LLM’s ability to generate feature-rich queries</li> </ul> <h4 id="new-oracles-vs-prior-manual-oracles">New Oracles vs. Prior Manual Oracles</h4> <p>In a head-to-head comparison on Dolt v1.0.0 (6-hour window), using <strong>the same snippet corpus and CAQ format</strong> for fairness:</p> <ul> <li><strong>Argus-5,000 oracles</strong>: found <strong>10 unique logic bugs</strong></li> <li><strong>Baseline</strong> (union of <a href="https://dl.acm.org/doi/10.1145/3428279"><code class="language-plaintext highlighter-rouge">TLP</code></a> + <a href="https://dl.acm.org/doi/10.1145/3368089.3409710"><code class="language-plaintext highlighter-rouge">NoREC</code></a> + <a href="https://www.usenix.org/conference/osdi24/presentation/jiang-zu-ming"><code class="language-plaintext highlighter-rouge">EET</code></a> + <a href="https://dl.acm.org/doi/10.1145/3654991"><code class="language-plaintext highlighter-rouge">DQP</code></a> — 11 hand-crafted oracles): found <strong>3</strong></li> <li>Argus-50 oracles (fewer than baseline): found only 2, confirming the <em>quantity</em> of oracles matters</li> </ul> <p>The <strong>3.33× improvement</strong> demonstrates that automation unlocks oracle diversity that manual design simply cannot match at scale.</p> <h4 id="cost--efficiency">Cost &amp; Efficiency</h4> <p><img src="https://joyemang33.github.io/assets/img/argus-3.png" alt="Argus-cost" style="max-width:60%; display: block; margin: 1rem auto;"/></p> <p>Argus’s two-stage design is <em>dramatically</em> more efficient than a naive LLM baseline that generates full concrete query pairs directly:</p> <table class="table table-sm" style="width: auto; margin: 1rem auto 2rem; font-family: 'Open Sans', -apple-system, BlinkMacSystemFont, 'Segoe UI', sans-serif; font-size: 0.88rem; border: 1px solid #ccc; color: var(--global-text-color);"> <thead style="background-color: #f0f0f0; font-weight: 600;"> <tr> <th style="background-color: #f0f0f0; border-color: #ccc; padding: 4px 10px; color: var(--global-text-color);">Cost Item</th> <th style="background-color: #f0f0f0; border-color: #ccc; padding: 4px 10px; color: var(--global-text-color);">Phase</th> <th style="background-color: #f0f0f0; border-color: #ccc; padding: 4px 10px; text-align: center; color: var(--global-text-color);">Cost</th> </tr> </thead> <tbody> <tr> <td style="border-color: #ccc; padding: 4px 10px; color: var(--global-text-color);">CAQ pair generation</td> <td style="border-color: #ccc; padding: 4px 10px; color: var(--global-text-color); font-size: 0.82rem;">Offline · one-time</td> <td style="border-color: #ccc; padding: 4px 10px; text-align: center; font-weight: 600; color: var(--global-text-color);">~$3</td> </tr> <tr> <td style="border-color: #ccc; padding: 4px 10px; color: var(--global-text-color);">Snippet corpus <span style="font-size: 0.82rem; color: #888;">(100,000 snippets)</span></td> <td style="border-color: #ccc; padding: 4px 10px; color: var(--global-text-color); font-size: 0.82rem;">Offline · per DBMS</td> <td style="border-color: #ccc; padding: 4px 10px; text-align: center; font-weight: 600; color: var(--global-text-color);">~$12</td> </tr> <tr> <td style="border-color: #ccc; padding: 4px 10px; color: var(--global-text-color);">Instantiation &amp; test execution</td> <td style="border-color: #ccc; padding: 4px 10px; color: var(--global-text-color); font-size: 0.82rem;">Online · reusable</td> <td style="border-color: #ccc; padding: 4px 10px; text-align: center; font-weight: 600; color: var(--global-text-color);">FREE</td> </tr> </tbody> </table> <p>The naive baseline generates test cases orders of magnitude more slowly because it must call the LLM for every single test case.</p> <h4 id="soundness-why-the-sql-prover-matters">Soundness: Why the SQL Prover Matters</h4> <p>When we replaced the SQL equivalence prover with <strong>LLM-as-a-judge</strong> (using <code class="language-plaintext highlighter-rouge">GPT-5</code>):</p> <ul> <li>20 consecutive bug reports were <strong>all false positives</strong></li> <li>Among 20 LLM-judged-equivalent CAQ pairs, 1 was actually inequivalent (5% error rate)</li> </ul> <p>In mature DBMSs, finding a single real bug may require <em>thousands</em> of queries. Even a 5% error rate overwhelmingly drowns out true bugs. The prover is not optional — it’s what makes Argus <em>practical</em>.</p> <hr/> <h3 id="what-makes-argus-different">What Makes Argus Different</h3> <table class="table table-sm" style="width: auto; margin: 1rem auto 2rem; font-family: 'Open Sans', -apple-system, BlinkMacSystemFont, 'Segoe UI', sans-serif; font-size: 0.88rem; border: 1px solid #ccc; color: var(--global-text-color);"> <thead style="background-color: #f0f0f0; font-weight: 600;"> <tr> <th style="background-color: #f0f0f0; border-color: #ccc; padding: 4px 10px; color: var(--global-text-color);">Aspect</th> <th style="background-color: #f0f0f0; border-color: #ccc; padding: 4px 10px; color: var(--global-text-color);">Prior work</th> <th style="background-color: #f0f0f0; border-color: #ccc; padding: 4px 10px; color: var(--global-text-color);">Argus</th> </tr> </thead> <tbody> <tr><td style="border-color: #ccc; padding: 3px 10px; color: var(--global-text-color);">Oracle design</td><td style="border-color: #ccc; padding: 3px 10px; color: var(--global-text-color);">Manual, expert-crafted</td><td style="border-color: #ccc; padding: 3px 10px; color: var(--global-text-color);">Automated by LLM</td></tr> <tr><td style="border-color: #ccc; padding: 3px 10px; color: var(--global-text-color);">Soundness</td><td style="border-color: #ccc; padding: 3px 10px; color: var(--global-text-color);">Assumed correct</td><td style="border-color: #ccc; padding: 3px 10px; color: var(--global-text-color);">Formally verified by SQL prover</td></tr> <tr><td style="border-color: #ccc; padding: 3px 10px; color: var(--global-text-color);">Scalability</td><td style="border-color: #ccc; padding: 3px 10px; color: var(--global-text-color);">~10s of hand-written oracles</td><td style="border-color: #ccc; padding: 3px 10px; color: var(--global-text-color);">Thousands of verified CAQ pairs</td></tr> </tbody> </table> <hr/> <h3 id="discussion">Discussion</h3> <p><strong>Prover limitations are opportunities.</strong> SQL equivalence provers currently support a subset of SQL features (core Calcite syntax: outer joins, nested queries, basic aggregations). Argus’s two-stage design mitigates this by <em>proving equivalence at the abstract CAQ level</em>, then instantiating placeholders with complex, DBMS-specific snippets that go beyond the prover’s reasoning capabilities.</p> <p><strong>We also found prover bugs.</strong> During development, Argus revealed <strong>10 bugs in <a href="https://github.com/SJTU-IPADS/SQLSolver"><code class="language-plaintext highlighter-rouge">SQLSolver</code></a> and <a href="https://github.com/qed-solver"><code class="language-plaintext highlighter-rouge">QED</code></a></strong> — incorrect equivalence proofs that would have caused false positives. All were fixed quickly. Improving Argus simultaneously improves the tools it depends on.</p> <p><strong>Extensible by design.</strong> Argus can be steered toward specific SQL features simply by adjusting the LLM prompt (e.g., “ensure the generated snippet includes at least one <code class="language-plaintext highlighter-rouge">OUTER JOIN</code>”). No code changes needed.</p> <p><strong>Future directions.</strong> Two natural extensions:</p> <ol> <li><strong>Expand the target domain.</strong> Argus’s core idea — using LLMs to discover semantic equivalences and formally verifying them — is not specific to relational DBMSs. The same paradigm could apply to <strong>compilers</strong> (e.g., finding equivalent IR transformations that expose miscompilation bugs), <strong>network systems</strong> (e.g., equivalent packet-forwarding rules that reveal routing inconsistencies), or <strong>graph/spatial databases</strong> (e.g., equivalent graph traversal queries). Any domain with a formal notion of equivalence and a verifier is a candidate.</li> <li><strong>Oracle prioritization.</strong> Given thousands of LLM-generated oracles, which are most likely to find bugs in a specific DBMS? Combining coverage feedback, historical bug patterns, and oracle structural diversity could guide Argus toward higher-yield test oracles.</li> </ol> <hr/> <h3 id="citation">Citation</h3> <div class="language-bibtex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nc">@misc</span><span class="p">{</span><span class="nl">mang2025argus</span><span class="p">,</span>
  <span class="na">title</span>         <span class="p">=</span> <span class="s">{Automated Discovery of Test Oracles for Database Management Systems Using LLMs}</span><span class="p">,</span>
  <span class="na">author</span>        <span class="p">=</span> <span class="s">{Qiuyang Mang and Runyuan He and Suyang Zhong and Xiaoxuan Liu and Huanchen Zhang and Alvin Cheung}</span><span class="p">,</span>
  <span class="na">year</span>          <span class="p">=</span> <span class="s">{2025}</span><span class="p">,</span>
  <span class="na">eprint</span>        <span class="p">=</span> <span class="s">{2510.06663}</span><span class="p">,</span>
  <span class="na">archivePrefix</span> <span class="p">=</span> <span class="s">{arXiv}</span><span class="p">,</span>
  <span class="na">primaryClass</span>  <span class="p">=</span> <span class="s">{cs.DB}</span><span class="p">,</span>
  <span class="na">url</span>           <span class="p">=</span> <span class="s">{https://arxiv.org/abs/2510.06663}</span>
<span class="p">}</span>
</code></pre></div></div> <hr/> <p><em>This work was accepted at <a href="https://2026.sigmod.org/">SIGMOD 2026</a>. Find out more: <a href="https://arxiv.org/abs/2510.06663">[arXiv]</a> <a href="/assets/pdf/argus.pdf">[Slides]</a> <a href="https://www.dolthub.com/blog/2025-10-21-ai-sql-testing/">[Dolt Blog]</a></em></p>]]></content><author><name>Qiuyang Mang, Runyuan He, Suyang Zhong, Xiaoxuan Liu, Huanchen Zhang, Alvin Cheung</name></author><category term="research"/><category term="database"/><category term="testing"/><category term="LLM"/><category term="SQL"/><category term="Auto Discovery"/><summary type="html"><![CDATA[We present Argus, a novel framework that uses LLMs to automatically discover test oracles for DBMS testing — finding 41 previously unknown bugs across 5 widely-used databases. Accepted at SIGMOD 2026.]]></summary></entry></feed>