AI agents replicate
quantum papers on
real hardware.
Can AI systematically reproduce quantum computing experiments? We tested 27 claims from 6 landmark papers across 3 quantum processors. 93% pass. The gaps between published results and AI-reproduced results are the finding.
Paper Replications
Sagastizabal et al. 2019
IBM TREX: 0.22 kcal/mol
Kandala et al. 2017
Chemical accuracy on 3 configs
Peruzzo et al. 2014
Coefficient amplification discovered
Cross et al. 2019
QV=32 on IBM & IQM
Harrigan et al. 2021
74.1% approx ratio on Tuna-9
Kim et al. 2023
9-qubit, 180 CZ gates, 14.1x ZNE
Key Findings
Chemical accuracy on real hardware
TREX on IBM Torino. 119x improvement over raw. The simplest mitigation wins.
Coefficient amplification predicts error
H2 ratio 4.4 = 0.22 kcal/mol. HeH+ ratio 7.8 = 4.45 kcal/mol. 1.8x ratio, 20x error.
Topology beats scale
Tuna-9 beats IQM Garnet on GHZ-5: 83.8% vs 81.8%. Knowing your chip matters more.
Most error is readout, not gates
ZNE failed on both backends. Gate folding adds <1.3 kcal/mol. Readout correction is what works.
100+ Experiments
Entanglement benchmarking across qubit pairs
3-50 qubit multipartite entanglement
H2 and HeH+ energy estimation with mitigation
Combinatorial optimization on hardware
RB, QV, connectivity probes, characterization
[[4,2,2]] detection code, NN decoders
Three Chips, One Suite
Best small-scale fidelity
Highest Bell fidelity
Best VQE with TREX
Explore the Research
6 papers, 27 claims tested across 3 chips. Every claim documented.
100+ experiments with raw counts, analysis, and circuit details.
Tuna-9 vs Garnet vs Torino. Bell, GHZ, QV, VQE head-to-head.
349 prompts from 445 sessions. The 5-phase workflow that emerged.
14 posts: mitigation showdowns, topology maps, noise forensics.
The method that made all this possible.
Data & Reproducibility
All raw data, circuits, and analysis scripts are open on GitHub. Every result file uses schema-versioned JSON with SHA256 checksums for raw counts and circuits.