Testing Published
Quantum Results
We reproduce published quantum computing results using AI-written circuits on modern hardware. Some are full replications at original scale, others are small-scale reproductions that verify the underlying mechanisms.
Each paper is tested on up to four backends: a noiseless emulator (correctness baseline), QI Tuna-9 (9 superconducting qubits), IQM Garnet (20 qubits), and IBM Torino (133 qubits). Claims are compared quantitatively against published values.
7
Papers Tested
1 more planned
24
Claims Tested
across 7 backends
96%
Pass Rate
23/24 claims
7
Backends
QI Emulator, QI Tuna-9, IBM Torino, IQM Garnet, ibm_marrakesh, tuna9_12edge, ibm_torino_9q_trex
Three Chips, One Suite
Same circuits, different hardware. Each metric tested on QI Tuna-9 (9 superconducting qubits), IQM Garnet (20 qubits), and IBM Torino (133 qubits).
| Metric | QI Tuna-9 (9q) | IQM Garnet (20q) | IBM Torino (133q) |
|---|---|---|---|
| Bell fidelity | 93.5% | 98.1% | 86.5% |
| GHZ-3 fidelity | 88.9% | 93.9% | 82.9% |
| GHZ-5 fidelity | 83.8% | 81.8% | 76.6% |
| GHZ-10 | n/a | 54.7% | 62.2% |
| Quantum Volume | 16 | 32 | 32 |
| RB gate fidelity | 99.82% | 99.82% | 99.99%* |
| VQE H2 (kcal/mol) | 0.92 | -- | 0.22 |
| Dominant noise | Dephasing | Dephasing | Depolarizing |
* IBM RB inflated by transpiler collapsing Clifford sequences. Tuna-9/IQM values are true gate fidelity. Bold = best per metric. VQE: IBM uses TREX, Tuna-9 uses hybrid PS+REM. -- = not yet tested.
Results by Backend
Which papers pass on which hardware? Green = all claims pass. Orange = partial. Red = fails. Gray = not yet tested.
| Paper | Emulator | Tuna-9 | IBM Torino | IQM Garnet | Type |
|---|---|---|---|---|---|
| Cross 20193/3 — same scale, different hardware | PASS | PASS | PASS | PASS | QV + RB |
| Sagastizabal 20194/4 — same scale, different hardware | PASS | PASS | PASS | -- | VQE + EM |
| Kandala 20175/5 — H2 only (omits LiH, BeH2) | PASS | PASS | PASS | -- | VQE |
| Peruzzo 20146/8 — superconducting, not photonic | PASS | PARTIAL | FAIL | -- | VQE |
| Harrigan 20214/4 — 3-6 qubits (original: 23) | PASS | PASS | -- | -- | QAOA |
| Kim 20233/3 — 9 qubits (original: 127) | PASS | PARTIAL | PARTIAL | -- | Ising |
PASS = all tested claims within published error bars. PARTIAL = some claims pass, some fail due to hardware noise. FAIL = no claims pass on hardware. -- = not yet tested. Notes show scope relative to original paper.
Do published results hold up?
We extract quantitative claims from papers and test whether AI-generated circuits produce matching numbers on noiseless emulators and real hardware. Some tests are at original scale, others are smaller-scale checks of the underlying mechanisms.
Where does hardware noise break things?
Emulator runs pass consistently. Hardware runs reveal the noise floor: which claims survive real-world decoherence, and which are swamped?
Can AI close the gap?
Failure mode classification tells us whether the gap is noise (mitigable), circuit translation (fixable), or missing methodology (structural).
Completed Reports
Validating quantum computers using randomized model circuits
Cross et al. — Phys. Rev. A 100, 032328 (2019)
IBM Research | IBM superconducting (various)
+1 more claims
Quantum approximate optimization of non-planar graph problems on a planar superconducting processor
Harrigan et al. — Nature Physics 17, 332-336 (2021)
Google AI Quantum | 53-qubit Sycamore (Google)
Hardware-efficient variational quantum eigensolver for small molecules and quantum magnets
Kandala et al. — Nature 549, 242-246 (2017)
IBM Research | 6-qubit superconducting transmon
+1 more claims
Evidence for the utility of quantum computing before fault tolerance
Kim et al. — Nature 618, 500-505 (2023)
IBM Quantum | 127-qubit Eagle (ibm_kyiv)
A variational eigenvalue solver on a photonic quantum processor
Peruzzo et al. — Nature Communications 5, 4213 (2014)
Various (Bristol, MIT, Google) | Photonic quantum processor
Error Mitigation by Symmetry Verification on a VQE
Sagastizabal et al. — Phys. Rev. A 100, 010302(R) (2019)
QuTech / TU Delft | 2-qubit transmon (Starmon-5)
+1 more claims
A programmable two-qubit quantum processor in silicon
Watson et al. — Nature 555, 633-637 (2018)
QuTech / TU Delft | Si/SiGe spin qubits (2 qubits)
Paper Pipeline
Methodology
Claim extraction. Published claims are identified from paper text, figures, and supplementary material. Each claim has a published value, error bars (when available), and a reference figure.
Circuit generation. An AI agent (Claude Opus 4.6) writes the quantum circuits, Hamiltonian construction, and measurement analysis code. The agent uses PennyLane, Qiskit, and OpenFermion depending on the paper's methodology.
Failure classification. Results are classified as: success (within published error bars), partial noise (qualitatively correct but degraded), noise dominated (hardware noise overwhelms signal), or structural failures (circuit translation, parameter mismatch, missing methodology detail).
The research question. What do the gaps between published results and AI-reproduced results reveal about reproducibility in quantum computing? The finding is not the pass/fail — it's the pattern of where and why things break.