Can AI agents systematically replicate quantum computing experiments? We're finding out — running the same algorithms across four backends, testing every error mitigation technique we can find, and publishing everything.
The context for our empirical work: why AI-accelerated science matters, which papers define the field, and the data behind the hype.
GPT-5 runs 36,000 experiments, AI scientists publish papers, and a Nature study finds the field is shrinking
LandscapeNeural network error decoders, autonomous quantum agents, and AI circuit optimizers — a researcher's guide to the intersection
LandscapeFunding tables, government programs, and a curated reading list for researchers
LandscapeBloch spheres, Q-spheres, circuit editors, and 12 interactive demos we built to make quantum intuitive
Addition, multiplication, Grover's search, and entanglement — six experiments on a 9-qubit superconducting chip. Every one returned the correct answer as the most common measurement.
Can a quantum computer do 2+3? Yes — and 5+3, 9+7, 3×2, Grover's search, and GHZ entanglement. We ran six experiments on Quantum Inspire's Tuna-9 superconducting chip. The simplest circuits hit 85% fidelity. The hardest (4-bit addition across all 9 qubits) still returned the correct answer 37% of the time. Fidelity tracks gate count exactly as theory predicts.
We turned quantum chemistry eigenspectra into sound. Energy levels become harmonics, bond stretching becomes a pitch sweep, and dissociation sounds like a chord collapsing.
Map each energy eigenvalue to an audio oscillator. The ground state becomes a fundamental. Excited states become harmonics. Stretch the bond and hear the spectrum shift. Two molecules (H₂ and LiH), computed from first principles, sonified in real time.
From molecular geometry to quantum hardware measurements in one automated pipeline. The emulator nailed it. IBM Fez tried its best.
We built a complete quantum chemistry pipeline from molecular integrals to qubit Hamiltonians to hardware measurements. H2 on 2 qubits achieved chemical accuracy on the QI emulator (1.3 mHa error). LiH on 4 qubits needed 9 measurement circuits. The emulator nailed it (0.2 mHa). IBM Fez got the right quantum state but 354 mHa of noise. Noise scales faster than circuit depth.
Honest benchmarks, fragile auth tokens, and why the hardware you trust is the hardware that runs your circuit as written.
We ran 50+ experiments on Quantum Inspire's Tuna-9, built an MCP server around the SDK, and automated a full experiment pipeline. The hardware surprised us — honest benchmarks, portable error mitigation, cross-platform parity on hard problems. The developer experience surprised us too, in less pleasant ways. Here's what we'd tell the QI team over coffee.
IBM's TREX (readout error correction) hit 0.22 kcal/mol. Tuna-9's best combo (readout mitigation + post-selection) averaged 2.52 kcal/mol. Zero-noise extrapolation made things worse. Here's what actually works for near-term quantum chemistry.
We compared 15+ error mitigation techniques across IBM Torino and Tuna-9 for hydrogen VQE (variational quantum eigensolver — an algorithm that finds molecular ground-state energies). IBM's TREX achieved chemical accuracy (0.22 kcal/mol) in a single shot. On Tuna-9, combining readout error mitigation with post-selection cut errors by 70% to 2.52 kcal/mol. But adding dynamical decoupling and Pauli twirling to TREX made IBM 45x worse. The lesson: understand your noise before stacking techniques.
We ran the same experiments on a noiseless emulator, IBM Torino (133q), Tuna-9 (9q), and IQM Garnet (20q). The answer: it matters a lot, but not always in the ways you expect.
We ran VQE (molecular energy estimation), quantum volume (a standard hardware benchmark), randomized benchmarking (gate accuracy testing), and error correction across 4 quantum backends. Benchmarks pass everywhere. VQE fails everywhere except the emulator. IQM Garnet achieves QV=32 while Tuna-9 manages QV=8. Error correction reveals the sharpest hardware differences. And IBM's 99.99% gate fidelity from randomized benchmarking is misleading.
AI agents reproduced 14 published claims across emulator, IBM Torino, and Tuna-9 hardware. The gaps tell us more than the successes.
We used AI agents to replicate 4 landmark quantum computing papers on 3 different backends. Emulators matched published results almost perfectly (85% pass). Real hardware told a different story: IBM Torino got within 9 kcal/mol on VQE, Tuna-9 achieved Quantum Volume 8 but failed VQE entirely. The reproducibility gap is the finding.
We wasted days on HeH+ before realizing the energy model itself told us the answer. One ratio predicts everything.
After achieving chemical accuracy on H2 (0.22 kcal/mol), we assumed HeH+ would be similar. Same circuit, same hardware, same error correction. It was 20x worse. Turns out you can predict this from one number in the molecular energy model — before running a single shot. Here's the pre-flight check we wish we'd known.
IBM's error correction went from 119x improvement to 1.3x when we changed circuits. A 30-second diagnostic would have told us why.
TREX (readout error correction) was our hero — 119x improvement on molecular energy estimation, chemical accuracy on the first try. So we used it on everything. Then we ran a deeper circuit and it barely helped (1.3x). Meanwhile ZNE (zero-noise extrapolation), which had failed before, would have given 14x. The mistake: we were fixing measurement errors on a circuit where gate errors dominated. Here's the 30-second test that tells you which fix to use.
From 63% to 80%. The bottleneck isn't intelligence — it's documentation.
We ran 151 quantum programming tasks against frontier LLMs. They scored 63%. The main failure wasn't bad quantum logic — it was outdated API knowledge. When we gave them current documentation, scores jumped to 71%. A multi-run ensemble hit 80%.
MCP servers that let Claude Code generate random numbers from vacuum fluctuations (with Tuna-9 superconducting qubit fallback) and submit circuits to real quantum processors
We built two MCP servers that give Claude Code direct access to quantum resources: true random numbers with automatic fallback from ANU vacuum fluctuations to Tuna-9 superconducting qubits, plus circuit execution on Quantum Inspire hardware. Here's how they work and why this matters for AI-accelerated quantum research.
What happens when AI agents try to reproduce quantum computing experiments across different hardware?
We replicated 6 quantum computing papers across 4 hardware backends. 93% of claims reproduce successfully (25/27). Key finding: TREX (readout error correction) achieves 119x improvement for short molecular energy circuits but only 1.3x for deeper physics simulations — the error correction strategy must match the dominant error source.
Claude designed circuits, submitted them to three quantum backends, analyzed errors, and iterated — no human code required
We gave Claude direct access to quantum hardware through MCP tool calls. It designed a Bell state tomography experiment, submitted circuits to three backends, discovered that IBM's transpiler is as important as its hardware, and mapped how quickly each platform loses quantum coherence. No Python scripts. No human in the loop.
Claude autonomously discovered Tuna-9's topology, characterized its noise, and achieved 33% lower error rates through hardware-aware routing
We gave an AI agent access to a quantum processor it had never seen before and asked: can you figure out how it works and use that knowledge to run better circuits? In 33 hardware jobs, Claude discovered the full topology, identified the best and worst qubits, characterized noise types, and improved GHZ state fidelity by 5.8 percentage points.
Claude Opus 4.6 wrote 300 lines of molecular energy simulation code from a paper reference alone
We gave Claude Opus 4.6 a reference to Sagastizabal et al. (2019) — a QuTech paper on symmetry-verified molecular energy estimation for hydrogen — and asked it to replicate the experiment. It wrote the energy model, trial quantum state, noise model, and error mitigation from scratch.