Do Quantum Transformers Help? A Systematic VQC Architecture Comparison on Tabular Benchmarks
Pith reviewed 2026-05-08 04:32 UTC · model grok-4.3
The pith
Fully-connected VQCs reach 90-96% of quantum transformer accuracy on tabular tasks while using 40-50% fewer parameters.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
On the chosen tabular benchmarks, multi-layer FC-VQCs achieve 90-96% of the R² (or equivalent accuracy) obtained by both hybrid and fully quantum transformer VQCs while using 40-50% fewer parameters; they also outperform equal-capacity classical MLPs. The Type 4 inter-block connectivity already supplies enough cross-token mixing to approximate the benefit of explicit quantum self-attention. Expressibility saturates at depth approximately 3, LayerNorm improves the fully quantum transformer, and the fully quantum transformer degrades more gracefully than the hybrid version under depolarizing noise.
What carries the argument
Type 4 inter-block connectivity in FC-VQCs, which supplies partial cross-token mixing without an explicit attention mechanism.
If this is right
- Shallow depth-3 circuits already saturate the Hilbert-space coverage needed for these tasks.
- Explicit quantum self-attention adds parameters with only marginal accuracy gains on most datasets.
- Layer normalization improves accuracy when every operation remains inside the quantum circuit.
- Fully quantum transformers are more robust to depolarizing noise than hybrid quantum-classical versions.
- Parameter count becomes the dominant design constraint once expressibility is no longer the bottleneck.
Where Pith is reading between the lines
- Designers of near-term quantum models for classical data may obtain better resource efficiency by refining connectivity patterns rather than adding attention layers.
- The quick saturation of expressibility suggests that future work should focus on noise resilience and readout strategies instead of deeper circuits.
- The observed robustness ordering under noise could guide hybrid deployments where part of the model stays classical.
Load-bearing premise
The five tabular benchmarks together with the chosen VQC implementations and depolarizing noise model are representative enough to yield general architectural guidance for near-term quantum hardware.
What would settle it
If a new set of tabular or higher-dimensional benchmarks shows that attention-based VQCs pull ahead by more than 10-15% once parameter counts are equalized, or if the performance gap reverses under realistic hardware noise beyond the depolarizing model, the claim that FC-VQCs are sufficient would be falsified.
Figures
read the original abstract
Variational quantum circuits (VQCs) are a leading approach to quantum machine learning on near-term devices, yet it remains unclear which circuit architecture yields the best accuracy-parameter trade-off on classical tabular data. We present a systematic empirical comparison of four VQC families -- multi-layer fully-connected (FC-VQC), residual (ResNet-VQC), hybrid quantum-classical transformer (QT), and fully quantum transformer (FQT) -- across five regression and classification benchmarks. Our key findings are: \textbf{(i)}~FC-VQCs achieve 90-96\% of the $R^2$ of attention-based VQCs while using 40-50\% fewer parameters, and consistently outperform equal-capacity MLPs (mean $R^2{=}0.829$ vs.\ MLP$_{720}$'s $0.753$ on Boston Housing, 3-seed average); \textbf{(ii)}~FC-VQC's Type~4 inter-block connectivity provides partial cross-token mixing that approximates the role of attention -- explicit quantum self-attention yields only marginal gains on most datasets while significantly increasing parameter count; \textbf{(iii)}~expressibility saturates at circuit depth~${\approx}\,3$, explaining why shallow VQCs already cover the Hilbert space effectively; \textbf{(iv)}~LayerNorm on the fully quantum transformer improves classification accuracy, suggesting normalization is important when all operations are quantum; \textbf{(v)}~in our noise study on Boston Housing, FQT degrades gracefully under depolarizing noise while QT collapses. All results are validated across three random seeds. These findings provide practical architectural guidance for deploying VQCs on near-term quantum hardware.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper conducts a systematic empirical comparison of four VQC architectures (multi-layer fully-connected FC-VQC, residual ResNet-VQC, hybrid quantum-classical transformer QT, and fully quantum transformer FQT) across five tabular regression and classification benchmarks. Key claims include: FC-VQCs reach 90-96% of attention-based VQC R² with 40-50% fewer parameters and outperform equal-capacity MLPs (e.g., mean R²=0.829 vs. MLP₇₂₀'s 0.753 on Boston Housing, 3-seed average); Type-4 connectivity approximates attention with only marginal gains from explicit quantum self-attention; expressibility saturates at depth ≈3; LayerNorm improves FQT classification; and FQT is more robust than QT under depolarizing noise on Boston Housing. All results use three random seeds.
Significance. If the empirical findings hold after full reproducibility checks, the work offers practical guidance for near-term VQC deployment by showing that simpler FC architectures can be competitive with transformer variants while being more parameter-efficient. The multi-seed validation, expressibility analysis, and noise robustness comparison are positive elements that could inform hardware-aware design choices in quantum machine learning.
major comments (3)
- [Abstract] Abstract, finding (i): the specific numerical claims (90-96% R² retention, 40-50% parameter reduction, and the Boston Housing R²=0.829 vs. 0.753 comparison) are load-bearing for the central architectural-guidance conclusion, yet the manuscript provides no table or section summarizing per-dataset, per-seed results with exact data splits, preprocessing, and hyperparameter tables; this prevents independent verification of the outperformance and marginal-gain statements.
- [Abstract] Abstract, finding (ii) and (v): the assertion that Type-4 connectivity approximates attention and that FQT degrades gracefully while QT collapses rests on five unspecified tabular benchmarks plus a single-dataset (Boston Housing) depolarizing-noise study; without explicit dataset names, regime coverage (e.g., feature dimensionality, token count), and a broader noise model or hardware calibration, these cannot support general guidance for near-term devices.
- [Methods] Methods/Implementation section (inferred from abstract): the VQC families are described at high level (FC, ResNet, QT, FQT with LayerNorm and Type-4 connectivity) but lack circuit diagrams, exact gate decompositions, variational-parameter counts per architecture, and training protocols; this is load-bearing because the parameter-efficiency and expressibility-saturation claims (findings i and iii) cannot be reproduced or stress-tested without them.
minor comments (2)
- [Abstract] Abstract: notation is inconsistent (R² vs. R^2); standardize and ensure all symbols are defined on first use.
- [Results] The abstract states 'all results are validated across three random seeds' but does not report variance or statistical significance tests; adding error bars or p-values in the results section would strengthen presentation without altering the central claims.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed review, which correctly identifies areas where additional transparency will strengthen the manuscript's reproducibility and support for its claims. We address each major comment below and will revise the paper to incorporate the requested details on results, datasets, and implementation.
read point-by-point responses
-
Referee: [Abstract] Abstract, finding (i): the specific numerical claims (90-96% R² retention, 40-50% parameter reduction, and the Boston Housing R²=0.829 vs. 0.753 comparison) are load-bearing for the central architectural-guidance conclusion, yet the manuscript provides no table or section summarizing per-dataset, per-seed results with exact data splits, preprocessing, and hyperparameter tables; this prevents independent verification of the outperformance and marginal-gain statements.
Authors: We agree that the numerical claims require supporting per-dataset data for verification. The revised manuscript will add a new table (in the main text or as supplementary material) reporting all per-dataset and per-seed R²/accuracy values, exact train/test splits used, preprocessing steps (standardization and normalization), and complete hyperparameter tables for each architecture and benchmark. This will directly enable independent verification of the 90-96% retention, parameter reduction, and MLP comparison statements. revision: yes
-
Referee: [Abstract] Abstract, finding (ii) and (v): the assertion that Type-4 connectivity approximates attention and that FQT degrades gracefully while QT collapses rests on five unspecified tabular benchmarks plus a single-dataset (Boston Housing) depolarizing-noise study; without explicit dataset names, regime coverage (e.g., feature dimensionality, token count), and a broader noise model or hardware calibration, these cannot support general guidance for near-term devices.
Authors: The five benchmarks are named in Section 3.1 (Boston Housing, California Housing, Diabetes, Wine Quality, Heart Disease) with their feature counts and sample sizes; we will add an explicit table summarizing dimensionality, tokenization approach, and regime coverage for each. The noise study is intentionally focused on Boston Housing as a representative case, but we acknowledge the single-dataset/single-model limitation. In revision we will extend the depolarizing-noise analysis to a second dataset and add one additional noise channel (phase damping) to provide broader support for the graceful degradation claim. revision: partial
-
Referee: [Methods] Methods/Implementation section (inferred from abstract): the VQC families are described at high level (FC, ResNet, QT, FQT with LayerNorm and Type-4 connectivity) but lack circuit diagrams, exact gate decompositions, variational-parameter counts per architecture, and training protocols; this is load-bearing because the parameter-efficiency and expressibility-saturation claims (findings i and iii) cannot be reproduced or stress-tested without them.
Authors: We agree that high-level descriptions alone are insufficient. The revised manuscript will include explicit circuit diagrams for all four architectures, gate-by-gate decompositions (RY rotations, CZ entanglers for Type-4 connectivity, etc.), exact variational-parameter counts per block and total per model, and a complete training protocol (optimizer, learning-rate schedule, batch size, epochs, loss, and initialization). We will also reference the open-source code repository containing the exact implementations. revision: yes
Circularity Check
No circularity: direct empirical measurements with no derivations or fitted predictions
full rationale
The paper performs a systematic empirical comparison of four VQC families on five tabular benchmarks, reporting observed R² and accuracy metrics across three seeds. The abstract and findings consist entirely of measured performance numbers (e.g., FC-VQCs achieving 90-96% of attention-based R² with fewer parameters) and qualitative observations from those runs. No equations, derivations, parameter fits, or self-citations are invoked to generate or justify the central claims; results are presented as direct experimental outcomes. This satisfies the criteria for a self-contained empirical study with no load-bearing steps that reduce to inputs by construction.
Axiom & Free-Parameter Ledger
free parameters (2)
- Circuit depth
- Number of variational parameters per architecture
axioms (2)
- standard math Standard parameterized quantum gates and measurement model for VQCs
- domain assumption Depolarizing noise model approximates real near-term hardware errors
Reference graph
Works this paper leans on
-
[1]
Variational quantum algorithms,
M. Cerezo, A. Arrasmith, R. Babbush, S. C. Benjamin, S. Endo, K. Fujii, J. R. McClean, K. Mitarai, X. Yuan, L. Cincioet al., “Variational quantum algorithms,”Nature Reviews Physics, vol. 3, no. 9, pp. 625–644, 2021
2021
-
[2]
Parameterized quantum circuits as machine learning models,
M. Benedetti, E. Lloyd, S. Sack, and M. Fiorentini, “Parameterized quantum circuits as machine learning models,”Quantum Science and Technology, vol. 4, no. 4, p. 043001, 2019
2019
-
[3]
Deep residual learning for image recognition,
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778
2016
-
[4]
Expressibility and entangling capability of parameterized quantum circuits for hybrid quantum-classical algorithms,
S. Sim, P. D. Johnson, and A. Aspuru-Guzik, “Expressibility and entangling capability of parameterized quantum circuits for hybrid quantum-classical algorithms,”Advanced Quantum Technologies, vol. 2, no. 12, p. 1900070, 2019
2019
-
[5]
Effect of data encoding on the expressive power of variational quantum-machine-learning models,
M. Schuld, R. Sweke, and J. J. Meyer, “Effect of data encoding on the expressive power of variational quantum-machine-learning models,” Physical Review A, vol. 103, no. 3, p. 032430, 2021
2021
-
[6]
Barren plateaus in quantum neural network training landscapes,
J. R. McCleanet al., “Barren plateaus in quantum neural network training landscapes,”Nature communications, vol. 9, no. 1, p. 4812, 2018
2018
-
[7]
The power of quantum neural networks,
A. Abbas, D. Sutter, C. Zoufal, A. Lucchi, A. Figalli, and S. Woerner, “The power of quantum neural networks,”Nature Computational Science, vol. 1, no. 6, pp. 403–409, 2021
2021
-
[8]
Data re-uploading for a universal quantum classifier,
A. Pérez-Salinas, A. Cervera-Lierta, E. Gil-Fuster, and J. I. Latorre, “Data re-uploading for a universal quantum classifier,”Quantum, vol. 4, p. 226, 2020
2020
-
[9]
Supervised learning with quantum-enhanced feature spaces,
V . Havlíˇcek, A. D. Córcoles, K. Temme, A. W. Harrow, A. Kandala, J. M. Chow, and J. M. Gambetta, “Supervised learning with quantum-enhanced feature spaces,”Nature, vol. 567, no. 7747, pp. 209–212, 2019
2019
-
[10]
Quantum self-attention neural networks for text classification,
G. Li, X. Zhao, and X. Wang, “Quantum self-attention neural networks for text classification,”arXiv preprint arXiv:2205.05625, 2022
-
[11]
Quantum vision transformers,
E. A. Cherrat, I. Kerenidis, N. Mathur, J. Landman, M. Strahm, and Y . Y . Li, “Quantum vision transformers,”Quantum, vol. 8, p. 1265, 2024
2024
-
[12]
Quantum Adaptive Self-Attention for Quantum Transformer Models
C.-S. Chen and E.-J. Kuo, “Quantum adaptive self-attention for quantum transformer models,”arXiv preprint arXiv:2504.05336, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [13]
-
[14]
Why do tree-based models still outperform deep learning on typical tabular data?
L. Grinsztajn, E. Oyallon, and G. Varoquaux, “Why do tree-based models still outperform deep learning on typical tabular data?” inAdvances in Neural Information Processing Systems, vol. 35, 2022
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.