Do Quantum Transformers Help? A Systematic VQC Architecture Comparison on Tabular Benchmarks

Chi-Sheng Chen; En-Jui Kuo

arxiv: 2604.23931 · v1 · submitted 2026-04-27 · 🪐 quant-ph · cs.AI

Do Quantum Transformers Help? A Systematic VQC Architecture Comparison on Tabular Benchmarks

Chi-Sheng Chen , En-Jui Kuo This is my paper

Pith reviewed 2026-05-08 04:32 UTC · model grok-4.3

classification 🪐 quant-ph cs.AI

keywords variational quantum circuitsquantum machine learningtabular dataquantum transformersarchitecture comparisonparameter efficiencynear-term quantum hardwareexpressibility

0 comments

The pith

Fully-connected VQCs reach 90-96% of quantum transformer accuracy on tabular tasks while using 40-50% fewer parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper runs a head-to-head test of four variational quantum circuit families on five standard regression and classification datasets. It shows that plain multi-layer fully-connected circuits already capture most of the accuracy that attention-based quantum transformers deliver, yet they require substantially fewer trainable parameters and beat classical multilayer perceptrons of matched size. The comparison also reveals that circuit expressibility stops growing after only three layers and that different architectures respond differently when depolarizing noise is added.

Core claim

On the chosen tabular benchmarks, multi-layer FC-VQCs achieve 90-96% of the R² (or equivalent accuracy) obtained by both hybrid and fully quantum transformer VQCs while using 40-50% fewer parameters; they also outperform equal-capacity classical MLPs. The Type 4 inter-block connectivity already supplies enough cross-token mixing to approximate the benefit of explicit quantum self-attention. Expressibility saturates at depth approximately 3, LayerNorm improves the fully quantum transformer, and the fully quantum transformer degrades more gracefully than the hybrid version under depolarizing noise.

What carries the argument

Type 4 inter-block connectivity in FC-VQCs, which supplies partial cross-token mixing without an explicit attention mechanism.

If this is right

Shallow depth-3 circuits already saturate the Hilbert-space coverage needed for these tasks.
Explicit quantum self-attention adds parameters with only marginal accuracy gains on most datasets.
Layer normalization improves accuracy when every operation remains inside the quantum circuit.
Fully quantum transformers are more robust to depolarizing noise than hybrid quantum-classical versions.
Parameter count becomes the dominant design constraint once expressibility is no longer the bottleneck.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Designers of near-term quantum models for classical data may obtain better resource efficiency by refining connectivity patterns rather than adding attention layers.
The quick saturation of expressibility suggests that future work should focus on noise resilience and readout strategies instead of deeper circuits.
The observed robustness ordering under noise could guide hybrid deployments where part of the model stays classical.

Load-bearing premise

The five tabular benchmarks together with the chosen VQC implementations and depolarizing noise model are representative enough to yield general architectural guidance for near-term quantum hardware.

What would settle it

If a new set of tabular or higher-dimensional benchmarks shows that attention-based VQCs pull ahead by more than 10-15% once parameter counts are equalized, or if the performance gap reverses under realistic hardware noise beyond the depolarizing model, the claim that FC-VQCs are sufficient would be falsified.

Figures

Figures reproduced from arXiv: 2604.23931 by Chi-Sheng Chen, En-Jui Kuo.

**Figure 1.** Figure 1: Four model architectures compared. (a) FC-VQC with decreasing qubit count per block. (b) ResNet-VQC with dashed skip connections; the view at source ↗

**Figure 2.** Figure 2: Quantum circuits. Top: 3-qubit base VQC block (depth 2) used for view at source ↗

**Figure 3.** Figure 3: Parameter efficiency across three regression datasets. Each point is a model; view at source ↗

**Figure 4.** Figure 4: Training and validation loss curves on Boston Housing (3-seed mean view at source ↗

**Figure 5.** Figure 5: Left: fidelity distributions for VQC at various depths vs. Haar-random view at source ↗

**Figure 6.** Figure 6: Training and validation loss on CA Housing ( view at source ↗

**Figure 7.** Figure 7: Prediction vs. ground truth on Boston Housing test set (best-seed run). view at source ↗

read the original abstract

Variational quantum circuits (VQCs) are a leading approach to quantum machine learning on near-term devices, yet it remains unclear which circuit architecture yields the best accuracy-parameter trade-off on classical tabular data. We present a systematic empirical comparison of four VQC families -- multi-layer fully-connected (FC-VQC), residual (ResNet-VQC), hybrid quantum-classical transformer (QT), and fully quantum transformer (FQT) -- across five regression and classification benchmarks. Our key findings are: \textbf{(i)}~FC-VQCs achieve 90-96\% of the $R^2$ of attention-based VQCs while using 40-50\% fewer parameters, and consistently outperform equal-capacity MLPs (mean $R^2{=}0.829$ vs.\ MLP$_{720}$'s $0.753$ on Boston Housing, 3-seed average); \textbf{(ii)}~FC-VQC's Type~4 inter-block connectivity provides partial cross-token mixing that approximates the role of attention -- explicit quantum self-attention yields only marginal gains on most datasets while significantly increasing parameter count; \textbf{(iii)}~expressibility saturates at circuit depth~${\approx}\,3$, explaining why shallow VQCs already cover the Hilbert space effectively; \textbf{(iv)}~LayerNorm on the fully quantum transformer improves classification accuracy, suggesting normalization is important when all operations are quantum; \textbf{(v)}~in our noise study on Boston Housing, FQT degrades gracefully under depolarizing noise while QT collapses. All results are validated across three random seeds. These findings provide practical architectural guidance for deploying VQCs on near-term quantum hardware.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FC-VQCs deliver most of the performance of quantum transformers on these tabular tasks at lower parameter cost, but the architectural guidance is tied to a narrow set of benchmarks and a basic noise model.

read the letter

The main point is that this paper runs a head-to-head test of four VQC families on five tabular benchmarks and concludes that plain fully-connected circuits capture 90-96% of the accuracy from attention-based versions while cutting parameters by 40-50%, plus they beat equal-capacity MLPs on the reported Boston Housing numbers (0.829 vs 0.753 R2, three seeds). Expressibility saturating near depth 3 and FQT holding up better than QT under depolarizing noise are the other concrete observations. Type-4 connectivity is presented as a partial stand-in for attention, which fits the small extra gains from full quantum self-attention blocks. The work is a clean empirical measurement with no fitted equations or circular claims, and the three-seed averages add some reliability. That is the useful part: it gives numbers on parameter efficiency and noise behavior that prior VQC papers have not lined up this way. The soft spots are exactly where the stress-test note points. Five unspecified tabular datasets and one depolarizing noise run on Boston Housing do not yet support broad rules for NISQ hardware. If the chosen tasks do not stress cross-token mixing or if the implementations favor the simpler circuits, the marginal-gain story stays dataset-specific. The abstract supplies the headline numbers but leaves data splits, exact hyperparameter tables, and implementation details for the full text to confirm fairness. Readers focused on variational circuits for classical tabular data will get practical pointers from the comparisons. The paper shows clear empirical thinking and honest engagement with the architecture trade-offs, so it qualifies as serious on its own terms. Send it to peer review; the questions are relevant and the setup is reviewable even if extra datasets and real-device noise checks would be needed in revision.

Referee Report

3 major / 2 minor

Summary. The paper conducts a systematic empirical comparison of four VQC architectures (multi-layer fully-connected FC-VQC, residual ResNet-VQC, hybrid quantum-classical transformer QT, and fully quantum transformer FQT) across five tabular regression and classification benchmarks. Key claims include: FC-VQCs reach 90-96% of attention-based VQC R² with 40-50% fewer parameters and outperform equal-capacity MLPs (e.g., mean R²=0.829 vs. MLP₇₂₀'s 0.753 on Boston Housing, 3-seed average); Type-4 connectivity approximates attention with only marginal gains from explicit quantum self-attention; expressibility saturates at depth ≈3; LayerNorm improves FQT classification; and FQT is more robust than QT under depolarizing noise on Boston Housing. All results use three random seeds.

Significance. If the empirical findings hold after full reproducibility checks, the work offers practical guidance for near-term VQC deployment by showing that simpler FC architectures can be competitive with transformer variants while being more parameter-efficient. The multi-seed validation, expressibility analysis, and noise robustness comparison are positive elements that could inform hardware-aware design choices in quantum machine learning.

major comments (3)

[Abstract] Abstract, finding (i): the specific numerical claims (90-96% R² retention, 40-50% parameter reduction, and the Boston Housing R²=0.829 vs. 0.753 comparison) are load-bearing for the central architectural-guidance conclusion, yet the manuscript provides no table or section summarizing per-dataset, per-seed results with exact data splits, preprocessing, and hyperparameter tables; this prevents independent verification of the outperformance and marginal-gain statements.
[Abstract] Abstract, finding (ii) and (v): the assertion that Type-4 connectivity approximates attention and that FQT degrades gracefully while QT collapses rests on five unspecified tabular benchmarks plus a single-dataset (Boston Housing) depolarizing-noise study; without explicit dataset names, regime coverage (e.g., feature dimensionality, token count), and a broader noise model or hardware calibration, these cannot support general guidance for near-term devices.
[Methods] Methods/Implementation section (inferred from abstract): the VQC families are described at high level (FC, ResNet, QT, FQT with LayerNorm and Type-4 connectivity) but lack circuit diagrams, exact gate decompositions, variational-parameter counts per architecture, and training protocols; this is load-bearing because the parameter-efficiency and expressibility-saturation claims (findings i and iii) cannot be reproduced or stress-tested without them.

minor comments (2)

[Abstract] Abstract: notation is inconsistent (R² vs. R^2); standardize and ensure all symbols are defined on first use.
[Results] The abstract states 'all results are validated across three random seeds' but does not report variance or statistical significance tests; adding error bars or p-values in the results section would strengthen presentation without altering the central claims.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed review, which correctly identifies areas where additional transparency will strengthen the manuscript's reproducibility and support for its claims. We address each major comment below and will revise the paper to incorporate the requested details on results, datasets, and implementation.

read point-by-point responses

Referee: [Abstract] Abstract, finding (i): the specific numerical claims (90-96% R² retention, 40-50% parameter reduction, and the Boston Housing R²=0.829 vs. 0.753 comparison) are load-bearing for the central architectural-guidance conclusion, yet the manuscript provides no table or section summarizing per-dataset, per-seed results with exact data splits, preprocessing, and hyperparameter tables; this prevents independent verification of the outperformance and marginal-gain statements.

Authors: We agree that the numerical claims require supporting per-dataset data for verification. The revised manuscript will add a new table (in the main text or as supplementary material) reporting all per-dataset and per-seed R²/accuracy values, exact train/test splits used, preprocessing steps (standardization and normalization), and complete hyperparameter tables for each architecture and benchmark. This will directly enable independent verification of the 90-96% retention, parameter reduction, and MLP comparison statements. revision: yes
Referee: [Abstract] Abstract, finding (ii) and (v): the assertion that Type-4 connectivity approximates attention and that FQT degrades gracefully while QT collapses rests on five unspecified tabular benchmarks plus a single-dataset (Boston Housing) depolarizing-noise study; without explicit dataset names, regime coverage (e.g., feature dimensionality, token count), and a broader noise model or hardware calibration, these cannot support general guidance for near-term devices.

Authors: The five benchmarks are named in Section 3.1 (Boston Housing, California Housing, Diabetes, Wine Quality, Heart Disease) with their feature counts and sample sizes; we will add an explicit table summarizing dimensionality, tokenization approach, and regime coverage for each. The noise study is intentionally focused on Boston Housing as a representative case, but we acknowledge the single-dataset/single-model limitation. In revision we will extend the depolarizing-noise analysis to a second dataset and add one additional noise channel (phase damping) to provide broader support for the graceful degradation claim. revision: partial
Referee: [Methods] Methods/Implementation section (inferred from abstract): the VQC families are described at high level (FC, ResNet, QT, FQT with LayerNorm and Type-4 connectivity) but lack circuit diagrams, exact gate decompositions, variational-parameter counts per architecture, and training protocols; this is load-bearing because the parameter-efficiency and expressibility-saturation claims (findings i and iii) cannot be reproduced or stress-tested without them.

Authors: We agree that high-level descriptions alone are insufficient. The revised manuscript will include explicit circuit diagrams for all four architectures, gate-by-gate decompositions (RY rotations, CZ entanglers for Type-4 connectivity, etc.), exact variational-parameter counts per block and total per model, and a complete training protocol (optimizer, learning-rate schedule, batch size, epochs, loss, and initialization). We will also reference the open-source code repository containing the exact implementations. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical measurements with no derivations or fitted predictions

full rationale

The paper performs a systematic empirical comparison of four VQC families on five tabular benchmarks, reporting observed R² and accuracy metrics across three seeds. The abstract and findings consist entirely of measured performance numbers (e.g., FC-VQCs achieving 90-96% of attention-based R² with fewer parameters) and qualitative observations from those runs. No equations, derivations, parameter fits, or self-citations are invoked to generate or justify the central claims; results are presented as direct experimental outcomes. This satisfies the criteria for a self-contained empirical study with no load-bearing steps that reduce to inputs by construction.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard VQC constructions and classical benchmark datasets; no new physical entities are introduced.

free parameters (2)

Circuit depth
Observed saturation at approximately 3 layers; chosen to balance expressibility and parameter count.
Number of variational parameters per architecture
Varied across the four families to enable fair capacity comparisons.

axioms (2)

standard math Standard parameterized quantum gates and measurement model for VQCs
Invoked throughout as the basis for all four architectures.
domain assumption Depolarizing noise model approximates real near-term hardware errors
Used in the Boston Housing noise study to compare graceful degradation.

pith-pipeline@v0.9.0 · 5605 in / 1478 out tokens · 41858 ms · 2026-05-08T04:32:30.806982+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 3 canonical work pages · 1 internal anchor

[1]

Variational quantum algorithms,

M. Cerezo, A. Arrasmith, R. Babbush, S. C. Benjamin, S. Endo, K. Fujii, J. R. McClean, K. Mitarai, X. Yuan, L. Cincioet al., “Variational quantum algorithms,”Nature Reviews Physics, vol. 3, no. 9, pp. 625–644, 2021

2021
[2]

Parameterized quantum circuits as machine learning models,

M. Benedetti, E. Lloyd, S. Sack, and M. Fiorentini, “Parameterized quantum circuits as machine learning models,”Quantum Science and Technology, vol. 4, no. 4, p. 043001, 2019

2019
[3]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778

2016
[4]

Expressibility and entangling capability of parameterized quantum circuits for hybrid quantum-classical algorithms,

S. Sim, P. D. Johnson, and A. Aspuru-Guzik, “Expressibility and entangling capability of parameterized quantum circuits for hybrid quantum-classical algorithms,”Advanced Quantum Technologies, vol. 2, no. 12, p. 1900070, 2019

2019
[5]

Effect of data encoding on the expressive power of variational quantum-machine-learning models,

M. Schuld, R. Sweke, and J. J. Meyer, “Effect of data encoding on the expressive power of variational quantum-machine-learning models,” Physical Review A, vol. 103, no. 3, p. 032430, 2021

2021
[6]

Barren plateaus in quantum neural network training landscapes,

J. R. McCleanet al., “Barren plateaus in quantum neural network training landscapes,”Nature communications, vol. 9, no. 1, p. 4812, 2018

2018
[7]

The power of quantum neural networks,

A. Abbas, D. Sutter, C. Zoufal, A. Lucchi, A. Figalli, and S. Woerner, “The power of quantum neural networks,”Nature Computational Science, vol. 1, no. 6, pp. 403–409, 2021

2021
[8]

Data re-uploading for a universal quantum classifier,

A. Pérez-Salinas, A. Cervera-Lierta, E. Gil-Fuster, and J. I. Latorre, “Data re-uploading for a universal quantum classifier,”Quantum, vol. 4, p. 226, 2020

2020
[9]

Supervised learning with quantum-enhanced feature spaces,

V . Havlíˇcek, A. D. Córcoles, K. Temme, A. W. Harrow, A. Kandala, J. M. Chow, and J. M. Gambetta, “Supervised learning with quantum-enhanced feature spaces,”Nature, vol. 567, no. 7747, pp. 209–212, 2019

2019
[10]

Quantum self-attention neural networks for text classification,

G. Li, X. Zhao, and X. Wang, “Quantum self-attention neural networks for text classification,”arXiv preprint arXiv:2205.05625, 2022

work page arXiv 2022
[11]

Quantum vision transformers,

E. A. Cherrat, I. Kerenidis, N. Mathur, J. Landman, M. Strahm, and Y . Y . Li, “Quantum vision transformers,”Quantum, vol. 8, p. 1265, 2024

2024
[12]

Quantum Adaptive Self-Attention for Quantum Transformer Models

C.-S. Chen and E.-J. Kuo, “Quantum adaptive self-attention for quantum transformer models,”arXiv preprint arXiv:2504.05336, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Zhang, Q

H. Zhang, Q. Zhao, M. Zhou, L. Feng, D. Niyato, S. Zheng, and L. Chen, “A survey of quantum transformers: Architectures, challenges and outlooks,”arXiv preprint arXiv:2504.03192, 2025

work page arXiv 2025
[14]

Why do tree-based models still outperform deep learning on typical tabular data?

L. Grinsztajn, E. Oyallon, and G. Varoquaux, “Why do tree-based models still outperform deep learning on typical tabular data?” inAdvances in Neural Information Processing Systems, vol. 35, 2022

2022

[1] [1]

Variational quantum algorithms,

M. Cerezo, A. Arrasmith, R. Babbush, S. C. Benjamin, S. Endo, K. Fujii, J. R. McClean, K. Mitarai, X. Yuan, L. Cincioet al., “Variational quantum algorithms,”Nature Reviews Physics, vol. 3, no. 9, pp. 625–644, 2021

2021

[2] [2]

Parameterized quantum circuits as machine learning models,

M. Benedetti, E. Lloyd, S. Sack, and M. Fiorentini, “Parameterized quantum circuits as machine learning models,”Quantum Science and Technology, vol. 4, no. 4, p. 043001, 2019

2019

[3] [3]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778

2016

[4] [4]

Expressibility and entangling capability of parameterized quantum circuits for hybrid quantum-classical algorithms,

S. Sim, P. D. Johnson, and A. Aspuru-Guzik, “Expressibility and entangling capability of parameterized quantum circuits for hybrid quantum-classical algorithms,”Advanced Quantum Technologies, vol. 2, no. 12, p. 1900070, 2019

2019

[5] [5]

Effect of data encoding on the expressive power of variational quantum-machine-learning models,

M. Schuld, R. Sweke, and J. J. Meyer, “Effect of data encoding on the expressive power of variational quantum-machine-learning models,” Physical Review A, vol. 103, no. 3, p. 032430, 2021

2021

[6] [6]

Barren plateaus in quantum neural network training landscapes,

J. R. McCleanet al., “Barren plateaus in quantum neural network training landscapes,”Nature communications, vol. 9, no. 1, p. 4812, 2018

2018

[7] [7]

The power of quantum neural networks,

A. Abbas, D. Sutter, C. Zoufal, A. Lucchi, A. Figalli, and S. Woerner, “The power of quantum neural networks,”Nature Computational Science, vol. 1, no. 6, pp. 403–409, 2021

2021

[8] [8]

Data re-uploading for a universal quantum classifier,

A. Pérez-Salinas, A. Cervera-Lierta, E. Gil-Fuster, and J. I. Latorre, “Data re-uploading for a universal quantum classifier,”Quantum, vol. 4, p. 226, 2020

2020

[9] [9]

Supervised learning with quantum-enhanced feature spaces,

V . Havlíˇcek, A. D. Córcoles, K. Temme, A. W. Harrow, A. Kandala, J. M. Chow, and J. M. Gambetta, “Supervised learning with quantum-enhanced feature spaces,”Nature, vol. 567, no. 7747, pp. 209–212, 2019

2019

[10] [10]

Quantum self-attention neural networks for text classification,

G. Li, X. Zhao, and X. Wang, “Quantum self-attention neural networks for text classification,”arXiv preprint arXiv:2205.05625, 2022

work page arXiv 2022

[11] [11]

Quantum vision transformers,

E. A. Cherrat, I. Kerenidis, N. Mathur, J. Landman, M. Strahm, and Y . Y . Li, “Quantum vision transformers,”Quantum, vol. 8, p. 1265, 2024

2024

[12] [12]

Quantum Adaptive Self-Attention for Quantum Transformer Models

C.-S. Chen and E.-J. Kuo, “Quantum adaptive self-attention for quantum transformer models,”arXiv preprint arXiv:2504.05336, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

Zhang, Q

H. Zhang, Q. Zhao, M. Zhou, L. Feng, D. Niyato, S. Zheng, and L. Chen, “A survey of quantum transformers: Architectures, challenges and outlooks,”arXiv preprint arXiv:2504.03192, 2025

work page arXiv 2025

[14] [14]

Why do tree-based models still outperform deep learning on typical tabular data?

L. Grinsztajn, E. Oyallon, and G. Varoquaux, “Why do tree-based models still outperform deep learning on typical tabular data?” inAdvances in Neural Information Processing Systems, vol. 35, 2022

2022