pith. sign in

arxiv: 2605.27616 · v1 · pith:MZZRDBFLnew · submitted 2026-05-26 · 💻 cs.CV · cs.AI

Not All NVFP4 QAT Recipes Are Equal: How Architecture and Scale Shape Model Quality for Anomaly Segmentation

Pith reviewed 2026-06-29 18:09 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords FP4 quantizationQAT recipesanomaly segmentationSwin TransformerCNNmodel architecturebrain tumor segmentationquantization robustness
0
0 comments X

The pith

Architecture choice has the largest impact on FP4 quantization robustness for anomaly segmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates the combined effects of model architecture, scale, and different FP4 quantization-aware training recipes on brain tumor segmentation performance. It establishes that attention-based models maintain high quality regardless of the recipe chosen, while convolutional networks lose performance under certain recipes especially as they grow larger. This matters because real-time anomaly segmentation requires both accurate detection and fast low-precision computation, so knowing which models tolerate quantization best guides practical deployment.

Core claim

Architecture choice has the largest impact on quantization robustness, with attention-based architectures showing remarkable resilience to recipe choice while CNN degrades under gradient-quantizing recipes at larger scales. The Swin Transformer is robust to QAT recipe choice across all scales.

What carries the argument

The three-way interaction of architecture, scale, and FP4 QAT recipe evaluated on recall-critical brain tumor segmentation under a unified protocol.

If this is right

  • At low capacity, FP4 can discretize softmax attention, but advanced QAT recipes prevent this collapse.
  • At larger scales, advanced recipes mitigate gradient quantization noise that degrades CNN quality.
  • Five-fold patient-level cross-validation confirms these findings are robust to data partition.
  • The Swin Transformer is recommended for FP4-quantized anomaly segmentation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The resilience pattern may appear in other medical imaging tasks that need low-precision inference.
  • Developers targeting FP4 should prioritize architecture selection before tuning QAT recipes.
  • Testing the same protocol on non-medical anomaly detection datasets would show whether the architecture effect is domain-specific.

Load-bearing premise

The observed three-way interactions between architecture, scale, and QAT recipe are assumed to be driven primarily by the model properties rather than by unexamined dataset-specific factors or implementation details of the unified evaluation protocol.

What would settle it

Repeating the experiments on a different anomaly segmentation dataset and finding that the architecture-dependent degradation patterns under gradient-quantizing recipes disappear would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.27616 by Oleg Rybakov, Zijian Du.

Figure 1
Figure 1. Figure 1: Architecture overview and quantization layout. with ViT consistently underperforming regardless of opti￾mizer. (§4.2, [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: NVFP4 QAT data flow for a single quantized linear layer. Yellow blocks: quantization steps; green blocks: NVFP4 GEMM operations; orange: optimizer. Each QAT recipe ( [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Validation loss curves (BCE + Tversky) for all architectures across 8 NVFP4 recipes (columns) and 3 model scales (rows: 500K, 4M, 15M). All architectures converge across recipes [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: AUPRC vs. model scale for all architectures. AUPRC increases with scale but with diminishing returns from 4M to 15M. Swin achieves the highest AUPRC at every scale. gradually building global context without the quadratic cost of full self-attention. Standard ViT lacks this locality prior: it attends over all patches from the first layer, which is data￾hungry and less effective on small medical datasets whe… view at source ↗
Figure 5
Figure 5. Figure 5: PRC curves and prediction distributions at 500K scale. Swin (left) shows discretized prediction probabilities under NVFP4 Full and Forward-Only, while CNN (right) remains smooth, resulting in lower AUPRC for Swin at this scale. bare FP4 quantization introduces. Forward–backward consistency is sufficient without ad￾vanced techniques. Chain Rule also lacks RHT and SR yet loses only 0.003 AUPRC, because it re… view at source ↗
Figure 6
Figure 6. Figure 6: Normalized AUPRC (% of BF16 baseline) across all quantized recipes for Swin (top) and CNN (bottom) at three scales. Each dot is one of 10 random weight initialization seeds; bars show mean ± 95% CI. Shaded region highlights recipes incorporating SR, RHT, or 2D scaling [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Cross-validation robustness. Swin outperforms CNN under both BF16 baseline and the best FP4 recipe across all five patient-level folds at 4M scale. Dots show individual folds; error bars show ±1 std. 5. Conclusion We studied NVFP4 QAT across three architectures, three matched scales, and eight recipes on recall-critical brain tumor segmentation. 1. Swin is the most FP4-robust architecture, achieving the hi… view at source ↗
read the original abstract

Real-time anomaly segmentation demands both high recall and efficient low-precision inference. We study the three-way interaction of model architecture, model scale, and FP4 quantization-aware training (QAT) recipe on a recall-critical brain tumor segmentation task, evaluating multiple architectures, scales, and QAT recipes under a unified protocol. We find that architecture choice has the largest impact on quantization robustness, with attention-based architectures showing remarkable resilience to recipe choice while CNN degrades under gradient-quantizing recipes at larger scales. At low capacity, FP4 can discretize softmax attention, but advanced QAT recipes prevent this collapse. At larger scales, advanced recipes mitigate gradient quantization noise that degrades CNN quality. Five-fold patient-level cross-validation confirms these findings are robust to data partition. Our results show that the Swin Transformer is robust to QAT recipe choice across all scales, making it the recommended architecture for FP4-quantized anomaly segmentation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper examines the three-way interaction of architecture, model scale, and FP4 quantization-aware training (QAT) recipes for recall-critical anomaly segmentation on a brain tumor task. Under a unified evaluation protocol with multiple architectures and scales, it reports that architecture exerts the largest effect on quantization robustness: attention-based models exhibit resilience to recipe choice while CNNs degrade under gradient-quantizing recipes at larger scales. At low capacity FP4 can discretize softmax attention but advanced recipes avoid collapse; at scale advanced recipes mitigate gradient noise for CNNs. Five-fold patient-level cross-validation is used to confirm stability to data partitions. The Swin Transformer is identified as robust across scales and recommended for FP4-quantized anomaly segmentation.

Significance. If the reported interactions hold, the work supplies actionable guidance for architecture selection in low-precision real-time anomaly segmentation, particularly highlighting the practical advantage of attention-based models for FP4 deployment. The explicit use of 5-fold patient-level CV to demonstrate partition robustness is a methodological strength that supports the internal reliability of the empirical comparisons.

major comments (1)
  1. [Abstract] Abstract: The central claim that 'architecture choice has the largest impact on quantization robustness' and the resulting recommendation of the Swin Transformer for FP4-quantized anomaly segmentation rest on experiments confined to a single brain tumor segmentation task under one unified protocol. While 5-fold patient-level CV addresses stability to data splits, the absence of additional datasets, tasks, or protocol variations means the observed three-way interactions (attention resilience vs. CNN degradation) could be driven by dataset statistics or implementation details rather than intrinsic architectural properties; this directly affects the load-bearing general recommendation.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and for highlighting the methodological strengths of the 5-fold patient-level cross-validation. We respond to the single major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that 'architecture choice has the largest impact on quantization robustness' and the resulting recommendation of the Swin Transformer for FP4-quantized anomaly segmentation rest on experiments confined to a single brain tumor segmentation task under one unified protocol. While 5-fold patient-level CV addresses stability to data splits, the absence of additional datasets, tasks, or protocol variations means the observed three-way interactions (attention resilience vs. CNN degradation) could be driven by dataset statistics or implementation details rather than intrinsic architectural properties; this directly affects the load-bearing general recommendation.

    Authors: We acknowledge that the experiments are confined to a single brain tumor segmentation task and dataset. The unified protocol across architectures and scales isolates the three-way interactions under controlled conditions, and the 5-fold patient-level CV demonstrates robustness to data partitions within this setting. However, we agree that the load-bearing general recommendation in the abstract exceeds the scope of the evidence. We will therefore revise the abstract to qualify the recommendation as applying to the studied brain tumor segmentation task and add an explicit limitations paragraph discussing the need for validation on additional datasets and tasks. This constitutes a partial revision that preserves the core empirical findings while addressing the generalizability concern. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical comparison with no derivations or fitted predictions

full rationale

The paper is an empirical study evaluating architecture-scale-QAT recipe interactions on brain tumor segmentation via 5-fold patient-level CV under a unified protocol. No equations, parameter fits, self-definitional claims, or load-bearing self-citations appear in the provided text. All reported outcomes are direct experimental measurements rather than reductions of inputs by construction. This matches the default case of a self-contained empirical paper (score 0-2).

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical model or derivation present; purely empirical comparison.

pith-pipeline@v0.9.1-grok · 5690 in / 960 out tokens · 31492 ms · 2026-06-29T18:09:28.351640+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 4 canonical work pages

  1. [1]

    Surface defect inspection of industrial products with object detection deep networks: A systematic review.Artificial Intelligence Review, 57(257), 2024

    Wenbo Chen et al. Surface defect inspection of industrial products with object detection deep networks: A systematic review.Artificial Intelligence Review, 57(257), 2024. 1

  2. [2]

    Low dosage SEM image processing for metrology applications

    Zijian Du, Lingling Pu, Jiaoying Tan, Paul Wei, and Jeeeon Kim. Low dosage SEM image processing for metrology applications. InMetrology, Inspection, and Process Control XXXVI, volume 12053, pages 59–67. SPIE, 2022

  3. [3]

    Zijian Du, Lingling Pu, Paul Wei, Rui Yuan, Jeeeon Kim, and Jiaoying Tan. Unsupervised neural network-based image restoration framework for pattern fidelity improvement and ro- bust metrology.Journal of Micro/Nanopatterning, Materials, and Metrology, 22(3):034201, 2023. 1

  4. [4]

    Focal loss for dense object detection

    Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense object detection. InIEEE International Conference on Computer Vision (ICCV), pages 2980–2988, 2017. 1

  5. [5]

    Tversky loss function for image segmentation using 3D fully convolutional deep networks

    Seyed Sadegh Mohseni Salehi, Deniz Erdogmus, and Ali Gholipour. Tversky loss function for image segmentation using 3D fully convolutional deep networks. InInternational Workshop on Machine Learning in Medical Imaging (MLMI), pages 379–387. Springer, 2017. 1, 2

  6. [6]

    The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets.PLOS ONE, 10(3): e0118432, 2015

    Takaya Saito and Marc Rehmsmeier. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets.PLOS ONE, 10(3): e0118432, 2015. 1, 2, 4

  7. [7]

    Introducing NVFP4 for efficient and accurate low- precision inference

    NVIDIA. Introducing NVFP4 for efficient and accurate low- precision inference. NVIDIA Developer Blog, 2025. 1, 3, 4

  8. [8]

    Training LLMs with MXFP4

    Albert Tseng et al. Training LLMs with MXFP4. InPro- ceedings of the 42nd International Conference on Machine Learning (ICML). PMLR, 2025. 3, 7

  9. [9]

    FP4 all the way: Fully quantized training of LLMs

    Ruizhe Xi et al. FP4 all the way: Fully quantized training of LLMs.arXiv preprint arXiv:2505.19115, 2025. 1, 3, 7

  10. [10]

    U-Net: Convolutional networks for biomedical image segmentation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: Convolutional networks for biomedical image segmentation. InMedical Image Computing and Computer-Assisted Inter- vention (MICCAI), pages 234–241. Springer, 2015. 1, 2, 3

  11. [11]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representa- tions (ICLR), 2021....

  12. [12]

    Swin Transformer: Hierarchical vision transformer using shifted windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin Transformer: Hierarchical vision transformer using shifted windows. In IEEE/CVF International Conference on Computer Vision (ICCV), pages 10012–10022, 2021. 1, 2, 3, 5

  13. [13]

    Transformers learn low sensitivity func- tions: Investigations and implications

    Bhavya Vasudeva, Shreyas Bhattamishra, Varun Kanade, and Lenka Zdeborova. Transformers learn low sensitivity func- tions: Investigations and implications. InInternational Con- ference on Learning Representations (ICLR), 2025. 1, 2, 7

  14. [14]

    Why are sensitive functions hard for transformers? InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), 2024

    Michael Hahn and Mark Rofin. Why are sensitive functions hard for transformers? InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), 2024. 2, 6

  15. [15]

    How do vision transformers work? InInternational Conference on Learning Representa- tions (ICLR), 2022

    Namuk Park and Songkuk Kim. How do vision transformers work? InInternational Conference on Learning Representa- tions (ICLR), 2022. 1, 2, 7

  16. [16]

    Training data-efficient image transformers & distillation through at- tention

    Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herv´e J´egou. Training data-efficient image transformers & distillation through at- tention. InProceedings of the 38th International Conference on Machine Learning (ICML), pages 10347–10357. PMLR,

  17. [17]

    Mazurowski

    Mateusz Buda, Ashirbani Saha, and Maciej A. Mazurowski. Association of genomic subtypes of lower-grade gliomas with shape features automatically extracted by a deep learning algorithm.Computers in Biology and Medicine, 109:218– 225, 2019. 1, 3

  18. [18]

    A novel focal tversky loss function with improved attention U-Net for le- sion segmentation

    Nabila Abraham and Naimul Mefraz Khan. A novel focal tversky loss function with improved attention U-Net for le- sion segmentation. InIEEE International Symposium on Biomedical Imaging (ISBI), pages 683–687, 2019. 2

  19. [19]

    Michael Yeung, Evis Sala, Carola-Bibiane Sch ¨onlieb, and Leonardo Rundo. Unified focal loss: Generalising dice and cross entropy-based losses to handle class imbalanced medi- cal image segmentation.Computerized Medical Imaging and Graphics, 95:102026, 2022. 2

  20. [20]

    McDermott, Lasse Hansen, Giovanni An- gelotti, Jack Gallifant, Fabian Pl¨otz, Leo Anthony Celi, and Marzyeh Ghassemi

    Matthew B.A. McDermott, Lasse Hansen, Giovanni An- gelotti, Jack Gallifant, Fabian Pl¨otz, Leo Anthony Celi, and Marzyeh Ghassemi. A closer look at AUROC and AUPRC under class imbalance. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. 2

  21. [21]

    Vi- sion transformers for dense prediction

    Ren´e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. InIEEE International Conference on Computer Vision (ICCV), pages 12179–12188,

  22. [22]

    Transformers in medical imaging: A survey

    Fahad Shamshad, Salman Khan, Syed Waqas Zamir, Muham- mad Haris Khan, Munawar Hayat, Fahad Shahbaz Khan, and Huazhu Fu. Transformers in medical imaging: A survey. Medical Image Analysis, 88:102802, 2023. 2

  23. [23]

    Towards understanding regularization in batch normalization

    Ping Luo, Xinjiang Wang, Wenqi Shao, and Zhanglin Peng. Towards understanding regularization in batch normalization. InInternational Conference on Learning Representations (ICLR), 2019. 2, 6, 7

  24. [24]

    Bayesian uncertainty estimation for batch normalized deep networks

    Mattias Teye, Hossein Azizpour, and Kevin Smith. Bayesian uncertainty estimation for batch normalized deep networks. In Proceedings of the 35th International Conference on Machine Learning (ICML), pages 4907–4916. PMLR, 2018. 2

  25. [25]

    Castro, Denis Kuznedelev, Andrei Panferov, Eldar Kurtic, Shubhra Pandit, Alexandre Noll Marques, Mark Kurtz, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh

    Ali Edalati et al. Bridging the gap between promise and per- formance for microscaling FP4 quantization.arXiv preprint arXiv:2509.23202, 2025. 3

  26. [26]

    TetraJet: Mitigating weight oscillation for robust MXFP4 vision transformer training.arXiv preprint arXiv:2502.20853, 2025

    Jongmin Lee et al. TetraJet: Mitigating weight oscillation for robust MXFP4 vision transformer training.arXiv preprint arXiv:2502.20853, 2025. 3, 7

  27. [27]

    Stochastic rounding for LLM training: Theory and practice

    Kaan Ozkara, Tao Yu, and Jongho Park. Stochastic rounding for LLM training: Theory and practice. InProceedings of the 28th International Conference on Artificial Intelligence and Statistics (AISTATS). PMLR, 2025. 3, 4

  28. [28]

    Pretraining large language models with NVFP4.arXiv preprint arXiv:2509.25149,

    Felix Abecassis, Anjulie Agrusa, Dong Ahn, Jonah Alben, et al. Pretraining large language models with NVFP4.arXiv preprint arXiv:2509.25149, 2025. 4