pith. sign in

arxiv: 2511.05477 · v2 · submitted 2025-11-07 · 💻 cs.CV

GroupKAN: Efficient Kolmogorov-Arnold Networks via Grouped Spline Modeling

Pith reviewed 2026-05-17 23:26 UTC · model grok-4.3

classification 💻 cs.CV
keywords Kolmogorov-Arnold NetworksMedical Image SegmentationEfficient ArchitecturesSpline ModelingGrouped InteractionsParameter ReductionU-KAN
0
0 comments X p. Extension

The pith

GroupKAN restricts spline interactions within channel groups to reduce parameter growth in Kolmogorov-Arnold Networks for medical segmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Kolmogorov-Arnold Networks offer adaptive nonlinear transformations but their full spline mappings cause quadratic parameter growth with increasing channels, which becomes impractical for medical image segmentation where data is limited. The paper proposes GroupKAN to divide channels into groups and limit the spline mappings to occur only within each group. This structured constraint lowers the parameter scaling while still allowing the network to learn effective nonlinearities for segmentation. Tests on breast ultrasound, gland, and colorectal polyp datasets show GroupKAN reaching an average IoU of 79.80% with 3.02 million parameters, surpassing the previous KAN model that used 6.35 million parameters. The grouped design also leads to activation maps that more precisely highlight the target structures in the images.

Core claim

By introducing the Grouped KAN Transform that computes splines only for intra-group channel pairs across a chosen number of groups and the Grouped KAN Activation that shares spline functions inside each group, GroupKAN achieves a parameter reduction to roughly half while delivering superior segmentation performance on standard medical benchmarks.

What carries the argument

Grouped KAN Transform (GKT) that restricts spline interactions to intra-group channel mappings across g groups, reducing the quadratic term by a factor related to the group count.

If this is right

  • Models for medical image analysis can achieve better accuracy with reduced computational resources and memory footprint.
  • Activation maps become more localized and interpretable, aiding clinical review of model decisions.
  • The approach enables KAN layers to be incorporated into larger segmentation networks without exceeding hardware limits.
  • Overfitting risks decrease because the constrained interactions limit unnecessary functional complexity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Grouping strategies like this may help control complexity in other types of basis expansion networks beyond KANs.
  • Applying the same intra-group restriction to vision transformers or CNNs could yield similar efficiency gains in other tasks.
  • Future work might explore adaptive group sizes that depend on channel importance rather than fixed divisions.

Load-bearing premise

That restricting spline interactions to intra-group channel mappings preserves sufficient expressive capacity for medical segmentation without meaningful performance loss.

What would settle it

Measuring the IoU of GroupKAN on the BUSI dataset when the number of groups is increased to match the total number of channels, which should approach the performance of the full unconstrained KAN or fall below it if the assumption fails.

Figures

Figures reproduced from arXiv: 2511.05477 by Anh Nguyen, Anwar P.P. Abdul Majeed, Fan Zhang, Guojie Li, Muhammad Ateeq, Tianyi Liu.

Figure 1
Figure 1. Figure 1: Accuracy–complexity trade-off in medical segmentation: IoU (%) vs. number of parameters (M) among [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the GroupKAN pipeline. The encoder extracts features through convolutional blocks. In the [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of Grouped KAN Activation. Given token embeddings [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of Grouped KAN Transform. The input feature [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison of segmentation results on BUSI, GlaS, and CVC-ClinicDB. GroupKAN produces [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: 3D visualization of IoU and parameter count across [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Channel activation maps for explainability comparison. GroupKAN with GKT (bottom-left) shows the best [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
read the original abstract

Medical image segmentation demands models that achieve high accuracy while maintaining computational efficiency and clinical interpretability. While recent Kolmogorov-Arnold Networks (KANs) offer powerful adaptive non-linearities, their full-channel spline transformations incur a quadratic parameter growth of $\mathcal{O}(C^{2}(G+k))$ with respect to the channel dimension $C$, where $G$ and $k$ denote the number of grid intervals and spline polynomial order, respectively. Moreover, unconstrained spline mappings lack structural constraints, leading to excessive functional freedom, which may cause overfitting under limited medical annotations. To address these challenges, we propose GroupKAN (Grouped Kolmogorov-Arnold Networks), an efficient architecture driven by group-structured spline modeling. Specifically, we introduce: (1) Grouped KAN Transform (GKT), which restricts spline interactions to intra-group channel mappings across $g$ groups, effectively reducing the spline-induced quadratic expansion to \textbf{$\mathcal{O}(C^2(\frac{G+k}{g} + 1))$}, thereby significantly lowering the effective quadratic coefficient; and (2) Grouped KAN Activation (GKA), which applies shared spline functions within each group to enable efficient token-wise non-linearities. By imposing structured constraints on channel interactions, GroupKAN achieves a substantial reduction in parameter redundancy without sacrificing expressive capacity.Extensive evaluations on three medical benchmarks (BUSI, GlaS, and CVC) demonstrate that GroupKAN achieves an average IoU of 79.80\%, outperforming the strong U-KAN baseline by +1.11\% while requiring only 47.6\% of the parameters (3.02M vs. 6.35M). Qualitative results further reveal that GroupKAN produces sharply localized activation maps that better align with the ground truth than MLPs and KANs, significantly enhancing clinical interpretability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes GroupKAN for efficient medical image segmentation using Kolmogorov-Arnold Networks. It introduces Grouped KAN Transform (GKT) to restrict spline interactions within channel groups, reducing parameter scaling to O(C²((G+k)/g +1)), and Grouped KAN Activation (GKA) for shared intra-group activations. On BUSI, GlaS, and CVC benchmarks, it reports 79.80% average IoU, +1.11% over U-KAN, with 47.6% parameters (3.02M vs. 6.35M), claiming better interpretability.

Significance. GroupKAN addresses the quadratic parameter growth in KANs through structured grouping, which if the modest performance gains prove robust, could facilitate wider adoption of KANs in efficient and interpretable medical segmentation models. The approach provides a concrete mechanism to trade off interaction density for computational savings while maintaining or improving accuracy on the tested datasets.

major comments (3)
  1. [Abstract] The reported average IoU of 79.80% and +1.11% improvement lack accompanying error bars, results from multiple random seeds, or p-values from statistical tests. Without these, the claim of outperforming U-KAN without sacrificing accuracy cannot be fully substantiated, as the central efficiency-without-loss assertion relies on these empirical results.
  2. [Methods (GKT definition)] The GKT partitions channels into g groups and applies intra-group splines, yielding the reduced complexity. However, this implicitly assumes later layers or GKA can recover necessary inter-group non-linearities for precise boundary segmentation in medical images. No analysis of cross-group feature interactions or sensitivity to g is provided, which is load-bearing for whether expressive capacity is preserved.
  3. [Experiments] No ablation experiments are described varying the group count g, despite it being a key free parameter that controls the efficiency-capacity trade-off. Including such ablations would directly test the weakest assumption that intra-group restriction does not lead to meaningful performance loss.
minor comments (2)
  1. [Abstract] The complexity expression uses G and k for grid and order; ensure these are consistently defined and referenced in the main text with equation numbers.
  2. [Qualitative results] The claim of 'sharply localized activation maps' would benefit from quantitative metrics comparing alignment with ground truth, such as activation overlap scores, rather than relying solely on qualitative description.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. We address each major comment point by point below, agreeing where appropriate and outlining specific revisions to strengthen the empirical support and analysis.

read point-by-point responses
  1. Referee: [Abstract] The reported average IoU of 79.80% and +1.11% improvement lack accompanying error bars, results from multiple random seeds, or p-values from statistical tests. Without these, the claim of outperforming U-KAN without sacrificing accuracy cannot be fully substantiated, as the central efficiency-without-loss assertion relies on these empirical results.

    Authors: We agree that statistical rigor is essential for validating the reported performance gains. In the revised manuscript, we will augment the abstract and results sections with mean IoU and standard deviations computed over multiple random seeds (at least five runs), along with p-values from paired statistical tests comparing GroupKAN against the U-KAN baseline. This will provide stronger substantiation for the efficiency-without-loss claim. revision: yes

  2. Referee: [Methods (GKT definition)] The GKT partitions channels into g groups and applies intra-group splines, yielding the reduced complexity. However, this implicitly assumes later layers or GKA can recover necessary inter-group non-linearities for precise boundary segmentation in medical images. No analysis of cross-group feature interactions or sensitivity to g is provided, which is load-bearing for whether expressive capacity is preserved.

    Authors: The referee correctly identifies a key implicit assumption in the GKT design. While the architecture relies on subsequent convolutional layers and the overall network depth to integrate cross-group information hierarchically, we will revise the Methods section to include a dedicated discussion of cross-group feature recovery. We will also add a sensitivity analysis subsection examining performance as a function of g to demonstrate that expressive capacity for boundary segmentation is preserved across reasonable group counts. revision: partial

  3. Referee: [Experiments] No ablation experiments are described varying the group count g, despite it being a key free parameter that controls the efficiency-capacity trade-off. Including such ablations would directly test the weakest assumption that intra-group restriction does not lead to meaningful performance loss.

    Authors: We fully agree that ablations on the group count g are required to rigorously validate the core design trade-off. In the revised Experiments section, we will add comprehensive ablation studies varying g (including values such as 1, 2, 4, and 8) and report the resulting IoU scores, parameter counts, and efficiency metrics on all three benchmarks. These results will directly address whether intra-group restrictions incur meaningful performance degradation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; efficiency scaling follows directly from explicit grouping definition and is validated empirically

full rationale

The paper introduces GKT and GKA by defining intra-group spline restrictions (g groups) and derives the reduced complexity O(C²((G+k)/g +1)) as a direct algebraic consequence of that architectural choice. Reported gains (+1.11% IoU, 47.6% parameters) are measured against external baselines (U-KAN) on independent datasets rather than obtained by fitting or renaming within the paper. No load-bearing step reduces to self-citation, fitted prediction, or self-definition; the derivation remains self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The design rests on the premise that channel grouping imposes useful structural constraints while retaining enough non-linearity; the number of groups is treated as a tunable hyperparameter.

free parameters (1)
  • number of groups g
    Hyperparameter that controls the degree of parameter reduction versus retained expressivity; its specific value is chosen to achieve the reported efficiency gains.
axioms (1)
  • domain assumption Intra-group spline mappings suffice to model the channel interactions required for accurate medical segmentation
    Invoked to justify restricting interactions to within groups in the GKT definition.

pith-pipeline@v0.9.0 · 5657 in / 1221 out tokens · 36260 ms · 2026-05-17T23:26:57.861351+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 5 internal anchors

  1. [1]

    Medical image segmentation using deep learning: A survey.IET image processing, 16(5):1243–1267, 2022

    Risheng Wang, Tao Lei, Ruixia Cui, Bingtao Zhang, Hongying Meng, and Asoke K Nandi. Medical image segmentation using deep learning: A survey.IET image processing, 16(5):1243–1267, 2022

  2. [2]

    A review of medical image segmentation algorithms.EAI Endorsed Transactions on Pervasive Health & Technology, 7(27), 2021

    KKD Ramesh, G Kiran Kumar, K Swapna, Debabrata Datta, and S Suman Rajest. A review of medical image segmentation algorithms.EAI Endorsed Transactions on Pervasive Health & Technology, 7(27), 2021

  3. [3]

    Sohaib Asif, Yi Wenhui, Saif ur Rehman, Qurrat ul ain, Kamran Amjad, Yi Yueyang, Si Jinhai, and Muhammad Awais. Advancements and prospects of machine learning in medical diagnostics: unveiling the future of diagnostic precision.Archives of Computational Methods in Engineering, 32(2):853–883, 2025

  4. [4]

    Generalist medical foundation model improves prostate cancer segmentation from multimodal mri images

    Yuhan Zhang, Xiao Ma, Mingchao Li, Kun Huang, Jie Zhu, Miao Wang, Xi Wang, Menglin Wu, and Pheng-Ann Heng. Generalist medical foundation model improves prostate cancer segmentation from multimodal mri images. npj Digital Medicine, 8(1):372, 2025

  5. [5]

    Comprehensive review of recent developments in visual object detection based on deep learning.Artificial Intelligence Review, 58(9):277, 2025

    Enerst Edozie, Aliyu Nuhu Shuaibu, Ukagwu Kelechi John, and Bashir Olaniyi Sadiq. Comprehensive review of recent developments in visual object detection based on deep learning.Artificial Intelligence Review, 58(9):277, 2025

  6. [6]

    U-net: Convolutional networks for biomedical image segmentation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InInternational Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015

  7. [7]

    Efficient acceleration of deep learning inference on resource-constrained edge devices: A review.Proceedings of the IEEE, 111(1):42–91, 2022

    Md Maruf Hossain Shuvo, Syed Kamrul Islam, Jianlin Cheng, and Bashir I Morshed. Efficient acceleration of deep learning inference on resource-constrained edge devices: A review.Proceedings of the IEEE, 111(1):42–91, 2022. 11 Running Title for Header

  8. [8]

    A comprehensive survey of convolutions in deep learning: Applications, challenges, and future trends.IEEE Access, 12:41180–41218, 2024

    Abolfazl Younesi, Mohsen Ansari, Mohammadamin Fazli, Alireza Ejlali, Muhammad Shafique, and Jörg Henkel. A comprehensive survey of convolutions in deep learning: Applications, challenges, and future trends.IEEE Access, 12:41180–41218, 2024

  9. [9]

    Transformer quality in linear time

    Weizhe Hua, Zihang Dai, Hanxiao Liu, and Quoc Le. Transformer quality in linear time. InInternational conference on machine learning, pages 9099–9117. PMLR, 2022

  10. [10]

    A practical survey on faster and lighter transformers

    Quentin Fournier, Gaétan Marceau Caron, and Daniel Aloise. A practical survey on faster and lighter transformers. ACM Computing Surveys, 55(14s):1–40, 2023

  11. [11]

    An attentive inductive bias for sequential recommendation beyond the self-attention

    Yehjin Shin, Jeongwhan Choi, Hyowon Wi, and Noseong Park. An attentive inductive bias for sequential recommendation beyond the self-attention. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 8984–8992, 2024

  12. [12]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752, 2023

  13. [13]

    U-kan makes strong backbone for medical image segmentation and generation

    Chenxin Li, Xinyu Liu, Wuyang Li, Cheng Wang, Hengyu Liu, Yifan Liu, Zhen Chen, and Yixuan Yuan. U-kan makes strong backbone for medical image segmentation and generation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 4652–4660, 2025

  14. [14]

    KAN: Kolmogorov-Arnold Networks

    Ziming Liu, Yixuan Wang, Sachin Vaidya, Fabian Ruehle, James Halverson, Marin Soljaˇci´c, Thomas Y Hou, and Max Tegmark. Kan: Kolmogorov-arnold networks.arXiv preprint arXiv:2404.19756, 2024

  15. [15]

    Unet++: A nested u-net architecture for medical image segmentation

    Zongwei Zhou, Md Mahfuzur Rahman Siddiquee, Nima Tajbakhsh, and Jianming Liang. Unet++: A nested u-net architecture for medical image segmentation. InInternational workshop on deep learning in medical image analysis, pages 3–11. Springer, 2018

  16. [16]

    Rethinking u-net: Task-adaptive mixture of skip connections for enhanced medical image segmentation

    Zichen Luo, Xinshan Zhu, Lan Zhang, and Biao Sun. Rethinking u-net: Task-adaptive mixture of skip connections for enhanced medical image segmentation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 5874–5882, 2025

  17. [17]

    Attention U-Net: Learning Where to Look for the Pancreas

    Ozan Oktay, Jo Schlemper, Loic Le Folgoc, Matthew Lee, Mattias Heinrich, Kazunari Misawa, Kensaku Mori, Steven McDonagh, Nils Y Hammerla, Bernhard Kainz, et al. Attention u-net: Learning where to look for the pancreas.arXiv preprint arXiv:1804.03999, 2018

  18. [18]

    nnwnet: Rethinking the use of transformers in biomedical image segmentation and calling for a unified evaluation benchmark

    Yanfeng Zhou, Lingrui Li, Le Lu, and Minfeng Xu. nnwnet: Rethinking the use of transformers in biomedical image segmentation and calling for a unified evaluation benchmark. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 20852–20862, 2025

  19. [19]

    Smaformer: Synergistic multi-attention transformer for medical image segmentation

    Fuchen Zheng, Xuhang Chen, Weihuang Liu, Haolun Li, Yingtie Lei, Jiahui He, Chi-Man Pun, and Shoujun Zhou. Smaformer: Synergistic multi-attention transformer for medical image segmentation. In2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 4048–4053. IEEE, 2024

  20. [20]

    Unext: Mlp-based rapid medical image segmentation network

    Jeya Maria Jose Valanarasu and Vishal M Patel. Unext: Mlp-based rapid medical image segmentation network. In International conference on medical image computing and computer-assisted intervention, pages 23–33. Springer, 2022

  21. [21]

    Rolling-unet: Revitalizing mlp’s ability to efficiently extract long-distance dependencies for medical image segmentation

    Yutong Liu, Haijiang Zhu, Mengting Liu, Huaiyuan Yu, Zihan Chen, and Jie Gao. Rolling-unet: Revitalizing mlp’s ability to efficiently extract long-distance dependencies for medical image segmentation. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 3819–3827, 2024

  22. [22]

    U-Mamba: Enhancing Long-range Dependency for Biomedical Image Segmentation

    Jun Ma, Feifei Li, and Bo Wang. U-mamba: Enhancing long-range dependency for biomedical image segmentation. arXiv preprint arXiv:2401.04722, 2024

  23. [23]

    Convolutional kolmogorov-arnold networks.arXiv preprint arXiv:2406.13155, 2024

    Alexander Dylan Bodner, Antonio Santiago Tepsich, Jack Natan Spolski, and Santiago Pourteau. Convolutional kolmogorov-arnold networks.arXiv preprint arXiv:2406.13155, 2024

  24. [24]

    title Kolmogorov-arnold convolutions: Design principles and empirical studies

    Ivan Drokin. Kolmogorov-arnold convolutions: Design principles and empirical studies.arXiv preprint arXiv:2407.01092, 2024

  25. [25]

    Rectified linear units improve restricted boltzmann machines

    Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. InProceedings of the 27th international conference on machine learning (ICML-10), pages 807–814, 2010

  26. [26]

    Gaussian Error Linear Units (GELUs)

    Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415, 2016

  27. [27]

    Telu activation function for fast and stable deep learning

    Alfredo Fernandez. Telu activation function for fast and stable deep learning. Master’s thesis, University of South Florida, 2024

  28. [28]

    Gompertz linear units: Leveraging asymmetry for enhanced learning dynamics.arXiv preprint arXiv:2502.03654, 2025

    Anirban Das, Siddharth Rao, and Tirthankar Bhattacharya. Gompertz linear units: Leveraging asymmetry for enhanced learning dynamics.arXiv preprint arXiv:2502.03654, 2025. 12 Running Title for Header

  29. [29]

    Cvt: Introducing convolutions to vision transformers

    Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, and Lei Zhang. Cvt: Introducing convolutions to vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 22–31, 2021

  30. [30]

    Pvt v2: Improved baselines with pyramid vision transformer.Computational visual media, 8(3):415–424, 2022

    Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pvt v2: Improved baselines with pyramid vision transformer.Computational visual media, 8(3):415–424, 2022

  31. [31]

    On the representations of continuous functions of many variables by super- position of continuous functions of one variable and addition

    Andrei Nikolaevich Kolmogorov. On the representations of continuous functions of many variables by super- position of continuous functions of one variable and addition. InDokl. Akad. Nauk USSR, volume 114, pages 953–956, 1957

  32. [32]

    Dataset of breast ultrasound images

    Walid Al-Dhabyani, Mohammed Gomaa, Hussien Khaled, and Aly Fahmy. Dataset of breast ultrasound images. Data in brief, 28:104863, 2020

  33. [33]

    Gland segmentation in colon histology images: The glas challenge contest.Medical image analysis, 35:489–502, 2017

    Korsuk Sirinukunwattana, Josien PW Pluim, Hao Chen, Xiaojuan Qi, Pheng-Ann Heng, Yun Bo Guo, Li Yang Wang, Bogdan J Matuszewski, Elia Bruni, Urko Sanchez, et al. Gland segmentation in colon histology images: The glas challenge contest.Medical image analysis, 35:489–502, 2017

  34. [34]

    Wm-dova maps for accurate polyp highlighting in colonoscopy: Validation vs

    Jorge Bernal, F Javier Sánchez, Gloria Fernández-Esparrach, Debora Gil, Cristina Rodríguez, and Fernando Vilariño. Wm-dova maps for accurate polyp highlighting in colonoscopy: Validation vs. saliency maps from physicians.Computerized medical imaging and graphics, 43:99–111, 2015

  35. [35]

    Individual comparisons by ranking methods

    Frank Wilcoxon. Individual comparisons by ranking methods. biometrics bulletin, 1 (6), 80-83, 1945

  36. [36]

    Transformers without normalization

    Jiachen Zhu, Xinlei Chen, Kaiming He, Yann LeCun, and Zhuang Liu. Transformers without normalization. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 14901–14911, 2025. 13