pith. machine review for the scientific record. sign in

arxiv: 2604.28109 · v1 · submitted 2026-04-30 · 💻 cs.LG

Recognition: unknown

Auto-FlexSwitch: Efficient Dynamic Model Merging via Learnable Task Vector Compression

Authors on Pith no claims yet

Pith reviewed 2026-05-07 06:35 UTC · model grok-4.3

classification 💻 cs.LG
keywords model mergingtask vector compressiondynamic mergingmulti-task adaptationparameter compressionlearnable sparsification
0
0 comments X

The pith

Task vectors can be compressed into binary masks, sign vectors, and scalars to enable efficient dynamic model merging.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Dynamic model merging combines multiple fine-tuned models at inference time to handle many tasks without conflicts, but it normally requires storing a full set of parameters for each task. The paper demonstrates that the updates from fine-tuning, called task vectors, have a sparse impulse-like pattern that tolerates aggressive compression. Using this, they develop a decomposition into a binary mask, sign vector, and scaling factor, then make the choice of which parts to sparsify and quantize learnable through gating and bit selection. This leads to Auto-FlexSwitch, which performs dynamic merging with much lower storage needs while keeping performance close to storing full task vectors.

Core claim

The authors establish that task vectors exhibit an impulse-like activation pattern and high robustness to low-bit representations. This allows T-Switch to decompose each task vector into a binary sparse mask, a sign vector, and a scalar scaling factor for high compression. Auto-Switch then uses feature similarity to automatically select and compose these compressed vectors. FlexSwitch further makes the compression adaptive by jointly optimizing Learnable Gating Sparsification and Bit-width Adaptive Selection with a sparsity-aware storage strategy. The final Auto-FlexSwitch incorporates KNN inference with a learnable low-rank metric to support efficient dynamic merging of multiple tasks.

What carries the argument

The decomposition of task vectors into a binary sparse mask, a sign vector, and a scalar scaling factor, extended by learnable gating for sparsification and adaptive bit-width selection.

If this is right

  • Storing parameters for many tasks becomes feasible with high compression ratios.
  • Dynamic merging can maintain high performance without full parameter storage per task.
  • The compression adapts automatically to different parts of the model.
  • Retrieval-based selection of task vectors during inference is enabled by feature similarity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Scaling this to hundreds of tasks could make multi-task models practical on resource-constrained hardware.
  • Similar compression might apply to other adaptation methods that use parameter deltas.
  • Testing on larger models or different domains would reveal if the impulse pattern is general.

Load-bearing premise

Task vectors show an impulse-like activation pattern and stay accurate even when represented with very few bits.

What would settle it

Observing a large performance drop when replacing original task vectors with their compressed versions on standard benchmarks like image classification or NLP tasks would disprove the approach.

Figures

Figures reproduced from arXiv: 2604.28109 by Biqing Qi, Dazhi Zhang, Junqi Gao, Wangmeng Zuo, Yi Ran, Zhichang Guo.

Figure 1
Figure 1. Figure 1: Accuracy (%) trends of the three control strategies C1, C2, and C3 across the eight visual tasks on the ViT-B/32 model as a function of the pruning rate α. The horizontal dashed lines (Individual) represent the original fine-tuning accuracy for each task. The insets highlight the regions where specific tasks exhibit performance substantially exceeding the fine-tuning baseline (by more than 0.2%). different… view at source ↗
Figure 2
Figure 2. Figure 2: Performance comparison of the ViT-B/32 model equipped with task vectors processed by P-Spar and B-Approx under different pruning rates view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the proposed methods. The left side illustrates the compression pipelines of T-Switch and FlexSwitch for constructing lightweight task view at source ↗
Figure 4
Figure 4. Figure 4: Heatmaps illustrating the sensitivity of different modules/layers to sparsification and quantization. The horizontal axis represents task names, while the view at source ↗
Figure 5
Figure 5. Figure 5: Line charts illustrating the performance degradation across different view at source ↗
Figure 6
Figure 6. Figure 6: Accuracy of models equipped with task vectors sparsified by LGS with different performance preserving losses and P-Spar across different tasks view at source ↗
Figure 7
Figure 7. Figure 7: Storage comparison between the SASS and Indep schemes under dif view at source ↗
Figure 8
Figure 8. Figure 8: Comparison of performance and storage overhead (MB) between FlexSwitch and T-Switch across different tasks, using SASS as the unified storage view at source ↗
Figure 10
Figure 10. Figure 10: Performance of Auto-Switch and Auto-FlexSwitch under varying view at source ↗
Figure 12
Figure 12. Figure 12: Storage overhead analysis of SASS under different sparsity ratios view at source ↗
read the original abstract

Model merging has attracted attention as an effective path toward multi-task adaptation by integrating knowledge from multiple task-specific models. Among existing approaches, dynamic merging mitigates performance degradation caused by conflicting parameter updates across tasks by flexibly combining task-specific parameters at inference time, thereby maintaining high performance. However, these methods require storing independent parameters for each task, resulting in prohibitive storage overhead. To address this issue, we first experimentally demonstrate that the fine-tuned weight increments (referred to as task vectors) exhibit an impulse-like activation pattern and high robustness to low-bit representations. Driven by this insight, we propose T-Switch, which decomposes task vectors into three compact components: a binary sparse mask, a sign vector, and a scalar scaling factor, achieving high-fidelity approximation at high compression ratios. We then introduce Auto-Switch, a training-free merging scheme that automatically composes task vectors via feature similarity retrieval. Building on this, we develop Auto-Switch, a training-free merging scheme that automatically assembles task vectors through feature similarity retrieval. Furthermore, to transform task vector sparsification and quantization from static rules to adaptive learning, we propose FlexSwitch, a learnable framework which jointly optimizes the compression strategy for each model unit via Learnable Gating Sparsification (LGS) and Bit-width Adaptive Selection (BAS), while employing the Sparsity-Aware Storage Strategy (SASS) to select the optimal storage encoding structure. Finally, by incorporating a K-Nearest Neighbor (KNN) inference scheme with a learnable low-rank metric, we present Auto-FlexSwitch, a dynamic model merging approach that supports highly efficient task vector compression.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that fine-tuned task vectors exhibit an impulse-like activation pattern and robustness to low-bit representations, enabling a T-Switch decomposition into a binary sparse mask, sign vector, and scalar scaling factor for high-fidelity compression. Building on this, it introduces Auto-Switch (training-free merging via feature similarity retrieval), FlexSwitch (learnable framework with LGS for sparsification, BAS for bit-width selection, and SASS for storage), and Auto-FlexSwitch (dynamic merging with KNN inference using a learnable low-rank metric) to achieve efficient dynamic model merging with reduced storage overhead.

Significance. If the impulse-like pattern generalizes and the compression maintains high fidelity as claimed, the work could meaningfully reduce storage costs in dynamic multi-task model merging, addressing a key practical limitation. The shift from static to learnable adaptive compression via LGS/BAS/SASS is a potentially useful direction for efficient deployment, though its impact depends on empirical validation beyond the initial observation.

major comments (2)
  1. [Abstract] Abstract: The load-bearing empirical claim that task vectors 'exhibit an impulse-like activation pattern and high robustness to low-bit representations' is presented as the driver for T-Switch and all subsequent components, but the abstract (and by extension the demonstration) provides no quantitative metrics, baselines, ablation results, or error analysis to support 'high-fidelity approximation at high compression ratios'. This makes it difficult to assess whether the pattern is general or setup-specific, directly affecting the soundness of the efficiency claims.
  2. [Method (Auto-FlexSwitch and FlexSwitch)] The learnable low-rank metric in the KNN inference scheme of Auto-FlexSwitch and the additional parameters from LGS/BAS introduce free parameters (as noted in the axiom ledger). The manuscript should explicitly compare total storage and inference overhead against uncompressed task vectors and prior merging methods in the experimental section to confirm net efficiency gains.
minor comments (2)
  1. [Abstract] Abstract: The description of Auto-Switch is duplicated with nearly identical wording ('We then introduce Auto-Switch... Building on this, we develop Auto-Switch...'), which is a drafting error that should be corrected for clarity.
  2. [Method] The notation for T-Switch components (binary sparse mask, sign vector, scalar) and the definitions of LGS, BAS, and SASS should be formalized with equations and pseudocode in the relevant method subsections to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for their thorough review and constructive feedback. The comments help clarify how to better present the empirical foundations and efficiency analysis. We address each major comment point by point below, with planned revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The load-bearing empirical claim that task vectors 'exhibit an impulse-like activation pattern and high robustness to low-bit representations' is presented as the driver for T-Switch and all subsequent components, but the abstract (and by extension the demonstration) provides no quantitative metrics, baselines, ablation results, or error analysis to support 'high-fidelity approximation at high compression ratios'. This makes it difficult to assess whether the pattern is general or setup-specific, directly affecting the soundness of the efficiency claims.

    Authors: We acknowledge that the abstract is a concise summary and does not embed detailed quantitative metrics. The supporting experimental demonstration—including quantitative metrics on activation sparsity, bit-width robustness, baselines, ablations, and error analysis—is provided in Section 3.1 with associated figures. To address the concern and make the abstract more self-contained, we will revise it to include brief quantitative highlights (e.g., achieved compression ratios with fidelity retention) drawn from those results. This change will better substantiate the claims without altering the manuscript's core contributions or experimental content. revision: yes

  2. Referee: [Method (Auto-FlexSwitch and FlexSwitch)] The learnable low-rank metric in the KNN inference scheme of Auto-FlexSwitch and the additional parameters from LGS/BAS introduce free parameters (as noted in the axiom ledger). The manuscript should explicitly compare total storage and inference overhead against uncompressed task vectors and prior merging methods in the experimental section to confirm net efficiency gains.

    Authors: We agree that explicit net-efficiency comparisons are necessary given the learnable components. The current manuscript discusses compression benefits and includes overhead analysis via SASS, but we will add a dedicated table and subsection in the experimental results that directly quantifies total storage (including the low-rank metric and LGS/BAS parameters) and inference overhead against both uncompressed task vectors and prior merging methods. This will demonstrate that the added parameters are more than offset by the reductions from T-Switch decomposition and adaptive compression. revision: yes

Circularity Check

0 steps flagged

Derivation chain is self-contained with no circular reductions

full rationale

The paper's pipeline starts from an independent experimental observation of impulse-like activation patterns and low-bit robustness in task vectors, which is presented as an empirical input rather than a derived claim. This observation directly motivates the T-Switch decomposition and the design of Auto-Switch, FlexSwitch (with LGS/BAS/SASS), and Auto-FlexSwitch (with KNN and learnable low-rank metric). No equations reduce claimed performance or compression ratios back to the method's own fitted outputs by construction; the learnable components are optimized on external data, similarity retrieval uses independent feature comparisons, and no self-citations or uniqueness theorems are invoked as load-bearing premises. The derivation remains independent of its own outputs and self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 3 invented entities

The central claim rests on the observed statistical properties of task vectors and on the effectiveness of the newly introduced learnable compression modules; no external benchmarks or proofs are referenced in the abstract.

free parameters (2)
  • scalar scaling factor
    Introduced as one of the three compact components in T-Switch decomposition; its value is determined per task vector and directly affects reconstruction fidelity.
  • learnable low-rank metric parameters
    Used inside the KNN inference scheme of Auto-FlexSwitch; these are optimized during the learnable stage and control how similarity is measured for merging weights.
axioms (2)
  • domain assumption Task vectors exhibit an impulse-like activation pattern
    Stated as the first experimental demonstration that motivates the entire compression approach.
  • domain assumption Task vectors are highly robust to low-bit representations
    Used to justify the binary mask plus sign plus scalar decomposition and subsequent quantization steps.
invented entities (3)
  • Learnable Gating Sparsification (LGS) no independent evidence
    purpose: Jointly optimize the compression strategy for each model unit by learning sparsity patterns
    New component inside FlexSwitch that turns static sparsification into an adaptive, learned process.
  • Bit-width Adaptive Selection (BAS) no independent evidence
    purpose: Dynamically choose bit-width for each unit during quantization
    Part of the learnable framework that replaces fixed low-bit rules.
  • Sparsity-Aware Storage Strategy (SASS) no independent evidence
    purpose: Select the optimal storage encoding structure for the compressed representation
    Invented to achieve efficient packing after sparsification and quantization.

pith-pipeline@v0.9.0 · 5614 in / 1849 out tokens · 58771 ms · 2026-05-07T06:35:12.844534+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

71 extracted references · 12 canonical work pages · 10 internal anchors

  1. [1]

    HuggingFace's Transformers: State-of-the-art Natural Language Processing

    T. Wolf, L. Debut, V . Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, and J. Brew, “Huggingface’s transformers: State-of-the-art natural language processing,”CoRR, vol. abs/1910.03771, 2019

  2. [2]

    MMDetection: Open MMLab Detection Toolbox and Benchmark

    K. Chen, J. Wang, J. Pang, Y . Cao, Y . Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Xu, Z. Zhang, D. Cheng, C. Zhu, T. Cheng, Q. Zhao, B. Li, X. Lu, R. Zhu, Y . Wu, J. Dai, J. Wang, J. Shi, W. Ouyang, C. C. Loy, and D. Lin, “Mmdetection: Open mmlab detection toolbox and benchmark,” CoRR, vol. abs/1906.07155, 2019

  3. [3]

    Qwen3 Technical Report

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, C. Zheng, D. Liu, F. Zhou, F. Huang, F. Hu, H. Ge, H. Wei, H. Lin, J. Tang, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Bao, K. Yang, L. Yu, L. Deng, M. Li, M. Xue, M. Li, P. Zhang, P. Wang, Q. Zhu, R. Men, R. Gao, S. Liu, S. Luo, T. Li, T. Ta...

  4. [4]

    Gemma 3 Technical Report

    G. Team, “Gemma 3 technical report,”CoRR, vol. abs/2503.19786, 2025

  5. [5]

    Phi-4 Technical Report

    M. I. Abdin, J. Aneja, H. S. Behl, S. Bubeck, R. Eldan, S. Gunasekar, M. Harrison, R. J. Hewett, M. Javaheripi, P. Kauffmann, J. R. Lee, Y . T. Lee, Y . Li, W. Liu, C. C. T. Mendes, A. Nguyen, E. Price, G. de Rosa, O. Saarikivi, A. Salim, S. Shah, X. Wang, R. Ward, Y . Wu, D. Yu, C. Zhang, and Y . Zhang, “Phi-4 technical report,”CoRR, vol. abs/2412.08905, 2024

  6. [6]

    Editing models with task arithmetic,

    G. Ilharco, M. T. Ribeiro, M. Wortsman, L. Schmidt, H. Hajishirzi, and A. Farhadi, “Editing models with task arithmetic,” inInternational Conference on Learning Representations, 2023

  7. [7]

    Merging models with fisher-weighted averag- ing,

    M. Matena and C. Raffel, “Merging models with fisher-weighted averag- ing,” inAnnual Conference on Neural Information Processing Systems, 2022

  8. [8]

    Dataless knowledge fusion by merging weights of language models,

    X. Jin, X. Ren, D. Preotiuc-Pietro, and P. Cheng, “Dataless knowledge fusion by merging weights of language models,” inInternational Con- ference on Learning Representations, 2023

  9. [9]

    Twin-merging: Dynamic integration of modular expertise in model merging,

    Z. Lu, C. Fan, W. Wei, X. Qu, D. Chen, and Y . Cheng, “Twin-merging: Dynamic integration of modular expertise in model merging,”Annual Conference on Neural Information Processing Systems, vol. 37, pp. 78 905–78 935, 2024

  10. [10]

    Emr-merging: Tuning-free high-performance model merging,

    C. Huang, P. Ye, T. Chen, T. He, X. Yue, and W. Ouyang, “Emr-merging: Tuning-free high-performance model merging,”Annual Conference on Neural Information Processing Systems, vol. 37, pp. 122 741–122 769, 2024

  11. [11]

    Less is more: Efficient model merging with binary task switch,

    B. Qi, F. Li, Z. Wang, J. Gao, D. Li, P. Ye, and B. Zhou, “Less is more: Efficient model merging with binary task switch,” inConference on Computer Vision and Pattern Recognition, 2025, pp. 15 265–15 274

  12. [12]

    Essen- tially no barriers in neural network energy landscape,

    F. Draxler, K. Veschgini, M. Salmhofer, and F. A. Hamprecht, “Essen- tially no barriers in neural network energy landscape,” inInternational Conference on Machine Learning, 2018, pp. 1308–1317

  13. [13]

    Loss surfaces, mode connectivity, and fast ensembling of dnns,

    T. Garipov, P. Izmailov, D. Podoprikhin, D. P. Vetrov, and A. G. Wilson, “Loss surfaces, mode connectivity, and fast ensembling of dnns,” in Annual Conference on Neural Information Processing Systems, 2018, pp. 8803–8812

  14. [14]

    Linear mode connectivity and the lottery ticket hypothesis,

    J. Frankle, G. K. Dziugaite, D. M. Roy, and M. Carbin, “Linear mode connectivity and the lottery ticket hypothesis,” inInternational Conference on Machine Learning, 2020, pp. 3259–3269

  15. [15]

    Model soups: averaging weights of multiple fine- tuned models improves accuracy without increasing inference time,

    M. Wortsman, G. Ilharco, S. Y . Gadre, R. Roelofs, R. G. Lopes, A. S. Morcos, H. Namkoong, A. Farhadi, Y . Carmon, S. Kornblith, and L. Schmidt, “Model soups: averaging weights of multiple fine- tuned models improves accuracy without increasing inference time,” inInternational Conference on Machine Learning, 2022, pp. 23 965– 23 998

  16. [16]

    Dynamic fisher- weighted model merging via bayesian optimization,

    S. Lee, J. Liu, Q. Wang, J. Wang, X. Cai, and Y . Wu, “Dynamic fisher- weighted model merging via bayesian optimization,” inConference of the Nations of the Americas Chapter of the Association for Computa- tional Linguistics, 2025, pp. 4923–4935

  17. [17]

    Git re-basin: Merging models modulo permutation symmetries,

    S. K. Ainsworth, J. Hayase, and S. S. Srinivasa, “Git re-basin: Merging models modulo permutation symmetries,” inInternational Conference on Learning Representations, 2023

  18. [18]

    Adamerging: Adaptive model merging for multi-task learning,

    E. Yang, Z. Wang, L. Shen, S. Liu, G. Guo, X. Wang, and D. Tao, “Adamerging: Adaptive model merging for multi-task learning,” in International Conference on Learning Representations, 2024

  19. [19]

    Federated learning with matched averaging,

    H. Wang, M. Yurochkin, Y . Sun, D. S. Papailiopoulos, and Y . Khazaeni, “Federated learning with matched averaging,” inInternational Confer- ence on Learning Representations, 2020

  20. [20]

    Zipit! merging models from different tasks without training,

    G. Stoica, D. Bolya, J. Bjorner, P. Ramesh, T. Hearn, and J. Hoffman, “Zipit! merging models from different tasks without training,” inInter- national Conference on Learning Representations, 2024

  21. [21]

    Language models are super mario: Absorbing abilities from homologous models as a free lunch,

    L. Yu, B. Yu, H. Yu, F. Huang, and Y . Li, “Language models are super mario: Absorbing abilities from homologous models as a free lunch,” in International Conference on Machine Learning, 2024

  22. [22]

    Ties- merging: Resolving interference when merging models,

    P. Yadav, D. Tam, L. Choshen, C. A. Raffel, and M. Bansal, “Ties- merging: Resolving interference when merging models,” inAnnual Conference on Neural Information Processing Systems, 2023

  23. [23]

    Dynamic model merging with mixture of weights,

    P. Ye, C. Huang, M. Shen, T. Chen, Y . Huang, and W. Ouyang, “Dynamic model merging with mixture of weights,”IEEE Trans. Circuits Syst. Video Technol., vol. 35, no. 8, pp. 7925–7939, 2025

  24. [24]

    Parameter-efficient transfer learning with diff pruning,

    D. Guo, A. M. Rush, and Y . Kim, “Parameter-efficient transfer learning with diff pruning,” inAnnual Meeting of the Association for Computa- tional Linguistics, 2021, pp. 4884–4896. 18

  25. [25]

    Composable sparse fine-tuning for cross-lingual transfer,

    A. Ansell, E. M. Ponti, A. Korhonen, and I. Vulic, “Composable sparse fine-tuning for cross-lingual transfer,” inAnnual Meeting of the Association for Computational Linguistics, 2022, pp. 1778–1796

  26. [26]

    Sparse structure search for delta tuning,

    S. Hu, Z. Zhang, N. Ding, Y . Wang, Y . Wang, Z. Liu, and M. Sun, “Sparse structure search for delta tuning,” inAnnual Conference on Neural Information Processing Systems, 2022

  27. [27]

    Training neural networks with fixed sparse masks,

    Y . Sung, V . Nair, and C. Raffel, “Training neural networks with fixed sparse masks,” inAnnual Conference on Neural Information Processing Systems, 2021, pp. 24 193–24 205

  28. [28]

    Sparse high rank adapters,

    K. Bhardwaj, N. P. Pandey, S. Priyadarshi, V . Ganapathy, S. Kadambi, R. Esteves, S. Borse, P. N. Whatmough, R. Garrepalli, M. van Baalen, H. Teague, and M. Nagel, “Sparse high rank adapters,” inAnnual Conference on Neural Information Processing Systems, 2024

  29. [29]

    SMT: fine-tuning large language models with sparse matrices,

    H. He, J. B. Li, X. Jiang, and H. Miller, “SMT: fine-tuning large language models with sparse matrices,” inInternational Conference on Learning Representations, 2025

  30. [30]

    Fm-delta: Lossless compression for storing massive fine- tuned foundation models,

    W. Ning, J. Wang, Q. Qi, M. Zhu, H. Sun, D. Cheng, J. Liao, and C. Zhang, “Fm-delta: Lossless compression for storing massive fine- tuned foundation models,” inAnnual Conference on Neural Information Processing Systems, 2024

  31. [31]

    Bitdelta: Your fine-tune may only be worth one bit,

    J. Liu, G. Xiao, K. Li, J. D. Lee, S. Han, T. Dao, and T. Cai, “Bitdelta: Your fine-tune may only be worth one bit,” inAnnual Conference on Neural Information Processing Systems, 2024

  32. [32]

    To prune, or not to prune: Exploring the efficacy of pruning for model compression,

    M. Zhu and S. Gupta, “To prune, or not to prune: Exploring the efficacy of pruning for model compression,” inInternational Conference on Learning Representations, 2018

  33. [33]

    A simple and effective pruning approach for large language models,

    M. Sun, Z. Liu, A. Bair, and J. Z. Kolter, “A simple and effective pruning approach for large language models,” inInternational Conference on Learning Representations, 2024

  34. [34]

    Learning transferable visual models from natural language supervi- sion,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervi- sion,” inInternational Conference on Machine Learning, vol. 139, 2021, pp. 8748–8763

  35. [35]

    SUN database: Large-scale scene recognition from abbey to zoo,

    J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “SUN database: Large-scale scene recognition from abbey to zoo,” inCon- ference on Computer Vision and Pattern Recognition, 2010, pp. 3485– 3492

  36. [36]

    3d object representations for fine-grained categorization,

    J. Krause, M. Stark, J. Deng, and L. Fei-Fei, “3d object representations for fine-grained categorization,” inInternational Conference on Com- puter Vision Workshops, 2013, pp. 554–561

  37. [37]

    Remote sensing image scene classifica- tion: Benchmark and state of the art,

    G. Cheng, J. Han, and X. Lu, “Remote sensing image scene classifica- tion: Benchmark and state of the art,”Proc. IEEE, vol. 105, no. 10, pp. 1865–1883, 2017

  38. [38]

    Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification,

    P. Helber, B. Bischke, A. Dengel, and D. Borth, “Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification,”IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens., vol. 12, no. 7, pp. 2217–2226, 2019

  39. [39]

    Reading digits in natural images with unsupervised feature learning,

    Y . Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, A. Y . Nget al., “Reading digits in natural images with unsupervised feature learning,” inAnnual Conference on Neural Information Processing Systems Work- shop, vol. 2011, no. 5, 2011, p. 7

  40. [40]

    The german traffic sign recognition benchmark: A multi-class classification competition,

    J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel, “The german traffic sign recognition benchmark: A multi-class classification competition,” inInternational Joint Conference on Neural Networks, 2011, pp. 1453– 1460

  41. [41]

    The MNIST database of handwritten digit images for machine learning research [best of the web],

    L. Deng, “The MNIST database of handwritten digit images for machine learning research [best of the web],”IEEE Signal Process. Mag., vol. 29, no. 6, pp. 141–142, 2012

  42. [42]

    Describing textures in the wild,

    M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi, “Describing textures in the wild,” inConference on Computer Vision and Pattern Recognition, 2014, pp. 3606–3613

  43. [43]

    Trained ternary quantization,

    C. Zhu, S. Han, H. Mao, and W. J. Dally, “Trained ternary quantization,” inInternational Conference on Learning Representations, 2017

  44. [44]

    Knowledge neurons in pretrained transformers,

    D. Dai, L. Dong, Y . Hao, Z. Sui, B. Chang, and F. Wei, “Knowledge neurons in pretrained transformers,” inAnnual Meeting of the Associa- tion for Computational Linguistics, 2022, pp. 8493–8502

  45. [45]

    Knowledge distillation meets self-supervision,

    G. Xu, Z. Liu, X. Li, and C. C. Loy, “Knowledge distillation meets self-supervision,” inEuropean Conference on Computer Vision, 2020, pp. 588–604

  46. [46]

    Online knowledge distillation via collaborative learning,

    Q. Guo, X. Wang, Y . Wu, Z. Yu, D. Liang, X. Hu, and P. Luo, “Online knowledge distillation via collaborative learning,” inConference on Computer Vision and Pattern Recognition, 2020, pp. 11 020–11 029

  47. [47]

    Online knowledge distillation via mutual contrastive learning for visual recogni- tion,

    C. Yang, Z. An, H. Zhou, F. Zhuang, Y . Xu, and Q. Zhang, “Online knowledge distillation via mutual contrastive learning for visual recogni- tion,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 8, pp. 10 212–10 227, 2023

  48. [48]

    On information and sufficiency,

    S. Kullback and R. A. Leibler, “On information and sufficiency,”The Annals of Mathematical Statistics, vol. 22, no. 1, pp. 79–86, 1951

  49. [49]

    On the efficacy of knowledge distillation,

    J. H. Cho and B. Hariharan, “On the efficacy of knowledge distillation,” inInternational Conference on Computer Vision, 2019, pp. 4794–4802

  50. [50]

    Decoupled knowledge distillation,

    B. Zhao, Q. Cui, R. Song, Y . Qiu, and J. Liang, “Decoupled knowledge distillation,” inConference on Computer Vision and Pattern Recognition, 2022, pp. 11 953–11 962

  51. [51]

    A convnet for the 2020s,

    Z. Liu, H. Mao, C. Wu, C. Feichtenhofer, T. Darrell, and S. Xie, “A convnet for the 2020s,” inConference on Computer Vision and Pattern Recognition, 2022, pp. 11 966–11 976

  52. [52]

    Roboflow 100: A rich, multi-domain object detection benchmark,

    F. Ciaglia, F. S. Zuppichini, P. Guerrie, M. McQuade, and J. Solawetz, “Roboflow 100: A rich, multi-domain object detection benchmark,” CoRR, vol. abs/2211.13523, 2022

  53. [53]

    End-to-end object detection with transformers,

    N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in European Conference on Computer Vision, 2020, pp. 213–229

  54. [54]

    GLUE: A multi-task benchmark and analysis platform for natural language understanding,

    A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, “GLUE: A multi-task benchmark and analysis platform for natural language understanding,” inInternational Conference on Learning Rep- resentations, 2019

  55. [55]

    Neural network acceptability judgments,

    A. Warstadt, A. Singh, and S. R. Bowman, “Neural network acceptability judgments,”Trans. Assoc. Comput. Linguistics, vol. 7, pp. 625–641, 2019

  56. [56]

    Recursive deep models for semantic compositionality over a sentiment treebank,

    R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Y . Ng, and C. Potts, “Recursive deep models for semantic compositionality over a sentiment treebank,” inConference on Empirical Methods in Natural Language Processing, 2013, pp. 1631–1642

  57. [57]

    Automatically constructing a corpus of sentential paraphrases,

    W. B. Dolan and C. Brockett, “Automatically constructing a corpus of sentential paraphrases,” inInternational Workshop on Paraphrasing. Asian Federation of Natural Language Processing, 2005

  58. [58]

    A broad-coverage challenge corpus for sentence understanding through inference,

    A. Williams, N. Nangia, and S. R. Bowman, “A broad-coverage challenge corpus for sentence understanding through inference,” in Conference of the North American Chapter of the Association for Computational Linguistics, 2018, pp. 1112–1122

  59. [59]

    Squad: 100, 000+ questions for machine comprehension of text,

    P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, “Squad: 100, 000+ questions for machine comprehension of text,” inConference on Empir- ical Methods in Natural Language Processing, 2016, pp. 2383–2392

  60. [60]

    The third PASCAL recognizing textual entailment challenge,

    D. Giampiccolo, B. Magnini, I. Dagan, and B. Dolan, “The third PASCAL recognizing textual entailment challenge,” inProceedings of the ACL-PASCAL@ACL 2007 Workshop on Textual Entailment and Paraphrasing, 2007, pp. 1–9

  61. [61]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Y . Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V . Stoyanov, “Roberta: A robustly optimized BERT pretraining approach,”CoRR, vol. abs/1907.11692, 2019

  62. [62]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,”CoRR, vol. abs/2312.00752, 2023

  63. [63]

    The Llama 3 Herd of Models

    L. Team, “The llama 3 herd of models,”CoRR, vol. abs/2407.21783, 2024

  64. [64]

    Gemma 2: Improving Open Language Models at a Practical Size

    G. Team, “Gemma 2: Improving open language models at a practical size,”CoRR, vol. abs/2408.00118, 2024

  65. [65]

    Bohdi: Heterogeneous llm fusion with automatic data exploration,

    J. Gao, Z. Guo, D. Zhang, D. Li, R. Liu, P. Li, K. Tian, and B. Qi, “Bohdi: Heterogeneous llm fusion with automatic data exploration,” in Annual Conference on Neural Information Processing Systems, 2025

  66. [66]

    LIMA: less is more for alignment,

    C. Zhou, P. Liu, P. Xu, S. Iyer, J. Sun, Y . Mao, X. Ma, A. Efrat, P. Yu, L. Yu, S. Zhang, G. Ghosh, M. Lewis, L. Zettlemoyer, and O. Levy, “LIMA: less is more for alignment,” inAnnual Conference on Neural Information Processing Systems, 2023

  67. [67]

    Measuring mathematical problem solving with the MATH dataset,

    D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt, “Measuring mathematical problem solving with the MATH dataset,” inProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, 2021

  68. [68]

    Training Verifiers to Solve Math Word Problems

    K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schul- man, “Training verifiers to solve math word problems,”CoRR, vol. abs/2110.14168, 2021

  69. [69]

    Program Synthesis with Large Language Models

    J. Austin, A. Odena, M. I. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. J. Cai, M. Terry, Q. V . Le, and C. Sutton, “Program synthesis with large language models,”CoRR, vol. abs/2108.07732, 2021

  70. [70]

    Theoremqa: A theorem-driven question answering dataset,

    W. Chen, M. Yin, M. Ku, P. Lu, Y . Wan, X. Ma, J. Xu, X. Wang, and T. Xia, “Theoremqa: A theorem-driven question answering dataset,” inConference on Empirical Methods in Natural Language Processing, 2023

  71. [71]

    Challenging big-bench tasks and whether chain-of-thought can solve them,

    M. Suzgun, N. Scales, N. Sch ¨arli, S. Gehrmann, Y . Tay, H. W. Chung, A. Chowdhery, Q. V . Le, E. H. Chi, D. Zhou, and J. Wei, “Challenging big-bench tasks and whether chain-of-thought can solve them,” in Findings of the Association for Computational Linguistics, 2023