pith. sign in

arxiv: 2606.11682 · v1 · pith:QRMTFCWXnew · submitted 2026-06-10 · 💻 cs.CV · cs.LG

Parameter-Efficient Adapter Tuning for Tabular-Image Multimodal Learning

Pith reviewed 2026-06-27 10:42 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords tabular-image multimodal learningparameter-efficient fine-tuningadapter tuningTI-Adaptermultimodal adaptationfrozen encoders
0
0 comments X

The pith

TI-Adapter uses lightweight adapters to match full fine-tuning performance on tabular-image multimodal tasks with far fewer parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TI-Adapter as a way to adapt pretrained encoders for combined tabular and image data without updating all parameters. It freezes the tabular encoder and adds an adapter after its embedding, while using two types of adapters on the image side. Experiments across 20 datasets indicate this method performs as well as or better than full fine-tuning. Readers would care because it lowers the computational cost of training multimodal models for tasks that combine structured data with images.

Core claim

TI-Adapter freezes the pretrained tabular encoder and learns an adapter after the extracted tabular embedding, while adapting the image branch with embedding-level and bottleneck-level adapters. On 20 tabular-image datasets, TI-Adapter achieves competitive or better predictive performance than full fine-tuning while using substantially fewer trainable parameters.

What carries the argument

TI-Adapter, a modality-specific adapter-based fine-tuning framework that places adapters after embeddings or in bottleneck positions to enable efficient adaptation of frozen encoders.

If this is right

  • TI-Adapter reduces trainable parameters substantially compared to full fine-tuning.
  • Performance remains competitive or superior across diverse tabular-image datasets.
  • Ablation studies confirm that specific adapter placements are key to the efficiency-performance balance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This approach could allow fine-tuning of larger multimodal models on limited hardware.
  • Similar adapter strategies might apply to other combinations of data modalities.
  • Task-specific adaptation might be possible even when encoders are trained on very different domains.

Load-bearing premise

The representations from the frozen pretrained encoders are rich enough that the proposed adapter placements can recover task-specific performance.

What would settle it

A new tabular-image dataset where TI-Adapter shows significantly worse performance than full fine-tuning despite using fewer parameters would falsify the claim.

Figures

Figures reproduced from arXiv: 2606.11682 by Jiaqi Luo.

Figure 1
Figure 1. Figure 1: An overview of TI-Adapter. Blocks marked with red flames are trainable, while blocks marked with blue snowflakes remain frozen during training. The image branch supports two adaptation options: applying an embedding-level image adapter after the final ResNet representation, or inserting BCAdapters into selected ResNet layers. Our proposed model, shown in [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Performance difference relative to full ResNet fine-tuning. (a) Accuracy gain on 12 classification datasets. (b) Relative MSE reduction on 8 regression datasets. Pos￾itive values indicate better performance than full fine-tuning, while negative values in￾dicate worse performance. For compactness, BCA1 and BCA2 denote BCAdapter1 and BCAdapter2, respectively. trainable parameter ratio within each dataset rel… view at source ↗
Figure 3
Figure 3. Figure 3: Performance–parameter efficiency trade-off across 20 tabular-image multimodal datasets. The x-axis denotes the unified mean rank, where a lower rank indicates better predictive performance. The y-axis denotes the average trainable parameter ratio relative to full fine-tuning. Methods closer to the bottom-left corner achieve a better trade-off between predictive performance and parameter efficiency. 18 [PI… view at source ↗
Figure 4
Figure 4. Figure 4: Ablation study on the number of ResNet layers equipped with BCAdapters. For compactness, BCA1, BCA2, BCA3, and BCA4 denote BCAdapter1, BCAdapter2, BCAdapter3, and BCAdapter4,respectively. (a) and (b) show performance improvement over full fine-tuning, measured by accuracy gain and MSE reduction. (c) and (d) show the GPU memory ratio and trainable parameter ratio relative to full fine-tuning. 4.4. Ablation … view at source ↗
read the original abstract

Tabular-image multimodal learning aims to improve predictive modeling by jointly using structured tabular attributes and visual data. Although pretrained encoders provide strong modality-specific representations, full fine-tuning can be computationally expensive, while keeping encoders frozen may limit task-specific adaptation. We propose the Tabular-Image Adapter (TI-Adapter), a modality-specific adapter-based fine-tuning framework for efficient multimodal adaptation. TI-Adapter freezes the pretrained tabular encoder and learns an adapter after the extracted tabular embedding, while adapting the image branch with embedding-level and bottleneck-level adapters instead of full fine-tuning. Experiments on 20 tabular-image datasets show that TI-Adapter achieves competitive or better predictive performance than full fine-tuning while using substantially fewer trainable parameters. Ablation studies further demonstrate the importance of adapter placement for balancing performance and practical efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript proposes the Tabular-Image Adapter (TI-Adapter), a modality-specific adapter framework for parameter-efficient fine-tuning of pretrained encoders in tabular-image multimodal learning. The tabular encoder is kept frozen with an adapter applied to its embeddings, while the image branch receives embedding-level and bottleneck-level adapters. The central empirical claim is that this yields competitive or superior predictive performance to full fine-tuning across 20 tabular-image datasets while using substantially fewer trainable parameters, with supporting ablations on adapter placement.

Significance. If the reported results hold under standard verification, the work demonstrates a practical efficiency gain for multimodal adaptation without sacrificing accuracy. The evaluation on 20 datasets and the ablation studies on placement constitute a reasonable empirical contribution for an applied methods paper; no machine-checked proofs or parameter-free derivations are present, but the direct comparison to full fine-tuning is the appropriate test for the stated claim.

minor comments (3)
  1. [Abstract] Abstract: the statement that TI-Adapter 'achieves competitive or better predictive performance' would be strengthened by an explicit statement of the evaluation protocol (e.g., train/validation/test splits, number of runs) already in the abstract or first paragraph of the experiments section.
  2. The manuscript should include a table (or clear paragraph) listing the exact number of trainable parameters for TI-Adapter versus the full fine-tuning baseline on at least one representative dataset; this directly supports the efficiency claim.
  3. Figure or table captions for the ablation studies should explicitly state which adapter placements are being compared and on how many datasets the trends hold.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our work and the recommendation for minor revision. The referee's summary accurately captures the TI-Adapter framework and its empirical evaluation on 20 datasets.

Circularity Check

0 steps flagged

No significant circularity; empirical claim only

full rationale

The paper's central claim is an empirical statement: TI-Adapter matches or exceeds full fine-tuning performance on 20 tabular-image datasets while using fewer trainable parameters. This rests on direct experimental comparison and ablations, not on any derivation, equation, or prediction that reduces to fitted quantities or self-citations by construction. No load-bearing mathematical steps, self-definitional relations, or uniqueness theorems appear in the provided description or abstract. The method description (freezing encoders, inserting modality-specific adapters) is a design choice validated externally by results, not internally forced.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no equations or implementation details, so no concrete free parameters, axioms, or invented entities can be identified.

pith-pipeline@v0.9.1-grok · 5652 in / 964 out tokens · 32851 ms · 2026-06-27T10:42:26.611600+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 8 canonical work pages · 5 internal anchors

  1. [1]

    J. N. Acosta, G. J. Falcone, P. Rajpurkar, E. J. Topol, Multimodal biomedical ai, Nature medicine 28 (9) (2022) 1773–1784

  2. [2]

    Huang, A

    S.-C. Huang, A. Pareek, S. Seyyedi, I. Banerjee, M. P. Lungren, Fusion of medical imaging and electronic health records using deep learning: a 25 systematic review and implementation guidelines, NPJ digital medicine 3 (1) (2020) 136

  3. [3]

    Borsos, C

    B. Borsos, C. G. Allaart, A. van Halteren, Predicting stroke outcome: a case for multimodal deep learning methods with tabular and ct perfusion data, Artificial Intelligence in Medicine 147 (2024) 102719

  4. [4]

    K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778

  5. [5]

    TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second

    N. Hollmann, S. Müller, K. Eggensperger, F. Hutter, Tabpfn: A trans- former that solves small tabular classification problems in a second, arXiv preprint arXiv:2207.01848 (2022)

  6. [6]

    Hollmann, S

    N. Hollmann, S. Müller, L. Purucker, A. Krishnakumar, M. Körfer, S. B. Hoo, R. T. Schirrmeister, F. Hutter, Accurate predictions on small data with a tabular foundation model, Nature 637 (8045) (2025) 319–326

  7. [7]

    TabPFN-3: Technical Report

    L. Grinsztajn, K. Flöge, O. Key, F. Birkel, P. Jund, B. Roof, M. Manium, S. Bin, M. Bühler, A. Garg, et al., Tabpfn-3: Technical report, arXiv preprint arXiv:2605.13986 (2026)

  8. [8]

    L. Xu, H. Xie, S. J. Qin, X. Tao, F. L. Wang, Parameter-efficient fine- tuning methods for pretrained language models: A critical review and assessment, IEEE Transactions on Pattern Analysis and Machine Intel- ligence (2026). 26

  9. [9]

    Y. Xin, J. Yang, S. Luo, Y. Du, Q. Qin, K. Cen, Y. He, Z. Zhang, B. Fu, X. Yang, et al., Parameter-efficient fine-tuning for pre-trained vi- sion models: A survey and benchmark, arXiv preprint arXiv:2402.02242 (2024)

  10. [10]

    P. Gao, S. Geng, R. Zhang, T. Ma, R. Fang, Y. Zhang, H. Li, Y. Qiao, Clip-adapter: Better vision-language models with feature adapters, In- ternational journal of computer vision 132 (2) (2024) 581–595

  11. [11]

    H. Chen, R. Tao, H. Zhang, Y. Wang, X. Li, W. Ye, J. Wang, G. Hu, M. Savvides, Conv-adapter: Exploring parameter efficient transfer learn- ing for convnets, in: Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, 2024, pp. 1551–1561

  12. [12]

    Y. Lei, Z. Li, Y. Shen, J. Zhang, H. Shan, Clip-lung: Textual knowledge- guided lung nodule malignancy prediction, in: International Confer- ence on Medical Image Computing and Computer-Assisted Intervention, Springer, 2023, pp. 403–412

  13. [13]

    L. Sun, M. Zhang, Y. Lu, W. Zhu, Y. Yi, F. Yan, Nodule-clip: Lung nodule classification based on multi-modal contrastive learning, Com- puters in Biology and Medicine 175 (2024) 108505

  14. [14]

    Spasov, L

    S. Spasov, L. Passamonti, A. Duggento, P. Lio, N. Toschi, A. D. N. Initiative, et al., A parameter-efficient deep learning approach to pre- 27 dict conversion from mild cognitive impairment to alzheimer’s disease, Neuroimage 189 (2019) 276–287

  15. [15]

    Y. Liu, Y. Yu, J. Ouyang, B. Jiang, G. Yang, S. Ostmeier, M. Winter- mark, P. Michel, D. S. Liebeskind, M. G. Lansberg, et al., Functional outcome prediction in acute ischemic stroke using a fused imaging and clinical deep learning model, Stroke 54 (9) (2023) 2316–2327

  16. [16]

    Zheng, Z

    H. Zheng, Z. Lin, Q. Zhou, X. Peng, J. Xiao, C. Zu, Z. Jiao, Y. Wang, Multi-transsp: Multimodal transformer for survival prediction of na- sopharyngeal carcinoma patients, in: International Conference on Med- ical Image Computing and Computer-Assisted Intervention, Springer, 2022, pp. 234–243

  17. [17]

    C. Xue, S. S. Kowshik, D. Lteif, S. Puducheri, V. H. Jasodanand, O. T. Zhou, A. S. Walia, O. B. Guney, J. D. Zhang, S. T. Pham, et al., Ai- based differential diagnosis of dementia etiologies on multimodal data, Nature Medicine 30 (10) (2024) 2977–2989

  18. [18]

    J. Luo, Y. Yuan, S. Xu, Time: Tabpfn-integrated multimodal engine for robust tabular-image learning, arXiv preprint arXiv:2506.00813 (2025)

  19. [19]

    T. Chen, C. Guestrin, Xgboost: A scalable tree boosting system, in: Proceedings of the 22nd acm sigkdd international conference on knowl- edge discovery and data mining, 2016, pp. 785–794. 28

  20. [20]

    J. Luo, S. Xu, Ncart: Neural classification and regression tree for tabular data, Pattern Recognition 154 (2024) 110578

  21. [21]

    Gorishniy, I

    Y. Gorishniy, I. Rubachev, V. Khrulkov, A. Babenko, Revisiting deep learning models for tabular data, Advances in Neural Information Pro- cessing Systems 34 (2021) 18932–18943

  22. [22]

    W. Kim, C. Song, H. Kim, Multimodalpfn: Extending prior-data fit- ted networks for multimodal tabular learning, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026, pp. 30357–30367

  23. [23]

    Hager, M

    P. Hager, M. J. Menten, D. Rueckert, Best of both worlds: Multimodal contrastive learning with tabular and imaging data, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, 2023, pp. 23924–23935

  24. [24]

    W. Huang, Multimodal contrastive learning and tabular attention for automated alzheimer’s disease prediction, in: Proceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 2473– 2482

  25. [25]

    S. Du, S. Zheng, Y. Wang, W. Bai, D. P. ORegan, C. Qin, Tip: Tabular- image pre-training for multimodal classification with incomplete data, in: European Conference on Computer Vision, Springer, 2024, pp. 478– 496. 29

  26. [26]

    S. Du, X. Luo, D. P. O’Regan, C. Qin, Stil: Semi-supervised tabular- image learning for comprehensive task-relevant information exploration in multimodal classification, in: Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 15549–15559

  27. [27]

    Houlsby, A

    N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. Attariyan, S. Gelly, Parameter-efficient transfer learn- ing for nlp, in: International conference on machine learning, PMLR, 2019, pp. 2790–2799

  28. [28]

    X. L. Li, P. Liang, Prefix-tuning: Optimizing continuous prompts for generation, in: Proceedings of the 59th Annual Meeting of the Asso- ciation for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021, pp. 4582–4597

  29. [29]

    Lester, R

    B. Lester, R. Al-Rfou, N. Constant, The power of scale for parameter- efficient prompt tuning, in: Proceedings of the 2021 conference on em- pirical methods in natural language processing, 2021, pp. 3045–3059

  30. [30]

    Zhang, R

    R. Zhang, R. Fang, W. Zhang, P. Gao, K. Li, J. Dai, Y. Qiao, H. Li, Tip- adapter: Training-free clip-adapter for better vision-language modeling, arXiv preprint arXiv:2111.03930 (2021)

  31. [31]

    E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, 30 W. Chen, et al., Lora: Low-rank adaptation of large language models., Iclr 1 (2) (2022) 3

  32. [32]

    AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning

    Q. Zhang, M. Chen, A. Bukharin, N. Karampatziakis, P. He, Y. Cheng, W. Chen, T. Zhao, Adalora: Adaptive budget allocation for parameter- efficient fine-tuning, arXiv preprint arXiv:2303.10512 (2023)

  33. [33]

    Dettmers, A

    T. Dettmers, A. Pagnoni, A. Holtzman, L. Zettlemoyer, Qlora: Efficient finetuning of quantized llms, Advances in neural information processing systems 36 (2023) 10088–10115

  34. [34]

    MulTaBench: Benchmarking Multimodal Tabular Learning with Text and Image

    A. Arazi, E. Shapira, S. Grunblat, M. Ventura, E. Hoffer, G. Blayer, D. Holzmüller, L. Purucker, G. Varoquaux, F. Hutter, et al., Multabench: Benchmarking multimodal tabular learning with text and image, arXiv preprint arXiv:2605.10616 (2026)

  35. [35]

    T. Chen, B. Xu, C. Zhang, C. Guestrin, Training deep nets with sub- linear memory cost, arXiv preprint arXiv:1604.06174 (2016)

  36. [36]

    Mercea, A

    O.-B. Mercea, A. Gritsenko, C. Schmid, A. Arnab, Time-memory-and parameter-efficient visual adaptation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 5536–5545. 31