Parameter-Efficient Adapter Tuning for Tabular-Image Multimodal Learning

Jiaqi Luo

arxiv: 2606.11682 · v1 · pith:QRMTFCWXnew · submitted 2026-06-10 · 💻 cs.CV · cs.LG

Parameter-Efficient Adapter Tuning for Tabular-Image Multimodal Learning

Jiaqi Luo This is my paper

Pith reviewed 2026-06-27 10:42 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords tabular-image multimodal learningparameter-efficient fine-tuningadapter tuningTI-Adaptermultimodal adaptationfrozen encoders

0 comments

The pith

TI-Adapter uses lightweight adapters to match full fine-tuning performance on tabular-image multimodal tasks with far fewer parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TI-Adapter as a way to adapt pretrained encoders for combined tabular and image data without updating all parameters. It freezes the tabular encoder and adds an adapter after its embedding, while using two types of adapters on the image side. Experiments across 20 datasets indicate this method performs as well as or better than full fine-tuning. Readers would care because it lowers the computational cost of training multimodal models for tasks that combine structured data with images.

Core claim

TI-Adapter freezes the pretrained tabular encoder and learns an adapter after the extracted tabular embedding, while adapting the image branch with embedding-level and bottleneck-level adapters. On 20 tabular-image datasets, TI-Adapter achieves competitive or better predictive performance than full fine-tuning while using substantially fewer trainable parameters.

What carries the argument

TI-Adapter, a modality-specific adapter-based fine-tuning framework that places adapters after embeddings or in bottleneck positions to enable efficient adaptation of frozen encoders.

If this is right

TI-Adapter reduces trainable parameters substantially compared to full fine-tuning.
Performance remains competitive or superior across diverse tabular-image datasets.
Ablation studies confirm that specific adapter placements are key to the efficiency-performance balance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This approach could allow fine-tuning of larger multimodal models on limited hardware.
Similar adapter strategies might apply to other combinations of data modalities.
Task-specific adaptation might be possible even when encoders are trained on very different domains.

Load-bearing premise

The representations from the frozen pretrained encoders are rich enough that the proposed adapter placements can recover task-specific performance.

What would settle it

A new tabular-image dataset where TI-Adapter shows significantly worse performance than full fine-tuning despite using fewer parameters would falsify the claim.

Figures

Figures reproduced from arXiv: 2606.11682 by Jiaqi Luo.

**Figure 1.** Figure 1: An overview of TI-Adapter. Blocks marked with red flames are trainable, while blocks marked with blue snowflakes remain frozen during training. The image branch supports two adaptation options: applying an embedding-level image adapter after the final ResNet representation, or inserting BCAdapters into selected ResNet layers. Our proposed model, shown in [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗

**Figure 2.** Figure 2: Performance difference relative to full ResNet fine-tuning. (a) Accuracy gain on 12 classification datasets. (b) Relative MSE reduction on 8 regression datasets. Positive values indicate better performance than full fine-tuning, while negative values indicate worse performance. For compactness, BCA1 and BCA2 denote BCAdapter1 and BCAdapter2, respectively. trainable parameter ratio within each dataset rel… view at source ↗

**Figure 3.** Figure 3: Performance–parameter efficiency trade-off across 20 tabular-image multimodal datasets. The x-axis denotes the unified mean rank, where a lower rank indicates better predictive performance. The y-axis denotes the average trainable parameter ratio relative to full fine-tuning. Methods closer to the bottom-left corner achieve a better trade-off between predictive performance and parameter efficiency. 18 [PI… view at source ↗

**Figure 4.** Figure 4: Ablation study on the number of ResNet layers equipped with BCAdapters. For compactness, BCA1, BCA2, BCA3, and BCA4 denote BCAdapter1, BCAdapter2, BCAdapter3, and BCAdapter4,respectively. (a) and (b) show performance improvement over full fine-tuning, measured by accuracy gain and MSE reduction. (c) and (d) show the GPU memory ratio and trainable parameter ratio relative to full fine-tuning. 4.4. Ablation … view at source ↗

read the original abstract

Tabular-image multimodal learning aims to improve predictive modeling by jointly using structured tabular attributes and visual data. Although pretrained encoders provide strong modality-specific representations, full fine-tuning can be computationally expensive, while keeping encoders frozen may limit task-specific adaptation. We propose the Tabular-Image Adapter (TI-Adapter), a modality-specific adapter-based fine-tuning framework for efficient multimodal adaptation. TI-Adapter freezes the pretrained tabular encoder and learns an adapter after the extracted tabular embedding, while adapting the image branch with embedding-level and bottleneck-level adapters instead of full fine-tuning. Experiments on 20 tabular-image datasets show that TI-Adapter achieves competitive or better predictive performance than full fine-tuning while using substantially fewer trainable parameters. Ablation studies further demonstrate the importance of adapter placement for balancing performance and practical efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TI-Adapter is a straightforward modality-specific adapter pattern for tabular-image tasks that reportedly matches full fine-tuning with far fewer parameters on 20 datasets.

read the letter

The main point is that this paper takes established adapter tuning and applies it with a targeted placement for tabular plus image data. The tabular encoder stays frozen with an adapter added after its embedding, while the image side uses both embedding-level and bottleneck adapters. Experiments across 20 datasets plus placement ablations are the core of the contribution.

The work is clear on the practical goal: reduce trainable parameters without losing predictive performance. The ablations help show why the chosen locations matter for the efficiency trade-off, and the claim of competitive or better results than full fine-tuning is the kind of direct comparison that practitioners can test.

The soft spots are mostly around experimental transparency. The abstract gives no information on baseline implementations, run-to-run variance, statistical tests, or exact data splits, so the strength of the performance claims is difficult to judge from the summary alone. The underlying assumption that frozen pretrained encoders already carry enough signal for the adapters to recover task performance is reasonable for many datasets but could fail on more distant tasks.

This is the sort of paper that would interest groups working on efficient multimodal adaptation with mixed tabular and visual inputs. It is incremental rather than foundational, but the concrete pattern and the scale of the experiments make it worth checking.

I would send it to peer review. The idea is well scoped and the empirical focus is direct enough that referees can evaluate the setup and results without needing major new theory.

Referee Report

0 major / 3 minor

Summary. The manuscript proposes the Tabular-Image Adapter (TI-Adapter), a modality-specific adapter framework for parameter-efficient fine-tuning of pretrained encoders in tabular-image multimodal learning. The tabular encoder is kept frozen with an adapter applied to its embeddings, while the image branch receives embedding-level and bottleneck-level adapters. The central empirical claim is that this yields competitive or superior predictive performance to full fine-tuning across 20 tabular-image datasets while using substantially fewer trainable parameters, with supporting ablations on adapter placement.

Significance. If the reported results hold under standard verification, the work demonstrates a practical efficiency gain for multimodal adaptation without sacrificing accuracy. The evaluation on 20 datasets and the ablation studies on placement constitute a reasonable empirical contribution for an applied methods paper; no machine-checked proofs or parameter-free derivations are present, but the direct comparison to full fine-tuning is the appropriate test for the stated claim.

minor comments (3)

[Abstract] Abstract: the statement that TI-Adapter 'achieves competitive or better predictive performance' would be strengthened by an explicit statement of the evaluation protocol (e.g., train/validation/test splits, number of runs) already in the abstract or first paragraph of the experiments section.
The manuscript should include a table (or clear paragraph) listing the exact number of trainable parameters for TI-Adapter versus the full fine-tuning baseline on at least one representative dataset; this directly supports the efficiency claim.
Figure or table captions for the ablation studies should explicitly state which adapter placements are being compared and on how many datasets the trends hold.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our work and the recommendation for minor revision. The referee's summary accurately captures the TI-Adapter framework and its empirical evaluation on 20 datasets.

Circularity Check

0 steps flagged

No significant circularity; empirical claim only

full rationale

The paper's central claim is an empirical statement: TI-Adapter matches or exceeds full fine-tuning performance on 20 tabular-image datasets while using fewer trainable parameters. This rests on direct experimental comparison and ablations, not on any derivation, equation, or prediction that reduces to fitted quantities or self-citations by construction. No load-bearing mathematical steps, self-definitional relations, or uniqueness theorems appear in the provided description or abstract. The method description (freezing encoders, inserting modality-specific adapters) is a design choice validated externally by results, not internally forced.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no equations or implementation details, so no concrete free parameters, axioms, or invented entities can be identified.

pith-pipeline@v0.9.1-grok · 5652 in / 964 out tokens · 32851 ms · 2026-06-27T10:42:26.611600+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 8 canonical work pages · 5 internal anchors

[1]

J. N. Acosta, G. J. Falcone, P. Rajpurkar, E. J. Topol, Multimodal biomedical ai, Nature medicine 28 (9) (2022) 1773–1784

2022
[2]

Huang, A

S.-C. Huang, A. Pareek, S. Seyyedi, I. Banerjee, M. P. Lungren, Fusion of medical imaging and electronic health records using deep learning: a 25 systematic review and implementation guidelines, NPJ digital medicine 3 (1) (2020) 136

2020
[3]

Borsos, C

B. Borsos, C. G. Allaart, A. van Halteren, Predicting stroke outcome: a case for multimodal deep learning methods with tabular and ct perfusion data, Artiﬁcial Intelligence in Medicine 147 (2024) 102719

2024
[4]

K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778

2016
[5]

TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second

N. Hollmann, S. Müller, K. Eggensperger, F. Hutter, Tabpfn: A trans- former that solves small tabular classiﬁcation problems in a second, arXiv preprint arXiv:2207.01848 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[6]

Hollmann, S

N. Hollmann, S. Müller, L. Purucker, A. Krishnakumar, M. Körfer, S. B. Hoo, R. T. Schirrmeister, F. Hutter, Accurate predictions on small data with a tabular foundation model, Nature 637 (8045) (2025) 319–326

2025
[7]

TabPFN-3: Technical Report

L. Grinsztajn, K. Flöge, O. Key, F. Birkel, P. Jund, B. Roof, M. Manium, S. Bin, M. Bühler, A. Garg, et al., Tabpfn-3: Technical report, arXiv preprint arXiv:2605.13986 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[8]

L. Xu, H. Xie, S. J. Qin, X. Tao, F. L. Wang, Parameter-eﬃcient ﬁne- tuning methods for pretrained language models: A critical review and assessment, IEEE Transactions on Pattern Analysis and Machine Intel- ligence (2026). 26

2026
[9]

Y. Xin, J. Yang, S. Luo, Y. Du, Q. Qin, K. Cen, Y. He, Z. Zhang, B. Fu, X. Yang, et al., Parameter-eﬃcient ﬁne-tuning for pre-trained vi- sion models: A survey and benchmark, arXiv preprint arXiv:2402.02242 (2024)

work page arXiv 2024
[10]

P. Gao, S. Geng, R. Zhang, T. Ma, R. Fang, Y. Zhang, H. Li, Y. Qiao, Clip-adapter: Better vision-language models with feature adapters, In- ternational journal of computer vision 132 (2) (2024) 581–595

2024
[11]

H. Chen, R. Tao, H. Zhang, Y. Wang, X. Li, W. Ye, J. Wang, G. Hu, M. Savvides, Conv-adapter: Exploring parameter eﬃcient transfer learn- ing for convnets, in: Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, 2024, pp. 1551–1561

2024
[12]

Y. Lei, Z. Li, Y. Shen, J. Zhang, H. Shan, Clip-lung: Textual knowledge- guided lung nodule malignancy prediction, in: International Confer- ence on Medical Image Computing and Computer-Assisted Intervention, Springer, 2023, pp. 403–412

2023
[13]

L. Sun, M. Zhang, Y. Lu, W. Zhu, Y. Yi, F. Yan, Nodule-clip: Lung nodule classiﬁcation based on multi-modal contrastive learning, Com- puters in Biology and Medicine 175 (2024) 108505

2024
[14]

Spasov, L

S. Spasov, L. Passamonti, A. Duggento, P. Lio, N. Toschi, A. D. N. Initiative, et al., A parameter-eﬃcient deep learning approach to pre- 27 dict conversion from mild cognitive impairment to alzheimer’s disease, Neuroimage 189 (2019) 276–287

2019
[15]

Y. Liu, Y. Yu, J. Ouyang, B. Jiang, G. Yang, S. Ostmeier, M. Winter- mark, P. Michel, D. S. Liebeskind, M. G. Lansberg, et al., Functional outcome prediction in acute ischemic stroke using a fused imaging and clinical deep learning model, Stroke 54 (9) (2023) 2316–2327

2023
[16]

Zheng, Z

H. Zheng, Z. Lin, Q. Zhou, X. Peng, J. Xiao, C. Zu, Z. Jiao, Y. Wang, Multi-transsp: Multimodal transformer for survival prediction of na- sopharyngeal carcinoma patients, in: International Conference on Med- ical Image Computing and Computer-Assisted Intervention, Springer, 2022, pp. 234–243

2022
[17]

C. Xue, S. S. Kowshik, D. Lteif, S. Puducheri, V. H. Jasodanand, O. T. Zhou, A. S. Walia, O. B. Guney, J. D. Zhang, S. T. Pham, et al., Ai- based diﬀerential diagnosis of dementia etiologies on multimodal data, Nature Medicine 30 (10) (2024) 2977–2989

2024
[18]

J. Luo, Y. Yuan, S. Xu, Time: Tabpfn-integrated multimodal engine for robust tabular-image learning, arXiv preprint arXiv:2506.00813 (2025)

work page arXiv 2025
[19]

T. Chen, C. Guestrin, Xgboost: A scalable tree boosting system, in: Proceedings of the 22nd acm sigkdd international conference on knowl- edge discovery and data mining, 2016, pp. 785–794. 28

2016
[20]

J. Luo, S. Xu, Ncart: Neural classiﬁcation and regression tree for tabular data, Pattern Recognition 154 (2024) 110578

2024
[21]

Gorishniy, I

Y. Gorishniy, I. Rubachev, V. Khrulkov, A. Babenko, Revisiting deep learning models for tabular data, Advances in Neural Information Pro- cessing Systems 34 (2021) 18932–18943

2021
[22]

W. Kim, C. Song, H. Kim, Multimodalpfn: Extending prior-data ﬁt- ted networks for multimodal tabular learning, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026, pp. 30357–30367

2026
[23]

Hager, M

P. Hager, M. J. Menten, D. Rueckert, Best of both worlds: Multimodal contrastive learning with tabular and imaging data, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, 2023, pp. 23924–23935

2023
[24]

W. Huang, Multimodal contrastive learning and tabular attention for automated alzheimer’s disease prediction, in: Proceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 2473– 2482

2023
[25]

S. Du, S. Zheng, Y. Wang, W. Bai, D. P. ORegan, C. Qin, Tip: Tabular- image pre-training for multimodal classiﬁcation with incomplete data, in: European Conference on Computer Vision, Springer, 2024, pp. 478– 496. 29

2024
[26]

S. Du, X. Luo, D. P. O’Regan, C. Qin, Stil: Semi-supervised tabular- image learning for comprehensive task-relevant information exploration in multimodal classiﬁcation, in: Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 15549–15559

2025
[27]

Houlsby, A

N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. Attariyan, S. Gelly, Parameter-eﬃcient transfer learn- ing for nlp, in: International conference on machine learning, PMLR, 2019, pp. 2790–2799

2019
[28]

X. L. Li, P. Liang, Preﬁx-tuning: Optimizing continuous prompts for generation, in: Proceedings of the 59th Annual Meeting of the Asso- ciation for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021, pp. 4582–4597

2021
[29]

Lester, R

B. Lester, R. Al-Rfou, N. Constant, The power of scale for parameter- eﬃcient prompt tuning, in: Proceedings of the 2021 conference on em- pirical methods in natural language processing, 2021, pp. 3045–3059

2021
[30]

Zhang, R

R. Zhang, R. Fang, W. Zhang, P. Gao, K. Li, J. Dai, Y. Qiao, H. Li, Tip- adapter: Training-free clip-adapter for better vision-language modeling, arXiv preprint arXiv:2111.03930 (2021)

work page arXiv 2021
[31]

E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, 30 W. Chen, et al., Lora: Low-rank adaptation of large language models., Iclr 1 (2) (2022) 3

2022
[32]

AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning

Q. Zhang, M. Chen, A. Bukharin, N. Karampatziakis, P. He, Y. Cheng, W. Chen, T. Zhao, Adalora: Adaptive budget allocation for parameter- eﬃcient ﬁne-tuning, arXiv preprint arXiv:2303.10512 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

Dettmers, A

T. Dettmers, A. Pagnoni, A. Holtzman, L. Zettlemoyer, Qlora: Eﬃcient ﬁnetuning of quantized llms, Advances in neural information processing systems 36 (2023) 10088–10115

2023
[34]

MulTaBench: Benchmarking Multimodal Tabular Learning with Text and Image

A. Arazi, E. Shapira, S. Grunblat, M. Ventura, E. Hoﬀer, G. Blayer, D. Holzmüller, L. Purucker, G. Varoquaux, F. Hutter, et al., Multabench: Benchmarking multimodal tabular learning with text and image, arXiv preprint arXiv:2605.10616 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[35]

T. Chen, B. Xu, C. Zhang, C. Guestrin, Training deep nets with sub- linear memory cost, arXiv preprint arXiv:1604.06174 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016
[36]

Mercea, A

O.-B. Mercea, A. Gritsenko, C. Schmid, A. Arnab, Time-memory-and parameter-eﬃcient visual adaptation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 5536–5545. 31

2024

[1] [1]

J. N. Acosta, G. J. Falcone, P. Rajpurkar, E. J. Topol, Multimodal biomedical ai, Nature medicine 28 (9) (2022) 1773–1784

2022

[2] [2]

Huang, A

S.-C. Huang, A. Pareek, S. Seyyedi, I. Banerjee, M. P. Lungren, Fusion of medical imaging and electronic health records using deep learning: a 25 systematic review and implementation guidelines, NPJ digital medicine 3 (1) (2020) 136

2020

[3] [3]

Borsos, C

B. Borsos, C. G. Allaart, A. van Halteren, Predicting stroke outcome: a case for multimodal deep learning methods with tabular and ct perfusion data, Artiﬁcial Intelligence in Medicine 147 (2024) 102719

2024

[4] [4]

K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778

2016

[5] [5]

TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second

N. Hollmann, S. Müller, K. Eggensperger, F. Hutter, Tabpfn: A trans- former that solves small tabular classiﬁcation problems in a second, arXiv preprint arXiv:2207.01848 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[6] [6]

Hollmann, S

N. Hollmann, S. Müller, L. Purucker, A. Krishnakumar, M. Körfer, S. B. Hoo, R. T. Schirrmeister, F. Hutter, Accurate predictions on small data with a tabular foundation model, Nature 637 (8045) (2025) 319–326

2025

[7] [7]

TabPFN-3: Technical Report

L. Grinsztajn, K. Flöge, O. Key, F. Birkel, P. Jund, B. Roof, M. Manium, S. Bin, M. Bühler, A. Garg, et al., Tabpfn-3: Technical report, arXiv preprint arXiv:2605.13986 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[8] [8]

L. Xu, H. Xie, S. J. Qin, X. Tao, F. L. Wang, Parameter-eﬃcient ﬁne- tuning methods for pretrained language models: A critical review and assessment, IEEE Transactions on Pattern Analysis and Machine Intel- ligence (2026). 26

2026

[9] [9]

Y. Xin, J. Yang, S. Luo, Y. Du, Q. Qin, K. Cen, Y. He, Z. Zhang, B. Fu, X. Yang, et al., Parameter-eﬃcient ﬁne-tuning for pre-trained vi- sion models: A survey and benchmark, arXiv preprint arXiv:2402.02242 (2024)

work page arXiv 2024

[10] [10]

P. Gao, S. Geng, R. Zhang, T. Ma, R. Fang, Y. Zhang, H. Li, Y. Qiao, Clip-adapter: Better vision-language models with feature adapters, In- ternational journal of computer vision 132 (2) (2024) 581–595

2024

[11] [11]

H. Chen, R. Tao, H. Zhang, Y. Wang, X. Li, W. Ye, J. Wang, G. Hu, M. Savvides, Conv-adapter: Exploring parameter eﬃcient transfer learn- ing for convnets, in: Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, 2024, pp. 1551–1561

2024

[12] [12]

Y. Lei, Z. Li, Y. Shen, J. Zhang, H. Shan, Clip-lung: Textual knowledge- guided lung nodule malignancy prediction, in: International Confer- ence on Medical Image Computing and Computer-Assisted Intervention, Springer, 2023, pp. 403–412

2023

[13] [13]

L. Sun, M. Zhang, Y. Lu, W. Zhu, Y. Yi, F. Yan, Nodule-clip: Lung nodule classiﬁcation based on multi-modal contrastive learning, Com- puters in Biology and Medicine 175 (2024) 108505

2024

[14] [14]

Spasov, L

S. Spasov, L. Passamonti, A. Duggento, P. Lio, N. Toschi, A. D. N. Initiative, et al., A parameter-eﬃcient deep learning approach to pre- 27 dict conversion from mild cognitive impairment to alzheimer’s disease, Neuroimage 189 (2019) 276–287

2019

[15] [15]

Y. Liu, Y. Yu, J. Ouyang, B. Jiang, G. Yang, S. Ostmeier, M. Winter- mark, P. Michel, D. S. Liebeskind, M. G. Lansberg, et al., Functional outcome prediction in acute ischemic stroke using a fused imaging and clinical deep learning model, Stroke 54 (9) (2023) 2316–2327

2023

[16] [16]

Zheng, Z

H. Zheng, Z. Lin, Q. Zhou, X. Peng, J. Xiao, C. Zu, Z. Jiao, Y. Wang, Multi-transsp: Multimodal transformer for survival prediction of na- sopharyngeal carcinoma patients, in: International Conference on Med- ical Image Computing and Computer-Assisted Intervention, Springer, 2022, pp. 234–243

2022

[17] [17]

C. Xue, S. S. Kowshik, D. Lteif, S. Puducheri, V. H. Jasodanand, O. T. Zhou, A. S. Walia, O. B. Guney, J. D. Zhang, S. T. Pham, et al., Ai- based diﬀerential diagnosis of dementia etiologies on multimodal data, Nature Medicine 30 (10) (2024) 2977–2989

2024

[18] [18]

J. Luo, Y. Yuan, S. Xu, Time: Tabpfn-integrated multimodal engine for robust tabular-image learning, arXiv preprint arXiv:2506.00813 (2025)

work page arXiv 2025

[19] [19]

T. Chen, C. Guestrin, Xgboost: A scalable tree boosting system, in: Proceedings of the 22nd acm sigkdd international conference on knowl- edge discovery and data mining, 2016, pp. 785–794. 28

2016

[20] [20]

J. Luo, S. Xu, Ncart: Neural classiﬁcation and regression tree for tabular data, Pattern Recognition 154 (2024) 110578

2024

[21] [21]

Gorishniy, I

Y. Gorishniy, I. Rubachev, V. Khrulkov, A. Babenko, Revisiting deep learning models for tabular data, Advances in Neural Information Pro- cessing Systems 34 (2021) 18932–18943

2021

[22] [22]

W. Kim, C. Song, H. Kim, Multimodalpfn: Extending prior-data ﬁt- ted networks for multimodal tabular learning, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026, pp. 30357–30367

2026

[23] [23]

Hager, M

P. Hager, M. J. Menten, D. Rueckert, Best of both worlds: Multimodal contrastive learning with tabular and imaging data, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, 2023, pp. 23924–23935

2023

[24] [24]

W. Huang, Multimodal contrastive learning and tabular attention for automated alzheimer’s disease prediction, in: Proceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 2473– 2482

2023

[25] [25]

S. Du, S. Zheng, Y. Wang, W. Bai, D. P. ORegan, C. Qin, Tip: Tabular- image pre-training for multimodal classiﬁcation with incomplete data, in: European Conference on Computer Vision, Springer, 2024, pp. 478– 496. 29

2024

[26] [26]

S. Du, X. Luo, D. P. O’Regan, C. Qin, Stil: Semi-supervised tabular- image learning for comprehensive task-relevant information exploration in multimodal classiﬁcation, in: Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 15549–15559

2025

[27] [27]

Houlsby, A

N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. Attariyan, S. Gelly, Parameter-eﬃcient transfer learn- ing for nlp, in: International conference on machine learning, PMLR, 2019, pp. 2790–2799

2019

[28] [28]

X. L. Li, P. Liang, Preﬁx-tuning: Optimizing continuous prompts for generation, in: Proceedings of the 59th Annual Meeting of the Asso- ciation for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021, pp. 4582–4597

2021

[29] [29]

Lester, R

B. Lester, R. Al-Rfou, N. Constant, The power of scale for parameter- eﬃcient prompt tuning, in: Proceedings of the 2021 conference on em- pirical methods in natural language processing, 2021, pp. 3045–3059

2021

[30] [30]

Zhang, R

R. Zhang, R. Fang, W. Zhang, P. Gao, K. Li, J. Dai, Y. Qiao, H. Li, Tip- adapter: Training-free clip-adapter for better vision-language modeling, arXiv preprint arXiv:2111.03930 (2021)

work page arXiv 2021

[31] [31]

E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, 30 W. Chen, et al., Lora: Low-rank adaptation of large language models., Iclr 1 (2) (2022) 3

2022

[32] [32]

AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning

Q. Zhang, M. Chen, A. Bukharin, N. Karampatziakis, P. He, Y. Cheng, W. Chen, T. Zhao, Adalora: Adaptive budget allocation for parameter- eﬃcient ﬁne-tuning, arXiv preprint arXiv:2303.10512 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[33] [33]

Dettmers, A

T. Dettmers, A. Pagnoni, A. Holtzman, L. Zettlemoyer, Qlora: Eﬃcient ﬁnetuning of quantized llms, Advances in neural information processing systems 36 (2023) 10088–10115

2023

[34] [34]

MulTaBench: Benchmarking Multimodal Tabular Learning with Text and Image

A. Arazi, E. Shapira, S. Grunblat, M. Ventura, E. Hoﬀer, G. Blayer, D. Holzmüller, L. Purucker, G. Varoquaux, F. Hutter, et al., Multabench: Benchmarking multimodal tabular learning with text and image, arXiv preprint arXiv:2605.10616 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[35] [35]

T. Chen, B. Xu, C. Zhang, C. Guestrin, Training deep nets with sub- linear memory cost, arXiv preprint arXiv:1604.06174 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016

[36] [36]

Mercea, A

O.-B. Mercea, A. Gritsenko, C. Schmid, A. Arnab, Time-memory-and parameter-eﬃcient visual adaptation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 5536–5545. 31

2024