pith. machine review for the scientific record. sign in

arxiv: 2605.10616 · v1 · submitted 2026-05-11 · 💻 cs.LG · cs.CL· cs.CV

Recognition: 2 theorem links

· Lean Theorem

MulTaBench: Benchmarking Multimodal Tabular Learning with Text and Image

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:49 UTC · model grok-4.3

classification 💻 cs.LG cs.CLcs.CV
keywords multimodal tabular learningtarget-aware representationsbenchmark datasetsembedding tuningtext-tabularimage-tabularfoundation models
0
0 comments X

The pith

Tuning text and image embeddings to the prediction target improves performance on multimodal tabular tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that established multimodal tabular benchmarks emphasize simple co-occurrence of modalities, which creates high variance and hides the real benefit of adjusting embeddings to the specific task. It introduces MulTaBench with 40 datasets split between image-tabular and text-tabular cases, chosen so that the unstructured data supplies complementary signals that generic frozen embeddings lose. Experiments then show that target-aware tuning of those embeddings produces consistent gains. The result matters because tabular foundation models currently treat text and image inputs as fixed add-ons rather than adaptable components.

Core claim

MulTaBench is a benchmark of 40 datasets that isolates multimodal tabular learning settings where text or images carry essential complementary information lost in generic embeddings. The central result is that tuning the embeddings to align with the prediction target yields performance improvements that hold across both modalities, multiple tabular learners, different encoder scales, and varying embedding dimensions.

What carries the argument

Target-aware representation tuning, which adapts pretrained text or image embeddings to the specific supervised task instead of leaving them frozen.

Load-bearing premise

That the 40 chosen datasets genuinely contain complementary predictive signals from text or images that generic embeddings fail to capture.

What would settle it

Re-running the experiments on MulTaBench and finding no average improvement from target-aware tuning over frozen generic embeddings, or finding that the gains disappear outside the selected datasets.

Figures

Figures reproduced from arXiv: 2605.10616 by Alan Arazi, David Holzm\"uller, Eilam Shapira, Elad Hoffer, Frank Hutter, Ga\"el Varoquaux, Gioia Blayer, Lennart Purucker, Mor Ventura, Roi Reichart, Shoham Grunblat.

Figure 1
Figure 1. Figure 1: The MulTaBench Curation Pipeline. Datasets are included if joint prediction outperforms [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Curation protocol over candidate datasets. Mean AUC per model and condition. The [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Target-Aware Representations Gains over Frozen. Normalized scores for [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Tabular Learners Performance Analysis. Normalized scores for MulTaBench datasets, with [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Embedding Model Size Analysis. Normalized scores are computed with min-max scaling [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Embedding Dimension Analysis. Normalized scores are computed with min-max scaling at [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: DINO-v3-small Attention Maps. Before (Frozen) and after (Target-Aware) finetuning on [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Curations Conditions for the Text-Tabular Pool. Normalized scores for [PITH_FULL_IMAGE:figures/full_fig_p026_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Tabular Learners Performances Analysis for Classification Tasks. Normalized scores over [PITH_FULL_IMAGE:figures/full_fig_p031_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Tabular Learners Performances Analysis for Regression Tasks. Normalized scores over [PITH_FULL_IMAGE:figures/full_fig_p031_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Computation costs per run. Left: median runtime in seconds (log scale). Right: median [PITH_FULL_IMAGE:figures/full_fig_p033_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Encoder Scale Analysis for Classification. Small and large encoder variants, frozen and [PITH_FULL_IMAGE:figures/full_fig_p034_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Encoder Scale Analysis for Regression. Small and large encoder variants, frozen and TAR, [PITH_FULL_IMAGE:figures/full_fig_p034_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: No-PCA ablation on 33 datasets for CatBoost and LightGBM. Normalized scores are on [PITH_FULL_IMAGE:figures/full_fig_p034_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: CheXpert Attention Maps. The attention shifts from diffused edges to the lung. [PITH_FULL_IMAGE:figures/full_fig_p035_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: PetFinder Attention Maps. Attention isolates the cat ears and the dog’s eyes. [PITH_FULL_IMAGE:figures/full_fig_p036_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Glaucoma Attention Maps. Frozen attention scatters randomly across the retina; TAR [PITH_FULL_IMAGE:figures/full_fig_p037_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Celeb Attractiveness Attention Maps. Frozen attention disperses across accessories, [PITH_FULL_IMAGE:figures/full_fig_p038_18.png] view at source ↗
read the original abstract

Tabular Foundation Models have recently established the state of the art in supervised tabular learning, by leveraging pretraining to learn generalizable representations of numerical and categorical structured data. However, they lack native support for unstructured modalities such as text and image, and rely on frozen, pretrained embeddings to process them. On established Multimodal Tabular Learning benchmarks, we show that tuning the embeddings to the task improves performance. Existing benchmarks, however, often focus on the mere co-occurrence of modalities; this leads to high variance across datasets and masks the benefits of task-specific tuning. To address this gap, we introduce MulTaBench, a benchmark of 40 datasets, split equally between image-tabular and text-tabular tasks. We focus on predictive tasks where the modalities provide complementary predictive signal, and where generic embeddings lose critical information, necessitating Target-Aware Representations that are aligned with the task. Our experimental results demonstrate that the gains from target-aware representation tuning generalize across both text and image modalities, several tabular learners, encoder scales, and embedding dimensions. MulTaBench constitutes the largest image-tabular benchmarking effort to date, spanning high-impact domains such as healthcare and e-commerce. It is designed to enable the research of novel architectures which incorporate joint modeling and target-aware representations, paving the way for the development of novel Multimodal Tabular Foundation Models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces MulTaBench, a benchmark of 40 multimodal tabular datasets (20 image-tabular and 20 text-tabular) selected to emphasize tasks where text or image modalities supply complementary predictive signals beyond tabular features and where frozen generic embeddings discard critical task-relevant information. It contrasts this with existing benchmarks that focus on modality co-occurrence and exhibit high variance. Experiments demonstrate that target-aware tuning of embeddings yields performance gains that generalize across text and image modalities, multiple tabular learners, encoder scales, and embedding dimensions. The work positions MulTaBench as the largest image-tabular benchmarking effort to date, covering domains such as healthcare and e-commerce, to support research on joint modeling and target-aware multimodal tabular foundation models.

Significance. If the dataset selection and experimental claims hold, MulTaBench would provide a more targeted evaluation resource than prior co-occurrence-focused benchmarks, potentially accelerating development of architectures that incorporate target-aware representations. The scale (40 datasets) and domain coverage constitute a clear strength for reproducibility and comparability in multimodal tabular learning.

major comments (2)
  1. [Dataset construction / MulTaBench description] Dataset construction (as described in the abstract and implied methods): The claim that the 40 datasets were chosen such that 'generic embeddings lose critical information, necessitating Target-Aware Representations' is load-bearing for the generalization statement, yet no quantitative selection filter is provided (e.g., no reported performance gap between tabular-only baselines and frozen-multimodal models, no mutual-information estimates, or explicit threshold on complementary signal). Without this, gains from target-aware tuning could stem from dataset idiosyncrasies rather than the asserted necessity.
  2. [Experimental results] Experimental results (as asserted in the abstract): The generalization claim across modalities, learners, scales, and dimensions lacks supporting details on statistical significance testing, error bars, or variance analysis across the 40 datasets. This weakens the assertion that gains 'generalize' and makes it hard to assess robustness.
minor comments (1)
  1. [Abstract] The abstract refers to 'established Multimodal Tabular Learning benchmarks' without citing specific prior works or datasets; adding these references would improve context.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. We address each major point below and describe the revisions we will implement to improve clarity and rigor.

read point-by-point responses
  1. Referee: [Dataset construction / MulTaBench description] Dataset construction (as described in the abstract and implied methods): The claim that the 40 datasets were chosen such that 'generic embeddings lose critical information, necessitating Target-Aware Representations' is load-bearing for the generalization statement, yet no quantitative selection filter is provided (e.g., no reported performance gap between tabular-only baselines and frozen-multimodal models, no mutual-information estimates, or explicit threshold on complementary signal). Without this, gains from target-aware tuning could stem from dataset idiosyncrasies rather than the asserted necessity.

    Authors: We agree that an explicit quantitative justification for dataset selection would strengthen the paper and improve reproducibility. While the current manuscript describes the focus on tasks with complementary signals (Section 3), the selection process relied on domain expertise and preliminary checks rather than a fully documented filter. In the revised version, we will add a new subsection under MulTaBench construction that reports performance gaps between tabular-only baselines and frozen multimodal models for the selected datasets, along with mutual information estimates between the unstructured modalities and the target where feasible. This will directly address the concern regarding potential idiosyncrasies. revision: yes

  2. Referee: [Experimental results] Experimental results (as asserted in the abstract): The generalization claim across modalities, learners, scales, and dimensions lacks supporting details on statistical significance testing, error bars, or variance analysis across the 40 datasets. This weakens the assertion that gains 'generalize' and makes it hard to assess robustness.

    Authors: We acknowledge that the current presentation of results would benefit from additional statistical details to support the generalization claims. The experiments demonstrate consistent gains, but variance and significance were not fully quantified across all 40 datasets. In the revised manuscript, we will update the experimental results section to include error bars (standard deviations), variance analysis across datasets, and statistical significance testing (e.g., paired Wilcoxon tests) for the reported improvements. These additions will be reflected in the tables and figures to better substantiate robustness across modalities, learners, scales, and dimensions. revision: yes

Circularity Check

0 steps flagged

No significant circularity in benchmark construction or generalization claims

full rationale

The paper introduces MulTaBench as an empirical benchmark of 40 datasets chosen to exhibit complementary multimodal signals, then reports direct experimental measurements of performance gains from target-aware embedding tuning across modalities, learners, and scales. These results are obtained from independent evaluations on the curated data rather than any self-referential derivation, fitted parameter renamed as prediction, or load-bearing self-citation that reduces the central claim to its own inputs by construction. Dataset selection criteria are stated descriptively without creating a definitional loop, and no equations or uniqueness theorems are invoked that would force the observed outcomes.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmarking paper rather than a theoretical derivation; no free parameters, axioms, or invented entities are introduced or required for the central claim in the abstract.

pith-pipeline@v0.9.0 · 5588 in / 1197 out tokens · 56628 ms · 2026-05-12T04:49:45.037705+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Predicting Decisions of AI Agents from Limited Interaction through Text-Tabular Modeling

    cs.LG 2026-05 unverdicted novelty 6.0

    A tabular foundation model with LLM-as-Observer features predicts AI agent decisions in controlled games, outperforming baselines by 4 AUC points and 14% lower error at K=16 interactions.

Reference graph

Works this paper leans on

113 extracted references · 113 canonical work pages · cited by 1 Pith paper · 8 internal anchors

  1. [1]

    Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35: 23716–23736, 2022

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35: 23716–23736, 2022

  2. [2]

    Selvaraju, Michael Cogswell, Ab- hishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra

    Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. VQA: Visual Question Answering. In2015 IEEE International Conference on Computer Vision (ICCV), pages 2425–2433, Santiago, Chile, December 2015. IEEE. ISBN 978-1-4673-8391-2. doi: 10.1109/ICCV .2015.279. URL http://ieeexplore. ieee.org/docu...

  3. [3]

    TabSTAR: A Tabular Foundation Model for Tabular Data with Text Fields

    Alan Arazi, Eilam Shapira, and Roi Reichart. TabSTAR: A Tabular Foundation Model for Tabular Data with Text Fields. In D. Belgrave, C. Zhang, H. Lin, R. Pas- canu, P. Koniusz, M. Ghassemi, and N. Chen, editors,Advances in Neural Informa- tion Processing Systems, volume 38, pages 172108–172161. Curran Associates, Inc.,

  4. [4]

    URL https://proceedings.neurips.cc/paper_files/paper/2025/file/ faf6e23e198314c7728eaa6ac44ae079-Paper-Conference.pdf

  5. [5]

    Social media images can predict suicide risk using interpretable large language-vision models.J Clin Psychiatry, 85(1):50516

    Yael Badian, Yaakov Ophir, Refael Tikochinski, Nitay Calderon, Anat Brunstein Klomek, Eyal Fruchter, and Roi Reichart. Social media images can predict suicide risk using interpretable large language-vision models.J Clin Psychiatry, 85(1):50516

  6. [6]

    Ele- phants Never Forget: Memorization and Learning of Tabular Data in Large Language Models

    Sebastian Bordt, Harsha Nori, Vanessa Rodrigues, Besmira Nushi, and Rich Caruana. Ele- phants Never Forget: Memorization and Learning of Tabular Data in Large Language Models. First Conference on Language Modeling, 2024. 10

  7. [7]

    org/abs/2511.02818

    Mohamed Bouadi, Pratinav Seth, Aditya Tanna, and Vinay Kumar Sankarapu. Orion-MSP: Multi-Scale Sparse Attention for Tabular In-Context Learning, November 2025. URL http: //arxiv.org/abs/2511.02818. arXiv:2511.02818 [cs]

  8. [8]

    Task Expansion and Cross Refinement for Open-World Conditional Modeling, March 2026

    Shreyas Bhat Brahmavar, Qiyang Liu, Yang Li, and Junier Oliva. Task Expansion and Cross Refinement for Open-World Conditional Modeling, March 2026. URL http://arxiv.org/ abs/2603.13308. arXiv:2603.13308 [cs]

  9. [9]

    Machine Learning , author =

    Leo Breiman. Random Forests.Machine Learning, 45(1):5–32, October 2001. ISSN 1573-0565. doi: 10.1023/A:1010933404324. URL https://doi.org/10.1023/A: 1010933404324

  10. [10]

    Language Models are Few-Shot Learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott...

  11. [11]

    The Revolution of Mul- timodal Large Language Models: A Survey

    Davide Caffagni, Federico Cocchi, Luca Barsellotti, Nicholas Moratelli, Sara Sarto, Lorenzo Baraldi, Lorenzo Baraldi, Marcella Cornia, and Rita Cucchiara. The Revolution of Mul- timodal Large Language Models: A Survey. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics: ACL 2024, pages 13590...

  12. [12]

    TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment

    Bingyi Cao, Koert Chen, Kevis-Kokitsi Maninis, Kaifeng Chen, Arjun Karpur, Ye Xia, Sahil Dua, Tanmaya Dabral, Guangxing Han, Bohyung Han, Joshua Ainslie, Alex Bewley, Mithun Jacob, René Wagner, Washington Ramos, Krzysztof Choromanski, Mojtaba Seyedhosseini, Howard Zhou, and André Araujo. TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Te...

  13. [13]

    Proceedings of the 22nd

    Tianqi Chen and Carlos Guestrin. XGBoost: A Scalable Tree Boosting System. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pages 785–794, New York, NY , USA, August 2016. Association for Computing Machinery. ISBN 978-1-4503-4232-2. doi: 10.1145/2939672.2939785. URL https://dl.acm.org/doi/10.11...

  14. [14]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, Luke Marris, Sam Petulla, Colin Gaffney, Asaf Aharoni, Nathan Lintz, Tiago Cardal Pais, Henrik Jacobsson, Idan Szpektor, Nan-Jiang Jiang, et al. Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multim...

  15. [15]

    Deep multimodal fusion of image and non- image data in disease diagnosis and prognosis: a review.Progress in Biomedical Engineering, 5(2):022001, April 2023

    Can Cui, Haichun Yang, Yaohong Wang, Shilin Zhao, Zuhayr Asad, Lori A Coburn, Keith T Wilson, Bennett A Landman, and Yuankai Huo. Deep multimodal fusion of image and non- image data in disease diagnosis and prognosis: a review.Progress in Biomedical Engineering, 5(2):022001, April 2023. ISSN 2516-1091. doi: 10.1088/2516-1091/acc2fe. URL https: //doi.org/1...

  16. [16]

    Towards the development of an explainable e-commerce fake review index: An attribute analytics approach.European Journal of Operational Research, 317(2):382–400, 2024

    Ronnie Das, Wasim Ahmed, Kshitij Sharma, Mariann Hardey, Yogesh K Dwivedi, Ziqi Zhang, Chrysostomos Apostolidis, and Raffaele Filieri. Towards the development of an explainable e-commerce fake review index: An attribute analytics approach.European Journal of Operational Research, 317(2):382–400, 2024. 11

  17. [17]

    O’Regan, and Chen Qin

    Siyi Du, Shaoming Zheng, Yinsong Wang, Wenjia Bai, Declan P. O’Regan, and Chen Qin. TIP: Tabular-Image Pre-training for Multimodal Classification with Incomplete Data. In Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol, editors,Computer Vision – ECCV 2024, volume 15073, pages 478–496. Springer Nature Switzerland,...

  18. [18]

    TIP: Tabular-Image Pre-training for Multimodal Classification with Incomplete Data,

    doi: 10.1007/978-3-031-72633-0_27. URL https://link.springer.com/10.1007/ 978-3-031-72633-0_27. Series Title: Lecture Notes in Computer Science

  19. [19]

    HyperFusion: A Hypernetwork Approach to Multimodal Integration of Tabular and Medical Imaging Data for Predictive Modeling.Medical Image Analysis, 102:103503, May 2025

    Daniel Duenias, Brennan Nichyporuk, Tal Arbel, and Tammy Riklin Raviv. HyperFusion: A Hypernetwork Approach to Multimodal Integration of Tabular and Medical Imaging Data for Predictive Modeling.Medical Image Analysis, 102:103503, May 2025. ISSN 13618415. doi: 10.1016/j.media.2025.103503. URL http://arxiv.org/abs/2403.13319. arXiv:2403.13319 [cs]

  20. [20]

    LANISTR: Multimodal learning from structured and unstructured data,

    Sayna Ebrahimi, Sercan O. Arik, Yihe Dong, and Tomas Pfister. LANISTR: Multimodal Learning from Structured and Unstructured Data, April 2024. URL http://arxiv.org/ abs/2305.16556. arXiv:2305.16556 [cs]

  21. [21]

    From Local to Global: A Graph RAG Approach to Query-Focused Summarization

    Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. From Local to Global: A Graph RAG Approach to Query-Focused Summarization, February 2025. URL http://arxiv.org/abs/2404.16130. arXiv:2404.16130 [cs]

  22. [22]

    TabLib: A Dataset of 627M Tables with Context, October 2023

    Gus Eggert, Kevin Huo, Mike Biven, and Justin Waugh. TabLib: A Dataset of 627M Tables with Context, October 2023. URL http://arxiv.org/abs/2310.07875. arXiv:2310.07875 [cs]

  23. [23]

    Tabarena: A living benchmark for machine learning on tabular data.arXiv preprint arXiv:2506.16791, 2025

    Nick Erickson, Lennart Purucker, Andrej Tschalzev, David Holzmüller, Prateek Mutalik Desai, and David Salinas, and Frank Hutter. TabArena: A Living Benchmark for Machine Learning on Tabular Data, June 2025. URL http://arxiv.org/abs/2506.16791. arXiv:2506.16791 [cs]

  24. [24]

    A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models

    Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li. A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’24, pages 6491–6501, New York, NY , USA, August 2024. Association for Computing M...

  25. [25]

    Sengamedu, and Christos Faloutsos

    Xi Fang, Weijie Xu, Fiona Anting Tan, Ziqing Hu, Jiani Zhang, Yanjun Qi, Srinivasan H. Sengamedu, and Christos Faloutsos. Large Language Models (LLMs) on Tabular Data: Prediction, Generation, and Understanding - A Survey.Transactions on Machine Learning Research, March 2024. ISSN 2835-8856. URL https://openreview.net/forum?id= IZnrCGF9WI

  26. [26]

    Unleashing the Power of Image-Tabular Self-Supervised Learning via Breaking Cross-Tabular Barriers, December

    Yibing Fu, Yunpeng Zhao, Zhitao Zeng, Cheng Chen, and Yueming Jin. Unleashing the Power of Image-Tabular Self-Supervised Learning via Breaking Cross-Tabular Barriers, December

  27. [27]

    arXiv:2512.14026 [cs]

    URLhttp://arxiv.org/abs/2512.14026. arXiv:2512.14026 [cs]

  28. [28]

    In: CVPR, pp

    Roy Ganz, Yair Kittenplon, Aviad Aberdam, Elad Ben Avraham, Oren Nuriel, Shai Mazor, and Ron Litman. Question Aware Vision Transformer for Multimodal Reasoning. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13861– 13871, Seattle, WA, USA, June 2024. IEEE. ISBN 979-8-3503-5300-6. doi: 10.1109/ CVPR52733.2024.01315. URL...

  29. [29]

    Perdomo, and Ludwig Schmidt

    Josh Gardner, Juan C. Perdomo, and Ludwig Schmidt. Large Scale Transfer Learning for Tabular Data via Language Modeling.Advances in Neural Information Processing Systems, 37: 45155–45205, December 2024. URLhttps://proceedings.neurips.cc/paper_files/ paper/2024/hash/4fd5cfd2e31bebbccfa5ffa354c04bdc-Abstract-Conference. html. 12

  30. [30]

    Real-TabPFN: Improving Tabular Foundation Models via Continued Pre-training With Real-World Data

    Anurag Garg, Muhammad Ali, Noah Hollmann, Lennart Purucker, Samuel Müller, and Frank Hutter. Real-TabPFN: Improving Tabular Foundation Models via Continued Pre-training With Real-World Data. June 2025. URLhttps://openreview.net/forum?id=BtEiqKsIMw

  31. [31]

    Should We Still Pretrain Encoders with Masked Language Modeling? October 2025

    Hippolyte Gisserot-Boukhlef, Nicolas Boizard, Manuel Faysse, Duarte Miguel Alves, Em- manuel Malherbe, Andre Martins, Celine Hudelot, and Pierre Colombo. Should We Still Pretrain Encoders with Masked Language Modeling? October 2025. URL https: //openreview.net/forum?id=jpz7e3jhRq

  32. [32]

    Revisiting deep learning models for tabular data

    Yury Gorishniy, Ivan Rubachev, Valentin Khrulkov, and Artem Babenko. Revisiting deep learning models for tabular data. InProceedings of the 35th International Conference on Neural Information Processing Systems, NIPS ’21, pages 18932–18943, Red Hook, NY , USA, December 2021. Curran Associates Inc. ISBN 978-1-7138-4539-3

  33. [33]

    TabM: Advancing Tabular Deep Learning with Parameter-Efficient Ensembling, February 2025

    Yury Gorishniy, Akim Kotelnikov, and Artem Babenko. TabM: Advancing Tabular Deep Learning with Parameter-Efficient Ensembling, February 2025. URL http://arxiv.org/ abs/2410.24210. arXiv:2410.24210 [cs]

  34. [34]

    The illusion of generalization: Re-examining tabular language model evaluation, 2026

    Aditya Gorla and Ratish Puduppully. The Illusion of Generalization: Re-examining Tabular Language Model Evaluation, February 2026. URL http://arxiv.org/abs/2602.04031. arXiv:2602.04031 [cs] version: 1

  35. [35]

    Why do tree-based models still outperform deep learning on typical tabular data?Advances in Neural Information Processing Systems, 35:507–520, December 2022

    Leo Grinsztajn, Edouard Oyallon, and Gael Varoquaux. Why do tree-based models still outperform deep learning on typical tabular data?Advances in Neural Information Processing Systems, 35:507–520, December 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/hash/ 0378c7692da36807bdec87ab043cdadc-Abstract-Datasets_and_Benchmarks. html

  36. [36]

    Vectorizing string entries for data processing on tables: when are larger language models better?arXiv preprint arXiv:2312.09634, 2023

    Léo Grinsztajn, Edouard Oyallon, Myung Jun Kim, and Gaël Varoquaux. Vectorizing string entries for data processing on tables: when are larger language models better?, December 2023. URLhttp://arxiv.org/abs/2312.09634. arXiv:2312.09634 [stat]

  37. [37]

    TabPFN-2.5: Advancing the State of the Art in Tabular Foundation Models

    Léo Grinsztajn, Klemens Flöge, Oscar Key, Felix Birkel, Philipp Jund, Brendan Roof, Benjamin Jäger, Dominik Safaric, Simone Alessi, Adrian Hayler, Mihir Manium, Rosen Yu, Felix Jablonski, Shi Bin Hoo, Anurag Garg, Jake Robertson, Magnus Bühler, Vla- dyslav Moroshan, Lennart Purucker, Clara Cornu, Lilly Charlotte Wehrhahn, Alessandro Bonetto, Bernhard Schö...

  38. [38]

    Menten, and Daniel Rueckert

    Paul Hager, Martin J. Menten, and Daniel Rueckert. Best of Both Worlds: Multimodal Contrastive Learning with Tabular and Imaging Data, March 2023. URL http://arxiv. org/abs/2303.14080. arXiv:2303.14080 [cs]

  39. [39]

    Bringing Graphs to the Table: Zero-shot Node Classification via Tabular Foundation Models,

    Adrian Hayler, Xingyue Huang, ˙Ismail ˙Ilkan Ceylan, Michael Bronstein, and Ben Finkelshtein. Bringing Graphs to the Table: Zero-shot Node Classification via Tabular Foundation Models,

  40. [40]
  41. [41]

    Nicolás, The bar derived category of a curved dg algebra, Journal of Pure and Applied Algebra 212 (2008) 2633–2659

    Xin He, Kaiyong Zhao, and Xiaowen Chu. AutoML: A survey of the state-of-the-art. Knowledge-Based Systems, 212:106622, January 2021. ISSN 0950-7051. doi: 10.1016/j. knosys.2020.106622. URL https://www.sciencedirect.com/science/article/pii/ S0950705120307516

  42. [42]

    TabLLM: Few-shot Classification of Tabular Data with Large Language Models

    Stefan Hegselmann, Alejandro Buendia, Hunter Lang, Monica Agrawal, Xiaoyi Jiang, and David Sontag. TabLLM: Few-shot Classification of Tabular Data with Large Language Models. InProceedings of The 26th International Conference on Artificial Intelligence and Statistics, pages 5549–5581. PMLR, April 2023. URL https://proceedings.mlr.press/v206/ hegselmann23a.html

  43. [43]

    Retrieval from Within: An Intrinsic Capability of Attention-Based Models

    Elad Hoffer, Yochai Blau, Ron Banner, Daniel Soudry, and Boris Ginsburg. Retrieval from within: An intrinsic capability of attention-based models, 2026. URL https://arxiv.org/ abs/2605.05806. 13

  44. [44]

    TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second

    Noah Hollmann, Samuel Müller, Katharina Eggensperger, and Frank Hutter. TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second. The Eleventh International Conference on Learning Representations, September 2022. URL https:// openreview.net/forum?id=cp5PvcI6w8_

  45. [45]

    Accurate predictions on small data with a tabular foundation model.Nature, 637(8045):319–326, January 2025

    Noah Hollmann, Samuel Müller, Lennart Purucker, Arjun Krishnakumar, Max Körfer, Shi Bin Hoo, Robin Tibor Schirrmeister, and Frank Hutter. Accurate predictions on small data with a tabular foundation model.Nature, 637(8045):319–326, January 2025. ISSN 1476-

  46. [46]

    Accurate predictions on small data with a tabular foundation model.Nature, 637(8045):319–326, 2025

    doi: 10.1038/s41586-024-08328-6. URL https://www.nature.com/articles/ s41586-024-08328-6

  47. [47]

    Better by default: Strong pre-tuned MLPs and boosted trees on tabular data.Advances in Neural Information Processing Systems, 37:26577–26658, December 2024

    David Holzmüller, Léo Grinsztajn, and Ingo Steinwart. Better by default: Strong pre-tuned MLPs and boosted trees on tabular data.Advances in Neural Information Processing Systems, 37:26577–26658, December 2024. URL https://proceedings.neurips.cc/paper_files/paper/2024/hash/ 2ee1c87245956e3eaa71aaba5f5753eb-Abstract-Conference.html

  48. [48]

    The tabular foundation model tabpfn outperforms specialized time series forecasting models based on simple features

    Shi Bin Hoo, Samuel Müller, David Salinas, and Frank Hutter. The tabular foundation model tabpfn outperforms specialized time series forecasting models based on simple features. In NeurIPS workshop on time series in the age of large models, 2024

  49. [49]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-Rank Adaptation of Large Language Models. October 2021. URLhttps://openreview.net/forum?id=nZeVKeeFYf9

  50. [50]

    PyTorch Frame: A Modular Framework for Multi- Modal Tabular Learning

    Weihua Hu, Yiwen Yuan, Zecheng Zhang, Akihiro Nitta, Kaidi Cao, Vid Kocijan, Jinu Sunil, Jure Leskovec, and Matthias Fey. PyTorch Frame: A Modular Framework for Multi- Modal Tabular Learning. October 2024. URL https://openreview.net/forum?id= 2ZHKA9xo8V#discussion

  51. [51]

    Shih-Cheng Huang, Anuj Pareek, Saeed Seyyedi, Imon Banerjee, and Matthew P. Lungren. Fu- sion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines.npj Digital Medicine, 3(1):136, October 2020. ISSN 2398-

  52. [52]

    URL https://www.nature.com/articles/ s41746-020-00341-z

    doi: 10.1038/s41746-020-00341-z. URL https://www.nature.com/articles/ s41746-020-00341-z

  53. [53]

    Tabular Insights, Visual Impacts: Transferring Expertise from Tables to Images

    Jun-Peng Jiang, Han-Jia Ye, Leye Wang, Yang Yang, Yuan Jiang, and De-Chuan Zhan. Tabular Insights, Visual Impacts: Transferring Expertise from Tables to Images. InProceedings of the 41st International Conference on Machine Learning, pages 21988–22009. PMLR, July 2024. URLhttps://proceedings.mlr.press/v235/jiang24h.html

  54. [54]

    Representation Learning for Tabular Data: A Comprehensive Survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–20, 2026

    Jun-Peng Jiang, Si-Yang Liu, Hao-Run Cai, Qi-Le Zhou, and Han-Jia Ye. Representation Learning for Tabular Data: A Comprehensive Survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–20, 2026. ISSN 1939-3539. doi: 10.1109/TPAMI.2026. 3657217. URLhttps://ieeexplore.ieee.org/abstract/document/11369258

  55. [55]

    LightGBM: A Highly Efficient Gradient Boosting Decision Tree

    Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qi- wei Ye, and Tie-Yan Liu. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. InAdvances in Neural Information Processing Systems, volume 30. Curran Asso- ciates, Inc., 2017. URL https://papers.nips.cc/paper_files/paper/2017/hash/ 6449f44a102fde848669bdd9eb6b76fa-Abstract.html

  56. [56]

    ColBERT: Efficient and effective passage search via con- textualized late interaction over bert

    Omar Khattab and Matei Zaharia. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. InProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’20, pages 39–48, New York, NY , USA, July 2020. Association for Computing Machinery. ISBN 978-1- 4503-8016-4...

  57. [57]

    CARTE: pretraining and transfer for tabular learning

    Myung Jun Kim, Léo Grinsztajn, and Gaël Varoquaux. CARTE: pretraining and transfer for tabular learning. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofICML’24, pages 23843–23866, Vienna, Austria, July 2024. JMLR.org. 14

  58. [58]

    Table Foundation Models: on knowledge pre-training for tabular learning, May 2025

    Myung Jun Kim, Félix Lefebvre, Gaëtan Brison, Alexandre Perez-Lebel, and Gaël Varoquaux. Table Foundation Models: on knowledge pre-training for tabular learning, May 2025. URL http://arxiv.org/abs/2505.14415. arXiv:2505.14415 [cs]

  59. [59]

    MultiModalPFN: Extending Prior-Data Fitted Networks for Multimodal Tabular Learning

    Wall Kim, Chaeyoung Song, and Hanul Kim. MultiModalPFN: Extending Prior-Data Fitted Networks for Multimodal Tabular Learning. October 2025. URL https://openreview. net/forum?id=pSyuFl8mau

  60. [60]

    Structured RAG for Answering Aggregative Ques- tions, November 2025

    Omri Koshorek, Niv Granot, Aviv Alloni, Shahar Admati, Roee Hendel, Ido Weiss, Alan Arazi, Shay-Nitzan Cohen, and Yonatan Belinkov. Structured RAG for Answering Aggregative Ques- tions, November 2025. URL http://arxiv.org/abs/2511.08505. arXiv:2511.08505 [cs]

  61. [61]

    Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

  62. [62]

    Lost in Embeddings: Information Loss in Vision–Language Models

    Wenyan Li, Raphael Tang, Chengzu Li, Caiqi Zhang, Ivan Vuli´c, and Anders Søgaard. Lost in Embeddings: Information Loss in Vision–Language Models. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Association for Computational Linguistics: EMNLP 2025, pages 22676–22693, Suzhou, China, November

  63. [64]

    Visual Instruction Tun- ing.Advances in Neural Information Processing Systems, 36:34892–34916, Decem- ber 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual Instruction Tun- ing.Advances in Neural Information Processing Systems, 36:34892–34916, Decem- ber 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/hash/ 6dcf277ea32ce3288914faf369fe6de0-Abstract-Conference.html

  64. [65]

    Yiming Liu, Yuhui Zhang, Dhruba Ghosh, Ludwig Schmidt, and Serena Yeung-Levy. Data or Language Supervision: What Makes CLIP Better than DINO? In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Association for Computational Linguistics: EMNLP 2025, pages 1868–1874, Suzhou, China, November 2025. Associ...

  65. [66]

    MuG: A Multimodal Classification Benchmark on Game Data with Tabular, Textual, and Visual Fields

    Jiaying Lu, Yongchen Qian, Shifan Zhao, Yuanzhe Xi, and Carl Yang. MuG: A Multimodal Classification Benchmark on Game Data with Tabular, Textual, and Visual Fields. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023, pages 5332–5346, Singapore, December 2023. Association for Computat...

  66. [67]

    Can Agentic AI Match the Performance of Human Data Scientists?, December 2025

    An Luo, Jin Du, Fangqiao Tian, Xun Xian, Robert Specht, Ganghua Wang, Xuan Bi, Charles Fleming, Jayanth Srinivasa, Ashish Kundu, Mingyi Hong, and Jie Ding. Can Agentic AI Match the Performance of Human Data Scientists?, December 2025. URL http://arxiv. org/abs/2512.20959. arXiv:2512.20959 [cs]

  67. [68]

    Time: Tabpfn-integrated multimodal engine for robust tabular-image learning, 2025

    Jiaqi Luo, Yuan Yuan, and Shixin Xu. TIME: TabPFN-Integrated Multimodal Engine for Robust Tabular-Image Learning, June 2025. URL http://arxiv.org/abs/2506.00813. arXiv:2506.00813 [cs]

  68. [69]

    Cresswell, Keyvan Golestan, Guangwei Yu, Anthony L

    Junwei Ma, Valentin Thomas, Rasa Hosseinzadeh, Hamidreza Kamkari, Alex Labach, Jesse C. Cresswell, Keyvan Golestan, Guangwei Yu, Anthony L. Caterini, and Maksims V olkovs. TabDPT: Scaling Tabular Foundation Models on Real Data, July 2025. URL http://arxiv. org/abs/2410.18164. arXiv:2410.18164 [cs]

  69. [70]

    QUEST: A Retrieval Dataset of Entity-Seeking Queries with Implicit Set Operations

    Chaitanya Malaviya, Peter Shaw, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. QUEST: A Retrieval Dataset of Entity-Seeking Queries with Implicit Set Operations. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguis- tics (Volume 1: Long Papers), pages 14032–14047, Toronto, Canada, 2023. Associa- tion for Computationa...

  70. [71]

    Principal components analysis (PCA).Com- puters & Geosciences, 19(3):303–342, March 1993

    Andrzej Ma´ckiewicz and Waldemar Ratajczak. Principal components analysis (PCA).Com- puters & Geosciences, 19(3):303–342, March 1993. ISSN 0098-3004. doi: 10.1016/ 0098-3004(93)90090-R. URL https://www.sciencedirect.com/science/article/ pii/009830049390090R

  71. [72]

    When Do Neural Nets Outperform Boosted Trees on Tabular Data?Advances in Neural Information Processing Systems, 36:76336–76369, De- cember 2023

    Duncan McElfresh, Sujay Khandagale, Jonathan Valverde, Vishak Prasad C, Ganesh Ramakr- ishnan, Micah Goldblum, and Colin White. When Do Neural Nets Outperform Boosted Trees on Tabular Data?Advances in Neural Information Processing Systems, 36:76336–76369, De- cember 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/ hash/f06d5ebd4ff40b40dd97...

  72. [73]

    A multimodal approach to predict social media popularity

    Mayank Meghawat, Satyendra Yadav, Debanjan Mahata, Yifang Yin, Rajiv Ratn Shah, and Roger Zimmermann. A multimodal approach to predict social media popularity. In2018 IEEE conference on multimedia information processing and retrieval (MIPR), pages 190–195. IEEE, 2018

  73. [74]

    Towards Benchmarking Foundation Models for Tabular Data With Text

    Martin Mráz, Breenda Das, Anshul Gupta, Lennart Purucker, and Frank Hutter. Towards Benchmarking Foundation Models for Tabular Data With Text. June 2025. URL https: //openreview.net/forum?id=yrmoQG9NAV

  74. [75]

    In: Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

    Niklas Muennighoff, Nouamane Tazi, Loic Magne, and Nils Reimers. MTEB: Massive Text Embedding Benchmark. In Andreas Vlachos and Isabelle Augenstein, editors,Proceedings of the 17th Conference of the European Chapter of the Association for Computational Lin- guistics, pages 2014–2037, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics....

  75. [76]

    Transformers Can Do Bayesian Inference

    Samuel Müller, Noah Hollmann, Sebastian Pineda Arango, Josif Grabocka, and Frank Hutter. Transformers Can Do Bayesian Inference. October 2021. URL https://openreview.net/ forum?id=KSugKcbNf9

  76. [77]

    Deep neural networks detect suicide risk from textual facebook posts.Scientific reports, 10(1): 16685, 2020

    Yaakov Ophir, Refael Tikochinski, Christa SC Asterhan, Itay Sisso, and Roi Reichart. Deep neural networks detect suicide risk from textual facebook posts.Scientific reports, 10(1): 16685, 2020

  77. [78]

    Lost in Space: Probing Fine-grained Spatial Understanding in Vision and Language Resamplers

    Georgios Pantazopoulos, Alessandro Suglia, Oliver Lemon, and Arash Eshghi. Lost in Space: Probing Fine-grained Spatial Understanding in Vision and Language Resamplers. In Kevin Duh, Helena Gomez, and Steven Bethard, editors,Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tec...

  78. [79]

    Scikit- learn: Machine learning in python.the Journal of machine Learning research, 12:2825–2830, 2011

    Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit- learn: Machine learning in python.the Journal of machine Learning research, 12:2825–2830, 2011

  79. [80]

    CatBoost: unbiased boosting with categorical features

    Liudmila Prokhorenkova, Gleb Gusev, Aleksandr V orobev, Anna Veronika Doro- gush, and Andrey Gulin. CatBoost: unbiased boosting with categorical features. InAdvances in Neural Information Processing Systems, volume 31. Curran Asso- ciates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/hash/ 14491b756b3a51daac41c24863285549-Abstract.html

  80. [81]

    Yuan Pu, Zhuolun He, Yuqi Jiang, Tairu Qiu, Haoyuan Wu, Qi Sun, Cheng Zhuo, and Bei Yu. Customized Retrieval Augmented Generation and Benchmarking for EDA Tool Documentation QA.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 44 (12):4615–4628, December 2025. ISSN 1937-4151. doi: 10.1109/TCAD.2025.3568776. URL https://ieeexpl...

Showing first 80 references.