MachineLearningLM: Scaling Many-shot In-context Learning via Continued Pretraining

Guolin Ke; Haoyu Dong; Mingzhe Lu; Pengkun Zhang; Yanzhen Shen

arxiv: 2509.06806 · v6 · submitted 2025-09-08 · 💻 cs.CL · cs.AI

MachineLearningLM: Scaling Many-shot In-context Learning via Continued Pretraining

Haoyu Dong , Pengkun Zhang , Mingzhe Lu , Yanzhen Shen , Guolin Ke This is my paper

Pith reviewed 2026-05-18 18:05 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords in-context learningcontinued pretrainingtabular classificationstructural causal modelsmany-shot learninglarge language modelsout-of-distribution generalization

0 comments

The pith

Continued pretraining on tasks synthesized from millions of structural causal models equips LLMs with robust many-shot in-context learning for tabular classification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a continued-pretraining method that turns a general LLM into a model capable of performing machine learning tasks like tabular classification using only in-context examples, without any gradient updates at inference time. It generates millions of synthetic training tasks from structural causal models and distills strategies from a random-forest teacher to build numerical robustness, then serializes them with compact prompts that pack far more examples into the context window. This matters for readers because it shows LLMs can reach random-forest accuracy on out-of-distribution data across finance, physics, biology, and healthcare domains simply by scaling the number of demonstrations, while keeping general knowledge and reasoning intact. The work reveals a clear many-shot scaling law in which accuracy rises steadily as the number of in-context examples grows from 8 to 1024.

Core claim

MachineLearningLM, obtained by continued pretraining of Qwen-2.5-7B-Instruct with LoRA, outperforms strong LLM baselines such as GPT-5-mini by an average of about 15 percent on out-of-distribution tabular classification tasks across finance, physics, biology, and healthcare. It exhibits a many-shot scaling law in which accuracy increases monotonically with the number of in-context demonstrations from 8 to 1024 and reaches random-forest-level performance without any task-specific training, while preserving general capabilities at 75.4 percent on MMLU.

What carries the argument

A portable continued-pretraining framework that synthesizes ML tasks from millions of structural causal models and distills random-forest decision strategies into the LLM using token-efficient prompts.

If this is right

LLMs reach random-forest-level accuracy on tabular classification using hundreds of in-context shots without task-specific training.
Accuracy on these tasks increases monotonically as the number of in-context demonstrations grows from 8 to 1024.
General chat capabilities remain intact, with 75.4 percent accuracy on MMLU after the continued pretraining.
Token-efficient serialization supports 3x to 6x more examples per context window and up to 50x amortized throughput via batch inference.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same synthesis approach could be applied to non-tabular tasks such as regression or time-series forecasting to test whether many-shot in-context learning generalizes beyond classification.
Models trained this way might serve as drop-in replacements for specialized ML pipelines in settings where collecting large labeled datasets is costly.
The observed scaling law suggests that further increases in context length could yield additional gains without any change to the pretraining recipe.

Load-bearing premise

The synthetic tasks generated from structural causal models produce training distributions representative enough of real-world tabular data in the target domains that in-context performance transfers to out-of-distribution test sets.

What would settle it

Measure whether the 15 percent average gain and the monotonic accuracy increase with shot count both disappear when the model is tested on real tabular datasets drawn from the same domains but whose statistical or causal structure differs markedly from the synthesized SCM tasks.

Figures

Figures reproduced from arXiv: 2509.06806 by Guolin Ke, Haoyu Dong, Mingzhe Lu, Pengkun Zhang, Yanzhen Shen.

**Figure 1.** Figure 1: MACHINELEARNINGLM on in-context ML tasks: (a) prompt template; (b) 512-shot accuracy across domains vs. Qwen-2.5-7B-Instruct; (c) many-shot scaling (2 3–2 10 shots) vs. LLMs. 1 arXiv:2509.06806v5 [cs.CL] 16 Sep 2025 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Heatmap of task density across feature and shot counts. We sample 100k tasks from the [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: A comparison between the NL description style and tabular style of many-shot examples. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

read the original abstract

Large language models (LLMs) possess broad world knowledge and strong general-purpose reasoning ability, yet they struggle to learn from many in-context examples on standard machine learning (ML) tasks, that is, to leverage many-shot demonstrations purely via in-context learning (ICL) without gradient descent. We introduce MachineLearningLM, a portable continued-pretraining framework that equips a general-purpose LLM with robust in-context ML capability while preserving its general knowledge and reasoning for broader chat workflows. Our pretraining procedure synthesizes ML tasks from millions of structural causal models (SCMs), spanning shot counts up to 1,024. We begin with a random-forest teacher, distilling tree-based decision strategies into the LLM to strengthen robustness in numerical modeling. All tasks are serialized with a token-efficient prompt, enabling 3x to 6x more examples per context window and delivering up to 50x amortized throughput via batch inference. Despite a modest setup (Qwen-2.5-7B-Instruct with LoRA rank 8), MachineLearningLM outperforms strong LLM baselines (e.g., GPT-5-mini) by an average of about 15% on out-of-distribution tabular classification across finance, physics, biology, and healthcare domains. It exhibits a striking many-shot scaling law: accuracy increases monotonically as in-context demonstrations grow from 8 to 1,024. Without any task-specific training, it attains random-forest-level accuracy across hundreds of shots. General chat capabilities, including knowledge and reasoning, are preserved: it achieves 75.4% on MMLU.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gets a 7B model to hit random-forest accuracy on real OOD tabular tasks via many-shot ICL after SCM-based continued pretraining, with clear scaling to 1024 shots.

read the letter

The main takeaway is that continued pretraining on tasks drawn from millions of structural causal models, plus random-forest distillation and compact serialization, lets a modest 7B model do strong many-shot in-context learning on tabular classification. It reaches random-forest levels on out-of-distribution data from finance, physics, biology, and healthcare, beats GPT-5-mini by roughly 15% on average, and shows accuracy rising steadily as shots go from 8 to 1024. General capabilities stay intact at 75.4% on MMLU. That combination of synthesis, distillation, and token-efficient formatting is the concrete new piece; it is not just another ICL prompt tweak or standard continued pretraining run.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces MachineLearningLM, a continued-pretraining framework that synthesizes millions of tabular ML tasks from structural causal models (SCMs) to equip a base LLM (Qwen-2.5-7B-Instruct, LoRA rank 8) with many-shot in-context learning capability. It reports an average ~15% improvement over strong LLM baselines such as GPT-5-mini on out-of-distribution real-world tabular classification tasks drawn from finance, physics, biology, and healthcare, together with a monotonic accuracy scaling law as the number of in-context demonstrations grows from 8 to 1,024 shots, reaching random-forest-level performance without any task-specific gradient updates, while preserving general chat performance (75.4% on MMLU).

Significance. If the transfer from SCM-synthesized pretraining to real OOD tabular data proves robust, the result would be significant: it would demonstrate a practical route to scaling in-context ML performance far beyond current LLM limits without sacrificing general capabilities. The token-efficient serialization (3-6x more examples per context) and the use of a random-forest teacher for distillation are concrete engineering contributions that could be adopted more broadly. The external OOD evaluation on never-seen real datasets supplies an independent test rather than a circular one.

major comments (2)

[Abstract] Abstract: the headline claim of an average 15% improvement over GPT-5-mini (and random-forest parity) is stated without error bars, per-domain standard deviations, number of runs, or any statistical test. Because the central empirical assertion rests on these aggregate numbers, the absence of uncertainty quantification is load-bearing for assessing whether the gains are reliable or could be explained by dataset selection or variance.
[Abstract] Abstract / pretraining description: the assumption that tasks drawn from millions of SCMs induce generalizable numerical ICL strategies (rather than artifacts of simplified noise models, bounded ranges, or missing heavy-tail / non-stationarity patterns) is not accompanied by any diagnostic that measures distribution shift between the synthetic pretraining distribution and the target real-world domains. This transfer step is load-bearing for the OOD performance claim.

minor comments (1)

[Abstract] The token-efficient prompt serialization is described as enabling 3x-6x more examples; a short illustrative example or token-count formula would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights important aspects of empirical rigor and transfer validity. We address each major comment below and commit to revisions that strengthen the presentation of our results without altering the core claims.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claim of an average 15% improvement over GPT-5-mini (and random-forest parity) is stated without error bars, per-domain standard deviations, number of runs, or any statistical test. Because the central empirical assertion rests on these aggregate numbers, the absence of uncertainty quantification is load-bearing for assessing whether the gains are reliable or could be explained by dataset selection or variance.

Authors: We agree that uncertainty quantification is essential for the headline claims. In the revised manuscript we will report all aggregate results with error bars computed over at least five independent runs (different random seeds for both pretraining data sampling and evaluation), include per-domain standard deviations, and add paired statistical significance tests (e.g., Wilcoxon signed-rank) against the strongest baselines. These additions will appear in the abstract, Table 1, and the main results section. revision: yes
Referee: [Abstract] Abstract / pretraining description: the assumption that tasks drawn from millions of SCMs induce generalizable numerical ICL strategies (rather than artifacts of simplified noise models, bounded ranges, or missing heavy-tail / non-stationarity patterns) is not accompanied by any diagnostic that measures distribution shift between the synthetic pretraining distribution and the target real-world domains. This transfer step is load-bearing for the OOD performance claim.

Authors: We acknowledge that an explicit characterization of the synthetic-to-real distribution shift would further support the transfer argument. While the primary evidence remains the strong OOD performance on held-out real datasets spanning finance, physics, biology, and healthcare, we will add a new appendix section that quantifies shift via (i) Kolmogorov-Smirnov tests on marginal feature distributions, (ii) comparison of noise variance and range statistics, and (iii) a simple covariate-shift diagnostic using a domain classifier trained on synthetic vs. real features. This analysis will be included in the revision. revision: partial

Circularity Check

0 steps flagged

No significant circularity; claims rest on independent OOD real-world evaluation

full rationale

The paper's core results—15% average gains over GPT-5-mini, monotonic accuracy scaling from 8 to 1024 shots, and random-forest-level performance—are measured on out-of-distribution real tabular datasets from finance, physics, biology, and healthcare domains. These test sets are explicitly never seen during the SCM-synthesized continued pretraining, supplying an external benchmark rather than a quantity fitted or defined from the training distribution. The pretraining procedure (task synthesis from structural causal models, random-forest distillation, token-efficient serialization) is described as a method to induce generalizable ICL strategies, but the reported scaling laws and comparisons are not reduced to the inputs by construction. No self-definitional steps, fitted-input predictions, load-bearing self-citations, uniqueness theorems, or ansatz smuggling appear in the derivation. The evaluation setup is self-contained against external benchmarks, making this a standard non-circular empirical claim.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the premise that synthetic tasks drawn from structural causal models are sufficiently diverse and realistic to induce transferable in-context learning behavior; the only explicit training hyperparameter disclosed is LoRA rank 8.

free parameters (1)

LoRA rank = 8
Chosen as part of the modest experimental setup on Qwen-2.5-7B-Instruct

axioms (1)

domain assumption Structural causal models can generate representative ML tasks whose decision boundaries are learnable via in-context imitation of random-forest strategies
Invoked to justify synthesis of millions of training tasks spanning up to 1024 shots

pith-pipeline@v0.9.0 · 5832 in / 1530 out tokens · 46539 ms · 2026-05-18T18:05:29.256779+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce MachineLearningLM, a portable continued-pretraining framework that equips a general-purpose LLM with robust in-context ML capability... synthesizes ML tasks from millions of structural causal models (SCMs), spanning shot counts up to 1,024.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Finch: Benchmarking Finance & Accounting across Spreadsheet-Centric Enterprise Workflows
cs.AI 2025-12 unverdicted novelty 6.0

Finch is a new benchmark with 172 composite workflows and 384 tasks from real enterprise data that shows top AI models like GPT-5.1 Pro pass only 38.4% of workflows under human evaluation.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · cited by 1 Pith paper · 8 internal anchors

[2]

Ml-master: Towards ai-for-ai via integration of exploration and reasoning.arXiv preprint arXiv:2506.16499,

URLhttps://arxiv.org/abs/2506.16499. Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609,

work page arXiv
[3]

Cache me if you can: How many kvs do you need for effective long-context lms?arXiv preprint arXiv:2506.17121,

Adithya Bhaskar, Alexander Wettig, Tianyu Gao, Yihe Dong, and Danqi Chen. Cache me if you can: How many kvs do you need for effective long-context lms?arXiv preprint arXiv:2506.17121,

work page arXiv
[4]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

work page 1901
[6]

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

URLhttps://arxiv.org/abs/2410.07095. Jiuhai Chen, Lichang Chen, Chen Zhu, and Tianyi Zhou. How many demonstrations do you need for in-context learning? InThe 2023 Conference on Empirical Methods in Natural Language Processing. Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. InKDD, pp. 785–794,

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Batch prompting: Efficient inference with large language model apis

Zhoujun Cheng, Jungo Kasai, and Tao Yu. Batch prompting: Efficient inference with large language model apis. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track, pp. 792–810,

work page 2023
[8]

Interpreting Tree Ensembles with inTrees

Houtao Deng. Interpreting tree ensembles with intrees.arXiv preprint arXiv:1408.5456,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Spreadsheetllm: encoding spreadsheets for large language models

Haoyu Dong, Jianbo Zhao, Yuzhang Tian, Junyu Xiong, Shiyu Xia, Mengyu Zhou, Yun Lin, José Cambronero, Yeye He, Shi Han, et al. Spreadsheetllm: encoding spreadsheets for large language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 20728–20748,

work page 2024
[10]

Josh Gardner, Juan C Perdomo, and Ludwig Schmidt

doi: 10.1214/07-AOAS148. Josh Gardner, Juan C Perdomo, and Ludwig Schmidt. Large scale transfer learning for tabular data via language modeling.Advances in Neural Information Processing Systems, 37:45155–45205,

work page doi:10.1214/07-aoas148
[11]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Zhengyao Gu, Henry Peng Zou, Aiwei Liu, Yankai Chen, Weizhi Zhang, and Philip S Yu. Scaling laws for many-shot in-context learning with self-generated annotations. InICML 2025 Workshop on Long-Context Foundation Models. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Many-shot in-context learning in multimodal foundation models

Yixing Jiang, Jeremy Andrew Irvin, Ji Hun Wang, Muhammad Ahmed Chaudhry, Jonathan H Chen, and Andrew Y Ng. Many-shot in-context learning in multimodal foundation models. InICML 2024 Workshop on In-Context Learning. Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie- Yan Liu. Lightgbm: A highly efficient gradient boost...

work page 2024
[14]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Se-agent: Self-evolution trajectory optimization in multi-step reasoning with llm-based agents.CoRR, abs/2508.02085, 2025

Jianzhe Lin, Maurice Diesendruck, Liang Du, and Robin Abraham. Batchprompt: Accomplish more with less. InThe Twelfth International Conference on Learning Representations. Jiaye Lin, Yifu Guo, Yuzhen Han, Sen Hu, Ziyi Ni, Licheng Wang, Mingguang Chen, Daxin Jiang, Binxing Jiao, Chen Hu, et al. Se-agent: Self-evolution trajectory optimization in multi-step ...

work page arXiv
[16]

Jiachang Liu, Dinghan Shen, Yizhe Zhang, William B Dolan, Lawrence Carin, and Weizhu Chen. What makes good in-context examples for gpt-3? InProceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, pp. 100–114,

work page 2022
[17]

arXiv preprint arXiv:2407.04057 , year=

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics, 11:157–173, 2024a. Si-Yang Liu, Hao-Run Cai, Qi-Le Zhou, and Han-Jia Ye. Talent: A tabular analytics and learning toolbox.arXiv ...

work page arXiv
[18]

Tabdpt: Scaling tabular foundation models on real data.arXiv preprint arXiv:2410.18164,

Junwei Ma, Valentin Thomas, Rasa Hosseinzadeh, Hamidreza Kamkari, Alex Labach, Jesse C Cresswell, Keyvan Golestan, Guangwei Yu, Maksims V olkovs, and Anthony L Caterini. Tab- dpt: Scaling tabular foundation models.arXiv preprint arXiv:2410.18164,

work page arXiv
[19]

Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work? In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.),Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 11048–11064, Abu Dhabi, U...

work page 2022
[20]

Bahri, H

Association for Computational Linguistics. doi: 10.18653/v1/2022. emnlp-main.759. URLhttps://aclanthology.org/2022.emnlp-main.759/. Martin Mráz, Breenda Das, Anshul Gupta, Lennart Purucker, and Frank Hutter. Towards bench- marking foundation models for tabular data with text.arXiv preprint arXiv:2507.07829,

work page doi:10.18653/v1/2022 2022
[21]

Tabicl: A tabular foun- dation model for in-context learning on large data

Jingang Qu, David Holzmüller, Gaël Varoquaux, and Marine Le Morvan. Tabicl: A tabular foun- dation model for in-context learning on large data. InICML 2025-Forty-Second International Conference on Machine Learning,

work page 2025
[22]

Benchmarking multi- modal automl for tabular data with text fields.arXiv preprint arXiv:2111.02705,

Xingjian Shi, Jonas Mueller, Nick Erickson, Mu Li, and Alexander J Smola. Benchmarking multi- modal automl for tabular data with text fields.arXiv preprint arXiv:2111.02705,

work page arXiv
[23]

Singh and DJ Strouse

Aaditya K Singh and DJ Strouse. Tokenization counts: the impact of tokenization on arithmetic in frontier llms.arXiv preprint arXiv:2402.14903,

work page arXiv
[24]

Tablegpt2: A large multimodal model with tabular data integration.arXiv preprint arXiv:2411.02059,

Aofeng Su, Aowen Wang, Chao Ye, Chen Zhou, Ga Zhang, Gang Chen, Guangcheng Zhu, Haobo Wang, Haokai Xu, Hao Chen, et al. Tablegpt2: A large multimodal model with tabular data integration.arXiv preprint arXiv:2411.02059,

work page arXiv
[25]

Yan Tai, Weichen Fan, Zhao Zhang, Feng Zhu, Rui Zhao, and Ziwei Liu

doi: 10.1145/3616855.3635752. Yan Tai, Weichen Fan, Zhao Zhang, Feng Zhu, Rui Zhao, and Ziwei Liu. Link-context learning for multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),

work page doi:10.1145/3616855.3635752
[26]

Eric Wallace, Yizhong Wang, Sujian Li, Sameer Singh, and Matt Gardner

URLhttps://openaccess.thecvf.com/content/ CVPR2024/papers/Tai_Link-Context_Learning_for_Multimodal_LLMs_ CVPR_2024_paper.pdf. Eric Wallace, Yizhong Wang, Sujian Li, Sameer Singh, and Matt Gardner. Do NLP models know numbers? probing numeracy in embeddings. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th...

work page 2019
[27]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Association for Computational Linguistics. doi: 10.18653/v1/D19-1534. URL https://aclanthology.org/D19-1534/. 21 Work in Progress Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/d19-1534
[28]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InThe Eleventh International Conference on Learning Representations, a. Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdh- ery, an...

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Agent workflow memory

Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. Agent workflow memory. In Forty-second International Conference on Machine Learning, b. Xumeng Wen, Shun Zheng, Zhen Xu, Yiming Sun, and Jiang Bian. Scalable in-context learning on tabular data via retrieval-augmented large language models.arXiv preprint arXiv:2502.03147,

work page arXiv
[30]

Effective long-context scaling of foundation models

Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis Martin, Rashi Rungta, Karthik Abinav Sankararaman, Barlas Oguz, et al. Effective long-context scaling of foundation models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (...

work page 2024
[31]

R&D-Agent: An LLM-Agent Framework Towards Autonomous Data Science, October 2025.https://arxiv.org/abs/2505.14738

Xu Yang, Xiao Yang, Shikai Fang, Bowen Xian, Yuante Li, Jian Wang, Minrui Xu, Haoran Pan, Xin- peng Hong, Weiqing Liu, et al. R&d-agent: Automating data-driven ai solution building through llm-powered automated research, development, and evolution.arXiv preprint arXiv:2505.14738,

work page arXiv
[32]

arXiv preprint arXiv:2407.00956 , year=

Han-Jia Ye, Si-Yang Liu, Hao-Run Cai, Qi-Le Zhou, and De-Chuan Zhan. A closer look at deep learning methods on tabular datasets.arXiv preprint arXiv:2407.00956,

work page arXiv
[33]

More is not always better? enhancing many-shot in-context learning with differ- entiated and reweighting objectives

Xiaoqing Zhang, Ang Lv, Yuhan Liu, Flood Sung, Wei Liu, Jian Luan, Shuo Shang, Xiuying Chen, and Rui Yan. More is not always better? enhancing many-shot in-context learning with differ- entiated and reweighting objectives. InProceedings of ACL 2025 (Long Papers),

work page 2025
[34]

Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh

URL https://aclanthology.org/2025.acl-long.1475.pdf. Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. Calibrate before use: Improving few-shot performance of language models. InInternational conference on machine learning, pp. 12697–12706. PMLR,

work page 2025
[35]

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

URLhttps: //arxiv.org/abs/2403.13372. Kaijian Zou, Muhammad Khalifa, and Lu Wang. On many-shot in-context learning for long-context evaluation. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.),Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 25605–2563...

work page internal anchor Pith review Pith/arXiv arXiv
[36]

ISBN 979-8-89176-251-0

Association for Com- putational Linguistics. ISBN 979-8-89176-251-0. doi: 10.18653/v1/2025.acl-long.1245. URL https://aclanthology.org/2025.acl-long.1245/. 22 Work in Progress A LIST OFSYMBOLS Symbol Meaning Range/Default GDAG of the structural causal model (SCM) — vNode index inG— xv Variable value at nodev— fv Structural function at nodev— εv i.i.d. noi...

work page doi:10.18653/v1/2025.acl-long.1245 2025

[1] [2]

Ml-master: Towards ai-for-ai via integration of exploration and reasoning.arXiv preprint arXiv:2506.16499,

URLhttps://arxiv.org/abs/2506.16499. Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609,

work page arXiv

[2] [3]

Cache me if you can: How many kvs do you need for effective long-context lms?arXiv preprint arXiv:2506.17121,

Adithya Bhaskar, Alexander Wettig, Tianyu Gao, Yihe Dong, and Danqi Chen. Cache me if you can: How many kvs do you need for effective long-context lms?arXiv preprint arXiv:2506.17121,

work page arXiv

[3] [4]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

work page 1901

[4] [6]

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

URLhttps://arxiv.org/abs/2410.07095. Jiuhai Chen, Lichang Chen, Chen Zhu, and Tianyi Zhou. How many demonstrations do you need for in-context learning? InThe 2023 Conference on Empirical Methods in Natural Language Processing. Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. InKDD, pp. 785–794,

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [7]

Batch prompting: Efficient inference with large language model apis

Zhoujun Cheng, Jungo Kasai, and Tao Yu. Batch prompting: Efficient inference with large language model apis. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track, pp. 792–810,

work page 2023

[6] [8]

Interpreting Tree Ensembles with inTrees

Houtao Deng. Interpreting tree ensembles with intrees.arXiv preprint arXiv:1408.5456,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [9]

Spreadsheetllm: encoding spreadsheets for large language models

Haoyu Dong, Jianbo Zhao, Yuzhang Tian, Junyu Xiong, Shiyu Xia, Mengyu Zhou, Yun Lin, José Cambronero, Yeye He, Shi Han, et al. Spreadsheetllm: encoding spreadsheets for large language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 20728–20748,

work page 2024

[8] [10]

Josh Gardner, Juan C Perdomo, and Ludwig Schmidt

doi: 10.1214/07-AOAS148. Josh Gardner, Juan C Perdomo, and Ludwig Schmidt. Large scale transfer learning for tabular data via language modeling.Advances in Neural Information Processing Systems, 37:45155–45205,

work page doi:10.1214/07-aoas148

[9] [11]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Zhengyao Gu, Henry Peng Zou, Aiwei Liu, Yankai Chen, Weizhi Zhang, and Philip S Yu. Scaling laws for many-shot in-context learning with self-generated annotations. InICML 2025 Workshop on Long-Context Foundation Models. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [12]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [13]

Many-shot in-context learning in multimodal foundation models

Yixing Jiang, Jeremy Andrew Irvin, Ji Hun Wang, Muhammad Ahmed Chaudhry, Jonathan H Chen, and Andrew Y Ng. Many-shot in-context learning in multimodal foundation models. InICML 2024 Workshop on In-Context Learning. Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie- Yan Liu. Lightgbm: A highly efficient gradient boost...

work page 2024

[12] [14]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [15]

Se-agent: Self-evolution trajectory optimization in multi-step reasoning with llm-based agents.CoRR, abs/2508.02085, 2025

Jianzhe Lin, Maurice Diesendruck, Liang Du, and Robin Abraham. Batchprompt: Accomplish more with less. InThe Twelfth International Conference on Learning Representations. Jiaye Lin, Yifu Guo, Yuzhen Han, Sen Hu, Ziyi Ni, Licheng Wang, Mingguang Chen, Daxin Jiang, Binxing Jiao, Chen Hu, et al. Se-agent: Self-evolution trajectory optimization in multi-step ...

work page arXiv

[14] [16]

Jiachang Liu, Dinghan Shen, Yizhe Zhang, William B Dolan, Lawrence Carin, and Weizhu Chen. What makes good in-context examples for gpt-3? InProceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, pp. 100–114,

work page 2022

[15] [17]

arXiv preprint arXiv:2407.04057 , year=

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics, 11:157–173, 2024a. Si-Yang Liu, Hao-Run Cai, Qi-Le Zhou, and Han-Jia Ye. Talent: A tabular analytics and learning toolbox.arXiv ...

work page arXiv

[16] [18]

Tabdpt: Scaling tabular foundation models on real data.arXiv preprint arXiv:2410.18164,

Junwei Ma, Valentin Thomas, Rasa Hosseinzadeh, Hamidreza Kamkari, Alex Labach, Jesse C Cresswell, Keyvan Golestan, Guangwei Yu, Maksims V olkovs, and Anthony L Caterini. Tab- dpt: Scaling tabular foundation models.arXiv preprint arXiv:2410.18164,

work page arXiv

[17] [19]

Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work? In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.),Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 11048–11064, Abu Dhabi, U...

work page 2022

[18] [20]

Bahri, H

Association for Computational Linguistics. doi: 10.18653/v1/2022. emnlp-main.759. URLhttps://aclanthology.org/2022.emnlp-main.759/. Martin Mráz, Breenda Das, Anshul Gupta, Lennart Purucker, and Frank Hutter. Towards bench- marking foundation models for tabular data with text.arXiv preprint arXiv:2507.07829,

work page doi:10.18653/v1/2022 2022

[19] [21]

Tabicl: A tabular foun- dation model for in-context learning on large data

Jingang Qu, David Holzmüller, Gaël Varoquaux, and Marine Le Morvan. Tabicl: A tabular foun- dation model for in-context learning on large data. InICML 2025-Forty-Second International Conference on Machine Learning,

work page 2025

[20] [22]

Benchmarking multi- modal automl for tabular data with text fields.arXiv preprint arXiv:2111.02705,

Xingjian Shi, Jonas Mueller, Nick Erickson, Mu Li, and Alexander J Smola. Benchmarking multi- modal automl for tabular data with text fields.arXiv preprint arXiv:2111.02705,

work page arXiv

[21] [23]

Singh and DJ Strouse

Aaditya K Singh and DJ Strouse. Tokenization counts: the impact of tokenization on arithmetic in frontier llms.arXiv preprint arXiv:2402.14903,

work page arXiv

[22] [24]

Tablegpt2: A large multimodal model with tabular data integration.arXiv preprint arXiv:2411.02059,

Aofeng Su, Aowen Wang, Chao Ye, Chen Zhou, Ga Zhang, Gang Chen, Guangcheng Zhu, Haobo Wang, Haokai Xu, Hao Chen, et al. Tablegpt2: A large multimodal model with tabular data integration.arXiv preprint arXiv:2411.02059,

work page arXiv

[23] [25]

Yan Tai, Weichen Fan, Zhao Zhang, Feng Zhu, Rui Zhao, and Ziwei Liu

doi: 10.1145/3616855.3635752. Yan Tai, Weichen Fan, Zhao Zhang, Feng Zhu, Rui Zhao, and Ziwei Liu. Link-context learning for multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),

work page doi:10.1145/3616855.3635752

[24] [26]

Eric Wallace, Yizhong Wang, Sujian Li, Sameer Singh, and Matt Gardner

URLhttps://openaccess.thecvf.com/content/ CVPR2024/papers/Tai_Link-Context_Learning_for_Multimodal_LLMs_ CVPR_2024_paper.pdf. Eric Wallace, Yizhong Wang, Sujian Li, Sameer Singh, and Matt Gardner. Do NLP models know numbers? probing numeracy in embeddings. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th...

work page 2019

[25] [27]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Association for Computational Linguistics. doi: 10.18653/v1/D19-1534. URL https://aclanthology.org/D19-1534/. 21 Work in Progress Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/d19-1534

[26] [28]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InThe Eleventh International Conference on Learning Representations, a. Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdh- ery, an...

work page internal anchor Pith review Pith/arXiv arXiv

[27] [29]

Agent workflow memory

Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. Agent workflow memory. In Forty-second International Conference on Machine Learning, b. Xumeng Wen, Shun Zheng, Zhen Xu, Yiming Sun, and Jiang Bian. Scalable in-context learning on tabular data via retrieval-augmented large language models.arXiv preprint arXiv:2502.03147,

work page arXiv

[28] [30]

Effective long-context scaling of foundation models

Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis Martin, Rashi Rungta, Karthik Abinav Sankararaman, Barlas Oguz, et al. Effective long-context scaling of foundation models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (...

work page 2024

[29] [31]

R&D-Agent: An LLM-Agent Framework Towards Autonomous Data Science, October 2025.https://arxiv.org/abs/2505.14738

Xu Yang, Xiao Yang, Shikai Fang, Bowen Xian, Yuante Li, Jian Wang, Minrui Xu, Haoran Pan, Xin- peng Hong, Weiqing Liu, et al. R&d-agent: Automating data-driven ai solution building through llm-powered automated research, development, and evolution.arXiv preprint arXiv:2505.14738,

work page arXiv

[30] [32]

arXiv preprint arXiv:2407.00956 , year=

Han-Jia Ye, Si-Yang Liu, Hao-Run Cai, Qi-Le Zhou, and De-Chuan Zhan. A closer look at deep learning methods on tabular datasets.arXiv preprint arXiv:2407.00956,

work page arXiv

[31] [33]

More is not always better? enhancing many-shot in-context learning with differ- entiated and reweighting objectives

Xiaoqing Zhang, Ang Lv, Yuhan Liu, Flood Sung, Wei Liu, Jian Luan, Shuo Shang, Xiuying Chen, and Rui Yan. More is not always better? enhancing many-shot in-context learning with differ- entiated and reweighting objectives. InProceedings of ACL 2025 (Long Papers),

work page 2025

[32] [34]

Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh

URL https://aclanthology.org/2025.acl-long.1475.pdf. Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. Calibrate before use: Improving few-shot performance of language models. InInternational conference on machine learning, pp. 12697–12706. PMLR,

work page 2025

[33] [35]

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

URLhttps: //arxiv.org/abs/2403.13372. Kaijian Zou, Muhammad Khalifa, and Lu Wang. On many-shot in-context learning for long-context evaluation. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.),Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 25605–2563...

work page internal anchor Pith review Pith/arXiv arXiv

[34] [36]

ISBN 979-8-89176-251-0

Association for Com- putational Linguistics. ISBN 979-8-89176-251-0. doi: 10.18653/v1/2025.acl-long.1245. URL https://aclanthology.org/2025.acl-long.1245/. 22 Work in Progress A LIST OFSYMBOLS Symbol Meaning Range/Default GDAG of the structural causal model (SCM) — vNode index inG— xv Variable value at nodev— fv Structural function at nodev— εv i.i.d. noi...

work page doi:10.18653/v1/2025.acl-long.1245 2025