MachineLearningLM: Scaling Many-shot In-context Learning via Continued Pretraining
Pith reviewed 2026-05-18 18:05 UTC · model grok-4.3
The pith
Continued pretraining on tasks synthesized from millions of structural causal models equips LLMs with robust many-shot in-context learning for tabular classification.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MachineLearningLM, obtained by continued pretraining of Qwen-2.5-7B-Instruct with LoRA, outperforms strong LLM baselines such as GPT-5-mini by an average of about 15 percent on out-of-distribution tabular classification tasks across finance, physics, biology, and healthcare. It exhibits a many-shot scaling law in which accuracy increases monotonically with the number of in-context demonstrations from 8 to 1024 and reaches random-forest-level performance without any task-specific training, while preserving general capabilities at 75.4 percent on MMLU.
What carries the argument
A portable continued-pretraining framework that synthesizes ML tasks from millions of structural causal models and distills random-forest decision strategies into the LLM using token-efficient prompts.
If this is right
- LLMs reach random-forest-level accuracy on tabular classification using hundreds of in-context shots without task-specific training.
- Accuracy on these tasks increases monotonically as the number of in-context demonstrations grows from 8 to 1024.
- General chat capabilities remain intact, with 75.4 percent accuracy on MMLU after the continued pretraining.
- Token-efficient serialization supports 3x to 6x more examples per context window and up to 50x amortized throughput via batch inference.
Where Pith is reading between the lines
- The same synthesis approach could be applied to non-tabular tasks such as regression or time-series forecasting to test whether many-shot in-context learning generalizes beyond classification.
- Models trained this way might serve as drop-in replacements for specialized ML pipelines in settings where collecting large labeled datasets is costly.
- The observed scaling law suggests that further increases in context length could yield additional gains without any change to the pretraining recipe.
Load-bearing premise
The synthetic tasks generated from structural causal models produce training distributions representative enough of real-world tabular data in the target domains that in-context performance transfers to out-of-distribution test sets.
What would settle it
Measure whether the 15 percent average gain and the monotonic accuracy increase with shot count both disappear when the model is tested on real tabular datasets drawn from the same domains but whose statistical or causal structure differs markedly from the synthesized SCM tasks.
Figures
read the original abstract
Large language models (LLMs) possess broad world knowledge and strong general-purpose reasoning ability, yet they struggle to learn from many in-context examples on standard machine learning (ML) tasks, that is, to leverage many-shot demonstrations purely via in-context learning (ICL) without gradient descent. We introduce MachineLearningLM, a portable continued-pretraining framework that equips a general-purpose LLM with robust in-context ML capability while preserving its general knowledge and reasoning for broader chat workflows. Our pretraining procedure synthesizes ML tasks from millions of structural causal models (SCMs), spanning shot counts up to 1,024. We begin with a random-forest teacher, distilling tree-based decision strategies into the LLM to strengthen robustness in numerical modeling. All tasks are serialized with a token-efficient prompt, enabling 3x to 6x more examples per context window and delivering up to 50x amortized throughput via batch inference. Despite a modest setup (Qwen-2.5-7B-Instruct with LoRA rank 8), MachineLearningLM outperforms strong LLM baselines (e.g., GPT-5-mini) by an average of about 15% on out-of-distribution tabular classification across finance, physics, biology, and healthcare domains. It exhibits a striking many-shot scaling law: accuracy increases monotonically as in-context demonstrations grow from 8 to 1,024. Without any task-specific training, it attains random-forest-level accuracy across hundreds of shots. General chat capabilities, including knowledge and reasoning, are preserved: it achieves 75.4% on MMLU.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces MachineLearningLM, a continued-pretraining framework that synthesizes millions of tabular ML tasks from structural causal models (SCMs) to equip a base LLM (Qwen-2.5-7B-Instruct, LoRA rank 8) with many-shot in-context learning capability. It reports an average ~15% improvement over strong LLM baselines such as GPT-5-mini on out-of-distribution real-world tabular classification tasks drawn from finance, physics, biology, and healthcare, together with a monotonic accuracy scaling law as the number of in-context demonstrations grows from 8 to 1,024 shots, reaching random-forest-level performance without any task-specific gradient updates, while preserving general chat performance (75.4% on MMLU).
Significance. If the transfer from SCM-synthesized pretraining to real OOD tabular data proves robust, the result would be significant: it would demonstrate a practical route to scaling in-context ML performance far beyond current LLM limits without sacrificing general capabilities. The token-efficient serialization (3-6x more examples per context) and the use of a random-forest teacher for distillation are concrete engineering contributions that could be adopted more broadly. The external OOD evaluation on never-seen real datasets supplies an independent test rather than a circular one.
major comments (2)
- [Abstract] Abstract: the headline claim of an average 15% improvement over GPT-5-mini (and random-forest parity) is stated without error bars, per-domain standard deviations, number of runs, or any statistical test. Because the central empirical assertion rests on these aggregate numbers, the absence of uncertainty quantification is load-bearing for assessing whether the gains are reliable or could be explained by dataset selection or variance.
- [Abstract] Abstract / pretraining description: the assumption that tasks drawn from millions of SCMs induce generalizable numerical ICL strategies (rather than artifacts of simplified noise models, bounded ranges, or missing heavy-tail / non-stationarity patterns) is not accompanied by any diagnostic that measures distribution shift between the synthetic pretraining distribution and the target real-world domains. This transfer step is load-bearing for the OOD performance claim.
minor comments (1)
- [Abstract] The token-efficient prompt serialization is described as enabling 3x-6x more examples; a short illustrative example or token-count formula would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback, which highlights important aspects of empirical rigor and transfer validity. We address each major comment below and commit to revisions that strengthen the presentation of our results without altering the core claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline claim of an average 15% improvement over GPT-5-mini (and random-forest parity) is stated without error bars, per-domain standard deviations, number of runs, or any statistical test. Because the central empirical assertion rests on these aggregate numbers, the absence of uncertainty quantification is load-bearing for assessing whether the gains are reliable or could be explained by dataset selection or variance.
Authors: We agree that uncertainty quantification is essential for the headline claims. In the revised manuscript we will report all aggregate results with error bars computed over at least five independent runs (different random seeds for both pretraining data sampling and evaluation), include per-domain standard deviations, and add paired statistical significance tests (e.g., Wilcoxon signed-rank) against the strongest baselines. These additions will appear in the abstract, Table 1, and the main results section. revision: yes
-
Referee: [Abstract] Abstract / pretraining description: the assumption that tasks drawn from millions of SCMs induce generalizable numerical ICL strategies (rather than artifacts of simplified noise models, bounded ranges, or missing heavy-tail / non-stationarity patterns) is not accompanied by any diagnostic that measures distribution shift between the synthetic pretraining distribution and the target real-world domains. This transfer step is load-bearing for the OOD performance claim.
Authors: We acknowledge that an explicit characterization of the synthetic-to-real distribution shift would further support the transfer argument. While the primary evidence remains the strong OOD performance on held-out real datasets spanning finance, physics, biology, and healthcare, we will add a new appendix section that quantifies shift via (i) Kolmogorov-Smirnov tests on marginal feature distributions, (ii) comparison of noise variance and range statistics, and (iii) a simple covariate-shift diagnostic using a domain classifier trained on synthetic vs. real features. This analysis will be included in the revision. revision: partial
Circularity Check
No significant circularity; claims rest on independent OOD real-world evaluation
full rationale
The paper's core results—15% average gains over GPT-5-mini, monotonic accuracy scaling from 8 to 1024 shots, and random-forest-level performance—are measured on out-of-distribution real tabular datasets from finance, physics, biology, and healthcare domains. These test sets are explicitly never seen during the SCM-synthesized continued pretraining, supplying an external benchmark rather than a quantity fitted or defined from the training distribution. The pretraining procedure (task synthesis from structural causal models, random-forest distillation, token-efficient serialization) is described as a method to induce generalizable ICL strategies, but the reported scaling laws and comparisons are not reduced to the inputs by construction. No self-definitional steps, fitted-input predictions, load-bearing self-citations, uniqueness theorems, or ansatz smuggling appear in the derivation. The evaluation setup is self-contained against external benchmarks, making this a standard non-circular empirical claim.
Axiom & Free-Parameter Ledger
free parameters (1)
- LoRA rank =
8
axioms (1)
- domain assumption Structural causal models can generate representative ML tasks whose decision boundaries are learnable via in-context imitation of random-forest strategies
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce MachineLearningLM, a portable continued-pretraining framework that equips a general-purpose LLM with robust in-context ML capability... synthesizes ML tasks from millions of structural causal models (SCMs), spanning shot counts up to 1,024.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Finch: Benchmarking Finance & Accounting across Spreadsheet-Centric Enterprise Workflows
Finch is a new benchmark with 172 composite workflows and 384 tasks from real enterprise data that shows top AI models like GPT-5.1 Pro pass only 38.4% of workflows under human evaluation.
Reference graph
Works this paper leans on
-
[2]
URLhttps://arxiv.org/abs/2506.16499. Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609,
-
[3]
Adithya Bhaskar, Alexander Wettig, Tianyu Gao, Yihe Dong, and Danqi Chen. Cache me if you can: How many kvs do you need for effective long-context lms?arXiv preprint arXiv:2506.17121,
-
[4]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,
work page 1901
-
[6]
MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering
URLhttps://arxiv.org/abs/2410.07095. Jiuhai Chen, Lichang Chen, Chen Zhu, and Tianyi Zhou. How many demonstrations do you need for in-context learning? InThe 2023 Conference on Empirical Methods in Natural Language Processing. Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. InKDD, pp. 785–794,
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
Batch prompting: Efficient inference with large language model apis
Zhoujun Cheng, Jungo Kasai, and Tao Yu. Batch prompting: Efficient inference with large language model apis. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track, pp. 792–810,
work page 2023
-
[8]
Interpreting Tree Ensembles with inTrees
Houtao Deng. Interpreting tree ensembles with intrees.arXiv preprint arXiv:1408.5456,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Spreadsheetllm: encoding spreadsheets for large language models
Haoyu Dong, Jianbo Zhao, Yuzhang Tian, Junyu Xiong, Shiyu Xia, Mengyu Zhou, Yun Lin, José Cambronero, Yeye He, Shi Han, et al. Spreadsheetllm: encoding spreadsheets for large language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 20728–20748,
work page 2024
-
[10]
Josh Gardner, Juan C Perdomo, and Ludwig Schmidt
doi: 10.1214/07-AOAS148. Josh Gardner, Juan C Perdomo, and Ludwig Schmidt. Large scale transfer learning for tabular data via language modeling.Advances in Neural Information Processing Systems, 37:45155–45205,
-
[11]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Zhengyao Gu, Henry Peng Zou, Aiwei Liu, Yankai Chen, Weizhi Zhang, and Philip S Yu. Scaling laws for many-shot in-context learning with self-generated annotations. InICML 2025 Workshop on Long-Context Foundation Models. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
Distilling the Knowledge in a Neural Network
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Many-shot in-context learning in multimodal foundation models
Yixing Jiang, Jeremy Andrew Irvin, Ji Hun Wang, Muhammad Ahmed Chaudhry, Jonathan H Chen, and Andrew Y Ng. Many-shot in-context learning in multimodal foundation models. InICML 2024 Workshop on In-Context Learning. Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie- Yan Liu. Lightgbm: A highly efficient gradient boost...
work page 2024
-
[14]
Adam: A Method for Stochastic Optimization
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Jianzhe Lin, Maurice Diesendruck, Liang Du, and Robin Abraham. Batchprompt: Accomplish more with less. InThe Twelfth International Conference on Learning Representations. Jiaye Lin, Yifu Guo, Yuzhen Han, Sen Hu, Ziyi Ni, Licheng Wang, Mingguang Chen, Daxin Jiang, Binxing Jiao, Chen Hu, et al. Se-agent: Self-evolution trajectory optimization in multi-step ...
-
[16]
Jiachang Liu, Dinghan Shen, Yizhe Zhang, William B Dolan, Lawrence Carin, and Weizhu Chen. What makes good in-context examples for gpt-3? InProceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, pp. 100–114,
work page 2022
-
[17]
arXiv preprint arXiv:2407.04057 , year=
Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics, 11:157–173, 2024a. Si-Yang Liu, Hao-Run Cai, Qi-Le Zhou, and Han-Jia Ye. Talent: A tabular analytics and learning toolbox.arXiv ...
-
[18]
Tabdpt: Scaling tabular foundation models on real data.arXiv preprint arXiv:2410.18164,
Junwei Ma, Valentin Thomas, Rasa Hosseinzadeh, Hamidreza Kamkari, Alex Labach, Jesse C Cresswell, Keyvan Golestan, Guangwei Yu, Maksims V olkovs, and Anthony L Caterini. Tab- dpt: Scaling tabular foundation models.arXiv preprint arXiv:2410.18164,
-
[19]
Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work? In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.),Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 11048–11064, Abu Dhabi, U...
work page 2022
-
[20]
Association for Computational Linguistics. doi: 10.18653/v1/2022. emnlp-main.759. URLhttps://aclanthology.org/2022.emnlp-main.759/. Martin Mráz, Breenda Das, Anshul Gupta, Lennart Purucker, and Frank Hutter. Towards bench- marking foundation models for tabular data with text.arXiv preprint arXiv:2507.07829,
-
[21]
Tabicl: A tabular foun- dation model for in-context learning on large data
Jingang Qu, David Holzmüller, Gaël Varoquaux, and Marine Le Morvan. Tabicl: A tabular foun- dation model for in-context learning on large data. InICML 2025-Forty-Second International Conference on Machine Learning,
work page 2025
-
[22]
Benchmarking multi- modal automl for tabular data with text fields.arXiv preprint arXiv:2111.02705,
Xingjian Shi, Jonas Mueller, Nick Erickson, Mu Li, and Alexander J Smola. Benchmarking multi- modal automl for tabular data with text fields.arXiv preprint arXiv:2111.02705,
-
[23]
Aaditya K Singh and DJ Strouse. Tokenization counts: the impact of tokenization on arithmetic in frontier llms.arXiv preprint arXiv:2402.14903,
-
[24]
Tablegpt2: A large multimodal model with tabular data integration.arXiv preprint arXiv:2411.02059,
Aofeng Su, Aowen Wang, Chao Ye, Chen Zhou, Ga Zhang, Gang Chen, Guangcheng Zhu, Haobo Wang, Haokai Xu, Hao Chen, et al. Tablegpt2: A large multimodal model with tabular data integration.arXiv preprint arXiv:2411.02059,
-
[25]
Yan Tai, Weichen Fan, Zhao Zhang, Feng Zhu, Rui Zhao, and Ziwei Liu
doi: 10.1145/3616855.3635752. Yan Tai, Weichen Fan, Zhao Zhang, Feng Zhu, Rui Zhao, and Ziwei Liu. Link-context learning for multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
-
[26]
Eric Wallace, Yizhong Wang, Sujian Li, Sameer Singh, and Matt Gardner
URLhttps://openaccess.thecvf.com/content/ CVPR2024/papers/Tai_Link-Context_Learning_for_Multimodal_LLMs_ CVPR_2024_paper.pdf. Eric Wallace, Yizhong Wang, Sujian Li, Sameer Singh, and Matt Gardner. Do NLP models know numbers? probing numeracy in embeddings. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th...
work page 2019
-
[27]
Voyager: An Open-Ended Embodied Agent with Large Language Models
Association for Computational Linguistics. doi: 10.18653/v1/D19-1534. URL https://aclanthology.org/D19-1534/. 21 Work in Progress Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/d19-1534
-
[28]
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InThe Eleventh International Conference on Learning Representations, a. Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdh- ery, an...
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. Agent workflow memory. In Forty-second International Conference on Machine Learning, b. Xumeng Wen, Shun Zheng, Zhen Xu, Yiming Sun, and Jiang Bian. Scalable in-context learning on tabular data via retrieval-augmented large language models.arXiv preprint arXiv:2502.03147,
-
[30]
Effective long-context scaling of foundation models
Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis Martin, Rashi Rungta, Karthik Abinav Sankararaman, Barlas Oguz, et al. Effective long-context scaling of foundation models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (...
work page 2024
-
[31]
Xu Yang, Xiao Yang, Shikai Fang, Bowen Xian, Yuante Li, Jian Wang, Minrui Xu, Haoran Pan, Xin- peng Hong, Weiqing Liu, et al. R&d-agent: Automating data-driven ai solution building through llm-powered automated research, development, and evolution.arXiv preprint arXiv:2505.14738,
-
[32]
arXiv preprint arXiv:2407.00956 , year=
Han-Jia Ye, Si-Yang Liu, Hao-Run Cai, Qi-Le Zhou, and De-Chuan Zhan. A closer look at deep learning methods on tabular datasets.arXiv preprint arXiv:2407.00956,
-
[33]
Xiaoqing Zhang, Ang Lv, Yuhan Liu, Flood Sung, Wei Liu, Jian Luan, Shuo Shang, Xiuying Chen, and Rui Yan. More is not always better? enhancing many-shot in-context learning with differ- entiated and reweighting objectives. InProceedings of ACL 2025 (Long Papers),
work page 2025
-
[34]
Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh
URL https://aclanthology.org/2025.acl-long.1475.pdf. Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. Calibrate before use: Improving few-shot performance of language models. InInternational conference on machine learning, pp. 12697–12706. PMLR,
work page 2025
-
[35]
LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models
URLhttps: //arxiv.org/abs/2403.13372. Kaijian Zou, Muhammad Khalifa, and Lu Wang. On many-shot in-context learning for long-context evaluation. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.),Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 25605–2563...
work page internal anchor Pith review Pith/arXiv arXiv
-
[36]
Association for Com- putational Linguistics. ISBN 979-8-89176-251-0. doi: 10.18653/v1/2025.acl-long.1245. URL https://aclanthology.org/2025.acl-long.1245/. 22 Work in Progress A LIST OFSYMBOLS Symbol Meaning Range/Default GDAG of the structural causal model (SCM) — vNode index inG— xv Variable value at nodev— fv Structural function at nodev— εv i.i.d. noi...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.