Recognition: 2 theorem links
· Lean TheoremPioneer Agent: Continual Improvement of Small Language Models in Production
Pith reviewed 2026-05-10 17:44 UTC · model grok-4.3
The pith
Pioneer Agent automates the engineering loop for adapting small language models from task descriptions or failure logs while preventing regressions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that a closed-loop system called Pioneer Agent can automate the entire adaptation lifecycle of small language models. In cold-start mode it receives only a natural-language task description, acquires data, constructs evaluation sets, and iteratively trains models by jointly optimizing data, hyperparameters, and learning strategy. In production mode it receives a deployed model together with labeled failures, diagnoses error patterns, constructs targeted training data, and retrains under explicit regression constraints. The approach is tested on eight cold-start benchmarks across reasoning, math, code generation, summarization, and classification, on the new AdaptFT-Bench of
What carries the argument
Pioneer Agent, a closed-loop system that automates data acquisition, error-pattern diagnosis, curriculum synthesis, and regression-constrained retraining for small language models.
If this is right
- The agent can begin from a natural-language task description alone and produce specialized models for reasoning, math, code generation, summarization, and classification tasks.
- In adaptation scenarios with progressively noisier logs, the agent improves or preserves performance across all tested cases while standard retraining produces large drops.
- The regression constraints prevent the performance losses that occur when models are retrained without them.
- In simulated production deployments the agent raises intent classification accuracy and entity recognition quality on public benchmark tasks.
- The optimization process inside the agent can surface effective training choices such as chain-of-thought supervision and data-quality focus without being told to look for them.
Where Pith is reading between the lines
- Organizations could maintain fleets of small specialized models that update themselves as user patterns shift, lowering the human cost of ongoing maintenance.
- The error-diagnosis step may transfer to other machine-learning systems where logs of failures are available but expert review is limited.
- If the constraints are made tunable, the same loop could balance adaptation to new tasks against retention of old ones in multi-task or continual-learning settings.
Load-bearing premise
That automated diagnosis of error patterns from labeled failures will reliably identify fixable issues and that the synthesized training data plus regression constraints will produce net gains without introducing hidden regressions or requiring human intervention in complex real deployments.
What would settle it
A production deployment in which the agent receives labeled failures, follows its diagnosis and synthesis steps, applies the regression constraints, yet the resulting model shows clear degradation on the original task compared with the starting model.
read the original abstract
Small language models are attractive for production deployment due to their low cost, fast inference, and ease of specialization. However, adapting them to a specific task remains a challenging engineering loop, driven not by training itself but by surrounding decisions: data curation, failure diagnosis, regression avoidance, and iteration control. We present Pioneer Agent, a closed-loop system that automates this lifecycle. In cold-start mode, given only a natural-language task description, the agent acquires data, constructs evaluation sets, and iteratively trains models by jointly optimizing data, hyperparameters, and learning strategy. In production mode, given a deployed model with labeled failures, it diagnoses error patterns, constructs targeted training data, and retrains under explicit regression constraints. To evaluate this setting, we introduce AdaptFT-Bench, a benchmark of synthetic inference logs with progressively increasing noise, designed to test the full adaptation loop: diagnosis, curriculum synthesis, retraining, and verification. Across eight cold-start benchmarks spanning reasoning, math, code generation, summarization, and classification, Pioneer Agent improves over base models by 1.6-83.8 points. On AdaptFT-Bench, it improves or preserves performance in all seven scenarios, while naive retraining degrades by up to 43 points. On two production-style deployments built from public benchmark tasks, it raises intent classification from 84.9% to 99.3% and Entity F1 from 0.345 to 0.810. Beyond performance gains, the agent often discovers effective training strategies, including chain-of-thought supervision, task-specific optimization, and quality-focused data curation, purely from downstream feedback.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Pioneer Agent, a closed-loop automated system for adapting and continually improving small language models. In cold-start mode, given only a natural-language task description, the agent acquires data, builds evaluation sets, and iteratively trains by jointly optimizing data, hyperparameters, and learning strategy. In production mode, given labeled failures from a deployed model, it diagnoses error patterns, synthesizes targeted training data, and retrains under explicit regression constraints. The authors introduce AdaptFT-Bench, a synthetic benchmark of inference logs with progressively increasing noise to evaluate the full adaptation loop. They report gains of 1.6-83.8 points over base models on eight cold-start benchmarks (reasoning, math, code, summarization, classification), improvement or preservation on all seven AdaptFT-Bench scenarios (vs. naive retraining degrading up to 43 points), and large gains on two production-style tasks constructed from public benchmarks (intent classification 84.9% to 99.3%, Entity F1 0.345 to 0.810). The system is also shown to discover effective strategies such as chain-of-thought supervision purely from downstream feedback.
Significance. If the results hold, the work has substantial practical significance for SLM deployment by automating the engineering loop of data curation, diagnosis, and regression avoidance that currently demands extensive human effort. The introduction of AdaptFT-Bench is a clear positive contribution as a standardized testbed for adaptation systems. The empirical demonstration that effective training strategies can be discovered from feedback alone is noteworthy and could reduce expert intervention. The paper earns credit for its focus on reproducible empirical evaluation across diverse tasks and for highlighting the contrast with naive retraining.
major comments (3)
- [§5.3] §5.3 (production-style deployments): the two reported deployments are constructed from public benchmark tasks rather than real production failure logs; this does not test the diagnosis step against long-tail, multi-cause, or distribution-shifted failures that are central to the production-mode claim of achieving net gains without hidden regressions or human intervention.
- [§4] §4 (AdaptFT-Bench): while the benchmark uses synthetic logs with progressively increasing noise, no independent accuracy metric is reported for the LLM-based error-pattern diagnosis against ground-truth causes; without this, it is impossible to determine whether the observed gains (or preservation) are attributable to reliable diagnosis or to other system components.
- [§3.2] §3.2 (regression constraints): the explicit regression constraints are described as preventing degradation (contrasted with naive retraining losses up to 43 points), but their precise formulation (loss terms, penalty application, or verification on held-out original-task distributions) is not provided, making it impossible to verify they guard against regressions beyond the reported metrics.
minor comments (3)
- [§5.1] The abstract and §5.1 report a wide range of improvements (1.6-83.8 points) but do not include per-benchmark breakdowns, variance across runs, or statistical significance tests in the summary tables.
- [§3] A system diagram illustrating the closed-loop flow (data acquisition, diagnosis, synthesis, constrained retraining, verification) would improve clarity in the method section.
- [§5.4] The discussion of discovered strategies (chain-of-thought, task-specific optimization, quality-focused curation) would benefit from concrete examples of synthesized prompts or data in an appendix.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications from the manuscript and indicate where revisions will strengthen the presentation.
read point-by-point responses
-
Referee: [§5.3] §5.3 (production-style deployments): the two reported deployments are constructed from public benchmark tasks rather than real production failure logs; this does not test the diagnosis step against long-tail, multi-cause, or distribution-shifted failures that are central to the production-mode claim of achieving net gains without hidden regressions or human intervention.
Authors: We acknowledge that the two production-style tasks are constructed from public benchmarks (intent classification and entity recognition) rather than proprietary production logs. This design choice enables full reproducibility and controlled injection of failure patterns while still exercising the diagnosis, synthesis, and regression-constrained retraining pipeline. The constructed logs incorporate multi-cause noise and distribution shifts to approximate production conditions, as described in §5.3. We agree that real long-tail production logs would provide stronger external validation. In the revision we will expand §5.3 to explicitly state this limitation, detail how the synthetic failures emulate multi-cause and shifted distributions, and note that future work will seek access to anonymized production logs for additional evaluation. revision: partial
-
Referee: [§4] §4 (AdaptFT-Bench): while the benchmark uses synthetic logs with progressively increasing noise, no independent accuracy metric is reported for the LLM-based error-pattern diagnosis against ground-truth causes; without this, it is impossible to determine whether the observed gains (or preservation) are attributable to reliable diagnosis or to other system components.
Authors: The referee correctly notes the absence of a standalone diagnosis accuracy metric. The current evaluation in §4 reports end-to-end adaptation performance across noise levels, which implicitly depends on diagnosis quality. To isolate this component, the revised manuscript will add a new table and accompanying text in §4 that reports the LLM-based diagnosis accuracy against the ground-truth injected error patterns (e.g., precision/recall per noise type and overall F1). This will allow readers to attribute gains to diagnosis reliability versus other pipeline elements. revision: yes
-
Referee: [§3.2] §3.2 (regression constraints): the explicit regression constraints are described as preventing degradation (contrasted with naive retraining losses up to 43 points), but their precise formulation (loss terms, penalty application, or verification on held-out original-task distributions) is not provided, making it impossible to verify they guard against regressions beyond the reported metrics.
Authors: We agree that the precise mathematical formulation of the regression constraints was insufficiently detailed in §3.2. The constraints combine the task-specific loss with a penalty term that measures deviation from the original model's behavior on a held-out subset of the initial task distribution, implemented via KL divergence on output logits (with a tunable coefficient). Verification occurs by evaluating the retrained model on the same held-out original-task set after each iteration. The revised manuscript will include the exact loss equation, hyperparameter values for the penalty, and pseudocode for the constrained optimization loop. revision: yes
Circularity Check
No circularity: empirical system evaluation on external benchmarks
full rationale
The paper is a systems description of an automated adaptation loop for small language models, evaluated via cold-start benchmarks, AdaptFT-Bench (synthetic logs), and two production-style deployments built from public tasks. All reported gains (e.g., 1.6-83.8 points, intent classification 84.9%→99.3%) are measured against independent external test sets rather than any internal equations, fitted parameters, or self-referential predictions. No derivations, uniqueness theorems, or ansatzes are invoked; the work contains no load-bearing self-citations or reductions of outputs to inputs by construction. Claims rest on observed benchmark performance, making the evaluation self-contained against external data.
Axiom & Free-Parameter Ledger
free parameters (1)
- jointly optimized data, hyperparameters, and learning strategy
axioms (1)
- domain assumption Small language models can be improved by targeted fine-tuning on automatically curated data without catastrophic forgetting when regression constraints are applied
invented entities (1)
-
Pioneer Agent
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We formalize autonomous fine-tuning as search over a structured configuration space... π=(D,H,S)... optimization objective π∗=arg max f(π) subject to r(π;R)≤ϵ
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Pioneer Agent... closed-loop system that automates this lifecycle... failure diagnosis, curriculum synthesis, retraining under explicit regression constraints
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
GLiNER2-PII: A Multilingual Model for Personally Identifiable Information Extraction
GLiNER2-PII achieves the highest span-level F1 on the SPY benchmark by fine-tuning a small GLiNER2 model on a 4,910-example multilingual synthetic PII corpus.
Reference graph
Works this paper leans on
-
[1]
write newline
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[2]
GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning
Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alexandros G. Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, and Omar Khattab. Gepa: Reflective prompt evolution can outperform reinforcement learning, 2026. URL https://arx...
work page internal anchor Pith review arXiv 2026
-
[3]
A uto I ntent: A uto ML for text classification
Ilya Alekseev, Roman Solomatin, Darina Rustamova, and Denis Kuznetsov. A uto I ntent: A uto ML for text classification. In Ivan Habernal, Peter Schulam, and J \"o rg Tiedemann, editors, Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 707--716, Suzhou, China, November 2025. Association fo...
-
[4]
Claude sonnet 4.6 system card
Anthropic . Claude sonnet 4.6 system card. https://www.anthropic.com/claude-sonnet-4-6-system-card, 2026
2026
-
[5]
Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional AI : Harmlessness from AI feedback. arXiv preprint arXiv:2212.08073, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[6]
Ahmed, and Andreas Kassler
Firas Bayram, Bestoun S. Ahmed, and Andreas Kassler. From concept drift to model degradation: An overview on performance-aware drift detectors. Knowledge-Based Systems, 245: 0 108632, 2022
2022
-
[7]
Curriculum learning
Yoshua Bengio, J \'e r \^o me Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th International Conference on Machine Learning (ICML), 2009
2009
-
[8]
Random search for hyper-parameter optimization
James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. J. Mach. Learn. Res., 13 0 (null): 0 281–305, February 2012. ISSN 1532-4435
2012
-
[9]
On the Opportunities and Risks of Foundation Models
Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[10]
Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in Neural Information Processing Systems (NeurIPS), 2020
2020
-
[11]
MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering
Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Lilian Weng, and Aleksander Madry. Mle-bench: Evaluating machine learning agents on machine learning engineering. arXiv preprint arXiv:2410.07095, 2024
work page Pith review arXiv 2024
-
[12]
Humans or llms as the judge? a study on judgement biases, September 2024
Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, and Benyou Wang. Humans or llms as the judge? a study on judgement biases. arXiv preprint arXiv:2402.10669, 2024
-
[13]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[14]
Bringmann, John Tran, Wei Liu, Fung Xie, Michael Lightstone, and Humphrey Shi
Terry Chen, Zhifan Ye, Bing Xu, Zihao Ye, Timmy Liu, Ali Hassani, Tianqi Chen, Andrew Kerr, Haicheng Wu, Yang Xu, et al. Avo: Agentic variation operators for autonomous evolutionary search. arXiv preprint arXiv:2603.24517, 2026
-
[15]
Fast abstractive summarization with reinforce-selected sentence rewriting
Yen-Chun Chen and Mohit Bansal. Fast abstractive summarization with reinforce-selected sentence rewriting. arXiv preprint arXiv:1805.11080, 2018
-
[16]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[17]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[18]
DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Jun-Mei Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiaoling Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bing-Li Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Dama...
2025
-
[19]
QLoRA: Efficient Finetuning of Quantized LLMs
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023
work page internal anchor Pith review arXiv 2023
-
[20]
Automlgen: Navigating fine-grained optimization for coding agents,
Shangheng Du, Xiangchao Yan, Dengyang Jiang, Jiakang Yuan, Yusong Hu, Xin Li, Liang He, Bo Zhang, and Lei Bai. Automlgen: Navigating fine-grained optimization for coding agents. arXiv preprint arXiv:2510.08511, 2025
-
[21]
Neural architecture search: A survey
Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Neural architecture search: A survey. Journal of Machine Learning Research, 2019
2019
-
[22]
KTO: Model Alignment as Prospect Theoretic Optimization
Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306, 2024
work page internal anchor Pith review arXiv 2024
-
[23]
BOHB : Robust and efficient hyperparameter optimization at scale
Stefan Falkner, Aaron Klein, and Frank Hutter. BOHB : Robust and efficient hyperparameter optimization at scale. In Proceedings of the 35th International Conference on Machine Learning, 2018
2018
-
[24]
Shiyang Feng, Runmin Ma, Xiangchao Yan, Yue Fan, Yusong Hu, Songtao Huang, Shuaiyu Zhang, Zongsheng Cao, Tianshuo Peng, Jiakang Yuan, Zijie Guo, Zhijie Zhong, Shangheng Du, Weida Wang, Jinxin Shi, Yuhao Zhou, Xiaohan He, Zhiyin Yu, Fangchen Yu, Bihao Zhan, Qihao Zheng, Jiamin Wu, Mianxin Liu, Chi Zhang, Shaowei Hou, Shuya Li, Yankai Jiang, Wenjie Lou, Lil...
-
[25]
Promptbreeder: Self-referential self-improvement via prompt evolution,
Chrisantha Fernando, Dylan Banarse, Henryk Michalewski, Simon Osindero, and Tim Rockt \"a schel. Promptbreeder: Self-referential self-improvement via prompt evolution. arXiv preprint arXiv:2309.16797, 2023
-
[26]
AutoML: Methods, Systems, Challenges
Matthias Feurer and Frank Hutter. AutoML: Methods, Systems, Challenges. Springer, 2019
2019
-
[27]
Efficient and robust automated machine learning
Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Tobias Springenberg, Manuel Blum, and Frank Hutter. Efficient and robust automated machine learning. In Advances in Neural Information Processing Systems (NeurIPS), 2015
2015
- [28]
-
[29]
Nemotron-flash: Towards latency-optimal hybrid small language models
Yonggan Fu, Xin Dong, Shizhe Diao, Matthijs Van keirsbilck, Hanrong Ye, Wonmin Byeon, Yashaswi Karnati, Lucas Liebenwein, Maksim Khadkevich, Alexander Keller, Jan Kautz, Yingyan Celine Lin, and Pavlo Molchanov. Nemotron-flash: Towards latency-optimal hybrid small language models. In The Thirty-ninth Annual Conference on Neural Information Processing Syste...
2025
-
[30]
A survey on concept drift adaptation
Jo \ a o Gama, Indr \.e Z liobait \.e , Albert Bifet, Mykola Pechenizkiy, and Abdelhamid Bouchachia. A survey on concept drift adaptation. ACM Computing Surveys, 46 0 (4): 0 44:1--44:37, 2014
2014
-
[31]
Data shapley: Equitable valuation of data for machine learning
Amirata Ghorbani and James Zou. Data shapley: Equitable valuation of data for machine learning. In Proceedings of the 36th International Conference on Machine Learning, 2019
2019
-
[32]
Samsum corpus: A human-annotated dialogue dataset for abstractive summarization
Bogdan Gliwa, Iwona Mochol, Maciej Biesek, and Aleksander Wawer. Samsum corpus: A human-annotated dialogue dataset for abstractive summarization. In Proceedings of the 2nd Workshop on New Frontiers in Summarization, pages 70--79, 2019
2019
-
[33]
Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, et al. Towards an ai co-scientist. arXiv preprint arXiv:2502.18864, 2025
work page internal anchor Pith review arXiv 2025
-
[35]
On calibration of modern neural networks,
Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. ArXiv, abs/1706.04599, 2017. URL https://api.semanticscholar.org/CorpusID:28671436
-
[36]
Llm-as-a-judge: Reassessing the performance of llms in extractive qa
Xanh Ho, Jiahao Huang, Florian Boudin, and Akiko Aizawa. Llm-as-a-judge: Reassessing the performance of llms in extractive qa. arXiv preprint arXiv:2504.11972, 2025
-
[37]
A practical guide to support vector classification
Chih-Wei Hsu, Chih-Chung Chang, Chih-Jen Lin, et al. A practical guide to support vector classification. 2003
2003
-
[39]
LoRA: Low-Rank Adaptation of Large Language Models
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021 b . URL https://arxiv.org/abs/2106.09685
work page internal anchor Pith review Pith/arXiv arXiv 2021
- [40]
-
[41]
Qian Huang, Jian Vora, Percy Liang, and Jure Leskovec. Mlagentbench: Evaluating language agents on machine learning experimentation. arXiv preprint arXiv:2310.03302, 2023
-
[42]
Quality or quantity? on data scale and diversity in adapting large language models for low-resource translation
Vivek Iyer, Bhavitvya Malik, Pavel Stepachev, Pinzhen Chen, Barry Haddow, and Alexandra Birch. Quality or quantity? on data scale and diversity in adapting large language models for low-resource translation. In Conference on Machine Translation, 2024. URL https://api.semanticscholar.org/CorpusID:271947083
2024
-
[43]
Xuhui Jiang et al. A survey on llm-as-a-judge. arXiv preprint arXiv:2411.15594, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[44]
Aide: Ai-driven exploration in the space of code.arXiv preprint arXiv:2502.13138, 2025
Zhengyao Jiang, Dominik Schmidt, Dhruv Srikanth, Dixing Xu, Ian Kaplan, Deniss Jacenko, and Yuxiang Wu. AIDE : AI -driven exploration in the space of code. arXiv preprint arXiv:2502.13138, 2025
-
[45]
Liqiang Jing, Zhehui Huang, Xiaoyang Wang, Wenlin Yao, Wenhao Yu, Kaixin Ma, Hongming Zhang, Xinya Du, and Dong Yu. Dsbench: How far are data science agents to becoming data science experts? arXiv preprint arXiv:2409.07703, 2024
-
[46]
Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension
Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pages 1601--1611, 2017
2017
-
[47]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[48]
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
Omar Khattab, Arnav Singhvi, Parth Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Saumya Vardhaman, Matei Zaharia, et al. Dspy: Compiling declarative language model calls into self-improving pipelines. arXiv preprint arXiv:2310.03714, 2023
work page internal anchor Pith review arXiv 2023
-
[49]
Prometheus: Inducing fine-grained evaluation capability in language models
Seungone Kim, Jamin Shin, Yejin Choi, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Seongjin Shin, Sungdong Kim, James Thorne, and Minjoon Seo. Prometheus: Inducing fine-grained evaluation capability in language models. In The Twelfth International Conference on Learning Representations, 2024
2024
-
[50]
Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A
James Kirkpatrick, Razvan Pascanu, Neil C. Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114: 0 35...
2016
-
[51]
Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwi \'n ska, et al
James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwi \'n ska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114 0 (13): 0 3521--3526, 2017
2017
-
[52]
Pawan Kumar, Benjamin Packer, and Daphne Koller
M. Pawan Kumar, Benjamin Packer, and Daphne Koller. Self-paced learning for latent variable models. In Advances in Neural Information Processing Systems (NeurIPS), 2010
2010
-
[53]
Peper, Christopher Clarke, Andrew Lee, Parker Hill, Jonathan K
Stefan Larson, Anish Mahendran, Joseph J. Peper, Christopher Clarke, Andrew Lee, Parker Hill, Jonathan K. Kummerfeld, Kevin Leach, Michael A. Laurenzano, Lingjia Tang, and Jason Mars. An evaluation dataset for intent classification and out-of-scope prediction. In Proceedings of EMNLP-IJCNLP, pages 1311--1316, 2019
2019
-
[54]
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods
Haitao Li et al. Llms-as-judges: A comprehensive survey on llm-based evaluation methods. arXiv preprint arXiv:2412.05579, 2024
work page internal anchor Pith review arXiv 2024
-
[55]
Hyperband: A novel bandit-based approach to hyperparameter optimization
Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. Hyperband: A novel bandit-based approach to hyperparameter optimization. In Advances in Neural Information Processing Systems, 2017
2017
-
[56]
Ft-dojo: Towards autonomous llm fine-tuning with language agents
Qizheng Li, Yifei Zhang, Xiao Yang, Xu Yang, Zhuo Wang, Weiqing Liu, and Jiang Bian. Ft-dojo: Towards autonomous llm fine-tuning with language agents. arXiv preprint arXiv:2603.01712, 2026
-
[57]
Llama Team . The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[58]
Gradient episodic memory for continual learning
David Lopez-Paz and Marc'Aurelio Ranzato. Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems, 2017
2017
-
[59]
The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The AI scientist: Towards fully automated open-ended scientific discovery. arXiv preprint arXiv:2408.06292, 2024
work page internal anchor Pith review arXiv 2024
-
[60]
Teacher--student curriculum learning
Tambet Matiisen, Avital Oliver, Taco Cohen, and John Schulman. Teacher--student curriculum learning. IEEE Transactions on Neural Networks and Learning Systems, 31 0 (9): 0 3732--3740, 2019
2019
-
[61]
Dataperf: Benchmarks for data-centric ai development
Mark Mazumder, Colby Banbury, Xiaozhe Yao, Bojan Karla s , William Gaviria Rojas, Sudnya Diamos, Greg Diamos, Lynn He, Alicia Parrish, Hannah Rose Kirk, et al. Dataperf: Benchmarks for data-centric ai development. In Advances in Neural Information Processing Systems (NeurIPS Datasets and Benchmarks Track), 2023
2023
-
[62]
Jaehyun Nam, Jinsung Yoon, Jiefeng Chen, Jinwoo Shin, Sercan \"O Ar k, and Tomas Pfister. Mle-star: Machine learning engineering agent via search and targeted refinement. arXiv preprint arXiv:2506.15692, 2025
-
[63]
Cohen, and Mirella Lapata
Shashi Narayan, Shay B. Cohen, and Mirella Lapata. Don't give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018
2018
-
[64]
Sculpting subspaces: Constrained full fine-tuning in llms for continual learning, 2025
Nikhil Shivakumar Nayak, Krishnateja Killamsetty, Ligong Han, Abhishek Bhandwaldar, Prateek Chanda, Kai Xu, Hao Wang, Aldo Pareja, Oleg Silkin, Mustafa Eyceoz, et al. Sculpting subspaces: Constrained full fine-tuning in llms for continual learning. arXiv preprint arXiv:2504.07097, 2025
-
[65]
Towards llms robustness to changes in prompt format styles
Lilian Ngweta, Kiran Kate, Jason Tsay, and Yara Rizk. Towards llms robustness to changes in prompt format styles. In North American Chapter of the Association for Computational Linguistics, 2025. URL https://api.semanticscholar.org/CorpusID:277634322
2025
-
[66]
Northcutt, Lu Jiang, and Isaac L
Curtis G. Northcutt, Lu Jiang, and Isaac L. Chuang. Confident learning: Estimating uncertainty in dataset labels. Journal of Artificial Intelligence Research (JAIR), 2021
2021
-
[67]
OpenAccess-AI-Collective . Axolotl. https://github.com/OpenAccess-AI-Collective/axolotl, 2024
2024
-
[68]
Optimizing instructions and demonstrations for multi-stage language model programs
Krista Opsahl-Ong, Michael J Ryan, Josh Purtell, David Broman, Christopher Potts, Matei Zaharia, and Omar Khattab. Optimizing instructions and demonstrations for multi-stage language model programs. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
2024
-
[69]
Training language models to follow instructions with human feedback
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022 a
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[70]
Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 2022 b
2022
-
[71]
Unveiling the secret recipe: A guide for supervised fine-tuning small llms
Aldo Pareja, Nikhil Shivakumar Nayak, Hao Wang, Krishnateja Killamsetty, Shivchander Sudalairaj, Wenlong Zhao, Seungwook Han, Abhishek Bhandwaldar, Guangxuan Xu, Kai Xu, et al. Unveiling the secret recipe: A guide for supervised fine-tuning small llms. arXiv preprint arXiv:2412.13337, 2024
-
[72]
Parisi, Ronald Kemker, Jose L
German I. Parisi, Ronald Kemker, Jose L. Part, Christopher Kanan, and Stefan Wermter. Continual lifelong learning with neural networks: A review. Neural Networks, 113: 0 54--71, 2019
2019
-
[73]
Lawrence
Joaquin Quionero-Candela, Masashi Sugiyama, Anton Schwaighofer, and Neil D. Lawrence. Dataset shift in machine learning. 2009. URL https://api.semanticscholar.org/CorpusID:61294087
2009
-
[75]
Manning, and Chelsea Finn
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 2023
2023
-
[76]
Posttrainbench: Can llm agents automate llm post- training?arXiv preprint arXiv:2603.08640, 2026
Ben Rank, Hardik Bhatnagar, Ameya Prabhu, Shira Eisenberg, Karina Nguyen, Matthias Bethge, and Maksym Andriushchenko. Posttrainbench: Can llm agents automate llm post-training? arXiv preprint arXiv:2603.08640, 2026
-
[77]
Data programming: Creating large training sets, quickly
Alexander Ratner, Christopher De Sa, Sen Wu, Daniel Selsam, and Christopher R \'e . Data programming: Creating large training sets, quickly. In Advances in Neural Information Processing Systems, 2016
2016
-
[78]
Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher R \'e
Alexander Ratner, Stephen H. Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher R \'e . Snorkel: Rapid training data creation with weak supervision. The VLDB Journal, 29: 0 709--730, 2020
2020
-
[79]
Pawan Kumar, Emilien Dupont, Francisco J
Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M. Pawan Kumar, Emilien Dupont, Francisco J. R. Ruiz, Jordan S. Ellenberg, Pengming Wang, Omar Fawzi, Pushmeet Kohli, and Alhussein Fawzi. Mathematical discoveries from program search with large language models. Nature, 625: 0 468--475, 2024
2024
-
[80]
URL http://dx.doi.org/10.18653/v1/ D15-1044
Alexander M. Rush, Sumit Chopra, and Jason Weston. A neural attention model for abstractive sentence summarization. In Llu \'i s M \`a rquez, Chris Callison-Burch, and Jian Su, editors, Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 379--389, Lisbon, Portugal, September 2015. Association for Computational Lin...
-
[81]
Term-weighting approaches in automatic text retrieval
Gerard Salton and Chris Buckley. Term-weighting approaches in automatic text retrieval. Inf. Process. Manag., 24: 0 513--523, 1988. URL https://api.semanticscholar.org/CorpusID:7725217
1988
-
[82]
u rgen Schmidhuber. G \
J \"u rgen Schmidhuber. G \"o del machines: Self-referential universal problem solvers making provably optimal self-improvements. Artificial General Intelligence, pages 199--226, 2006
2006
-
[83]
Active learning literature survey
Burr Settles. Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin--Madison, 2009
2009
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.