pith. machine review for the scientific record. sign in

arxiv: 2604.09791 · v1 · submitted 2026-04-10 · 💻 cs.AI · cs.CL· cs.LG· cs.MA

Recognition: 2 theorem links

· Lean Theorem

Pioneer Agent: Continual Improvement of Small Language Models in Production

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:44 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LGcs.MA
keywords small language modelscontinual improvementautomated adaptationclosed-loop systemerror diagnosisregression constraintsproduction deploymentAdaptFT-Bench
0
0 comments X

The pith

Pioneer Agent automates the engineering loop for adapting small language models from task descriptions or failure logs while preventing regressions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Small language models offer low-cost inference but adapting them to specific tasks requires repeated manual choices about data curation, error diagnosis, evaluation design, and avoiding drops on prior capabilities. Pioneer Agent replaces that loop with a closed system that operates in cold-start mode from a plain-language task description and in production mode from labeled failures. It gathers or creates data, identifies error patterns, synthesizes targeted training examples, and retrains under explicit constraints that protect earlier performance. A reader would care because the surrounding decisions, not the training step itself, are the main barrier to using small models reliably in changing production settings. The paper demonstrates that this automation succeeds on standard task benchmarks and on a new test of noisy adaptation logs where unconstrained retraining fails.

Core claim

The paper claims that a closed-loop system called Pioneer Agent can automate the entire adaptation lifecycle of small language models. In cold-start mode it receives only a natural-language task description, acquires data, constructs evaluation sets, and iteratively trains models by jointly optimizing data, hyperparameters, and learning strategy. In production mode it receives a deployed model together with labeled failures, diagnoses error patterns, constructs targeted training data, and retrains under explicit regression constraints. The approach is tested on eight cold-start benchmarks across reasoning, math, code generation, summarization, and classification, on the new AdaptFT-Bench of

What carries the argument

Pioneer Agent, a closed-loop system that automates data acquisition, error-pattern diagnosis, curriculum synthesis, and regression-constrained retraining for small language models.

If this is right

  • The agent can begin from a natural-language task description alone and produce specialized models for reasoning, math, code generation, summarization, and classification tasks.
  • In adaptation scenarios with progressively noisier logs, the agent improves or preserves performance across all tested cases while standard retraining produces large drops.
  • The regression constraints prevent the performance losses that occur when models are retrained without them.
  • In simulated production deployments the agent raises intent classification accuracy and entity recognition quality on public benchmark tasks.
  • The optimization process inside the agent can surface effective training choices such as chain-of-thought supervision and data-quality focus without being told to look for them.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Organizations could maintain fleets of small specialized models that update themselves as user patterns shift, lowering the human cost of ongoing maintenance.
  • The error-diagnosis step may transfer to other machine-learning systems where logs of failures are available but expert review is limited.
  • If the constraints are made tunable, the same loop could balance adaptation to new tasks against retention of old ones in multi-task or continual-learning settings.

Load-bearing premise

That automated diagnosis of error patterns from labeled failures will reliably identify fixable issues and that the synthesized training data plus regression constraints will produce net gains without introducing hidden regressions or requiring human intervention in complex real deployments.

What would settle it

A production deployment in which the agent receives labeled failures, follows its diagnosis and synthesis steps, applies the regression constraints, yet the resulting model shows clear degradation on the original task compared with the starting model.

read the original abstract

Small language models are attractive for production deployment due to their low cost, fast inference, and ease of specialization. However, adapting them to a specific task remains a challenging engineering loop, driven not by training itself but by surrounding decisions: data curation, failure diagnosis, regression avoidance, and iteration control. We present Pioneer Agent, a closed-loop system that automates this lifecycle. In cold-start mode, given only a natural-language task description, the agent acquires data, constructs evaluation sets, and iteratively trains models by jointly optimizing data, hyperparameters, and learning strategy. In production mode, given a deployed model with labeled failures, it diagnoses error patterns, constructs targeted training data, and retrains under explicit regression constraints. To evaluate this setting, we introduce AdaptFT-Bench, a benchmark of synthetic inference logs with progressively increasing noise, designed to test the full adaptation loop: diagnosis, curriculum synthesis, retraining, and verification. Across eight cold-start benchmarks spanning reasoning, math, code generation, summarization, and classification, Pioneer Agent improves over base models by 1.6-83.8 points. On AdaptFT-Bench, it improves or preserves performance in all seven scenarios, while naive retraining degrades by up to 43 points. On two production-style deployments built from public benchmark tasks, it raises intent classification from 84.9% to 99.3% and Entity F1 from 0.345 to 0.810. Beyond performance gains, the agent often discovers effective training strategies, including chain-of-thought supervision, task-specific optimization, and quality-focused data curation, purely from downstream feedback.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The manuscript introduces Pioneer Agent, a closed-loop automated system for adapting and continually improving small language models. In cold-start mode, given only a natural-language task description, the agent acquires data, builds evaluation sets, and iteratively trains by jointly optimizing data, hyperparameters, and learning strategy. In production mode, given labeled failures from a deployed model, it diagnoses error patterns, synthesizes targeted training data, and retrains under explicit regression constraints. The authors introduce AdaptFT-Bench, a synthetic benchmark of inference logs with progressively increasing noise to evaluate the full adaptation loop. They report gains of 1.6-83.8 points over base models on eight cold-start benchmarks (reasoning, math, code, summarization, classification), improvement or preservation on all seven AdaptFT-Bench scenarios (vs. naive retraining degrading up to 43 points), and large gains on two production-style tasks constructed from public benchmarks (intent classification 84.9% to 99.3%, Entity F1 0.345 to 0.810). The system is also shown to discover effective strategies such as chain-of-thought supervision purely from downstream feedback.

Significance. If the results hold, the work has substantial practical significance for SLM deployment by automating the engineering loop of data curation, diagnosis, and regression avoidance that currently demands extensive human effort. The introduction of AdaptFT-Bench is a clear positive contribution as a standardized testbed for adaptation systems. The empirical demonstration that effective training strategies can be discovered from feedback alone is noteworthy and could reduce expert intervention. The paper earns credit for its focus on reproducible empirical evaluation across diverse tasks and for highlighting the contrast with naive retraining.

major comments (3)
  1. [§5.3] §5.3 (production-style deployments): the two reported deployments are constructed from public benchmark tasks rather than real production failure logs; this does not test the diagnosis step against long-tail, multi-cause, or distribution-shifted failures that are central to the production-mode claim of achieving net gains without hidden regressions or human intervention.
  2. [§4] §4 (AdaptFT-Bench): while the benchmark uses synthetic logs with progressively increasing noise, no independent accuracy metric is reported for the LLM-based error-pattern diagnosis against ground-truth causes; without this, it is impossible to determine whether the observed gains (or preservation) are attributable to reliable diagnosis or to other system components.
  3. [§3.2] §3.2 (regression constraints): the explicit regression constraints are described as preventing degradation (contrasted with naive retraining losses up to 43 points), but their precise formulation (loss terms, penalty application, or verification on held-out original-task distributions) is not provided, making it impossible to verify they guard against regressions beyond the reported metrics.
minor comments (3)
  1. [§5.1] The abstract and §5.1 report a wide range of improvements (1.6-83.8 points) but do not include per-benchmark breakdowns, variance across runs, or statistical significance tests in the summary tables.
  2. [§3] A system diagram illustrating the closed-loop flow (data acquisition, diagnosis, synthesis, constrained retraining, verification) would improve clarity in the method section.
  3. [§5.4] The discussion of discovered strategies (chain-of-thought, task-specific optimization, quality-focused curation) would benefit from concrete examples of synthesized prompts or data in an appendix.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications from the manuscript and indicate where revisions will strengthen the presentation.

read point-by-point responses
  1. Referee: [§5.3] §5.3 (production-style deployments): the two reported deployments are constructed from public benchmark tasks rather than real production failure logs; this does not test the diagnosis step against long-tail, multi-cause, or distribution-shifted failures that are central to the production-mode claim of achieving net gains without hidden regressions or human intervention.

    Authors: We acknowledge that the two production-style tasks are constructed from public benchmarks (intent classification and entity recognition) rather than proprietary production logs. This design choice enables full reproducibility and controlled injection of failure patterns while still exercising the diagnosis, synthesis, and regression-constrained retraining pipeline. The constructed logs incorporate multi-cause noise and distribution shifts to approximate production conditions, as described in §5.3. We agree that real long-tail production logs would provide stronger external validation. In the revision we will expand §5.3 to explicitly state this limitation, detail how the synthetic failures emulate multi-cause and shifted distributions, and note that future work will seek access to anonymized production logs for additional evaluation. revision: partial

  2. Referee: [§4] §4 (AdaptFT-Bench): while the benchmark uses synthetic logs with progressively increasing noise, no independent accuracy metric is reported for the LLM-based error-pattern diagnosis against ground-truth causes; without this, it is impossible to determine whether the observed gains (or preservation) are attributable to reliable diagnosis or to other system components.

    Authors: The referee correctly notes the absence of a standalone diagnosis accuracy metric. The current evaluation in §4 reports end-to-end adaptation performance across noise levels, which implicitly depends on diagnosis quality. To isolate this component, the revised manuscript will add a new table and accompanying text in §4 that reports the LLM-based diagnosis accuracy against the ground-truth injected error patterns (e.g., precision/recall per noise type and overall F1). This will allow readers to attribute gains to diagnosis reliability versus other pipeline elements. revision: yes

  3. Referee: [§3.2] §3.2 (regression constraints): the explicit regression constraints are described as preventing degradation (contrasted with naive retraining losses up to 43 points), but their precise formulation (loss terms, penalty application, or verification on held-out original-task distributions) is not provided, making it impossible to verify they guard against regressions beyond the reported metrics.

    Authors: We agree that the precise mathematical formulation of the regression constraints was insufficiently detailed in §3.2. The constraints combine the task-specific loss with a penalty term that measures deviation from the original model's behavior on a held-out subset of the initial task distribution, implemented via KL divergence on output logits (with a tunable coefficient). Verification occurs by evaluating the retrained model on the same held-out original-task set after each iteration. The revised manuscript will include the exact loss equation, hyperparameter values for the penalty, and pseudocode for the constrained optimization loop. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical system evaluation on external benchmarks

full rationale

The paper is a systems description of an automated adaptation loop for small language models, evaluated via cold-start benchmarks, AdaptFT-Bench (synthetic logs), and two production-style deployments built from public tasks. All reported gains (e.g., 1.6-83.8 points, intent classification 84.9%→99.3%) are measured against independent external test sets rather than any internal equations, fitted parameters, or self-referential predictions. No derivations, uniqueness theorems, or ansatzes are invoked; the work contains no load-bearing self-citations or reductions of outputs to inputs by construction. Claims rest on observed benchmark performance, making the evaluation self-contained against external data.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that an agent can autonomously perform effective diagnosis and data synthesis; the abstract does not detail the internal mechanisms or prove they generalize beyond the tested scenarios.

free parameters (1)
  • jointly optimized data, hyperparameters, and learning strategy
    The agent selects these during each iteration; their specific values are not fixed in advance but discovered from downstream feedback.
axioms (1)
  • domain assumption Small language models can be improved by targeted fine-tuning on automatically curated data without catastrophic forgetting when regression constraints are applied
    Invoked throughout the description of both cold-start and production modes.
invented entities (1)
  • Pioneer Agent no independent evidence
    purpose: Closed-loop automation of model adaptation and continual improvement
    The core new system introduced to handle data curation, diagnosis, and retraining.

pith-pipeline@v0.9.0 · 5629 in / 1662 out tokens · 46053 ms · 2026-05-10T17:44:04.829353+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. GLiNER2-PII: A Multilingual Model for Personally Identifiable Information Extraction

    cs.CL 2026-05 unverdicted novelty 4.0

    GLiNER2-PII achieves the highest span-level F1 on the SPY benchmark by fine-tuning a small GLiNER2 model on a 4,910-example multilingual synthetic PII corpus.

Reference graph

Works this paper leans on

109 extracted references · 58 canonical work pages · cited by 1 Pith paper · 31 internal anchors

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

    Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alexandros G. Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, and Omar Khattab. Gepa: Reflective prompt evolution can outperform reinforcement learning, 2026. URL https://arx...

  3. [3]

    A uto I ntent: A uto ML for text classification

    Ilya Alekseev, Roman Solomatin, Darina Rustamova, and Denis Kuznetsov. A uto I ntent: A uto ML for text classification. In Ivan Habernal, Peter Schulam, and J \"o rg Tiedemann, editors, Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 707--716, Suzhou, China, November 2025. Association fo...

  4. [4]

    Claude sonnet 4.6 system card

    Anthropic . Claude sonnet 4.6 system card. https://www.anthropic.com/claude-sonnet-4-6-system-card, 2026

  5. [5]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional AI : Harmlessness from AI feedback. arXiv preprint arXiv:2212.08073, 2022

  6. [6]

    Ahmed, and Andreas Kassler

    Firas Bayram, Bestoun S. Ahmed, and Andreas Kassler. From concept drift to model degradation: An overview on performance-aware drift detectors. Knowledge-Based Systems, 245: 0 108632, 2022

  7. [7]

    Curriculum learning

    Yoshua Bengio, J \'e r \^o me Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th International Conference on Machine Learning (ICML), 2009

  8. [8]

    Random search for hyper-parameter optimization

    James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. J. Mach. Learn. Res., 13 0 (null): 0 281–305, February 2012. ISSN 1532-4435

  9. [9]

    On the Opportunities and Risks of Foundation Models

    Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021

  10. [10]

    Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in Neural Information Processing Systems (NeurIPS), 2020

  11. [11]

    MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

    Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Lilian Weng, and Aleksander Madry. Mle-bench: Evaluating machine learning agents on machine learning engineering. arXiv preprint arXiv:2410.07095, 2024

  12. [12]

    Humans or llms as the judge? a study on judgement biases, September 2024

    Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, and Benyou Wang. Humans or llms as the judge? a study on judgement biases. arXiv preprint arXiv:2402.10669, 2024

  13. [13]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

  14. [14]

    Bringmann, John Tran, Wei Liu, Fung Xie, Michael Lightstone, and Humphrey Shi

    Terry Chen, Zhifan Ye, Bing Xu, Zihao Ye, Timmy Liu, Ali Hassani, Tianqi Chen, Andrew Kerr, Haicheng Wu, Yang Xu, et al. Avo: Agentic variation operators for autonomous evolutionary search. arXiv preprint arXiv:2603.24517, 2026

  15. [15]

    Fast abstractive summarization with reinforce-selected sentence rewriting

    Yen-Chun Chen and Mohit Bansal. Fast abstractive summarization with reinforce-selected sentence rewriting. arXiv preprint arXiv:1805.11080, 2018

  16. [16]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018

  17. [17]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

  18. [18]

    DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Jun-Mei Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiaoling Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bing-Li Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Dama...

  19. [19]

    QLoRA: Efficient Finetuning of Quantized LLMs

    Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023

  20. [20]

    Automlgen: Navigating fine-grained optimization for coding agents,

    Shangheng Du, Xiangchao Yan, Dengyang Jiang, Jiakang Yuan, Yusong Hu, Xin Li, Liang He, Bo Zhang, and Lei Bai. Automlgen: Navigating fine-grained optimization for coding agents. arXiv preprint arXiv:2510.08511, 2025

  21. [21]

    Neural architecture search: A survey

    Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Neural architecture search: A survey. Journal of Machine Learning Research, 2019

  22. [22]

    KTO: Model Alignment as Prospect Theoretic Optimization

    Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306, 2024

  23. [23]

    BOHB : Robust and efficient hyperparameter optimization at scale

    Stefan Falkner, Aaron Klein, and Frank Hutter. BOHB : Robust and efficient hyperparameter optimization at scale. In Proceedings of the 35th International Conference on Machine Learning, 2018

  24. [24]

    Internagent-1.5: A unified agentic framework for long-horizon autonomous scientific discovery.arXiv preprint arXiv:2602.08990, 2026

    Shiyang Feng, Runmin Ma, Xiangchao Yan, Yue Fan, Yusong Hu, Songtao Huang, Shuaiyu Zhang, Zongsheng Cao, Tianshuo Peng, Jiakang Yuan, Zijie Guo, Zhijie Zhong, Shangheng Du, Weida Wang, Jinxin Shi, Yuhao Zhou, Xiaohan He, Zhiyin Yu, Fangchen Yu, Bihao Zhan, Qihao Zheng, Jiamin Wu, Mianxin Liu, Chi Zhang, Shaowei Hou, Shuya Li, Yankai Jiang, Wenjie Lou, Lil...

  25. [25]

    Promptbreeder: Self-referential self-improvement via prompt evolution,

    Chrisantha Fernando, Dylan Banarse, Henryk Michalewski, Simon Osindero, and Tim Rockt \"a schel. Promptbreeder: Self-referential self-improvement via prompt evolution. arXiv preprint arXiv:2309.16797, 2023

  26. [26]

    AutoML: Methods, Systems, Challenges

    Matthias Feurer and Frank Hutter. AutoML: Methods, Systems, Challenges. Springer, 2019

  27. [27]

    Efficient and robust automated machine learning

    Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Tobias Springenberg, Manuel Blum, and Frank Hutter. Efficient and robust automated machine learning. In Advances in Neural Information Processing Systems (NeurIPS), 2015

  28. [28]

    Feurer, K

    Matthias Feurer, Katharina Eggensperger, Stefan Falkner, Marius Lindauer, and Frank Hutter. Auto-sklearn 2.0: Hands-free automl via meta-learning, 2022. URL https://arxiv.org/abs/2007.04074

  29. [29]

    Nemotron-flash: Towards latency-optimal hybrid small language models

    Yonggan Fu, Xin Dong, Shizhe Diao, Matthijs Van keirsbilck, Hanrong Ye, Wonmin Byeon, Yashaswi Karnati, Lucas Liebenwein, Maksim Khadkevich, Alexander Keller, Jan Kautz, Yingyan Celine Lin, and Pavlo Molchanov. Nemotron-flash: Towards latency-optimal hybrid small language models. In The Thirty-ninth Annual Conference on Neural Information Processing Syste...

  30. [30]

    A survey on concept drift adaptation

    Jo \ a o Gama, Indr \.e Z liobait \.e , Albert Bifet, Mykola Pechenizkiy, and Abdelhamid Bouchachia. A survey on concept drift adaptation. ACM Computing Surveys, 46 0 (4): 0 44:1--44:37, 2014

  31. [31]

    Data shapley: Equitable valuation of data for machine learning

    Amirata Ghorbani and James Zou. Data shapley: Equitable valuation of data for machine learning. In Proceedings of the 36th International Conference on Machine Learning, 2019

  32. [32]

    Samsum corpus: A human-annotated dialogue dataset for abstractive summarization

    Bogdan Gliwa, Iwona Mochol, Maciej Biesek, and Aleksander Wawer. Samsum corpus: A human-annotated dialogue dataset for abstractive summarization. In Proceedings of the 2nd Workshop on New Frontiers in Summarization, pages 70--79, 2019

  33. [33]

    Towards an AI co-scientist

    Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, et al. Towards an ai co-scientist. arXiv preprint arXiv:2502.18864, 2025

  34. [35]

    On calibration of modern neural networks,

    Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. ArXiv, abs/1706.04599, 2017. URL https://api.semanticscholar.org/CorpusID:28671436

  35. [36]

    Llm-as-a-judge: Reassessing the performance of llms in extractive qa

    Xanh Ho, Jiahao Huang, Florian Boudin, and Akiko Aizawa. Llm-as-a-judge: Reassessing the performance of llms in extractive qa. arXiv preprint arXiv:2504.11972, 2025

  36. [37]

    A practical guide to support vector classification

    Chih-Wei Hsu, Chih-Chung Chang, Chih-Jen Lin, et al. A practical guide to support vector classification. 2003

  37. [39]

    LoRA: Low-Rank Adaptation of Large Language Models

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021 b . URL https://arxiv.org/abs/2106.09685

  38. [40]

    Huang, S

    Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. Large language models can self-improve. arXiv preprint arXiv:2210.11610, 2022

  39. [41]

    Mlagentbench: Evaluating language agents on machine learning experimentation.arXiv preprint arXiv:2310.03302, 2023

    Qian Huang, Jian Vora, Percy Liang, and Jure Leskovec. Mlagentbench: Evaluating language agents on machine learning experimentation. arXiv preprint arXiv:2310.03302, 2023

  40. [42]

    Quality or quantity? on data scale and diversity in adapting large language models for low-resource translation

    Vivek Iyer, Bhavitvya Malik, Pavel Stepachev, Pinzhen Chen, Barry Haddow, and Alexandra Birch. Quality or quantity? on data scale and diversity in adapting large language models for low-resource translation. In Conference on Machine Translation, 2024. URL https://api.semanticscholar.org/CorpusID:271947083

  41. [43]

    A Survey on LLM-as-a-Judge

    Xuhui Jiang et al. A survey on llm-as-a-judge. arXiv preprint arXiv:2411.15594, 2024

  42. [44]

    Aide: Ai-driven exploration in the space of code.arXiv preprint arXiv:2502.13138, 2025

    Zhengyao Jiang, Dominik Schmidt, Dhruv Srikanth, Dixing Xu, Ian Kaplan, Deniss Jacenko, and Yuxiang Wu. AIDE : AI -driven exploration in the space of code. arXiv preprint arXiv:2502.13138, 2025

  43. [45]

    Dsbench: How far are data science agents to becoming data science experts? arXiv preprint arXiv:2409.07703, 2024

    Liqiang Jing, Zhehui Huang, Xiaoyang Wang, Wenlin Yao, Wenhao Yu, Kaixin Ma, Hongming Zhang, Xinya Du, and Dong Yu. Dsbench: How far are data science agents to becoming data science experts? arXiv preprint arXiv:2409.07703, 2024

  44. [46]

    Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension

    Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pages 1601--1611, 2017

  45. [47]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

  46. [48]

    DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

    Omar Khattab, Arnav Singhvi, Parth Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Saumya Vardhaman, Matei Zaharia, et al. Dspy: Compiling declarative language model calls into self-improving pipelines. arXiv preprint arXiv:2310.03714, 2023

  47. [49]

    Prometheus: Inducing fine-grained evaluation capability in language models

    Seungone Kim, Jamin Shin, Yejin Choi, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Seongjin Shin, Sungdong Kim, James Thorne, and Minjoon Seo. Prometheus: Inducing fine-grained evaluation capability in language models. In The Twelfth International Conference on Learning Representations, 2024

  48. [50]

    Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A

    James Kirkpatrick, Razvan Pascanu, Neil C. Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114: 0 35...

  49. [51]

    Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwi \'n ska, et al

    James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwi \'n ska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114 0 (13): 0 3521--3526, 2017

  50. [52]

    Pawan Kumar, Benjamin Packer, and Daphne Koller

    M. Pawan Kumar, Benjamin Packer, and Daphne Koller. Self-paced learning for latent variable models. In Advances in Neural Information Processing Systems (NeurIPS), 2010

  51. [53]

    Peper, Christopher Clarke, Andrew Lee, Parker Hill, Jonathan K

    Stefan Larson, Anish Mahendran, Joseph J. Peper, Christopher Clarke, Andrew Lee, Parker Hill, Jonathan K. Kummerfeld, Kevin Leach, Michael A. Laurenzano, Lingjia Tang, and Jason Mars. An evaluation dataset for intent classification and out-of-scope prediction. In Proceedings of EMNLP-IJCNLP, pages 1311--1316, 2019

  52. [54]

    LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

    Haitao Li et al. Llms-as-judges: A comprehensive survey on llm-based evaluation methods. arXiv preprint arXiv:2412.05579, 2024

  53. [55]

    Hyperband: A novel bandit-based approach to hyperparameter optimization

    Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. Hyperband: A novel bandit-based approach to hyperparameter optimization. In Advances in Neural Information Processing Systems, 2017

  54. [56]

    Ft-dojo: Towards autonomous llm fine-tuning with language agents

    Qizheng Li, Yifei Zhang, Xiao Yang, Xu Yang, Zhuo Wang, Weiqing Liu, and Jiang Bian. Ft-dojo: Towards autonomous llm fine-tuning with language agents. arXiv preprint arXiv:2603.01712, 2026

  55. [57]

    The Llama 3 Herd of Models

    Llama Team . The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783

  56. [58]

    Gradient episodic memory for continual learning

    David Lopez-Paz and Marc'Aurelio Ranzato. Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems, 2017

  57. [59]

    The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

    Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The AI scientist: Towards fully automated open-ended scientific discovery. arXiv preprint arXiv:2408.06292, 2024

  58. [60]

    Teacher--student curriculum learning

    Tambet Matiisen, Avital Oliver, Taco Cohen, and John Schulman. Teacher--student curriculum learning. IEEE Transactions on Neural Networks and Learning Systems, 31 0 (9): 0 3732--3740, 2019

  59. [61]

    Dataperf: Benchmarks for data-centric ai development

    Mark Mazumder, Colby Banbury, Xiaozhe Yao, Bojan Karla s , William Gaviria Rojas, Sudnya Diamos, Greg Diamos, Lynn He, Alicia Parrish, Hannah Rose Kirk, et al. Dataperf: Benchmarks for data-centric ai development. In Advances in Neural Information Processing Systems (NeurIPS Datasets and Benchmarks Track), 2023

  60. [62]

    Mle-star: Ma- chine learning engineering agent via search and targeted refinement.arXiv preprint arXiv:2506.15692, 2025

    Jaehyun Nam, Jinsung Yoon, Jiefeng Chen, Jinwoo Shin, Sercan \"O Ar k, and Tomas Pfister. Mle-star: Machine learning engineering agent via search and targeted refinement. arXiv preprint arXiv:2506.15692, 2025

  61. [63]

    Cohen, and Mirella Lapata

    Shashi Narayan, Shay B. Cohen, and Mirella Lapata. Don't give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018

  62. [64]

    Sculpting subspaces: Constrained full fine-tuning in llms for continual learning, 2025

    Nikhil Shivakumar Nayak, Krishnateja Killamsetty, Ligong Han, Abhishek Bhandwaldar, Prateek Chanda, Kai Xu, Hao Wang, Aldo Pareja, Oleg Silkin, Mustafa Eyceoz, et al. Sculpting subspaces: Constrained full fine-tuning in llms for continual learning. arXiv preprint arXiv:2504.07097, 2025

  63. [65]

    Towards llms robustness to changes in prompt format styles

    Lilian Ngweta, Kiran Kate, Jason Tsay, and Yara Rizk. Towards llms robustness to changes in prompt format styles. In North American Chapter of the Association for Computational Linguistics, 2025. URL https://api.semanticscholar.org/CorpusID:277634322

  64. [66]

    Northcutt, Lu Jiang, and Isaac L

    Curtis G. Northcutt, Lu Jiang, and Isaac L. Chuang. Confident learning: Estimating uncertainty in dataset labels. Journal of Artificial Intelligence Research (JAIR), 2021

  65. [67]

    OpenAccess-AI-Collective . Axolotl. https://github.com/OpenAccess-AI-Collective/axolotl, 2024

  66. [68]

    Optimizing instructions and demonstrations for multi-stage language model programs

    Krista Opsahl-Ong, Michael J Ryan, Josh Purtell, David Broman, Christopher Potts, Matei Zaharia, and Omar Khattab. Optimizing instructions and demonstrations for multi-stage language model programs. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024

  67. [69]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022 a

  68. [70]

    Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 2022 b

  69. [71]

    Unveiling the secret recipe: A guide for supervised fine-tuning small llms

    Aldo Pareja, Nikhil Shivakumar Nayak, Hao Wang, Krishnateja Killamsetty, Shivchander Sudalairaj, Wenlong Zhao, Seungwook Han, Abhishek Bhandwaldar, Guangxuan Xu, Kai Xu, et al. Unveiling the secret recipe: A guide for supervised fine-tuning small llms. arXiv preprint arXiv:2412.13337, 2024

  70. [72]

    Parisi, Ronald Kemker, Jose L

    German I. Parisi, Ronald Kemker, Jose L. Part, Christopher Kanan, and Stefan Wermter. Continual lifelong learning with neural networks: A review. Neural Networks, 113: 0 54--71, 2019

  71. [73]

    Lawrence

    Joaquin Quionero-Candela, Masashi Sugiyama, Anton Schwaighofer, and Neil D. Lawrence. Dataset shift in machine learning. 2009. URL https://api.semanticscholar.org/CorpusID:61294087

  72. [75]

    Manning, and Chelsea Finn

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 2023

  73. [76]

    Posttrainbench: Can llm agents automate llm post- training?arXiv preprint arXiv:2603.08640, 2026

    Ben Rank, Hardik Bhatnagar, Ameya Prabhu, Shira Eisenberg, Karina Nguyen, Matthias Bethge, and Maksym Andriushchenko. Posttrainbench: Can llm agents automate llm post-training? arXiv preprint arXiv:2603.08640, 2026

  74. [77]

    Data programming: Creating large training sets, quickly

    Alexander Ratner, Christopher De Sa, Sen Wu, Daniel Selsam, and Christopher R \'e . Data programming: Creating large training sets, quickly. In Advances in Neural Information Processing Systems, 2016

  75. [78]

    Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher R \'e

    Alexander Ratner, Stephen H. Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher R \'e . Snorkel: Rapid training data creation with weak supervision. The VLDB Journal, 29: 0 709--730, 2020

  76. [79]

    Pawan Kumar, Emilien Dupont, Francisco J

    Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M. Pawan Kumar, Emilien Dupont, Francisco J. R. Ruiz, Jordan S. Ellenberg, Pengming Wang, Omar Fawzi, Pushmeet Kohli, and Alhussein Fawzi. Mathematical discoveries from program search with large language models. Nature, 625: 0 468--475, 2024

  77. [80]

    URL http://dx.doi.org/10.18653/v1/ D15-1044

    Alexander M. Rush, Sumit Chopra, and Jason Weston. A neural attention model for abstractive sentence summarization. In Llu \'i s M \`a rquez, Chris Callison-Burch, and Jian Su, editors, Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 379--389, Lisbon, Portugal, September 2015. Association for Computational Lin...

  78. [81]

    Term-weighting approaches in automatic text retrieval

    Gerard Salton and Chris Buckley. Term-weighting approaches in automatic text retrieval. Inf. Process. Manag., 24: 0 513--523, 1988. URL https://api.semanticscholar.org/CorpusID:7725217

  79. [82]

    u rgen Schmidhuber. G \

    J \"u rgen Schmidhuber. G \"o del machines: Self-referential universal problem solvers making provably optimal self-improvements. Artificial General Intelligence, pages 199--226, 2006

  80. [83]

    Active learning literature survey

    Burr Settles. Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin--Madison, 2009

Showing first 80 references.