arxiv: 2604.09791 · v1 · submitted 2026-04-10 · 💻 cs.AI · cs.CL· cs.LG· cs.MA

Recognition: 2 theorem links

· Lean Theorem

Pioneer Agent: Continual Improvement of Small Language Models in Production

Dhruv Atreja , Julia White , Nikhil Nayak , Kelton Zhang , Henrijs Princis , George Hurn-Maloney , Ash Lewis , Urchade Zaratiana

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:44 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LGcs.MA

keywords small language modelscontinual improvementautomated adaptationclosed-loop systemerror diagnosisregression constraintsproduction deploymentAdaptFT-Bench

0 comments

The pith

Pioneer Agent automates the engineering loop for adapting small language models from task descriptions or failure logs while preventing regressions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Small language models offer low-cost inference but adapting them to specific tasks requires repeated manual choices about data curation, error diagnosis, evaluation design, and avoiding drops on prior capabilities. Pioneer Agent replaces that loop with a closed system that operates in cold-start mode from a plain-language task description and in production mode from labeled failures. It gathers or creates data, identifies error patterns, synthesizes targeted training examples, and retrains under explicit constraints that protect earlier performance. A reader would care because the surrounding decisions, not the training step itself, are the main barrier to using small models reliably in changing production settings. The paper demonstrates that this automation succeeds on standard task benchmarks and on a new test of noisy adaptation logs where unconstrained retraining fails.

Core claim

The paper claims that a closed-loop system called Pioneer Agent can automate the entire adaptation lifecycle of small language models. In cold-start mode it receives only a natural-language task description, acquires data, constructs evaluation sets, and iteratively trains models by jointly optimizing data, hyperparameters, and learning strategy. In production mode it receives a deployed model together with labeled failures, diagnoses error patterns, constructs targeted training data, and retrains under explicit regression constraints. The approach is tested on eight cold-start benchmarks across reasoning, math, code generation, summarization, and classification, on the new AdaptFT-Bench of

What carries the argument

Pioneer Agent, a closed-loop system that automates data acquisition, error-pattern diagnosis, curriculum synthesis, and regression-constrained retraining for small language models.

If this is right

The agent can begin from a natural-language task description alone and produce specialized models for reasoning, math, code generation, summarization, and classification tasks.
In adaptation scenarios with progressively noisier logs, the agent improves or preserves performance across all tested cases while standard retraining produces large drops.
The regression constraints prevent the performance losses that occur when models are retrained without them.
In simulated production deployments the agent raises intent classification accuracy and entity recognition quality on public benchmark tasks.
The optimization process inside the agent can surface effective training choices such as chain-of-thought supervision and data-quality focus without being told to look for them.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Organizations could maintain fleets of small specialized models that update themselves as user patterns shift, lowering the human cost of ongoing maintenance.
The error-diagnosis step may transfer to other machine-learning systems where logs of failures are available but expert review is limited.
If the constraints are made tunable, the same loop could balance adaptation to new tasks against retention of old ones in multi-task or continual-learning settings.

Load-bearing premise

That automated diagnosis of error patterns from labeled failures will reliably identify fixable issues and that the synthesized training data plus regression constraints will produce net gains without introducing hidden regressions or requiring human intervention in complex real deployments.

What would settle it

A production deployment in which the agent receives labeled failures, follows its diagnosis and synthesis steps, applies the regression constraints, yet the resulting model shows clear degradation on the original task compared with the starting model.

read the original abstract

Small language models are attractive for production deployment due to their low cost, fast inference, and ease of specialization. However, adapting them to a specific task remains a challenging engineering loop, driven not by training itself but by surrounding decisions: data curation, failure diagnosis, regression avoidance, and iteration control. We present Pioneer Agent, a closed-loop system that automates this lifecycle. In cold-start mode, given only a natural-language task description, the agent acquires data, constructs evaluation sets, and iteratively trains models by jointly optimizing data, hyperparameters, and learning strategy. In production mode, given a deployed model with labeled failures, it diagnoses error patterns, constructs targeted training data, and retrains under explicit regression constraints. To evaluate this setting, we introduce AdaptFT-Bench, a benchmark of synthetic inference logs with progressively increasing noise, designed to test the full adaptation loop: diagnosis, curriculum synthesis, retraining, and verification. Across eight cold-start benchmarks spanning reasoning, math, code generation, summarization, and classification, Pioneer Agent improves over base models by 1.6-83.8 points. On AdaptFT-Bench, it improves or preserves performance in all seven scenarios, while naive retraining degrades by up to 43 points. On two production-style deployments built from public benchmark tasks, it raises intent classification from 84.9% to 99.3% and Entity F1 from 0.345 to 0.810. Beyond performance gains, the agent often discovers effective training strategies, including chain-of-thought supervision, task-specific optimization, and quality-focused data curation, purely from downstream feedback.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Pioneer Agent gives a working closed-loop for cold-start and failure-driven retraining of small models, with gains on synthetic tests, but the production claims rest on setups that may not match real failure patterns.

read the letter

Pioneer Agent puts together an automated system for starting up and then maintaining small language models in production, using a single agent to handle data, training choices, and fixes based on failures. The new AdaptFT-Bench tests the full loop on synthetic logs with added noise. The paper does a good job showing that this closed loop beats naive retraining on their benchmark, preserving or improving performance where simple updates hurt it. The cold-start experiments cover a range of tasks and report clear gains over base models. They also highlight that the agent can discover useful strategies like chain-of-thought without being told. The main weakness is that everything rests on synthetic failure data and constructed production examples. Real deployments often have messier, long-tail errors that an LLM-based diagnosis might miss or misattribute. The regression constraints are mentioned but not detailed enough to judge if they truly prevent hidden drops on the original distribution. The numerical claims like the jump to 99.3% intent accuracy need verification against possible benchmark artifacts. This paper is for people building production systems with small models who are tired of manual iteration. It gives them a concrete architecture and a way to test adaptation loops. A reader working on continual learning or agentic fine-tuning would find the benchmark useful to build on. It deserves peer review because the empirical setup is new and the results are presented with enough structure to critique. I would send it out, expecting reviewers to push on the synthetic-to-real gap.

Referee Report

3 major / 3 minor

Summary. The manuscript introduces Pioneer Agent, a closed-loop automated system for adapting and continually improving small language models. In cold-start mode, given only a natural-language task description, the agent acquires data, builds evaluation sets, and iteratively trains by jointly optimizing data, hyperparameters, and learning strategy. In production mode, given labeled failures from a deployed model, it diagnoses error patterns, synthesizes targeted training data, and retrains under explicit regression constraints. The authors introduce AdaptFT-Bench, a synthetic benchmark of inference logs with progressively increasing noise to evaluate the full adaptation loop. They report gains of 1.6-83.8 points over base models on eight cold-start benchmarks (reasoning, math, code, summarization, classification), improvement or preservation on all seven AdaptFT-Bench scenarios (vs. naive retraining degrading up to 43 points), and large gains on two production-style tasks constructed from public benchmarks (intent classification 84.9% to 99.3%, Entity F1 0.345 to 0.810). The system is also shown to discover effective strategies such as chain-of-thought supervision purely from downstream feedback.

Significance. If the results hold, the work has substantial practical significance for SLM deployment by automating the engineering loop of data curation, diagnosis, and regression avoidance that currently demands extensive human effort. The introduction of AdaptFT-Bench is a clear positive contribution as a standardized testbed for adaptation systems. The empirical demonstration that effective training strategies can be discovered from feedback alone is noteworthy and could reduce expert intervention. The paper earns credit for its focus on reproducible empirical evaluation across diverse tasks and for highlighting the contrast with naive retraining.

major comments (3)

[§5.3] §5.3 (production-style deployments): the two reported deployments are constructed from public benchmark tasks rather than real production failure logs; this does not test the diagnosis step against long-tail, multi-cause, or distribution-shifted failures that are central to the production-mode claim of achieving net gains without hidden regressions or human intervention.
[§4] §4 (AdaptFT-Bench): while the benchmark uses synthetic logs with progressively increasing noise, no independent accuracy metric is reported for the LLM-based error-pattern diagnosis against ground-truth causes; without this, it is impossible to determine whether the observed gains (or preservation) are attributable to reliable diagnosis or to other system components.
[§3.2] §3.2 (regression constraints): the explicit regression constraints are described as preventing degradation (contrasted with naive retraining losses up to 43 points), but their precise formulation (loss terms, penalty application, or verification on held-out original-task distributions) is not provided, making it impossible to verify they guard against regressions beyond the reported metrics.

minor comments (3)

[§5.1] The abstract and §5.1 report a wide range of improvements (1.6-83.8 points) but do not include per-benchmark breakdowns, variance across runs, or statistical significance tests in the summary tables.
[§3] A system diagram illustrating the closed-loop flow (data acquisition, diagnosis, synthesis, constrained retraining, verification) would improve clarity in the method section.
[§5.4] The discussion of discovered strategies (chain-of-thought, task-specific optimization, quality-focused curation) would benefit from concrete examples of synthesized prompts or data in an appendix.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications from the manuscript and indicate where revisions will strengthen the presentation.

read point-by-point responses

Referee: [§5.3] §5.3 (production-style deployments): the two reported deployments are constructed from public benchmark tasks rather than real production failure logs; this does not test the diagnosis step against long-tail, multi-cause, or distribution-shifted failures that are central to the production-mode claim of achieving net gains without hidden regressions or human intervention.

Authors: We acknowledge that the two production-style tasks are constructed from public benchmarks (intent classification and entity recognition) rather than proprietary production logs. This design choice enables full reproducibility and controlled injection of failure patterns while still exercising the diagnosis, synthesis, and regression-constrained retraining pipeline. The constructed logs incorporate multi-cause noise and distribution shifts to approximate production conditions, as described in §5.3. We agree that real long-tail production logs would provide stronger external validation. In the revision we will expand §5.3 to explicitly state this limitation, detail how the synthetic failures emulate multi-cause and shifted distributions, and note that future work will seek access to anonymized production logs for additional evaluation. revision: partial
Referee: [§4] §4 (AdaptFT-Bench): while the benchmark uses synthetic logs with progressively increasing noise, no independent accuracy metric is reported for the LLM-based error-pattern diagnosis against ground-truth causes; without this, it is impossible to determine whether the observed gains (or preservation) are attributable to reliable diagnosis or to other system components.

Authors: The referee correctly notes the absence of a standalone diagnosis accuracy metric. The current evaluation in §4 reports end-to-end adaptation performance across noise levels, which implicitly depends on diagnosis quality. To isolate this component, the revised manuscript will add a new table and accompanying text in §4 that reports the LLM-based diagnosis accuracy against the ground-truth injected error patterns (e.g., precision/recall per noise type and overall F1). This will allow readers to attribute gains to diagnosis reliability versus other pipeline elements. revision: yes
Referee: [§3.2] §3.2 (regression constraints): the explicit regression constraints are described as preventing degradation (contrasted with naive retraining losses up to 43 points), but their precise formulation (loss terms, penalty application, or verification on held-out original-task distributions) is not provided, making it impossible to verify they guard against regressions beyond the reported metrics.

Authors: We agree that the precise mathematical formulation of the regression constraints was insufficiently detailed in §3.2. The constraints combine the task-specific loss with a penalty term that measures deviation from the original model's behavior on a held-out subset of the initial task distribution, implemented via KL divergence on output logits (with a tunable coefficient). Verification occurs by evaluating the retrained model on the same held-out original-task set after each iteration. The revised manuscript will include the exact loss equation, hyperparameter values for the penalty, and pseudocode for the constrained optimization loop. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical system evaluation on external benchmarks

full rationale

The paper is a systems description of an automated adaptation loop for small language models, evaluated via cold-start benchmarks, AdaptFT-Bench (synthetic logs), and two production-style deployments built from public tasks. All reported gains (e.g., 1.6-83.8 points, intent classification 84.9%→99.3%) are measured against independent external test sets rather than any internal equations, fitted parameters, or self-referential predictions. No derivations, uniqueness theorems, or ansatzes are invoked; the work contains no load-bearing self-citations or reductions of outputs to inputs by construction. Claims rest on observed benchmark performance, making the evaluation self-contained against external data.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that an agent can autonomously perform effective diagnosis and data synthesis; the abstract does not detail the internal mechanisms or prove they generalize beyond the tested scenarios.

free parameters (1)

jointly optimized data, hyperparameters, and learning strategy
The agent selects these during each iteration; their specific values are not fixed in advance but discovered from downstream feedback.

axioms (1)

domain assumption Small language models can be improved by targeted fine-tuning on automatically curated data without catastrophic forgetting when regression constraints are applied
Invoked throughout the description of both cold-start and production modes.

invented entities (1)

Pioneer Agent no independent evidence
purpose: Closed-loop automation of model adaptation and continual improvement
The core new system introduced to handle data curation, diagnosis, and retraining.

pith-pipeline@v0.9.0 · 5629 in / 1662 out tokens · 46053 ms · 2026-05-10T17:44:04.829353+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We formalize autonomous fine-tuning as search over a structured configuration space... π=(D,H,S)... optimization objective π∗=arg max f(π) subject to r(π;R)≤ϵ
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Pioneer Agent... closed-loop system that automates this lifecycle... failure diagnosis, curriculum synthesis, retraining under explicit regression constraints

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

GLiNER2-PII: A Multilingual Model for Personally Identifiable Information Extraction
cs.CL 2026-05 unverdicted novelty 4.0

GLiNER2-PII achieves the highest span-level F1 on the SPY benchmark by fine-tuning a small GLiNER2 model on a 4,910-example multilingual synthetic PII corpus.

Reference graph

Works this paper leans on

109 extracted references · 58 canonical work pages · cited by 1 Pith paper · 31 internal anchors

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
[2]

GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alexandros G. Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, and Omar Khattab. Gepa: Reflective prompt evolution can outperform reinforcement learning, 2026. URL https://arx...

work page internal anchor Pith review arXiv 2026
[3]

A uto I ntent: A uto ML for text classification

Ilya Alekseev, Roman Solomatin, Darina Rustamova, and Denis Kuznetsov. A uto I ntent: A uto ML for text classification. In Ivan Habernal, Peter Schulam, and J \"o rg Tiedemann, editors, Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 707--716, Suzhou, China, November 2025. Association fo...

work page doi:10.18653/v1/2025.emnlp-demos.53 2025
[4]

Claude sonnet 4.6 system card

Anthropic . Claude sonnet 4.6 system card. https://www.anthropic.com/claude-sonnet-4-6-system-card, 2026

2026
[5]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional AI : Harmlessness from AI feedback. arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[6]

Ahmed, and Andreas Kassler

Firas Bayram, Bestoun S. Ahmed, and Andreas Kassler. From concept drift to model degradation: An overview on performance-aware drift detectors. Knowledge-Based Systems, 245: 0 108632, 2022

2022
[7]

Curriculum learning

Yoshua Bengio, J \'e r \^o me Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th International Conference on Machine Learning (ICML), 2009

2009
[8]

Random search for hyper-parameter optimization

James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. J. Mach. Learn. Res., 13 0 (null): 0 281–305, February 2012. ISSN 1532-4435

2012
[9]

On the Opportunities and Risks of Foundation Models

Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[10]

Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in Neural Information Processing Systems (NeurIPS), 2020

2020
[11]

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Lilian Weng, and Aleksander Madry. Mle-bench: Evaluating machine learning agents on machine learning engineering. arXiv preprint arXiv:2410.07095, 2024

work page Pith review arXiv 2024
[12]

Humans or llms as the judge? a study on judgement biases, September 2024

Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, and Benyou Wang. Humans or llms as the judge? a study on judgement biases. arXiv preprint arXiv:2402.10669, 2024

work page arXiv 2024
[13]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[14]

Bringmann, John Tran, Wei Liu, Fung Xie, Michael Lightstone, and Humphrey Shi

Terry Chen, Zhifan Ye, Bing Xu, Zihao Ye, Timmy Liu, Ali Hassani, Tianqi Chen, Andrew Kerr, Haicheng Wu, Yang Xu, et al. Avo: Agentic variation operators for autonomous evolutionary search. arXiv preprint arXiv:2603.24517, 2026

work page arXiv 2026
[15]

Fast abstractive summarization with reinforce-selected sentence rewriting

Yen-Chun Chen and Mohit Bansal. Fast abstractive summarization with reinforce-selected sentence rewriting. arXiv preprint arXiv:1805.11080, 2018

work page arXiv 2018
[16]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[17]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[18]

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Jun-Mei Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiaoling Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bing-Li Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Dama...

2025
[19]

QLoRA: Efficient Finetuning of Quantized LLMs

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023

work page internal anchor Pith review arXiv 2023
[20]

Automlgen: Navigating fine-grained optimization for coding agents,

Shangheng Du, Xiangchao Yan, Dengyang Jiang, Jiakang Yuan, Yusong Hu, Xin Li, Liang He, Bo Zhang, and Lei Bai. Automlgen: Navigating fine-grained optimization for coding agents. arXiv preprint arXiv:2510.08511, 2025

work page arXiv 2025
[21]

Neural architecture search: A survey

Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Neural architecture search: A survey. Journal of Machine Learning Research, 2019

2019
[22]

KTO: Model Alignment as Prospect Theoretic Optimization

Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306, 2024

work page internal anchor Pith review arXiv 2024
[23]

BOHB : Robust and efficient hyperparameter optimization at scale

Stefan Falkner, Aaron Klein, and Frank Hutter. BOHB : Robust and efficient hyperparameter optimization at scale. In Proceedings of the 35th International Conference on Machine Learning, 2018

2018
[24]

Internagent-1.5: A unified agentic framework for long-horizon autonomous scientific discovery.arXiv preprint arXiv:2602.08990, 2026

Shiyang Feng, Runmin Ma, Xiangchao Yan, Yue Fan, Yusong Hu, Songtao Huang, Shuaiyu Zhang, Zongsheng Cao, Tianshuo Peng, Jiakang Yuan, Zijie Guo, Zhijie Zhong, Shangheng Du, Weida Wang, Jinxin Shi, Yuhao Zhou, Xiaohan He, Zhiyin Yu, Fangchen Yu, Bihao Zhan, Qihao Zheng, Jiamin Wu, Mianxin Liu, Chi Zhang, Shaowei Hou, Shuya Li, Yankai Jiang, Wenjie Lou, Lil...

work page arXiv 2026
[25]

Promptbreeder: Self-referential self-improvement via prompt evolution,

Chrisantha Fernando, Dylan Banarse, Henryk Michalewski, Simon Osindero, and Tim Rockt \"a schel. Promptbreeder: Self-referential self-improvement via prompt evolution. arXiv preprint arXiv:2309.16797, 2023

work page arXiv 2023
[26]

AutoML: Methods, Systems, Challenges

Matthias Feurer and Frank Hutter. AutoML: Methods, Systems, Challenges. Springer, 2019

2019
[27]

Efficient and robust automated machine learning

Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Tobias Springenberg, Manuel Blum, and Frank Hutter. Efficient and robust automated machine learning. In Advances in Neural Information Processing Systems (NeurIPS), 2015

2015
[28]

Feurer, K

Matthias Feurer, Katharina Eggensperger, Stefan Falkner, Marius Lindauer, and Frank Hutter. Auto-sklearn 2.0: Hands-free automl via meta-learning, 2022. URL https://arxiv.org/abs/2007.04074

work page arXiv 2022
[29]

Nemotron-flash: Towards latency-optimal hybrid small language models

Yonggan Fu, Xin Dong, Shizhe Diao, Matthijs Van keirsbilck, Hanrong Ye, Wonmin Byeon, Yashaswi Karnati, Lucas Liebenwein, Maksim Khadkevich, Alexander Keller, Jan Kautz, Yingyan Celine Lin, and Pavlo Molchanov. Nemotron-flash: Towards latency-optimal hybrid small language models. In The Thirty-ninth Annual Conference on Neural Information Processing Syste...

2025
[30]

A survey on concept drift adaptation

Jo \ a o Gama, Indr \.e Z liobait \.e , Albert Bifet, Mykola Pechenizkiy, and Abdelhamid Bouchachia. A survey on concept drift adaptation. ACM Computing Surveys, 46 0 (4): 0 44:1--44:37, 2014

2014
[31]

Data shapley: Equitable valuation of data for machine learning

Amirata Ghorbani and James Zou. Data shapley: Equitable valuation of data for machine learning. In Proceedings of the 36th International Conference on Machine Learning, 2019

2019
[32]

Samsum corpus: A human-annotated dialogue dataset for abstractive summarization

Bogdan Gliwa, Iwona Mochol, Maciej Biesek, and Aleksander Wawer. Samsum corpus: A human-annotated dialogue dataset for abstractive summarization. In Proceedings of the 2nd Workshop on New Frontiers in Summarization, pages 70--79, 2019

2019
[33]

Towards an AI co-scientist

Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, et al. Towards an ai co-scientist. arXiv preprint arXiv:2502.18864, 2025

work page internal anchor Pith review arXiv 2025
[35]

On calibration of modern neural networks,

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. ArXiv, abs/1706.04599, 2017. URL https://api.semanticscholar.org/CorpusID:28671436

work page arXiv 2017
[36]

Llm-as-a-judge: Reassessing the performance of llms in extractive qa

Xanh Ho, Jiahao Huang, Florian Boudin, and Akiko Aizawa. Llm-as-a-judge: Reassessing the performance of llms in extractive qa. arXiv preprint arXiv:2504.11972, 2025

work page arXiv 2025
[37]

A practical guide to support vector classification

Chih-Wei Hsu, Chih-Chung Chang, Chih-Jen Lin, et al. A practical guide to support vector classification. 2003

2003
[39]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021 b . URL https://arxiv.org/abs/2106.09685

work page internal anchor Pith review Pith/arXiv arXiv 2021
[40]

Huang, S

Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. Large language models can self-improve. arXiv preprint arXiv:2210.11610, 2022

work page arXiv 2022
[41]

Mlagentbench: Evaluating language agents on machine learning experimentation.arXiv preprint arXiv:2310.03302, 2023

Qian Huang, Jian Vora, Percy Liang, and Jure Leskovec. Mlagentbench: Evaluating language agents on machine learning experimentation. arXiv preprint arXiv:2310.03302, 2023

work page arXiv 2023
[42]

Quality or quantity? on data scale and diversity in adapting large language models for low-resource translation

Vivek Iyer, Bhavitvya Malik, Pavel Stepachev, Pinzhen Chen, Barry Haddow, and Alexandra Birch. Quality or quantity? on data scale and diversity in adapting large language models for low-resource translation. In Conference on Machine Translation, 2024. URL https://api.semanticscholar.org/CorpusID:271947083

2024
[43]

A Survey on LLM-as-a-Judge

Xuhui Jiang et al. A survey on llm-as-a-judge. arXiv preprint arXiv:2411.15594, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

Aide: Ai-driven exploration in the space of code.arXiv preprint arXiv:2502.13138, 2025

Zhengyao Jiang, Dominik Schmidt, Dhruv Srikanth, Dixing Xu, Ian Kaplan, Deniss Jacenko, and Yuxiang Wu. AIDE : AI -driven exploration in the space of code. arXiv preprint arXiv:2502.13138, 2025

work page arXiv 2025
[45]

Dsbench: How far are data science agents to becoming data science experts? arXiv preprint arXiv:2409.07703, 2024

Liqiang Jing, Zhehui Huang, Xiaoyang Wang, Wenlin Yao, Wenhao Yu, Kaixin Ma, Hongming Zhang, Xinya Du, and Dong Yu. Dsbench: How far are data science agents to becoming data science experts? arXiv preprint arXiv:2409.07703, 2024

work page arXiv 2024
[46]

Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension

Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pages 1601--1611, 2017

2017
[47]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[48]

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

Omar Khattab, Arnav Singhvi, Parth Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Saumya Vardhaman, Matei Zaharia, et al. Dspy: Compiling declarative language model calls into self-improving pipelines. arXiv preprint arXiv:2310.03714, 2023

work page internal anchor Pith review arXiv 2023
[49]

Prometheus: Inducing fine-grained evaluation capability in language models

Seungone Kim, Jamin Shin, Yejin Choi, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Seongjin Shin, Sungdong Kim, James Thorne, and Minjoon Seo. Prometheus: Inducing fine-grained evaluation capability in language models. In The Twelfth International Conference on Learning Representations, 2024

2024
[50]

Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A

James Kirkpatrick, Razvan Pascanu, Neil C. Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114: 0 35...

2016
[51]

Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwi \'n ska, et al

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwi \'n ska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114 0 (13): 0 3521--3526, 2017

2017
[52]

Pawan Kumar, Benjamin Packer, and Daphne Koller

M. Pawan Kumar, Benjamin Packer, and Daphne Koller. Self-paced learning for latent variable models. In Advances in Neural Information Processing Systems (NeurIPS), 2010

2010
[53]

Peper, Christopher Clarke, Andrew Lee, Parker Hill, Jonathan K

Stefan Larson, Anish Mahendran, Joseph J. Peper, Christopher Clarke, Andrew Lee, Parker Hill, Jonathan K. Kummerfeld, Kevin Leach, Michael A. Laurenzano, Lingjia Tang, and Jason Mars. An evaluation dataset for intent classification and out-of-scope prediction. In Proceedings of EMNLP-IJCNLP, pages 1311--1316, 2019

2019
[54]

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

Haitao Li et al. Llms-as-judges: A comprehensive survey on llm-based evaluation methods. arXiv preprint arXiv:2412.05579, 2024

work page internal anchor Pith review arXiv 2024
[55]

Hyperband: A novel bandit-based approach to hyperparameter optimization

Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. Hyperband: A novel bandit-based approach to hyperparameter optimization. In Advances in Neural Information Processing Systems, 2017

2017
[56]

Ft-dojo: Towards autonomous llm fine-tuning with language agents

Qizheng Li, Yifei Zhang, Xiao Yang, Xu Yang, Zhuo Wang, Weiqing Liu, and Jiang Bian. Ft-dojo: Towards autonomous llm fine-tuning with language agents. arXiv preprint arXiv:2603.01712, 2026

work page arXiv 2026
[57]

The Llama 3 Herd of Models

Llama Team . The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[58]

Gradient episodic memory for continual learning

David Lopez-Paz and Marc'Aurelio Ranzato. Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems, 2017

2017
[59]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The AI scientist: Towards fully automated open-ended scientific discovery. arXiv preprint arXiv:2408.06292, 2024

work page internal anchor Pith review arXiv 2024
[60]

Teacher--student curriculum learning

Tambet Matiisen, Avital Oliver, Taco Cohen, and John Schulman. Teacher--student curriculum learning. IEEE Transactions on Neural Networks and Learning Systems, 31 0 (9): 0 3732--3740, 2019

2019
[61]

Dataperf: Benchmarks for data-centric ai development

Mark Mazumder, Colby Banbury, Xiaozhe Yao, Bojan Karla s , William Gaviria Rojas, Sudnya Diamos, Greg Diamos, Lynn He, Alicia Parrish, Hannah Rose Kirk, et al. Dataperf: Benchmarks for data-centric ai development. In Advances in Neural Information Processing Systems (NeurIPS Datasets and Benchmarks Track), 2023

2023
[62]

Mle-star: Ma- chine learning engineering agent via search and targeted refinement.arXiv preprint arXiv:2506.15692, 2025

Jaehyun Nam, Jinsung Yoon, Jiefeng Chen, Jinwoo Shin, Sercan \"O Ar k, and Tomas Pfister. Mle-star: Machine learning engineering agent via search and targeted refinement. arXiv preprint arXiv:2506.15692, 2025

work page arXiv 2025
[63]

Cohen, and Mirella Lapata

Shashi Narayan, Shay B. Cohen, and Mirella Lapata. Don't give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018

2018
[64]

Sculpting subspaces: Constrained full fine-tuning in llms for continual learning, 2025

Nikhil Shivakumar Nayak, Krishnateja Killamsetty, Ligong Han, Abhishek Bhandwaldar, Prateek Chanda, Kai Xu, Hao Wang, Aldo Pareja, Oleg Silkin, Mustafa Eyceoz, et al. Sculpting subspaces: Constrained full fine-tuning in llms for continual learning. arXiv preprint arXiv:2504.07097, 2025

work page arXiv 2025
[65]

Towards llms robustness to changes in prompt format styles

Lilian Ngweta, Kiran Kate, Jason Tsay, and Yara Rizk. Towards llms robustness to changes in prompt format styles. In North American Chapter of the Association for Computational Linguistics, 2025. URL https://api.semanticscholar.org/CorpusID:277634322

2025
[66]

Northcutt, Lu Jiang, and Isaac L

Curtis G. Northcutt, Lu Jiang, and Isaac L. Chuang. Confident learning: Estimating uncertainty in dataset labels. Journal of Artificial Intelligence Research (JAIR), 2021

2021
[67]

OpenAccess-AI-Collective . Axolotl. https://github.com/OpenAccess-AI-Collective/axolotl, 2024

2024
[68]

Optimizing instructions and demonstrations for multi-stage language model programs

Krista Opsahl-Ong, Michael J Ryan, Josh Purtell, David Broman, Christopher Potts, Matei Zaharia, and Omar Khattab. Optimizing instructions and demonstrations for multi-stage language model programs. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024

2024
[69]

Training language models to follow instructions with human feedback

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022 a

work page internal anchor Pith review Pith/arXiv arXiv 2022
[70]

Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 2022 b

2022
[71]

Unveiling the secret recipe: A guide for supervised fine-tuning small llms

Aldo Pareja, Nikhil Shivakumar Nayak, Hao Wang, Krishnateja Killamsetty, Shivchander Sudalairaj, Wenlong Zhao, Seungwook Han, Abhishek Bhandwaldar, Guangxuan Xu, Kai Xu, et al. Unveiling the secret recipe: A guide for supervised fine-tuning small llms. arXiv preprint arXiv:2412.13337, 2024

work page arXiv 2024
[72]

Parisi, Ronald Kemker, Jose L

German I. Parisi, Ronald Kemker, Jose L. Part, Christopher Kanan, and Stefan Wermter. Continual lifelong learning with neural networks: A review. Neural Networks, 113: 0 54--71, 2019

2019
[73]

Lawrence

Joaquin Quionero-Candela, Masashi Sugiyama, Anton Schwaighofer, and Neil D. Lawrence. Dataset shift in machine learning. 2009. URL https://api.semanticscholar.org/CorpusID:61294087

2009
[75]

Manning, and Chelsea Finn

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 2023

2023
[76]

Posttrainbench: Can llm agents automate llm post- training?arXiv preprint arXiv:2603.08640, 2026

Ben Rank, Hardik Bhatnagar, Ameya Prabhu, Shira Eisenberg, Karina Nguyen, Matthias Bethge, and Maksym Andriushchenko. Posttrainbench: Can llm agents automate llm post-training? arXiv preprint arXiv:2603.08640, 2026

work page arXiv 2026
[77]

Data programming: Creating large training sets, quickly

Alexander Ratner, Christopher De Sa, Sen Wu, Daniel Selsam, and Christopher R \'e . Data programming: Creating large training sets, quickly. In Advances in Neural Information Processing Systems, 2016

2016
[78]

Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher R \'e

Alexander Ratner, Stephen H. Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher R \'e . Snorkel: Rapid training data creation with weak supervision. The VLDB Journal, 29: 0 709--730, 2020

2020
[79]

Pawan Kumar, Emilien Dupont, Francisco J

Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M. Pawan Kumar, Emilien Dupont, Francisco J. R. Ruiz, Jordan S. Ellenberg, Pengming Wang, Omar Fawzi, Pushmeet Kohli, and Alhussein Fawzi. Mathematical discoveries from program search with large language models. Nature, 625: 0 468--475, 2024

2024
[80]

URL http://dx.doi.org/10.18653/v1/ D15-1044

Alexander M. Rush, Sumit Chopra, and Jason Weston. A neural attention model for abstractive sentence summarization. In Llu \'i s M \`a rquez, Chris Callison-Burch, and Jian Su, editors, Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 379--389, Lisbon, Portugal, September 2015. Association for Computational Lin...

work page doi:10.18653/v1/d15-1044 2015
[81]

Term-weighting approaches in automatic text retrieval

Gerard Salton and Chris Buckley. Term-weighting approaches in automatic text retrieval. Inf. Process. Manag., 24: 0 513--523, 1988. URL https://api.semanticscholar.org/CorpusID:7725217

1988
[82]

u rgen Schmidhuber. G \

J \"u rgen Schmidhuber. G \"o del machines: Self-referential universal problem solvers making provably optimal self-improvements. Artificial General Intelligence, pages 199--226, 2006

2006
[83]

Active learning literature survey

Burr Settles. Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin--Madison, 2009

2009

Showing first 80 references.