Can Generalist Agents Automate Data Curation?

Adam Nguyen; Dawn Song; Feiyang Kang; Frederic Sala; Hanze Li; Jiaqi W. Ma; Mahavir Dabas; Ruoxi Jia

arxiv: 2606.04261 · v1 · pith:QLSOI33Ynew · submitted 2026-06-02 · 💻 cs.AI · cs.CL· cs.CV· cs.ET· cs.LG

Can Generalist Agents Automate Data Curation?

Feiyang Kang , Hanze Li , Adam Nguyen , Mahavir Dabas , Jiaqi W. Ma , Frederic Sala , Dawn Song , Ruoxi Jia This is my paper

Pith reviewed 2026-06-28 09:30 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.CVcs.ETcs.LG

keywords data curationgeneralist agentsdata selectionagent benchmarksinstruction tuningvision-language modelsautomated research

0 comments

The pith

Scaffolded generalist agents autonomously compose data-selection policies that outperform published baselines at one-tenth the data budget.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether generalist coding agents can take over the iterative work of proposing, implementing, and revising data-selection policies for training AI models. It builds Curation-Bench to give agents direct access to inspect data, write policies, run a fixed training pipeline, and receive benchmark feedback. Out-of-the-box agents reach existing baselines quickly but mostly tweak local variants. Adding a scaffold that forces each step to cite, instantiate, and adapt prior published methods shifts the agent toward broader exploration. The result is a policy the agent built without human design input that beats strong baselines while using only one-tenth the data volume.

Core claim

In the vision-language instruction-tuning instantiation of Curation-Bench, the scaffolded agent composes a data-selection policy that outperforms strong published baselines at one-tenth their data budget, with no human design input supplied during the agent's iterations.

What carries the argument

The scaffold requiring each iteration to cite, instantiate, and adapt a prior method, which converts open-ended prompting into method-guided exploration inside the fixed Curation-Bench loop.

If this is right

Agents supplied with the scaffold can run complete data-curation loops without constant human oversight.
Method-guided adaptation produces data policies that exceed the performance of hand-designed baselines at substantially lower data volume.
The execution-research gap can be narrowed by requiring explicit citation and reuse of prior work rather than free-form invention.
Open-sourcing Curation-Bench makes it possible to test whether the same scaffold pattern transfers to other model families and tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the scaffold pattern holds, labs could shift human effort from writing selection rules to writing the scaffolds themselves.
The same loop might surface entirely new policy families that human researchers have not yet enumerated.
Testing at larger model scales or noisier real-world data sources would show whether the one-tenth budget advantage survives increased complexity.

Load-bearing premise

The single vision-language instruction-tuning setup with its fixed training recipe and evaluation suite stands in for the wider range of noisy, cross-domain data-curation loops that practitioners actually run.

What would settle it

Running the identical scaffolded agent on a different domain such as text-only pretraining and finding that it no longer beats the published baselines even at the reduced data budget.

Figures

Figures reproduced from arXiv: 2606.04261 by Adam Nguyen, Dawn Song, Feiyang Kang, Frederic Sala, Hanze Li, Jiaqi W. Ma, Mahavir Dabas, Ruoxi Jia.

**Figure 1.** Figure 1: Agentic data curation requires more than openended prompting. (a) We formulate data curation as a policysearch loop: the agent proposes a data policy, constructs training data, observes feedback from fixed training and evaluation, and revises the policy. (b) On 10k-example selection from LLaVA-665K dataset for LLaVA-1.5-7B training, open-ended prompting improves over random selection and human-designed… view at source ↗

**Figure 2.** Figure 2: Architecture of CURATION-BENCH. A coding agent inspects the candidate pool, implements data policies, and submits curated datasets. The harness validates each submission and scores it with a fixed training–evaluation pipeline. how current generalist coding agents interact with repositories: inspecting files, writing scripts, running commands, debugging failures, and reading logs. (P3) Contamination contr… view at source ↗

**Figure 3.** Figure 3: Agent data-curation results (open-prompt). The bars and the centerline show the average scores and the range of standard deviation, and the upper/lower wicks showing the min/max outcome of a session. Agents run 3 sessions each with 10 iterations, and the baseline is repeated for 10 independent runs. 100% iterations are successfully executed. (a). Fine-tuning pre-trained LLaVA-1.5-7B on curated subsets of 1… view at source ↗

**Figure 4.** Figure 4: and [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Ablation on the number of iterations (10-50) in each agent session (Claude Code). result is notable: average performance continues to improve within this range rather than clearly plateauing. Under open-prompting, the gains accumulate gradually, consistent with extended local search over source mixtures and related heuristics. Under the Heavy II scaffold, the best score is already found within the first … view at source ↗

read the original abstract

Curating training data is among the most consequential yet labor-intensive parts of modern AI development: practitioners iteratively propose, implement, evaluate, and revise data policies against noisy benchmark feedback. We ask whether generalist coding agents can automate this data-curation loop. We introduce *Curation-Bench*, an agent-centric benchmark that fixes the model, training recipe, and evaluation suite while giving agents command-line access to inspect data, implement policies, submit them to a fixed training/evaluation pipeline, and revise. In a vision-language instruction-tuning instantiation, out-of-the-box agents reach strong published data-selection baselines within ten iterations. However, trajectory analysis reveals a persistent *execution-research gap*: agents mainly tune local policy variants rather than explore new policy families, even when given strategy guides and paper references. Scaffolds requiring each iteration to cite, instantiate, and adapt a prior method shift agents toward method-guided exploration. The scaffolded agent autonomously composes -- without human design input -- a data-selection policy that outperforms strong published baselines at one-tenth their data budget. Overall, current agents can run the curation loop, but reliable data research requires scaffolded method adaptation, not open-ended prompting alone. Code and benchmark are open-sourced.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces Curation-Bench and shows scaffolded agents can compose a data policy that beats baselines at 1/10 data use in one vision-language setup, but the result stays narrow.

read the letter

The main thing here is a new benchmark that lets agents run a full data curation loop with fixed training and eval, plus the finding that open-ended agents mostly do local tweaks while scaffolds that force method citation and adaptation produce a policy outperforming published baselines on less data.

They do a clean job setting up the environment so agents get command-line access to inspect data and submit policies. The trajectory analysis that identifies the execution-research gap is straightforward and useful. Open-sourcing the benchmark and code is the part that actually lets others build on it. The scaffold result is concrete: the agent composes something better without human-designed policy input inside their test case.

The soft spot is exactly what the stress-test note flags. All the performance numbers and the gap observation come from a single vision-language instruction-tuning instantiation with one model, one recipe, and one evaluation suite. Nothing tests whether the same agent behavior or the 10x data efficiency holds when the domain changes, the model scale shifts, or the feedback becomes noisier. That limits how far the automation claim travels.

This is for people working on agent scaffolds for automated ML pipelines or data work. A reader who wants a standardized testbed for agent-driven curation will get value from the benchmark itself. The empirical result is worth checking but stays preliminary.

I would send it to peer review. The benchmark is new and the scaffold comparison is worth referee scrutiny even with the narrow scope.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Curation-Bench, an agent-centric benchmark that fixes the model, training recipe, and evaluation suite while allowing agents command-line access to inspect data, implement policies, and iterate via a fixed pipeline. In a vision-language instruction-tuning instantiation, out-of-the-box agents reach strong published baselines within ten iterations, but trajectory analysis identifies an execution-research gap where agents tune local variants rather than explore new policy families. Scaffolds that require citing, instantiating, and adapting prior methods shift behavior toward method-guided exploration. The scaffolded agent autonomously composes a data-selection policy outperforming strong baselines at one-tenth their data budget. The work concludes that current agents can run the curation loop but reliable data research requires scaffolded method adaptation, and open-sources the benchmark and code.

Significance. If the empirical results hold under the reported controls, the paper contributes a reproducible benchmark and concrete evidence that generalist coding agents with targeted scaffolding can automate consequential steps in AI development pipelines. The open-sourcing of code and benchmark is a clear strength that supports follow-on work. The identification of the execution-research gap provides a useful diagnostic for agent capabilities in research-like tasks.

major comments (2)

[Abstract] Abstract: The central claim that the scaffolded agent demonstrates general automation of data curation by autonomously composing a superior policy (outperforming baselines at 1/10 data budget) rests on experiments confined to a single vision-language instruction-tuning instantiation with fixed model, recipe, and evaluation suite. This narrow scope is load-bearing for the broader conclusion that 'reliable data research requires scaffolded method adaptation,' as the execution-research gap and scaffold benefits are measured only inside this benchmark; different domains, model scales, or noisier feedback loops could change how data policies interact with training dynamics. A concrete test would be to instantiate the same agent and scaffolds on at least one additional domain or scale.
[Abstract] Abstract / results section: The outperformance result is presented without reported details on variance across runs, statistical significance tests, or ablation of the fixed training pipeline, which is needed to establish that the 1/10 data-budget advantage is robust rather than an artifact of the specific instantiation.

minor comments (1)

[Abstract] The abstract and conclusion could more explicitly qualify the scope of the generalist-agent claim to the tested instantiation to avoid overgeneralization.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that the scaffolded agent demonstrates general automation of data curation by autonomously composing a superior policy (outperforming baselines at 1/10 data budget) rests on experiments confined to a single vision-language instruction-tuning instantiation with fixed model, recipe, and evaluation suite. This narrow scope is load-bearing for the broader conclusion that 'reliable data research requires scaffolded method adaptation,' as the execution-research gap and scaffold benefits are measured only inside this benchmark; different domains, model scales, or noisier feedback loops could change how data policies interact with training dynamics. A concrete test would be to instantiate the same agent and scaffolds on at least one additional domain or scale.

Authors: Curation-Bench is intentionally designed with a fixed model, training recipe, and evaluation suite precisely to isolate agent behavior in the data-curation loop and to enable reproducible measurement of the execution-research gap. The manuscript already qualifies results as applying to this vision-language instruction-tuning instantiation. We will revise the abstract and conclusion to state the scope more explicitly and to frame the scaffold benefit and gap as observations within this controlled setting rather than as a general claim across all domains. revision: partial
Referee: [Abstract] Abstract / results section: The outperformance result is presented without reported details on variance across runs, statistical significance tests, or ablation of the fixed training pipeline, which is needed to establish that the 1/10 data-budget advantage is robust rather than an artifact of the specific instantiation.

Authors: We agree that reporting variance, statistical tests, and explicit clarification of the fixed pipeline would strengthen the presentation. In the revision we will add these details (variance across runs and significance tests for the reported outperformance) and state that the pipeline is deliberately fixed by benchmark design to control variables other than the data policy itself. revision: yes

standing simulated objections not resolved

Instantiating the agent and scaffolds on at least one additional domain or scale, which would require new benchmark setups, data, and compute not available for the current revision.

Circularity Check

0 steps flagged

No circularity; empirical benchmark results are self-contained

full rationale

The paper introduces Curation-Bench and reports direct empirical outcomes from running agents on a fixed vision-language instruction-tuning setup. No equations, fitted parameters, or derivations are present. The central claim (scaffolded agent composing an outperforming policy) is a measured experimental result on the open benchmark, not a reduction to prior inputs or self-citations. Self-citations, if any, are not load-bearing for the reported performance deltas. This matches the default case of an honest empirical paper with no circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; ledger is minimal and provisional.

axioms (1)

domain assumption The vision-language instruction-tuning task with fixed pipeline is a valid testbed for general data-curation automation.
The paper instantiates the benchmark on this task.

pith-pipeline@v0.9.1-grok · 5775 in / 1068 out tokens · 13794 ms · 2026-06-28T09:30:23.470823+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 5 canonical work pages · 2 internal anchors

[1]

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Machine learning data practices through a data curation lens: An evaluation framework. InPro- ceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, pages 1055–1067. Dustin Brunner. 2023. Datacomp challenge. Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksi...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Mlagentbench: Evaluating language agents on machine learning experi- mentation

Mlagentbench: Evaluating language agents on machine learning experimentation.arXiv preprint arXiv:2310.03302. Liqiang Jing, Zhehui Huang, Xiaoyang Wang, Wen- lin Yao, Wenhao Yu, Kaixin Ma, Hongming Zhang, Xinya Du, and Dong Yu. 2024. Dsbench: How far are data science agents from becoming data science experts?arXiv preprint arXiv:2409.07703. Feiyang Kang, ...

work page arXiv 2024
[3]

Andrej Karpathy

Adadedup: Adaptive hybrid data pruning for efficient large-scale object detection training.arXiv preprint arXiv:2507.00049. Andrej Karpathy. 2026. autoresearch: Ai agents running research on single-gpu nanochat training automatically. https://github.com/karpathy/ autoresearch. Accessed: 2026-04-29. Konwoo Kim, Suhas Kotha, Yejin Choi, Tatsunori Hashimoto,...

work page arXiv 2026
[4]

AgentBench: Evaluating LLMs as Agents

Datacomp-lm: In search of the next gener- ation of training sets for language models.Advances in Neural Information Processing Systems, 37:14200– 14282. Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. 2024a. Improved baselines with visual instruc- tion tuning. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pa...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Advances in Neural Information Processing Systems, 36:50358–50376

Scaling data-constrained language models. Advances in Neural Information Processing Systems, 36:50358–50376. Andrew Ng. 2021. A chat with andrew on mlops: From model-centric to data-centric ai. YouTube video, DeepLearning.AI. Accessed: 2026-05-05. Curtis Northcutt, Lu Jiang, and Isaac Chuang. 2021. Confident learning: Estimating uncertainty in dataset lab...

work page arXiv 2021
[6]

higher quality

/ DSBench (Jing et al., 2024) / TML- Bench (Pinchuk, 2026) Data science and tabular ML Analyze data, produce notebooks, train tabu- lar models No Data is an input to analysis/modeling rather than the artifact being optimized. Data-centric AI / data se- lection (Ng, 2021; Paul et al., 2021; Sorscher et al., 2022) Dataset engineering Usually non-agentic met...

2024
[7]

Keep” rows improved the composite accuracy; “discard

with vLLM (Kwon et al., 2023). Judge-based metrics use a separate OpenAI-compatible Qwen3.5- 27B endpoint (Team, 2026) for all baseline runs. The server-side judge configuration is summarized in Table 16. Table 16:Judge-server configuration for judge-based VLMEvalKit metrics. This endpoint is separate from candidate-model evaluation inference. Setting Val...

2023
[8]

Keep MMMU stable (no TextVQA change, OCR_VQA change is small)
[9]

Improve visual diversity (200 more unique COCO images)
[10]

Potentially improve MMVet/LLaV ABench (+0.3–0.5 from more detailed COCO content)
[11]

– 1 paper -> 1 skill

Minor OCRBench impact (−3 to−5 from slightly less OCR training) Expected net: +0.001 to +0.003 accuracy ## Minimal Change After computing proportional allocations: – ocr_vqa: 1486→1286 (−200) – coco: 6765→6965 (+200) – gqa: 1340 (unchanged) – textvqa: 408 (unchanged) Instructionsfor automated compilation of skill cards compiled from research papers ## Cha...
[12]

**One-line decision** — when to use / when not to (single sentence each)
[13]

‘operational-method‘), ‘actionability‘ (high/med/low), ‘evidence quality‘ (‘full_paper‘)

**Skill metadata** — ‘skill type‘, ‘paper kind‘ (e.g. ‘operational-method‘), ‘actionability‘ (high/med/low), ‘evidence quality‘ (‘full_paper‘)
[14]

**Goal** — one paragraph operational restatement
[15]

**Problem signature** — modality, data state, scale regime, model requirement
[16]

**Use when** / **Do not use when** — bullet pairs
[17]

**Required inputs** / **Optional inputs** / **Outputs** — bolded named slots with one-line descriptions
[18]

**Assumptions and prerequisites**
[19]

**Procedure** — numbered steps, each with ‘Action: . . . / Why: . . . / Note: See paper for details.‘
[20]

**Parameters to set** — Role / How to set / Default-or-range / Effect
[21]

**Validation checks**, **Failure modes**
[22]

**Adaptation notes for VLM training** — always present, ties the paper back to VLM data work even when the source paper is not VLM-specific
[23]

**Implementation notes**
[24]

**Evidence from the paper** — 3–5 bullets with the load-bearing claims
[25]

Use this skill when

**Source paper** — title, year, venue, paper ID, URL, arXiv ID. **Tone / philosophy** – Decision-first ("Use this skill when. . . " / "Avoid it when. . . ") rather than paper summary. – Procedures are skeletons — the *shape* of the method, not full reproduction. Each step ends with "See paper for details." – Knowledge-cards optimized for a downstream agen...

[1] [1]

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Machine learning data practices through a data curation lens: An evaluation framework. InPro- ceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, pages 1055–1067. Dustin Brunner. 2023. Datacomp challenge. Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksi...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

Mlagentbench: Evaluating language agents on machine learning experi- mentation

Mlagentbench: Evaluating language agents on machine learning experimentation.arXiv preprint arXiv:2310.03302. Liqiang Jing, Zhehui Huang, Xiaoyang Wang, Wen- lin Yao, Wenhao Yu, Kaixin Ma, Hongming Zhang, Xinya Du, and Dong Yu. 2024. Dsbench: How far are data science agents from becoming data science experts?arXiv preprint arXiv:2409.07703. Feiyang Kang, ...

work page arXiv 2024

[3] [3]

Andrej Karpathy

Adadedup: Adaptive hybrid data pruning for efficient large-scale object detection training.arXiv preprint arXiv:2507.00049. Andrej Karpathy. 2026. autoresearch: Ai agents running research on single-gpu nanochat training automatically. https://github.com/karpathy/ autoresearch. Accessed: 2026-04-29. Konwoo Kim, Suhas Kotha, Yejin Choi, Tatsunori Hashimoto,...

work page arXiv 2026

[4] [4]

AgentBench: Evaluating LLMs as Agents

Datacomp-lm: In search of the next gener- ation of training sets for language models.Advances in Neural Information Processing Systems, 37:14200– 14282. Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. 2024a. Improved baselines with visual instruc- tion tuning. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pa...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

Advances in Neural Information Processing Systems, 36:50358–50376

Scaling data-constrained language models. Advances in Neural Information Processing Systems, 36:50358–50376. Andrew Ng. 2021. A chat with andrew on mlops: From model-centric to data-centric ai. YouTube video, DeepLearning.AI. Accessed: 2026-05-05. Curtis Northcutt, Lu Jiang, and Isaac Chuang. 2021. Confident learning: Estimating uncertainty in dataset lab...

work page arXiv 2021

[6] [6]

higher quality

/ DSBench (Jing et al., 2024) / TML- Bench (Pinchuk, 2026) Data science and tabular ML Analyze data, produce notebooks, train tabu- lar models No Data is an input to analysis/modeling rather than the artifact being optimized. Data-centric AI / data se- lection (Ng, 2021; Paul et al., 2021; Sorscher et al., 2022) Dataset engineering Usually non-agentic met...

2024

[7] [7]

Keep” rows improved the composite accuracy; “discard

with vLLM (Kwon et al., 2023). Judge-based metrics use a separate OpenAI-compatible Qwen3.5- 27B endpoint (Team, 2026) for all baseline runs. The server-side judge configuration is summarized in Table 16. Table 16:Judge-server configuration for judge-based VLMEvalKit metrics. This endpoint is separate from candidate-model evaluation inference. Setting Val...

2023

[8] [8]

Keep MMMU stable (no TextVQA change, OCR_VQA change is small)

[9] [9]

Improve visual diversity (200 more unique COCO images)

[10] [10]

Potentially improve MMVet/LLaV ABench (+0.3–0.5 from more detailed COCO content)

[11] [11]

– 1 paper -> 1 skill

Minor OCRBench impact (−3 to−5 from slightly less OCR training) Expected net: +0.001 to +0.003 accuracy ## Minimal Change After computing proportional allocations: – ocr_vqa: 1486→1286 (−200) – coco: 6765→6965 (+200) – gqa: 1340 (unchanged) – textvqa: 408 (unchanged) Instructionsfor automated compilation of skill cards compiled from research papers ## Cha...

[12] [12]

**One-line decision** — when to use / when not to (single sentence each)

[13] [13]

‘operational-method‘), ‘actionability‘ (high/med/low), ‘evidence quality‘ (‘full_paper‘)

**Skill metadata** — ‘skill type‘, ‘paper kind‘ (e.g. ‘operational-method‘), ‘actionability‘ (high/med/low), ‘evidence quality‘ (‘full_paper‘)

[14] [14]

**Goal** — one paragraph operational restatement

[15] [15]

**Problem signature** — modality, data state, scale regime, model requirement

[16] [16]

**Use when** / **Do not use when** — bullet pairs

[17] [17]

**Required inputs** / **Optional inputs** / **Outputs** — bolded named slots with one-line descriptions

[18] [18]

**Assumptions and prerequisites**

[19] [19]

**Procedure** — numbered steps, each with ‘Action: . . . / Why: . . . / Note: See paper for details.‘

[20] [20]

**Parameters to set** — Role / How to set / Default-or-range / Effect

[21] [21]

**Validation checks**, **Failure modes**

[22] [22]

**Adaptation notes for VLM training** — always present, ties the paper back to VLM data work even when the source paper is not VLM-specific

[23] [23]

**Implementation notes**

[24] [24]

**Evidence from the paper** — 3–5 bullets with the load-bearing claims

[25] [25]

Use this skill when

**Source paper** — title, year, venue, paper ID, URL, arXiv ID. **Tone / philosophy** – Decision-first ("Use this skill when. . . " / "Avoid it when. . . ") rather than paper summary. – Procedures are skeletons — the *shape* of the method, not full reproduction. Each step ends with "See paper for details." – Knowledge-cards optimized for a downstream agen...