pith. sign in

arxiv: 2606.04261 · v1 · pith:QLSOI33Ynew · submitted 2026-06-02 · 💻 cs.AI · cs.CL· cs.CV· cs.ET· cs.LG

Can Generalist Agents Automate Data Curation?

Pith reviewed 2026-06-28 09:30 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.CVcs.ETcs.LG
keywords data curationgeneralist agentsdata selectionagent benchmarksinstruction tuningvision-language modelsautomated research
0
0 comments X

The pith

Scaffolded generalist agents autonomously compose data-selection policies that outperform published baselines at one-tenth the data budget.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether generalist coding agents can take over the iterative work of proposing, implementing, and revising data-selection policies for training AI models. It builds Curation-Bench to give agents direct access to inspect data, write policies, run a fixed training pipeline, and receive benchmark feedback. Out-of-the-box agents reach existing baselines quickly but mostly tweak local variants. Adding a scaffold that forces each step to cite, instantiate, and adapt prior published methods shifts the agent toward broader exploration. The result is a policy the agent built without human design input that beats strong baselines while using only one-tenth the data volume.

Core claim

In the vision-language instruction-tuning instantiation of Curation-Bench, the scaffolded agent composes a data-selection policy that outperforms strong published baselines at one-tenth their data budget, with no human design input supplied during the agent's iterations.

What carries the argument

The scaffold requiring each iteration to cite, instantiate, and adapt a prior method, which converts open-ended prompting into method-guided exploration inside the fixed Curation-Bench loop.

If this is right

  • Agents supplied with the scaffold can run complete data-curation loops without constant human oversight.
  • Method-guided adaptation produces data policies that exceed the performance of hand-designed baselines at substantially lower data volume.
  • The execution-research gap can be narrowed by requiring explicit citation and reuse of prior work rather than free-form invention.
  • Open-sourcing Curation-Bench makes it possible to test whether the same scaffold pattern transfers to other model families and tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the scaffold pattern holds, labs could shift human effort from writing selection rules to writing the scaffolds themselves.
  • The same loop might surface entirely new policy families that human researchers have not yet enumerated.
  • Testing at larger model scales or noisier real-world data sources would show whether the one-tenth budget advantage survives increased complexity.

Load-bearing premise

The single vision-language instruction-tuning setup with its fixed training recipe and evaluation suite stands in for the wider range of noisy, cross-domain data-curation loops that practitioners actually run.

What would settle it

Running the identical scaffolded agent on a different domain such as text-only pretraining and finding that it no longer beats the published baselines even at the reduced data budget.

Figures

Figures reproduced from arXiv: 2606.04261 by Adam Nguyen, Dawn Song, Feiyang Kang, Frederic Sala, Hanze Li, Jiaqi W. Ma, Mahavir Dabas, Ruoxi Jia.

Figure 1
Figure 1. Figure 1: Agentic data curation requires more than open￾ended prompting. (a) We formulate data curation as a policy￾search loop: the agent proposes a data policy, constructs training data, observes feedback from fixed training and eval￾uation, and revises the policy. (b) On 10k-example selec￾tion from LLaVA-665K dataset for LLaVA-1.5-7B training, open-ended prompting improves over random selection and human-designed… view at source ↗
Figure 2
Figure 2. Figure 2: Architecture of CURATION-BENCH. A coding agent inspects the candidate pool, implements data policies, and submits curated datasets. The harness validates each sub￾mission and scores it with a fixed training–evaluation pipeline. how current generalist coding agents interact with repositories: inspecting files, writing scripts, run￾ning commands, debugging failures, and reading logs. (P3) Contamination contr… view at source ↗
Figure 3
Figure 3. Figure 3: Agent data-curation results (open-prompt). The bars and the centerline show the average scores and the range of standard deviation, and the upper/lower wicks showing the min/max outcome of a session. Agents run 3 sessions each with 10 iterations, and the baseline is repeated for 10 independent runs. 100% iterations are successfully executed. (a). Fine-tuning pre-trained LLaVA-1.5-7B on curated subsets of 1… view at source ↗
Figure 4
Figure 4. Figure 4: and [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Ablation on the number of iterations (10-50) in each agent session (Claude Code). result is notable: average performance continues to improve within this range rather than clearly plateauing. Under open-prompting, the gains ac￾cumulate gradually, consistent with extended local search over source mixtures and related heuristics. Under the Heavy II scaffold, the best score is al￾ready found within the first … view at source ↗
read the original abstract

Curating training data is among the most consequential yet labor-intensive parts of modern AI development: practitioners iteratively propose, implement, evaluate, and revise data policies against noisy benchmark feedback. We ask whether generalist coding agents can automate this data-curation loop. We introduce *Curation-Bench*, an agent-centric benchmark that fixes the model, training recipe, and evaluation suite while giving agents command-line access to inspect data, implement policies, submit them to a fixed training/evaluation pipeline, and revise. In a vision-language instruction-tuning instantiation, out-of-the-box agents reach strong published data-selection baselines within ten iterations. However, trajectory analysis reveals a persistent *execution-research gap*: agents mainly tune local policy variants rather than explore new policy families, even when given strategy guides and paper references. Scaffolds requiring each iteration to cite, instantiate, and adapt a prior method shift agents toward method-guided exploration. The scaffolded agent autonomously composes -- without human design input -- a data-selection policy that outperforms strong published baselines at one-tenth their data budget. Overall, current agents can run the curation loop, but reliable data research requires scaffolded method adaptation, not open-ended prompting alone. Code and benchmark are open-sourced.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Curation-Bench, an agent-centric benchmark that fixes the model, training recipe, and evaluation suite while allowing agents command-line access to inspect data, implement policies, and iterate via a fixed pipeline. In a vision-language instruction-tuning instantiation, out-of-the-box agents reach strong published baselines within ten iterations, but trajectory analysis identifies an execution-research gap where agents tune local variants rather than explore new policy families. Scaffolds that require citing, instantiating, and adapting prior methods shift behavior toward method-guided exploration. The scaffolded agent autonomously composes a data-selection policy outperforming strong baselines at one-tenth their data budget. The work concludes that current agents can run the curation loop but reliable data research requires scaffolded method adaptation, and open-sources the benchmark and code.

Significance. If the empirical results hold under the reported controls, the paper contributes a reproducible benchmark and concrete evidence that generalist coding agents with targeted scaffolding can automate consequential steps in AI development pipelines. The open-sourcing of code and benchmark is a clear strength that supports follow-on work. The identification of the execution-research gap provides a useful diagnostic for agent capabilities in research-like tasks.

major comments (2)
  1. [Abstract] Abstract: The central claim that the scaffolded agent demonstrates general automation of data curation by autonomously composing a superior policy (outperforming baselines at 1/10 data budget) rests on experiments confined to a single vision-language instruction-tuning instantiation with fixed model, recipe, and evaluation suite. This narrow scope is load-bearing for the broader conclusion that 'reliable data research requires scaffolded method adaptation,' as the execution-research gap and scaffold benefits are measured only inside this benchmark; different domains, model scales, or noisier feedback loops could change how data policies interact with training dynamics. A concrete test would be to instantiate the same agent and scaffolds on at least one additional domain or scale.
  2. [Abstract] Abstract / results section: The outperformance result is presented without reported details on variance across runs, statistical significance tests, or ablation of the fixed training pipeline, which is needed to establish that the 1/10 data-budget advantage is robust rather than an artifact of the specific instantiation.
minor comments (1)
  1. [Abstract] The abstract and conclusion could more explicitly qualify the scope of the generalist-agent claim to the tested instantiation to avoid overgeneralization.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that the scaffolded agent demonstrates general automation of data curation by autonomously composing a superior policy (outperforming baselines at 1/10 data budget) rests on experiments confined to a single vision-language instruction-tuning instantiation with fixed model, recipe, and evaluation suite. This narrow scope is load-bearing for the broader conclusion that 'reliable data research requires scaffolded method adaptation,' as the execution-research gap and scaffold benefits are measured only inside this benchmark; different domains, model scales, or noisier feedback loops could change how data policies interact with training dynamics. A concrete test would be to instantiate the same agent and scaffolds on at least one additional domain or scale.

    Authors: Curation-Bench is intentionally designed with a fixed model, training recipe, and evaluation suite precisely to isolate agent behavior in the data-curation loop and to enable reproducible measurement of the execution-research gap. The manuscript already qualifies results as applying to this vision-language instruction-tuning instantiation. We will revise the abstract and conclusion to state the scope more explicitly and to frame the scaffold benefit and gap as observations within this controlled setting rather than as a general claim across all domains. revision: partial

  2. Referee: [Abstract] Abstract / results section: The outperformance result is presented without reported details on variance across runs, statistical significance tests, or ablation of the fixed training pipeline, which is needed to establish that the 1/10 data-budget advantage is robust rather than an artifact of the specific instantiation.

    Authors: We agree that reporting variance, statistical tests, and explicit clarification of the fixed pipeline would strengthen the presentation. In the revision we will add these details (variance across runs and significance tests for the reported outperformance) and state that the pipeline is deliberately fixed by benchmark design to control variables other than the data policy itself. revision: yes

standing simulated objections not resolved
  • Instantiating the agent and scaffolds on at least one additional domain or scale, which would require new benchmark setups, data, and compute not available for the current revision.

Circularity Check

0 steps flagged

No circularity; empirical benchmark results are self-contained

full rationale

The paper introduces Curation-Bench and reports direct empirical outcomes from running agents on a fixed vision-language instruction-tuning setup. No equations, fitted parameters, or derivations are present. The central claim (scaffolded agent composing an outperforming policy) is a measured experimental result on the open benchmark, not a reduction to prior inputs or self-citations. Self-citations, if any, are not load-bearing for the reported performance deltas. This matches the default case of an honest empirical paper with no circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; ledger is minimal and provisional.

axioms (1)
  • domain assumption The vision-language instruction-tuning task with fixed pipeline is a valid testbed for general data-curation automation.
    The paper instantiates the benchmark on this task.

pith-pipeline@v0.9.1-grok · 5775 in / 1068 out tokens · 13794 ms · 2026-06-28T09:30:23.470823+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references · 5 canonical work pages · 2 internal anchors

  1. [1]

    MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

    Machine learning data practices through a data curation lens: An evaluation framework. InPro- ceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, pages 1055–1067. Dustin Brunner. 2023. Datacomp challenge. Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksi...

  2. [2]

    Mlagentbench: Evaluating language agents on machine learning experi- mentation

    Mlagentbench: Evaluating language agents on machine learning experimentation.arXiv preprint arXiv:2310.03302. Liqiang Jing, Zhehui Huang, Xiaoyang Wang, Wen- lin Yao, Wenhao Yu, Kaixin Ma, Hongming Zhang, Xinya Du, and Dong Yu. 2024. Dsbench: How far are data science agents from becoming data science experts?arXiv preprint arXiv:2409.07703. Feiyang Kang, ...

  3. [3]

    Andrej Karpathy

    Adadedup: Adaptive hybrid data pruning for efficient large-scale object detection training.arXiv preprint arXiv:2507.00049. Andrej Karpathy. 2026. autoresearch: Ai agents running research on single-gpu nanochat training automatically. https://github.com/karpathy/ autoresearch. Accessed: 2026-04-29. Konwoo Kim, Suhas Kotha, Yejin Choi, Tatsunori Hashimoto,...

  4. [4]

    AgentBench: Evaluating LLMs as Agents

    Datacomp-lm: In search of the next gener- ation of training sets for language models.Advances in Neural Information Processing Systems, 37:14200– 14282. Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. 2024a. Improved baselines with visual instruc- tion tuning. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pa...

  5. [5]

    Advances in Neural Information Processing Systems, 36:50358–50376

    Scaling data-constrained language models. Advances in Neural Information Processing Systems, 36:50358–50376. Andrew Ng. 2021. A chat with andrew on mlops: From model-centric to data-centric ai. YouTube video, DeepLearning.AI. Accessed: 2026-05-05. Curtis Northcutt, Lu Jiang, and Isaac Chuang. 2021. Confident learning: Estimating uncertainty in dataset lab...

  6. [6]

    higher quality

    / DSBench (Jing et al., 2024) / TML- Bench (Pinchuk, 2026) Data science and tabular ML Analyze data, produce notebooks, train tabu- lar models No Data is an input to analysis/modeling rather than the artifact being optimized. Data-centric AI / data se- lection (Ng, 2021; Paul et al., 2021; Sorscher et al., 2022) Dataset engineering Usually non-agentic met...

  7. [7]

    Keep” rows improved the composite accuracy; “discard

    with vLLM (Kwon et al., 2023). Judge-based metrics use a separate OpenAI-compatible Qwen3.5- 27B endpoint (Team, 2026) for all baseline runs. The server-side judge configuration is summarized in Table 16. Table 16:Judge-server configuration for judge-based VLMEvalKit metrics. This endpoint is separate from candidate-model evaluation inference. Setting Val...

  8. [8]

    Keep MMMU stable (no TextVQA change, OCR_VQA change is small)

  9. [9]

    Improve visual diversity (200 more unique COCO images)

  10. [10]

    Potentially improve MMVet/LLaV ABench (+0.3–0.5 from more detailed COCO content)

  11. [11]

    – 1 paper -> 1 skill

    Minor OCRBench impact (−3 to−5 from slightly less OCR training) Expected net: +0.001 to +0.003 accuracy ## Minimal Change After computing proportional allocations: – ocr_vqa: 1486→1286 (−200) – coco: 6765→6965 (+200) – gqa: 1340 (unchanged) – textvqa: 408 (unchanged) Instructionsfor automated compilation of skill cards compiled from research papers ## Cha...

  12. [12]

    **One-line decision** — when to use / when not to (single sentence each)

  13. [13]

    ‘operational-method‘), ‘actionability‘ (high/med/low), ‘evidence quality‘ (‘full_paper‘)

    **Skill metadata** — ‘skill type‘, ‘paper kind‘ (e.g. ‘operational-method‘), ‘actionability‘ (high/med/low), ‘evidence quality‘ (‘full_paper‘)

  14. [14]

    **Goal** — one paragraph operational restatement

  15. [15]

    **Problem signature** — modality, data state, scale regime, model requirement

  16. [16]

    **Use when** / **Do not use when** — bullet pairs

  17. [17]

    **Required inputs** / **Optional inputs** / **Outputs** — bolded named slots with one-line descriptions

  18. [18]

    **Assumptions and prerequisites**

  19. [19]

    **Procedure** — numbered steps, each with ‘Action: . . . / Why: . . . / Note: See paper for details.‘

  20. [20]

    **Parameters to set** — Role / How to set / Default-or-range / Effect

  21. [21]

    **Validation checks**, **Failure modes**

  22. [22]

    **Adaptation notes for VLM training** — always present, ties the paper back to VLM data work even when the source paper is not VLM-specific

  23. [23]

    **Implementation notes**

  24. [24]

    **Evidence from the paper** — 3–5 bullets with the load-bearing claims

  25. [25]

    Use this skill when

    **Source paper** — title, year, venue, paper ID, URL, arXiv ID. **Tone / philosophy** – Decision-first ("Use this skill when. . . " / "Avoid it when. . . ") rather than paper summary. – Procedures are skeletons — the *shape* of the method, not full reproduction. Each step ends with "See paper for details." – Knowledge-cards optimized for a downstream agen...