pith. sign in

arxiv: 2606.05250 · v1 · pith:TMG27UOPnew · submitted 2026-06-03 · 💻 cs.SE

Towards Persistent Case-Based Memory for Autonomous Data Science: A CBR-Augmented R&D-Agent with a Locally Deployable Small Language Model

Pith reviewed 2026-06-28 05:16 UTC · model grok-4.3

classification 💻 cs.SE
keywords case-based reasoningautonomous data science agentssmall language modelspersistent memorykaggle competitionsR&D agent frameworkGemma model
0
0 comments X

The pith

CBR layer added to R&D agent with local SLM yields directional accuracy gains and lower variance on Kaggle tasks

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper integrates a persistent case-based reasoning layer into an existing autonomous data-science agent framework, using structured cases that pair symbolic records with executable code. This CBR component overrides selected phases of the agent's loop via a toggleable subclass and applies a five-gate quality filter plus heuristic reuse detection based on embedding similarity and code fingerprints. Evaluation across two Kaggle competitions with multiple seeds shows the CBR version reaching 0.8147 accuracy versus 0.8098 for the baseline on Spaceship Titanic, accompanied by substantially reduced variance, while reuse events exhibit high semantic relevance. The work also provides the first published end-to-end test of Gemma 4 31B as the agent's backbone model.

Core claim

Overriding three phases of the R&D loop with a CBR layer that stores structured cases containing executable code snapshots and quality metadata, then retrieves them via a five-gate filter and reuse-detection heuristic combining embedding similarity (mean 0.882) and code-fingerprint overlap (mean 0.305), produces directionally higher accuracy and markedly lower variance than the CBR-disabled baseline on the Spaceship Titanic task.

What carries the argument

The CBR layer, implemented as a surgical subclass toggled by an environment variable, that stores cases as structured records with executable code and quality metadata and retrieves them through a five-gate quality filter and heuristic reuse detection using embedding similarity plus code-fingerprint overlap.

If this is right

  • Persistent, quality-controlled case memory can be added to existing agent frameworks without replacing the core loop.
  • Small language models such as Gemma 4 31B can function as locally deployable backbones for full autonomous data-science pipelines.
  • Heuristic reuse detection supports conceptual guidance from prior cases rather than verbatim code reuse.
  • Lower variance across random seeds indicates more stable improvement trajectories when CBR memory is active.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same CBR pattern could be ported to other agent scaffolds that lack native long-term memory.
  • Variable code-fingerprint similarity alongside high embedding similarity points to a hybrid symbolic-neural memory design that may generalize beyond the current tasks.
  • Testing the five-gate filter on tasks with noisier or less structured code artefacts would clarify the limits of the current reuse heuristic.

Load-bearing premise

The five-gate quality filter and heuristic reuse-detection mechanism correctly identify transferable knowledge without introducing selection bias or false positives that inflate apparent gains.

What would settle it

Re-running the eight-loop evaluation on Spaceship Titanic with the reuse-detection heuristic disabled or replaced by random retrieval and finding that the accuracy gap and variance reduction both disappear.

Figures

Figures reproduced from arXiv: 2606.05250 by Felix Stocker.

Figure 1
Figure 1. Figure 1: Cumulative-best score trajectories across eight autonomous [PITH_FULL_IMAGE:figures/full_fig_p009_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Heuristic reuse-detection scatter: embedding cosine similarity ( [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Appendix G: CBR-augmented R&D-Agent loop (one iteration). Dark grey boxes mark the three phases overridden by the CBR layer (Phases 1, 2, 5); light grey boxes mark unchanged R&D-Agent phases (Phases 3, 4). Phase 1 performs two-stage CBR retrieval and injects retrieved cases and Failure-Tracker patterns into the hypothesis; Phase 2 appends code snapshots to the coding prompt; Phase 5 applies the five-gate Q… view at source ↗
read the original abstract

Most top-performing autonomous data-science agents rely on frontier cloud models and lack persistent, cross-session memory. This paper addresses two open gaps: (1) the underexplored use of formally structured, quality-controlled Case-Based Reasoning (CBR) case bases coupling symbolic case records with executable code artefacts; and (2) the untested viability of Small Language Models (SLMs) as locally deployable agent backbones. We present CBR-augmented R&D-Agent, integrating a persistent CBR layer into Microsoft's R&D-Agent framework with a custom backend for Gemma 4 31B Dense -- the first published end-to-end evaluation of Gemma 4 as an autonomous data-science agent backbone. The CBR layer overrides three R&D loop phases via a surgical subclass toggled by a single environment variable. Cases are stored as structured records with executable code snapshots and quality metadata; a five-gate quality filter and a heuristic reuse-detection mechanism assess knowledge transfer by combining embedding similarity, code-fingerprint overlap, and injection provenance. Evaluated on two Kaggle competitions (NOMAD 2018, Spaceship Titanic) with four seeds over eight improvement loops each, CBR achieves directionally higher accuracy than the CBR-disabled baseline on Spaceship Titanic (0.8147 vs. 0.8098, d = -1.41) with substantially lower variance. Heuristic reuse detection across 108 retrieval events shows high semantic relevance (mean embedding similarity 0.882) alongside variable structural proximity (mean code-fingerprint similarity 0.305), consistent with conceptual guidance rather than verbatim code copying.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents a CBR-augmented R&D-Agent that integrates a persistent, structured Case-Based Reasoning layer (with quality-controlled case records coupling symbolic metadata and executable code) into Microsoft's R&D-Agent framework. It uses Gemma 4 31B Dense as the locally deployable SLM backbone and evaluates the system on two Kaggle competitions (NOMAD 2018 and Spaceship Titanic) across eight improvement loops with four random seeds. The central claim is that the CBR layer produces directionally higher accuracy (0.8147 vs. 0.8098) and substantially lower variance on Spaceship Titanic, with supporting analysis of 108 retrieval events showing high embedding similarity but variable code-fingerprint overlap.

Significance. If the reported accuracy and variance improvements can be substantiated with adequate statistical controls and larger sample sizes, the work would demonstrate a practical route to persistent, cross-session memory in autonomous data-science agents while using only locally deployable small models. The five-gate quality filter and combined embedding-plus-fingerprint reuse heuristic constitute a concrete, inspectable mechanism for controlled knowledge transfer that could be adopted or extended by other agent frameworks.

major comments (2)
  1. [Abstract] Abstract: The central empirical claim of directional superiority plus lower variance rests on only four random seeds. No per-seed accuracy values, standard deviations, p-values, bootstrap confidence intervals, or hypothesis tests are reported, so it is impossible to determine whether the 0.49 pp difference (or the cited d = -1.41) exceeds what would be expected from seed-to-seed fluctuation under an otherwise identical agent.
  2. [Abstract] Abstract: The effect size d = -1.41 is stated without definition, formula, or reference to the underlying per-seed data; this prevents verification of the sign, magnitude, or appropriateness of the statistic for the variance comparison.
minor comments (1)
  1. [Abstract] Abstract: The phrase 'Gemma 4 31B Dense' should be clarified with the exact model identifier or citation, as it is not a standard public release name.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and for identifying gaps in the statistical reporting of the abstract. We address each major comment below and will revise the manuscript to improve transparency and verifiability.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central empirical claim of directional superiority plus lower variance rests on only four random seeds. No per-seed accuracy values, standard deviations, p-values, bootstrap confidence intervals, or hypothesis tests are reported, so it is impossible to determine whether the 0.49 pp difference (or the cited d = -1.41) exceeds what would be expected from seed-to-seed fluctuation under an otherwise identical agent.

    Authors: We agree that four seeds constitute a small sample and that aggregate means alone do not permit readers to judge whether the 0.49 pp difference exceeds typical seed-to-seed fluctuation. In the revised manuscript we will add a table or explicit listing of the four per-seed accuracies for both the CBR-augmented and baseline conditions on Spaceship Titanic, report the standard deviation across seeds, and include the result of an appropriate paired test (e.g., Wilcoxon signed-rank) together with its p-value and a bootstrap confidence interval. These additions will be placed in both the abstract and a new short results subsection. revision: yes

  2. Referee: [Abstract] Abstract: The effect size d = -1.41 is stated without definition, formula, or reference to the underlying per-seed data; this prevents verification of the sign, magnitude, or appropriateness of the statistic for the variance comparison.

    Authors: We will insert an explicit definition of the reported effect size, the formula employed, and the per-seed values used in its calculation. The revised text will also clarify whether the statistic is intended to quantify the accuracy difference or the variance reduction and will cite the standard reference for the chosen effect-size measure. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical ablation is direct and self-contained

full rationale

The paper reports a straightforward empirical comparison of CBR-enabled versus CBR-disabled runs of the same R&D-Agent on identical Kaggle tasks (NOMAD 2018, Spaceship Titanic) across four seeds. No equations, derivations, or predictions are presented that reduce reported accuracies (0.8147 vs 0.8098) to fitted parameters defined by the authors. The five-gate filter and reuse-detection mechanism are described as implementation details whose correctness is evaluated externally via observed similarities, not assumed by construction. No self-citation chains or uniqueness theorems are invoked to justify the central claim. The evaluation is therefore independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that a small language model can function as a capable autonomous agent backbone and that the custom CBR layer can be surgically inserted without breaking the original loop; no free parameters or invented entities are described in the abstract.

axioms (2)
  • domain assumption Small language models such as Gemma 4 31B can serve as viable backbones for autonomous data-science agents
    The paper positions the model as the first published test case for this role.
  • domain assumption Structured case records with executable code and quality metadata can be reused across sessions via embedding and fingerprint similarity
    This is the core premise of the CBR layer described in the abstract.

pith-pipeline@v0.9.1-grok · 5821 in / 1401 out tokens · 27557 ms · 2026-06-28T05:16:42.283781+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 29 canonical work pages · 11 internal anchors

  1. [1]

    AIDE: AI-Driven Exploration in the Space of Code

    Z. Jiang, D. Schmidt, D. Srikanth, D. Xu, I. Kaplan, D. Jacenko, and Y . Wu, “AIDE: AI-driven exploration in the space of code,”arXiv preprint arXiv:2502.13138, 2025. [Online]. Available: https://arxiv.org/abs/2502.13138

  2. [2]

    MARS: Modular Agent with Reflective Search for Automated AI Research

    J. Chen, B. Dalvi Mishra, J. Nam, R. Meng, T. Pfister, and J. Yoon, “MARS: Modular agent with reflective search for automated AI research,” inProceedings of the 43rd International Conference on Machine Learning (ICML), ser. Proceedings of Machine Learning Research. PMLR, 2026, to appear; proceedings not yet published at time of writing. [Online]. Availabl...

  3. [3]

    Large language models orchestrating structured reasoning achieve kaggle grandmaster level

    A. Grosnit, A. Maraval, Refinath S N, Z. Zhao, J. Doran, G. Paolo, A. Thomas, J. Gonzalez, A. Kumar, K. Khandelwal, A. Benechehab, H. Cherkaoui, Y . Attia El-Hili, K. Shao, J. Hao, J. Yao, B. Kégl, H. Bou-Ammar, and J. Wang, “Kolb-based experiential learning for generalist agents with human-level Kaggle data science performance,”arXiv preprint arXiv:2411....

  4. [4]

    R&D-Agent: An LLM-Agent Framework Towards Autonomous Data Science

    X. Yang, X. Yang, S. Fang, B. Xian, Y . Li, J. Wang, M. Xu, H. Pan, X. Hong, W. Liu, Y . Shen, W. Chen, and J. Bian, “R&D-Agent: Automating data-driven AI solution building through LLM-powered automated research, development, and evolution,” arXiv preprint arXiv:2505.14738, 2025. [Online]. Available: https: //arxiv.org/abs/2505.14738

  5. [5]

    MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

    J. S. Chan, N. Chowdhury, O. Jaffe, J. Aung, D. Sherburn, E. Mays, G. Starace, K. Liu, L. Maksin, T. Patwardhan, L. Weng, and A. M ˛ adry, “MLE-Bench: Evaluating machine learning agents on machine learning engineering,” inProceedings of the 13th International Conference on Learning Representations (ICLR), 2025. [Online]. Available: https://arxiv.org/abs/2...

  6. [6]

    Ds-agent: Automated data science by empowering large language models with case-based reasoning

    S. Guo, C. Deng, Y . Wen, H. Chen, Y . Chang, and J. Wang, “DS- Agent: Automated data science by empowering large language models with case-based reasoning,” inProceedings of the 41st International Conference on Machine Learning (ICML), ser. Proceedings of Machine Learning Research, vol. 235. PMLR, 2024, pp. 16 813– 16 848. [Online]. Available: https://ar...

  7. [7]

    Gemma 4 model card,

    Google DeepMind, “Gemma 4 model card,” https://ai.google.dev/ gemma/docs/core/model_card_4, 2026, last updated 17 April 2026, accessed 13 May 2026

  8. [8]

    Case-based reasoning: Foundational issues, methodological variations, and system approaches,

    A. Aamodt and E. Plaza, “Case-based reasoning: Foundational issues, methodological variations, and system approaches,”AI Communications, vol. 7, no. 1, pp. 39–59, 1994. [Online]. Available: https://www.researchgate.net/publication/225070522_Case- Based_Reasoning_Foundational_Issues_Methodological_Variations_ and_System_Approaches

  9. [9]

    Remembering to forget: A competence- preserving case deletion policy for case-based reasoning systems,

    B. Smyth and M. T. Keane, “Remembering to forget: A competence- preserving case deletion policy for case-based reasoning systems,” in Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI-95), 1995, pp. 377–382. [Online]. Available: https://www.ijcai.org/Proceedings/95-1/Papers/050.pdf

  10. [10]

    Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

    P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, S. Riedel, and D. Kiela, “Retrieval-augmented generation for knowledge-intensive NLP tasks,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 33, 2020, pp. 9459–9474. [Online]. Available: https://arxiv.org/abs/2005.11401

  11. [11]

    CBR-RAG: Case-based reasoning for retrieval augmented generation in LLMs for legal question answering,

    N. Wiratunga, R. Abeyratne, L. Jayawardena, K. Martin, S. Massie, I. Nkisi-Orji, R. Weerasinghe, A. Liret, and B. Fleisch, “CBR-RAG: Case-based reasoning for retrieval augmented generation in LLMs for legal question answering,” inProceedings of the 32nd International Conference on Case-Based Reasoning (ICCBR 2024), ser. Lecture Notes in Computer Science, ...

  12. [12]

    Review of case-based reasoning for LLM agents: Theoretical foundations, architectural components, and cognitive integration,

    K. Hatalis, D. Christou, and V . Kondapalli, “Review of case-based reasoning for LLM agents: Theoretical foundations, architectural components, and cognitive integration,”arXiv preprint arXiv:2504.06943, 2025. [Online]. Available: https: //arxiv.org/abs/2504.06943

  13. [13]

    Case-based reasoning meets large language models: A research manifesto for open challenges and research directions,

    K. Bach, R. Bergmann, F. Brand, M. Caro-Martínez, V . Eisenstadt, M. W. Floyd, L. Jayawardena, D. Leake, M. Lenz, L. Malburg, D. H. Ménager, M. Minor, B. Schack, I. Watson, K. Wilkerson, and N. Wiratunga, “Case-based reasoning meets large language models: A research manifesto for open challenges and research directions,” HAL Science, Tech. Rep. hal-050067...

  14. [14]

    Levels of AI memory — and case-based ways for LLMs to ascend them,

    M. W. Floyd, D. Leake, D. H. Ménager, I. Watson, and K. Wilkerson, “Levels of AI memory — and case-based ways for LLMs to ascend them,” inCBR-LLM Workshop @ ICCBR 2025, ser. CEUR Workshop Proceedings, vol. 3993, 2025, pp. 2–14. [Online]. Available: https://ceur-ws.org/V ol-3993/paper1.pdf

  15. [15]

    EXAR: A unified experience-grounded agentic reasoning architecture,

    R. Bergmann, F. Brand, M. Lenz, and L. Malburg, “EXAR: A unified experience-grounded agentic reasoning architecture,” inProceedings of the 33rd International Conference on Case-Based Reasoning (ICCBR 2025), ser. Lecture Notes in Computer Science, vol. 15662. Springer, 2025, pp. 3–17. [Online]. Available: https://www.wi2.uni- trier.de/shared/publications/2...

  16. [16]

    A case-based reasoning approach to dynamic few-shot prompting for code generation,

    D. Dannenhauer, Z. Dannenhauer, D. Christou, and K. Hatalis, “A case-based reasoning approach to dynamic few-shot prompting for code generation,” inICML 2024 Workshop on LLMs and Cognition, 2024. [Online]. Available: https://openreview.net/pdf?id= Kt9bM32oDY

  17. [17]

    Large language models as knowledge engineers,

    F. Brand, L. Malburg, and R. Bergmann, “Large language models as knowledge engineers,” inCBR-LLM Workshop @ ICCBR 2024, ser. CEUR Workshop Proceedings, vol. 3708, 2024, pp. 3–18. [Online]. Available: https://ceur-ws.org/V ol-3708/paper_01.pdf

  18. [18]

    Retrieval augmented generation with LLMs for explaining business process models,

    M. Minor and E. Kaucher, “Retrieval augmented generation with LLMs for explaining business process models,” inProceedings of the 32nd International Conference on Case-Based Reasoning (ICCBR 2024), ser. Lecture Notes in Computer Science, vol. 14775. Springer, 2024, pp. 175–190. [Online]. Available: http://wi.cs.uni- frankfurt.de/webdav/publications/2024_IC...

  19. [19]

    Explainable classification system for hip fractures: A hybrid CBR+LLM surrogate approach,

    E. Queipo-de Llano, M. Ciurcau, A. Paz-Olalla, B. Díaz-Agudo, and J. A. Recio-García, “Explainable classification system for hip fractures: A hybrid CBR+LLM surrogate approach,” inXCBR Workshop on CBR for the Explanation of Intelligent Systems @ ICCBR 2024, ser. CEUR Workshop Proceedings, vol. 3708, 2024, pp. 65–80. [Online]. Available: https://ceur-ws.or...

  20. [20]

    LLM-driven case-base populating for structuring and integrating restoration experiences,

    F. Ghazouani, F. Giustozzi, and F. Le Ber, “LLM-driven case-base populating for structuring and integrating restoration experiences,” inProceedings of the 33rd International Conference on Case- Based Reasoning (ICCBR 2025), ser. Lecture Notes in Computer Science, vol. 15662. Springer, 2025, pp. 67–80. [Online]. Available: https://hal.science/hal-05058570v...

  21. [21]

    Agentic CBR in action: Empowering loan approvals through interactive, counterfactual explanations,

    P. Salimi, N. Wiratunga, and D. Corsar, “Agentic CBR in action: Empowering loan approvals through interactive, counterfactual explanations,” inCBR-LLM Workshop @ ICCBR 2025, ser. CEUR Workshop Proceedings, vol. 3993, 2025, pp. 27–42. [Online]. Available: https://ceur-ws.org/V ol-3993/paper3.pdf

  22. [22]

    A human-LLM note-taking system with case-based reasoning as framework for scientific discovery,

    D. B. Craig, “A human-LLM note-taking system with case-based reasoning as framework for scientific discovery,” inProceedings of the 1st Workshop on AI and Scientific Discovery: Directions and Opportunities (AISD @ NAACL 2025), 2025, pp. 22–30. [Online]. Available: https://aclanthology.org/2025.aisd-main.3

  23. [23]

    Decision making in LLMs: A first step,

    R. O. Weber, C. B. Rauch, and S. Amin, “Decision making in LLMs: A first step,” inCBR-LLM Workshop @ ICCBR 2025, ser. CEUR Workshop Proceedings, vol. 3993, 2025, pp. 15–26. [Online]. Available: https://ceur-ws.org/V ol-3993/paper2.pdf

  24. [24]

    Memento: Fine-tuning llm agents without fine-tuning llms.arXiv preprint arXiv:2508.16153, 2025

    H. Zhou, Y . Chen, S. Guo, X. Yan, K. H. Lee, Z. Wang, K. Y . Lee, G. Zhang, K. Shao, L. Yang, and J. Wang, “Memento: Fine-tuning LLM agents without fine-tuning LLMs,” arXiv preprint arXiv:2508.16153, 2025. [Online]. Available: https: //arxiv.org/abs/2508.16153

  25. [25]

    Hoos, and Kevin Leyton-Brown

    C. Thornton, F. Hutter, H. H. Hoos, and K. Leyton-Brown, “Auto-WEKA: Combined selection and hyperparameter optimization of classification algorithms,” inProceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2013, pp. 847–855. [Online]. Available: https://dl.acm.org/doi/10.1145/2487575.2487629

  26. [26]

    Auto-sklearn: Automated machine learning,

    M. Feurer, A. Klein, K. Eggensperger, J. Springenberg, M. Blum, and F. Hutter, “Auto-sklearn: Automated machine learning,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 28, 2015, pp. 2962–2970. [Online]. Available: https://www.researchgate.net/publication/333181102_Auto- sklearn_Efficient_and_Robust_Automated_Machine_Learning

  27. [27]

    Evaluation of a tree-based pipeline optimization tool for automating data science,

    R. S. Olson, N. Bartley, R. J. Urbanowicz, and J. H. Moore, “Evaluation of a tree-based pipeline optimization tool for automating data science,” inProceedings of the Genetic and Evolutionary Computation Conference (GECCO), 2016, pp. 485–492. [Online]. Available: https://dl.acm.org/doi/10.1145/2908812.2908918

  28. [29]

    Available: https://arxiv.org/abs/2402.18679

    [Online]. Available: https://arxiv.org/abs/2402.18679

  29. [30]

    MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

    S. Hong, M. Zhuge, J. Chen, X. Zheng, Y . Cheng, C. Zhang, J. Wang, Z. Wang, S. K. S. Yau, Z. Lin, L. Zhou, C. Ran, L. Xiao, C. Wu, and J. Schmidhuber, “MetaGPT: Meta programming for a multi-agent collaborative framework,” inProceedings of the 12th International Conference on Learning Representations (ICLR), 2024. [Online]. Available: https://arxiv.org/ab...

  30. [31]

    Mlagentbench: Evaluating language agents on machine learning experimentation.arXiv preprint arXiv:2310.03302, 2023

    Q. Huang, J. V ora, P. Liang, and J. Leskovec, “MLAgentBench: Evaluating language agents on machine learning experimentation,” inProceedings of the 41st International Conference on Machine Learning (ICML), ser. Proceedings of Machine Learning Research, vol. 235. PMLR, 2024, pp. 20 271–20 309. [Online]. Available: https://arxiv.org/abs/2310.03302

  31. [32]

    Autokaggle: A multi-agent framework for autonomous data science competitions.arXiv preprint arXiv:2410.20424, 2024

    Z. Li, Q. Zang, D. Ma, J. Guo, T. Zheng, M. Liu, X. Niu, Y . Wang, J. Yang, J. Liu, W. Zhong, W. Zhou, W. Huang, and G. Zhang, “AutoKaggle: A multi-agent framework for autonomous data science competitions,”arXiv preprint arXiv:2410.20424, 2024. [Online]. Available: https://arxiv.org/abs/2410.20424

  32. [33]

    MLE-STAR: Machine Learning Engineering Agent via Search and Targeted Refinement

    J. Nam, J. Yoon, J. Chen, J. Shin, S. Ö. Arık, and T. Pfister, “MLE-STAR: Machine learning engineering agent via search and targeted refinement,”arXiv preprint arXiv:2506.15692, 2025. [Online]. Available: https://arxiv.org/abs/2506.15692

  33. [34]

    OpenHands: An open platform for AI software developers as generalist agents,

    X. Wang, B. Li, Y . Song, F. F. Xu, X. Tang, M. Zhuge, J. Pan, Y . Song, B. Li, J. Singh, H. H. Tran, F. Li, R. Ma, M. Zheng, B. Qian, Y . Shao, N. Muennighoff, Y . Zhang, B. Hui, J. Lin, R. Brennan, H. Peng, H. Ji, and G. Neubig, “OpenHands: An open platform for AI software developers as generalist agents,” inProceedings of the 13th International Confere...

  34. [35]

    OpenHands: An Open Platform for AI Software Developers as Generalist Agents

    [Online]. Available: https://arxiv.org/abs/2407.16741

  35. [36]

    Collaborative evolving strategy for automatic data-centric development,

    X. Yang, H. Chen, W. Feng, H. Wang, Z. Ye, X. Shen, X. Yang, S. Sun, W. Liu, and J. Bian, “Collaborative evolving strategy for automatic data-centric development,”arXiv preprint arXiv:2407.18690, 2024. [Online]. Available: https://arxiv.org/abs/2407.18690

  36. [37]

    Toward ultra-long-horizon agentic science: Cognitive accumulation for machine learning engineering

    X. Zhu, Y . Cai, Z. Liu, B. Zheng, C. Wang, R. Ye, Y . Zhang, L. Zhang, W. E, S. Chen, and Y . Wang, “Toward ultra-long- horizon agentic science: Cognitive accumulation for machine learning engineering,”arXiv preprint arXiv:2601.10402, 2026. [Online]. Available: https://arxiv.org/abs/2601.10402

  37. [38]

    AutoMind: Adaptive Knowledgeable Agent for Automated Data Science

    Y . Ou, Y . Luo, J. Zheng, L. Wei, Z. Yu, S. Qiao, J. Zhang, D. Zheng, Y . Mao, Y . Gao, H. Chen, and N. Zhang, “AutoMind: Adaptive knowledgeable agent for automated data science,”arXiv preprint arXiv:2506.10974, 2025. [Online]. Available: https://arxiv. org/abs/2506.10974

  38. [40]

    ML-Agent: Reinforcing LLM Agents for Autonomous Machine Learning Engineering

    [Online]. Available: https://arxiv.org/abs/2505.23723

  39. [41]

    LLaMA: Open and Efficient Foundation Language Models

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample, “LLaMA: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023. [Online]. Available: https://arxiv.org/abs/2302.13971

  40. [42]

    Mistral 7B

    A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P. Stock, T. Le Scao, T. Lavril, T. Wang, T. Lacroix, and W. El Sayed, “Mistral 7b,”arXiv preprint arXiv:2310.06825, 2023. [Online]. Available: https://arxiv.org/abs/2310.06825

  41. [43]

    DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

    D. Guo, Q. Zhu, D. Yang, Z. Xie, K. Dong, W. Zhang, G. Chen, X. Bi, Y . Wu, Y . K. Li, F. Luo, Y . Xiong, and W. Liang, “DeepSeek-Coder: When the large language model meets programming — the rise of code intelligence,”arXiv preprint arXiv:2401.14196, 2024. [Online]. Available: https://arxiv.org/abs/2401.14196

  42. [44]

    Position: LLMs can’t plan, but can help planning in LLM-modulo frameworks,

    S. Kambhampati, K. Valmeekam, L. Guan, M. Verma, K. Stechly, S. Bhambri, L. Saldyt, and A. Murthy, “Position: LLMs can’t plan, but can help planning in LLM-modulo frameworks,” in Proceedings of the 41st International Conference on Machine Learning (ICML), ser. Proceedings of Machine Learning Research, vol. 235. PMLR, 2024, pp. 22 895–22 907. [Online]. Ava...

  43. [45]

    Lost in the Middle: How Language Models Use Long Contexts

    N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang, “Lost in the middle: How language models use long contexts,”Transactions of the Association for Computational Linguistics, vol. 12, pp. 157–173, 2024. [Online]. Available: https://arxiv.org/abs/2307.03172

  44. [46]

    Welcome Gemma 4: Frontier multimodal intelligence on device,

    Hugging Face, “Welcome Gemma 4: Frontier multimodal intelligence on device,” https://huggingface.co/blog/gemma4, 2026, published 2 April 2026, accessed 13 May 2026. APPENDIX TABLE III APPENDIXA: PER-RUNPERFORMANCEMETRICS. COMP.=COMPETITION.COND.=CONDITION.S=SEED.N=LOOPS COMPLETED.BEST= BEST CUMULATIVE METRIC.L0=LOOP-0METRIC.∆SOTA=ABSOLUTESOTAGAIN(DIRECTIO...