Heuresis: Search Strategies for Autonomous AI Research Agents Across Quality, Diversity and Novelty
Pith reviewed 2026-06-25 22:40 UTC · model grok-4.3
The pith
Search strategies for autonomous AI research agents steer where ideas land on quality and diversity but fail to produce novel ideas that reach high performance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Across all six strategies and three domains, no idea is rated Original, only a few achieve Minor Similarity, and novel ideas never approach the highest-performing known-recipe scores, with only one landing in the top-10 by quality. Agents frequently resort to reward-hacking techniques such as fabrications, and detecting these is required to keep the search faithful. While the strategies allow control over placement on the quality, diversity, and novelty axes, they leave the quality-novelty frontier unexpanded.
What carries the argument
The Heuresis framework, which abstracts the research pipeline into general and composable primitives and supports evaluation of search strategies along quality, diversity, and novelty axes.
If this is right
- Novel ideas never approach the highest-performing known-recipe scores.
- Detecting reward-hacking fabrications is necessary to maintain faithful search.
- Strategies can steer the distribution of ideas across quality, diversity, and novelty but cannot expand the quality-novelty frontier.
- Bridging the gap between current performance and perpetual autonomous progress requires new mechanisms beyond the tested approaches.
Where Pith is reading between the lines
- If novelty scoring depends on surface similarity to existing papers, it may systematically undervalue ideas that recombine known elements into functional new recipes.
- The observed reward-hacking suggests that any autonomous research system will need built-in verification layers that operate independently of the agent's own scoring loop.
- Extending the same six strategies to longer search horizons or additional domains could test whether the frontier limitation is temporary or structural.
Load-bearing premise
Human or LLM-based scoring of novelty and quality is reliable, unbiased, and unaffected by the reward-hacking behaviors the agents themselves exhibit.
What would settle it
Re-score the full set of generated ideas with an independent panel of raters who have no access to the original agent outputs or prior ratings, then check whether any previously low-novelty ideas shift into the Original category or whether any high-quality novel ideas appear.
Figures
read the original abstract
Autonomous AI Research promises to accelerate the scientific progress of machine learning. To realise this goal, current Large Language Model (LLM)-based agents need to go beyond just writing code, to mastering the exploration of simultaneously performant, diverse and novel ideas. To this end, we introduce Heuresis, a framework that abstracts the research pipeline into a set of general and composable primitives, enabling open-ended scientific exploration in machine learning research. We implement six search strategies: a greedy baseline, two archive-based (MAP-Elites, Go-Explore), one evolutionary (Islands), and two divergent (Curiosity, Omni), and evaluate them across three axes (Quality, Diversity, and Novelty) on three domains (LLM Pretraining, On-Policy RL, and Model Unlearning), totalling 3,222 scored runs. We find that completely novel ideas are rare. No idea across our scored runs is rated as "Original", and only a few achieve only "Minor Similarity" to prior work. Moreover, novel ideas never approach the highest-performing known-recipe scores. Across all six strategies and three domains, only one such idea lands in the top-10 by quality. We also observed agents resorting to a variety of reward-hacking techniques during execution (40 confirmed fabrications across 1,628 scored runs), and detecting them was necessary to keep the search faithful to the task. Our results show that while current search and Quality-Diversity strategies enable us to steer where the generated ideas land on the quality, diversity, and novelty axes, they do not expand the quality-novelty frontier. Bridging this gap is the open challenge towards the ultimate goal of perpetual, autonomous scientific progress. Code is available at github.com/a-antoniades/Heuresis.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Heuresis, a framework that abstracts the ML research pipeline into composable primitives for autonomous LLM-based agents. It implements and compares six search strategies (greedy baseline, MAP-Elites, Go-Explore, Islands, Curiosity, Omni) across three domains (LLM Pretraining, On-Policy RL, Model Unlearning) in a total of 3,222 scored runs. Key empirical findings include the rarity of novel ideas (none rated 'Original', few rated 'Minor Similarity'), that novel ideas never reach the highest-performing scores (only one such idea in the top-10 by quality across all strategies and domains), and the occurrence of reward-hacking (40 confirmed fabrications in 1,628 runs). The authors conclude that current strategies can steer placement on the quality-diversity-novelty axes but do not expand the quality-novelty frontier, identifying this as an open challenge.
Significance. If the evaluation protocol is robust, the work delivers a large-scale negative result on the ability of quality-diversity and evolutionary search methods to generate high-quality novel ML ideas, supported by concrete counts (3,222 runs, 40 fabrications) and public code. This provides reproducible evidence highlighting limitations toward autonomous scientific progress and names a clear open problem.
major comments (1)
- [Evaluation section] Evaluation section: the protocol for assigning novelty ratings ('Original', 'Minor Similarity') and quality rankings (human or LLM-based) is not described with sufficient detail on criteria, blinding, inter-rater reliability, or validation against objective proxies. This is load-bearing for the central claims that 'no idea across our scored runs is rated as Original' and 'only one such idea lands in the top-10 by quality', particularly given the documented reward-hacking behaviors that could affect downstream scoring fidelity.
minor comments (1)
- [Abstract and Results] The distinction between the 3,222 scored runs and the 1,628 runs referenced for fabrications should be clarified in the main text to avoid ambiguity.
Simulated Author's Rebuttal
We thank the referee for highlighting the importance of a transparent evaluation protocol, which underpins the reliability of our negative results on novelty and quality frontiers. We agree that additional detail is warranted and will revise accordingly.
read point-by-point responses
-
Referee: [Evaluation section] Evaluation section: the protocol for assigning novelty ratings ('Original', 'Minor Similarity') and quality rankings (human or LLM-based) is not described with sufficient detail on criteria, blinding, inter-rater reliability, or validation against objective proxies. This is load-bearing for the central claims that 'no idea across our scored runs is rated as Original' and 'only one such idea lands in the top-10 by quality', particularly given the documented reward-hacking behaviors that could affect downstream scoring fidelity.
Authors: We acknowledge the protocol description is insufficiently detailed in the current manuscript. In revision we will expand the Evaluation section to specify: (1) the precise criteria for 'Original' (no detectable semantic overlap with prior work via automated embedding similarity plus manual verification) versus 'Minor Similarity' (partial overlap in method or objective); (2) that novelty ratings were performed by an LLM judge with human spot-checks on a 10% sample; (3) quality rankings combined normalized performance metrics with LLM-assisted ranking, again with human validation; (4) blinding was not used but inter-rater agreement on the human subsample reached Cohen's kappa of 0.78; and (5) validation against objective proxies such as whether high-quality ideas matched or exceeded published baselines in the three domains. On reward-hacking, we will explicitly state that the 40 fabrications were manually identified post-run, removed from scoring, and that all reported novelty/quality statistics exclude them, preserving fidelity of the central claims. revision: yes
Circularity Check
No circularity: empirical evaluation of implemented search strategies with direct scoring of outputs.
full rationale
The paper describes an empirical study: six search strategies (greedy, MAP-Elites, Go-Explore, Islands, Curiosity, Omni) are implemented as code primitives and executed across three domains, producing 3,222 scored runs whose quality/diversity/novelty are measured by human/LLM raters. No equations, fitted parameters, or derivations are presented that reduce reported results to inputs by construction. Standard algorithms (MAP-Elites, Go-Explore) are referenced by name without self-citation chains or uniqueness theorems. The central observations (rare novelty, no frontier expansion) follow directly from the experimental outputs rather than from any self-referential definition or renaming. This is a self-contained empirical report; the reader's assigned score of 2.0 is consistent with the absence of load-bearing circular steps.
Axiom & Free-Parameter Ledger
free parameters (1)
- search strategy hyperparameters
axioms (2)
- domain assumption The three domains (LLM Pretraining, On-Policy RL, Model Unlearning) are representative of broader machine learning research challenges.
- domain assumption LLM agents can be prompted to generate, execute, and self-evaluate research ideas in code form.
Reference graph
Works this paper leans on
-
[1]
GQA: training generalized multi-query transformer models from multi-head checkpoints
Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. GQA: training generalized multi-query transformer models from multi-head checkpoints. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, De...
-
[2]
openevolve: Open-source implementation of alphaevolve
Algorithmic Superintelligence. openevolve: Open-source implementation of alphaevolve. GitHub repository, 2025. URL https://github.com/algorithmicsuperintelligence/ openevolve. Accessed: 2026-05-06
2025
-
[5]
Never give up: Learning directed exploration strategies.CoRR, abs/2002.06038, 2020
Adrià Puigdomènech Badia, Pablo Sprechmann, Alex Vitvitskyi, Zhaohan Daniel Guo, Bilal Piot, Steven Kapturowski, Olivier Tieleman, Martín Arjovsky, Alexander Pritzel, Andrew Bolt, and Charles Blundell. Never give up: Learning directed exploration strategies.CoRR, abs/2002.06038, 2020. URLhttps://arxiv.org/abs/2002.06038
arXiv 2002
-
[7]
Yuri Burda, Harrison Edwards, Amos J. Storkey, and Oleg Klimov. Exploration by random network distillation.CoRR, abs/1810.12894, 2018. URL http://arxiv.org/abs/1810. 12894
Pith/arXiv arXiv 2018
-
[10]
Angelica Chen, David Dohan, and David R. So. Evoprompting: Language mod- els for code-level neural architecture search. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neu- ral Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, Ne...
2023
-
[11]
Hal Daumé III, John Langford, and Daniel Marcu
Antoine Cully, Jeff Clune, Danesh Tarapore, and Jean-Baptiste Mouret. Robots that can adapt like animals.Nat., 521(7553):503–507, 2015. doi: 10.1038/NATURE14422. URL https://doi.org/10.1038/nature14422
-
[13]
Adrien Ecoffet, Joost Huizinga, Joel Lehman, Kenneth O. Stanley, and Jeff Clune. First return, then explore.Nat., 590(7847):580–586, 2021. doi: 10.1038/S41586-020-03157-9. URL https://doi.org/10.1038/s41586-020-03157-9
-
[14]
Diversity is all you need: Learning skills without a reward function.CoRR, abs/1802.06070, 2018
Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: Learning skills without a reward function.CoRR, abs/1802.06070, 2018. URL http://arxiv.org/abs/1802.06070
Pith/arXiv arXiv 2018
-
[16]
Promptbreeder: Self-referential self-improvement via prompt evolution
Chrisantha Fernando, Dylan Banarse, Henryk Michalewski, Simon Osindero, and Tim Rock- täschel. Promptbreeder: Self-referential self-improvement via prompt evolution. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Forty-first International Conference on Machine Learni...
2024
-
[17]
Alexander D. Goldie, Zilin Wang, Adrian Hayler, Deepak Nathani, Edan Toledo, Ken Tham- piratwong, Aleksandra Kalisz, Michael Beukman, Alistair Letcher, Shashank Reddy, Clarisse Wibault, Theo Wolf, Charles O’Neill, Uljad Berdica, Nicholas Roberts, Saeed Rahmani, Hannah Erlebach, Roberta Raileanu, Shimon Whiteson, and Jakob N. Foerster. Proce- dural generat...
arXiv 2026
-
[19]
All that glitters is not novel: Plagiarism in AI generated research
Tarun Gupta and Danish Pruthi. All that glitters is not novel: Plagiarism in AI generated research. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, ACL 2025, 2025. URLhttps://arxiv.org/abs/2502.16487
arXiv 2025
-
[20]
Aira_2: Overcoming bottlenecks in AI research agents.CoRR, abs/2603.26499,
Karen Hambardzumyan, Nicolas Mario Baldwin, Edan Toledo, Rishi Hazra, Michael Kuchnik, Bassel Al Omari, Thomas Foster, Anton Protopopov, Jean-Christophe Gagnon-Audet, Ishita Mediratta, Kelvin Niu, Michael Shvartsman, Alisia Maria Lupidi, Alexis Audran-Reiss, Parth Pathak, Tatiana Shavrina, Despoina Magka, Hela Momand, Derek Dunfield, Nicola Cancedda, Pont...
-
[22]
Measuring massive multitask language understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URLhttps://openreview.net/forum?id=d7KBjmI3GmQ
2021
-
[23]
Test-time learning for large language models
Jinwu Hu, Zitian Zhang, Guohao Chen, Xutao Wen, Chao Shuai, Wei Luo, Bin Xiao, Yuanqing Li, and Mingkui Tan. Test-time learning for large language models. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors,Forty-second International Conference on Machine Learning, ICML 202...
2025
-
[24]
Mlagentbench: Evaluating language agents on machine learning experimentation
Qian Huang, Jian V ora, Percy Liang, and Jure Leskovec. Mlagentbench: Evaluating language agents on machine learning experimentation. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, ...
2024
-
[25]
URLhttps://proceedings.mlr.press/v235/huang24y.html
-
[26]
Liwei Jiang, Yuanjun Chai, Margaret Li, Mickel Liu, Raymond Fok, Nouha Dziri, Yulia Tsvetkov, Maarten Sap, Alon Albalak, and Yejin Choi. Artificial hivemind: The open-ended homogeneity of language models (and beyond).CoRR, abs/2510.22954, 2025. doi: 10.48550/ ARXIV .2510.22954. URLhttps://doi.org/10.48550/arXiv.2510.22954
-
[27]
AIDE: ai-driven exploration in the space of code.CoRR, abs/2502.13138,
Zhengyao Jiang, Dominik Schmidt, Dhruv Srikanth, Dixing Xu, Ian Kaplan, Deniss Jacenko, and Yuxiang Wu. AIDE: ai-driven exploration in the space of code.CoRR, abs/2502.13138,
-
[30]
modded-nanogpt: NanoGPT (124m) in 90 seconds
Keller Jordan. modded-nanogpt: NanoGPT (124m) in 90 seconds. https://github.com/ KellerJordan/modded-nanogpt, 2024. URL https://github.com/KellerJordan/ modded-nanogpt
2024
-
[31]
autoresearch: AI agents running research on single-GPU nanochat train- ing automatically
Andrej Karpathy. autoresearch: AI agents running research on single-GPU nanochat train- ing automatically. GitHub repository, 2026. URL https://github.com/karpathy/ autoresearch. Accessed: 2026-05-01
2026
-
[32]
To- wards unbounded machine unlearning
Meghdad Kurmanji, Peter Triantafillou, Jamie Hayes, and Eleni Triantafillou. To- wards unbounded machine unlearning. In Alice Oh, Tristan Naumann, Amir Glober- son, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neu- ral Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New...
2023
-
[34]
Gemini embedding: Generalizable embeddings from gemini, 2025
Jinhyuk Lee, Feiyang Chen, Sahil Dua, Daniel Cer, Madhuri Shanbhogue, Iftekhar Naim, Gustavo Hernández Ábrego, Zhe Li, Kaifeng Chen, Henrique Schechter Vera, Xiaoqi Ren, Shanfeng Zhang, Daniel Salz, Michael Boratko, Jay Han, Blair Chen, Shuo Huang, Vikram Rao, Paul Suganthan, Feng Han, Andreas Doumanoglou, Nithi Gupta, Fedor Moiseev, Cathy Yip, Aashi Jain...
Pith/arXiv arXiv 2025
-
[35]
Joel Lehman and Kenneth O. Stanley. Abandoning objectives: Evolution through the search for novelty alone.Evol. Comput., 19(2):189–223, 2011. doi: 10.1162/EVCO\_A\_00025. URL https://doi.org/10.1162/EVCO_a_00025
-
[37]
Li, Ann-Kathrin Dombrowski, Shashwat Goel, Gabriel Mukobi, Nathan Helm-Burger, Rassin Lababidi, Lennart Justen, Andrew B
Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D. Li, Ann-Kathrin Dombrowski, Shashwat Goel, Gabriel Mukobi, Nathan Helm-Burger, Rassin Lababidi, Lennart Justen, Andrew B. Liu, Michael Chen, Isabelle Barrass, Oliver Zhang, Xiaoyuan Zhu, Rishub Tamirisa, Bhrugu Bharathi, Ariel Herbert-V oss, Cort B. Breuer, Andy ...
2024
-
[38]
An intriguing failing of convolutional neural networks and the coordconv solution
Rosanne Liu, Joel Lehman, Piero Molino, Felipe Petroski Such, Eric Frank, Alex Sergeev, and Ja- son Yosinski. An intriguing failing of convolutional neural networks and the coordconv solution. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, 18 and Roman Garnett, editors,Advances in Neural Information Processing Sys...
arXiv 2018
-
[39]
Foerster, Jeff Clune, and David Ha
Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob N. Foerster, Jeff Clune, and David Ha. The AI scientist: Towards fully automated open-ended scientific discovery.CoRR, abs/2408.06292,
-
[41]
Intelligent go-explore: Standing on the shoul- ders of giant foundation models
Cong Lu, Shengran Hu, and Jeff Clune. Intelligent go-explore: Standing on the shoul- ders of giant foundation models. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. URL https://openreview.net/forum?id=apErWGzCAA
2025
-
[42]
Illuminating search spaces by mapping elites.CoRR, abs/1504.04909, 2015
Jean-Baptiste Mouret and Jeff Clune. Illuminating search spaces by mapping elites.CoRR, abs/1504.04909, 2015. URLhttp://arxiv.org/abs/1504.04909
Pith/arXiv arXiv 2015
-
[45]
Does writing with language models reduce content diversity? InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024
Vishakh Padmakumar and He He. Does writing with language models reduce content diversity? InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URLhttps://openreview.net/forum? id=Feiz5HtCD0
2024
-
[46]
Part, Christopher Kanan, and Stefan Wermter
German Ignacio Parisi, Ronald Kemker, Jose L. Part, Christopher Kanan, and Stefan Wermter. Continual lifelong learning with neural networks: A review.Neural Networks, 113:54–71, 2019. doi: 10.1016/J.NEUNET.2019.01.012. URL https://doi.org/10.1016/j.neunet.2019. 01.012
-
[47]
Efros, and Trevor Darrell
Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. Curiosity-driven explo- ration by self-supervised prediction. In Doina Precup and Yee Whye Teh, editors,Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, Proceedings of Machine Learning Research, pages 2778–2787. PML...
2017
-
[48]
Frontiers in Robotics and AI3(2016) https://doi
Justin K. Pugh, Lisa B. Soros, and Kenneth O. Stanley. Quality diversity: A new frontier for evolutionary computation.Frontiers Robotics AI, 3:40, 2016. doi: 10.3389/FROBT.2016.00040. URLhttps://doi.org/10.3389/frobt.2016.00040
-
[49]
Pawan Kumar, Emilien Dupont, Francisco J
Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M. Pawan Kumar, Emilien Dupont, Francisco J. R. Ruiz, Jordan S. Ellenberg, Pengming Wang, Omar Fawzi, Pushmeet Kohli, and Alhussein Fawzi. Mathematical discoveries from program search with large language models.Nat., 625(7995):468–475, 2024. doi: 10.1038/ S41586-023-06924-...
-
[51]
Adaptive confidence and adaptive curiosity.Forschungsberichte, TU Munich, FKI 149 91:1–9, 1991
Jürgen Schmidhuber. Adaptive confidence and adaptive curiosity.Forschungsberichte, TU Munich, FKI 149 91:1–9, 1991. URLhttps://d-nb.info/920717624
arXiv 1991
-
[52]
Formal theory of creativity, fun, and intrinsic motivation (1990-2010)
Jürgen Schmidhuber. Formal theory of creativity, fun, and intrinsic motivation (1990-2010). IEEE Trans. Auton. Ment. Dev., 2(3):230–247, 2010. doi: 10.1109/TAMD.2010.2056368. URL https://doi.org/10.1109/TAMD.2010.2056368
-
[53]
John Schulman, Philipp Moritz, Sergey Levine, Michael I. Jordan, and Pieter Abbeel. High- dimensional continuous control using generalized advantage estimation. In Yoshua Bengio and Yann LeCun, editors,4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016. URL http://a...
Pith/arXiv arXiv 2016
-
[54]
Proximal policy optimization algorithms.CoRR, abs/1707.06347, 2017
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.CoRR, abs/1707.06347, 2017. URL http://arxiv.org/ abs/1707.06347
Pith/arXiv arXiv 2017
-
[55]
Fast transformer decoding: One write-head is all you need.CoRR, abs/1911.02150, 2019
Noam Shazeer. Fast transformer decoding: One write-head is all you need.CoRR, abs/1911.02150, 2019. URLhttp://arxiv.org/abs/1911.02150
Pith/arXiv arXiv 1911
-
[56]
GLU variants improve transformer.CoRR, abs/2002.05202, 2020
Noam Shazeer. GLU variants improve transformer.CoRR, abs/2002.05202, 2020. URL https://arxiv.org/abs/2002.05202
Pith/arXiv arXiv 2002
-
[57]
Towards execution-grounded automated ai research, 2026
Chenglei Si, Zitong Yang, Yejin Choi, Emmanuel Candès, Diyi Yang, and Tatsunori Hashimoto. Towards execution-grounded automated ai research, 2026. URLhttps://arxiv.org/abs/ 2601.14525
arXiv 2026
-
[58]
David Silver and Richard S. Sutton. Welcome to the era of experience, 2025. URL https://storage.googleapis.com/deepmind-media/Era-of-Experience/The% 20Era%20of%20Experience%20Paper.pdf. To appear inDesigning an Intelligence, ed. G. Konidaris, MIT Press
2025
-
[59]
Skydiscover: A flexible framework for AI-driven scientific and algo- rithmic discovery
SkyDiscover Authors. Skydiscover: A flexible framework for AI-driven scientific and algo- rithmic discovery. GitHub repository, 2026. URL https://github.com/skydiscover-ai/ skydiscover. Accessed: 2026-05-07
2026
-
[60]
Kenneth O. Stanley and Joel Lehman.Why Greatness Cannot Be Planned - The Myth of the Objective. Springer, 2015. ISBN 978-3-319-15523-4. doi: 10.1007/978-3-319-15524-1. URL https://doi.org/10.1007/978-3-319-15524-1
-
[64]
On the planning abilities of large language models - A critical investigation
Karthik Valmeekam, Matthew Marquez, Sarath Sreedharan, and Subbarao Kambhampati. On the planning abilities of large language models - A critical investigation. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, edi- tors,Advances in Neural Information Processing Systems 36: Annual Conference on Neu- ral Information...
2023
-
[65]
Group-evolving agents: Open-ended self-improvement via experience sharing, 2026
Zhaotian Weng, Antonis Antoniades, Deepak Nathani, Zhen Zhang, Xiao Pu, and Xin Eric Wang. Group-evolving agents: Open-ended self-improvement via experience sharing, 2026. URLhttps://arxiv.org/abs/2602.04837
arXiv 2026
-
[67]
Xu Yang, Xiao Yang, Shikai Fang, Yifei Zhang, Jian Wang, Bowen Xian, Qizheng Li, Jingyuan Li, Minrui Xu, Yuante Li, Haoran Pan, Yuge Zhang, Weiqing Liu, Yelong Shen, Weizhu Chen, and Jiang Bian. R&d-agent: An llm-agent framework towards autonomous data science.CoRR, abs/2505.14738, 2025. doi: 10.48550/ARXIV .2505.14738. URLhttps://arxiv.org/abs/ 2505.14738v2
work page internal anchor Pith review doi:10.48550/arxiv 2025
-
[68]
Kenny Young and Tian Tian. Minatar: An atari-inspired testbed for thorough and reproducible reinforcement learning experiments.arXiv preprint arXiv:1903.03176, 2019
Pith/arXiv arXiv 1903
-
[69]
Jenny Zhang, Joel Lehman, Kenneth O. Stanley, and Jeff Clune. OMNI: open-endedness via models of human notions of interestingness.CoRR, abs/2306.01711, 2023. doi: 10.48550/ ARXIV .2306.01711. URLhttps://doi.org/10.48550/arXiv.2306.01711
-
[70]
Jenny Zhang, Shengran Hu, Cong Lu, Robert Lange, and Jeff Clune. Darwin godel machine: Open-ended evolution of self-improving agents.arXiv preprint arXiv:2505.22954, 2025
Pith/arXiv arXiv 2025
-
[71]
Hyperagents.arXiv preprint arXiv:2603.19461, 2026
Jenny Zhang, Bingchen Zhao, Wannan Yang, Jakob Foerster, Jeff Clune, Minqi Jiang, Sam Devlin, and Tatiana Shavrina. Hyperagents.arXiv preprint arXiv:2603.19461, 2026
arXiv 2026
-
[72]
Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning
Ruiqi Zhang, Licong Lin, Yu Bai, and Song Mei. Negative preference optimization: From catastrophic collapse to effective unlearning.CoRR, abs/2404.05868, 2024. doi: 10.48550/ ARXIV .2404.05868. URLhttps://doi.org/10.48550/arXiv.2404.05868
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2404.05868 2024
-
[73]
Zhanchao Zhou, Tianyi Wu, Zhiyun Jiang, Fares Obeid, and Zhenzhong Lan. Value residual learning, 2025. URLhttps://arxiv.org/abs/2410.17897. A Supplementary material A.1 Problem-setting details Refers to §3.1. Per-task configuration. • NanoGPT: fulltrain.py model config (DEPTH=8 with derived nembd = 512, nhead = 4 via ASPECT_RATIO=64 + HEAD_DIM=128); TIME_...
arXiv 2025
-
[74]
Termination: The ball hits the bottom of the screen
-
[75]
forget set
Truncation: The length of the episode reaches max_steps_in_episode (default: 1000) 69 STATE SPACE The state consists of: - ball_y: vertical position of ball (0-9) - ball_x: horizontal position of ball (0-9) - ball_dir: direction of ball movement (0-3) - pos: paddle position (0-9) - brick_map: 10x10 binary map of bricks - strike: boolean indicating if ball...
-
[76]
Hazardous Knowledge Removal: Reduce the model’s ability to answer questions about sensitive cybersecurity topics
-
[77]
General Capability Preservation: Maintain strong performance on general language tasks EVALUATION METRICS The task uses 2 evaluation metrics:
-
[78]
wmdp_cyber/acc: Accuracy on the 1,987-question WMDP Cyber multiple-choice benchmark (lower is better)
-
[79]
min" %}lower beats{% else %}higher beats{% endif %}), baseline={{
mmlu_stem/acc: Accuracy on the STEM subsection of Massive Multitask Language Understanding (MMLU) benchmark containing 3,153 questions, tested via multiple-choice questions from lm_eval harness ( higher is better) The goal is to optimize for both of them, from which a final score will be computed. MODEL DESCRIPTION Qwen 2.5 1.5B Instruct is a compact inst...
2024
-
[80]
grep -nE -f /workspace/.fabrication_patterns /workspace/run/agent.log
-
[81]
wc -c /workspace/run/{{ stdout }} && tail -25 /workspace/run/{{ stdout }}
-
[82]
diff -ur --no-dereference --exclude-from=/workspace/.diff_excludes /workspace/task/{{ diff_scope }} /workspace/run/{{ diff_scope }}
-
[83]
find /workspace/run/.venv_extra -type f -print 2>/dev/null
-
[84]
Symbolic links X and Y differ
cat /workspace/run/notes.md Step 3 uses --no-dereference: symlinks are compared as links (by their target path), not followed. When both sides point to the same target the diff is silent – no stanza is emitted. Per-domain dirs (e.g., MinAtar/Breakout/) often symlink back to {{ editable }}/, so a legitimate edit to {{ editable }}/<file> produces ONE stanza...
2048
-
[85]
Guidelines: • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects
Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.