arxiv: 2604.12999 · v1 · submitted 2026-04-14 · 💻 cs.CV

Recognition: unknown

Agentic Discovery with Active Hypothesis Exploration for Visual Recognition

Jaywon Koo , Jefferson Hernandez , Ruozhen He , Hanjie Chen , Chen Wei , Vicente Ordonez

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:10 UTC · model grok-4.3

classification 💻 cs.CV

keywords neural architecture searchagentic AIlarge language modelshypothesis explorationevolutionary branchingvisual recognitionCIFAR-10MedMNIST

0 comments

The pith

HypoExplore turns neural architecture search into an active process of proposing, testing, and refining scientific hypotheses about design choices using language model agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces HypoExplore to discover neural architectures for visual recognition by treating the search as hypothesis-driven scientific inquiry. Large language models generate new architecture ideas by extending selected parent hypotheses, guided by a strategy that weighs both proven ideas and open uncertainties. A Trajectory Tree records every lineage while a Hypothesis Memory Bank maintains confidence scores that multiple feedback agents update after each experiment from varied analytical angles. Tests on CIFAR-10 show accuracy rising from an 18.91 percent baseline to 94.11 percent in the best evolved model, with generalization to CIFAR-100, Tiny-ImageNet, and state-of-the-art results on MedMNIST medical imaging. The work further demonstrates that accumulated evidence makes confidence scores more reliable at forecasting performance and that extracted principles carry across separate search runs.

Core claim

HypoExplore formulates neural architecture discovery for visual recognition as a hypothesis-driven scientific inquiry. Given a human-specified high-level research direction, HypoExplore ideates, implements, evaluates, and improves neural architectures through evolutionary branching. New hypotheses are created using a large language model by selecting a parent hypothesis to build upon, guided by a dual strategy that balances exploiting validated principles with resolving uncertain ones. The framework maintains a Trajectory Tree that records the lineage of all proposed architectures, and a Hypothesis Memory Bank that actively tracks confidence scores acquired through experimental evidence.

What carries the argument

The Hypothesis Memory Bank, which records and updates confidence scores for each hypothesis based on consolidated multi-perspective agent feedback after every experiment to guide which parent hypotheses to extend next.

If this is right

Stronger lightweight vision architectures can be discovered on CIFAR-10, CIFAR-100, and Tiny-ImageNet starting from weak baselines.
Hypothesis confidence scores become increasingly accurate predictors of future performance as more experimental evidence is collected.
Design principles identified during search transfer across independent evolutionary lineages and across different datasets.
The same process applies to specialized domains such as medical imaging on MedMNIST and reaches state-of-the-art results there.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The confidence-tracking mechanism could be repurposed to build an explicit, queryable map of which architectural motifs work well under which conditions.
If the learned principles prove robust, the framework might be extended to automate model design in non-vision tasks such as sequence modeling or control policies.
Replacing some feedback agents with human experts could create a hybrid loop that accelerates discovery while still reducing manual coding effort.

Load-bearing premise

Large language models can reliably generate valid, implementable neural architecture code that improves when guided by high-level experimental feedback.

What would settle it

A direct comparison run in which the same starting hypotheses are evolved once using the full confidence-updating agents and once using random or fixed scores, checking whether the full system produces measurably higher final accuracies.

Figures

Figures reproduced from arXiv: 2604.12999 by Chen Wei, Hanjie Chen, Jaywon Koo, Jefferson Hernandez, Ruozhen He, Vicente Ordonez.

**Figure 1.** Figure 1: High-level Overview of HypoExplore. Starting from a research direction, HypoExplore initializes a discovery state with a Trajectory Tree Memory and Hypothesis Memory Bank (Step 0→ Step 1). At each subsequent step, the current discovery state selects a parent node and hypothesis to guide the Research Cycle, producing an updated discovery state with enriched memory (Step t → Step t+1). search, but as a proce… view at source ↗

**Figure 2.** Figure 2: HypoExplore finds a lightweight Global Shape Token Network (GSTN) that introduces a small bank of learned global vectors. This network using less parameters closely matches or surpasses other manually engineered networks. Accordingly, instead of treating candidate models as isolated architecture instances, we represent each design direction as an explicit architectural hypothesis: a structured conjecture … view at source ↗

**Figure 3.** Figure 3: Overview of Per-Node Research Cycle. The Idea Agent proposes a neural architecture, which the Coding Agent implements with iterative hyperparameter tuning. A Redundancy Filtering Agent checks against the Tree Memory to prevent re-generation of concepts already explored. The Executor trains and evaluates each architecture, and the results are analyzed by four specialized Feedback Agents (right), each provid… view at source ↗

**Figure 4.** Figure 4: HypoExplore discovers high-performing architectures via hypothesis-guided evolutionary branch [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Accumulated best accuracy over 50 iterations on CIFAR-10. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Analysis of hypothesis memory over 50 iterations on CIFAR-10. Left: Hypothesis prediction [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: Cross-lineage hypothesis applications succeed at a comparable rate to within-lineage ones, indicating transferable design principles. 6. Conclusion We introduced HypoExplore, a multi-agent framework that reframes automated neural architecture discovery as hypothesis-driven scientific inquiry. By maintaining a trajectory tree and a hypothesis memory bank, HypoExplore separates where to search from what to t… view at source ↗

**Figure 8.** Figure 8: HyporExplore’s run using Gemini-3.1-pro D. Discovered Architectures This section presents the complete Idea Agent output and final implementation code for the three highest-performing architectures discovered. D.1. GST-Guarded NPIN (94.11%) Idea Agent Output: GST-Guarded NPIN: Sparse Global Shape Tokens + Per-Band Normalized SuperParticles Description. Build on NPIN-Guard (particle dynamics + EMA-stabiliz… view at source ↗

read the original abstract

We introduce HypoExplore, an agentic framework that formulates neural architecture discovery for visual recognition as a hypothesis-driven scientific inquiry. Given a human-specified high-level research direction, HypoExplore ideates, implements, evaluates, and improves neural architectures through evolutionary branching. New hypotheses are created using a large language model by selecting a parent hypothesis to build upon, guided by a dual strategy that balances exploiting validated principles with resolving uncertain ones. Our proposed framework maintains a Trajectory Tree that records the lineage of all proposed architectures, and a Hypothesis Memory Bank that actively tracks confidence scores acquired through experimental evidence. After each experiment, multiple feedback agents analyze the results from different perspectives and consolidate their findings into hypothesis confidence updates. Our framework is tested on discovering lightweight vision architectures on CIFAR-10, with the best achieving 94.11% accuracy evolved from a root node baseline that starts at 18.91%, and generalizes to CIFAR-100 and Tiny-ImageNet. We further demonstrate applicability to a specialized domain by conducting independent architecture discovery runs on MedMNIST, which yield a state-of-the-art performance. We show that hypothesis confidence scores grow increasingly predictive as evidence accumulates, and that the learned principles transfer across independent evolutionary lineages, suggesting that HypoExplore not only discovers stronger architectures, but can help build a genuine understanding of the design space.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HypoExplore adds lineage tracking and multi-agent feedback to LLM-driven architecture search and shows large accuracy jumps, but the evidence that it extracts real transferable principles stays internal to the search loop.

read the letter

The paper's main move is to treat neural architecture discovery as hypothesis-driven inquiry rather than pure search. It uses an LLM to propose new architectures from parent ones, keeps a Trajectory Tree of lineages, maintains a Hypothesis Memory Bank with running confidence scores, and lets multiple feedback agents consolidate results into updates. The dual exploit-explore strategy and the reported jump from 18.91% to 94.11% on CIFAR-10, plus generalization and MedMNIST results, are the concrete outputs worth noting first.

Referee Report

3 major / 2 minor

Summary. The paper introduces HypoExplore, an agentic framework that treats neural architecture discovery for visual recognition as hypothesis-driven scientific inquiry. It uses LLMs to ideate, implement, and evolve architectures via evolutionary branching from a root baseline, guided by a Trajectory Tree for lineages and a Hypothesis Memory Bank for tracking confidence scores updated by multi-perspective feedback agents. Experiments on CIFAR-10 report accuracy rising from 18.91% to 94.11%, with generalization to CIFAR-100 and Tiny-ImageNet, SOTA results on MedMNIST, and evidence that confidence scores become predictive while extracted principles transfer across lineages.

Significance. If the core claims hold under rigorous validation, the work could advance automated NAS by shifting from black-box search to interpretable, principle-extracting inquiry, with potential for more transferable design knowledge in CV. The reported accuracy gains and cross-dataset generalization would be notable if supported by proper controls, but the absence of statistical rigor and ablations limits immediate impact.

major comments (3)

[Abstract] Abstract (results paragraph): The central claim that HypoExplore builds 'genuine understanding of the design space' via predictive confidence scores and cross-lineage principle transfer rests on internal observations within the same closed evolutionary loop; no ablation isolating the Hypothesis Memory Bank and multi-agent feedback from base LLM-driven search is reported, leaving open that gains may arise from stochasticity or capacity rather than causal transferable principles.
[Abstract] Abstract (experimental claims): The CIFAR-10 result (18.91% to 94.11%) and generalization to CIFAR-100/Tiny-ImageNet plus SOTA on MedMNIST are presented without specifying number of independent runs, error bars, statistical significance tests, or comparisons to strong NAS baselines (e.g., DARTS, NASNet), which is load-bearing for validating superiority and robustness of the discovered architectures.
[Method] Method (Hypothesis Memory Bank and feedback agents): The confidence update mechanism from multi-perspective agents is described qualitatively but lacks a formal equation, pseudocode, or sensitivity analysis to sampling parameters; this directly affects the reproducibility of the 'increasingly predictive' scores and the claim that updates reflect unbiased design principles rather than trajectory artifacts.

minor comments (2)

[Abstract] The root baseline accuracy of 18.91% on CIFAR-10 should be explicitly compared to standard simple CNNs to clarify the starting point of the evolutionary process.
[Method] Notation for the Trajectory Tree and Hypothesis Memory Bank could be formalized with a diagram or pseudocode for clarity in the method description.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights important areas for strengthening the rigor and reproducibility of our work. We address each major comment below and commit to revisions that enhance the manuscript without altering its core contributions.

read point-by-point responses

Referee: [Abstract] Abstract (results paragraph): The central claim that HypoExplore builds 'genuine understanding of the design space' via predictive confidence scores and cross-lineage principle transfer rests on internal observations within the same closed evolutionary loop; no ablation isolating the Hypothesis Memory Bank and multi-agent feedback from base LLM-driven search is reported, leaving open that gains may arise from stochasticity or capacity rather than causal transferable principles.

Authors: We acknowledge that an explicit ablation isolating the Hypothesis Memory Bank and multi-perspective feedback would provide stronger causal evidence. In the revised manuscript, we will add an ablation study comparing the full framework against a base LLM-driven evolutionary search without the memory bank or consolidated feedback. This will quantify the incremental benefit and address potential stochasticity concerns. We note that the cross-lineage principle transfer is already demonstrated by seeding a new independent run with principles extracted from a prior lineage, yielding faster convergence; however, we will clarify this distinction and expand the analysis to better separate internal loop effects from transferable design knowledge. revision: yes
Referee: [Abstract] Abstract (experimental claims): The CIFAR-10 result (18.91% to 94.11%) and generalization to CIFAR-100/Tiny-ImageNet plus SOTA on MedMNIST are presented without specifying number of independent runs, error bars, statistical significance tests, or comparisons to strong NAS baselines (e.g., DARTS, NASNet), which is load-bearing for validating superiority and robustness of the discovered architectures.

Authors: We agree that statistical rigor and baseline comparisons are essential for validating the reported gains. In the revision, we will report results aggregated over multiple independent runs (with means, standard deviations, and error bars), include appropriate statistical tests (e.g., paired t-tests), and add direct comparisons to established NAS methods such as DARTS and NASNet on CIFAR-10 under comparable compute budgets. The 94.11% figure represents the best architecture from the primary run; we will emphasize robustness metrics and note any limitations in direct apples-to-apples comparisons arising from differing search paradigms. revision: yes
Referee: [Method] Method (Hypothesis Memory Bank and feedback agents): The confidence update mechanism from multi-perspective agents is described qualitatively but lacks a formal equation, pseudocode, or sensitivity analysis to sampling parameters; this directly affects the reproducibility of the 'increasingly predictive' scores and the claim that updates reflect unbiased design principles rather than trajectory artifacts.

Authors: We will strengthen the Method section by introducing a formal equation for the confidence update rule that aggregates multi-agent feedback into hypothesis scores. We will also include pseudocode for the full update and consolidation process. Additionally, a sensitivity analysis on parameters such as the number of feedback agents, sampling temperature, and evidence weighting will be added to demonstrate that the predictive improvement in confidence scores is robust and not an artifact of specific trajectory choices. revision: yes

Circularity Check

1 steps flagged

Claim of 'genuine understanding' reduces to internal confidence updates and lineage transfer within the same closed loop

specific steps

fitted input called prediction [Abstract]
"We show that hypothesis confidence scores grow increasingly predictive as evidence accumulates, and that the learned principles transfer across independent evolutionary lineages, suggesting that HypoExplore not only discovers stronger architectures, but can help build a genuine understanding of the design space."

Confidence scores are explicitly updated after each experiment via the multi-perspective feedback agents using the results from the same runs; demonstrating that these scores 'grow increasingly predictive' therefore uses the fitted updates on the data that generated them. The cross-lineage transfer is likewise measured inside the Trajectory Tree and Hypothesis Memory Bank produced by the framework itself, with no external benchmark or ablation separating extracted principles from search artifacts.

full rationale

The paper's empirical accuracy results (e.g., 94.11% on CIFAR-10) are independent measurements and not circular. However, the load-bearing interpretive claim that the framework builds genuine understanding of the design space is supported only by showing that hypothesis confidence scores (updated from the same multi-agent feedback on experimental outcomes) become increasingly predictive and that principles transfer across lineages generated inside the identical evolutionary process. This makes the 'understanding' assertion a re-description of the search trajectory's internal statistics rather than an externally validated extraction of causal principles.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 3 invented entities

The framework introduces several new conceptual components and depends on unverified assumptions about LLM capabilities and the validity of internally generated confidence metrics.

free parameters (2)

confidence update mechanism
Rules for how experimental results translate into hypothesis confidence scores are central but unspecified in detail.
LLM prompt templates and sampling parameters
Prompt engineering choices for ideation and feedback directly control hypothesis generation.

axioms (2)

domain assumption Large language models can generate syntactically valid and functionally improvable neural network code from textual descriptions and performance feedback.
Invoked throughout the ideation, implementation, and improvement steps.
domain assumption Performance on CIFAR-10 and similar small datasets provides reliable signals for updating hypothesis confidence that generalize to other tasks.
Basis for all confidence updates and transfer claims.

invented entities (3)

Trajectory Tree no independent evidence
purpose: Records the evolutionary lineage of all proposed architectures
New data structure for tracking branching hypotheses.
Hypothesis Memory Bank no independent evidence
purpose: Actively tracks and updates confidence scores for hypotheses based on evidence
Central storage for evidence accumulation.
feedback agents no independent evidence
purpose: Analyze experimental results from multiple perspectives and consolidate findings
Multiple agents for updating confidence.

pith-pipeline@v0.9.0 · 5549 in / 2009 out tokens · 106626 ms · 2026-05-10T16:10:30.343631+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

79 extracted references · 7 canonical work pages

[1]

Autodiscovery: Open-ended scientific discovery via bayesian surprise.arXiv preprint arXiv:2507.00310,

DhruvAgarwal,BodhisattwaPrasadMajumder,ReeceAdamson,MeghaChakravorty,SatvikaReddyGavireddy, Aditya Parashar, Harshit Surana, Bhavana Dalvi Mishra, Andrew McCallum, Ashish Sabharwal, et al. Autodis- covery: Open-ended scientific discovery via bayesian surprise.arXiv preprint arXiv:2507.00310, 2025

work page arXiv 2025
[2]

Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, and Omar Khattab

Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alexandros G. Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, and Omar Khattab. Gepa: Reflective prompt evolution can outperform reinforcement learning, 2026

2026
[3]

ResearchAgent: Iterative Research Idea Generation over Scientific Literature with Large Language Models, 2025

Jinheon Baek, Sujay Kumar Jauhar, Silviu Cucerzan, and Sung Ju Hwang. ResearchAgent: Iterative Research Idea Generation over Scientific Literature with Large Language Models, 2025

2025
[4]

Emerging Properties in Self-Supervised Vision Transformers

MathildeCaron, HugoTouvron, IshanMisra, HervéJégou, JulienMairal, PiotrBojanowski, andArmandJoulin. Emerging Properties in Self-Supervised Vision Transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9650–9660, October 2021

2021
[5]

The method of multiple working hypotheses.Science, (366):92–96, 1890

Thomas C Chamberlin. The method of multiple working hypotheses.Science, (366):92–96, 1890
[6]

RevoNAD: Reflective Evolutionary Exploration for Neural Architecture Design, 2025

Gyusam Chang, Jeongyoon Yoon, Shin han yi, JaeHyeok Lee, Sujin Jang, and Sangpil Kim. RevoNAD: Reflective Evolutionary Exploration for Neural Architecture Design, 2025

2025
[7]

An empirical evaluation of thompson sampling.Advances in neural information processing systems, 24, 2011

Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling.Advances in neural information processing systems, 24, 2011

2011
[8]

MLE-bench: Evaluating machine learning agents on machine learning engineering

Jiefeng Chen, Bhavana Dalvi Mishra, Jaehyun Nam, Rui Meng, Tomas Pfister, and Jinsung Yoon. MARS: Modular Agent with Reflective Search for Automated AI Research.arXiv preprint arXiv:2602.02660, 2026

work page arXiv 2026
[9]

Drop an octave: Reducing spatial redundancy in convolutional neural networks with octave convolution

Yunpeng Chen, Haoqi Fan, Bing Xu, Zhicheng Yan, Yannis Kalantidis, Marcus Rohrbach, Shuicheng Yan, and Jiashi Feng. Drop an octave: Reducing spatial redundancy in convolutional neural networks with octave convolution. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3435–3444, 2019

2019
[10]

Language modeling by language models.arXiv preprint arXiv:2506.20249, 2025

Junyan Cheng, Peter Clark, and Kyle Richardson. Language modeling by language models.arXiv preprint arXiv:2506.20249, 2025

work page arXiv 2025
[11]

Med-former: A transformer based architecture for medical image classification

G Jignesh Chowdary and Zhaozheng Yin. Med-former: A transformer based architecture for medical image classification. InInternational conference on medical image computing and computer-assisted intervention, pages 448–457. Springer, 2024

2024
[12]

A tutorial on thompson sampling.Foundations and Trends®in Machine Learning, 11(1):1–99, 2018

J Russo Daniel, Van Roy Benjamin, Kazerouni Abbas, Osband Ian, and Wen Zheng. A tutorial on thompson sampling.Foundations and Trends®in Machine Learning, 11(1):1–99, 2018

2018
[13]

Vision transformers need registers

Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers. In International Conference on Learning Representations (ICLR), 2024

2024
[14]

AnImageisWorth16x16Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, MostafaDehghani, MatthiasMinderer, GHeigold, SGelly, etal. AnImageisWorth16x16Words: Transformers for Image Recognition at Scale. InInternational Conference on Learning Representations (ICLR), 2021. 12 Agentic Discovery with Active Hypothesis Explora...

2021
[15]

Bayes-Entropy Collaborative Driven Agents for Research Hypotheses Generation and Optimization, 2025

Shiyang Duan, Yuan Tian, Qi Bing, and Xiaowei Shao. Bayes-Entropy Collaborative Driven Agents for Research Hypotheses Generation and Optimization, 2025

2025
[16]

Interactive debugging and steering of multi-agent ai systems

Will Epperson, Gagan Bansal, Victor C Dibia, Adam Fourney, Jack Gerrits, Erkang Zhu, and Saleema Amershi. Interactive debugging and steering of multi-agent ai systems. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–15, 2025

2025
[17]

Vision GNN: An Image is Worth Graph of Nodes

Kai Han, Yunhe Wang, Jianyuan Guo, Yehui Tang, and Enhua Wu. Vision GNN: An Image is Worth Graph of Nodes. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems, volume 35, pages 8291–8303. Curran Associates, Inc., 2022

2022
[18]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

2016
[19]

Bayesian active learning for classiﬁcation and preferenc e learning,

Neil Houlsby, Ferenc Huszár, Zoubin Ghahramani, and Máté Lengyel. Bayesian active learning for classification and preference learning.arXiv preprint arXiv:1112.5745, 2011

work page arXiv 2011
[20]

Li, Emmanuel Candès, and Jure Leskovec

Kexin Huang, Ying Jin, Ryan Li, Michael Y. Li, Emmanuel Candès, and Jure Leskovec. Automated Hypothesis Validation with Agentic Sequential Falsifications, 2025

2025
[21]

MLAgentBench: evaluating language agents on machine learning experimentation

Qian Huang, Jian Vora, Percy Liang, and Jure Leskovec. MLAgentBench: evaluating language agents on machine learning experimentation. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

2024
[22]

Prada: Protecting and detecting dataset abuse for open-source medical dataset

Jinhyeok Jang, Hong Joo Lee, Nassir Navab, and Seong Tae Kim. Prada: Protecting and detecting dataset abuse for open-source medical dataset. InInternational Conference on Medical Image Computing and Computer- Assisted Intervention, pages 463–473. Springer, 2025

2025
[23]

Peter Jansen, Peter Clark, Doug Downey, and Daniel S. Weld. Generating Literature-Driven Scientific Theories at Scale, 2026

2026
[24]

BioDisco: Multi-agent hypothesis generation with dual-mode evidence, iterative feedback and temporal evaluation

Yujing Ke, Kevin George, Kathan Pandya, David Blumenthal, Maximilian Sprang, Gerrit Großmann, Sebastian Vollmer, and David Antony Selby. BioDisco: Multi-agent hypothesis generation with dual-mode evidence, iterative feedback and temporal evaluation
[25]

Segment anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023

2023
[26]

The division of cognitive labor.The journal of philosophy, 87(1):5–22, 1990

Philip Kitcher. The division of cognitive labor.The journal of philosophy, 87(1):5–22, 1990

1990
[27]

Dual space search during scientific reasoning.Cognitive science, 12(1):1–48, 1988

David Klahr and Kevin Dunbar. Dual space search during scientific reasoning.Cognitive science, 12(1):1–48, 1988

1988
[28]

Proptest: Automatic property testing for improved visual programming

Jaywon Koo, Ziyan Yang, Paola Cascante-Bonilla, Baishakhi Ray, and Vicente Ordonez. Proptest: Automatic property testing for improved visual programming. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 8241–8256, 2024

2024
[29]

Set transformer: A framework for attention-based permutation-invariant neural networks

Juho Lee, Yoonho Lee, Jungtaek Kim, Adam Kosiorek, Seungjin Choi, and Yee Whye Teh. Set transformer: A framework for attention-based permutation-invariant neural networks. InInternational conference on machine learning, pages 3744–3753. PMLR, 2019

2019
[30]

Alphago moment for model architecture discovery.arXiv preprint arXiv:2507.18074, 2025

Yixiu Liu, Yang Nan, Weixian Xu, Xiangkun Hu, Lyumanshan Ye, Zhen Qin, and Pengfei Liu. Alphago moment for model architecture discovery.arXiv preprint arXiv:2507.18074, 2025

work page arXiv 2025
[31]

Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10012–10022, October 2021. 13 Agentic Discovery with Active Hypothesis Exploration for Visual Recognition

2021
[32]

Object-Centric Learning with Slot Attention

Francesco Locatello, Dirk Weissenborn, Thomas Unterthiner, Aravindh Mahendran, Georg Heigold, Jakob Uszkoreit, Alexey Dosovitskiy, and Thomas Kipf. Object-Centric Learning with Slot Attention. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors,Advances in Neural Information Processing Systems, volume 33, pages 11525–11538. Curran A...

2020
[33]

Medvit: a robust vision transformer for generalized medical image classification.Computers in biology and medicine, 157:106791, 2023

Omid Nejati Manzari, Hamid Ahmadabadi, Hossein Kashiani, Shahriar B Shokouhi, and Ahmad Ayatollahi. Medvit: a robust vision transformer for generalized medical image classification.Computers in biology and medicine, 157:106791, 2023

2023
[34]

Medical image classification with kan-integrated transformers and dilated neighborhood attention.Applied Soft Computing, page 114045, 2025

Omid Nejati Manzari, Hojat Asgariandehkordi, Taha Koleilat, Yiming Xiao, and Hassan Rivaz. Medical image classification with kan-integrated transformers and dilated neighborhood attention.Applied Soft Computing, page 114045, 2025

2025
[35]

Exploration and exploitation in organizational learning.Organization science, 2(1):71–87, 1991

James G March. Exploration and exploitation in organizational learning.Organization science, 2(1):71–87, 1991

1991
[36]

Mednns: supernet-based medical task-adaptive neural network search

Lotfi Abdelkrim Mecharbat, Ibrahim Almakky, Martin Takac, and Mohammad Yaqub. Mednns: supernet-based medical task-adaptive neural network search. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 448–458. Springer, 2025

2025
[37]

Landsness, Daniel L

Ludovico Mitchener, Angela Yiu, Benjamin Chang, Mathieu Bourdenx, Tyler Nadolski, Arvis Sulovari, Eric C. Landsness, Daniel L. Barabasi, Siddharth Narayanan, Nicky Evans, Shriya Reddy, Martha Foiani, Aizad Kamal, Leah P. Shriver, Fang Cao, Asmamaw T. Wassie, Jon M. Laurent, Edwin Melville-Green, Mayk Caldas, Albert Bou, Kaleigh F. Roberts, Sladjana Zagora...

2025
[38]

Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. AlphaEvolve: A coding agent for scientific and algor...

2025
[39]

Introducing gpt-5 for developers, August 2025

OpenAI. Introducing gpt-5 for developers, August 2025. OpenAI product announcement, accessed March 5, 2026

2025
[40]

Le, Samira Daruki, Xiangru Tang, Vishy Tirumalashetty, George Lee, Mahsan Rofouei, Hangfei Lin, Jiawei Han, Chen-Yu Lee, and Tomas Pfister

Siru Ouyang, Jun Yan, I-Hung Hsu, Yanfei Chen, Ke Jiang, Zifeng Wang, Rujun Han, Long T. Le, Samira Daruki, Xiangru Tang, Vishy Tirumalashetty, George Lee, Mahsan Rofouei, Hangfei Lin, Jiawei Han, Chen-Yu Lee, and Tomas Pfister. ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory, 2025

2025
[41]

Film: Visual reasoning with a general conditioning layer

Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. InProceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018

2018
[42]

Strong inference: Certain systematic methods of scientific thinking may produce much more rapid progress than others.science, 146(3642):347–353, 1964

John R Platt. Strong inference: Certain systematic methods of scientific thinking may produce much more rapid progress than others.science, 146(3642):347–353, 1964

1964
[43]

Nqnn: Noise-aware quantum neural networks for medical image classification

Maqsudur Rahman and Jun Zhuang. Nqnn: Noise-aware quantum neural networks for medical image classification. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 433–442. Springer, 2025

2025
[44]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollar, and Christoph Feichtenhofer. SAM 2: Segment Anything in Images and Videos. InThe Thirteenth Internation...

2025
[45]

A comprehensive survey of neural architecture search: Challenges and solutions.ACM Computing Surveys (CSUR), 54(4):1–34, 2021

Pengzhen Ren, Yun Xiao, Xiaojun Chang, Po-Yao Huang, Zhihui Li, Xiaojiang Chen, and Xin Wang. A comprehensive survey of neural architecture search: Challenges and solutions.ACM Computing Surveys (CSUR), 54(4):1–34, 2021

2021
[46]

Active learning literature survey

Burr Settles. Active learning literature survey. 2009

2009
[47]

Towards execution-grounded automated ai research, 2026

Chenglei Si, Zitong Yang, Yejin Choi, Emmanuel Candès, Diyi Yang, and Tatsunori Hashimoto. Towards execution-grounded automated ai research.arXiv preprint arXiv:2601.14525, 2026

work page arXiv 2026
[48]

Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory, 2025

Mirac Suzgun, Mert Yuksekgonul, Federico Bianchi, Dan Jurafsky, and James Zou. Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory, 2025

2025
[49]

An Image Patch Is a Wave: Phase-Aware Vision MLP

Yehui Tang, Kai Han, Jianyuan Guo, Chang Xu, Yanxi Li, Chao Xu, and Yunhe Wang. An Image Patch Is a Wave: Phase-Aware Vision MLP. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10935–10944, June 2022

2022
[50]

FAIRCodeGenteam,JadeCopet,QuentinCarbonneaux,GalCohen,JonasGehring,JacobKahn,JannikKossen, Felix Kreuk, Emily McMilin, Michel Meyer, Yuxiang Wei, David Zhang, Kunhao Zheng, Jordi Armengol-Estapé, Pedram Bashiri, Maximilian Beck, Pierre Chambon, Abhishek Charnalia, Chris Cummins, Juliette Decugis, Zacharias V. Fisches, François Fleuret, Fabian Gloeckle, Al...

2025
[51]

HypER: Literature-grounded hypothesis generation and distillation with provenance

Rosni Vasu, Chandrayee Basu, Bhavana Dalvi Mishra, Cristina Sarasua, Peter Clark, and Abraham Bern- stein. HypER: Literature-grounded hypothesis generation and distillation with provenance. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Con- ference on Empirical Methods in Natural Languag...

2025
[52]

Association for Computational Linguistics
[53]

On the failure to eliminate hypotheses in a conceptual task.Quarterly journal of experimental psychology, 12(3):129–140, 1960

Peter C Wason. On the failure to eliminate hypotheses in a conceptual task.Quarterly journal of experimental psychology, 12(3):129–140, 1960

1960
[54]

Neural Predictor for Neural Architecture Search

Wei Wen, Hanxiao Liu, Yiran Chen, Hai Li, Gabriel Bender, and Pieter-Jan Kindermans. Neural Predictor for Neural Architecture Search. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision – ECCV 2020, pages 660–676, Cham, 2020. Springer International Publishing

2020
[55]

Group- Evolving Agents: Open-Ended Self-Improvement via Experience Sharing, 2026

Zhaotian Weng, Antonis Antoniades, Deepak Nathani, Zhen Zhang, Xiao Pu, and Xin Eric Wang. Group- Evolving Agents: Open-Ended Self-Improvement via Experience Sharing, 2026

2026
[56]

Medmnist v2-a large-scale lightweight benchmark for 2d and 3d biomedical image classification.Scientific data, 10(1):41, 2023

Jiancheng Yang, Rui Shi, Donglai Wei, Zequan Liu, Lin Zhao, Bilian Ke, Hanspeter Pfister, and Bingbing Ni. Medmnist v2-a large-scale lightweight benchmark for 2d and 3d biomedical image classification.Scientific data, 10(1):41, 2023

2023
[57]

R&D-Agent: An LLM-Agent Framework Towards Autonomous Data Science, 2025

Xu Yang, Xiao Yang, Shikai Fang, Yifei Zhang, Jian Wang, Bowen Xian, Qizheng Li, Jingyuan Li, Minrui Xu, Yuante Li, Haoran Pan, Yuge Zhang, Weiqing Liu, Yelong Shen, Weizhu Chen, and Jiang Bian. R&D-Agent: An LLM-Agent Framework Towards Autonomous Data Science, 2025

2025
[58]

Nader: Neural architecture design via multi-agent collaboration

Zekang Yang, Wang Zeng, Sheng Jin, Chen Qian, Ping Luo, and Wentao Liu. Nader: Neural architecture design via multi-agent collaboration. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 4452–4461, 2025

2025
[59]

MOOSE-Chem: Large Language Models for Rediscovering Unseen Chemistry Scientific Hypotheses, 2025

Zonglin Yang, Wanhao Liu, Ben Gao, Tong Xie, Yuqiang Li, Wanli Ouyang, Soujanya Poria, Erik Cambria, and Dongzhan Zhou. MOOSE-Chem: Large Language Models for Rediscovering Unseen Chemistry Scientific Hypotheses, 2025. 15 Agentic Discovery with Active Hypothesis Exploration for Visual Recognition

2025
[60]

TinyScientist: An Interactive, Extensible, and Controllable Framework for Building Research Agents, 2025

Haofei Yu, Keyang Xuan, Fenghai Li, Kunlun Zhu, Zijie Lei, Jiaxun Zhang, Ziheng Qi, Kyle Richardson, and Jiaxuan You. TinyScientist: An Interactive, Extensible, and Controllable Framework for Building Research Agents, 2025

2025
[61]

AlphaResearch: Accelerating New Algorithm Discovery with Language Models, 2025

Zhaojian Yu, Kaiyue Feng, Yilun Zhao, Shilin He, Xiao-Ping Zhang, and Arman Cohan. AlphaResearch: Accelerating New Algorithm Discovery with Language Models, 2025

2025
[62]

arXiv preprint arXiv:2403.03849 (2024)

Yubiao Yue and Zhenzhang Li. Medmamba: Vision mamba for medical image classification.arXiv preprint arXiv:2403.03849, 2024

work page arXiv 2024
[63]

Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents, 2025

Jenny Zhang, Shengran Hu, Cong Lu, Robert Lange, and Jeff Clune. Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents, 2025

2025
[64]

Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models, 2026

Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, Urmish Thakker, James Zou, and Kunle Olukotun. Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models, 2026

2026
[65]

Med-lego: Editing and adapting toward generalist medical image diagnosis

Yitao Zhu, Yuan Yin, Jiaming Li, Mengjie Xu, Zihao Zhao, Honglin Xiong, Sheng Wang, and Qian Wang. Med-lego: Editing and adapting toward generalist medical image diagnosis. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 438–447. Springer, 2025. 16 Agentic Discovery with Active Hypothesis Exploration for Vis...

2025
[66]

{parent_architecture}: parent node’s full brainstorming output (title, description, intuition, novelty, architecture spec) + performance metrics (accuracy, training time, novelty score)
[67]

{feedback_summary}: concatenated outputs from all feedback agents, including per-agent reasoning, actionable recommendations, hypothesis updates with current confidence fromℳ𝑡, and newly proposed hypotheses
[68]

{hypothesis_memory}: compiled hypothesis context fromℳ𝑡, grouped into confirmed (𝑐 >0.75 ), refuted (𝑐 <0.25), and uncertain patterns, each with full evidence log and agent attribution
[69]

hyp_3",

Selected hypothesisℎ⋆ to test Output A single evolved architecture containing: •Structured reasoning trace: parent_analysis (what worked),failure_analysis (what failed),hy- pothesis_usage(which hypotheses guide design),proposed_changes(targeted modifications) •Architecture fields:title,description,architecture_spec, etc. •existing_hypotheses: referenced h...
[70]

Tags: global-routing, sparse-routing, shape-aggregation.Initial confidence:0.55.Connected: hyp_44,hyp_27

IFa small number (𝐺≤4 ) of stage-local Global Shape Tokens (GSTs) are added that aggregate low- frequency token summaries via sparse top-𝑘 routing and broadcast compact shape-aware updates back to tokensINNPIN-Guard-style particle+slot backbones on CIFAR-scale tasks,THENoverall shape- sensitive accuracy will increase and texture-driven confusions will dec...
[71]

""A small ResNet-like basic block (kept lightweight)

IFsuper-particle anchor messages are L2-normalized per-band before FiLM blendingINmulti-band NPIN-style backbones,THENthe incidence of anchor-driven prototype-takeover misclassifications will decrease and shape-sensitive per-class recall will improve,BECAUSEbounding anchor magnitudes prevents high-frequency anchors from numerically overwhelming token iden...
[72]

IFtoken-to-hub routing is implemented as a temperatureed soft-assignment combined with a small learnable residual blend weight (𝛼) in the token updateINmid/high network stages that perform global mixing,THENthe model will reduce high-confidence wrong predictions and improve top-1 accuracy without harming top-5,BECAUSEsoft routing preserves per-token uncer...
[73]

""Depthwise separable conv: depthwise conv followed by pointwise conv

IFsmall per-scale, per-channel-group additive phase-offset parameters are added to the local depth- wise separable conv sampling locationsINearly/mid stages of the hierarchical backbone,THENshift- equivariant failure modes will be reduced and shape-sensitive class accuracy will improve,BECAUSE tiny phase offsets break exact translation symmetry in filter ...
[74]

Tags: band-normalization, FiLM, multi-scale.Initial confidence:0.6.Connected: hyp_18, hyp_- 13

IFper-band usage-normalized FiLM controllers replace naive/shared FiLM controllersINmulti-scale wavelet-style token-mixing stages for CIFAR-scale hierarchical backbones,THENlow-pass (shape) channels will retain sufficient forward activation mass and controller responsiveness so that shape- sensitiveper-classaccuracyimproveswhileavoidinghigh-frequencyproto...
[75]

""Patch embedding: a light conv stem preserving spatial resolution

IFgated cross-band residual connectors (low-pass→ band-pass multiplicative gates with tiny gating MLPs) are added to band-pass processorsINWTM stages,THENthe model will reduce texture-triggered false positives and improve discrimination for visually-similar classes,BECAUSElow-pass summaries provide spatially-coherent priors that multiplicatively suppress ...
[76]

Synthesize strength by weighting how direct each agent’s evidence is

Deduplicate Hypothesis Updates.If multiple agents update the SAME hypothesis: combine into ONE update with synthesized reasoning. Synthesize strength by weighting how direct each agent’s evidence is. If agents disagree on evidence type, explain the disagreement and choose the most justified type. Always cite contributing agents. Misattributed Updates Chec...
[77]

contradicts

Deduplicate New Hypothesis Candidates.Merge overlapping proposals from different agents. Check against existing hypotheses—if already captured, propose an UPDATE instead. Maximum 2 new hypotheses per node. Contradiction Check (Critical).Before creating any new hypothesis, check if it proposes the OPPOSITE outcome of an existing hypothesis for the SAME mec...
[78]

Quality Requirements.Every new hypothesis must satisfy all 7 quality dimensions [see Section B.5]
[79]

hypothesis_updates

Implementation Notes.Pass through from diagnostic agent without modification. Output Format (JSON): •"hypothesis_updates": [{"hyp_id", "evidence_type", "strength", "reasoning"}] •"new_hypotheses" : [ {"text", "scope", "prediction", "falsification_criteria", "tags", "initial_confidence", "reasoning", "connected_hypotheses"}] •"implementation_notes": [{"hyp...