Recognition: unknown
Agentic Discovery with Active Hypothesis Exploration for Visual Recognition
Pith reviewed 2026-05-10 16:10 UTC · model grok-4.3
The pith
HypoExplore turns neural architecture search into an active process of proposing, testing, and refining scientific hypotheses about design choices using language model agents.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
HypoExplore formulates neural architecture discovery for visual recognition as a hypothesis-driven scientific inquiry. Given a human-specified high-level research direction, HypoExplore ideates, implements, evaluates, and improves neural architectures through evolutionary branching. New hypotheses are created using a large language model by selecting a parent hypothesis to build upon, guided by a dual strategy that balances exploiting validated principles with resolving uncertain ones. The framework maintains a Trajectory Tree that records the lineage of all proposed architectures, and a Hypothesis Memory Bank that actively tracks confidence scores acquired through experimental evidence.
What carries the argument
The Hypothesis Memory Bank, which records and updates confidence scores for each hypothesis based on consolidated multi-perspective agent feedback after every experiment to guide which parent hypotheses to extend next.
If this is right
- Stronger lightweight vision architectures can be discovered on CIFAR-10, CIFAR-100, and Tiny-ImageNet starting from weak baselines.
- Hypothesis confidence scores become increasingly accurate predictors of future performance as more experimental evidence is collected.
- Design principles identified during search transfer across independent evolutionary lineages and across different datasets.
- The same process applies to specialized domains such as medical imaging on MedMNIST and reaches state-of-the-art results there.
Where Pith is reading between the lines
- The confidence-tracking mechanism could be repurposed to build an explicit, queryable map of which architectural motifs work well under which conditions.
- If the learned principles prove robust, the framework might be extended to automate model design in non-vision tasks such as sequence modeling or control policies.
- Replacing some feedback agents with human experts could create a hybrid loop that accelerates discovery while still reducing manual coding effort.
Load-bearing premise
Large language models can reliably generate valid, implementable neural architecture code that improves when guided by high-level experimental feedback.
What would settle it
A direct comparison run in which the same starting hypotheses are evolved once using the full confidence-updating agents and once using random or fixed scores, checking whether the full system produces measurably higher final accuracies.
Figures
read the original abstract
We introduce HypoExplore, an agentic framework that formulates neural architecture discovery for visual recognition as a hypothesis-driven scientific inquiry. Given a human-specified high-level research direction, HypoExplore ideates, implements, evaluates, and improves neural architectures through evolutionary branching. New hypotheses are created using a large language model by selecting a parent hypothesis to build upon, guided by a dual strategy that balances exploiting validated principles with resolving uncertain ones. Our proposed framework maintains a Trajectory Tree that records the lineage of all proposed architectures, and a Hypothesis Memory Bank that actively tracks confidence scores acquired through experimental evidence. After each experiment, multiple feedback agents analyze the results from different perspectives and consolidate their findings into hypothesis confidence updates. Our framework is tested on discovering lightweight vision architectures on CIFAR-10, with the best achieving 94.11% accuracy evolved from a root node baseline that starts at 18.91%, and generalizes to CIFAR-100 and Tiny-ImageNet. We further demonstrate applicability to a specialized domain by conducting independent architecture discovery runs on MedMNIST, which yield a state-of-the-art performance. We show that hypothesis confidence scores grow increasingly predictive as evidence accumulates, and that the learned principles transfer across independent evolutionary lineages, suggesting that HypoExplore not only discovers stronger architectures, but can help build a genuine understanding of the design space.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces HypoExplore, an agentic framework that treats neural architecture discovery for visual recognition as hypothesis-driven scientific inquiry. It uses LLMs to ideate, implement, and evolve architectures via evolutionary branching from a root baseline, guided by a Trajectory Tree for lineages and a Hypothesis Memory Bank for tracking confidence scores updated by multi-perspective feedback agents. Experiments on CIFAR-10 report accuracy rising from 18.91% to 94.11%, with generalization to CIFAR-100 and Tiny-ImageNet, SOTA results on MedMNIST, and evidence that confidence scores become predictive while extracted principles transfer across lineages.
Significance. If the core claims hold under rigorous validation, the work could advance automated NAS by shifting from black-box search to interpretable, principle-extracting inquiry, with potential for more transferable design knowledge in CV. The reported accuracy gains and cross-dataset generalization would be notable if supported by proper controls, but the absence of statistical rigor and ablations limits immediate impact.
major comments (3)
- [Abstract] Abstract (results paragraph): The central claim that HypoExplore builds 'genuine understanding of the design space' via predictive confidence scores and cross-lineage principle transfer rests on internal observations within the same closed evolutionary loop; no ablation isolating the Hypothesis Memory Bank and multi-agent feedback from base LLM-driven search is reported, leaving open that gains may arise from stochasticity or capacity rather than causal transferable principles.
- [Abstract] Abstract (experimental claims): The CIFAR-10 result (18.91% to 94.11%) and generalization to CIFAR-100/Tiny-ImageNet plus SOTA on MedMNIST are presented without specifying number of independent runs, error bars, statistical significance tests, or comparisons to strong NAS baselines (e.g., DARTS, NASNet), which is load-bearing for validating superiority and robustness of the discovered architectures.
- [Method] Method (Hypothesis Memory Bank and feedback agents): The confidence update mechanism from multi-perspective agents is described qualitatively but lacks a formal equation, pseudocode, or sensitivity analysis to sampling parameters; this directly affects the reproducibility of the 'increasingly predictive' scores and the claim that updates reflect unbiased design principles rather than trajectory artifacts.
minor comments (2)
- [Abstract] The root baseline accuracy of 18.91% on CIFAR-10 should be explicitly compared to standard simple CNNs to clarify the starting point of the evolutionary process.
- [Method] Notation for the Trajectory Tree and Hypothesis Memory Bank could be formalized with a diagram or pseudocode for clarity in the method description.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback, which highlights important areas for strengthening the rigor and reproducibility of our work. We address each major comment below and commit to revisions that enhance the manuscript without altering its core contributions.
read point-by-point responses
-
Referee: [Abstract] Abstract (results paragraph): The central claim that HypoExplore builds 'genuine understanding of the design space' via predictive confidence scores and cross-lineage principle transfer rests on internal observations within the same closed evolutionary loop; no ablation isolating the Hypothesis Memory Bank and multi-agent feedback from base LLM-driven search is reported, leaving open that gains may arise from stochasticity or capacity rather than causal transferable principles.
Authors: We acknowledge that an explicit ablation isolating the Hypothesis Memory Bank and multi-perspective feedback would provide stronger causal evidence. In the revised manuscript, we will add an ablation study comparing the full framework against a base LLM-driven evolutionary search without the memory bank or consolidated feedback. This will quantify the incremental benefit and address potential stochasticity concerns. We note that the cross-lineage principle transfer is already demonstrated by seeding a new independent run with principles extracted from a prior lineage, yielding faster convergence; however, we will clarify this distinction and expand the analysis to better separate internal loop effects from transferable design knowledge. revision: yes
-
Referee: [Abstract] Abstract (experimental claims): The CIFAR-10 result (18.91% to 94.11%) and generalization to CIFAR-100/Tiny-ImageNet plus SOTA on MedMNIST are presented without specifying number of independent runs, error bars, statistical significance tests, or comparisons to strong NAS baselines (e.g., DARTS, NASNet), which is load-bearing for validating superiority and robustness of the discovered architectures.
Authors: We agree that statistical rigor and baseline comparisons are essential for validating the reported gains. In the revision, we will report results aggregated over multiple independent runs (with means, standard deviations, and error bars), include appropriate statistical tests (e.g., paired t-tests), and add direct comparisons to established NAS methods such as DARTS and NASNet on CIFAR-10 under comparable compute budgets. The 94.11% figure represents the best architecture from the primary run; we will emphasize robustness metrics and note any limitations in direct apples-to-apples comparisons arising from differing search paradigms. revision: yes
-
Referee: [Method] Method (Hypothesis Memory Bank and feedback agents): The confidence update mechanism from multi-perspective agents is described qualitatively but lacks a formal equation, pseudocode, or sensitivity analysis to sampling parameters; this directly affects the reproducibility of the 'increasingly predictive' scores and the claim that updates reflect unbiased design principles rather than trajectory artifacts.
Authors: We will strengthen the Method section by introducing a formal equation for the confidence update rule that aggregates multi-agent feedback into hypothesis scores. We will also include pseudocode for the full update and consolidation process. Additionally, a sensitivity analysis on parameters such as the number of feedback agents, sampling temperature, and evidence weighting will be added to demonstrate that the predictive improvement in confidence scores is robust and not an artifact of specific trajectory choices. revision: yes
Circularity Check
Claim of 'genuine understanding' reduces to internal confidence updates and lineage transfer within the same closed loop
specific steps
-
fitted input called prediction
[Abstract]
"We show that hypothesis confidence scores grow increasingly predictive as evidence accumulates, and that the learned principles transfer across independent evolutionary lineages, suggesting that HypoExplore not only discovers stronger architectures, but can help build a genuine understanding of the design space."
Confidence scores are explicitly updated after each experiment via the multi-perspective feedback agents using the results from the same runs; demonstrating that these scores 'grow increasingly predictive' therefore uses the fitted updates on the data that generated them. The cross-lineage transfer is likewise measured inside the Trajectory Tree and Hypothesis Memory Bank produced by the framework itself, with no external benchmark or ablation separating extracted principles from search artifacts.
full rationale
The paper's empirical accuracy results (e.g., 94.11% on CIFAR-10) are independent measurements and not circular. However, the load-bearing interpretive claim that the framework builds genuine understanding of the design space is supported only by showing that hypothesis confidence scores (updated from the same multi-agent feedback on experimental outcomes) become increasingly predictive and that principles transfer across lineages generated inside the identical evolutionary process. This makes the 'understanding' assertion a re-description of the search trajectory's internal statistics rather than an externally validated extraction of causal principles.
Axiom & Free-Parameter Ledger
free parameters (2)
- confidence update mechanism
- LLM prompt templates and sampling parameters
axioms (2)
- domain assumption Large language models can generate syntactically valid and functionally improvable neural network code from textual descriptions and performance feedback.
- domain assumption Performance on CIFAR-10 and similar small datasets provides reliable signals for updating hypothesis confidence that generalize to other tasks.
invented entities (3)
-
Trajectory Tree
no independent evidence
-
Hypothesis Memory Bank
no independent evidence
-
feedback agents
no independent evidence
Reference graph
Works this paper leans on
-
[1]
DhruvAgarwal,BodhisattwaPrasadMajumder,ReeceAdamson,MeghaChakravorty,SatvikaReddyGavireddy, Aditya Parashar, Harshit Surana, Bhavana Dalvi Mishra, Andrew McCallum, Ashish Sabharwal, et al. Autodis- covery: Open-ended scientific discovery via bayesian surprise.arXiv preprint arXiv:2507.00310, 2025
-
[2]
Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, and Omar Khattab
Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alexandros G. Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, and Omar Khattab. Gepa: Reflective prompt evolution can outperform reinforcement learning, 2026
2026
-
[3]
ResearchAgent: Iterative Research Idea Generation over Scientific Literature with Large Language Models, 2025
Jinheon Baek, Sujay Kumar Jauhar, Silviu Cucerzan, and Sung Ju Hwang. ResearchAgent: Iterative Research Idea Generation over Scientific Literature with Large Language Models, 2025
2025
-
[4]
Emerging Properties in Self-Supervised Vision Transformers
MathildeCaron, HugoTouvron, IshanMisra, HervéJégou, JulienMairal, PiotrBojanowski, andArmandJoulin. Emerging Properties in Self-Supervised Vision Transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9650–9660, October 2021
2021
-
[5]
The method of multiple working hypotheses.Science, (366):92–96, 1890
Thomas C Chamberlin. The method of multiple working hypotheses.Science, (366):92–96, 1890
-
[6]
RevoNAD: Reflective Evolutionary Exploration for Neural Architecture Design, 2025
Gyusam Chang, Jeongyoon Yoon, Shin han yi, JaeHyeok Lee, Sujin Jang, and Sangpil Kim. RevoNAD: Reflective Evolutionary Exploration for Neural Architecture Design, 2025
2025
-
[7]
An empirical evaluation of thompson sampling.Advances in neural information processing systems, 24, 2011
Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling.Advances in neural information processing systems, 24, 2011
2011
-
[8]
MLE-bench: Evaluating machine learning agents on machine learning engineering
Jiefeng Chen, Bhavana Dalvi Mishra, Jaehyun Nam, Rui Meng, Tomas Pfister, and Jinsung Yoon. MARS: Modular Agent with Reflective Search for Automated AI Research.arXiv preprint arXiv:2602.02660, 2026
-
[9]
Drop an octave: Reducing spatial redundancy in convolutional neural networks with octave convolution
Yunpeng Chen, Haoqi Fan, Bing Xu, Zhicheng Yan, Yannis Kalantidis, Marcus Rohrbach, Shuicheng Yan, and Jiashi Feng. Drop an octave: Reducing spatial redundancy in convolutional neural networks with octave convolution. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3435–3444, 2019
2019
-
[10]
Language modeling by language models.arXiv preprint arXiv:2506.20249, 2025
Junyan Cheng, Peter Clark, and Kyle Richardson. Language modeling by language models.arXiv preprint arXiv:2506.20249, 2025
-
[11]
Med-former: A transformer based architecture for medical image classification
G Jignesh Chowdary and Zhaozheng Yin. Med-former: A transformer based architecture for medical image classification. InInternational conference on medical image computing and computer-assisted intervention, pages 448–457. Springer, 2024
2024
-
[12]
A tutorial on thompson sampling.Foundations and Trends®in Machine Learning, 11(1):1–99, 2018
J Russo Daniel, Van Roy Benjamin, Kazerouni Abbas, Osband Ian, and Wen Zheng. A tutorial on thompson sampling.Foundations and Trends®in Machine Learning, 11(1):1–99, 2018
2018
-
[13]
Vision transformers need registers
Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers. In International Conference on Learning Representations (ICLR), 2024
2024
-
[14]
AnImageisWorth16x16Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, MostafaDehghani, MatthiasMinderer, GHeigold, SGelly, etal. AnImageisWorth16x16Words: Transformers for Image Recognition at Scale. InInternational Conference on Learning Representations (ICLR), 2021. 12 Agentic Discovery with Active Hypothesis Explora...
2021
-
[15]
Bayes-Entropy Collaborative Driven Agents for Research Hypotheses Generation and Optimization, 2025
Shiyang Duan, Yuan Tian, Qi Bing, and Xiaowei Shao. Bayes-Entropy Collaborative Driven Agents for Research Hypotheses Generation and Optimization, 2025
2025
-
[16]
Interactive debugging and steering of multi-agent ai systems
Will Epperson, Gagan Bansal, Victor C Dibia, Adam Fourney, Jack Gerrits, Erkang Zhu, and Saleema Amershi. Interactive debugging and steering of multi-agent ai systems. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–15, 2025
2025
-
[17]
Vision GNN: An Image is Worth Graph of Nodes
Kai Han, Yunhe Wang, Jianyuan Guo, Yehui Tang, and Enhua Wu. Vision GNN: An Image is Worth Graph of Nodes. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems, volume 35, pages 8291–8303. Curran Associates, Inc., 2022
2022
-
[18]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016
2016
-
[19]
Bayesian active learning for classification and preferenc e learning,
Neil Houlsby, Ferenc Huszár, Zoubin Ghahramani, and Máté Lengyel. Bayesian active learning for classification and preference learning.arXiv preprint arXiv:1112.5745, 2011
-
[20]
Li, Emmanuel Candès, and Jure Leskovec
Kexin Huang, Ying Jin, Ryan Li, Michael Y. Li, Emmanuel Candès, and Jure Leskovec. Automated Hypothesis Validation with Agentic Sequential Falsifications, 2025
2025
-
[21]
MLAgentBench: evaluating language agents on machine learning experimentation
Qian Huang, Jian Vora, Percy Liang, and Jure Leskovec. MLAgentBench: evaluating language agents on machine learning experimentation. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024
2024
-
[22]
Prada: Protecting and detecting dataset abuse for open-source medical dataset
Jinhyeok Jang, Hong Joo Lee, Nassir Navab, and Seong Tae Kim. Prada: Protecting and detecting dataset abuse for open-source medical dataset. InInternational Conference on Medical Image Computing and Computer- Assisted Intervention, pages 463–473. Springer, 2025
2025
-
[23]
Peter Jansen, Peter Clark, Doug Downey, and Daniel S. Weld. Generating Literature-Driven Scientific Theories at Scale, 2026
2026
-
[24]
BioDisco: Multi-agent hypothesis generation with dual-mode evidence, iterative feedback and temporal evaluation
Yujing Ke, Kevin George, Kathan Pandya, David Blumenthal, Maximilian Sprang, Gerrit Großmann, Sebastian Vollmer, and David Antony Selby. BioDisco: Multi-agent hypothesis generation with dual-mode evidence, iterative feedback and temporal evaluation
-
[25]
Segment anything
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023
2023
-
[26]
The division of cognitive labor.The journal of philosophy, 87(1):5–22, 1990
Philip Kitcher. The division of cognitive labor.The journal of philosophy, 87(1):5–22, 1990
1990
-
[27]
Dual space search during scientific reasoning.Cognitive science, 12(1):1–48, 1988
David Klahr and Kevin Dunbar. Dual space search during scientific reasoning.Cognitive science, 12(1):1–48, 1988
1988
-
[28]
Proptest: Automatic property testing for improved visual programming
Jaywon Koo, Ziyan Yang, Paola Cascante-Bonilla, Baishakhi Ray, and Vicente Ordonez. Proptest: Automatic property testing for improved visual programming. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 8241–8256, 2024
2024
-
[29]
Set transformer: A framework for attention-based permutation-invariant neural networks
Juho Lee, Yoonho Lee, Jungtaek Kim, Adam Kosiorek, Seungjin Choi, and Yee Whye Teh. Set transformer: A framework for attention-based permutation-invariant neural networks. InInternational conference on machine learning, pages 3744–3753. PMLR, 2019
2019
-
[30]
Alphago moment for model architecture discovery.arXiv preprint arXiv:2507.18074, 2025
Yixiu Liu, Yang Nan, Weixian Xu, Xiangkun Hu, Lyumanshan Ye, Zhen Qin, and Pengfei Liu. Alphago moment for model architecture discovery.arXiv preprint arXiv:2507.18074, 2025
-
[31]
Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10012–10022, October 2021. 13 Agentic Discovery with Active Hypothesis Exploration for Visual Recognition
2021
-
[32]
Object-Centric Learning with Slot Attention
Francesco Locatello, Dirk Weissenborn, Thomas Unterthiner, Aravindh Mahendran, Georg Heigold, Jakob Uszkoreit, Alexey Dosovitskiy, and Thomas Kipf. Object-Centric Learning with Slot Attention. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors,Advances in Neural Information Processing Systems, volume 33, pages 11525–11538. Curran A...
2020
-
[33]
Medvit: a robust vision transformer for generalized medical image classification.Computers in biology and medicine, 157:106791, 2023
Omid Nejati Manzari, Hamid Ahmadabadi, Hossein Kashiani, Shahriar B Shokouhi, and Ahmad Ayatollahi. Medvit: a robust vision transformer for generalized medical image classification.Computers in biology and medicine, 157:106791, 2023
2023
-
[34]
Medical image classification with kan-integrated transformers and dilated neighborhood attention.Applied Soft Computing, page 114045, 2025
Omid Nejati Manzari, Hojat Asgariandehkordi, Taha Koleilat, Yiming Xiao, and Hassan Rivaz. Medical image classification with kan-integrated transformers and dilated neighborhood attention.Applied Soft Computing, page 114045, 2025
2025
-
[35]
Exploration and exploitation in organizational learning.Organization science, 2(1):71–87, 1991
James G March. Exploration and exploitation in organizational learning.Organization science, 2(1):71–87, 1991
1991
-
[36]
Mednns: supernet-based medical task-adaptive neural network search
Lotfi Abdelkrim Mecharbat, Ibrahim Almakky, Martin Takac, and Mohammad Yaqub. Mednns: supernet-based medical task-adaptive neural network search. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 448–458. Springer, 2025
2025
-
[37]
Landsness, Daniel L
Ludovico Mitchener, Angela Yiu, Benjamin Chang, Mathieu Bourdenx, Tyler Nadolski, Arvis Sulovari, Eric C. Landsness, Daniel L. Barabasi, Siddharth Narayanan, Nicky Evans, Shriya Reddy, Martha Foiani, Aizad Kamal, Leah P. Shriver, Fang Cao, Asmamaw T. Wassie, Jon M. Laurent, Edwin Melville-Green, Mayk Caldas, Albert Bou, Kaleigh F. Roberts, Sladjana Zagora...
2025
-
[38]
Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. AlphaEvolve: A coding agent for scientific and algor...
2025
-
[39]
Introducing gpt-5 for developers, August 2025
OpenAI. Introducing gpt-5 for developers, August 2025. OpenAI product announcement, accessed March 5, 2026
2025
-
[40]
Le, Samira Daruki, Xiangru Tang, Vishy Tirumalashetty, George Lee, Mahsan Rofouei, Hangfei Lin, Jiawei Han, Chen-Yu Lee, and Tomas Pfister
Siru Ouyang, Jun Yan, I-Hung Hsu, Yanfei Chen, Ke Jiang, Zifeng Wang, Rujun Han, Long T. Le, Samira Daruki, Xiangru Tang, Vishy Tirumalashetty, George Lee, Mahsan Rofouei, Hangfei Lin, Jiawei Han, Chen-Yu Lee, and Tomas Pfister. ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory, 2025
2025
-
[41]
Film: Visual reasoning with a general conditioning layer
Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. InProceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018
2018
-
[42]
Strong inference: Certain systematic methods of scientific thinking may produce much more rapid progress than others.science, 146(3642):347–353, 1964
John R Platt. Strong inference: Certain systematic methods of scientific thinking may produce much more rapid progress than others.science, 146(3642):347–353, 1964
1964
-
[43]
Nqnn: Noise-aware quantum neural networks for medical image classification
Maqsudur Rahman and Jun Zhuang. Nqnn: Noise-aware quantum neural networks for medical image classification. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 433–442. Springer, 2025
2025
-
[44]
SAM 2: Segment Anything in Images and Videos
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollar, and Christoph Feichtenhofer. SAM 2: Segment Anything in Images and Videos. InThe Thirteenth Internation...
2025
-
[45]
A comprehensive survey of neural architecture search: Challenges and solutions.ACM Computing Surveys (CSUR), 54(4):1–34, 2021
Pengzhen Ren, Yun Xiao, Xiaojun Chang, Po-Yao Huang, Zhihui Li, Xiaojiang Chen, and Xin Wang. A comprehensive survey of neural architecture search: Challenges and solutions.ACM Computing Surveys (CSUR), 54(4):1–34, 2021
2021
-
[46]
Active learning literature survey
Burr Settles. Active learning literature survey. 2009
2009
-
[47]
Towards execution-grounded automated ai research, 2026
Chenglei Si, Zitong Yang, Yejin Choi, Emmanuel Candès, Diyi Yang, and Tatsunori Hashimoto. Towards execution-grounded automated ai research.arXiv preprint arXiv:2601.14525, 2026
-
[48]
Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory, 2025
Mirac Suzgun, Mert Yuksekgonul, Federico Bianchi, Dan Jurafsky, and James Zou. Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory, 2025
2025
-
[49]
An Image Patch Is a Wave: Phase-Aware Vision MLP
Yehui Tang, Kai Han, Jianyuan Guo, Chang Xu, Yanxi Li, Chao Xu, and Yunhe Wang. An Image Patch Is a Wave: Phase-Aware Vision MLP. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10935–10944, June 2022
2022
-
[50]
FAIRCodeGenteam,JadeCopet,QuentinCarbonneaux,GalCohen,JonasGehring,JacobKahn,JannikKossen, Felix Kreuk, Emily McMilin, Michel Meyer, Yuxiang Wei, David Zhang, Kunhao Zheng, Jordi Armengol-Estapé, Pedram Bashiri, Maximilian Beck, Pierre Chambon, Abhishek Charnalia, Chris Cummins, Juliette Decugis, Zacharias V. Fisches, François Fleuret, Fabian Gloeckle, Al...
2025
-
[51]
HypER: Literature-grounded hypothesis generation and distillation with provenance
Rosni Vasu, Chandrayee Basu, Bhavana Dalvi Mishra, Cristina Sarasua, Peter Clark, and Abraham Bern- stein. HypER: Literature-grounded hypothesis generation and distillation with provenance. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Con- ference on Empirical Methods in Natural Languag...
2025
-
[52]
Association for Computational Linguistics
-
[53]
On the failure to eliminate hypotheses in a conceptual task.Quarterly journal of experimental psychology, 12(3):129–140, 1960
Peter C Wason. On the failure to eliminate hypotheses in a conceptual task.Quarterly journal of experimental psychology, 12(3):129–140, 1960
1960
-
[54]
Neural Predictor for Neural Architecture Search
Wei Wen, Hanxiao Liu, Yiran Chen, Hai Li, Gabriel Bender, and Pieter-Jan Kindermans. Neural Predictor for Neural Architecture Search. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision – ECCV 2020, pages 660–676, Cham, 2020. Springer International Publishing
2020
-
[55]
Group- Evolving Agents: Open-Ended Self-Improvement via Experience Sharing, 2026
Zhaotian Weng, Antonis Antoniades, Deepak Nathani, Zhen Zhang, Xiao Pu, and Xin Eric Wang. Group- Evolving Agents: Open-Ended Self-Improvement via Experience Sharing, 2026
2026
-
[56]
Medmnist v2-a large-scale lightweight benchmark for 2d and 3d biomedical image classification.Scientific data, 10(1):41, 2023
Jiancheng Yang, Rui Shi, Donglai Wei, Zequan Liu, Lin Zhao, Bilian Ke, Hanspeter Pfister, and Bingbing Ni. Medmnist v2-a large-scale lightweight benchmark for 2d and 3d biomedical image classification.Scientific data, 10(1):41, 2023
2023
-
[57]
R&D-Agent: An LLM-Agent Framework Towards Autonomous Data Science, 2025
Xu Yang, Xiao Yang, Shikai Fang, Yifei Zhang, Jian Wang, Bowen Xian, Qizheng Li, Jingyuan Li, Minrui Xu, Yuante Li, Haoran Pan, Yuge Zhang, Weiqing Liu, Yelong Shen, Weizhu Chen, and Jiang Bian. R&D-Agent: An LLM-Agent Framework Towards Autonomous Data Science, 2025
2025
-
[58]
Nader: Neural architecture design via multi-agent collaboration
Zekang Yang, Wang Zeng, Sheng Jin, Chen Qian, Ping Luo, and Wentao Liu. Nader: Neural architecture design via multi-agent collaboration. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 4452–4461, 2025
2025
-
[59]
MOOSE-Chem: Large Language Models for Rediscovering Unseen Chemistry Scientific Hypotheses, 2025
Zonglin Yang, Wanhao Liu, Ben Gao, Tong Xie, Yuqiang Li, Wanli Ouyang, Soujanya Poria, Erik Cambria, and Dongzhan Zhou. MOOSE-Chem: Large Language Models for Rediscovering Unseen Chemistry Scientific Hypotheses, 2025. 15 Agentic Discovery with Active Hypothesis Exploration for Visual Recognition
2025
-
[60]
TinyScientist: An Interactive, Extensible, and Controllable Framework for Building Research Agents, 2025
Haofei Yu, Keyang Xuan, Fenghai Li, Kunlun Zhu, Zijie Lei, Jiaxun Zhang, Ziheng Qi, Kyle Richardson, and Jiaxuan You. TinyScientist: An Interactive, Extensible, and Controllable Framework for Building Research Agents, 2025
2025
-
[61]
AlphaResearch: Accelerating New Algorithm Discovery with Language Models, 2025
Zhaojian Yu, Kaiyue Feng, Yilun Zhao, Shilin He, Xiao-Ping Zhang, and Arman Cohan. AlphaResearch: Accelerating New Algorithm Discovery with Language Models, 2025
2025
-
[62]
arXiv preprint arXiv:2403.03849 (2024)
Yubiao Yue and Zhenzhang Li. Medmamba: Vision mamba for medical image classification.arXiv preprint arXiv:2403.03849, 2024
-
[63]
Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents, 2025
Jenny Zhang, Shengran Hu, Cong Lu, Robert Lange, and Jeff Clune. Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents, 2025
2025
-
[64]
Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models, 2026
Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, Urmish Thakker, James Zou, and Kunle Olukotun. Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models, 2026
2026
-
[65]
Med-lego: Editing and adapting toward generalist medical image diagnosis
Yitao Zhu, Yuan Yin, Jiaming Li, Mengjie Xu, Zihao Zhao, Honglin Xiong, Sheng Wang, and Qian Wang. Med-lego: Editing and adapting toward generalist medical image diagnosis. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 438–447. Springer, 2025. 16 Agentic Discovery with Active Hypothesis Exploration for Vis...
2025
-
[66]
{parent_architecture}: parent node’s full brainstorming output (title, description, intuition, novelty, architecture spec) + performance metrics (accuracy, training time, novelty score)
-
[67]
{feedback_summary}: concatenated outputs from all feedback agents, including per-agent reasoning, actionable recommendations, hypothesis updates with current confidence fromℳ𝑡, and newly proposed hypotheses
-
[68]
{hypothesis_memory}: compiled hypothesis context fromℳ𝑡, grouped into confirmed (𝑐 >0.75 ), refuted (𝑐 <0.25), and uncertain patterns, each with full evidence log and agent attribution
-
[69]
hyp_3",
Selected hypothesisℎ⋆ to test Output A single evolved architecture containing: •Structured reasoning trace: parent_analysis (what worked),failure_analysis (what failed),hy- pothesis_usage(which hypotheses guide design),proposed_changes(targeted modifications) •Architecture fields:title,description,architecture_spec, etc. •existing_hypotheses: referenced h...
-
[70]
Tags: global-routing, sparse-routing, shape-aggregation.Initial confidence:0.55.Connected: hyp_44,hyp_27
IFa small number (𝐺≤4 ) of stage-local Global Shape Tokens (GSTs) are added that aggregate low- frequency token summaries via sparse top-𝑘 routing and broadcast compact shape-aware updates back to tokensINNPIN-Guard-style particle+slot backbones on CIFAR-scale tasks,THENoverall shape- sensitive accuracy will increase and texture-driven confusions will dec...
-
[71]
""A small ResNet-like basic block (kept lightweight)
IFsuper-particle anchor messages are L2-normalized per-band before FiLM blendingINmulti-band NPIN-style backbones,THENthe incidence of anchor-driven prototype-takeover misclassifications will decrease and shape-sensitive per-class recall will improve,BECAUSEbounding anchor magnitudes prevents high-frequency anchors from numerically overwhelming token iden...
-
[72]
IFtoken-to-hub routing is implemented as a temperatureed soft-assignment combined with a small learnable residual blend weight (𝛼) in the token updateINmid/high network stages that perform global mixing,THENthe model will reduce high-confidence wrong predictions and improve top-1 accuracy without harming top-5,BECAUSEsoft routing preserves per-token uncer...
-
[73]
""Depthwise separable conv: depthwise conv followed by pointwise conv
IFsmall per-scale, per-channel-group additive phase-offset parameters are added to the local depth- wise separable conv sampling locationsINearly/mid stages of the hierarchical backbone,THENshift- equivariant failure modes will be reduced and shape-sensitive class accuracy will improve,BECAUSE tiny phase offsets break exact translation symmetry in filter ...
-
[74]
Tags: band-normalization, FiLM, multi-scale.Initial confidence:0.6.Connected: hyp_18, hyp_- 13
IFper-band usage-normalized FiLM controllers replace naive/shared FiLM controllersINmulti-scale wavelet-style token-mixing stages for CIFAR-scale hierarchical backbones,THENlow-pass (shape) channels will retain sufficient forward activation mass and controller responsiveness so that shape- sensitiveper-classaccuracyimproveswhileavoidinghigh-frequencyproto...
-
[75]
""Patch embedding: a light conv stem preserving spatial resolution
IFgated cross-band residual connectors (low-pass→ band-pass multiplicative gates with tiny gating MLPs) are added to band-pass processorsINWTM stages,THENthe model will reduce texture-triggered false positives and improve discrimination for visually-similar classes,BECAUSElow-pass summaries provide spatially-coherent priors that multiplicatively suppress ...
-
[76]
Synthesize strength by weighting how direct each agent’s evidence is
Deduplicate Hypothesis Updates.If multiple agents update the SAME hypothesis: combine into ONE update with synthesized reasoning. Synthesize strength by weighting how direct each agent’s evidence is. If agents disagree on evidence type, explain the disagreement and choose the most justified type. Always cite contributing agents. Misattributed Updates Chec...
-
[77]
contradicts
Deduplicate New Hypothesis Candidates.Merge overlapping proposals from different agents. Check against existing hypotheses—if already captured, propose an UPDATE instead. Maximum 2 new hypotheses per node. Contradiction Check (Critical).Before creating any new hypothesis, check if it proposes the OPPOSITE outcome of an existing hypothesis for the SAME mec...
-
[78]
Quality Requirements.Every new hypothesis must satisfy all 7 quality dimensions [see Section B.5]
-
[79]
hypothesis_updates
Implementation Notes.Pass through from diagnostic agent without modification. Output Format (JSON): •"hypothesis_updates": [{"hyp_id", "evidence_type", "strength", "reasoning"}] •"new_hypotheses" : [ {"text", "scope", "prediction", "falsification_criteria", "tags", "initial_confidence", "reasoning", "connected_hypotheses"}] •"implementation_notes": [{"hyp...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.