DrugSAGE accumulates cross-task memory of skills, statistical evidence, and recurring errors to let LLM agents achieve top-ranked performance on molecular property prediction tasks with reduced or zero test-time search.
MARS: Modular Agent with Reflective Search for Automated AI Research
7 Pith papers cite this work. Polarity classification is still indexing.
years
2026 7representative citing papers
DAgger-style training with turn-level policy interpolation raises 4B and 8B LLM agents to 27.3% and 29.8% on SWE-bench Verified, beating several larger published systems.
AIBuildAI uses a manager agent and three LLM sub-agents to fully automate AI model development and achieves a 63.1% medal rate on MLE-Bench, matching experienced human engineers.
HypoExplore uses LLMs for hypothesis-driven evolutionary search with a Trajectory Tree and Hypothesis Memory Bank to discover lightweight vision architectures, reaching 94.11% accuracy on CIFAR-10 from an 18.91% baseline and generalizing to other datasets including state-of-the-art on MedMNIST.
AIRA₂ improves AI research agents via asynchronous multi-GPU workers, hidden consistent evaluation, and interactive ReAct agents, reaching 81.5-83.1% percentile rank on MLE-bench-30 and exceeding human SOTA on 6 of 20 AIRS-Bench tasks.
citing papers explorer
-
DrugSAGE:Self-evolving Agent Experience for Efficient State-of-the-Art Drug Discovery
DrugSAGE accumulates cross-task memory of skills, statistical evidence, and recurring errors to let LLM agents achieve top-ranked performance on molecular property prediction tasks with reduced or zero test-time search.
-
Revisiting DAgger in the Era of LLM-Agents
DAgger-style training with turn-level policy interpolation raises 4B and 8B LLM agents to 27.3% and 29.8% on SWE-bench Verified, beating several larger published systems.
-
AIBuildAI: An AI Agent for Automatically Building AI Models
AIBuildAI uses a manager agent and three LLM sub-agents to fully automate AI model development and achieves a 63.1% medal rate on MLE-Bench, matching experienced human engineers.
-
Agentic Discovery with Active Hypothesis Exploration for Visual Recognition
HypoExplore uses LLMs for hypothesis-driven evolutionary search with a Trajectory Tree and Hypothesis Memory Bank to discover lightweight vision architectures, reaching 94.11% accuracy on CIFAR-10 from an 18.91% baseline and generalizing to other datasets including state-of-the-art on MedMNIST.
-
AIRA_2: Overcoming Bottlenecks in AI Research Agents
AIRA₂ improves AI research agents via asynchronous multi-GPU workers, hidden consistent evaluation, and interactive ReAct agents, reaching 81.5-83.1% percentile rank on MLE-bench-30 and exceeding human SOTA on 6 of 20 AIRS-Bench tasks.
- FML-bench: A Controlled Study of AI Research Agent Strategies from the Perspective of Search Dynamics
- Toward Autonomous Long-Horizon Engineering for ML Research