Camyla: Scaling Autonomous Research in Medical Image Segmentation

· 2026 · cs.AI · arXiv 2604.10696

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

We present Camyla, a system for fully autonomous research within the scientific domain of medical image segmentation. Camyla transforms raw datasets into literature-grounded research proposals, executable experiments, and complete manuscripts without human intervention. Autonomous experimentation over long horizons poses three interrelated challenges: search effort drifts toward unpromising directions, knowledge from earlier trials degrades as context accumulates, and recovery from failures collapses into repetitive incremental fixes. To address these challenges, the system combines three coupled mechanisms: Quality-Weighted Branch Exploration for allocating effort across competing proposals, Layered Reflective Memory for retaining and compressing cross-trial knowledge at multiple granularities, and Divergent Diagnostic Feedback for diversifying recovery after underperforming trials. The system is evaluated on CamylaBench, a contamination-free benchmark of 31 datasets constructed exclusively from 2025 publications, under a strict zero-intervention protocol across two independent runs within a total of 28 days on an 8-GPU cluster. Across the two runs, Camyla generates more than 2,700 novel model implementations and 40 complete manuscripts, and surpasses the strongest per-dataset baseline selected from 14 established architectures, including nnU-Net, on 22 and 18 of 31 datasets under identical training budgets, respectively (union: 24/31). Senior human reviewers score the generated manuscripts at the T1/T2 boundary of contemporary medical imaging journals. Relative to automated baselines, Camyla outperforms AutoML and NAS systems on aggregate segmentation performance and exceeds six open-ended research agents on both task completion and baseline-surpassing frequency. These results suggest that domain-scale autonomous research is achievable in medical image segmentation.

representative citing papers

One Reflection Is Not Enough: Self-Correcting Autonomous Research via Multi-Hypothesis Failure Attribution

cs.AI · 2026-06-30 · unverdicted · novelty 6.0

SAGE with MHFA improves failure recovery in autonomous research agents, raising metrics-bearing outputs from 42% to 92% on a 12-topic benchmark versus single-reflection baselines.

HealthAgentBench: A Unified Benchmark Suite of Realistic Agentic Healthcare Environments for Challenging Frontier AI Agents

cs.AI · 2026-06-30 · unverdicted · novelty 6.0

HealthAgentBench is a new benchmark of 54 healthcare agent tasks where even the strongest frontier AI agent reaches only about 42% success rate on end-to-end clinical workflows.

citing papers explorer

Showing 2 of 2 citing papers.

One Reflection Is Not Enough: Self-Correcting Autonomous Research via Multi-Hypothesis Failure Attribution cs.AI · 2026-06-30 · unverdicted · none · ref 84 · internal anchor
SAGE with MHFA improves failure recovery in autonomous research agents, raising metrics-bearing outputs from 42% to 92% on a 12-topic benchmark versus single-reflection baselines.
HealthAgentBench: A Unified Benchmark Suite of Realistic Agentic Healthcare Environments for Challenging Frontier AI Agents cs.AI · 2026-06-30 · unverdicted · none · ref 9 · internal anchor
HealthAgentBench is a new benchmark of 54 healthcare agent tasks where even the strongest frontier AI agent reaches only about 42% success rate on end-to-end clinical workflows.

Camyla: Scaling Autonomous Research in Medical Image Segmentation

fields

years

verdicts

representative citing papers

citing papers explorer