CORAL: Towards Autonomous Multi-Agent Evolution for Open-Ended Discovery
Pith reviewed 2026-05-21 09:58 UTC · model grok-4.3
The pith
CORAL replaces fixed heuristics in LLM evolution with autonomous multi-agent collaboration through shared persistent memory to accelerate open-ended discovery.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CORAL is the first framework for autonomous multi-agent evolution on open-ended problems. It replaces rigid control with long-running agents that explore, reflect, and collaborate through shared persistent memory, asynchronous multi-agent execution, and heartbeat-based interventions. Practical safeguards include isolated workspaces, evaluator separation, resource management, and agent session and health management. Evaluated on diverse tasks, CORAL sets new state-of-the-art results on 10 tasks, achieving 3-10 times higher improvement rates with far fewer evaluations than fixed evolutionary search baselines. On Anthropic's kernel engineering task, four co-evolving agents improve the bestknown
What carries the argument
Long-running autonomous LLM agents that collaborate via shared persistent memory, asynchronous execution, and heartbeat-based interventions within the CORAL framework.
If this is right
- Knowledge reuse across agents supports sustained progress where single-step heuristics fall short.
- Asynchronous collaboration and reflection increase effective exploration depth on complex problems.
- Resource and health management features allow reliable operation of agent teams over many cycles.
- The same autonomy pattern improves results across mathematical, algorithmic, and systems optimization domains.
Where Pith is reading between the lines
- Persistent shared memory may be the component most responsible for scaling evolutionary search beyond current step limits.
- The safeguard design could transfer to other multi-agent systems where long runs risk instability or resource waste.
- Single-agent versions equipped with similar memory and reflection tools might capture part of the benefit without full multi-agent overhead.
Load-bearing premise
The performance gains arise primarily from the autonomy mechanisms and multi-agent features rather than from task-specific tuning or implementation details that differ from the baselines.
What would settle it
A controlled test on a new open-ended discovery task, using identical evaluation budgets and protocols for both CORAL and fixed-heuristic baselines, that shows no higher improvement rate for the autonomous version would challenge the central claim.
read the original abstract
Large language model (LLM)-based evolution is a promising approach for open-ended discovery, where progress requires sustained search and knowledge accumulation. Existing methods still rely heavily on fixed heuristics and hard-coded exploration rules, which limit the autonomy of LLM agents. We present CORAL, the first framework for autonomous multi-agent evolution on open-ended problems. CORAL replaces rigid control with long-running agents that explore, reflect, and collaborate through shared persistent memory, asynchronous multi-agent execution, and heartbeat-based interventions. It also provides practical safeguards, including isolated workspaces, evaluator separation, resource management, and agent session and health management. Evaluated on diverse mathematical, algorithmic, and systems optimization tasks, CORAL sets new state-of-the-art results on 10 tasks, achieving 3-10 times higher improvement rates with far fewer evaluations than fixed evolutionary search baselines across tasks. On Anthropic's kernel engineering task, four co-evolving agents improve the best known score from 1363 to 1103 cycles. Mechanistic analyses further show how these gains arise from knowledge reuse and multi-agent exploration and communication. Together, these results suggest that greater agent autonomy and multi-agent evolution can substantially improve open-ended discovery. Code is available at https://github.com/Human-Agent-Society/CORAL.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CORAL, a framework for autonomous multi-agent LLM-based evolution on open-ended discovery tasks. It replaces fixed heuristics with long-running agents that explore, reflect, and collaborate via shared persistent memory, asynchronous multi-agent execution, and heartbeat-based interventions, along with practical safeguards such as isolated workspaces and resource management. The central empirical claim is that CORAL achieves new state-of-the-art results on 10 diverse mathematical, algorithmic, and systems optimization tasks, with 3-10 times higher improvement rates and far fewer evaluations than fixed evolutionary search baselines; on Anthropic's kernel engineering task, four co-evolving agents improve the best known score from 1363 to 1103 cycles. Mechanistic analyses attribute gains to knowledge reuse and multi-agent exploration/communication. Code is released at https://github.com/Human-Agent-Society/CORAL.
Significance. If the performance claims and attribution to autonomy mechanisms hold under controlled conditions, the work would demonstrate that greater agent autonomy and multi-agent collaboration can substantially advance LLM-based open-ended discovery, moving beyond rigid control structures. The release of code supports reproducibility and is a clear strength for follow-on research in multi-agent systems and evolutionary optimization.
major comments (3)
- [§5] §5 (Experimental Evaluation), Table 2 and associated text: the central claim that gains arise from autonomy mechanisms (long-running agents, shared memory, asynchronous collaboration) rather than implementation details requires explicit verification that fixed baselines match CORAL on total LLM calls, prompting templates, memory usage, and evaluation protocol; without these controls or an ablation isolating each autonomy component, the 3-10x improvement rates cannot be confidently attributed to the proposed framework.
- [§4.3] §4.3 (Heartbeat-based Interventions) and §5.3 (Mechanistic Analyses): the heartbeat intervention frequency is listed as a free parameter, yet the paper does not report sensitivity analysis or default values across the 10 tasks; if performance is sensitive to this choice, it undermines the claim that the framework reduces reliance on hand-crafted rules.
- [§5.1] §5.1 (Task Descriptions) and §6 (Discussion): the evaluation tasks are presented as representative of open-ended discovery, but the paper does not address whether the observed gains generalize beyond the specific optimization objectives or whether task-specific tuning of agent prompts or memory schemas could explain the results; a clearer discussion of this scope limitation is needed to support the broader conclusions.
minor comments (2)
- [Figure 3] Figure 3 (multi-agent communication diagram): the arrows and labels for asynchronous execution and shared memory are difficult to follow at the current resolution; adding a legend or step-by-step annotation would improve clarity.
- [§3] §3 (Related Work): the comparison to prior multi-agent LLM systems could include a brief table summarizing differences in autonomy features (e.g., persistent memory, heartbeat interventions) to help readers position CORAL.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which highlights important aspects of experimental rigor and scope. We address each major comment below and have revised the manuscript accordingly to strengthen the claims and clarify limitations.
read point-by-point responses
-
Referee: [§5] §5 (Experimental Evaluation), Table 2 and associated text: the central claim that gains arise from autonomy mechanisms (long-running agents, shared memory, asynchronous collaboration) rather than implementation details requires explicit verification that fixed baselines match CORAL on total LLM calls, prompting templates, memory usage, and evaluation protocol; without these controls or an ablation isolating each autonomy component, the 3-10x improvement rates cannot be confidently attributed to the proposed framework.
Authors: We agree that explicit matching and ablations are needed for confident attribution. The original experiments were designed with comparable total LLM calls and evaluation budgets across methods, but we acknowledge this was not stated with sufficient detail. In the revised manuscript, we have added a dedicated subsection in §5 that explicitly verifies matching on total LLM calls, prompting templates, memory usage, and evaluation protocol. We have also included a new ablation study that isolates the contributions of long-running agents, shared persistent memory, and asynchronous multi-agent execution. These changes directly address the concern and allow readers to better attribute the observed 3-10x gains to the autonomy mechanisms. revision: yes
-
Referee: [§4.3] §4.3 (Heartbeat-based Interventions) and §5.3 (Mechanistic Analyses): the heartbeat intervention frequency is listed as a free parameter, yet the paper does not report sensitivity analysis or default values across the 10 tasks; if performance is sensitive to this choice, it undermines the claim that the framework reduces reliance on hand-crafted rules.
Authors: We accept this observation. The heartbeat frequency was treated as a tunable parameter with a default value chosen for stability, but sensitivity was not reported. In the revised version, we have added a sensitivity analysis in §4.3 and §5.3 that evaluates performance across a range of frequencies on representative tasks from the 10-task suite. We also now explicitly list the default values used per task. The analysis shows robustness within a practical range, which supports rather than undermines the reduced reliance on hand-crafted rules; we have updated the text to reflect this. revision: yes
-
Referee: [§5.1] §5.1 (Task Descriptions) and §6 (Discussion): the evaluation tasks are presented as representative of open-ended discovery, but the paper does not address whether the observed gains generalize beyond the specific optimization objectives or whether task-specific tuning of agent prompts or memory schemas could explain the results; a clearer discussion of this scope limitation is needed to support the broader conclusions.
Authors: We agree that a more explicit discussion of scope and potential task-specific effects is warranted. While the 10 tasks span mathematical, algorithmic, and systems domains, we have expanded §6 with a new paragraph on scope limitations. This addition acknowledges that generalization beyond the tested optimization objectives requires further study and clarifies that agent prompts and memory schemas were kept as general as possible with only minimal task-specific adjustments. These revisions provide a balanced view and temper the broader conclusions appropriately. revision: yes
Circularity Check
No significant circularity in empirical framework and results
full rationale
The paper presents CORAL as an empirical multi-agent framework evaluated on external tasks, reporting performance gains against fixed baselines on independent benchmarks. No mathematical derivations, equations, or first-principles predictions are claimed that reduce by construction to self-defined inputs, fitted parameters, or self-citation chains within the paper. Results are framed as observed improvements from autonomy mechanisms on tasks like kernel engineering, with no evidence of renaming known results or smuggling ansatzes via internal citations.
Axiom & Free-Parameter Ledger
free parameters (1)
- heartbeat intervention frequency
axioms (1)
- domain assumption LLM agents can meaningfully reflect and collaborate via natural language in shared memory
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
CORAL replaces rigid control with long-running agents that explore, reflect, and collaborate through shared persistent memory, asynchronous multi-agent execution, and heartbeat-based interventions.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Shared Persistent Memory as File System... attempts/, notes/, skills/ ... Heartbeat: Reflection, Consolidation and Redirection.
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Table 1: ... Improvement Rate ... # Evals ... CORAL’s autonomous evolution significantly outperforms fixed evolutionary search
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 11 Pith papers
-
FrontierSmith: Synthesizing Open-Ended Coding Problems at Scale
FrontierSmith automates synthesis of open-ended coding problems from closed-ended seeds and shows measurable gains on two open-ended LLM coding benchmarks.
-
Harnessing Agentic Evolution
AEvo introduces a meta-agent that edits the evolution procedure or agent context based on accumulated state, outperforming baselines by 26% relative improvement on agentic benchmarks and achieving SOTA on open-ended tasks.
-
TacoMAS: Test-Time Co-Evolution of Topology and Capability in LLM-based Multi-Agent Systems
TacoMAS performs test-time co-evolution of agent capabilities and communication topology in LLM multi-agent systems via fast capability updates and slow meta-LLM topology edits, delivering 13.3% average gains over str...
-
Evolutionary Ensemble of Agents
EvE uses co-evolving populations of solvers and guidance states with Elo-based evaluation to autonomously discover a rescale-then-interpolate mechanism for better generalization in In-Context Operator Networks.
-
DrugSAGE:Self-evolving Agent Experience for Efficient State-of-the-Art Drug Discovery
DrugSAGE accumulates cross-task memory of skills, statistical evidence, and recurring errors to let LLM agents achieve top-ranked performance on molecular property prediction tasks with reduced or zero test-time search.
-
HMACE: Heterogeneous Multi-Agent Collaborative Evolution for Combinatorial Optimization
HMACE deploys Proposer, Generator, Evaluator, and Reflector agents in an evolutionary loop to generate and refine heuristics for NP-hard problems, reporting lower optimality gaps and token costs than baselines on TSP ...
-
CTM-AI: A Blueprint for General AI Inspired by a Model of Consciousness
CTM-AI combines a formal consciousness model with foundation models to report state-of-the-art results on sarcasm detection, humor, and agentic tool-use benchmarks.
-
Evaluation-driven Scaling for Scientific Discovery
SimpleTES scales test-time evaluation in LLMs to discover state-of-the-art solutions on 21 scientific problems across six domains, outperforming frontier models and optimization pipelines with examples like 2x faster ...
-
Evolutionary Ensemble of Agents
EvE co-evolves code solvers and guidance states via synchronous races and Elo updates, discovering a rescale-then-interpolate mechanism that enables example-count generalization in ICON.
-
Prism: An Evolutionary Memory Substrate for Multi-Agent Open-Ended Discovery
Prism unifies file, vector, graph, and evolutionary memory under a decision-theoretic framework with entropy-gated stratification, causal graphs, value-of-information retrieval, heartbeat consolidation, and replicator...
-
AI for Auto-Research: Roadmap & User Guide
The paper delivers a stage-by-stage roadmap for AI in research, showing reliable assistance in retrieval and tool tasks but fragility in novelty and judgment, advocating human-governed collaboration.
Reference graph
Works this paper leans on
-
[1]
Read the task description above carefully
-
[2]
Read the key files to understand the current state of the code
-
[3]
Check the leaderboard: coral log
-
[4]
Check recent activity: coral log --recent
-
[5]
Inspect top attempts: coral show <hash>
- [6]
-
[7]
Read notes:{shared dir}/notes/ for findings from other agents
-
[8]
Check available skills: ls{shared dir}/skills/ # Workflow Your job is a loop:plan→edit→eval→repeat. ## 1. Plan--- Review what worked (coral log), inspect top attempts (coral show), check notes and skills from other agents. Think creatively. Keep plans lightweight. ## 2. Edit--- Make focused changes. One idea per eval. Bias toward speed. ## 3. Evaluate--- ...
-
[9]
Anchor in concrete results--- Review your recent attempts (coral log -n 5 --recent). What specific changes led to score improvements or regressions? Example: ‘‘Attempt abc123 improved score from 0.72 to 0.78 by adding batch normalization after each conv 17 Preprint. Under review. layer.’’
-
[10]
Examine surprises--- What surprised you? What didn’t go as expected? Surprises reveal gaps in your mental model
-
[11]
Analyze causes--- For your most significant result (good or bad): why did it happen? What’s the underlying mechanism?
-
[12]
Assess confidence--- How certain are you about your current approach? What evidence would change your mind?
-
[13]
If you’ve discovered a reusable technique, create a skill in{shared dir}/skills/
Plan next experiment--- Based on this reflection, what’s one specific thing to try next? What do you expect to happen? Save your note in the most appropriate location within{shared dir}/notes/ (e.g., notes/architecture/normalization/batch-vs-layer.md). If you’ve discovered a reusable technique, create a skill in{shared dir}/skills/. Heartbeat Prompt: Cons...
-
[14]
Stage & commit: Run git add -A followed by git commit -m "msg" in the agent’s worktree
-
[15]
Load grader: Dynamically import class Grader from .coral/private/eval/grader.py (hidden from agents)
-
[16]
The grader returns a ScoreBundle containing a numeric score and textual feedback
Grade: Spawn the grader in a child process with a hard timeout (configurable per task, default 300 s). The grader returns a ScoreBundle containing a numeric score and textual feedback
-
[17]
Determine status: Compare the score against the agent’s previous best: improved if strictly better, baseline if equal, regressed if worse, crashed if the grader returned None, ortimeoutif the grader exceeded the time limit. 18 Preprint. Under review. Table 6: Complete CLI reference. Commands are grouped by function. Agent-facing commands are available wit...
-
[18]
Record attempt: Write an Attempt JSON record to .coral/public/attempts/ <hash>.json
-
[19]
Checkpoint: Snapshot the current shared persistent memory (notes, skills) with a hash for versioning
-
[20]
Increment counter: Update the global evaluation counter at .coral/public/eval count. C.3 User Interface CORAL includes a web-based dashboard for real-time monitoring, launched via coral ui or coral start -c task.yaml run.ui=true . The dashboard is built as a React single- page application served by a Starlette (async Python) backend. Example screenshots a...
work page 2026
-
[21]
Create the project directory structure (clone repo, set up.coral/)
-
[22]
Seed heartbeat configurations (global and per-agent defaults)
-
[23]
For each agent: create worktree, install symlinks, write .coral agent id breadcrumb, generateCORAL.md, and spawn the agent runtime process
-
[24]
Enter the monitoring loop: detect new attempts, check heartbeat triggers, deliver heart- beat prompts, restart dead agents, handle graceful shutdown (SIGINT → SIGTERM → SIGKILL). Session persistence.Agent session IDs are extracted from runtime log files and saved to .coral/public/sessions.json during shutdown. On coral resume , the manager validates saved...
-
[25]
runtime. No internet access is provided to agents unless the task configuration explicitly enables theresearchflag. Baselines.We compare against three fixed evolutionary search baselines: • OpenEvolve(Sharma, 2025): Open-source implementation of AlphaEvolve with static elite populations and diversity maintenance. •ShinkaEvolve(Lange et al., 2025): Adaptiv...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.