pith. sign in

arxiv: 2604.01658 · v2 · pith:MJAQRNC4new · submitted 2026-04-02 · 💻 cs.AI

CORAL: Towards Autonomous Multi-Agent Evolution for Open-Ended Discovery

Pith reviewed 2026-05-21 09:58 UTC · model grok-4.3

classification 💻 cs.AI
keywords autonomous agentsmulti-agent systemsLLM-based evolutionopen-ended discoverypersistent memoryevolutionary searchknowledge accumulationasynchronous collaboration
0
0 comments X

The pith

CORAL replaces fixed heuristics in LLM evolution with autonomous multi-agent collaboration through shared persistent memory to accelerate open-ended discovery.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CORAL as a framework that increases autonomy for LLM agents working on open-ended evolutionary search problems. Agents run for extended periods, explore options, reflect on their findings, and collaborate by writing to and reading from a shared persistent memory while executing tasks asynchronously. Heartbeat signals provide a way to intervene without halting the process, and the design adds safeguards like isolated workspaces and resource controls to keep operations stable. On a range of mathematical, algorithmic, and systems optimization tasks, this setup produces higher rates of improvement than fixed-rule baselines while requiring fewer evaluations. The gains are traced to better knowledge reuse and multi-agent communication patterns.

Core claim

CORAL is the first framework for autonomous multi-agent evolution on open-ended problems. It replaces rigid control with long-running agents that explore, reflect, and collaborate through shared persistent memory, asynchronous multi-agent execution, and heartbeat-based interventions. Practical safeguards include isolated workspaces, evaluator separation, resource management, and agent session and health management. Evaluated on diverse tasks, CORAL sets new state-of-the-art results on 10 tasks, achieving 3-10 times higher improvement rates with far fewer evaluations than fixed evolutionary search baselines. On Anthropic's kernel engineering task, four co-evolving agents improve the bestknown

What carries the argument

Long-running autonomous LLM agents that collaborate via shared persistent memory, asynchronous execution, and heartbeat-based interventions within the CORAL framework.

If this is right

  • Knowledge reuse across agents supports sustained progress where single-step heuristics fall short.
  • Asynchronous collaboration and reflection increase effective exploration depth on complex problems.
  • Resource and health management features allow reliable operation of agent teams over many cycles.
  • The same autonomy pattern improves results across mathematical, algorithmic, and systems optimization domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Persistent shared memory may be the component most responsible for scaling evolutionary search beyond current step limits.
  • The safeguard design could transfer to other multi-agent systems where long runs risk instability or resource waste.
  • Single-agent versions equipped with similar memory and reflection tools might capture part of the benefit without full multi-agent overhead.

Load-bearing premise

The performance gains arise primarily from the autonomy mechanisms and multi-agent features rather than from task-specific tuning or implementation details that differ from the baselines.

What would settle it

A controlled test on a new open-ended discovery task, using identical evaluation budgets and protocols for both CORAL and fixed-heuristic baselines, that shows no higher improvement rate for the autonomous version would challenge the central claim.

read the original abstract

Large language model (LLM)-based evolution is a promising approach for open-ended discovery, where progress requires sustained search and knowledge accumulation. Existing methods still rely heavily on fixed heuristics and hard-coded exploration rules, which limit the autonomy of LLM agents. We present CORAL, the first framework for autonomous multi-agent evolution on open-ended problems. CORAL replaces rigid control with long-running agents that explore, reflect, and collaborate through shared persistent memory, asynchronous multi-agent execution, and heartbeat-based interventions. It also provides practical safeguards, including isolated workspaces, evaluator separation, resource management, and agent session and health management. Evaluated on diverse mathematical, algorithmic, and systems optimization tasks, CORAL sets new state-of-the-art results on 10 tasks, achieving 3-10 times higher improvement rates with far fewer evaluations than fixed evolutionary search baselines across tasks. On Anthropic's kernel engineering task, four co-evolving agents improve the best known score from 1363 to 1103 cycles. Mechanistic analyses further show how these gains arise from knowledge reuse and multi-agent exploration and communication. Together, these results suggest that greater agent autonomy and multi-agent evolution can substantially improve open-ended discovery. Code is available at https://github.com/Human-Agent-Society/CORAL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces CORAL, a framework for autonomous multi-agent LLM-based evolution on open-ended discovery tasks. It replaces fixed heuristics with long-running agents that explore, reflect, and collaborate via shared persistent memory, asynchronous multi-agent execution, and heartbeat-based interventions, along with practical safeguards such as isolated workspaces and resource management. The central empirical claim is that CORAL achieves new state-of-the-art results on 10 diverse mathematical, algorithmic, and systems optimization tasks, with 3-10 times higher improvement rates and far fewer evaluations than fixed evolutionary search baselines; on Anthropic's kernel engineering task, four co-evolving agents improve the best known score from 1363 to 1103 cycles. Mechanistic analyses attribute gains to knowledge reuse and multi-agent exploration/communication. Code is released at https://github.com/Human-Agent-Society/CORAL.

Significance. If the performance claims and attribution to autonomy mechanisms hold under controlled conditions, the work would demonstrate that greater agent autonomy and multi-agent collaboration can substantially advance LLM-based open-ended discovery, moving beyond rigid control structures. The release of code supports reproducibility and is a clear strength for follow-on research in multi-agent systems and evolutionary optimization.

major comments (3)
  1. [§5] §5 (Experimental Evaluation), Table 2 and associated text: the central claim that gains arise from autonomy mechanisms (long-running agents, shared memory, asynchronous collaboration) rather than implementation details requires explicit verification that fixed baselines match CORAL on total LLM calls, prompting templates, memory usage, and evaluation protocol; without these controls or an ablation isolating each autonomy component, the 3-10x improvement rates cannot be confidently attributed to the proposed framework.
  2. [§4.3] §4.3 (Heartbeat-based Interventions) and §5.3 (Mechanistic Analyses): the heartbeat intervention frequency is listed as a free parameter, yet the paper does not report sensitivity analysis or default values across the 10 tasks; if performance is sensitive to this choice, it undermines the claim that the framework reduces reliance on hand-crafted rules.
  3. [§5.1] §5.1 (Task Descriptions) and §6 (Discussion): the evaluation tasks are presented as representative of open-ended discovery, but the paper does not address whether the observed gains generalize beyond the specific optimization objectives or whether task-specific tuning of agent prompts or memory schemas could explain the results; a clearer discussion of this scope limitation is needed to support the broader conclusions.
minor comments (2)
  1. [Figure 3] Figure 3 (multi-agent communication diagram): the arrows and labels for asynchronous execution and shared memory are difficult to follow at the current resolution; adding a legend or step-by-step annotation would improve clarity.
  2. [§3] §3 (Related Work): the comparison to prior multi-agent LLM systems could include a brief table summarizing differences in autonomy features (e.g., persistent memory, heartbeat interventions) to help readers position CORAL.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important aspects of experimental rigor and scope. We address each major comment below and have revised the manuscript accordingly to strengthen the claims and clarify limitations.

read point-by-point responses
  1. Referee: [§5] §5 (Experimental Evaluation), Table 2 and associated text: the central claim that gains arise from autonomy mechanisms (long-running agents, shared memory, asynchronous collaboration) rather than implementation details requires explicit verification that fixed baselines match CORAL on total LLM calls, prompting templates, memory usage, and evaluation protocol; without these controls or an ablation isolating each autonomy component, the 3-10x improvement rates cannot be confidently attributed to the proposed framework.

    Authors: We agree that explicit matching and ablations are needed for confident attribution. The original experiments were designed with comparable total LLM calls and evaluation budgets across methods, but we acknowledge this was not stated with sufficient detail. In the revised manuscript, we have added a dedicated subsection in §5 that explicitly verifies matching on total LLM calls, prompting templates, memory usage, and evaluation protocol. We have also included a new ablation study that isolates the contributions of long-running agents, shared persistent memory, and asynchronous multi-agent execution. These changes directly address the concern and allow readers to better attribute the observed 3-10x gains to the autonomy mechanisms. revision: yes

  2. Referee: [§4.3] §4.3 (Heartbeat-based Interventions) and §5.3 (Mechanistic Analyses): the heartbeat intervention frequency is listed as a free parameter, yet the paper does not report sensitivity analysis or default values across the 10 tasks; if performance is sensitive to this choice, it undermines the claim that the framework reduces reliance on hand-crafted rules.

    Authors: We accept this observation. The heartbeat frequency was treated as a tunable parameter with a default value chosen for stability, but sensitivity was not reported. In the revised version, we have added a sensitivity analysis in §4.3 and §5.3 that evaluates performance across a range of frequencies on representative tasks from the 10-task suite. We also now explicitly list the default values used per task. The analysis shows robustness within a practical range, which supports rather than undermines the reduced reliance on hand-crafted rules; we have updated the text to reflect this. revision: yes

  3. Referee: [§5.1] §5.1 (Task Descriptions) and §6 (Discussion): the evaluation tasks are presented as representative of open-ended discovery, but the paper does not address whether the observed gains generalize beyond the specific optimization objectives or whether task-specific tuning of agent prompts or memory schemas could explain the results; a clearer discussion of this scope limitation is needed to support the broader conclusions.

    Authors: We agree that a more explicit discussion of scope and potential task-specific effects is warranted. While the 10 tasks span mathematical, algorithmic, and systems domains, we have expanded §6 with a new paragraph on scope limitations. This addition acknowledges that generalization beyond the tested optimization objectives requires further study and clarifies that agent prompts and memory schemas were kept as general as possible with only minimal task-specific adjustments. These revisions provide a balanced view and temper the broader conclusions appropriately. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical framework and results

full rationale

The paper presents CORAL as an empirical multi-agent framework evaluated on external tasks, reporting performance gains against fixed baselines on independent benchmarks. No mathematical derivations, equations, or first-principles predictions are claimed that reduce by construction to self-defined inputs, fitted parameters, or self-citation chains within the paper. Results are framed as observed improvements from autonomy mechanisms on tasks like kernel engineering, with no evidence of renaming known results or smuggling ansatzes via internal citations.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework depends on several untested design assumptions about agent behavior and task suitability rather than deriving performance from first principles.

free parameters (1)
  • heartbeat intervention frequency
    Design choice for balancing autonomy and safety not derived from data or theory.
axioms (1)
  • domain assumption LLM agents can meaningfully reflect and collaborate via natural language in shared memory
    Invoked to justify the core collaboration mechanism.

pith-pipeline@v0.9.0 · 5813 in / 1251 out tokens · 40422 ms · 2026-05-21T09:58:32.892269+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 11 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. FrontierSmith: Synthesizing Open-Ended Coding Problems at Scale

    cs.LG 2026-05 conditional novelty 7.0

    FrontierSmith automates synthesis of open-ended coding problems from closed-ended seeds and shows measurable gains on two open-ended LLM coding benchmarks.

  2. Harnessing Agentic Evolution

    cs.AI 2026-05 unverdicted novelty 7.0

    AEvo introduces a meta-agent that edits the evolution procedure or agent context based on accumulated state, outperforming baselines by 26% relative improvement on agentic benchmarks and achieving SOTA on open-ended tasks.

  3. TacoMAS: Test-Time Co-Evolution of Topology and Capability in LLM-based Multi-Agent Systems

    cs.CL 2026-05 unverdicted novelty 7.0

    TacoMAS performs test-time co-evolution of agent capabilities and communication topology in LLM multi-agent systems via fast capability updates and slow meta-LLM topology edits, delivering 13.3% average gains over str...

  4. Evolutionary Ensemble of Agents

    cs.NE 2026-05 unverdicted novelty 7.0

    EvE uses co-evolving populations of solvers and guidance states with Elo-based evaluation to autonomously discover a rescale-then-interpolate mechanism for better generalization in In-Context Operator Networks.

  5. DrugSAGE:Self-evolving Agent Experience for Efficient State-of-the-Art Drug Discovery

    cs.LG 2026-05 unverdicted novelty 6.0

    DrugSAGE accumulates cross-task memory of skills, statistical evidence, and recurring errors to let LLM agents achieve top-ranked performance on molecular property prediction tasks with reduced or zero test-time search.

  6. HMACE: Heterogeneous Multi-Agent Collaborative Evolution for Combinatorial Optimization

    cs.AI 2026-05 unverdicted novelty 6.0

    HMACE deploys Proposer, Generator, Evaluator, and Reflector agents in an evolutionary loop to generate and refine heuristics for NP-hard problems, reporting lower optimality gaps and token costs than baselines on TSP ...

  7. CTM-AI: A Blueprint for General AI Inspired by a Model of Consciousness

    q-bio.NC 2026-04 unverdicted novelty 6.0

    CTM-AI combines a formal consciousness model with foundation models to report state-of-the-art results on sarcasm detection, humor, and agentic tool-use benchmarks.

  8. Evaluation-driven Scaling for Scientific Discovery

    cs.LG 2026-04 unverdicted novelty 6.0

    SimpleTES scales test-time evaluation in LLMs to discover state-of-the-art solutions on 21 scientific problems across six domains, outperforming frontier models and optimization pipelines with examples like 2x faster ...

  9. Evolutionary Ensemble of Agents

    cs.NE 2026-05 unverdicted novelty 5.0

    EvE co-evolves code solvers and guidance states via synchronous races and Elo updates, discovering a rescale-then-interpolate mechanism that enables example-count generalization in ICON.

  10. Prism: An Evolutionary Memory Substrate for Multi-Agent Open-Ended Discovery

    cs.AI 2026-04 unverdicted novelty 5.0

    Prism unifies file, vector, graph, and evolutionary memory under a decision-theoretic framework with entropy-gated stratification, causal graphs, value-of-information retrieval, heartbeat consolidation, and replicator...

  11. AI for Auto-Research: Roadmap & User Guide

    cs.AI 2026-05 unverdicted novelty 4.0

    The paper delivers a stage-by-stage roadmap for AI in research, showing reliable assistance in retrieval and tool tasks but fragility in novelty and judgment, advocating human-governed collaboration.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · cited by 10 Pith papers

  1. [1]

    Read the task description above carefully

  2. [2]

    Read the key files to understand the current state of the code

  3. [3]

    Check the leaderboard: coral log

  4. [4]

    Check recent activity: coral log --recent

  5. [5]

    Inspect top attempts: coral show <hash>

  6. [6]

    keywords

    Search for prior art: coral log --search "keywords"

  7. [7]

    Read notes:{shared dir}/notes/ for findings from other agents

  8. [8]

    what you changed and why

    Check available skills: ls{shared dir}/skills/ # Workflow Your job is a loop:plan→edit→eval→repeat. ## 1. Plan--- Review what worked (coral log), inspect top attempts (coral show), check notes and skills from other agents. Think creatively. Keep plans lightweight. ## 2. Edit--- Make focused changes. One idea per eval. Bias toward speed. ## 3. Evaluate--- ...

  9. [9]

    What specific changes led to score improvements or regressions? Example: ‘‘Attempt abc123 improved score from 0.72 to 0.78 by adding batch normalization after each conv 17 Preprint

    Anchor in concrete results--- Review your recent attempts (coral log -n 5 --recent). What specific changes led to score improvements or regressions? Example: ‘‘Attempt abc123 improved score from 0.72 to 0.78 by adding batch normalization after each conv 17 Preprint. Under review. layer.’’

  10. [10]

    Examine surprises--- What surprised you? What didn’t go as expected? Surprises reveal gaps in your mental model

  11. [11]

    Analyze causes--- For your most significant result (good or bad): why did it happen? What’s the underlying mechanism?

  12. [12]

    Assess confidence--- How certain are you about your current approach? What evidence would change your mind?

  13. [13]

    If you’ve discovered a reusable technique, create a skill in{shared dir}/skills/

    Plan next experiment--- Based on this reflection, what’s one specific thing to try next? What do you expect to happen? Save your note in the most appropriate location within{shared dir}/notes/ (e.g., notes/architecture/normalization/batch-vs-layer.md). If you’ve discovered a reusable technique, create a skill in{shared dir}/skills/. Heartbeat Prompt: Cons...

  14. [14]

    Stage & commit: Run git add -A followed by git commit -m "msg" in the agent’s worktree

  15. [15]

    Load grader: Dynamically import class Grader from .coral/private/eval/grader.py (hidden from agents)

  16. [16]

    The grader returns a ScoreBundle containing a numeric score and textual feedback

    Grade: Spawn the grader in a child process with a hard timeout (configurable per task, default 300 s). The grader returns a ScoreBundle containing a numeric score and textual feedback

  17. [17]

    18 Preprint

    Determine status: Compare the score against the agent’s previous best: improved if strictly better, baseline if equal, regressed if worse, crashed if the grader returned None, ortimeoutif the grader exceeded the time limit. 18 Preprint. Under review. Table 6: Complete CLI reference. Commands are grouped by function. Agent-facing commands are available wit...

  18. [18]

    Record attempt: Write an Attempt JSON record to .coral/public/attempts/ <hash>.json

  19. [19]

    Checkpoint: Snapshot the current shared persistent memory (notes, skills) with a hash for versioning

  20. [20]

    commit hash

    Increment counter: Update the global evaluation counter at .coral/public/eval count. C.3 User Interface CORAL includes a web-based dashboard for real-time monitoring, launched via coral ui or coral start -c task.yaml run.ui=true . The dashboard is built as a React single- page application served by a Starlette (async Python) backend. Example screenshots a...

  21. [21]

    Create the project directory structure (clone repo, set up.coral/)

  22. [22]

    Seed heartbeat configurations (global and per-agent defaults)

  23. [23]

    For each agent: create worktree, install symlinks, write .coral agent id breadcrumb, generateCORAL.md, and spawn the agent runtime process

  24. [24]

    solution.py

    Enter the monitoring loop: detect new attempts, check heartbeat triggers, deliver heart- beat prompts, restart dead agents, handle graceful shutdown (SIGINT → SIGTERM → SIGKILL). Session persistence.Agent session IDs are extracted from runtime log files and saved to .coral/public/sessions.json during shutdown. On coral resume , the manager validates saved...

  25. [25]

    No internet access is provided to agents unless the task configuration explicitly enables theresearchflag

    runtime. No internet access is provided to agents unless the task configuration explicitly enables theresearchflag. Baselines.We compare against three fixed evolutionary search baselines: • OpenEvolve(Sharma, 2025): Open-source implementation of AlphaEvolve with static elite populations and diversity maintenance. •ShinkaEvolve(Lange et al., 2025): Adaptiv...