CORAL: Towards Autonomous Multi-Agent Evolution for Open-Ended Discovery

Ao Qu; Bryan Kian Hsiang Low; Cathy Wu; Chonghe Jiang; Fenglu Hong; Han Zheng; Jiacheng Zhu; Jinhua Zhao; Kaichen Zhou; Minwei Kong

arxiv: 2604.01658 · v2 · pith:MJAQRNC4new · submitted 2026-04-02 · 💻 cs.AI

CORAL: Towards Autonomous Multi-Agent Evolution for Open-Ended Discovery

Ao Qu , Han Zheng , Zijian Zhou , Yihao Yan , Yihong Tang , Shao Yong Ong , Fenglu Hong , Kaichen Zhou

show 9 more authors

Chonghe Jiang Minwei Kong Jiacheng Zhu Xuan Jiang Sirui Li Cathy Wu Bryan Kian Hsiang Low Jinhua Zhao Paul Pu Liang

This is my paper

Pith reviewed 2026-05-21 09:58 UTC · model grok-4.3

classification 💻 cs.AI

keywords autonomous agentsmulti-agent systemsLLM-based evolutionopen-ended discoverypersistent memoryevolutionary searchknowledge accumulationasynchronous collaboration

0 comments

The pith

CORAL replaces fixed heuristics in LLM evolution with autonomous multi-agent collaboration through shared persistent memory to accelerate open-ended discovery.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CORAL as a framework that increases autonomy for LLM agents working on open-ended evolutionary search problems. Agents run for extended periods, explore options, reflect on their findings, and collaborate by writing to and reading from a shared persistent memory while executing tasks asynchronously. Heartbeat signals provide a way to intervene without halting the process, and the design adds safeguards like isolated workspaces and resource controls to keep operations stable. On a range of mathematical, algorithmic, and systems optimization tasks, this setup produces higher rates of improvement than fixed-rule baselines while requiring fewer evaluations. The gains are traced to better knowledge reuse and multi-agent communication patterns.

Core claim

CORAL is the first framework for autonomous multi-agent evolution on open-ended problems. It replaces rigid control with long-running agents that explore, reflect, and collaborate through shared persistent memory, asynchronous multi-agent execution, and heartbeat-based interventions. Practical safeguards include isolated workspaces, evaluator separation, resource management, and agent session and health management. Evaluated on diverse tasks, CORAL sets new state-of-the-art results on 10 tasks, achieving 3-10 times higher improvement rates with far fewer evaluations than fixed evolutionary search baselines. On Anthropic's kernel engineering task, four co-evolving agents improve the bestknown

What carries the argument

Long-running autonomous LLM agents that collaborate via shared persistent memory, asynchronous execution, and heartbeat-based interventions within the CORAL framework.

If this is right

Knowledge reuse across agents supports sustained progress where single-step heuristics fall short.
Asynchronous collaboration and reflection increase effective exploration depth on complex problems.
Resource and health management features allow reliable operation of agent teams over many cycles.
The same autonomy pattern improves results across mathematical, algorithmic, and systems optimization domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Persistent shared memory may be the component most responsible for scaling evolutionary search beyond current step limits.
The safeguard design could transfer to other multi-agent systems where long runs risk instability or resource waste.
Single-agent versions equipped with similar memory and reflection tools might capture part of the benefit without full multi-agent overhead.

Load-bearing premise

The performance gains arise primarily from the autonomy mechanisms and multi-agent features rather than from task-specific tuning or implementation details that differ from the baselines.

What would settle it

A controlled test on a new open-ended discovery task, using identical evaluation budgets and protocols for both CORAL and fixed-heuristic baselines, that shows no higher improvement rate for the autonomous version would challenge the central claim.

read the original abstract

Large language model (LLM)-based evolution is a promising approach for open-ended discovery, where progress requires sustained search and knowledge accumulation. Existing methods still rely heavily on fixed heuristics and hard-coded exploration rules, which limit the autonomy of LLM agents. We present CORAL, the first framework for autonomous multi-agent evolution on open-ended problems. CORAL replaces rigid control with long-running agents that explore, reflect, and collaborate through shared persistent memory, asynchronous multi-agent execution, and heartbeat-based interventions. It also provides practical safeguards, including isolated workspaces, evaluator separation, resource management, and agent session and health management. Evaluated on diverse mathematical, algorithmic, and systems optimization tasks, CORAL sets new state-of-the-art results on 10 tasks, achieving 3-10 times higher improvement rates with far fewer evaluations than fixed evolutionary search baselines across tasks. On Anthropic's kernel engineering task, four co-evolving agents improve the best known score from 1363 to 1103 cycles. Mechanistic analyses further show how these gains arise from knowledge reuse and multi-agent exploration and communication. Together, these results suggest that greater agent autonomy and multi-agent evolution can substantially improve open-ended discovery. Code is available at https://github.com/Human-Agent-Society/CORAL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CORAL gets better numbers on optimization tasks with long-running agents and shared memory, but the gains could easily trace to unmatched prompting or implementation details rather than the autonomy features themselves.

read the letter

The paper's core move is replacing fixed heuristics in LLM evolution with agents that keep running, share a persistent memory store, execute asynchronously, and receive heartbeat-based interventions when needed. That architecture plus the listed safeguards (isolated workspaces, separate evaluators, session management) is the actual novelty over the fixed-heuristic baselines cited in the abstract. They release code, which lets others check the details, and they run mechanistic checks showing knowledge reuse and cross-agent communication contribute to the reported gains. On the ten tasks they cover, the abstract states 3-10x higher improvement rates with fewer evaluations, including a concrete lift on the kernel engineering benchmark from 1363 to 1103 cycles. Those are the parts worth paying attention to if you work on agent-based discovery pipelines. The soft spot is the baseline comparison. The stress-test note is right to flag that the abstract does not confirm the fixed baselines received equivalent prompting quality, memory access, total LLM calls, or evaluation protocol. Without those controls documented, the performance edge cannot be confidently pinned on the long-running or multi-agent aspects rather than on unstated implementation differences. The tasks themselves are mostly closed optimization problems, so the leap to “open-ended discovery” rests on how well those benchmarks stand in for the harder case. Readers who build multi-agent LLM systems will find the practical architecture and code useful even if the attribution needs tightening. The work shows clear thinking on the engineering side and honest empirical reporting, so it deserves a serious referee who can press on the baseline matching and task scope. I would send it to review rather than desk reject.

Referee Report

3 major / 2 minor

Summary. The paper introduces CORAL, a framework for autonomous multi-agent LLM-based evolution on open-ended discovery tasks. It replaces fixed heuristics with long-running agents that explore, reflect, and collaborate via shared persistent memory, asynchronous multi-agent execution, and heartbeat-based interventions, along with practical safeguards such as isolated workspaces and resource management. The central empirical claim is that CORAL achieves new state-of-the-art results on 10 diverse mathematical, algorithmic, and systems optimization tasks, with 3-10 times higher improvement rates and far fewer evaluations than fixed evolutionary search baselines; on Anthropic's kernel engineering task, four co-evolving agents improve the best known score from 1363 to 1103 cycles. Mechanistic analyses attribute gains to knowledge reuse and multi-agent exploration/communication. Code is released at https://github.com/Human-Agent-Society/CORAL.

Significance. If the performance claims and attribution to autonomy mechanisms hold under controlled conditions, the work would demonstrate that greater agent autonomy and multi-agent collaboration can substantially advance LLM-based open-ended discovery, moving beyond rigid control structures. The release of code supports reproducibility and is a clear strength for follow-on research in multi-agent systems and evolutionary optimization.

major comments (3)

[§5] §5 (Experimental Evaluation), Table 2 and associated text: the central claim that gains arise from autonomy mechanisms (long-running agents, shared memory, asynchronous collaboration) rather than implementation details requires explicit verification that fixed baselines match CORAL on total LLM calls, prompting templates, memory usage, and evaluation protocol; without these controls or an ablation isolating each autonomy component, the 3-10x improvement rates cannot be confidently attributed to the proposed framework.
[§4.3] §4.3 (Heartbeat-based Interventions) and §5.3 (Mechanistic Analyses): the heartbeat intervention frequency is listed as a free parameter, yet the paper does not report sensitivity analysis or default values across the 10 tasks; if performance is sensitive to this choice, it undermines the claim that the framework reduces reliance on hand-crafted rules.
[§5.1] §5.1 (Task Descriptions) and §6 (Discussion): the evaluation tasks are presented as representative of open-ended discovery, but the paper does not address whether the observed gains generalize beyond the specific optimization objectives or whether task-specific tuning of agent prompts or memory schemas could explain the results; a clearer discussion of this scope limitation is needed to support the broader conclusions.

minor comments (2)

[Figure 3] Figure 3 (multi-agent communication diagram): the arrows and labels for asynchronous execution and shared memory are difficult to follow at the current resolution; adding a legend or step-by-step annotation would improve clarity.
[§3] §3 (Related Work): the comparison to prior multi-agent LLM systems could include a brief table summarizing differences in autonomy features (e.g., persistent memory, heartbeat interventions) to help readers position CORAL.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important aspects of experimental rigor and scope. We address each major comment below and have revised the manuscript accordingly to strengthen the claims and clarify limitations.

read point-by-point responses

Referee: [§5] §5 (Experimental Evaluation), Table 2 and associated text: the central claim that gains arise from autonomy mechanisms (long-running agents, shared memory, asynchronous collaboration) rather than implementation details requires explicit verification that fixed baselines match CORAL on total LLM calls, prompting templates, memory usage, and evaluation protocol; without these controls or an ablation isolating each autonomy component, the 3-10x improvement rates cannot be confidently attributed to the proposed framework.

Authors: We agree that explicit matching and ablations are needed for confident attribution. The original experiments were designed with comparable total LLM calls and evaluation budgets across methods, but we acknowledge this was not stated with sufficient detail. In the revised manuscript, we have added a dedicated subsection in §5 that explicitly verifies matching on total LLM calls, prompting templates, memory usage, and evaluation protocol. We have also included a new ablation study that isolates the contributions of long-running agents, shared persistent memory, and asynchronous multi-agent execution. These changes directly address the concern and allow readers to better attribute the observed 3-10x gains to the autonomy mechanisms. revision: yes
Referee: [§4.3] §4.3 (Heartbeat-based Interventions) and §5.3 (Mechanistic Analyses): the heartbeat intervention frequency is listed as a free parameter, yet the paper does not report sensitivity analysis or default values across the 10 tasks; if performance is sensitive to this choice, it undermines the claim that the framework reduces reliance on hand-crafted rules.

Authors: We accept this observation. The heartbeat frequency was treated as a tunable parameter with a default value chosen for stability, but sensitivity was not reported. In the revised version, we have added a sensitivity analysis in §4.3 and §5.3 that evaluates performance across a range of frequencies on representative tasks from the 10-task suite. We also now explicitly list the default values used per task. The analysis shows robustness within a practical range, which supports rather than undermines the reduced reliance on hand-crafted rules; we have updated the text to reflect this. revision: yes
Referee: [§5.1] §5.1 (Task Descriptions) and §6 (Discussion): the evaluation tasks are presented as representative of open-ended discovery, but the paper does not address whether the observed gains generalize beyond the specific optimization objectives or whether task-specific tuning of agent prompts or memory schemas could explain the results; a clearer discussion of this scope limitation is needed to support the broader conclusions.

Authors: We agree that a more explicit discussion of scope and potential task-specific effects is warranted. While the 10 tasks span mathematical, algorithmic, and systems domains, we have expanded §6 with a new paragraph on scope limitations. This addition acknowledges that generalization beyond the tested optimization objectives requires further study and clarifies that agent prompts and memory schemas were kept as general as possible with only minimal task-specific adjustments. These revisions provide a balanced view and temper the broader conclusions appropriately. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical framework and results

full rationale

The paper presents CORAL as an empirical multi-agent framework evaluated on external tasks, reporting performance gains against fixed baselines on independent benchmarks. No mathematical derivations, equations, or first-principles predictions are claimed that reduce by construction to self-defined inputs, fitted parameters, or self-citation chains within the paper. Results are framed as observed improvements from autonomy mechanisms on tasks like kernel engineering, with no evidence of renaming known results or smuggling ansatzes via internal citations.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework depends on several untested design assumptions about agent behavior and task suitability rather than deriving performance from first principles.

free parameters (1)

heartbeat intervention frequency
Design choice for balancing autonomy and safety not derived from data or theory.

axioms (1)

domain assumption LLM agents can meaningfully reflect and collaborate via natural language in shared memory
Invoked to justify the core collaboration mechanism.

pith-pipeline@v0.9.0 · 5813 in / 1251 out tokens · 40422 ms · 2026-05-21T09:58:32.892269+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

CORAL replaces rigid control with long-running agents that explore, reflect, and collaborate through shared persistent memory, asynchronous multi-agent execution, and heartbeat-based interventions.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Shared Persistent Memory as File System... attempts/, notes/, skills/ ... Heartbeat: Reflection, Consolidation and Redirection.
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Table 1: ... Improvement Rate ... # Evals ... CORAL’s autonomous evolution significantly outperforms fixed evolutionary search

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 11 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

FrontierSmith: Synthesizing Open-Ended Coding Problems at Scale
cs.LG 2026-05 conditional novelty 7.0

FrontierSmith automates synthesis of open-ended coding problems from closed-ended seeds and shows measurable gains on two open-ended LLM coding benchmarks.
Harnessing Agentic Evolution
cs.AI 2026-05 unverdicted novelty 7.0

AEvo introduces a meta-agent that edits the evolution procedure or agent context based on accumulated state, outperforming baselines by 26% relative improvement on agentic benchmarks and achieving SOTA on open-ended tasks.
TacoMAS: Test-Time Co-Evolution of Topology and Capability in LLM-based Multi-Agent Systems
cs.CL 2026-05 unverdicted novelty 7.0

TacoMAS performs test-time co-evolution of agent capabilities and communication topology in LLM multi-agent systems via fast capability updates and slow meta-LLM topology edits, delivering 13.3% average gains over str...
Evolutionary Ensemble of Agents
cs.NE 2026-05 unverdicted novelty 7.0

EvE uses co-evolving populations of solvers and guidance states with Elo-based evaluation to autonomously discover a rescale-then-interpolate mechanism for better generalization in In-Context Operator Networks.
DrugSAGE:Self-evolving Agent Experience for Efficient State-of-the-Art Drug Discovery
cs.LG 2026-05 unverdicted novelty 6.0

DrugSAGE accumulates cross-task memory of skills, statistical evidence, and recurring errors to let LLM agents achieve top-ranked performance on molecular property prediction tasks with reduced or zero test-time search.
HMACE: Heterogeneous Multi-Agent Collaborative Evolution for Combinatorial Optimization
cs.AI 2026-05 unverdicted novelty 6.0

HMACE deploys Proposer, Generator, Evaluator, and Reflector agents in an evolutionary loop to generate and refine heuristics for NP-hard problems, reporting lower optimality gaps and token costs than baselines on TSP ...
CTM-AI: A Blueprint for General AI Inspired by a Model of Consciousness
q-bio.NC 2026-04 unverdicted novelty 6.0

CTM-AI combines a formal consciousness model with foundation models to report state-of-the-art results on sarcasm detection, humor, and agentic tool-use benchmarks.
Evaluation-driven Scaling for Scientific Discovery
cs.LG 2026-04 unverdicted novelty 6.0

SimpleTES scales test-time evaluation in LLMs to discover state-of-the-art solutions on 21 scientific problems across six domains, outperforming frontier models and optimization pipelines with examples like 2x faster ...
Evolutionary Ensemble of Agents
cs.NE 2026-05 unverdicted novelty 5.0

EvE co-evolves code solvers and guidance states via synchronous races and Elo updates, discovering a rescale-then-interpolate mechanism that enables example-count generalization in ICON.
Prism: An Evolutionary Memory Substrate for Multi-Agent Open-Ended Discovery
cs.AI 2026-04 unverdicted novelty 5.0

Prism unifies file, vector, graph, and evolutionary memory under a decision-theoretic framework with entropy-gated stratification, causal graphs, value-of-information retrieval, heartbeat consolidation, and replicator...
AI for Auto-Research: Roadmap & User Guide
cs.AI 2026-05 unverdicted novelty 4.0

The paper delivers a stage-by-stage roadmap for AI in research, showing reliable assistance in retrieval and tool tasks but fragility in novelty and judgment, advocating human-governed collaboration.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · cited by 10 Pith papers

[1]

Read the task description above carefully

work page
[2]

Read the key files to understand the current state of the code

work page
[3]

Check the leaderboard: coral log

work page
[4]

Check recent activity: coral log --recent

work page
[5]

Inspect top attempts: coral show <hash>

work page
[6]

keywords

Search for prior art: coral log --search "keywords"

work page
[7]

Read notes:{shared dir}/notes/ for findings from other agents

work page
[8]

what you changed and why

Check available skills: ls{shared dir}/skills/ # Workflow Your job is a loop:plan→edit→eval→repeat. ## 1. Plan--- Review what worked (coral log), inspect top attempts (coral show), check notes and skills from other agents. Think creatively. Keep plans lightweight. ## 2. Edit--- Make focused changes. One idea per eval. Bias toward speed. ## 3. Evaluate--- ...

work page
[9]

What specific changes led to score improvements or regressions? Example: ‘‘Attempt abc123 improved score from 0.72 to 0.78 by adding batch normalization after each conv 17 Preprint

Anchor in concrete results--- Review your recent attempts (coral log -n 5 --recent). What specific changes led to score improvements or regressions? Example: ‘‘Attempt abc123 improved score from 0.72 to 0.78 by adding batch normalization after each conv 17 Preprint. Under review. layer.’’

work page
[10]

Examine surprises--- What surprised you? What didn’t go as expected? Surprises reveal gaps in your mental model

work page
[11]

Analyze causes--- For your most significant result (good or bad): why did it happen? What’s the underlying mechanism?

work page
[12]

Assess confidence--- How certain are you about your current approach? What evidence would change your mind?

work page
[13]

If you’ve discovered a reusable technique, create a skill in{shared dir}/skills/

Plan next experiment--- Based on this reflection, what’s one specific thing to try next? What do you expect to happen? Save your note in the most appropriate location within{shared dir}/notes/ (e.g., notes/architecture/normalization/batch-vs-layer.md). If you’ve discovered a reusable technique, create a skill in{shared dir}/skills/. Heartbeat Prompt: Cons...

work page
[14]

Stage & commit: Run git add -A followed by git commit -m "msg" in the agent’s worktree

work page
[15]

Load grader: Dynamically import class Grader from .coral/private/eval/grader.py (hidden from agents)

work page
[16]

The grader returns a ScoreBundle containing a numeric score and textual feedback

Grade: Spawn the grader in a child process with a hard timeout (configurable per task, default 300 s). The grader returns a ScoreBundle containing a numeric score and textual feedback

work page
[17]

18 Preprint

Determine status: Compare the score against the agent’s previous best: improved if strictly better, baseline if equal, regressed if worse, crashed if the grader returned None, ortimeoutif the grader exceeded the time limit. 18 Preprint. Under review. Table 6: Complete CLI reference. Commands are grouped by function. Agent-facing commands are available wit...

work page
[18]

Record attempt: Write an Attempt JSON record to .coral/public/attempts/ <hash>.json

work page
[19]

Checkpoint: Snapshot the current shared persistent memory (notes, skills) with a hash for versioning

work page
[20]

commit hash

Increment counter: Update the global evaluation counter at .coral/public/eval count. C.3 User Interface CORAL includes a web-based dashboard for real-time monitoring, launched via coral ui or coral start -c task.yaml run.ui=true . The dashboard is built as a React single- page application served by a Starlette (async Python) backend. Example screenshots a...

work page 2026
[21]

Create the project directory structure (clone repo, set up.coral/)

work page
[22]

Seed heartbeat configurations (global and per-agent defaults)

work page
[23]

For each agent: create worktree, install symlinks, write .coral agent id breadcrumb, generateCORAL.md, and spawn the agent runtime process

work page
[24]

solution.py

Enter the monitoring loop: detect new attempts, check heartbeat triggers, deliver heart- beat prompts, restart dead agents, handle graceful shutdown (SIGINT → SIGTERM → SIGKILL). Session persistence.Agent session IDs are extracted from runtime log files and saved to .coral/public/sessions.json during shutdown. On coral resume , the manager validates saved...

work page
[25]

No internet access is provided to agents unless the task configuration explicitly enables theresearchflag

runtime. No internet access is provided to agents unless the task configuration explicitly enables theresearchflag. Baselines.We compare against three fixed evolutionary search baselines: • OpenEvolve(Sharma, 2025): Open-source implementation of AlphaEvolve with static elite populations and diversity maintenance. •ShinkaEvolve(Lange et al., 2025): Adaptiv...

work page 2025

[1] [1]

Read the task description above carefully

work page

[2] [2]

Read the key files to understand the current state of the code

work page

[3] [3]

Check the leaderboard: coral log

work page

[4] [4]

Check recent activity: coral log --recent

work page

[5] [5]

Inspect top attempts: coral show <hash>

work page

[6] [6]

keywords

Search for prior art: coral log --search "keywords"

work page

[7] [7]

Read notes:{shared dir}/notes/ for findings from other agents

work page

[8] [8]

what you changed and why

Check available skills: ls{shared dir}/skills/ # Workflow Your job is a loop:plan→edit→eval→repeat. ## 1. Plan--- Review what worked (coral log), inspect top attempts (coral show), check notes and skills from other agents. Think creatively. Keep plans lightweight. ## 2. Edit--- Make focused changes. One idea per eval. Bias toward speed. ## 3. Evaluate--- ...

work page

[9] [9]

What specific changes led to score improvements or regressions? Example: ‘‘Attempt abc123 improved score from 0.72 to 0.78 by adding batch normalization after each conv 17 Preprint

Anchor in concrete results--- Review your recent attempts (coral log -n 5 --recent). What specific changes led to score improvements or regressions? Example: ‘‘Attempt abc123 improved score from 0.72 to 0.78 by adding batch normalization after each conv 17 Preprint. Under review. layer.’’

work page

[10] [10]

Examine surprises--- What surprised you? What didn’t go as expected? Surprises reveal gaps in your mental model

work page

[11] [11]

Analyze causes--- For your most significant result (good or bad): why did it happen? What’s the underlying mechanism?

work page

[12] [12]

Assess confidence--- How certain are you about your current approach? What evidence would change your mind?

work page

[13] [13]

If you’ve discovered a reusable technique, create a skill in{shared dir}/skills/

Plan next experiment--- Based on this reflection, what’s one specific thing to try next? What do you expect to happen? Save your note in the most appropriate location within{shared dir}/notes/ (e.g., notes/architecture/normalization/batch-vs-layer.md). If you’ve discovered a reusable technique, create a skill in{shared dir}/skills/. Heartbeat Prompt: Cons...

work page

[14] [14]

Stage & commit: Run git add -A followed by git commit -m "msg" in the agent’s worktree

work page

[15] [15]

Load grader: Dynamically import class Grader from .coral/private/eval/grader.py (hidden from agents)

work page

[16] [16]

The grader returns a ScoreBundle containing a numeric score and textual feedback

Grade: Spawn the grader in a child process with a hard timeout (configurable per task, default 300 s). The grader returns a ScoreBundle containing a numeric score and textual feedback

work page

[17] [17]

18 Preprint

Determine status: Compare the score against the agent’s previous best: improved if strictly better, baseline if equal, regressed if worse, crashed if the grader returned None, ortimeoutif the grader exceeded the time limit. 18 Preprint. Under review. Table 6: Complete CLI reference. Commands are grouped by function. Agent-facing commands are available wit...

work page

[18] [18]

Record attempt: Write an Attempt JSON record to .coral/public/attempts/ <hash>.json

work page

[19] [19]

Checkpoint: Snapshot the current shared persistent memory (notes, skills) with a hash for versioning

work page

[20] [20]

commit hash

Increment counter: Update the global evaluation counter at .coral/public/eval count. C.3 User Interface CORAL includes a web-based dashboard for real-time monitoring, launched via coral ui or coral start -c task.yaml run.ui=true . The dashboard is built as a React single- page application served by a Starlette (async Python) backend. Example screenshots a...

work page 2026

[21] [21]

Create the project directory structure (clone repo, set up.coral/)

work page

[22] [22]

Seed heartbeat configurations (global and per-agent defaults)

work page

[23] [23]

For each agent: create worktree, install symlinks, write .coral agent id breadcrumb, generateCORAL.md, and spawn the agent runtime process

work page

[24] [24]

solution.py

Enter the monitoring loop: detect new attempts, check heartbeat triggers, deliver heart- beat prompts, restart dead agents, handle graceful shutdown (SIGINT → SIGTERM → SIGKILL). Session persistence.Agent session IDs are extracted from runtime log files and saved to .coral/public/sessions.json during shutdown. On coral resume , the manager validates saved...

work page

[25] [25]

No internet access is provided to agents unless the task configuration explicitly enables theresearchflag

runtime. No internet access is provided to agents unless the task configuration explicitly enables theresearchflag. Baselines.We compare against three fixed evolutionary search baselines: • OpenEvolve(Sharma, 2025): Open-source implementation of AlphaEvolve with static elite populations and diversity maintenance. •ShinkaEvolve(Lange et al., 2025): Adaptiv...

work page 2025