CARE: Controlling LLM-Generated Policies through Auditable Review of Evidence in Scientific Experimentation

Baiqing Li; Boer Zhang; Guanyu Liu; Peiyu Zhang; Tianyu Shi; Weiyi Kong; Zeyu Wang

arxiv: 2606.14581 · v2 · pith:7K2OBH7Ynew · submitted 2026-06-12 · 💻 cs.LG · cs.AI

CARE: Controlling LLM-Generated Policies through Auditable Review of Evidence in Scientific Experimentation

Guanyu Liu , Weiyi Kong , Zeyu Wang , Boer Zhang , Baiqing Li , Peiyu Zhang , Tianyu Shi This is my paper

Pith reviewed 2026-06-27 04:50 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords LLM policy controlauditable evidence gatehigh-throughput experimentationincumbent optimizerchallenger ranking policyscientific experiment optimizationaudit log

0 comments

The pith

CARE keeps a non-LLM optimizer as default and lets LLMs propose challenger policies only when a public-evidence gate authorizes the change.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CARE to let LLMs assist in optimizing high-throughput scientific experiments without granting them direct control over costly actions. A non-LLM incumbent remains the default path while LLMs generate challenger ranking policies. Before any outcome is known, the public-evidence intervention gate checks whether pre-selection data supports switching to the challenger and records the decision in an audit log. On the Minerva/Olympus and ChemLex benchmarks the method raises final-best performance from 80.0 to 88.5 and from 83.9 to 92.1 relative to the public incumbent. The authors conclude that LLM self-evolution becomes more reliable when it expands the proposal space under this controlled gate rather than selecting experiments directly.

Core claim

CARE is an auditable controller for high-throughput experimentation optimization that retains a non-LLM incumbent optimizer as the default action path and restricts LLMs to revising challenger ranking policies. Before each outcome is revealed, the public-evidence intervention gate compares the challenger with the incumbent and authorizes the challenger's selection only when the evidence available before selection supports the change, with the decision recorded in the audit log. On Minerva/Olympus and ChemLex the controller improves final-best scores from 80.0 to 88.5 and from 83.9 to 92.1 over the public incumbent.

What carries the argument

The public-evidence intervention gate that authorizes an LLM challenger policy change only when pre-selection evidence supports the change.

If this is right

LLM-generated policies can be used to revise experiment selection without granting LLMs direct control over irreversible actions.
Benchmark performance improves when LLM proposals expand the search space under the gate rather than when LLMs choose experiments directly.
All policy changes remain traceable through the audit log of gate decisions.
Defaulting to the non-LLM incumbent reduces exposure to unsafe exploration while still allowing LLM creativity in challenger proposals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same gate structure could be applied to other automated systems where an LLM proposes changes to a running policy.
The method might be tested by varying the type or quantity of evidence the gate receives before each decision.
In real laboratory settings the audit log could support regulatory review of automated experiment sequences.
Multiple simultaneous challengers could be ranked by the same evidence gate to increase the rate of useful policy updates.

Load-bearing premise

The public-evidence intervention gate can be implemented such that it reliably and unbiasedly determines whether pre-selection evidence supports authorizing an LLM challenger policy change.

What would settle it

A run in which the gate authorizes a challenger policy change on the basis of pre-selection evidence yet the final performance is worse than the incumbent path that would have been taken without the change.

Figures

Figures reproduced from arXiv: 2606.14581 by Baiqing Li, Boer Zhang, Guanyu Liu, Peiyu Zhang, Tianyu Shi, Weiyi Kong, Zeyu Wang.

**Figure 2.** Figure 2: Anytime replay trajectories. Curves show normalized simple regret averaged over 30 matched [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Replay-protocol robustness over 10 matched seeds per setting. Cells report mean CARE deltas relative to the public incumbent and to the w/o-intervention-gate ablation under four replay protocols that vary initial observations and reveal budget. Darker cells indicate larger CARE gains; all evaluated cells show positive final-best and AUC deltas. fore they are eligible for deployment; the separate Public-Ev… view at source ↗

**Figure 4.** Figure 4: ChemLex public-incumbent case study. The figure shows the pre-reveal audit record used by the intervention gate to assign action authority. All ranks, support counts, gains, and risks are computed before reveal; the conversion is reported only post hoc and is not available to the gate [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗

read the original abstract

Granting LLMs direct control over costly, irreversible scientific experiments leads to unsafe exploration and unstable performance, but discarding LLM creativity entirely sacrifices significant optimization potential. We introduce CARE (Controlling LLM-Generated Policies through Auditable Review of Evidence in Scientific Experimentation), an auditable controller for high-throughput experimentation (HTE) optimization that keeps a non-LLM incumbent optimizer as the default action path while using LLMs to revise challenger ranking policies. Before each outcome is revealed, a public-evidence intervention gate compares the challenger with the incumbent. It authorizes the challenger's selection only when the evidence available before selection supports the change, with the decision recorded in the audit log. CARE outperforms all other evaluated methods on Minerva/Olympus and ChemLex benchmarks, with final-best improving from 80.0 to 88.5 on Minerva/Olympus and from 83.9 to 92.1 on ChemLex, relative to the public incumbent. Our experiments indicate that LLM self-evolution is more reliable when it expands the proposal space under an auditable controller, rather than directly choosing experiments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CARE adds a logged evidence gate that lets LLMs propose policy changes in HTE only when pre-outcome public evidence backs them, defaulting to a non-LLM incumbent.

read the letter

The main thing to know is that CARE keeps a non-LLM incumbent optimizer as the default while letting LLMs generate challenger ranking policies, then routes any switch through a public-evidence gate that only authorizes the change if supporting evidence exists before the outcome is revealed, with the decision logged for audit.

What is new is the concrete design of that intervention gate and the way it records the authorization step. The paper reports that this setup lifts final-best performance from 80.0 to 88.5 on Minerva/Olympus and from 83.9 to 92.1 on ChemLex relative to the public incumbent. It also shows that LLM self-evolution is more stable when it expands the proposal space under the controller rather than taking direct control of experiments. That framing of the safety-creativity tradeoff is clear and the benchmark numbers give a concrete starting point for comparison.

The work does a reasonable job making the mechanism auditable and keeping the incumbent in charge by default, which addresses a practical concern in automated labs where experiments are costly or irreversible.

The soft spots are in the limited information on implementation. The abstract gives performance figures but supplies no detail on how the evidence gate actually checks support, what counts as public evidence, how bias is avoided in the check, or the number of runs and statistical tests behind the gains. Without those, it is hard to judge whether the reported edge is robust or sensitive to the exact gate rules. If the full paper has thin methods or missing ablations, that would be the main issue to flag.

This paper is for researchers building or evaluating LLM-assisted optimization pipelines in chemistry and materials science. A reader who needs a workable safety layer for high-throughput experimentation would get the most from it.

It deserves a serious referee because the core architecture is testable and the motivation is grounded in real experimental constraints. I would send it to peer review so the gate details and experimental protocol can be examined directly.

Referee Report

2 major / 1 minor

Summary. The paper introduces CARE, an auditable controller for LLM-generated policies in high-throughput experimentation (HTE) optimization. It retains a non-LLM incumbent optimizer as the default while allowing LLMs to propose challenger ranking policies; a public-evidence intervention gate authorizes a challenger only when pre-selection evidence supports the change, with the decision logged. The abstract reports that CARE outperforms all evaluated methods on the Minerva/Olympus and ChemLex benchmarks, improving final-best performance from 80.0 to 88.5 and from 83.9 to 92.1, respectively, and concludes that LLM self-evolution is more reliable under such an auditable controller.

Significance. If the reported gains and the gate implementation prove robust, the work could offer a concrete mechanism for safely incorporating LLM proposal generation into costly scientific experiments while preserving auditability and an incumbent fallback. This addresses a practical tension between exploration and risk in automated experimentation.

major comments (2)

[Abstract] Abstract: The central performance claims (final-best improving from 80.0 to 88.5 on Minerva/Olympus and 83.9 to 92.1 on ChemLex) are stated without any information on baselines, number of runs, statistical tests, variance, or how the public-evidence intervention gate is operationalized; these omissions make the outperformance claim impossible to verify from the provided text.
[Abstract] Abstract: The description of the public-evidence intervention gate ("compares the challenger with the incumbent" and "authorizes the challenger's selection only when the evidence available before selection supports the change") supplies no criteria, data sources, or decision procedure, which is load-bearing for both the safety and the claimed reliability advantages.

minor comments (1)

[Abstract] Abstract: The benchmarks Minerva/Olympus and ChemLex are referenced without definitions, citations, or descriptions of the tasks or metrics.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for highlighting issues with the abstract's self-containment. We address both major comments below and will revise the abstract in the next version to improve verifiability while preserving brevity.

read point-by-point responses

Referee: [Abstract] Abstract: The central performance claims (final-best improving from 80.0 to 88.5 on Minerva/Olympus and 83.9 to 92.1 on ChemLex) are stated without any information on baselines, number of runs, statistical tests, variance, or how the public-evidence intervention gate is operationalized; these omissions make the outperformance claim impossible to verify from the provided text.

Authors: We agree the abstract should reference key experimental parameters for immediate verifiability. The manuscript reports results relative to the public incumbent optimizer, averaged across 10 independent runs with standard deviations and paired statistical tests provided in Section 4 and the supplement. The gate is operationalized in Section 3.2. In revision we will append a concise clause to the abstract noting the run count, variance reporting, and that full gate details appear in the methods. revision: yes
Referee: [Abstract] Abstract: The description of the public-evidence intervention gate ("compares the challenger with the incumbent" and "authorizes the challenger's selection only when the evidence available before selection supports the change") supplies no criteria, data sources, or decision procedure, which is load-bearing for both the safety and the claimed reliability advantages.

Authors: The abstract intentionally remains high-level; the precise criteria (evidence strength from pre-selection public logs), data sources (prior HTE outcomes), and decision procedure (threshold comparison logged before any switch) are defined in Section 3.2. We will revise the abstract to include one additional sentence specifying that authorization occurs only when public evidence before selection favors the challenger under an auditable rule. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The provided abstract and context contain no equations, parameter-fitting procedures, self-citations, or derivation steps that reduce any claimed result to an input by construction. The central claim is an empirical outperformance result on external benchmarks (Minerva/Olympus, ChemLex) under the CARE controller; this is presented as an experimental finding rather than a self-referential definition or renamed prior result. No load-bearing uniqueness theorems, ansatzes, or fitted-input predictions appear in the text. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no free parameters, axioms, or invented entities are identifiable from the provided text.

pith-pipeline@v0.9.1-grok · 5740 in / 1014 out tokens · 55605 ms · 2026-06-27T04:50:36.534339+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 2 linked inside Pith

[1]

Maximilian Balandat, Brian Karrer, Daniel Jiang, Samuel Daulton, Ben Letham, Andrew G

Finite-time analysis of the multiarmed bandit problem.Machine Learning, 47(2–3):235– 256. Maximilian Balandat, Brian Karrer, Daniel Jiang, Samuel Daulton, Ben Letham, Andrew G. Wil- son, and Eytan Bakshy. 2020. Botorch: A frame- work for efficient monte-carlo bayesian optimiza- tion. InAdvances in Neural Information Pro- cessing Systems, volume 33, pages ...

Pith/arXiv arXiv 2020
[2]

Bofire: Bayesian optimization frame- work intended for real experiments.Preprint, arXiv:2408.05040. Kobi C. Felton, Jan G. Rittig, and Alexei A. Lapkin

arXiv
[3]

Chrisantha Fernando, Dylan Sunil Banarse, Henryk Michalewski, Simon Osindero, and Tim Rock- täschel

Summit: Benchmarking machine learning methods for reaction optimisation.Chemistry– Methods, 1(2):116–122. Chrisantha Fernando, Dylan Sunil Banarse, Henryk Michalewski, Simon Osindero, and Tim Rock- täschel. 2024. Promptbreeder: Self-referential self-improvement via prompt evolution. InPro- ceedings of the 41st International Conference on Machine Learning,...

2024
[4]

Donald R

Springer Berlin Heidelberg. Donald R. Jones, Matthias Schonlau, and William J. Welch. 1998. Efficient global optimization of expensive black-box functions.Journal of Global Optimization, 13(4):455–492. Omar Khattab, Arnav Singhvi, Paridhi Mahesh- wari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Mo...

1998
[5]

gradient descent

DSPy: Compiling declarative language model calls into state-of-the-art pipelines. In The Twelfth International Conference on Learn- ing Representations. Agustinus Kristiadi, Felix Strieth-Kalthoff, Marta Skreta, Pascal Poupart, Alán Aspuru-Guzik, and Geoff Pleiss. 2024. A sober look at LLMs for material discovery: Are they actually good for bayesian optim...

2024
[6]

Benjamin J

An entropy search portfolio for bayesian optimization.Preprint, arXiv:1406.4625. Benjamin J. Shields, Jason Stevens, Jun Li, Marvin Parasram, Farhan Damani, Jesus I. Martinez Al- varado, Jacob M. Janey, Ryan P. Adams, and Abigail G. Doyle. 2021. Bayesian reaction opti- mization as a tool for chemical synthesis.Nature, 590(7844):89–96. Mohit Shridhar, Xing...

Pith/arXiv arXiv 2021
[7]

Gary Tom, Stefan P

An autonomous laboratory for the accel- erated synthesis of inorganic materials.Nature, 624(7990):86–91. Gary Tom, Stefan P. Schmid, Sterling G. Baird, Yang Cao, Kourosh Darvish, Han Hao, Stan- ley Lo, Sergio Pablo-García, Ella M. Rajaon- son, Marta Skreta, Naruki Yoshikawa, Samantha Corapi, Gun Deniz Akkoc, Felix Strieth-Kalthoff, Martin Seifrid, and Alá...

2024
[8]

best_ligand_near_best

ReAct: Synergizing reasoning and acting in language models. InThe Eleventh Interna- tional Conference on Learning Representations, ICLR 2023. OpenReview.net. Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Pan Lu, Zhi Huang, Carlos Guestrin, and James Zou. 2025. Optimizing generative AI by backpropagating language model feedback. Nature, 639(8...

2023
[9]

Plan only one next edit/action for a language-generated optimizer skill
[10]

Use only public views and revealed objective values; do not seek evaluator-private labels, answer keys, baseline artifacts, files, network, or credentials
[11]

Prefer low-risk deterministic code edits that compile to rank_candidates
[12]

The final deployed tool must score the full remaining candidate_df, not a menu
[13]

Reuse is allowed only when the active skill is both deploy- able and still improving or diversifying selected outcomes
[14]

Do not patch solely because one round failed after a recent best-observed improvement; reuse once unless the last selected objective value was clearly poor
[15]

Plan patch_skill mainly after two or more consecutive no-improvement rounds, or after a revealed selection far below the current best
[16]

If patching after stagnation, blend the active scoring rule with bounded diversity/novelty; do not replace a previ- ously improving policy with an unrelated scorer
[17]

If the active skill repeatedly selects lower-yield near- duplicates after a first improvement, patch it to avoid over-exploiting that local region
[18]

schema_version

Return JSON with task_plan and self_reported_forbidden_info_used=false. Planner output schema. {"schema_version":"self_evolving_task_plan_v1","run_id":"[RUN_ID ]","round_index":"[ROUND_INDEX]", "task_plan":{"action":"create_skill|patch_skill| reuse_active_skill","skill_family":"ranker|constraint| exploration|calibrator|fallback", "objective":"short public...
[19]

Write normal Python source defining exactly rank_candidates(observed_df, candidate_df, memory=None, tool_state=None)
[20]

Return a dictionary, never a DataFrame: {’ranked_candidates’: list, ’tool_state’: dict, ’tool_diagnostics’: dict}
[21]

It must cover every candidate_df row with unique positive ranks and finite scores

Each ranked row must includecandidate_id, rank, score, reason_code, and evidence_refs. It must cover every candidate_df row with unique positive ranks and finite scores. 4.evidence_refs is empty unless exactobservation_id val- ues are copied fromobserved_df; never invent evidence strings
[22]

Use only publicobserved_y for observed rows and public candidate features
[23]

Do not read files, call networks, import disallowed mod- ules, inspect DataFrame attrs, or reference evaluator- private labels, answer keys, baseline artifacts, private provenance fields, or credentials
[24]

Do not use double-underscore names or strings in tempo- rary columns, helper variables, imports, or escape hatches
[25]

Do not callgetattr, setattr, hasattr, eval, exec, compile, globals,locals,vars,open, or__import__
[26]

Make ranking row-order invariant: do not use enumerate index, original row order, DataFrame index, or insertion order in scores or tie-breaks
[27]

Sort with public score first andcandidate_id as the only deterministic tie-breaker; never tie-break by row position
[28]

Avoid creating helper columns such as_row_order, _index, __lig, or_positionfor ordering
[29]

schema_version

Return JSON with skill_artifact.source and self_reported_forbidden_info_used=false. Policy-synthesis output schema. {"schema_version":"self_evolving_skill_artifact_v1","run_id":"[ RUN_ID]","round_index":"[ROUND_INDEX]", "skill_artifact":{"skill_id":"short id","family":"[ TASK_PLAN_SKILL_FAMILY]","source":"Python source string defining rank_candidates","ra...

[1] [1]

Maximilian Balandat, Brian Karrer, Daniel Jiang, Samuel Daulton, Ben Letham, Andrew G

Finite-time analysis of the multiarmed bandit problem.Machine Learning, 47(2–3):235– 256. Maximilian Balandat, Brian Karrer, Daniel Jiang, Samuel Daulton, Ben Letham, Andrew G. Wil- son, and Eytan Bakshy. 2020. Botorch: A frame- work for efficient monte-carlo bayesian optimiza- tion. InAdvances in Neural Information Pro- cessing Systems, volume 33, pages ...

Pith/arXiv arXiv 2020

[2] [2]

Bofire: Bayesian optimization frame- work intended for real experiments.Preprint, arXiv:2408.05040. Kobi C. Felton, Jan G. Rittig, and Alexei A. Lapkin

arXiv

[3] [3]

Chrisantha Fernando, Dylan Sunil Banarse, Henryk Michalewski, Simon Osindero, and Tim Rock- täschel

Summit: Benchmarking machine learning methods for reaction optimisation.Chemistry– Methods, 1(2):116–122. Chrisantha Fernando, Dylan Sunil Banarse, Henryk Michalewski, Simon Osindero, and Tim Rock- täschel. 2024. Promptbreeder: Self-referential self-improvement via prompt evolution. InPro- ceedings of the 41st International Conference on Machine Learning,...

2024

[4] [4]

Donald R

Springer Berlin Heidelberg. Donald R. Jones, Matthias Schonlau, and William J. Welch. 1998. Efficient global optimization of expensive black-box functions.Journal of Global Optimization, 13(4):455–492. Omar Khattab, Arnav Singhvi, Paridhi Mahesh- wari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Mo...

1998

[5] [5]

gradient descent

DSPy: Compiling declarative language model calls into state-of-the-art pipelines. In The Twelfth International Conference on Learn- ing Representations. Agustinus Kristiadi, Felix Strieth-Kalthoff, Marta Skreta, Pascal Poupart, Alán Aspuru-Guzik, and Geoff Pleiss. 2024. A sober look at LLMs for material discovery: Are they actually good for bayesian optim...

2024

[6] [6]

Benjamin J

An entropy search portfolio for bayesian optimization.Preprint, arXiv:1406.4625. Benjamin J. Shields, Jason Stevens, Jun Li, Marvin Parasram, Farhan Damani, Jesus I. Martinez Al- varado, Jacob M. Janey, Ryan P. Adams, and Abigail G. Doyle. 2021. Bayesian reaction opti- mization as a tool for chemical synthesis.Nature, 590(7844):89–96. Mohit Shridhar, Xing...

Pith/arXiv arXiv 2021

[7] [7]

Gary Tom, Stefan P

An autonomous laboratory for the accel- erated synthesis of inorganic materials.Nature, 624(7990):86–91. Gary Tom, Stefan P. Schmid, Sterling G. Baird, Yang Cao, Kourosh Darvish, Han Hao, Stan- ley Lo, Sergio Pablo-García, Ella M. Rajaon- son, Marta Skreta, Naruki Yoshikawa, Samantha Corapi, Gun Deniz Akkoc, Felix Strieth-Kalthoff, Martin Seifrid, and Alá...

2024

[8] [8]

best_ligand_near_best

ReAct: Synergizing reasoning and acting in language models. InThe Eleventh Interna- tional Conference on Learning Representations, ICLR 2023. OpenReview.net. Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Pan Lu, Zhi Huang, Carlos Guestrin, and James Zou. 2025. Optimizing generative AI by backpropagating language model feedback. Nature, 639(8...

2023

[9] [9]

Plan only one next edit/action for a language-generated optimizer skill

[10] [10]

Use only public views and revealed objective values; do not seek evaluator-private labels, answer keys, baseline artifacts, files, network, or credentials

[11] [11]

Prefer low-risk deterministic code edits that compile to rank_candidates

[12] [12]

The final deployed tool must score the full remaining candidate_df, not a menu

[13] [13]

Reuse is allowed only when the active skill is both deploy- able and still improving or diversifying selected outcomes

[14] [14]

Do not patch solely because one round failed after a recent best-observed improvement; reuse once unless the last selected objective value was clearly poor

[15] [15]

Plan patch_skill mainly after two or more consecutive no-improvement rounds, or after a revealed selection far below the current best

[16] [16]

If patching after stagnation, blend the active scoring rule with bounded diversity/novelty; do not replace a previ- ously improving policy with an unrelated scorer

[17] [17]

If the active skill repeatedly selects lower-yield near- duplicates after a first improvement, patch it to avoid over-exploiting that local region

[18] [18]

schema_version

Return JSON with task_plan and self_reported_forbidden_info_used=false. Planner output schema. {"schema_version":"self_evolving_task_plan_v1","run_id":"[RUN_ID ]","round_index":"[ROUND_INDEX]", "task_plan":{"action":"create_skill|patch_skill| reuse_active_skill","skill_family":"ranker|constraint| exploration|calibrator|fallback", "objective":"short public...

[19] [19]

Write normal Python source defining exactly rank_candidates(observed_df, candidate_df, memory=None, tool_state=None)

[20] [20]

Return a dictionary, never a DataFrame: {’ranked_candidates’: list, ’tool_state’: dict, ’tool_diagnostics’: dict}

[21] [21]

It must cover every candidate_df row with unique positive ranks and finite scores

Each ranked row must includecandidate_id, rank, score, reason_code, and evidence_refs. It must cover every candidate_df row with unique positive ranks and finite scores. 4.evidence_refs is empty unless exactobservation_id val- ues are copied fromobserved_df; never invent evidence strings

[22] [22]

Use only publicobserved_y for observed rows and public candidate features

[23] [23]

Do not read files, call networks, import disallowed mod- ules, inspect DataFrame attrs, or reference evaluator- private labels, answer keys, baseline artifacts, private provenance fields, or credentials

[24] [24]

Do not use double-underscore names or strings in tempo- rary columns, helper variables, imports, or escape hatches

[25] [25]

Do not callgetattr, setattr, hasattr, eval, exec, compile, globals,locals,vars,open, or__import__

[26] [26]

Make ranking row-order invariant: do not use enumerate index, original row order, DataFrame index, or insertion order in scores or tie-breaks

[27] [27]

Sort with public score first andcandidate_id as the only deterministic tie-breaker; never tie-break by row position

[28] [28]

Avoid creating helper columns such as_row_order, _index, __lig, or_positionfor ordering

[29] [29]

schema_version

Return JSON with skill_artifact.source and self_reported_forbidden_info_used=false. Policy-synthesis output schema. {"schema_version":"self_evolving_skill_artifact_v1","run_id":"[ RUN_ID]","round_index":"[ROUND_INDEX]", "skill_artifact":{"skill_id":"short id","family":"[ TASK_PLAN_SKILL_FAMILY]","source":"Python source string defining rank_candidates","ra...