CARE: Controlling LLM-Generated Policies through Auditable Review of Evidence in Scientific Experimentation

Baiqing Li; Boer Zhang; Guanyu Liu; Peiyu Zhang; Tianyu Shi; Weiyi Kong; Zeyu Wang

arxiv: 2606.14581 · v3 · pith:7K2OBH7Ynew · submitted 2026-06-12 · 💻 cs.LG · cs.AI

CARE: Controlling LLM-Generated Policies through Auditable Review of Evidence in Scientific Experimentation

Guanyu Liu , Weiyi Kong , Zeyu Wang , Boer Zhang , Baiqing Li , Peiyu Zhang , Tianyu Shi This is my paper

Pith reviewed 2026-06-30 10:44 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords LLM policy controlauditable experimentationhigh-throughput optimizationevidence-based authorizationchallenger-incumbent comparisonscientific experiment control

0 comments

The pith

An auditable evidence gate lets LLMs revise scientific experiment policies only when pre-outcome data supports the change.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CARE as a controller that keeps a non-LLM incumbent optimizer as the default path for high-throughput experimentation while letting LLMs propose revisions to challenger ranking policies. A public-evidence intervention gate reviews only the evidence available before any outcome is revealed, authorizes the challenger selection solely when that evidence supports the switch, and records the decision in an audit log. On the Minerva/Olympus benchmark the final-best score rises from 80.0 to 88.5 and on ChemLex from 83.9 to 92.1 relative to the public incumbent. The work claims that LLM self-evolution is more reliable when the proposal space expands under this auditable controller than when LLMs choose experiments directly.

Core claim

CARE controls LLM-generated policies in scientific experimentation by maintaining a non-LLM incumbent as the default action path and using LLMs only to revise challenger ranking policies. Before each outcome is revealed, a public-evidence intervention gate compares the challenger with the incumbent and authorizes the change only when the pre-selection evidence supports it, with the decision recorded in an audit log. This yields higher performance on the Minerva/Olympus and ChemLex benchmarks.

What carries the argument

The public-evidence intervention gate that compares challenger and incumbent using only evidence available before any outcome is revealed and authorizes selection only when the evidence supports the change.

If this is right

LLM self-evolution produces more reliable policies when proposals expand under the auditable controller rather than when LLMs select experiments directly.
Final-best scores on Minerva/Olympus and ChemLex rise to 88.5 and 92.1 respectively when the gate is active.
All policy changes are traceable through the audit log of gate decisions.
The non-LLM incumbent remains the default unless pre-outcome evidence explicitly supports replacement.
Unsafe exploration is limited because the gate blocks changes unsupported by available evidence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same gate structure could be tested in other irreversible decision settings such as automated materials synthesis or clinical trial design.
Performance stability might degrade if the gate is removed or if the evidence used for comparison becomes noisier than in the reported benchmarks.
The audit log could serve as training data for refining the gate's own decision criteria in subsequent iterations.

Load-bearing premise

The public-evidence intervention gate can make reliable authorization decisions by comparing challenger and incumbent using only evidence available before any outcome is revealed, without the gate itself introducing selection bias or requiring post-hoc adjustments.

What would settle it

A run in which the gate authorizes a challenger that later underperforms the incumbent on the identical pre-outcome evidence set, or in which removing the gate eliminates the reported score gains on Minerva/Olympus and ChemLex.

Figures

Figures reproduced from arXiv: 2606.14581 by Baiqing Li, Boer Zhang, Guanyu Liu, Peiyu Zhang, Tianyu Shi, Weiyi Kong, Zeyu Wang.

**Figure 2.** Figure 2: Anytime replay trajectories. Curves show normalized simple regret averaged over 30 matched [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Replay-protocol robustness over 10 matched seeds per setting. Cells report mean CARE deltas relative to the public incumbent and to the w/o-intervention-gate ablation under four replay protocols that vary initial observations and reveal budget. Darker cells indicate larger CARE gains; all evaluated cells show positive final-best and AUC deltas. fore they are eligible for deployment; the separate Public-Ev… view at source ↗

**Figure 4.** Figure 4: ChemLex public-incumbent case study. The figure shows the pre-reveal audit record used by the intervention gate to assign action authority. All ranks, support counts, gains, and risks are computed before reveal; the conversion is reported only post hoc and is not available to the gate [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗

read the original abstract

Granting LLMs direct control over costly, irreversible scientific experiments leads to unsafe exploration and unstable performance, but discarding LLM creativity entirely sacrifices significant optimization potential. We introduce CARE (Controlling LLM-Generated Policies through Auditable Review of Evidence in Scientific Experimentation), an auditable controller for high-throughput experimentation (HTE) optimization that keeps a non-LLM incumbent optimizer as the default action path while using LLMs to revise challenger ranking policies. Before each outcome is revealed, a public-evidence intervention gate compares the challenger with the incumbent. It authorizes the challenger's selection only when the evidence available before selection supports the change, with the decision recorded in the audit log. CARE outperforms all other evaluated methods on Minerva/Olympus and ChemLex benchmarks, with final-best improving from 80.0 to 88.5 on Minerva/Olympus and from 83.9 to 92.1 on ChemLex, relative to the public incumbent. Our experiments indicate that LLM self-evolution is more reliable when it expands the proposal space under an auditable controller, rather than directly choosing experiments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CARE adds a pre-outcome evidence gate and audit log for LLM policy changes in HTE, but the benchmark gains cannot be assessed without the gate's exact rule and run details.

read the letter

The paper's main move is to keep a standard optimizer as the default and let an LLM propose challenger ranking policies, then route the switch through a gate that looks only at evidence available before any outcome is seen and records the call. That architecture is the concrete new piece.

It does a clean job stating the practical tension: full LLM control risks unsafe or unstable runs on real hardware, while banning LLMs loses whatever optimization headroom they might add. The audit log requirement is also a useful constraint for lab settings where decisions need to be reviewable later.

The reported lifts (80.0 to 88.5 on Minerva/Olympus, 83.9 to 92.1 on ChemLex) are presented as evidence that the controlled version is more reliable, yet the abstract supplies no comparison metric, no authorization predicate, no number of runs, and no statistical test. Without those, the numbers cannot be read as support for the claim. The stress-test worry about selection bias is live: if the gate's rule ends up favoring trajectories that later improve, the measured gain is partly an artifact of the filter rather than proof that the controller itself improves reliability.

The rest of the work is standard framing and benchmark citation; nothing else stands out as formally verified or independently reproducible from what is shown.

This is for groups building automated labs or LLM agents that touch physical experiments. A reader who needs a worked example of an auditable intervention layer will find the high-level design useful even if the evaluation stays thin. It is worth sending to referees because the safety problem is real and the proposed structure is straightforward to implement and test; the current draft simply needs the methods and controls filled in before the performance claims can be taken at face value.

Referee Report

2 major / 1 minor

Summary. The paper introduces CARE, an auditable controller for LLM-generated policies in high-throughput scientific experimentation (HTE) optimization. It retains a non-LLM incumbent optimizer as the default while using LLMs to propose challenger ranking policies; a public-evidence intervention gate compares challenger and incumbent using only pre-outcome evidence and authorizes a change only when that evidence supports it, with the decision logged for audit. The central empirical claim is that CARE outperforms all evaluated methods on the Minerva/Olympus and ChemLex benchmarks, raising final-best performance from 80.0 to 88.5 and from 83.9 to 92.1 respectively, and that LLM self-evolution is more reliable when operating under such an auditable controller rather than directly selecting experiments.

Significance. If the reported gains prove robust and the gate is shown to be free of post-hoc or outcome-correlated selection, the work would supply a concrete, auditable mechanism for safely incorporating LLM creativity into costly, irreversible scientific optimization loops. The emphasis on pre-outcome authorization and public logging directly addresses a practical barrier to LLM use in experimental design.

major comments (2)

[Abstract] Abstract: the headline performance claims (final-best rising from 80.0 to 88.5 on Minerva/Olympus and 83.9 to 92.1 on ChemLex) are presented without any reference to number of runs, statistical tests, variance, or explicit baseline definitions. These omissions are load-bearing because the central claim is that CARE outperforms all other evaluated methods; without this information the magnitude and reliability of the lift cannot be assessed.
[Methods (intervention gate)] Methods section describing the public-evidence intervention gate: the comparison metric and authorization predicate are not specified, nor is any argument or procedure given that the predicate is fixed before outcomes are observed. This is load-bearing for the claim that the measured gains arise from the auditable controller rather than from an outcome-correlated selection rule that the gate itself may embed.

minor comments (1)

[Abstract] Abstract: the term 'final-best' is used without definition; it should be clarified whether it denotes the single best observed value at the end of the run or an average over final policies.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's detailed feedback on our manuscript. The comments highlight important areas for clarification in the abstract and methods sections. We address each point below and plan to incorporate revisions to strengthen the presentation of our results and the description of the intervention gate.

read point-by-point responses

Referee: [Abstract] Abstract: the headline performance claims (final-best rising from 80.0 to 88.5 on Minerva/Olympus and 83.9 to 92.1 on ChemLex) are presented without any reference to number of runs, statistical tests, variance, or explicit baseline definitions. These omissions are load-bearing because the central claim is that CARE outperforms all other evaluated methods; without this information the magnitude and reliability of the lift cannot be assessed.

Authors: We agree that the abstract would benefit from including references to the experimental setup details such as the number of runs, statistical tests, variance, and explicit baseline definitions. These details are provided in the main text (e.g., in the experimental results section), but to make the abstract self-contained and address the load-bearing nature of the claims, we will revise the abstract to incorporate summary statistics and references to the evaluation protocol. revision: yes
Referee: [Methods (intervention gate)] Methods section describing the public-evidence intervention gate: the comparison metric and authorization predicate are not specified, nor is any argument or procedure given that the predicate is fixed before outcomes are observed. This is load-bearing for the claim that the measured gains arise from the auditable controller rather than from an outcome-correlated selection rule that the gate itself may embed.

Authors: We acknowledge that the current description of the public-evidence intervention gate in the methods section lacks explicit specification of the comparison metric and authorization predicate, as well as a clear statement that the predicate is determined prior to observing outcomes. In the revised manuscript, we will provide a detailed specification of the metric (e.g., based on pre-outcome evidence such as historical performance or public benchmarks) and the predicate, along with an argument and procedure ensuring it is fixed ex ante to prevent outcome-correlated selection. This will reinforce that the gains are attributable to the auditable controller. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The provided abstract and description contain no equations, derivations, fitted parameters, or self-citations that reduce any claim to its inputs by construction. The core assertions are empirical benchmark results (outperformance on Minerva/Olympus and ChemLex) and a descriptive account of the CARE controller and public-evidence gate; these are presented as measured outcomes rather than first-principles predictions or self-referential definitions. No load-bearing step matches any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities can be identified from the provided text.

pith-pipeline@v0.9.1-grok · 5740 in / 1040 out tokens · 29977 ms · 2026-06-30T10:44:13.189262+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 3 canonical work pages · 2 internal anchors

[1]

Portfolio Allocation for Bayesian Optimization

Finite-time analysis of the multiarmed bandit problem.Machine Learning, 47(2–3):235– 256. Maximilian Balandat, Brian Karrer, Daniel Jiang, Samuel Daulton, Ben Letham, Andrew G. Wil- son, and Eytan Bakshy. 2020. Botorch: A frame- work for efficient monte-carlo bayesian optimiza- tion. InAdvances in Neural Information Pro- cessing Systems, volume 33, pages ...

work page internal anchor Pith review Pith/arXiv arXiv 2020
[2]

Bofire: Bayesian optimization frame- work intended for real experiments.Preprint, arXiv:2408.05040. Kobi C. Felton, Jan G. Rittig, and Alexei A. Lapkin

work page arXiv
[3]

Chrisantha Fernando, Dylan Sunil Banarse, Henryk Michalewski, Simon Osindero, and Tim Rock- täschel

Summit: Benchmarking machine learning methods for reaction optimisation.Chemistry– Methods, 1(2):116–122. Chrisantha Fernando, Dylan Sunil Banarse, Henryk Michalewski, Simon Osindero, and Tim Rock- täschel. 2024. Promptbreeder: Self-referential self-improvement via prompt evolution. InPro- ceedings of the 41st International Conference on Machine Learning,...

2024
[4]

Donald R

Springer Berlin Heidelberg. Donald R. Jones, Matthias Schonlau, and William J. Welch. 1998. Efficient global optimization of expensive black-box functions.Journal of Global Optimization, 13(4):455–492. Omar Khattab, Arnav Singhvi, Paridhi Mahesh- wari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Mo...

1998
[5]

gradient descent

DSPy: Compiling declarative language model calls into state-of-the-art pipelines. In The Twelfth International Conference on Learn- ing Representations. Agustinus Kristiadi, Felix Strieth-Kalthoff, Marta Skreta, Pascal Poupart, Alán Aspuru-Guzik, and Geoff Pleiss. 2024. A sober look at LLMs for material discovery: Are they actually good for bayesian optim...

2024
[6]

An Entropy Search Portfolio for Bayesian Optimization

An entropy search portfolio for bayesian optimization.Preprint, arXiv:1406.4625. Benjamin J. Shields, Jason Stevens, Jun Li, Marvin Parasram, Farhan Damani, Jesus I. Martinez Al- varado, Jacob M. Janey, Ryan P. Adams, and Abigail G. Doyle. 2021. Bayesian reaction opti- mization as a tool for chemical synthesis.Nature, 590(7844):89–96. Mohit Shridhar, Xing...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

Gary Tom, Stefan P

An autonomous laboratory for the accel- erated synthesis of inorganic materials.Nature, 624(7990):86–91. Gary Tom, Stefan P. Schmid, Sterling G. Baird, Yang Cao, Kourosh Darvish, Han Hao, Stan- ley Lo, Sergio Pablo-García, Ella M. Rajaon- son, Marta Skreta, Naruki Yoshikawa, Samantha Corapi, Gun Deniz Akkoc, Felix Strieth-Kalthoff, Martin Seifrid, and Alá...

2024
[8]

best_ligand_near_best

ReAct: Synergizing reasoning and acting in language models. InThe Eleventh Interna- tional Conference on Learning Representations, ICLR 2023. OpenReview.net. Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Pan Lu, Zhi Huang, Carlos Guestrin, and James Zou. 2025. Optimizing generative AI by backpropagating language model feedback. Nature, 639(8...

2023
[9]

Plan only one next edit/action for a language-generated optimizer skill
[10]

Use only public views and revealed objective values; do not seek evaluator-private labels, answer keys, baseline artifacts, files, network, or credentials
[11]

Prefer low-risk deterministic code edits that compile to rank_candidates
[12]

The final deployed tool must score the full remaining candidate_df, not a menu
[13]

Reuse is allowed only when the active skill is both deploy- able and still improving or diversifying selected outcomes
[14]

Do not patch solely because one round failed after a recent best-observed improvement; reuse once unless the last selected objective value was clearly poor
[15]

Plan patch_skill mainly after two or more consecutive no-improvement rounds, or after a revealed selection far below the current best
[16]

If patching after stagnation, blend the active scoring rule with bounded diversity/novelty; do not replace a previ- ously improving policy with an unrelated scorer
[17]

If the active skill repeatedly selects lower-yield near- duplicates after a first improvement, patch it to avoid over-exploiting that local region
[18]

schema_version

Return JSON with task_plan and self_reported_forbidden_info_used=false. Planner output schema. {"schema_version":"self_evolving_task_plan_v1","run_id":"[RUN_ID ]","round_index":"[ROUND_INDEX]", "task_plan":{"action":"create_skill|patch_skill| reuse_active_skill","skill_family":"ranker|constraint| exploration|calibrator|fallback", "objective":"short public...
[19]

Write normal Python source defining exactly rank_candidates(observed_df, candidate_df, memory=None, tool_state=None)
[20]

Return a dictionary, never a DataFrame: {’ranked_candidates’: list, ’tool_state’: dict, ’tool_diagnostics’: dict}
[21]

It must cover every candidate_df row with unique positive ranks and finite scores

Each ranked row must includecandidate_id, rank, score, reason_code, and evidence_refs. It must cover every candidate_df row with unique positive ranks and finite scores. 4.evidence_refs is empty unless exactobservation_id val- ues are copied fromobserved_df; never invent evidence strings
[22]

Use only publicobserved_y for observed rows and public candidate features
[23]

Do not read files, call networks, import disallowed mod- ules, inspect DataFrame attrs, or reference evaluator- private labels, answer keys, baseline artifacts, private provenance fields, or credentials
[24]

Do not use double-underscore names or strings in tempo- rary columns, helper variables, imports, or escape hatches
[25]

Do not callgetattr, setattr, hasattr, eval, exec, compile, globals,locals,vars,open, or__import__
[26]

Make ranking row-order invariant: do not use enumerate index, original row order, DataFrame index, or insertion order in scores or tie-breaks
[27]

Sort with public score first andcandidate_id as the only deterministic tie-breaker; never tie-break by row position
[28]

Avoid creating helper columns such as_row_order, _index, __lig, or_positionfor ordering
[29]

schema_version

Return JSON with skill_artifact.source and self_reported_forbidden_info_used=false. Policy-synthesis output schema. {"schema_version":"self_evolving_skill_artifact_v1","run_id":"[ RUN_ID]","round_index":"[ROUND_INDEX]", "skill_artifact":{"skill_id":"short id","family":"[ TASK_PLAN_SKILL_FAMILY]","source":"Python source string defining rank_candidates","ra...

[1] [1]

Portfolio Allocation for Bayesian Optimization

Finite-time analysis of the multiarmed bandit problem.Machine Learning, 47(2–3):235– 256. Maximilian Balandat, Brian Karrer, Daniel Jiang, Samuel Daulton, Ben Letham, Andrew G. Wil- son, and Eytan Bakshy. 2020. Botorch: A frame- work for efficient monte-carlo bayesian optimiza- tion. InAdvances in Neural Information Pro- cessing Systems, volume 33, pages ...

work page internal anchor Pith review Pith/arXiv arXiv 2020

[2] [2]

Bofire: Bayesian optimization frame- work intended for real experiments.Preprint, arXiv:2408.05040. Kobi C. Felton, Jan G. Rittig, and Alexei A. Lapkin

work page arXiv

[3] [3]

Chrisantha Fernando, Dylan Sunil Banarse, Henryk Michalewski, Simon Osindero, and Tim Rock- täschel

Summit: Benchmarking machine learning methods for reaction optimisation.Chemistry– Methods, 1(2):116–122. Chrisantha Fernando, Dylan Sunil Banarse, Henryk Michalewski, Simon Osindero, and Tim Rock- täschel. 2024. Promptbreeder: Self-referential self-improvement via prompt evolution. InPro- ceedings of the 41st International Conference on Machine Learning,...

2024

[4] [4]

Donald R

Springer Berlin Heidelberg. Donald R. Jones, Matthias Schonlau, and William J. Welch. 1998. Efficient global optimization of expensive black-box functions.Journal of Global Optimization, 13(4):455–492. Omar Khattab, Arnav Singhvi, Paridhi Mahesh- wari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Mo...

1998

[5] [5]

gradient descent

DSPy: Compiling declarative language model calls into state-of-the-art pipelines. In The Twelfth International Conference on Learn- ing Representations. Agustinus Kristiadi, Felix Strieth-Kalthoff, Marta Skreta, Pascal Poupart, Alán Aspuru-Guzik, and Geoff Pleiss. 2024. A sober look at LLMs for material discovery: Are they actually good for bayesian optim...

2024

[6] [6]

An Entropy Search Portfolio for Bayesian Optimization

An entropy search portfolio for bayesian optimization.Preprint, arXiv:1406.4625. Benjamin J. Shields, Jason Stevens, Jun Li, Marvin Parasram, Farhan Damani, Jesus I. Martinez Al- varado, Jacob M. Janey, Ryan P. Adams, and Abigail G. Doyle. 2021. Bayesian reaction opti- mization as a tool for chemical synthesis.Nature, 590(7844):89–96. Mohit Shridhar, Xing...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[7] [7]

Gary Tom, Stefan P

An autonomous laboratory for the accel- erated synthesis of inorganic materials.Nature, 624(7990):86–91. Gary Tom, Stefan P. Schmid, Sterling G. Baird, Yang Cao, Kourosh Darvish, Han Hao, Stan- ley Lo, Sergio Pablo-García, Ella M. Rajaon- son, Marta Skreta, Naruki Yoshikawa, Samantha Corapi, Gun Deniz Akkoc, Felix Strieth-Kalthoff, Martin Seifrid, and Alá...

2024

[8] [8]

best_ligand_near_best

ReAct: Synergizing reasoning and acting in language models. InThe Eleventh Interna- tional Conference on Learning Representations, ICLR 2023. OpenReview.net. Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Pan Lu, Zhi Huang, Carlos Guestrin, and James Zou. 2025. Optimizing generative AI by backpropagating language model feedback. Nature, 639(8...

2023

[9] [9]

Plan only one next edit/action for a language-generated optimizer skill

[10] [10]

Use only public views and revealed objective values; do not seek evaluator-private labels, answer keys, baseline artifacts, files, network, or credentials

[11] [11]

Prefer low-risk deterministic code edits that compile to rank_candidates

[12] [12]

The final deployed tool must score the full remaining candidate_df, not a menu

[13] [13]

Reuse is allowed only when the active skill is both deploy- able and still improving or diversifying selected outcomes

[14] [14]

Do not patch solely because one round failed after a recent best-observed improvement; reuse once unless the last selected objective value was clearly poor

[15] [15]

Plan patch_skill mainly after two or more consecutive no-improvement rounds, or after a revealed selection far below the current best

[16] [16]

If patching after stagnation, blend the active scoring rule with bounded diversity/novelty; do not replace a previ- ously improving policy with an unrelated scorer

[17] [17]

If the active skill repeatedly selects lower-yield near- duplicates after a first improvement, patch it to avoid over-exploiting that local region

[18] [18]

schema_version

Return JSON with task_plan and self_reported_forbidden_info_used=false. Planner output schema. {"schema_version":"self_evolving_task_plan_v1","run_id":"[RUN_ID ]","round_index":"[ROUND_INDEX]", "task_plan":{"action":"create_skill|patch_skill| reuse_active_skill","skill_family":"ranker|constraint| exploration|calibrator|fallback", "objective":"short public...

[19] [19]

Write normal Python source defining exactly rank_candidates(observed_df, candidate_df, memory=None, tool_state=None)

[20] [20]

Return a dictionary, never a DataFrame: {’ranked_candidates’: list, ’tool_state’: dict, ’tool_diagnostics’: dict}

[21] [21]

It must cover every candidate_df row with unique positive ranks and finite scores

Each ranked row must includecandidate_id, rank, score, reason_code, and evidence_refs. It must cover every candidate_df row with unique positive ranks and finite scores. 4.evidence_refs is empty unless exactobservation_id val- ues are copied fromobserved_df; never invent evidence strings

[22] [22]

Use only publicobserved_y for observed rows and public candidate features

[23] [23]

Do not read files, call networks, import disallowed mod- ules, inspect DataFrame attrs, or reference evaluator- private labels, answer keys, baseline artifacts, private provenance fields, or credentials

[24] [24]

Do not use double-underscore names or strings in tempo- rary columns, helper variables, imports, or escape hatches

[25] [25]

Do not callgetattr, setattr, hasattr, eval, exec, compile, globals,locals,vars,open, or__import__

[26] [26]

Make ranking row-order invariant: do not use enumerate index, original row order, DataFrame index, or insertion order in scores or tie-breaks

[27] [27]

Sort with public score first andcandidate_id as the only deterministic tie-breaker; never tie-break by row position

[28] [28]

Avoid creating helper columns such as_row_order, _index, __lig, or_positionfor ordering

[29] [29]

schema_version

Return JSON with skill_artifact.source and self_reported_forbidden_info_used=false. Policy-synthesis output schema. {"schema_version":"self_evolving_skill_artifact_v1","run_id":"[ RUN_ID]","round_index":"[ROUND_INDEX]", "skill_artifact":{"skill_id":"short id","family":"[ TASK_PLAN_SKILL_FAMILY]","source":"Python source string defining rank_candidates","ra...