CARE: Controlling LLM-Generated Policies through Auditable Review of Evidence in Scientific Experimentation
Pith reviewed 2026-06-30 10:44 UTC · model grok-4.3
The pith
An auditable evidence gate lets LLMs revise scientific experiment policies only when pre-outcome data supports the change.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CARE controls LLM-generated policies in scientific experimentation by maintaining a non-LLM incumbent as the default action path and using LLMs only to revise challenger ranking policies. Before each outcome is revealed, a public-evidence intervention gate compares the challenger with the incumbent and authorizes the change only when the pre-selection evidence supports it, with the decision recorded in an audit log. This yields higher performance on the Minerva/Olympus and ChemLex benchmarks.
What carries the argument
The public-evidence intervention gate that compares challenger and incumbent using only evidence available before any outcome is revealed and authorizes selection only when the evidence supports the change.
If this is right
- LLM self-evolution produces more reliable policies when proposals expand under the auditable controller rather than when LLMs select experiments directly.
- Final-best scores on Minerva/Olympus and ChemLex rise to 88.5 and 92.1 respectively when the gate is active.
- All policy changes are traceable through the audit log of gate decisions.
- The non-LLM incumbent remains the default unless pre-outcome evidence explicitly supports replacement.
- Unsafe exploration is limited because the gate blocks changes unsupported by available evidence.
Where Pith is reading between the lines
- The same gate structure could be tested in other irreversible decision settings such as automated materials synthesis or clinical trial design.
- Performance stability might degrade if the gate is removed or if the evidence used for comparison becomes noisier than in the reported benchmarks.
- The audit log could serve as training data for refining the gate's own decision criteria in subsequent iterations.
Load-bearing premise
The public-evidence intervention gate can make reliable authorization decisions by comparing challenger and incumbent using only evidence available before any outcome is revealed, without the gate itself introducing selection bias or requiring post-hoc adjustments.
What would settle it
A run in which the gate authorizes a challenger that later underperforms the incumbent on the identical pre-outcome evidence set, or in which removing the gate eliminates the reported score gains on Minerva/Olympus and ChemLex.
Figures
read the original abstract
Granting LLMs direct control over costly, irreversible scientific experiments leads to unsafe exploration and unstable performance, but discarding LLM creativity entirely sacrifices significant optimization potential. We introduce CARE (Controlling LLM-Generated Policies through Auditable Review of Evidence in Scientific Experimentation), an auditable controller for high-throughput experimentation (HTE) optimization that keeps a non-LLM incumbent optimizer as the default action path while using LLMs to revise challenger ranking policies. Before each outcome is revealed, a public-evidence intervention gate compares the challenger with the incumbent. It authorizes the challenger's selection only when the evidence available before selection supports the change, with the decision recorded in the audit log. CARE outperforms all other evaluated methods on Minerva/Olympus and ChemLex benchmarks, with final-best improving from 80.0 to 88.5 on Minerva/Olympus and from 83.9 to 92.1 on ChemLex, relative to the public incumbent. Our experiments indicate that LLM self-evolution is more reliable when it expands the proposal space under an auditable controller, rather than directly choosing experiments.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CARE, an auditable controller for LLM-generated policies in high-throughput scientific experimentation (HTE) optimization. It retains a non-LLM incumbent optimizer as the default while using LLMs to propose challenger ranking policies; a public-evidence intervention gate compares challenger and incumbent using only pre-outcome evidence and authorizes a change only when that evidence supports it, with the decision logged for audit. The central empirical claim is that CARE outperforms all evaluated methods on the Minerva/Olympus and ChemLex benchmarks, raising final-best performance from 80.0 to 88.5 and from 83.9 to 92.1 respectively, and that LLM self-evolution is more reliable when operating under such an auditable controller rather than directly selecting experiments.
Significance. If the reported gains prove robust and the gate is shown to be free of post-hoc or outcome-correlated selection, the work would supply a concrete, auditable mechanism for safely incorporating LLM creativity into costly, irreversible scientific optimization loops. The emphasis on pre-outcome authorization and public logging directly addresses a practical barrier to LLM use in experimental design.
major comments (2)
- [Abstract] Abstract: the headline performance claims (final-best rising from 80.0 to 88.5 on Minerva/Olympus and 83.9 to 92.1 on ChemLex) are presented without any reference to number of runs, statistical tests, variance, or explicit baseline definitions. These omissions are load-bearing because the central claim is that CARE outperforms all other evaluated methods; without this information the magnitude and reliability of the lift cannot be assessed.
- [Methods (intervention gate)] Methods section describing the public-evidence intervention gate: the comparison metric and authorization predicate are not specified, nor is any argument or procedure given that the predicate is fixed before outcomes are observed. This is load-bearing for the claim that the measured gains arise from the auditable controller rather than from an outcome-correlated selection rule that the gate itself may embed.
minor comments (1)
- [Abstract] Abstract: the term 'final-best' is used without definition; it should be clarified whether it denotes the single best observed value at the end of the run or an average over final policies.
Simulated Author's Rebuttal
We appreciate the referee's detailed feedback on our manuscript. The comments highlight important areas for clarification in the abstract and methods sections. We address each point below and plan to incorporate revisions to strengthen the presentation of our results and the description of the intervention gate.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline performance claims (final-best rising from 80.0 to 88.5 on Minerva/Olympus and 83.9 to 92.1 on ChemLex) are presented without any reference to number of runs, statistical tests, variance, or explicit baseline definitions. These omissions are load-bearing because the central claim is that CARE outperforms all other evaluated methods; without this information the magnitude and reliability of the lift cannot be assessed.
Authors: We agree that the abstract would benefit from including references to the experimental setup details such as the number of runs, statistical tests, variance, and explicit baseline definitions. These details are provided in the main text (e.g., in the experimental results section), but to make the abstract self-contained and address the load-bearing nature of the claims, we will revise the abstract to incorporate summary statistics and references to the evaluation protocol. revision: yes
-
Referee: [Methods (intervention gate)] Methods section describing the public-evidence intervention gate: the comparison metric and authorization predicate are not specified, nor is any argument or procedure given that the predicate is fixed before outcomes are observed. This is load-bearing for the claim that the measured gains arise from the auditable controller rather than from an outcome-correlated selection rule that the gate itself may embed.
Authors: We acknowledge that the current description of the public-evidence intervention gate in the methods section lacks explicit specification of the comparison metric and authorization predicate, as well as a clear statement that the predicate is determined prior to observing outcomes. In the revised manuscript, we will provide a detailed specification of the metric (e.g., based on pre-outcome evidence such as historical performance or public benchmarks) and the predicate, along with an argument and procedure ensuring it is fixed ex ante to prevent outcome-correlated selection. This will reinforce that the gains are attributable to the auditable controller. revision: yes
Circularity Check
No significant circularity detected
full rationale
The provided abstract and description contain no equations, derivations, fitted parameters, or self-citations that reduce any claim to its inputs by construction. The core assertions are empirical benchmark results (outperformance on Minerva/Olympus and ChemLex) and a descriptive account of the CARE controller and public-evidence gate; these are presented as measured outcomes rather than first-principles predictions or self-referential definitions. No load-bearing step matches any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Portfolio Allocation for Bayesian Optimization
Finite-time analysis of the multiarmed bandit problem.Machine Learning, 47(2–3):235– 256. Maximilian Balandat, Brian Karrer, Daniel Jiang, Samuel Daulton, Ben Letham, Andrew G. Wil- son, and Eytan Bakshy. 2020. Botorch: A frame- work for efficient monte-carlo bayesian optimiza- tion. InAdvances in Neural Information Pro- cessing Systems, volume 33, pages ...
work page internal anchor Pith review Pith/arXiv arXiv 2020
- [2]
-
[3]
Chrisantha Fernando, Dylan Sunil Banarse, Henryk Michalewski, Simon Osindero, and Tim Rock- täschel
Summit: Benchmarking machine learning methods for reaction optimisation.Chemistry– Methods, 1(2):116–122. Chrisantha Fernando, Dylan Sunil Banarse, Henryk Michalewski, Simon Osindero, and Tim Rock- täschel. 2024. Promptbreeder: Self-referential self-improvement via prompt evolution. InPro- ceedings of the 41st International Conference on Machine Learning,...
2024
-
[4]
Donald R
Springer Berlin Heidelberg. Donald R. Jones, Matthias Schonlau, and William J. Welch. 1998. Efficient global optimization of expensive black-box functions.Journal of Global Optimization, 13(4):455–492. Omar Khattab, Arnav Singhvi, Paridhi Mahesh- wari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Mo...
1998
-
[5]
gradient descent
DSPy: Compiling declarative language model calls into state-of-the-art pipelines. In The Twelfth International Conference on Learn- ing Representations. Agustinus Kristiadi, Felix Strieth-Kalthoff, Marta Skreta, Pascal Poupart, Alán Aspuru-Guzik, and Geoff Pleiss. 2024. A sober look at LLMs for material discovery: Are they actually good for bayesian optim...
2024
-
[6]
An Entropy Search Portfolio for Bayesian Optimization
An entropy search portfolio for bayesian optimization.Preprint, arXiv:1406.4625. Benjamin J. Shields, Jason Stevens, Jun Li, Marvin Parasram, Farhan Damani, Jesus I. Martinez Al- varado, Jacob M. Janey, Ryan P. Adams, and Abigail G. Doyle. 2021. Bayesian reaction opti- mization as a tool for chemical synthesis.Nature, 590(7844):89–96. Mohit Shridhar, Xing...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[7]
Gary Tom, Stefan P
An autonomous laboratory for the accel- erated synthesis of inorganic materials.Nature, 624(7990):86–91. Gary Tom, Stefan P. Schmid, Sterling G. Baird, Yang Cao, Kourosh Darvish, Han Hao, Stan- ley Lo, Sergio Pablo-García, Ella M. Rajaon- son, Marta Skreta, Naruki Yoshikawa, Samantha Corapi, Gun Deniz Akkoc, Felix Strieth-Kalthoff, Martin Seifrid, and Alá...
2024
-
[8]
best_ligand_near_best
ReAct: Synergizing reasoning and acting in language models. InThe Eleventh Interna- tional Conference on Learning Representations, ICLR 2023. OpenReview.net. Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Pan Lu, Zhi Huang, Carlos Guestrin, and James Zou. 2025. Optimizing generative AI by backpropagating language model feedback. Nature, 639(8...
2023
-
[9]
Plan only one next edit/action for a language-generated optimizer skill
-
[10]
Use only public views and revealed objective values; do not seek evaluator-private labels, answer keys, baseline artifacts, files, network, or credentials
-
[11]
Prefer low-risk deterministic code edits that compile to rank_candidates
-
[12]
The final deployed tool must score the full remaining candidate_df, not a menu
-
[13]
Reuse is allowed only when the active skill is both deploy- able and still improving or diversifying selected outcomes
-
[14]
Do not patch solely because one round failed after a recent best-observed improvement; reuse once unless the last selected objective value was clearly poor
-
[15]
Plan patch_skill mainly after two or more consecutive no-improvement rounds, or after a revealed selection far below the current best
-
[16]
If patching after stagnation, blend the active scoring rule with bounded diversity/novelty; do not replace a previ- ously improving policy with an unrelated scorer
-
[17]
If the active skill repeatedly selects lower-yield near- duplicates after a first improvement, patch it to avoid over-exploiting that local region
-
[18]
schema_version
Return JSON with task_plan and self_reported_forbidden_info_used=false. Planner output schema. {"schema_version":"self_evolving_task_plan_v1","run_id":"[RUN_ID ]","round_index":"[ROUND_INDEX]", "task_plan":{"action":"create_skill|patch_skill| reuse_active_skill","skill_family":"ranker|constraint| exploration|calibrator|fallback", "objective":"short public...
-
[19]
Write normal Python source defining exactly rank_candidates(observed_df, candidate_df, memory=None, tool_state=None)
-
[20]
Return a dictionary, never a DataFrame: {’ranked_candidates’: list, ’tool_state’: dict, ’tool_diagnostics’: dict}
-
[21]
It must cover every candidate_df row with unique positive ranks and finite scores
Each ranked row must includecandidate_id, rank, score, reason_code, and evidence_refs. It must cover every candidate_df row with unique positive ranks and finite scores. 4.evidence_refs is empty unless exactobservation_id val- ues are copied fromobserved_df; never invent evidence strings
-
[22]
Use only publicobserved_y for observed rows and public candidate features
-
[23]
Do not read files, call networks, import disallowed mod- ules, inspect DataFrame attrs, or reference evaluator- private labels, answer keys, baseline artifacts, private provenance fields, or credentials
-
[24]
Do not use double-underscore names or strings in tempo- rary columns, helper variables, imports, or escape hatches
-
[25]
Do not callgetattr, setattr, hasattr, eval, exec, compile, globals,locals,vars,open, or__import__
-
[26]
Make ranking row-order invariant: do not use enumerate index, original row order, DataFrame index, or insertion order in scores or tie-breaks
-
[27]
Sort with public score first andcandidate_id as the only deterministic tie-breaker; never tie-break by row position
-
[28]
Avoid creating helper columns such as_row_order, _index, __lig, or_positionfor ordering
-
[29]
schema_version
Return JSON with skill_artifact.source and self_reported_forbidden_info_used=false. Policy-synthesis output schema. {"schema_version":"self_evolving_skill_artifact_v1","run_id":"[ RUN_ID]","round_index":"[ROUND_INDEX]", "skill_artifact":{"skill_id":"short id","family":"[ TASK_PLAN_SKILL_FAMILY]","source":"Python source string defining rank_candidates","ra...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.