From Residuals to Reasons: LLM-Guided Mechanism Inference from Tabular Data
Pith reviewed 2026-05-25 05:35 UTC · model grok-4.3
The pith
LLM agents anchored to base-model residuals on tabular data generate explicit correction terms that transfer across experimental batches when the underlying mechanism is unchanged.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that Multi-Agent Residual In-Context Learning produces correction terms whose validity can be verified by freezing the formulas learned on one batch and applying them unchanged to held-out batches; within the same reagent protocol these formulas improve predictions in over 92 percent of cases while across protocols they fail systematically, with the success boundary aligning with the biochemistry rather than batch count or other superficial factors.
What carries the argument
Multi-Agent Residual In-Context Learning (MARICL), an agentic loop in which LLMs receive high-residual examples in context, hypothesize missing functional structure, and output explicit correction terms that are iteratively refined by textual gradient steps.
If this is right
- The method raises predictive accuracy over the base model on every one of the nine benchmarks spanning scientific, biomedical, socioeconomic, and synthetic tabular data.
- Formulas learned on one experimental batch improve predictions on held-out batches inside the identical reagent protocol in more than 92 percent of cases.
- The identical formulas fail systematically when the reagent protocol changes.
- The boundary between transfer success and failure coincides with biochemical protocol differences, not with the number of batches seen.
- The output of the process consists of explicit, human-readable correction terms rather than opaque feature attributions.
Where Pith is reading between the lines
- The same residual-to-correction workflow could be applied to any tabular scientific domain where batch or protocol shifts are common, offering a route to mechanistic models without requiring new labeled data.
- If the generated corrections prove stable under further protocol variations, they could serve as candidate mechanistic hypotheses for targeted wet-lab validation.
- The multi-turn refinement loop suggests a general pattern for turning black-box residuals into iteratively improvable symbolic models in other high-stakes tabular settings.
Load-bearing premise
That the correction terms hypothesized by the LLM from high-residual examples reflect real underlying mechanisms rather than dataset-specific noise or model artifacts.
What would settle it
Observing that the same frozen formulas improve predictions at comparable rates when applied to batches from a different reagent protocol would falsify the mechanistic-generalization claim.
Figures
read the original abstract
A persistent challenge in machine learning for scientific applications is jointly achieving prediction and understanding. Statistical models excel on structured data but operate as black boxes, while existing interpretability methods are largely inspective: they answer "which features matter?" but do not articulate how features interact or refine explanations iteratively alongside human understanding. Asking an LLM to predict the target directly forces it to search the entire output space; we instead anchor predictions with a base model and ask the LLM the narrower question of what that model is missing. We introduce Multi-Agent Residual In-Context Learning (MARICL), an agentic framework in which LLM agents analyze where a base-model fails, hypothesize missing structure from high-residual examples provided in context, and produce explicit correction terms refined through multi-turn textual gradient optimization. Across nine benchmarks spanning scientific, biomedical, socioeconomic, and synthetic settings, MARICL improves consistently over its base model on all datasets. To test whether these corrections reflect real structure or batch-specific noise, we freeze formulas learned on one experimental batch of the Cell-Free Protein dataset and apply them (with no retraining and no further LLM calls) to held-out batches. Within the same reagent protocol, the frozen formulas improve predictions in over 92% of cases; across a different protocol, they fail systematically. The success boundary aligns with the biochemistry, not the batch count; direct evidence of mechanistic generalization.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Multi-Agent Residual In-Context Learning (MARICL), an agentic LLM framework that anchors on a base model, supplies high-residual examples in context, and elicits explicit correction terms via multi-turn textual gradient optimization. It reports consistent predictive gains over base models on nine tabular benchmarks and presents a batch-transfer experiment on the Cell-Free Protein dataset in which formulas learned on one batch are frozen and applied to held-out batches, yielding >92% improvement within the same reagent protocol but systematic failure across protocols, interpreted as evidence of mechanistic generalization.
Significance. If the transfer results hold under the requested controls, the work supplies a concrete route to joint prediction and mechanism inference on tabular scientific data by narrowing the LLM query to residuals rather than direct prediction. The held-out batch evaluation is a clear methodological strength, supplying external grounding that distinguishes the approach from pure in-sample fitting.
major comments (2)
- [§4.3] §4.3 (Cell-Free Protein transfer test): the central claim that success aligns with biochemistry rather than protocol-correlated statistical patterns rests on aggregate improvement rates; the section does not exhibit the explicit correction terms, nor does it map any term to a known biochemical rate law or interaction, so the mechanistic interpretation is not yet isolated from alternative explanations.
- [§4.1–4.2] §4.1–4.2 (nine-benchmark evaluation): the abstract and main results supply no base-model specifications, prompt templates, number of LLM calls per correction, or per-dataset variance/statistical tests, which directly affects the load-bearing claim of consistent improvement across scientific, biomedical, and socioeconomic settings.
minor comments (2)
- [Figure 1] The multi-agent architecture diagram (Figure 1) would benefit from explicit labeling of which agent performs residual analysis versus term refinement.
- [§3.2] Notation for the textual gradient step is introduced without a reference to prior LLM-as-optimizer literature; a short related-work sentence would clarify novelty.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed report. We address each major comment below, proposing targeted revisions to improve clarity and evidentiary strength while preserving the core claims supported by the experiments.
read point-by-point responses
-
Referee: [§4.3] §4.3 (Cell-Free Protein transfer test): the central claim that success aligns with biochemistry rather than protocol-correlated statistical patterns rests on aggregate improvement rates; the section does not exhibit the explicit correction terms, nor does it map any term to a known biochemical rate law or interaction, so the mechanistic interpretation is not yet isolated from alternative explanations.
Authors: We agree that displaying the explicit correction terms would strengthen the mechanistic claim. In the revision we will add representative correction formulas generated by the agents for the Cell-Free Protein batches, together with a brief discussion of their alignment with known reagent interactions. At the same time, the transfer design itself supplies isolating evidence: formulas learned on one batch improve held-out batches under the identical protocol (>92 % of cases) yet fail systematically under a different protocol. This success boundary tracks the biochemical change (reagent protocol) rather than batch statistics or sample size, which would be unlikely if the terms captured only protocol-correlated noise. A complete mapping of every term onto an explicit rate law lies beyond the paper’s scope, but the provided examples and transfer results together narrow alternative statistical explanations. revision: partial
-
Referee: [§4.1–4.2] §4.1–4.2 (nine-benchmark evaluation): the abstract and main results supply no base-model specifications, prompt templates, number of LLM calls per correction, or per-dataset variance/statistical tests, which directly affects the load-bearing claim of consistent improvement across scientific, biomedical, and socioeconomic settings.
Authors: We accept that these implementation details are necessary for reproducibility and for assessing the strength of the reported gains. The revised manuscript will (i) state the exact base models and hyperparameters for each of the nine benchmarks, (ii) include the prompt templates used by the multi-agent system, (iii) report the typical number of LLM calls per correction, and (iv) add per-dataset standard deviations together with paired statistical tests (e.g., Wilcoxon signed-rank) comparing MARICL against the base model. These additions will be placed in §4.1–4.2 and the appendix. revision: yes
Circularity Check
No circularity detected; held-out transfer test supplies independent validation.
full rationale
The paper's key claim rests on an empirical protocol: LLM-derived correction formulas are generated from residuals on one experimental batch, then applied frozen (no retraining, no further LLM calls) to held-out batches. Within-protocol improvement (>92%) versus cross-protocol failure is presented as evidence that the corrections capture structure aligned with biochemistry rather than batch noise. This test set is statistically independent of the batch used to produce the formulas, so the reported success rate does not reduce to the input data by construction. No self-definitional equations, fitted parameters renamed as predictions, load-bearing self-citations, uniqueness theorems, or ansatz smuggling appear in the abstract or described method. The result is therefore an ordinary out-of-distribution empirical check rather than a closed loop.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLMs can infer mechanistic structure from high-residual examples in tabular data when prompted appropriately.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
MARICL ... hypothesize missing structure from high-residual examples ... produce explicit correction terms ... frozen formulas improve predictions in over 92% of cases; across a different protocol, they fail systematically. The success boundary aligns with the biochemistry, not the batch count.
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanJ_uniquely_calibrated_via_higher_derivative echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
the correction encodes NAD-spermidine cofactor synergy ... Michaelis-Menten saturation term for folinic acid
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
First edition 1979; covers Michaelis-Menten kinetics, steady-state kinetics, and enzyme mechanisms. C Nick Pace, Bret A Shirley, Marsha McNutt, and Ketan Gajiwala. Forces contributing to the conformational stability of proteins.The FASEB Journal, 10(1):75–83, 1996. doi: 10.1096/fasebj. 10.1.8566551. Ralph A DeFronzo, Ele Ferrannini, Leif Groop, Robert R H...
-
[2]
Target plate loading.The target plate CSV is loaded and split into 80/20 train/test using quantile-stratified sampling (mirroring the training protocol of Script 018)
-
[3]
18 depending on the–ml_sourcesetting
ML mechanism.The ML model (Linear or XGBoost, as used in the source run) is either (a) transferred directly from the source plate or (b) retrained on the target plate’s train split, 1These counts and averages are computed under the headline averaged +blend configuration (row 1 of Table 10); the base-model trend reproduces qualitatively under each of the f...
-
[4]
Feature scaling.All features are scaled to [0.01,0.99] usingMinMaxScaler010 fit on the train split of whichever plate the ML model was trained on, then applied to the target test set
-
[5]
NumPy operations (np.clip, np.exp, etc.) are available
Formula evaluation.The extracted formula string is evaluated via Python’seval() with the test feature matrix injected as local variables. NumPy operations (np.clip, np.exp, etc.) are available. Outputs are clipped to[0,1]for stability
-
[6]
Prediction blending.Final predictions are a 50/50 blend of the ML mechanism output and the average formula output across all transferred LLM mechanisms: ˆy= 0.5·ˆy ML + 0.5· 1 |MLLM| X m∈MLLM ˆym. We adopt this blend for transfer because (i) treating formula outputs as absolute predictions on the bounded [0,1] scale is more robust than treating them as re...
-
[7]
Encoder calls: For each correction k, encoding requires ⌈|Dhigh-res|/B⌉ LLM calls when batched encoding is used (Eq. 7). With K corrections, this totals K· ⌈|D high-res|/B⌉ encoder calls
-
[8]
Corr” = correction only; “Full
Decoder and refinement calls: Initial decoding requires K calls (Eq. 8). Each refinement iteration requires one critique generation (Eq. 11) and one correction refinement (Eq. 13) per correction, totaling 2KT calls over T iterations. Combined: K(1 + 2T) decoder/refinement calls. The total number of LLM calls is therefore: Ncalls =K· |Dhigh-res| B +K(1 + 2...
-
[9]
When age is high and BMI is high, target is typically high
SAMPLE PATTERNS (direct pattern learning perspective): - What direct relationships do you see between features and target values in these samples? - How do feature values relate to target values? (e.g., "When age is high and BMI is high, target is typically high") - What prediction rules would work based on the sample patterns themselves? - What feature c...
-
[10]
The mechanism must be descriptive and textual (not just math)
-
[11]
The mechanism must explicitly include nonlinearities AND interactions
-
[12]
The mechanism must introduce intermediate combinatory concepts (named, explained) that capture nonlinear interactions
-
[13]
The final formula must be executable and appear as a SINGLE LINE starting with "Formula:" so it can be extracted programmatically. WHAT TO INCLUDE: - Named intermediate concepts (1–3): define them briefly, but do NOT rely on them in the final Formula line. Instead, inline/expand them in the final Formula expression. - Nonlinear transforms: at least one (s...
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.