From Residuals to Reasons: LLM-Guided Mechanism Inference from Tabular Data

Mohammad R. Rezaei; Rahul G. Krishnan

arxiv: 2605.22897 · v1 · pith:OBBCCRM5new · submitted 2026-05-21 · 💻 cs.LG

From Residuals to Reasons: LLM-Guided Mechanism Inference from Tabular Data

Mohammad R. Rezaei , Rahul G. Krishnan This is my paper

Pith reviewed 2026-05-25 05:35 UTC · model grok-4.3

classification 💻 cs.LG

keywords residual analysisLLM agentsmechanism inferencetabular datacorrection termsbatch transferscientific machine learningmulti-agent optimization

0 comments

The pith

LLM agents anchored to base-model residuals on tabular data generate explicit correction terms that transfer across experimental batches when the underlying mechanism is unchanged.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a framework in which LLM agents are given high-residual examples from a base statistical model and asked to propose explicit correction formulas rather than predict targets outright. These formulas are refined through multi-turn interaction and then tested for transfer. On the Cell-Free Protein dataset the frozen formulas improve predictions on new batches within the same reagent protocol in over 92 percent of cases yet fail systematically when the protocol itself changes. The success boundary tracks biochemical differences rather than batch identity, indicating that the corrections capture genuine mechanisms. This approach is shown to raise accuracy over the base model on nine diverse tabular benchmarks while producing human-readable terms.

Core claim

The central claim is that Multi-Agent Residual In-Context Learning produces correction terms whose validity can be verified by freezing the formulas learned on one batch and applying them unchanged to held-out batches; within the same reagent protocol these formulas improve predictions in over 92 percent of cases while across protocols they fail systematically, with the success boundary aligning with the biochemistry rather than batch count or other superficial factors.

What carries the argument

Multi-Agent Residual In-Context Learning (MARICL), an agentic loop in which LLMs receive high-residual examples in context, hypothesize missing functional structure, and output explicit correction terms that are iteratively refined by textual gradient steps.

If this is right

The method raises predictive accuracy over the base model on every one of the nine benchmarks spanning scientific, biomedical, socioeconomic, and synthetic tabular data.
Formulas learned on one experimental batch improve predictions on held-out batches inside the identical reagent protocol in more than 92 percent of cases.
The identical formulas fail systematically when the reagent protocol changes.
The boundary between transfer success and failure coincides with biochemical protocol differences, not with the number of batches seen.
The output of the process consists of explicit, human-readable correction terms rather than opaque feature attributions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same residual-to-correction workflow could be applied to any tabular scientific domain where batch or protocol shifts are common, offering a route to mechanistic models without requiring new labeled data.
If the generated corrections prove stable under further protocol variations, they could serve as candidate mechanistic hypotheses for targeted wet-lab validation.
The multi-turn refinement loop suggests a general pattern for turning black-box residuals into iteratively improvable symbolic models in other high-stakes tabular settings.

Load-bearing premise

That the correction terms hypothesized by the LLM from high-residual examples reflect real underlying mechanisms rather than dataset-specific noise or model artifacts.

What would settle it

Observing that the same frozen formulas improve predictions at comparable rates when applied to batches from a different reagent protocol would falsify the mechanistic-generalization claim.

Figures

Figures reproduced from arXiv: 2605.22897 by Mohammad R. Rezaei, Rahul G. Krishnan.

**Figure 1.** Figure 1: MARICL framework overview: (1-2) a base-model generates predictions, (3) residual analysis selects high-error examples, (4- 5) an LLM encoder produces structured hypotheses zk that a decoder converts into explanations Tk and executable formulas, (6) textual gradient optimization refines corrections via critique feedback, and (7-8) queryaware aggregation. A natural baseline is LLM-ICL: place the entir… view at source ↗

**Figure 2.** Figure 2: Performance across regression and classification benchmarks ( [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Joint sensitivity of K (corrections) and κ (residual fraction) on Cell-Free Protein (R2 ). Performance is stable across K ∈ [1, 3] and κ ∈ [0.2, 0.4]. Practitioner Guidance. Based on our experiments, we recommend: (1) Start with K = 2 corrections, increasing only if performance plateaus; (2) Set κ = 0.3 as default, adjusting downward for noisy data or upward for systematic model failures; (3) Use early st… view at source ↗

read the original abstract

A persistent challenge in machine learning for scientific applications is jointly achieving prediction and understanding. Statistical models excel on structured data but operate as black boxes, while existing interpretability methods are largely inspective: they answer "which features matter?" but do not articulate how features interact or refine explanations iteratively alongside human understanding. Asking an LLM to predict the target directly forces it to search the entire output space; we instead anchor predictions with a base model and ask the LLM the narrower question of what that model is missing. We introduce Multi-Agent Residual In-Context Learning (MARICL), an agentic framework in which LLM agents analyze where a base-model fails, hypothesize missing structure from high-residual examples provided in context, and produce explicit correction terms refined through multi-turn textual gradient optimization. Across nine benchmarks spanning scientific, biomedical, socioeconomic, and synthetic settings, MARICL improves consistently over its base model on all datasets. To test whether these corrections reflect real structure or batch-specific noise, we freeze formulas learned on one experimental batch of the Cell-Free Protein dataset and apply them (with no retraining and no further LLM calls) to held-out batches. Within the same reagent protocol, the frozen formulas improve predictions in over 92% of cases; across a different protocol, they fail systematically. The success boundary aligns with the biochemistry, not the batch count; direct evidence of mechanistic generalization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MARICL anchors LLMs to base-model residuals to produce explicit correction terms and tests them via frozen-formula transfer on one biochemical dataset, but the abstract leaves implementation details and alternative explanations unaddressed.

read the letter

The main thing to know is that this paper tries to get LLMs to output usable correction formulas rather than just explanations or direct predictions. They start with a base model, feed high-residual examples to LLM agents, and let the agents propose and refine explicit terms through multi-turn interaction. The frozen-formula test on the Cell-Free Protein data is the clearest piece of evidence: corrections learned on one batch improve held-out batches in the same reagent protocol in over 92% of cases but fail on a different protocol, with the cutoff matching the biochemistry rather than batch size alone.

Referee Report

2 major / 2 minor

Summary. The paper introduces Multi-Agent Residual In-Context Learning (MARICL), an agentic LLM framework that anchors on a base model, supplies high-residual examples in context, and elicits explicit correction terms via multi-turn textual gradient optimization. It reports consistent predictive gains over base models on nine tabular benchmarks and presents a batch-transfer experiment on the Cell-Free Protein dataset in which formulas learned on one batch are frozen and applied to held-out batches, yielding >92% improvement within the same reagent protocol but systematic failure across protocols, interpreted as evidence of mechanistic generalization.

Significance. If the transfer results hold under the requested controls, the work supplies a concrete route to joint prediction and mechanism inference on tabular scientific data by narrowing the LLM query to residuals rather than direct prediction. The held-out batch evaluation is a clear methodological strength, supplying external grounding that distinguishes the approach from pure in-sample fitting.

major comments (2)

[§4.3] §4.3 (Cell-Free Protein transfer test): the central claim that success aligns with biochemistry rather than protocol-correlated statistical patterns rests on aggregate improvement rates; the section does not exhibit the explicit correction terms, nor does it map any term to a known biochemical rate law or interaction, so the mechanistic interpretation is not yet isolated from alternative explanations.
[§4.1–4.2] §4.1–4.2 (nine-benchmark evaluation): the abstract and main results supply no base-model specifications, prompt templates, number of LLM calls per correction, or per-dataset variance/statistical tests, which directly affects the load-bearing claim of consistent improvement across scientific, biomedical, and socioeconomic settings.

minor comments (2)

[Figure 1] The multi-agent architecture diagram (Figure 1) would benefit from explicit labeling of which agent performs residual analysis versus term refinement.
[§3.2] Notation for the textual gradient step is introduced without a reference to prior LLM-as-optimizer literature; a short related-work sentence would clarify novelty.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed report. We address each major comment below, proposing targeted revisions to improve clarity and evidentiary strength while preserving the core claims supported by the experiments.

read point-by-point responses

Referee: [§4.3] §4.3 (Cell-Free Protein transfer test): the central claim that success aligns with biochemistry rather than protocol-correlated statistical patterns rests on aggregate improvement rates; the section does not exhibit the explicit correction terms, nor does it map any term to a known biochemical rate law or interaction, so the mechanistic interpretation is not yet isolated from alternative explanations.

Authors: We agree that displaying the explicit correction terms would strengthen the mechanistic claim. In the revision we will add representative correction formulas generated by the agents for the Cell-Free Protein batches, together with a brief discussion of their alignment with known reagent interactions. At the same time, the transfer design itself supplies isolating evidence: formulas learned on one batch improve held-out batches under the identical protocol (>92 % of cases) yet fail systematically under a different protocol. This success boundary tracks the biochemical change (reagent protocol) rather than batch statistics or sample size, which would be unlikely if the terms captured only protocol-correlated noise. A complete mapping of every term onto an explicit rate law lies beyond the paper’s scope, but the provided examples and transfer results together narrow alternative statistical explanations. revision: partial
Referee: [§4.1–4.2] §4.1–4.2 (nine-benchmark evaluation): the abstract and main results supply no base-model specifications, prompt templates, number of LLM calls per correction, or per-dataset variance/statistical tests, which directly affects the load-bearing claim of consistent improvement across scientific, biomedical, and socioeconomic settings.

Authors: We accept that these implementation details are necessary for reproducibility and for assessing the strength of the reported gains. The revised manuscript will (i) state the exact base models and hyperparameters for each of the nine benchmarks, (ii) include the prompt templates used by the multi-agent system, (iii) report the typical number of LLM calls per correction, and (iv) add per-dataset standard deviations together with paired statistical tests (e.g., Wilcoxon signed-rank) comparing MARICL against the base model. These additions will be placed in §4.1–4.2 and the appendix. revision: yes

Circularity Check

0 steps flagged

No circularity detected; held-out transfer test supplies independent validation.

full rationale

The paper's key claim rests on an empirical protocol: LLM-derived correction formulas are generated from residuals on one experimental batch, then applied frozen (no retraining, no further LLM calls) to held-out batches. Within-protocol improvement (>92%) versus cross-protocol failure is presented as evidence that the corrections capture structure aligned with biochemistry rather than batch noise. This test set is statistically independent of the batch used to produce the formulas, so the reported success rate does not reduce to the input data by construction. No self-definitional equations, fitted parameters renamed as predictions, load-bearing self-citations, uniqueness theorems, or ansatz smuggling appear in the abstract or described method. The result is therefore an ordinary out-of-distribution empirical check rather than a closed loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the unproven premise that LLMs can extract mechanistic structure from residual patterns in tabular data when given high-error examples in context; no independent evidence for this premise is supplied beyond the single-dataset transfer result.

axioms (1)

domain assumption LLMs can infer mechanistic structure from high-residual examples in tabular data when prompted appropriately.
The entire pipeline rests on this capability without separate validation or formal justification.

pith-pipeline@v0.9.0 · 5780 in / 1435 out tokens · 26239 ms · 2026-05-25T05:35:39.918939+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

MARICL ... hypothesize missing structure from high-residual examples ... produce explicit correction terms ... frozen formulas improve predictions in over 92% of cases; across a different protocol, they fail systematically. The success boundary aligns with the biochemistry, not the batch count.
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

the correction encodes NAD-spermidine cofactor synergy ... Michaelis-Menten saturation term for folinic acid

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

[1]

High-residual reactions share elevated NAD and spermidine. The linear model misses their interaction. Proposed mech- anism: multiplicative synergy drives un- derprediction

First edition 1979; covers Michaelis-Menten kinetics, steady-state kinetics, and enzyme mechanisms. C Nick Pace, Bret A Shirley, Marsha McNutt, and Ketan Gajiwala. Forces contributing to the conformational stability of proteins.The FASEB Journal, 10(1):75–83, 1996. doi: 10.1096/fasebj. 10.1.8566551. Ralph A DeFronzo, Ele Ferrannini, Leif Groop, Robert R H...

work page doi:10.1096/fasebj 1979
[2]

Target plate loading.The target plate CSV is loaded and split into 80/20 train/test using quantile-stratified sampling (mirroring the training protocol of Script 018)

work page
[3]

18 depending on the–ml_sourcesetting

ML mechanism.The ML model (Linear or XGBoost, as used in the source run) is either (a) transferred directly from the source plate or (b) retrained on the target plate’s train split, 1These counts and averages are computed under the headline averaged +blend configuration (row 1 of Table 10); the base-model trend reproduces qualitatively under each of the f...

work page
[4]

Feature scaling.All features are scaled to [0.01,0.99] usingMinMaxScaler010 fit on the train split of whichever plate the ML model was trained on, then applied to the target test set

work page
[5]

NumPy operations (np.clip, np.exp, etc.) are available

Formula evaluation.The extracted formula string is evaluated via Python’seval() with the test feature matrix injected as local variables. NumPy operations (np.clip, np.exp, etc.) are available. Outputs are clipped to[0,1]for stability

work page
[6]

Prediction blending.Final predictions are a 50/50 blend of the ML mechanism output and the average formula output across all transferred LLM mechanisms: ˆy= 0.5·ˆy ML + 0.5· 1 |MLLM| X m∈MLLM ˆym. We adopt this blend for transfer because (i) treating formula outputs as absolute predictions on the bounded [0,1] scale is more robust than treating them as re...

work page
[7]

Encoder calls: For each correction k, encoding requires ⌈|Dhigh-res|/B⌉ LLM calls when batched encoding is used (Eq. 7). With K corrections, this totals K· ⌈|D high-res|/B⌉ encoder calls

work page
[8]

Corr” = correction only; “Full

Decoder and refinement calls: Initial decoding requires K calls (Eq. 8). Each refinement iteration requires one critique generation (Eq. 11) and one correction refinement (Eq. 13) per correction, totaling 2KT calls over T iterations. Combined: K(1 + 2T) decoder/refinement calls. The total number of LLM calls is therefore: Ncalls =K· |Dhigh-res| B +K(1 + 2...

work page arXiv 2012
[9]

When age is high and BMI is high, target is typically high

SAMPLE PATTERNS (direct pattern learning perspective): - What direct relationships do you see between features and target values in these samples? - How do feature values relate to target values? (e.g., "When age is high and BMI is high, target is typically high") - What prediction rules would work based on the sample patterns themselves? - What feature c...

work page
[10]

The mechanism must be descriptive and textual (not just math)

work page
[11]

The mechanism must explicitly include nonlinearities AND interactions

work page
[12]

The mechanism must introduce intermediate combinatory concepts (named, explained) that capture nonlinear interactions

work page
[13]

Formula:

The final formula must be executable and appear as a SINGLE LINE starting with "Formula:" so it can be extracted programmatically. WHAT TO INCLUDE: - Named intermediate concepts (1–3): define them briefly, but do NOT rely on them in the final Formula line. Instead, inline/expand them in the final Formula expression. - Nonlinear transforms: at least one (s...

work page 2020

[1] [1]

High-residual reactions share elevated NAD and spermidine. The linear model misses their interaction. Proposed mech- anism: multiplicative synergy drives un- derprediction

First edition 1979; covers Michaelis-Menten kinetics, steady-state kinetics, and enzyme mechanisms. C Nick Pace, Bret A Shirley, Marsha McNutt, and Ketan Gajiwala. Forces contributing to the conformational stability of proteins.The FASEB Journal, 10(1):75–83, 1996. doi: 10.1096/fasebj. 10.1.8566551. Ralph A DeFronzo, Ele Ferrannini, Leif Groop, Robert R H...

work page doi:10.1096/fasebj 1979

[2] [2]

Target plate loading.The target plate CSV is loaded and split into 80/20 train/test using quantile-stratified sampling (mirroring the training protocol of Script 018)

work page

[3] [3]

18 depending on the–ml_sourcesetting

ML mechanism.The ML model (Linear or XGBoost, as used in the source run) is either (a) transferred directly from the source plate or (b) retrained on the target plate’s train split, 1These counts and averages are computed under the headline averaged +blend configuration (row 1 of Table 10); the base-model trend reproduces qualitatively under each of the f...

work page

[4] [4]

Feature scaling.All features are scaled to [0.01,0.99] usingMinMaxScaler010 fit on the train split of whichever plate the ML model was trained on, then applied to the target test set

work page

[5] [5]

NumPy operations (np.clip, np.exp, etc.) are available

Formula evaluation.The extracted formula string is evaluated via Python’seval() with the test feature matrix injected as local variables. NumPy operations (np.clip, np.exp, etc.) are available. Outputs are clipped to[0,1]for stability

work page

[6] [6]

Prediction blending.Final predictions are a 50/50 blend of the ML mechanism output and the average formula output across all transferred LLM mechanisms: ˆy= 0.5·ˆy ML + 0.5· 1 |MLLM| X m∈MLLM ˆym. We adopt this blend for transfer because (i) treating formula outputs as absolute predictions on the bounded [0,1] scale is more robust than treating them as re...

work page

[7] [7]

Encoder calls: For each correction k, encoding requires ⌈|Dhigh-res|/B⌉ LLM calls when batched encoding is used (Eq. 7). With K corrections, this totals K· ⌈|D high-res|/B⌉ encoder calls

work page

[8] [8]

Corr” = correction only; “Full

Decoder and refinement calls: Initial decoding requires K calls (Eq. 8). Each refinement iteration requires one critique generation (Eq. 11) and one correction refinement (Eq. 13) per correction, totaling 2KT calls over T iterations. Combined: K(1 + 2T) decoder/refinement calls. The total number of LLM calls is therefore: Ncalls =K· |Dhigh-res| B +K(1 + 2...

work page arXiv 2012

[9] [9]

When age is high and BMI is high, target is typically high

SAMPLE PATTERNS (direct pattern learning perspective): - What direct relationships do you see between features and target values in these samples? - How do feature values relate to target values? (e.g., "When age is high and BMI is high, target is typically high") - What prediction rules would work based on the sample patterns themselves? - What feature c...

work page

[10] [10]

The mechanism must be descriptive and textual (not just math)

work page

[11] [11]

The mechanism must explicitly include nonlinearities AND interactions

work page

[12] [12]

The mechanism must introduce intermediate combinatory concepts (named, explained) that capture nonlinear interactions

work page

[13] [13]

Formula:

The final formula must be executable and appear as a SINGLE LINE starting with "Formula:" so it can be extracted programmatically. WHAT TO INCLUDE: - Named intermediate concepts (1–3): define them briefly, but do NOT rely on them in the final Formula line. Instead, inline/expand them in the final Formula expression. - Nonlinear transforms: at least one (s...

work page 2020