pith. sign in

arxiv: 2605.22897 · v1 · pith:OBBCCRM5new · submitted 2026-05-21 · 💻 cs.LG

From Residuals to Reasons: LLM-Guided Mechanism Inference from Tabular Data

Pith reviewed 2026-05-25 05:35 UTC · model grok-4.3

classification 💻 cs.LG
keywords residual analysisLLM agentsmechanism inferencetabular datacorrection termsbatch transferscientific machine learningmulti-agent optimization
0
0 comments X

The pith

LLM agents anchored to base-model residuals on tabular data generate explicit correction terms that transfer across experimental batches when the underlying mechanism is unchanged.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a framework in which LLM agents are given high-residual examples from a base statistical model and asked to propose explicit correction formulas rather than predict targets outright. These formulas are refined through multi-turn interaction and then tested for transfer. On the Cell-Free Protein dataset the frozen formulas improve predictions on new batches within the same reagent protocol in over 92 percent of cases yet fail systematically when the protocol itself changes. The success boundary tracks biochemical differences rather than batch identity, indicating that the corrections capture genuine mechanisms. This approach is shown to raise accuracy over the base model on nine diverse tabular benchmarks while producing human-readable terms.

Core claim

The central claim is that Multi-Agent Residual In-Context Learning produces correction terms whose validity can be verified by freezing the formulas learned on one batch and applying them unchanged to held-out batches; within the same reagent protocol these formulas improve predictions in over 92 percent of cases while across protocols they fail systematically, with the success boundary aligning with the biochemistry rather than batch count or other superficial factors.

What carries the argument

Multi-Agent Residual In-Context Learning (MARICL), an agentic loop in which LLMs receive high-residual examples in context, hypothesize missing functional structure, and output explicit correction terms that are iteratively refined by textual gradient steps.

If this is right

  • The method raises predictive accuracy over the base model on every one of the nine benchmarks spanning scientific, biomedical, socioeconomic, and synthetic tabular data.
  • Formulas learned on one experimental batch improve predictions on held-out batches inside the identical reagent protocol in more than 92 percent of cases.
  • The identical formulas fail systematically when the reagent protocol changes.
  • The boundary between transfer success and failure coincides with biochemical protocol differences, not with the number of batches seen.
  • The output of the process consists of explicit, human-readable correction terms rather than opaque feature attributions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same residual-to-correction workflow could be applied to any tabular scientific domain where batch or protocol shifts are common, offering a route to mechanistic models without requiring new labeled data.
  • If the generated corrections prove stable under further protocol variations, they could serve as candidate mechanistic hypotheses for targeted wet-lab validation.
  • The multi-turn refinement loop suggests a general pattern for turning black-box residuals into iteratively improvable symbolic models in other high-stakes tabular settings.

Load-bearing premise

That the correction terms hypothesized by the LLM from high-residual examples reflect real underlying mechanisms rather than dataset-specific noise or model artifacts.

What would settle it

Observing that the same frozen formulas improve predictions at comparable rates when applied to batches from a different reagent protocol would falsify the mechanistic-generalization claim.

Figures

Figures reproduced from arXiv: 2605.22897 by Mohammad R. Rezaei, Rahul G. Krishnan.

Figure 1
Figure 1. Figure 1: MARICL framework overview: (1-2) a base-model gen￾erates predictions, (3) residual anal￾ysis selects high-error examples, (4- 5) an LLM encoder produces struc￾tured hypotheses zk that a decoder converts into explanations Tk and ex￾ecutable formulas, (6) textual gradi￾ent optimization refines corrections via critique feedback, and (7-8) query￾aware aggregation. A natural baseline is LLM-ICL: place the entir… view at source ↗
Figure 2
Figure 2. Figure 2: Performance across regression and classification benchmarks ( [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Joint sensitivity of K (corrections) and κ (residual fraction) on Cell-Free Protein (R2 ). Performance is stable across K ∈ [1, 3] and κ ∈ [0.2, 0.4]. Practitioner Guidance. Based on our experiments, we recommend: (1) Start with K = 2 correc￾tions, increasing only if performance plateaus; (2) Set κ = 0.3 as default, adjusting downward for noisy data or upward for systematic model failures; (3) Use early st… view at source ↗
read the original abstract

A persistent challenge in machine learning for scientific applications is jointly achieving prediction and understanding. Statistical models excel on structured data but operate as black boxes, while existing interpretability methods are largely inspective: they answer "which features matter?" but do not articulate how features interact or refine explanations iteratively alongside human understanding. Asking an LLM to predict the target directly forces it to search the entire output space; we instead anchor predictions with a base model and ask the LLM the narrower question of what that model is missing. We introduce Multi-Agent Residual In-Context Learning (MARICL), an agentic framework in which LLM agents analyze where a base-model fails, hypothesize missing structure from high-residual examples provided in context, and produce explicit correction terms refined through multi-turn textual gradient optimization. Across nine benchmarks spanning scientific, biomedical, socioeconomic, and synthetic settings, MARICL improves consistently over its base model on all datasets. To test whether these corrections reflect real structure or batch-specific noise, we freeze formulas learned on one experimental batch of the Cell-Free Protein dataset and apply them (with no retraining and no further LLM calls) to held-out batches. Within the same reagent protocol, the frozen formulas improve predictions in over 92% of cases; across a different protocol, they fail systematically. The success boundary aligns with the biochemistry, not the batch count; direct evidence of mechanistic generalization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Multi-Agent Residual In-Context Learning (MARICL), an agentic LLM framework that anchors on a base model, supplies high-residual examples in context, and elicits explicit correction terms via multi-turn textual gradient optimization. It reports consistent predictive gains over base models on nine tabular benchmarks and presents a batch-transfer experiment on the Cell-Free Protein dataset in which formulas learned on one batch are frozen and applied to held-out batches, yielding >92% improvement within the same reagent protocol but systematic failure across protocols, interpreted as evidence of mechanistic generalization.

Significance. If the transfer results hold under the requested controls, the work supplies a concrete route to joint prediction and mechanism inference on tabular scientific data by narrowing the LLM query to residuals rather than direct prediction. The held-out batch evaluation is a clear methodological strength, supplying external grounding that distinguishes the approach from pure in-sample fitting.

major comments (2)
  1. [§4.3] §4.3 (Cell-Free Protein transfer test): the central claim that success aligns with biochemistry rather than protocol-correlated statistical patterns rests on aggregate improvement rates; the section does not exhibit the explicit correction terms, nor does it map any term to a known biochemical rate law or interaction, so the mechanistic interpretation is not yet isolated from alternative explanations.
  2. [§4.1–4.2] §4.1–4.2 (nine-benchmark evaluation): the abstract and main results supply no base-model specifications, prompt templates, number of LLM calls per correction, or per-dataset variance/statistical tests, which directly affects the load-bearing claim of consistent improvement across scientific, biomedical, and socioeconomic settings.
minor comments (2)
  1. [Figure 1] The multi-agent architecture diagram (Figure 1) would benefit from explicit labeling of which agent performs residual analysis versus term refinement.
  2. [§3.2] Notation for the textual gradient step is introduced without a reference to prior LLM-as-optimizer literature; a short related-work sentence would clarify novelty.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed report. We address each major comment below, proposing targeted revisions to improve clarity and evidentiary strength while preserving the core claims supported by the experiments.

read point-by-point responses
  1. Referee: [§4.3] §4.3 (Cell-Free Protein transfer test): the central claim that success aligns with biochemistry rather than protocol-correlated statistical patterns rests on aggregate improvement rates; the section does not exhibit the explicit correction terms, nor does it map any term to a known biochemical rate law or interaction, so the mechanistic interpretation is not yet isolated from alternative explanations.

    Authors: We agree that displaying the explicit correction terms would strengthen the mechanistic claim. In the revision we will add representative correction formulas generated by the agents for the Cell-Free Protein batches, together with a brief discussion of their alignment with known reagent interactions. At the same time, the transfer design itself supplies isolating evidence: formulas learned on one batch improve held-out batches under the identical protocol (>92 % of cases) yet fail systematically under a different protocol. This success boundary tracks the biochemical change (reagent protocol) rather than batch statistics or sample size, which would be unlikely if the terms captured only protocol-correlated noise. A complete mapping of every term onto an explicit rate law lies beyond the paper’s scope, but the provided examples and transfer results together narrow alternative statistical explanations. revision: partial

  2. Referee: [§4.1–4.2] §4.1–4.2 (nine-benchmark evaluation): the abstract and main results supply no base-model specifications, prompt templates, number of LLM calls per correction, or per-dataset variance/statistical tests, which directly affects the load-bearing claim of consistent improvement across scientific, biomedical, and socioeconomic settings.

    Authors: We accept that these implementation details are necessary for reproducibility and for assessing the strength of the reported gains. The revised manuscript will (i) state the exact base models and hyperparameters for each of the nine benchmarks, (ii) include the prompt templates used by the multi-agent system, (iii) report the typical number of LLM calls per correction, and (iv) add per-dataset standard deviations together with paired statistical tests (e.g., Wilcoxon signed-rank) comparing MARICL against the base model. These additions will be placed in §4.1–4.2 and the appendix. revision: yes

Circularity Check

0 steps flagged

No circularity detected; held-out transfer test supplies independent validation.

full rationale

The paper's key claim rests on an empirical protocol: LLM-derived correction formulas are generated from residuals on one experimental batch, then applied frozen (no retraining, no further LLM calls) to held-out batches. Within-protocol improvement (>92%) versus cross-protocol failure is presented as evidence that the corrections capture structure aligned with biochemistry rather than batch noise. This test set is statistically independent of the batch used to produce the formulas, so the reported success rate does not reduce to the input data by construction. No self-definitional equations, fitted parameters renamed as predictions, load-bearing self-citations, uniqueness theorems, or ansatz smuggling appear in the abstract or described method. The result is therefore an ordinary out-of-distribution empirical check rather than a closed loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the unproven premise that LLMs can extract mechanistic structure from residual patterns in tabular data when given high-error examples in context; no independent evidence for this premise is supplied beyond the single-dataset transfer result.

axioms (1)
  • domain assumption LLMs can infer mechanistic structure from high-residual examples in tabular data when prompted appropriately.
    The entire pipeline rests on this capability without separate validation or formal justification.

pith-pipeline@v0.9.0 · 5780 in / 1435 out tokens · 26239 ms · 2026-05-25T05:35:39.918939+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    MARICL ... hypothesize missing structure from high-residual examples ... produce explicit correction terms ... frozen formulas improve predictions in over 92% of cases; across a different protocol, they fail systematically. The success boundary aligns with the biochemistry, not the batch count.

  • IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    the correction encodes NAD-spermidine cofactor synergy ... Michaelis-Menten saturation term for folinic acid

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

  1. [1]

    High-residual reactions share elevated NAD and spermidine. The linear model misses their interaction. Proposed mech- anism: multiplicative synergy drives un- derprediction

    First edition 1979; covers Michaelis-Menten kinetics, steady-state kinetics, and enzyme mechanisms. C Nick Pace, Bret A Shirley, Marsha McNutt, and Ketan Gajiwala. Forces contributing to the conformational stability of proteins.The FASEB Journal, 10(1):75–83, 1996. doi: 10.1096/fasebj. 10.1.8566551. Ralph A DeFronzo, Ele Ferrannini, Leif Groop, Robert R H...

  2. [2]

    Target plate loading.The target plate CSV is loaded and split into 80/20 train/test using quantile-stratified sampling (mirroring the training protocol of Script 018)

  3. [3]

    18 depending on the–ml_sourcesetting

    ML mechanism.The ML model (Linear or XGBoost, as used in the source run) is either (a) transferred directly from the source plate or (b) retrained on the target plate’s train split, 1These counts and averages are computed under the headline averaged +blend configuration (row 1 of Table 10); the base-model trend reproduces qualitatively under each of the f...

  4. [4]

    Feature scaling.All features are scaled to [0.01,0.99] usingMinMaxScaler010 fit on the train split of whichever plate the ML model was trained on, then applied to the target test set

  5. [5]

    NumPy operations (np.clip, np.exp, etc.) are available

    Formula evaluation.The extracted formula string is evaluated via Python’seval() with the test feature matrix injected as local variables. NumPy operations (np.clip, np.exp, etc.) are available. Outputs are clipped to[0,1]for stability

  6. [6]

    Prediction blending.Final predictions are a 50/50 blend of the ML mechanism output and the average formula output across all transferred LLM mechanisms: ˆy= 0.5·ˆy ML + 0.5· 1 |MLLM| X m∈MLLM ˆym. We adopt this blend for transfer because (i) treating formula outputs as absolute predictions on the bounded [0,1] scale is more robust than treating them as re...

  7. [7]

    Encoder calls: For each correction k, encoding requires ⌈|Dhigh-res|/B⌉ LLM calls when batched encoding is used (Eq. 7). With K corrections, this totals K· ⌈|D high-res|/B⌉ encoder calls

  8. [8]

    Corr” = correction only; “Full

    Decoder and refinement calls: Initial decoding requires K calls (Eq. 8). Each refinement iteration requires one critique generation (Eq. 11) and one correction refinement (Eq. 13) per correction, totaling 2KT calls over T iterations. Combined: K(1 + 2T) decoder/refinement calls. The total number of LLM calls is therefore: Ncalls =K· |Dhigh-res| B +K(1 + 2...

  9. [9]

    When age is high and BMI is high, target is typically high

    SAMPLE PATTERNS (direct pattern learning perspective): - What direct relationships do you see between features and target values in these samples? - How do feature values relate to target values? (e.g., "When age is high and BMI is high, target is typically high") - What prediction rules would work based on the sample patterns themselves? - What feature c...

  10. [10]

    The mechanism must be descriptive and textual (not just math)

  11. [11]

    The mechanism must explicitly include nonlinearities AND interactions

  12. [12]

    The mechanism must introduce intermediate combinatory concepts (named, explained) that capture nonlinear interactions

  13. [13]

    Formula:

    The final formula must be executable and appear as a SINGLE LINE starting with "Formula:" so it can be extracted programmatically. WHAT TO INCLUDE: - Named intermediate concepts (1–3): define them briefly, but do NOT rely on them in the final Formula line. Instead, inline/expand them in the final Formula expression. - Nonlinear transforms: at least one (s...