Tree-of-Evidence: Efficient "System 2" Search for Faithful Multimodal Grounding

Benoit L. Marteau; J. Ben Tamo; May D. Wang; Micky C. Nnamdi; Yishan Zhong

arxiv: 2604.07692 · v2 · submitted 2026-04-09 · 💻 cs.LG

Tree-of-Evidence: Efficient "System 2" Search for Faithful Multimodal Grounding

Micky C. Nnamdi , Benoit L. Marteau , Yishan Zhong , J. Ben Tamo , May D. Wang This is my paper

Pith reviewed 2026-05-10 18:04 UTC · model grok-4.3

classification 💻 cs.LG

keywords multimodal interpretabilityevidence searchclinical predictionmodel auditingbeam searchfaithful groundinginference-time optimizationdiscrete evidence selection

0 comments

The pith

An inference-time beam search finds small sets of evidence units that faithfully reproduce multimodal model predictions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper proposes Tree-of-Evidence as a way to make the reasoning of large multimodal models more transparent by searching for the minimal set of data pieces that the model relies on. It scores groups of data such as vital sign windows or text sentences using simple bottlenecks and then runs a beam search to pick the best compact collection. A reader should care because existing tools like attention maps often do not match what the model actually uses, especially when mixing time series and text. The method keeps almost all of the original prediction accuracy even when limited to five evidence pieces in medical and other tasks. It also shows better agreement with the full model than alternatives when evidence is scarce.

Core claim

The central discovery is that framing the search for faithful evidence as a discrete optimization problem solved via lightweight Evidence Bottlenecks and beam search allows identification of compact evidence sets that reproduce the model's predictions with high fidelity, producing auditable traces across multiple clinical and non-clinical multimodal tasks.

What carries the argument

The Tree-of-Evidence algorithm, which uses Evidence Bottlenecks to evaluate coarse data groups and beam search to optimize the selection of evidence units needed to match the full model's output.

If this is right

Auditable evidence traces are produced for each prediction without much loss in performance.
Over 0.98 of the full-model AUROC is retained using as few as five evidence units in all tested settings.
Higher decision agreement and lower probability fidelity error occur compared to other methods when evidence budgets are sparse.
The search strategy adapts by relying mainly on vital signs for clear cases and adding text when signals are ambiguous.
The approach works across four clinical tasks on MIMIC-IV, cross-center on eICU, and fault detection on LEMMA-RCA.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This discrete approach could help in settings where continuous attention weights are hard to audit, such as regulatory reviews of AI in medicine.
It raises the possibility of using similar search methods to debug model failures by examining which evidence units are selected or omitted.
Testing on additional modalities like images could show if the bottleneck scoring generalizes beyond time series and text.
If the evidence sets align with human expert judgment, it might improve trust in high-stakes decisions.

Load-bearing premise

The lightweight Evidence Bottlenecks accurately score groups of data without overlooking key interactions across different data types like time series and text.

What would settle it

Running the method on new examples where the selected five evidence units produce a different prediction probability or decision than the original model would show the claim is false.

Figures

Figures reproduced from arXiv: 2604.07692 by Benoit L. Marteau, J. Ben Tamo, May D. Wang, Micky C. Nnamdi, Yishan Zhong.

**Figure 1.** Figure 1: Overview of the Tree-of-Evidence (ToE) Framework. Phase I: Modality-specific classifiers are trained independently, with BioClinicalBERT (Alsentzer et al., 2019) encoding notes and contextual data (CXR/ECG) concatenated as fixed priors. Phase II: Lightweight MLP selectors learn to score evidence units using StraightThrough Estimator (STE) top-k masking with frozen encoders. Phase III: At inference, beam s… view at source ↗

**Figure 2.** Figure 2: The Faithfulness-Sparsity Frontier. Performance across evidence budgets k on MIMIC-IV (E1: In-Hospital Mortality, 5 seeds). (a) Sufficiency: ToE (Red ⋆) matches the full model’s predictive power (AUROC ≈ 0.80) with as few as k=5 units. (b) Fidelity: ToE achieves the lowest Fidelity MAE at sparse budgets (k ≤ 5), reducing error by >50% compared to Top-k Ranking (Blue •) and Saliency (Gold ▲) at sparse bud… view at source ↗

read the original abstract

Large Multimodal Models (LMMs) achieve state-of-the-art performance in high-stakes domains like healthcare, yet their reasoning remains opaque. Current interpretability methods, such as attention mechanisms or post-hoc saliency, often fail to faithfully represent the model's decision-making process, particularly when integrating heterogeneous modalities like time-series and text. We introduce Tree-of-Evidence (ToE), an inference-time search algorithm that frames interpretability as a discrete optimization problem. Rather than relying on soft attention weights, ToE employs lightweight Evidence Bottlenecks that score coarse groups or units of data (e.g., vital-sign windows, report sentences) and performs a beam search to identify the compact evidence set required to reproduce the model's prediction. We evaluate ToE across six tasks spanning three datasets and two domains: four clinical prediction tasks on MIMIC-IV, cross-center validation on eICU, and non-clinical fault detection on LEMMA-RCA. ToE produces auditable evidence traces while maintaining predictive performance, retaining over 0.98 of full-model AUROC with as few as five evidence units across all settings. Under sparse evidence budgets, ToE achieves higher decision agreement and lower probability fidelity error than other approaches. Qualitative analyses show that ToE adapts its search strategy: it often resolves straightforward cases using only vitals, while selectively incorporating text when physiological signals are ambiguous. ToE therefore provides a practical mechanism for auditing multimodal models by revealing which discrete evidence units support each prediction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Tree-of-Evidence frames multimodal interpretability as beam search over coarse evidence units scored by lightweight bottlenecks, but the abstract supplies no baselines or implementation details to back the performance claims.

read the letter

The main point is that this work treats faithful grounding in large multimodal models as a discrete optimization task. It scores groups of data like vital-sign windows or report sentences with Evidence Bottlenecks, then runs beam search to recover a compact set whose removal of everything else still reproduces the original output. The approach is applied to four clinical tasks on MIMIC-IV, cross-center validation on eICU, and fault detection on LEMMA-RCA, with the headline numbers being retention of over 0.98 AUROC using as few as five units and better decision agreement plus lower fidelity error than alternatives under tight budgets. It also notes that the search often stays with vitals alone and brings in text only when signals are ambiguous. That combination of search procedure and modality-adaptive behavior is the concrete thing the paper contributes. The evaluation across clinical and non-clinical settings gives it some breadth that pure attention or saliency papers often lack. The idea of an inference-time procedure that produces auditable traces without retraining is worth checking for high-stakes domains where regulators want to see which discrete units drove a prediction. The soft spots sit mostly in the missing experimental substance. The abstract states the quantitative outcomes but lists no baselines, no error bars, no exclusion criteria, and no implementation details on how the bottlenecks are trained or how the beam search is configured. Without those, it is impossible to tell whether the reported gains come from the method or from dataset shortcuts. The stress-test concern about cross-modality interactions also lands: the bottlenecks operate on coarse units without direct access to the full model's cross-modal layers, so any interaction that only appears when a text span modulates a specific time-series pattern could be under-scored. In that case the recovered set might match the prediction by accident rather than by tracing the actual decision process. The paper presents the algorithm as independent of fitted parameters, which avoids obvious circularity, but the lack of ablations on the bottlenecks leaves the faithfulness claim untested. This is for people building or auditing multimodal systems in healthcare or similar regulated areas who need a practical way to surface minimal evidence sets. A reader already working on search-based or bottleneck methods would get the most out of it, while someone looking for fully verified faithfulness guarantees would find the current write-up thin. It deserves a serious referee because the core framing is distinct enough and the application domain matters, even though the experiments need substantial expansion before the claims can be trusted.

Referee Report

3 major / 2 minor

Summary. The paper introduces Tree-of-Evidence (ToE), an inference-time search algorithm that frames multimodal interpretability as a discrete optimization problem. It uses lightweight Evidence Bottlenecks to score coarse data units (e.g., vital-sign windows or report sentences) from heterogeneous inputs and applies beam search to recover compact evidence sets that reproduce the original LMM prediction. Evaluated on six tasks across MIMIC-IV (four clinical predictions), eICU (cross-center), and LEMMA-RCA (fault detection), ToE is claimed to retain >0.98 of full-model AUROC with as few as five units while achieving higher decision agreement and lower probability fidelity error than alternatives under sparse budgets; qualitative results show adaptive use of modalities.

Significance. If the faithfulness claims hold, ToE would provide a practical, auditable mechanism for tracing discrete evidence in multimodal LMMs, particularly valuable in high-stakes domains like healthcare where opaque cross-modal reasoning is a barrier to deployment. The framing as beam search over evidence units offers a concrete alternative to soft attention or post-hoc saliency and could support debugging by revealing when text versus time-series units drive predictions.

major comments (3)

[Abstract and §4] Abstract and §4: The headline claim of retaining over 0.98 of full-model AUROC with five evidence units is presented without absolute AUROC values for the full LMM, without error bars, without the number of runs or statistical tests, and without explicit baselines in the abstract; this prevents assessment of whether the retention is meaningful or merely reflects easy tasks.
[§3] §3 (Evidence Bottlenecks): The scoring mechanism operates on coarse groups without direct access to the LMM's cross-modal layers, yet no equation, pseudocode, or ablation demonstrates that interactions (e.g., a text span modulating a specific vital-sign pattern) are captured rather than missed; this is load-bearing for the claim that recovered sets faithfully trace the model's decision process rather than exploiting dataset shortcuts.
[§4.2] §4.2 (Sparse-budget experiments): Superior decision agreement and lower fidelity error are reported versus 'other approaches,' but the exact baselines (random, attention-based, or otherwise), their implementation details, and how probability fidelity error is computed are not specified, undermining the cross-method comparison that supports the central advantage of ToE.

minor comments (2)

[§3] Notation for the beam-search objective and Evidence Bottleneck scoring function could be formalized with explicit equations to improve reproducibility.
[§5] Figure captions for qualitative examples should explicitly state the number of evidence units used and the modalities selected in each case.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments on our manuscript. We address each major point below and indicate the revisions we will make to improve clarity and rigor.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4: The headline claim of retaining over 0.98 of full-model AUROC with five evidence units is presented without absolute AUROC values for the full LMM, without error bars, without the number of runs or statistical tests, and without explicit baselines in the abstract; this prevents assessment of whether the retention is meaningful or merely reflects easy tasks.

Authors: We agree that the abstract would be strengthened by reporting absolute AUROC values, error bars, run counts, statistical tests, and explicit baseline names to allow readers to evaluate the retention claim directly. In the revised manuscript, we have updated the abstract to include the full-model AUROC values for each of the six tasks (drawn from the results in §4), standard deviations computed over five independent runs, and paired statistical tests against the baselines. The baselines are now named explicitly in the abstract as random selection, attention-based selection, and adapted saliency methods. These details were already present in the main-text tables and experimental protocol but are now summarized upfront. revision: yes
Referee: [§3] §3 (Evidence Bottlenecks): The scoring mechanism operates on coarse groups without direct access to the LMM's cross-modal layers, yet no equation, pseudocode, or ablation demonstrates that interactions (e.g., a text span modulating a specific vital-sign pattern) are captured rather than missed; this is load-bearing for the claim that recovered sets faithfully trace the model's decision process rather than exploiting dataset shortcuts.

Authors: The Evidence Bottleneck scores each coarse unit by measuring the change in the LMM's output probability when that unit is masked versus included, which by construction uses the full cross-modal model and therefore incorporates any interactions present in the original forward pass. We acknowledge, however, that an explicit demonstration of interaction capture would address the concern more directly. We have therefore added a formal equation for the bottleneck score in §3, pseudocode for the unit-scoring procedure, and a new ablation in the appendix that compares ToE evidence sets against unimodal baselines on tasks requiring cross-modal reasoning. While the high-fidelity reproduction of the original prediction remains our primary evidence of faithfulness, these additions clarify how interactions are handled. revision: partial
Referee: [§4.2] §4.2 (Sparse-budget experiments): Superior decision agreement and lower fidelity error are reported versus 'other approaches,' but the exact baselines (random, attention-based, or otherwise), their implementation details, and how probability fidelity error is computed are not specified, undermining the cross-method comparison that supports the central advantage of ToE.

Authors: We apologize for the insufficient detail on the baselines and metric. In the revised §4.2 we now explicitly list the three baselines: (1) random unit selection under the budget, (2) selection by ranking the LMM's native attention weights over the coarse units, and (3) post-hoc saliency maps (adapted Grad-CAM for the multimodal setting). We describe the precise top-k selection procedure used by each baseline to respect the same evidence budget. Probability fidelity error is defined as the mean absolute difference between the full-model predicted probability and the probability obtained when the model is evaluated on only the selected evidence units; this definition and its computation are now stated in the text. These clarifications enable direct reproduction of the reported comparisons. revision: yes

Circularity Check

0 steps flagged

No significant circularity; ToE is an independent inference-time procedure

full rationale

The paper frames ToE as a discrete optimization algorithm using beam search over units scored by lightweight Evidence Bottlenecks. Reported metrics (AUROC retention >0.98, decision agreement, fidelity error) are empirical results from evaluation on MIMIC-IV, eICU, and LEMMA-RCA datasets. No equations or definitions reduce these outcomes to quantities defined by the bottlenecks themselves or by self-citation chains. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Abstract-only review yields minimal ledger entries; no numerical free parameters are stated, and the only implicit assumption is the effectiveness of beam search for the discrete optimization task.

axioms (1)

domain assumption Beam search efficiently finds near-optimal compact evidence sets for the discrete optimization problem
Implicit in the description of performing beam search to identify the evidence set required to reproduce the model's prediction.

invented entities (1)

Evidence Bottlenecks no independent evidence
purpose: Lightweight scorers that assign importance to coarse groups of multimodal data units
New component introduced to enable efficient search instead of soft attention weights.

pith-pipeline@v0.9.0 · 5589 in / 1352 out tokens · 66885 ms · 2026-05-10T18:04:22.790691+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

score(m) = C(m) + λ S(m) − μ K(m) where C is sufficiency, S is probability stability, K is sparsity count
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Tree-of-Evidence (ToE) ... performs a beam search to identify the compact evidence set

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages · 1 internal anchor

[1]

InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5029–5038, Dublin, Ireland

A comparative study of faithfulness metrics for model interpretability methods. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5029–5038, Dublin, Ireland. Association for Computational Linguistics. Zheyi Chen, Liuchang Xu, Hongting Zheng, Luyao Chen, Amr Tolba, Liang Zhao, Keping Yu...

work page 2024
[2]

Gemma 2: Improving Open Language Models at a Practical Size

From large language models to large multi- modal models: A literature review.Applied Sciences, 14(12):5068. Shih-Cheng Huang, Anuj Pareek, Saeed Seyyedi, Imon Banerjee, and Matthew P Lungren. 2020. Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementa- tion guidelines.NPJ digital medicine, 3(1):136...

work page internal anchor Pith review Pith/arXiv arXiv 2020
[3]

Counterfactual explanations without opening the black box: Automated decisions and the gdpr. Harv. JL & Tech., 31:841. Sarah Wiegreffe and Yuval Pinter. 2019. Attention is not not explanation. InProceedings of the 2019 Confer- ence on Empirical Methods in Natural Language Pro- cessing and the 9th International Joint Conference on Natural Language Processi...

work page arXiv 2019
[4]

Lecheng Zheng, Zhengzhang Chen, Dongjie Wang, Chengyuan Deng, Reon Matsuoka, and Haifeng Chen

Tree of thoughts: Deliberate problem solving with large language models.Advances in neural information processing systems, 36:11809–11822. Lecheng Zheng, Zhengzhang Chen, Dongjie Wang, Chengyuan Deng, Reon Matsuoka, and Haifeng Chen. 2024. Lemma-rca: A large multi-modal multi-domain dataset for root cause analysis.arXiv preprint arXiv:2406.05375. Yilun Zh...

work page arXiv 2024

[1] [1]

InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5029–5038, Dublin, Ireland

A comparative study of faithfulness metrics for model interpretability methods. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5029–5038, Dublin, Ireland. Association for Computational Linguistics. Zheyi Chen, Liuchang Xu, Hongting Zheng, Luyao Chen, Amr Tolba, Liang Zhao, Keping Yu...

work page 2024

[2] [2]

Gemma 2: Improving Open Language Models at a Practical Size

From large language models to large multi- modal models: A literature review.Applied Sciences, 14(12):5068. Shih-Cheng Huang, Anuj Pareek, Saeed Seyyedi, Imon Banerjee, and Matthew P Lungren. 2020. Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementa- tion guidelines.NPJ digital medicine, 3(1):136...

work page internal anchor Pith review Pith/arXiv arXiv 2020

[3] [3]

Counterfactual explanations without opening the black box: Automated decisions and the gdpr. Harv. JL & Tech., 31:841. Sarah Wiegreffe and Yuval Pinter. 2019. Attention is not not explanation. InProceedings of the 2019 Confer- ence on Empirical Methods in Natural Language Pro- cessing and the 9th International Joint Conference on Natural Language Processi...

work page arXiv 2019

[4] [4]

Lecheng Zheng, Zhengzhang Chen, Dongjie Wang, Chengyuan Deng, Reon Matsuoka, and Haifeng Chen

Tree of thoughts: Deliberate problem solving with large language models.Advances in neural information processing systems, 36:11809–11822. Lecheng Zheng, Zhengzhang Chen, Dongjie Wang, Chengyuan Deng, Reon Matsuoka, and Haifeng Chen. 2024. Lemma-rca: A large multi-modal multi-domain dataset for root cause analysis.arXiv preprint arXiv:2406.05375. Yilun Zh...

work page arXiv 2024