Tree-of-Evidence: Efficient "System 2" Search for Faithful Multimodal Grounding
Pith reviewed 2026-05-10 18:04 UTC · model grok-4.3
The pith
An inference-time beam search finds small sets of evidence units that faithfully reproduce multimodal model predictions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that framing the search for faithful evidence as a discrete optimization problem solved via lightweight Evidence Bottlenecks and beam search allows identification of compact evidence sets that reproduce the model's predictions with high fidelity, producing auditable traces across multiple clinical and non-clinical multimodal tasks.
What carries the argument
The Tree-of-Evidence algorithm, which uses Evidence Bottlenecks to evaluate coarse data groups and beam search to optimize the selection of evidence units needed to match the full model's output.
If this is right
- Auditable evidence traces are produced for each prediction without much loss in performance.
- Over 0.98 of the full-model AUROC is retained using as few as five evidence units in all tested settings.
- Higher decision agreement and lower probability fidelity error occur compared to other methods when evidence budgets are sparse.
- The search strategy adapts by relying mainly on vital signs for clear cases and adding text when signals are ambiguous.
- The approach works across four clinical tasks on MIMIC-IV, cross-center on eICU, and fault detection on LEMMA-RCA.
Where Pith is reading between the lines
- This discrete approach could help in settings where continuous attention weights are hard to audit, such as regulatory reviews of AI in medicine.
- It raises the possibility of using similar search methods to debug model failures by examining which evidence units are selected or omitted.
- Testing on additional modalities like images could show if the bottleneck scoring generalizes beyond time series and text.
- If the evidence sets align with human expert judgment, it might improve trust in high-stakes decisions.
Load-bearing premise
The lightweight Evidence Bottlenecks accurately score groups of data without overlooking key interactions across different data types like time series and text.
What would settle it
Running the method on new examples where the selected five evidence units produce a different prediction probability or decision than the original model would show the claim is false.
Figures
read the original abstract
Large Multimodal Models (LMMs) achieve state-of-the-art performance in high-stakes domains like healthcare, yet their reasoning remains opaque. Current interpretability methods, such as attention mechanisms or post-hoc saliency, often fail to faithfully represent the model's decision-making process, particularly when integrating heterogeneous modalities like time-series and text. We introduce Tree-of-Evidence (ToE), an inference-time search algorithm that frames interpretability as a discrete optimization problem. Rather than relying on soft attention weights, ToE employs lightweight Evidence Bottlenecks that score coarse groups or units of data (e.g., vital-sign windows, report sentences) and performs a beam search to identify the compact evidence set required to reproduce the model's prediction. We evaluate ToE across six tasks spanning three datasets and two domains: four clinical prediction tasks on MIMIC-IV, cross-center validation on eICU, and non-clinical fault detection on LEMMA-RCA. ToE produces auditable evidence traces while maintaining predictive performance, retaining over 0.98 of full-model AUROC with as few as five evidence units across all settings. Under sparse evidence budgets, ToE achieves higher decision agreement and lower probability fidelity error than other approaches. Qualitative analyses show that ToE adapts its search strategy: it often resolves straightforward cases using only vitals, while selectively incorporating text when physiological signals are ambiguous. ToE therefore provides a practical mechanism for auditing multimodal models by revealing which discrete evidence units support each prediction.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Tree-of-Evidence (ToE), an inference-time search algorithm that frames multimodal interpretability as a discrete optimization problem. It uses lightweight Evidence Bottlenecks to score coarse data units (e.g., vital-sign windows or report sentences) from heterogeneous inputs and applies beam search to recover compact evidence sets that reproduce the original LMM prediction. Evaluated on six tasks across MIMIC-IV (four clinical predictions), eICU (cross-center), and LEMMA-RCA (fault detection), ToE is claimed to retain >0.98 of full-model AUROC with as few as five units while achieving higher decision agreement and lower probability fidelity error than alternatives under sparse budgets; qualitative results show adaptive use of modalities.
Significance. If the faithfulness claims hold, ToE would provide a practical, auditable mechanism for tracing discrete evidence in multimodal LMMs, particularly valuable in high-stakes domains like healthcare where opaque cross-modal reasoning is a barrier to deployment. The framing as beam search over evidence units offers a concrete alternative to soft attention or post-hoc saliency and could support debugging by revealing when text versus time-series units drive predictions.
major comments (3)
- [Abstract and §4] Abstract and §4: The headline claim of retaining over 0.98 of full-model AUROC with five evidence units is presented without absolute AUROC values for the full LMM, without error bars, without the number of runs or statistical tests, and without explicit baselines in the abstract; this prevents assessment of whether the retention is meaningful or merely reflects easy tasks.
- [§3] §3 (Evidence Bottlenecks): The scoring mechanism operates on coarse groups without direct access to the LMM's cross-modal layers, yet no equation, pseudocode, or ablation demonstrates that interactions (e.g., a text span modulating a specific vital-sign pattern) are captured rather than missed; this is load-bearing for the claim that recovered sets faithfully trace the model's decision process rather than exploiting dataset shortcuts.
- [§4.2] §4.2 (Sparse-budget experiments): Superior decision agreement and lower fidelity error are reported versus 'other approaches,' but the exact baselines (random, attention-based, or otherwise), their implementation details, and how probability fidelity error is computed are not specified, undermining the cross-method comparison that supports the central advantage of ToE.
minor comments (2)
- [§3] Notation for the beam-search objective and Evidence Bottleneck scoring function could be formalized with explicit equations to improve reproducibility.
- [§5] Figure captions for qualitative examples should explicitly state the number of evidence units used and the modalities selected in each case.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments on our manuscript. We address each major point below and indicate the revisions we will make to improve clarity and rigor.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4: The headline claim of retaining over 0.98 of full-model AUROC with five evidence units is presented without absolute AUROC values for the full LMM, without error bars, without the number of runs or statistical tests, and without explicit baselines in the abstract; this prevents assessment of whether the retention is meaningful or merely reflects easy tasks.
Authors: We agree that the abstract would be strengthened by reporting absolute AUROC values, error bars, run counts, statistical tests, and explicit baseline names to allow readers to evaluate the retention claim directly. In the revised manuscript, we have updated the abstract to include the full-model AUROC values for each of the six tasks (drawn from the results in §4), standard deviations computed over five independent runs, and paired statistical tests against the baselines. The baselines are now named explicitly in the abstract as random selection, attention-based selection, and adapted saliency methods. These details were already present in the main-text tables and experimental protocol but are now summarized upfront. revision: yes
-
Referee: [§3] §3 (Evidence Bottlenecks): The scoring mechanism operates on coarse groups without direct access to the LMM's cross-modal layers, yet no equation, pseudocode, or ablation demonstrates that interactions (e.g., a text span modulating a specific vital-sign pattern) are captured rather than missed; this is load-bearing for the claim that recovered sets faithfully trace the model's decision process rather than exploiting dataset shortcuts.
Authors: The Evidence Bottleneck scores each coarse unit by measuring the change in the LMM's output probability when that unit is masked versus included, which by construction uses the full cross-modal model and therefore incorporates any interactions present in the original forward pass. We acknowledge, however, that an explicit demonstration of interaction capture would address the concern more directly. We have therefore added a formal equation for the bottleneck score in §3, pseudocode for the unit-scoring procedure, and a new ablation in the appendix that compares ToE evidence sets against unimodal baselines on tasks requiring cross-modal reasoning. While the high-fidelity reproduction of the original prediction remains our primary evidence of faithfulness, these additions clarify how interactions are handled. revision: partial
-
Referee: [§4.2] §4.2 (Sparse-budget experiments): Superior decision agreement and lower fidelity error are reported versus 'other approaches,' but the exact baselines (random, attention-based, or otherwise), their implementation details, and how probability fidelity error is computed are not specified, undermining the cross-method comparison that supports the central advantage of ToE.
Authors: We apologize for the insufficient detail on the baselines and metric. In the revised §4.2 we now explicitly list the three baselines: (1) random unit selection under the budget, (2) selection by ranking the LMM's native attention weights over the coarse units, and (3) post-hoc saliency maps (adapted Grad-CAM for the multimodal setting). We describe the precise top-k selection procedure used by each baseline to respect the same evidence budget. Probability fidelity error is defined as the mean absolute difference between the full-model predicted probability and the probability obtained when the model is evaluated on only the selected evidence units; this definition and its computation are now stated in the text. These clarifications enable direct reproduction of the reported comparisons. revision: yes
Circularity Check
No significant circularity; ToE is an independent inference-time procedure
full rationale
The paper frames ToE as a discrete optimization algorithm using beam search over units scored by lightweight Evidence Bottlenecks. Reported metrics (AUROC retention >0.98, decision agreement, fidelity error) are empirical results from evaluation on MIMIC-IV, eICU, and LEMMA-RCA datasets. No equations or definitions reduce these outcomes to quantities defined by the bottlenecks themselves or by self-citation chains. The derivation chain remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Beam search efficiently finds near-optimal compact evidence sets for the discrete optimization problem
invented entities (1)
-
Evidence Bottlenecks
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
score(m) = C(m) + λ S(m) − μ K(m) where C is sufficiency, S is probability stability, K is sparsity count
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Tree-of-Evidence (ToE) ... performs a beam search to identify the compact evidence set
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
A comparative study of faithfulness metrics for model interpretability methods. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5029–5038, Dublin, Ireland. Association for Computational Linguistics. Zheyi Chen, Liuchang Xu, Hongting Zheng, Luyao Chen, Amr Tolba, Liang Zhao, Keping Yu...
work page 2024
-
[2]
Gemma 2: Improving Open Language Models at a Practical Size
From large language models to large multi- modal models: A literature review.Applied Sciences, 14(12):5068. Shih-Cheng Huang, Anuj Pareek, Saeed Seyyedi, Imon Banerjee, and Matthew P Lungren. 2020. Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementa- tion guidelines.NPJ digital medicine, 3(1):136...
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[3]
Counterfactual explanations without opening the black box: Automated decisions and the gdpr. Harv. JL & Tech., 31:841. Sarah Wiegreffe and Yuval Pinter. 2019. Attention is not not explanation. InProceedings of the 2019 Confer- ence on Empirical Methods in Natural Language Pro- cessing and the 9th International Joint Conference on Natural Language Processi...
-
[4]
Lecheng Zheng, Zhengzhang Chen, Dongjie Wang, Chengyuan Deng, Reon Matsuoka, and Haifeng Chen
Tree of thoughts: Deliberate problem solving with large language models.Advances in neural information processing systems, 36:11809–11822. Lecheng Zheng, Zhengzhang Chen, Dongjie Wang, Chengyuan Deng, Reon Matsuoka, and Haifeng Chen. 2024. Lemma-rca: A large multi-modal multi-domain dataset for root cause analysis.arXiv preprint arXiv:2406.05375. Yilun Zh...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.