OMHBench: Benchmarking Balanced and Grounded Omni-Modal Multi-Hop Reasoning
Pith reviewed 2026-05-18 21:20 UTC · model grok-4.3
The pith
A benchmark with balanced reasoning paths across text, vision, and speech reveals large performance gaps and path sensitivity in current models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
OMHBench consists of 6,144 questions with balanced reasoning paths that are jointly grounded across text, vision, and speech. Evaluation of 13 state-of-the-art models shows a large performance gap between proprietary and open-source models and high sensitivity to reasoning path variations, resulting in asymmetric omni-modal grounding, with models struggling especially on the speech modality.
What carries the argument
The balanced reasoning paths in OMHBench that require joint grounding across text, vision, and speech without allowing modality shortcuts.
If this is right
- A large performance gap exists between proprietary and open-source models on these tasks.
- Even proprietary models remain highly sensitive to changes in reasoning path order or emphasis.
- This sensitivity produces asymmetric grounding, so models do not draw on modalities evenly.
- Current models struggle particularly when the speech modality must be used.
- Existing evaluation frameworks with biased paths fail to measure true omni-modal capability.
Where Pith is reading between the lines
- Training approaches could add more examples that force consistent results no matter the path order.
- The benchmark could guide targeted fixes for weak modality integration in future models.
- Similar balance requirements might strengthen tests in other mixed-input domains such as video with narration.
Load-bearing premise
The 6,144 questions achieve truly balanced reasoning paths that are jointly grounded across the three modalities without introducing shortcuts or new biases.
What would settle it
Demonstrating that models achieve consistent performance across different reasoning path variations on the benchmark questions.
read the original abstract
Multimodal Large Language Models (MLLMs) have increasingly supported omni-modal processing across text, vision, and speech. However, existing evaluation frameworks for such models suffer from critical limitations, including modality shortcuts and biased reasoning paths. To address these challenges, we propose OMHBench, a novel benchmark designed to rigorously evaluate omni-modal multi-hop reasoning. It consists of 6,144 questions with balanced reasoning paths that are jointly grounded across all three modalities. Extensive evaluation of 13 state-of-the-art models reveals that (1) a large performance gap exists between proprietary and open-source MLLMs and (2) even proprietary models exhibit high sensitivity to reasoning path variations, resulting in asymmetric omni-modal grounding. Notably, models struggle when processing the speech modality, underscoring the need for balanced, multi-hop evaluation of omni-modal intelligence.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces OMHBench, a benchmark consisting of 6,144 questions designed to evaluate omni-modal multi-hop reasoning in MLLMs across text, vision, and speech with balanced and jointly grounded reasoning paths. It reports results from evaluating 13 state-of-the-art models, claiming a large performance gap between proprietary and open-source MLLMs, high sensitivity to reasoning path variations even among proprietary models that produces asymmetric omni-modal grounding, and particular difficulties when processing the speech modality.
Significance. If the benchmark's questions are verifiably balanced and free of modality shortcuts or new biases, the work would offer a useful advance in rigorous evaluation of omni-modal capabilities, helping to diagnose current model limitations in multi-hop cross-modal reasoning and motivating improvements in speech handling and path robustness.
major comments (2)
- [Abstract] Abstract: the assertion that the 6,144 questions have 'balanced reasoning paths that are jointly grounded across all three modalities' without modality shortcuts is presented without any supporting metrics (e.g., per-modality contribution frequencies, path-length distributions, or ablation results when individual modalities are masked), which is load-bearing for the central claims about performance gaps and asymmetric grounding.
- [Evaluation] Evaluation section: the reported high sensitivity to reasoning path variations and asymmetric omni-modal grounding in proprietary models rests on the assumption that the benchmark paths are balanced and jointly grounded; without explicit construction details or validation that no modality shortcuts were introduced, these findings risk being artifacts of the dataset design rather than robust evidence of omni-modal deficits.
minor comments (2)
- [Dataset Construction] Consider adding a dedicated subsection or appendix with quantitative statistics on reasoning path balance and modality grounding to strengthen the benchmark description.
- [Experiments] Ensure all model names, versions, and prompting details are listed consistently in the experimental setup for reproducibility.
Simulated Author's Rebuttal
We thank the referee for their careful reading and constructive comments on OMHBench. The feedback highlights important points about the need for more explicit quantitative support for the benchmark's balance and grounding properties. We address each major comment below and indicate the revisions we will incorporate.
read point-by-point responses
-
Referee: [Abstract] Abstract: the assertion that the 6,144 questions have 'balanced reasoning paths that are jointly grounded across all three modalities' without modality shortcuts is presented without any supporting metrics (e.g., per-modality contribution frequencies, path-length distributions, or ablation results when individual modalities are masked), which is load-bearing for the central claims about performance gaps and asymmetric grounding.
Authors: The abstract is intentionally concise. Section 3 of the manuscript describes the benchmark construction pipeline, including the multi-stage process used to generate questions with balanced modality contributions and to verify joint grounding without introducing shortcuts (e.g., through iterative human verification and cross-modal consistency checks). To directly address the concern and make this evidence more accessible, we will revise the abstract to reference key supporting statistics and add a new table in the main text reporting per-modality contribution frequencies, path-length distributions, and results from modality-masking ablations. These additions will provide the requested empirical validation for the performance gap and asymmetric grounding observations. revision: yes
-
Referee: [Evaluation] Evaluation section: the reported high sensitivity to reasoning path variations and asymmetric omni-modal grounding in proprietary models rests on the assumption that the benchmark paths are balanced and jointly grounded; without explicit construction details or validation that no modality shortcuts were introduced, these findings risk being artifacts of the dataset design rather than robust evidence of omni-modal deficits.
Authors: The evaluation in Section 4 controls for path variations by generating alternative orderings and modality emphases over the same underlying facts, with grounding verified during dataset creation. We acknowledge that additional quantitative validation would strengthen the interpretation. In the revision we will expand the Evaluation section with further construction details and include new analyses (such as information contribution metrics per modality and shortcut detection ablations) to demonstrate that the observed sensitivities are not artifacts of the design. This will better isolate model limitations in omni-modal reasoning. revision: partial
Circularity Check
No circularity: benchmark proposal with independent dataset construction and external evaluations
full rationale
The paper introduces OMHBench as a new dataset of 6,144 questions for omni-modal multi-hop reasoning and reports empirical results from evaluating 13 MLLMs. No mathematical derivations, equations, fitted parameters, or predictions appear in the provided text. Claims about performance gaps and sensitivity rest directly on the new benchmark construction and model runs rather than reducing to prior self-citations, ansatzes, or inputs by construction. The work is self-contained against external benchmarks and does not invoke uniqueness theorems or rename known results in a circular fashion.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The constructed questions provide balanced reasoning paths that are jointly grounded across text, vision, and speech without modality shortcuts.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
CMR-SPB ... balanced reasoning paths that are jointly grounded across all three modalities ... six reasoning paths
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat induction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Path Balance Score (PBS) ... macro-average and standard deviation over N! reasoning paths
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.