Unified Multimodal Uncertain Inference

· 2026 · cs.CV · arXiv 2604.08701

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

We introduce Unified Multimodal Uncertain Inference (UMUI), a multimodal inference task spanning text, audio, and video, where models must produce calibrated probability estimates of hypotheses conditioned on a premise in any modality or combination. While uncertain inference has been explored in text, extension to other modalities has been limited to single-modality binary entailment judgments, leaving no framework for fine-grained probabilistic reasoning in or across other modalities. To address this, we curate a human-annotated evaluation set with scalar probability judgments across audio, visual, and audiovisual settings, and additionally evaluate on existing text and audio benchmarks. We introduce CLUE (Calibrated Latent Uncertainty Estimation), which combines self-consistent teacher calibration and distribution-based confidence probing to produce calibrated predictions. We demonstrate that our 3B-parameter model achieves equivalent or stronger performance than baselines up to 32B parameters across all modalities.

representative citing papers

TRACE: Evidence Grounding-Guided Multi-Video Event Understanding and Claim Generation

cs.CV · 2026-05-16 · unverdicted · novelty 6.0

TRACE builds structured text timelines from videos via OCR and detection, then applies text-only LLM evidence localization before LVLM claim generation, raising MiRAGE F1 from 0.705 to 0.811 on MAGMaR.

CRAFT: Critic-Refined Adaptive Key-Frame Targeting for Multimodal Video Question Answering

cs.CV · 2026-05-18 · unverdicted · novelty 5.0

CRAFT introduces a query-conditioned pipeline with dynamic keyframe selection, ASR, and a hybrid critic loop that achieves top scores on MAGMaR 2026 for grounded multi-video question answering.

citing papers explorer

Showing 2 of 2 citing papers.

TRACE: Evidence Grounding-Guided Multi-Video Event Understanding and Claim Generation cs.CV · 2026-05-16 · unverdicted · none · ref 14 · internal anchor
TRACE builds structured text timelines from videos via OCR and detection, then applies text-only LLM evidence localization before LVLM claim generation, raising MiRAGE F1 from 0.705 to 0.811 on MAGMaR.
CRAFT: Critic-Refined Adaptive Key-Frame Targeting for Multimodal Video Question Answering cs.CV · 2026-05-18 · unverdicted · none · ref 33 · internal anchor
CRAFT introduces a query-conditioned pipeline with dynamic keyframe selection, ASR, and a hybrid critic loop that achieves top scores on MAGMaR 2026 for grounded multi-video question answering.

Unified Multimodal Uncertain Inference

fields

years

verdicts

representative citing papers

citing papers explorer