Lvomnibench: Pioneering long audio-video understanding evaluation for omnimodal llms.arXiv preprint arXiv:2603.19217

Keda Tao, Yuhua Zheng, Jia Xu, Wenjie Du, Kele Shao, Hesong Wang, Xueyi Chen, Xin Jin, Junhan Zhu, Bohan Yu, et al · 2026 · arXiv 2603.19217

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

read on arXiv browse 5 citing papers

citation-role summary

dataset 1

citation-polarity summary

use dataset 1

representative citing papers

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

cs.CV · 2026-05-08 · unverdicted · novelty 8.0

TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.

VideoOdyssey: A Benchmark for Ultra-Long-Context and Omni-Modal Video Understanding

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

VideoOdyssey is a new benchmark featuring ultra-long videos (avg. 109 min) across 11 domains with multi-level continuous certificates (avg. 16 min for visual, 12.8 min for audio-visual) to diagnose MLLM limitations in continuous reasoning and omni-modal perception.

LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning

cs.CL · 2026-05-21 · unverdicted · novelty 7.0

LatentOmni proposes a latent-space cross-modal reasoning framework that uses feature-level supervision and Omni-Sync Position Embedding to align and synchronize audio-visual latents, supported by a new 35K interleaved reasoning dataset and showing gains over text CoT baselines.

Stage-adaptive Token Selection for Efficient Omni-modal LLMs

cs.CV · 2026-05-19 · unverdicted · novelty 5.0

SEATS adaptively selects and removes non-text tokens before and inside the LLM layers of omni-modal models, yielding 9.3x FLOPs reduction and 4.8x prefill speedup at 10% token retention while keeping 96.3% performance.

PASA: A Principled Embedding-Space Watermarking Approach for LLM-Generated Text under Semantic-Invariant Attacks

cs.CR · 2026-05-09

citing papers explorer

Showing 5 of 5 citing papers.

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos cs.CV · 2026-05-08 · unverdicted · none · ref 21
TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
VideoOdyssey: A Benchmark for Ultra-Long-Context and Omni-Modal Video Understanding cs.CV · 2026-05-21 · unverdicted · none · ref 19
VideoOdyssey is a new benchmark featuring ultra-long videos (avg. 109 min) across 11 domains with multi-level continuous certificates (avg. 16 min for visual, 12.8 min for audio-visual) to diagnose MLLM limitations in continuous reasoning and omni-modal perception.
LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning cs.CL · 2026-05-21 · unverdicted · none · ref 51
LatentOmni proposes a latent-space cross-modal reasoning framework that uses feature-level supervision and Omni-Sync Position Embedding to align and synchronize audio-visual latents, supported by a new 35K interleaved reasoning dataset and showing gains over text CoT baselines.
Stage-adaptive Token Selection for Efficient Omni-modal LLMs cs.CV · 2026-05-19 · unverdicted · none · ref 27
SEATS adaptively selects and removes non-text tokens before and inside the LLM layers of omni-modal models, yielding 9.3x FLOPs reduction and 4.8x prefill speedup at 10% token retention while keeping 96.3% performance.
PASA: A Principled Embedding-Space Watermarking Approach for LLM-Generated Text under Semantic-Invariant Attacks cs.CR · 2026-05-09 · unreviewed · ref 39

Lvomnibench: Pioneering long audio-video understanding evaluation for omnimodal llms.arXiv preprint arXiv:2603.19217

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer