RefereeBench shows that even the strongest video MLLMs reach only around 60% accuracy on multi-sport refereeing tasks and struggle with rule application and temporal grounding.
hub
WhisperX: Time-accurate speech transcription of long-form audio
10 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
years
2026 10representative citing papers
CineAgents is a multi-agent system that builds hierarchical narrative memory via script reverse-engineering and uses iterative planning to produce instruction-driven cinematic video compilations with better coherence than prior methods.
SurgOnAir introduces a streaming vision-language model trained on a hierarchical surgical dataset to generate real-time, multi-level narrations with explicit transition tokens.
Multimodal LLM analysis correlates better with TRUST-Pathos than acoustic SER models in a case study of one Bundestag speech, while acoustic features help with arousal.
WhisperPipe delivers 89 ms median latency and 48% lower peak GPU memory than standard Whisper while keeping word error rate within 2% of the offline model.
Continuous diffusion spoken language models follow scaling laws for loss and phoneme divergence and generate emotive multi-speaker speech at 16B scale, though long-form coherence stays difficult.
AudioKV prioritizes audio-critical attention heads identified via ASR analysis and applies spectral score smoothing to evict KV cache tokens, achieving high compression with minimal accuracy loss in LALMs.
MedASR is an open-source 105M-parameter ASR model achieving 58% relative WER reduction versus Whisper Large-v3 on medical dictation.
Large-print editions of layout-based documents outperform gesture-based magnification by 18% in reading speed and 30% in target location speed while restoring natural reading strategies and reducing workload.
citing papers explorer
-
Quantifying the Cost of Manual Navigation: A Comparison of Gesture-Based Magnification versus Direct Access Reading in Digital Layout-based Documents
Large-print editions of layout-based documents outperform gesture-based magnification by 18% in reading speed and 30% in target location speed while restoring natural reading strategies and reducing workload.