hub

Whisperx: Time-accurate speech transcription of long-form audio

Max Bain, Jaesung Huh, Tengda Han, Andrew Zisserman · 2023 · arXiv 2303.00747

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

read on arXiv browse 10 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1 method 1

citation-polarity summary

background 1 use method 1

representative citing papers

RefereeBench: Are Video MLLMs Ready to be Multi-Sport Referees

cs.CV · 2026-04-17 · unverdicted · novelty 8.0

RefereeBench shows that even the strongest video MLLMs reach only around 60% accuracy on multi-sport refereeing tasks and struggle with rule application and temporal grounding.

Human-1 by Josh Talks: A Full-Duplex Conversational Modeling Framework in Hindi using Real-World Conversations

cs.CL · 2026-04-25 · unverdicted · novelty 7.0

Human-1 is the first open full-duplex spoken dialogue system for Hindi, created by adapting Moshi with a custom tokenizer and training on 26,000 hours of real-world conversations to enable natural interruptions and overlaps.

A Benchmark and Multi-Agent System for Instruction-driven Cinematic Video Compilation

cs.CV · 2026-04-12 · unverdicted · novelty 7.0

CineAgents is a multi-agent system that builds hierarchical narrative memory via script reverse-engineering and uses iterative planning to produce instruction-driven cinematic video compilations with better coherence than prior methods.

SurgOnAir: Hierarchy-Aware Real-Time Surgical Video Commentary

cs.CV · 2026-05-20 · unverdicted · novelty 6.0

SurgOnAir introduces a streaming vision-language model trained on a hierarchical surgical dataset to generate real-time, multi-level narrations with explicit transition tokens.

Beyond Acoustic Emotion Recognition: Multimodal Pathos Analysis in Political Speech Using LLM-Based and Acoustic Emotion Models

cs.AI · 2026-05-21 · unverdicted · novelty 5.0

Multimodal LLM analysis correlates better with TRUST-Pathos than acoustic SER models in a case study of one Bundestag speech, while acoustic features help with arousal.

WhisperPipe: A Resource-Efficient Streaming Architecture for Real-Time Automatic Speech Recognition

cs.CL · 2026-04-28 · unverdicted · novelty 5.0

WhisperPipe delivers 89 ms median latency and 48% lower peak GPU memory than standard Whisper while keeping word error rate within 2% of the offline model.

Scaling Properties of Continuous Diffusion Spoken Language Models

cs.CL · 2026-04-27 · unverdicted · novelty 5.0

Continuous diffusion spoken language models follow scaling laws for loss and phoneme divergence and generate emotive multi-speaker speech at 16B scale, though long-form coherence stays difficult.

AudioKV: KV Cache Eviction in Efficient Large Audio Language Models

cs.SD · 2026-04-08 · unverdicted · novelty 5.0

AudioKV prioritizes audio-critical attention heads identified via ASR analysis and applies spectral score smoothing to evict KV cache tokens, achieving high compression with minimal accuracy loss in LALMs.

MedASR: An Open-Source Model for High-Accuracy Medical Dictation

eess.AS · 2026-05-15 · unverdicted · novelty 4.0

MedASR is an open-source 105M-parameter ASR model achieving 58% relative WER reduction versus Whisper Large-v3 on medical dictation.

Quantifying the Cost of Manual Navigation: A Comparison of Gesture-Based Magnification versus Direct Access Reading in Digital Layout-based Documents

cs.HC · 2026-04-29 · unverdicted · novelty 4.0

Large-print editions of layout-based documents outperform gesture-based magnification by 18% in reading speed and 30% in target location speed while restoring natural reading strategies and reducing workload.

citing papers explorer

Showing 10 of 10 citing papers.

RefereeBench: Are Video MLLMs Ready to be Multi-Sport Referees cs.CV · 2026-04-17 · unverdicted · none · ref 4
RefereeBench shows that even the strongest video MLLMs reach only around 60% accuracy on multi-sport refereeing tasks and struggle with rule application and temporal grounding.
Human-1 by Josh Talks: A Full-Duplex Conversational Modeling Framework in Hindi using Real-World Conversations cs.CL · 2026-04-25 · unverdicted · none · ref 17
Human-1 is the first open full-duplex spoken dialogue system for Hindi, created by adapting Moshi with a custom tokenizer and training on 26,000 hours of real-world conversations to enable natural interruptions and overlaps.
A Benchmark and Multi-Agent System for Instruction-driven Cinematic Video Compilation cs.CV · 2026-04-12 · unverdicted · none · ref 8
CineAgents is a multi-agent system that builds hierarchical narrative memory via script reverse-engineering and uses iterative planning to produce instruction-driven cinematic video compilations with better coherence than prior methods.
SurgOnAir: Hierarchy-Aware Real-Time Surgical Video Commentary cs.CV · 2026-05-20 · unverdicted · none · ref 3
SurgOnAir introduces a streaming vision-language model trained on a hierarchical surgical dataset to generate real-time, multi-level narrations with explicit transition tokens.
Beyond Acoustic Emotion Recognition: Multimodal Pathos Analysis in Political Speech Using LLM-Based and Acoustic Emotion Models cs.AI · 2026-05-21 · unverdicted · none · ref 11
Multimodal LLM analysis correlates better with TRUST-Pathos than acoustic SER models in a case study of one Bundestag speech, while acoustic features help with arousal.
WhisperPipe: A Resource-Efficient Streaming Architecture for Real-Time Automatic Speech Recognition cs.CL · 2026-04-28 · unverdicted · none · ref 26
WhisperPipe delivers 89 ms median latency and 48% lower peak GPU memory than standard Whisper while keeping word error rate within 2% of the offline model.
Scaling Properties of Continuous Diffusion Spoken Language Models cs.CL · 2026-04-27 · unverdicted · none · ref 54
Continuous diffusion spoken language models follow scaling laws for loss and phoneme divergence and generate emotive multi-speaker speech at 16B scale, though long-form coherence stays difficult.
AudioKV: KV Cache Eviction in Efficient Large Audio Language Models cs.SD · 2026-04-08 · unverdicted · none · ref 1
AudioKV prioritizes audio-critical attention heads identified via ASR analysis and applies spectral score smoothing to evict KV cache tokens, achieving high compression with minimal accuracy loss in LALMs.
MedASR: An Open-Source Model for High-Accuracy Medical Dictation eess.AS · 2026-05-15 · unverdicted · none · ref 18
MedASR is an open-source 105M-parameter ASR model achieving 58% relative WER reduction versus Whisper Large-v3 on medical dictation.
Quantifying the Cost of Manual Navigation: A Comparison of Gesture-Based Magnification versus Direct Access Reading in Digital Layout-based Documents cs.HC · 2026-04-29 · unverdicted · none · ref 8
Large-print editions of layout-based documents outperform gesture-based magnification by 18% in reading speed and 30% in target location speed while restoring natural reading strategies and reducing workload.

Whisperx: Time-accurate speech transcription of long-form audio

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer