pith. sign in

arxiv: 2601.06199 · v3 · pith:AWEE722Bnew · submitted 2026-01-08 · 📡 eess.AS · cs.AI· cs.SD

FastSLM: Hierarchical Temporal Abstraction for Efficient Long-Form Speech Adaptation

classification 📡 eess.AS cs.AIcs.SD
keywords fastslmlong-formtemporalacousticcompressionextremehierarchicalmodels
0
0 comments X
read the original abstract

Scaling Multimodal Large Language Models (MLLMs) to long-form speech is bottlenecked by the explosive growth of input tokens. Unlike images or videos, audio lacks overlapping information, making extreme 1-token compression highly susceptible to the loss of fine-grained acoustic cues. To overcome this, we propose FastSLM, a token-efficient architecture featuring the Hierarchical Temporal Abstractor (HTA). HTA progressively distills non-overlapping acoustic features across multiple temporal scales, achieving an extreme compression rate of 1.67 tokens per second a 97% reduction without losing critical context. Experimental results show that FastSLM achieves competitive performance with state-of-the-art models on long-form benchmarks despite operating with significantly fewer FLOPs and parameters. The source code and model checkpoints are available at https://anonymous.4open.science/r/FastSLM-8BD3.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Listening with Time: Precise Temporal Awareness for Long-Form Audio Understanding

    eess.AS 2026-04 unverdicted novelty 7.0

    LAT-Audio introduces a global-to-local reasoning approach with TWA-CoT that outperforms prior models on temporal tasks for audio up to 30 minutes.