A Simple Method to Enhance Pre-trained Language Models with Speech Tokens for Classification

· 2025 · cs.CL · arXiv 2512.07571

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

This paper presents a simple method that allows to easily enhance textual pre-trained large language models with speech information, when fine-tuned for a specific classification task. A classical issue with the fusion of many embeddings from audio with text is the large length of the audio sequence compared to the text one. Our method benefits from an existing speech tokenizer trained for Audio Speech Recognition that output long sequences of tokens from a large vocabulary, making it difficult to integrate it at low cost in a large language model. By applying a simple lasso-based feature selection on multimodal Bag-of-Words representation, we retain only the most important audio tokens for the task, and adapt the language model to them with a self-supervised language modeling objective, before fine-tuning it on the downstream task. We show this helps to improve the performances compared to an unimodal model, to a bigger SpeechLM or to integrating audio via a learned representation. We demonstrate its effectiveness on Argumentative Fallacy Detection and Classification tasks where audio was previously believed counterproductive, and affective computing tasks on a widely-used dataset. We also provide an in-depth analysis of the method, showing that even a random audio token selection helps enhancing the unimodal model. Our code is available [online](https://github.com/salocinc/EACL26SpeechTokFallacy/).

representative citing papers

MultiLinguahah : A New Unsupervised Multilingual Acoustic Laughter Segmentation Method

cs.CL · 2026-05-07 · unverdicted · novelty 6.0 · 2 refs

An unsupervised multilingual laughter segmentation method using Isolation Forest on BYOL-A audio representations outperforms existing supervised methods on non-English datasets.

citing papers explorer

Showing 1 of 1 citing paper.

MultiLinguahah : A New Unsupervised Multilingual Acoustic Laughter Segmentation Method cs.CL · 2026-05-07 · unverdicted · none · ref 29 · 2 links · internal anchor
An unsupervised multilingual laughter segmentation method using Isolation Forest on BYOL-A audio representations outperforms existing supervised methods on non-English datasets.

A Simple Method to Enhance Pre-trained Language Models with Speech Tokens for Classification

fields

years

verdicts

representative citing papers

citing papers explorer