A Simple Method to Enhance Pre-trained Language Models with Speech Tokens for Classification

Jose Guillen; Nicolas Calbucura; Valentin Barriere

arxiv: 2512.07571 · v2 · submitted 2025-12-08 · 💻 cs.CL · cs.MM

A Simple Method to Enhance Pre-trained Language Models with Speech Tokens for Classification

Nicolas Calbucura , Jose Guillen , Valentin Barriere This is my paper

Pith reviewed 2026-05-17 00:40 UTC · model grok-4.3

classification 💻 cs.CL cs.MM

keywords audiolanguagelargemethodmodelspeechclassificationsimple

0 comments

The pith

Lasso-selected speech tokens enhance text LLMs for multimodal classification by reducing long audio sequences to task-relevant features via self-supervised adaptation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The core idea starts with a speech tokenizer that turns audio into many discrete tokens from a large vocabulary. These tokens are turned into a simple bag-of-words count vector that is combined with the text. Lasso regression then picks out only the audio tokens that matter most for the target classification task, throwing away the rest to keep the input short. The language model is next adapted by training it to predict the selected audio tokens in a self-supervised way, so it learns to treat them as part of its vocabulary. Finally the model is fine-tuned on the actual task such as spotting fallacies in arguments or classifying emotions. The authors also test a random token selection baseline and find it still helps the text-only model. This pipeline is tested on fallacy detection datasets and a standard affective computing benchmark, showing gains over text-only models, larger speech-language models, and other ways of adding audio features.

Core claim

By applying a simple lasso-based feature selection on multimodal Bag-of-Words representation, we retain only the most important audio tokens for the task, and adapt the language model to them with a self-supervised language modeling objective, before fine-tuning it on the downstream task. We show this helps to improve the performances compared to an unimodal model, to a bigger SpeechLM or to integrating audio via a learned representation.

Load-bearing premise

That the lasso-selected subset of audio tokens plus self-supervised adaptation is sufficient to capture the speech information that actually helps the classification task without discarding critical cues or adding noise that hurts performance.

Figures

Figures reproduced from arXiv: 2512.07571 by Jose Guillen, Nicolas Calbucura, Valentin Barriere.

**Figure 1.** Figure 1: The Step 1 of our method consists in audio token selection pipeline based on an ℓ1 logistic regression using Bag-of-Word representation. This results on fewer Audio Tokens selected for a specific task. tive Fallacy Detection (AFD) tasks using datasets from Mancini et al. (2024a). As baseline, we consider the results presented in Mancini et al. (2025), which showed a strong text dominance over audio and di… view at source ↗

**Figure 2.** Figure 2: The Step 2 and Step 3 consist in the pretraining the audio tokens embeddings followed by the fine-tuning of the multimodal LLM on the downstream task. respective embeddings, using a causal language modeling cross-entropy loss LCLM (see Equation 1). All the weights of the LLM stay frozen except for the embeddings of the audio tokens in order to not mess with the model’s initial representations (showed in [… view at source ↗

**Figure 3.** Figure 3: Task prompt for the Qwen2-Audio model for In-Context-Learning [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

This paper presents a simple method that allows to easily enhance textual pre-trained large language models with speech information, when fine-tuned for a specific classification task. A classical issue with the fusion of many embeddings from audio with text is the large length of the audio sequence compared to the text one. Our method benefits from an existing speech tokenizer trained for Audio Speech Recognition that output long sequences of tokens from a large vocabulary, making it difficult to integrate it at low cost in a large language model. By applying a simple lasso-based feature selection on multimodal Bag-of-Words representation, we retain only the most important audio tokens for the task, and adapt the language model to them with a self-supervised language modeling objective, before fine-tuning it on the downstream task. We show this helps to improve the performances compared to an unimodal model, to a bigger SpeechLM or to integrating audio via a learned representation. We demonstrate its effectiveness on Argumentative Fallacy Detection and Classification tasks where audio was previously believed counterproductive, and affective computing tasks on a widely-used dataset. We also provide an in-depth analysis of the method, showing that even a random audio token selection helps enhancing the unimodal model. Our code is available [online](https://github.com/salocinc/EACL26SpeechTokFallacy/).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Lasso on ASR tokens plus self-supervised adaptation gives gains on fallacy and affective tasks, but the bag-of-words step risks dropping order and timing cues that matter for speech.

read the letter

The main point is that this paper shows a workable way to add speech information to a text-only LLM for specific classification jobs. They run ASR to get tokens, turn them into a multimodal bag-of-words, use lasso to keep only the task-relevant ones, adapt the model to those tokens with a language modeling objective, and then fine-tune. On argumentative fallacy detection and affective computing they report better numbers than the plain text baseline, a larger SpeechLM, or learned audio representations. The random-selection ablation is useful because it suggests the adaptation step itself contributes something beyond the exact token choice. Code is released, which makes the pipeline easy to inspect and reuse. That combination of lasso selection and adaptation on tasks where audio was previously thought unhelpful is the actual new piece. The method stays simple and avoids the length mismatch that usually makes direct fusion expensive. The soft spot is the bag-of-words representation itself. Converting the token sequence to counts throws away order, co-occurrence, and timing, which often carry prosody or discourse structure. The stress-test concern lands: if the gains come mainly from injecting extra tokens or from the adaptation rather than from the lasso-chosen content, then the selection step may not be doing the heavy lifting claimed. The abstract gives no dataset sizes, significance tests, or baseline details, so the full paper needs to show those controls clearly before the improvements can be taken as settled. This work is aimed at applied multimodal NLP people who need a low-cost way to test whether speech helps a downstream classifier. Readers who already work with ASR outputs and want a lightweight fusion trick will get the most out of it. It is worth sending to peer review because the method is straightforward, the tasks are relevant, and the open code lets referees check the results directly.

Referee Report

2 major / 2 minor

Summary. The paper claims that a simple pipeline—converting ASR token sequences to a multimodal Bag-of-Words count vector, applying lasso feature selection to retain task-relevant audio tokens, performing self-supervised language-model adaptation on the selected tokens, and then fine-tuning—improves classification performance over a text-only baseline, a larger SpeechLM, and learned audio representations. The approach is evaluated on argumentative fallacy detection (where audio was previously thought counterproductive) and affective computing tasks, with an additional analysis showing that even random audio-token selection yields gains over the unimodal model. Code is released.

Significance. If the empirical gains are robust, the method offers a lightweight way to inject speech information into existing text LLMs without managing long audio sequences or training new fusion modules, which could be practically useful for classification tasks that benefit from paralinguistic cues. The public code release and the observation that random selection also helps are positive contributions that facilitate reproducibility and further analysis.

major comments (2)

[Abstract and §4] Abstract and §4 (Experiments): performance improvements are asserted without reporting dataset sizes, number of runs, statistical significance tests, or full baseline implementation details (e.g., how the larger SpeechLM and learned-representation baselines were trained or adapted). These omissions make it impossible to verify that the reported gains are reliable and attributable to the proposed pipeline rather than implementation differences.
[§3 and analysis section] §3 (Method) and analysis section: the lasso operates on a Bag-of-Words representation that discards token order, co-occurrence, and timing. The paper itself reports that random token selection also improves over the unimodal baseline; this raises the possibility that gains arise from the adaptation step or from simply adding any additional tokens rather than from the lasso-selected speech content. A direct ablation comparing lasso-selected tokens against random and against frequency-based selection on the same downstream metrics is needed to support the central claim that lasso retains the “most important” audio tokens.

minor comments (2)

[§3] Notation for the multimodal BoW vector and the lasso objective should be introduced with explicit equations rather than prose descriptions.
[§4] Figure captions and table headers should explicitly state the evaluation metric (accuracy, F1, etc.) and whether results are averaged over multiple seeds.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive comments. We address each major comment below and describe the revisions planned for the next version of the manuscript.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): performance improvements are asserted without reporting dataset sizes, number of runs, statistical significance tests, or full baseline implementation details (e.g., how the larger SpeechLM and learned-representation baselines were trained or adapted). These omissions make it impossible to verify that the reported gains are reliable and attributable to the proposed pipeline rather than implementation differences.

Authors: We agree that these experimental details are essential for assessing reliability and reproducibility. In the revised manuscript we will add explicit dataset sizes to §4, report all results as averages over five random seeds with standard deviations, include statistical significance tests (paired t-tests and McNemar’s test where appropriate), and expand the baseline descriptions to specify training procedures, hyperparameters, and adaptation steps for the larger SpeechLM and learned-representation baselines. revision: yes
Referee: [§3 and analysis section] §3 (Method) and analysis section: the lasso operates on a Bag-of-Words representation that discards token order, co-occurrence, and timing. The paper itself reports that random token selection also improves over the unimodal baseline; this raises the possibility that gains arise from the adaptation step or from simply adding any additional tokens rather than from the lasso-selected speech content. A direct ablation comparing lasso-selected tokens against random and against frequency-based selection on the same downstream metrics is needed to support the central claim that lasso retains the “most important” audio tokens.

Authors: We acknowledge that the existing analysis already demonstrates gains from random selection, which suggests that the self-supervised adaptation step itself contributes to performance. To isolate the contribution of lasso selection and strengthen the central claim, we will add a direct ablation study comparing lasso-selected tokens, random selection, and frequency-based selection on identical downstream metrics. We will also clarify in §3 that while the Bag-of-Words representation discards order and timing, the subsequent language-model adaptation allows contextual modeling of the retained tokens; this limitation will be discussed explicitly. revision: yes

Circularity Check

0 steps flagged

No circularity: standard empirical pipeline with independent evaluation

full rationale

The paper describes a straightforward sequence of standard components—ASR tokenization into a large vocabulary, conversion to multimodal Bag-of-Words counts, lasso-based feature selection on those counts, self-supervised language modeling adaptation on the selected tokens, and downstream fine-tuning—followed by empirical comparisons to unimodal baselines, larger SpeechLMs, and learned audio representations. No equations or derivations are presented that reduce the reported performance gains to a fitted parameter or self-referential definition by construction. The additional analysis that random token selection also yields gains is offered as an empirical observation rather than a core claim that collapses into the selection step. All load-bearing assertions remain externally falsifiable through the reported task accuracies on fallacy detection and affective computing datasets.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on the standard assumption that lasso regression reliably identifies task-relevant features from high-dimensional bag-of-words vectors and that self-supervised language modeling on the selected tokens produces useful multimodal representations.

free parameters (1)

Lasso regularization strength
Controls how many audio tokens are retained; value chosen to balance relevance and sequence length.

axioms (1)

standard math Lasso regression selects the most predictive features from a high-dimensional multimodal bag-of-words vector.
Invoked when reducing the long audio token sequence to a small set of important tokens.

pith-pipeline@v0.9.0 · 5531 in / 1219 out tokens · 82006 ms · 2026-05-17T00:40:50.343395+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Breath1024.lean period8 / 8-tick periodicity echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

By applying a simple lasso-based feature selection on multimodal Bag-of-Words representation, we retain only the most important audio tokens... vocabulary size 1024 at each layer level l, which gives a total of 8,196 different tokens... 8 layers
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel / J-cost uniqueness unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

lasso-based feature selection... Bag-of-Word representations... ℓ1-regularized logistic regression

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MultiLinguahah : A New Unsupervised Multilingual Acoustic Laughter Segmentation Method
cs.CL 2026-05 unverdicted novelty 6.0

An unsupervised multilingual laughter segmentation method using Isolation Forest on BYOL-A audio representations outperforms existing supervised methods on non-English datasets.
MultiLinguahah : A New Unsupervised Multilingual Acoustic Laughter Segmentation Method
cs.CL 2026-05 unverdicted novelty 5.0

An unsupervised multilingual laughter segmentation technique using Isolation Forest on BYOL-A representations outperforms state-of-the-art supervised detectors on non-English audio datasets.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · cited by 1 Pith paper · 5 internal anchors

[1]

Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, and 3 others

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, and 3 others. 2016. TensorFlow: A system for large-scale machine learning ...

work page 2016
[2]

Jean Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, and 7 others. 2022. Flamingo: a Visual Language Model fo...

work page 2022
[3]

Valentin Barriere and Alexandra Balahur. 2023. https://www.mdpi.com/2227-7390/11/9/2161 Multilingual Multi-target Stance Recognition in Online Public Consultations . MDPI Mathematics -- Special issue on Human Language Technollogy, 11(9):2161

work page 2023
[4]

Valentin Barriere and Guillaume Jacquet. 2022. CoFE : A New Dataset of Intra-Multilingual Multi-target Stance Classification from an Online European Participatory Democracy Platform . AACL-IJCNLP

work page 2022
[5]

Zalán Borsos, Raphaël Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin, Matt Sharifi, Dominik Roblek, Olivier Teboul, David Grangier, Marco Tagliasacchi, and Neil Zeghidour. 2023. https://arxiv.org/abs/2209.03143 Audiolm: a language modeling approach to audio generation . Preprint, arXiv:2209.03143

work page arXiv 2023
[6]

Eva Cant \' i n and Adriana Chust. 2025. https://doi.org/10.18653/v1/2025.argmining-1.36 Argumentative Fallacy Detection in Political Debates . In Proceedings of the 12th Argument mining Workshop, pages 369--373, Vienna, Austria. Association for Computational Linguistics

work page doi:10.18653/v1/2025.argmining-1.36 2025
[7]

Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, Chang Zhou, and Jingren Zhou. 2024. http://arxiv.org/abs/2407.10759 Qwen2-Audio Technical Report . pages 1--16

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Nilaksh Das, Saket Dingliwal, Srikanth Ronanki, Rohit Paturi, Zhaocheng Huang, Prashant Mathur, Jie Yuan, Dhanush Bekal, Xing Niu, Sai Muralidhar Jayanthi, Xilai Li, Karel Mundnich, Monica Sunkara, Sravan Bodapati, Sundararajan Srinivasan, Kyu J Han, and Katrin Kirchhoff. 2024. http://arxiv.org/abs/2405.08295 SpeechVerse: A Large-scale Generalizable Audio...

work page arXiv 2024
[9]

Alexandre D \' e fossez, Laurent Mazar \' e , Manu Orsini, Amélie Royer, Patrick P \' e rez, Hervé J \' e gou, Edouard Grave, and Neil Zeghidour. 2024. http://arxiv.org/abs/2410.00037 Moshi: a speech-text foundation model for real-time dialogue . pages 1--67

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Gilles Degottex, John Kane, Thomas Drugman, Tuomo Raitio, and Stefan Scherer. 2014. https://doi.org/10.1109/ICASSP.2014.6853739 COVAREP - A collaborative voice analysis repository for speech technologies . In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, pages 960--964

work page doi:10.1109/icassp.2014.6853739 2014
[11]

Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. 2022. https://arxiv.org/abs/2210.13438 High fidelity neural audio compression . Preprint, arXiv:2210.13438

work page internal anchor Pith review Pith/arXiv arXiv 2022
[12]

Townsend, Thibaud Gruber, and Carel P

Marlen Fr \" o hlich, Christine Sievers, Simon W. Townsend, Thibaud Gruber, and Carel P. van Schaik. 2019. https://doi.org/10.1111/brv.12535 Multimodal communication and language origins: integrating gestures and vocalizations . Biological Reviews, 94(5):1809--1829

work page doi:10.1111/brv.12535 2019
[13]

Pierpaolo Goffredo, Shohreh Haddadan, Vorakit Vorakitphan, Elena Cabrio, and Serena Villata. 2022. https://doi.org/10.24963/ijcai.2022/575 Fallacious Argument Classification in Political Debates . IJCAI International Joint Conference on Artificial Intelligence, pages 4143--4149

work page doi:10.24963/ijcai.2022/575 2022
[14]

R. Gray. 1984. https://doi.org/10.1109/MASSP.1984.1162229 Vector quantization . IEEE ASSP Magazine, 1(2):4--29

work page doi:10.1109/massp.1984.1162229 1984
[15]

Shohreh Haddadan, Elena Cabrio, and Serena Villata. 2019. https://doi.org/10.18653/v1/p19-1463 Yes, we can! Mining arguments in 50 years of US presidential campaign debates . ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, pages 4684--4690

work page doi:10.18653/v1/p19-1463 2019
[16]

Wei Ning Hsu, Benjamin Bolte, Yao Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. 2021. https://doi.org/10.1109/TASLP.2021.3122291 HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units . IEEE/ACM Transactions on Audio Speech and Language Processing, 29(Cv):3451--3460

work page doi:10.1109/taslp.2021.3122291 2021
[17]

Youngmin Kim, Jiwan Chung, Jisoo Kim, Sunghyun Lee, Sangkyu Lee, Junhyeok Kim, Cheoljong Yang, and Youngjae Yu. 2025. https://doi.org/10.18653/v1/2025.acl-long.112 Speaking Beyond Language: A Large-Scale Multimodal Dataset for Learning Nonverbal Cues from Video-Grounded Dialogues . pages 2247--2265

work page doi:10.18653/v1/2025.acl-long.112 2025
[18]

Zhifeng Kong, Arushi Goel, Rohan Badlani, Wei Ping, Rafael Valle, and Bryan Catanzaro. 2024. Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities . Proceedings of Machine Learning Research, 235:25125--25148

work page 2024
[19]

Marco Lippi and Paolo Torroni. 2016. https://doi.org/10.1609/aaai.v30i1.10384 Argument mining from speech: Detecting claims in political debates . 30th AAAI Conference on Artificial Intelligence, AAAI 2016, pages 2979--2985

work page doi:10.1609/aaai.v30i1.10384 2016
[20]

Wenrui Liu, Qian Chen, Wen Wang, Yafeng Chen, Jin Xu, Zhifang Guo, Guanrou Yang, Weiqin Li, Xiaoda Yang, Tao Jin, Minghui Fang, Jialong Zuo, Bai Jionghao, and Zemin Liu. 2025. https://arxiv.org/abs/2505.24496 Speech token prediction via compressed-to-fine language modeling for speech generation . Preprint, arXiv:2505.24496

work page arXiv 2025
[21]

Eleonora Mancini, Federico Ruggeri, Stefano Colamonaco, Andrea Zecca, Samuele Marro, and Paolo Torroni. 2024 a . https://github.com/lt-nlp-lab-unibo/mamkit MAMKit: A Comprehensive Multimodal Argument Mining Toolkit . In Proceedings of the 11th Workshop on Argument Mining (ArgMining 2024), pages 69--82

work page 2024
[22]

Eleonora Mancini, Federico Ruggeri, Andrea Galassi, and Paolo Torroni. 2022. https://aclanthology.org/2022.argmining-1.15/ Multimodal Argument Mining: A Case Study in Political Debates . Proceedings of the 9th Workshop on Argument Mining, pages 158--170

work page 2022
[23]

Eleonora Mancini, Federico Ruggeri, and Paolo Torroni. 2024 b . Multimodal Fallacy Classification in Political Debates . EACL 2024 - 18th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference, 2:170--178

work page 2024
[24]

Eleonora Mancini, Federico Ruggeri, Serena Villata, and Paolo Torroni. 2025. https://doi.org/10.18653/v1/2025.argmining-1.35 Overview of MM-ArgFallacy2025 on Multimodal Argumentative Fallacy Detection and Classification in Political Debates . In Proceedings of the 12th Argument mining Workshop, pages 358--368

work page doi:10.18653/v1/2025.argmining-1.35 2025
[25]

Middleton, Matt Ryan, Masood Gheasi, Timothy J

Rafael Mestre, Stuart E. Middleton, Matt Ryan, Masood Gheasi, Timothy J. Norman, and Jiatong Zhu. 2023. https://doi.org/10.18653/v1/2023.findings-eacl.21 Augmenting pre-trained language models with audio feature embedding for argumentation mining in political debates . EACL 2023 - 17th Conference of the European Chapter of the Association for Computationa...

work page doi:10.18653/v1/2023.findings-eacl.21 2023
[26]

Middleton, Matt Ryan, Jiatong Zhu, and Timothy J

Rafael Mestre, Razvan Milicin, Stuart E. Middleton, Matt Ryan, Jiatong Zhu, and Timothy J. Norman. 2021. https://doi.org/10.18653/v1/2021.argmining-1.8 M-Arg: Multimodal Argument Mining Dataset for Political Debates with Audio and Transcripts . 8th Workshop on Argument Mining, ArgMining 2021 - Proceedings, (2014):78--88

work page doi:10.18653/v1/2021.argmining-1.8 2021
[27]

Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, and Kunio Kashino. 2023. https://doi.org/10.1109/TASLP.2022.3221007 BYOL for Audio: Exploring Pre-Trained General-Purpose Audio Representations . IEEE/ACM Transactions on Audio Speech and Language Processing, 31:137--151

work page doi:10.1109/taslp.2022.3221007 2023
[28]

Alessio Pittiglio. 2025. https://doi.org/10.18653/v1/2025.argmining-1.39 Leveraging Context for Multimodal Fallacy Classification in Political Debates . In Proceedings of the 12th Argument mining Workshop, pages 388--397, Vienna, Austria. Association for Computational Linguistics

work page doi:10.18653/v1/2025.argmining-1.39 2025
[29]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. http://arxiv.org/abs/2103.00020 Learning Transferable Visual Models From Natural Language Supervision

work page internal anchor Pith review Pith/arXiv arXiv 2021
[30]

Abdullah Tahir, Imaan Ibrar, Huma Ameer, Mehwish Fatima, and Seemab Latif. 2025. https://doi.org/10.18653/v1/2025.argmining-1.38 Prompt-Guided Augmentation and Multi-modal Fusion for Argumentative Fallacy Classification in Political Debates . In Proceedings of the 12th Argument mining Workshop, pages 381--387, Vienna, Austria. Association for Computationa...

work page doi:10.18653/v1/2025.argmining-1.38 2025
[31]

Hugo Thimonier, Antony Perzo, and Renaud Seguier. 2025. http://arxiv.org/abs/2508.14130 EmoSLLM: Parameter-Efficient Adaptation of LLMs for Speech Emotion Recognition . pages 1--18

work page arXiv 2025
[32]

Vasuki and P.T

A. Vasuki and P.T. Vanathi. 2006. https://doi.org/10.1109/MP.2006.1664069 A review of vector quantization techniques . IEEE Potentials, 25(4):39--47

work page doi:10.1109/mp.2006.1664069 2006
[33]

Alessandro Vinciarelli, Maja Pantic, and Hervé Bourlard. 2009. https://doi.org/10.1016/j.imavis.2008.11.007 Social signal processing: Survey of an emerging domain . Image and Vision Computing, 27(12):1743--1759

work page doi:10.1016/j.imavis.2008.11.007 2009
[34]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, and Jamie Brew. 2019. http://arxiv.org/abs/1910.03771 HuggingFace's Transformers: State-of-the-art Natural Language Processing

work page internal anchor Pith review Pith/arXiv arXiv 2019
[35]

Amir Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2018. http://aclweb.org/anthology/P18-1208 Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph . Proceedings of ACL, pages 2236--2246

work page 2018
[36]

Xin Zhang, Dong Zhang, Shimin Li, Yaqian Zhou, and Xipeng Qiu. 2024. SPEECHTOKENIZER: UNIFIED SPEECH TOKENIZER FOR SPEECH LANGUAGE MODELS . In 12th International Conference on Learning Representations, ICLR 2024, pages 1--21

work page 2024
[37]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

work page
[38]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page

[1] [1]

Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, and 3 others

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, and 3 others. 2016. TensorFlow: A system for large-scale machine learning ...

work page 2016

[2] [2]

Jean Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, and 7 others. 2022. Flamingo: a Visual Language Model fo...

work page 2022

[3] [3]

Valentin Barriere and Alexandra Balahur. 2023. https://www.mdpi.com/2227-7390/11/9/2161 Multilingual Multi-target Stance Recognition in Online Public Consultations . MDPI Mathematics -- Special issue on Human Language Technollogy, 11(9):2161

work page 2023

[4] [4]

Valentin Barriere and Guillaume Jacquet. 2022. CoFE : A New Dataset of Intra-Multilingual Multi-target Stance Classification from an Online European Participatory Democracy Platform . AACL-IJCNLP

work page 2022

[5] [5]

Zalán Borsos, Raphaël Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin, Matt Sharifi, Dominik Roblek, Olivier Teboul, David Grangier, Marco Tagliasacchi, and Neil Zeghidour. 2023. https://arxiv.org/abs/2209.03143 Audiolm: a language modeling approach to audio generation . Preprint, arXiv:2209.03143

work page arXiv 2023

[6] [6]

Eva Cant \' i n and Adriana Chust. 2025. https://doi.org/10.18653/v1/2025.argmining-1.36 Argumentative Fallacy Detection in Political Debates . In Proceedings of the 12th Argument mining Workshop, pages 369--373, Vienna, Austria. Association for Computational Linguistics

work page doi:10.18653/v1/2025.argmining-1.36 2025

[7] [7]

Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, Chang Zhou, and Jingren Zhou. 2024. http://arxiv.org/abs/2407.10759 Qwen2-Audio Technical Report . pages 1--16

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

Nilaksh Das, Saket Dingliwal, Srikanth Ronanki, Rohit Paturi, Zhaocheng Huang, Prashant Mathur, Jie Yuan, Dhanush Bekal, Xing Niu, Sai Muralidhar Jayanthi, Xilai Li, Karel Mundnich, Monica Sunkara, Sravan Bodapati, Sundararajan Srinivasan, Kyu J Han, and Katrin Kirchhoff. 2024. http://arxiv.org/abs/2405.08295 SpeechVerse: A Large-scale Generalizable Audio...

work page arXiv 2024

[9] [9]

Alexandre D \' e fossez, Laurent Mazar \' e , Manu Orsini, Amélie Royer, Patrick P \' e rez, Hervé J \' e gou, Edouard Grave, and Neil Zeghidour. 2024. http://arxiv.org/abs/2410.00037 Moshi: a speech-text foundation model for real-time dialogue . pages 1--67

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

Gilles Degottex, John Kane, Thomas Drugman, Tuomo Raitio, and Stefan Scherer. 2014. https://doi.org/10.1109/ICASSP.2014.6853739 COVAREP - A collaborative voice analysis repository for speech technologies . In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, pages 960--964

work page doi:10.1109/icassp.2014.6853739 2014

[11] [11]

Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. 2022. https://arxiv.org/abs/2210.13438 High fidelity neural audio compression . Preprint, arXiv:2210.13438

work page internal anchor Pith review Pith/arXiv arXiv 2022

[12] [12]

Townsend, Thibaud Gruber, and Carel P

Marlen Fr \" o hlich, Christine Sievers, Simon W. Townsend, Thibaud Gruber, and Carel P. van Schaik. 2019. https://doi.org/10.1111/brv.12535 Multimodal communication and language origins: integrating gestures and vocalizations . Biological Reviews, 94(5):1809--1829

work page doi:10.1111/brv.12535 2019

[13] [13]

Pierpaolo Goffredo, Shohreh Haddadan, Vorakit Vorakitphan, Elena Cabrio, and Serena Villata. 2022. https://doi.org/10.24963/ijcai.2022/575 Fallacious Argument Classification in Political Debates . IJCAI International Joint Conference on Artificial Intelligence, pages 4143--4149

work page doi:10.24963/ijcai.2022/575 2022

[14] [14]

R. Gray. 1984. https://doi.org/10.1109/MASSP.1984.1162229 Vector quantization . IEEE ASSP Magazine, 1(2):4--29

work page doi:10.1109/massp.1984.1162229 1984

[15] [15]

Shohreh Haddadan, Elena Cabrio, and Serena Villata. 2019. https://doi.org/10.18653/v1/p19-1463 Yes, we can! Mining arguments in 50 years of US presidential campaign debates . ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, pages 4684--4690

work page doi:10.18653/v1/p19-1463 2019

[16] [16]

Wei Ning Hsu, Benjamin Bolte, Yao Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. 2021. https://doi.org/10.1109/TASLP.2021.3122291 HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units . IEEE/ACM Transactions on Audio Speech and Language Processing, 29(Cv):3451--3460

work page doi:10.1109/taslp.2021.3122291 2021

[17] [17]

Youngmin Kim, Jiwan Chung, Jisoo Kim, Sunghyun Lee, Sangkyu Lee, Junhyeok Kim, Cheoljong Yang, and Youngjae Yu. 2025. https://doi.org/10.18653/v1/2025.acl-long.112 Speaking Beyond Language: A Large-Scale Multimodal Dataset for Learning Nonverbal Cues from Video-Grounded Dialogues . pages 2247--2265

work page doi:10.18653/v1/2025.acl-long.112 2025

[18] [18]

Zhifeng Kong, Arushi Goel, Rohan Badlani, Wei Ping, Rafael Valle, and Bryan Catanzaro. 2024. Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities . Proceedings of Machine Learning Research, 235:25125--25148

work page 2024

[19] [19]

Marco Lippi and Paolo Torroni. 2016. https://doi.org/10.1609/aaai.v30i1.10384 Argument mining from speech: Detecting claims in political debates . 30th AAAI Conference on Artificial Intelligence, AAAI 2016, pages 2979--2985

work page doi:10.1609/aaai.v30i1.10384 2016

[20] [20]

Wenrui Liu, Qian Chen, Wen Wang, Yafeng Chen, Jin Xu, Zhifang Guo, Guanrou Yang, Weiqin Li, Xiaoda Yang, Tao Jin, Minghui Fang, Jialong Zuo, Bai Jionghao, and Zemin Liu. 2025. https://arxiv.org/abs/2505.24496 Speech token prediction via compressed-to-fine language modeling for speech generation . Preprint, arXiv:2505.24496

work page arXiv 2025

[21] [21]

Eleonora Mancini, Federico Ruggeri, Stefano Colamonaco, Andrea Zecca, Samuele Marro, and Paolo Torroni. 2024 a . https://github.com/lt-nlp-lab-unibo/mamkit MAMKit: A Comprehensive Multimodal Argument Mining Toolkit . In Proceedings of the 11th Workshop on Argument Mining (ArgMining 2024), pages 69--82

work page 2024

[22] [22]

Eleonora Mancini, Federico Ruggeri, Andrea Galassi, and Paolo Torroni. 2022. https://aclanthology.org/2022.argmining-1.15/ Multimodal Argument Mining: A Case Study in Political Debates . Proceedings of the 9th Workshop on Argument Mining, pages 158--170

work page 2022

[23] [23]

Eleonora Mancini, Federico Ruggeri, and Paolo Torroni. 2024 b . Multimodal Fallacy Classification in Political Debates . EACL 2024 - 18th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference, 2:170--178

work page 2024

[24] [24]

Eleonora Mancini, Federico Ruggeri, Serena Villata, and Paolo Torroni. 2025. https://doi.org/10.18653/v1/2025.argmining-1.35 Overview of MM-ArgFallacy2025 on Multimodal Argumentative Fallacy Detection and Classification in Political Debates . In Proceedings of the 12th Argument mining Workshop, pages 358--368

work page doi:10.18653/v1/2025.argmining-1.35 2025

[25] [25]

Middleton, Matt Ryan, Masood Gheasi, Timothy J

Rafael Mestre, Stuart E. Middleton, Matt Ryan, Masood Gheasi, Timothy J. Norman, and Jiatong Zhu. 2023. https://doi.org/10.18653/v1/2023.findings-eacl.21 Augmenting pre-trained language models with audio feature embedding for argumentation mining in political debates . EACL 2023 - 17th Conference of the European Chapter of the Association for Computationa...

work page doi:10.18653/v1/2023.findings-eacl.21 2023

[26] [26]

Middleton, Matt Ryan, Jiatong Zhu, and Timothy J

Rafael Mestre, Razvan Milicin, Stuart E. Middleton, Matt Ryan, Jiatong Zhu, and Timothy J. Norman. 2021. https://doi.org/10.18653/v1/2021.argmining-1.8 M-Arg: Multimodal Argument Mining Dataset for Political Debates with Audio and Transcripts . 8th Workshop on Argument Mining, ArgMining 2021 - Proceedings, (2014):78--88

work page doi:10.18653/v1/2021.argmining-1.8 2021

[27] [27]

Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, and Kunio Kashino. 2023. https://doi.org/10.1109/TASLP.2022.3221007 BYOL for Audio: Exploring Pre-Trained General-Purpose Audio Representations . IEEE/ACM Transactions on Audio Speech and Language Processing, 31:137--151

work page doi:10.1109/taslp.2022.3221007 2023

[28] [28]

Alessio Pittiglio. 2025. https://doi.org/10.18653/v1/2025.argmining-1.39 Leveraging Context for Multimodal Fallacy Classification in Political Debates . In Proceedings of the 12th Argument mining Workshop, pages 388--397, Vienna, Austria. Association for Computational Linguistics

work page doi:10.18653/v1/2025.argmining-1.39 2025

[29] [29]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. http://arxiv.org/abs/2103.00020 Learning Transferable Visual Models From Natural Language Supervision

work page internal anchor Pith review Pith/arXiv arXiv 2021

[30] [30]

Abdullah Tahir, Imaan Ibrar, Huma Ameer, Mehwish Fatima, and Seemab Latif. 2025. https://doi.org/10.18653/v1/2025.argmining-1.38 Prompt-Guided Augmentation and Multi-modal Fusion for Argumentative Fallacy Classification in Political Debates . In Proceedings of the 12th Argument mining Workshop, pages 381--387, Vienna, Austria. Association for Computationa...

work page doi:10.18653/v1/2025.argmining-1.38 2025

[31] [31]

Hugo Thimonier, Antony Perzo, and Renaud Seguier. 2025. http://arxiv.org/abs/2508.14130 EmoSLLM: Parameter-Efficient Adaptation of LLMs for Speech Emotion Recognition . pages 1--18

work page arXiv 2025

[32] [32]

Vasuki and P.T

A. Vasuki and P.T. Vanathi. 2006. https://doi.org/10.1109/MP.2006.1664069 A review of vector quantization techniques . IEEE Potentials, 25(4):39--47

work page doi:10.1109/mp.2006.1664069 2006

[33] [33]

Alessandro Vinciarelli, Maja Pantic, and Hervé Bourlard. 2009. https://doi.org/10.1016/j.imavis.2008.11.007 Social signal processing: Survey of an emerging domain . Image and Vision Computing, 27(12):1743--1759

work page doi:10.1016/j.imavis.2008.11.007 2009

[34] [34]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, and Jamie Brew. 2019. http://arxiv.org/abs/1910.03771 HuggingFace's Transformers: State-of-the-art Natural Language Processing

work page internal anchor Pith review Pith/arXiv arXiv 2019

[35] [35]

Amir Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2018. http://aclweb.org/anthology/P18-1208 Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph . Proceedings of ACL, pages 2236--2246

work page 2018

[36] [36]

Xin Zhang, Dong Zhang, Shimin Li, Yaqian Zhou, and Xipeng Qiu. 2024. SPEECHTOKENIZER: UNIFIED SPEECH TOKENIZER FOR SPEECH LANGUAGE MODELS . In 12th International Conference on Learning Representations, ICLR 2024, pages 1--21

work page 2024

[37] [37]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

work page

[38] [38]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page