Aligning MusicLLM with Emotion using Instruction Tuning and Feedback-Driven Alignment

Takuya Hasumi; Welly Naptali

arxiv: 2606.24123 · v1 · pith:WQVP6ULNnew · submitted 2026-06-23 · 💻 cs.SD

Aligning MusicLLM with Emotion using Instruction Tuning and Feedback-Driven Alignment

Takuya Hasumi , Welly Naptali This is my paper

Pith reviewed 2026-06-25 22:54 UTC · model grok-4.3

classification 💻 cs.SD

keywords music large language modelsemotion regressioninstruction tuningfeedback-driven alignmentarousalvalenceMusicQA

0 comments

The pith

Feedback-driven alignment with a verifiable numerical reward substantially improves MusicLLM performance on arousal and valence regression over instruction tuning alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether music large language models can be aligned to predict emotional content in music through explicit training. It compares two strategies: task-aware instruction tuning, which yields limited accuracy on emotion regression, and feedback-driven alignment, which uses a verifiable numerical reward. Experiments demonstrate that the feedback approach delivers substantial gains on both arousal and valence dimensions while preserving the model's MusicQA capability.

Core claim

Task-aware instruction tuning enables MusicLLMs to predict emotion levels to some extent, although accuracy remains limited. Applying feedback-driven alignment with a verifiable numerical reward substantially improves performance on both arousal and valence over instruction tuning alone, while maintaining MusicQA capability.

What carries the argument

Feedback-driven alignment using a verifiable numerical reward to refine emotion regression outputs.

If this is right

MusicLLMs gain usable emotion regression capability after feedback-driven alignment.
Gains appear on both arousal and valence prediction tasks.
Music question-answering performance stays intact after the alignment step.
Instruction tuning by itself yields only partial emotion regression ability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reward-based alignment loop could be tested on other continuous music attributes such as tempo or key.
Verifiable numerical rewards may reduce reliance on human preference data for music model alignment.
Models aligned this way might transfer to downstream applications like playlist generation by emotional contour.

Load-bearing premise

The numerical reward used in feedback-driven alignment accurately measures and improves true emotion regression quality without introducing new biases or overfitting to the specific evaluation sets.

What would settle it

Evaluating the feedback-aligned model on a fresh, held-out collection of music clips with independent human arousal and valence annotations and observing no performance gain over the instruction-tuned version would falsify the central improvement claim.

Figures

Figures reproduced from arXiv: 2606.24123 by Takuya Hasumi, Welly Naptali.

**Figure 1.** Figure 1: The overview of the music large language model (MusicLLM) for emotion regression. gression task requires predicting continuous values. Therefore, direct optimization of the score prediction error is necessary to produce reliable emotion level predictions. To investigate whether MusicLLMs can learn to capture emotional levels from audio through training, we train them on pairs of music tracks and correspon… view at source ↗

**Figure 2.** Figure 2: Training pipeline of MusicLLM in our work. First, the MusicLLM is trained with instruction tuning to learn the response format and a coarse emotion level. Then, the model is further fine-tuned by feedback-driven alignment to capture the fine-grained emotion level [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

read the original abstract

This paper investigates whether music large language models (MusicLLMs) can be aligned for emotion regression. While MusicLLMs have shown strong performance in music information retrieval tasks, their ability to predict arousal and valence scores remains limited, since emotion regression has not been an explicit training objective. To examine whether MusicLLMs can be aligned with emotion, we train MusicLLMs on emotion regression and compare two strategies: instruction tuning and feedback-driven alignment. Our experiments show that task-aware instruction tuning enables MusicLLMs to predict emotion levels to some extent, although the accuracy remains limited. Applying feedback-driven alignment with a verifiable numerical reward substantially improves performance on both arousal and valence over instruction tuning alone. We further show that our approach improves emotion regression performance while maintaining MusicQA capability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Feedback alignment beats instruction tuning for MusicLLM emotion regression but the reward construction is the part that needs verification.

read the letter

The paper tests two ways to add arousal and valence prediction to MusicLLMs: plain instruction tuning and then a follow-up feedback alignment step that uses a numerical reward. The main result is that the second step lifts performance on both emotion dimensions while the model keeps its MusicQA ability.

What the work actually does is take an existing MusicLLM and show that a standard alignment recipe can be applied to a regression task that was not in the original training. The comparison between the two strategies is direct, and checking that the original capability survives is a sensible control. That part is useful for anyone who already runs these models and wants to extend them without retraining from scratch.

The soft spot is the reward itself. The abstract calls it verifiable and numerical, yet supplies no description of how it is computed, what data it draws from, or how the authors avoided overlap with the evaluation sets. If the reward re-uses or correlates strongly with the same annotated scores used for final testing, the reported gains could be an artifact of the objective rather than better modeling. The stress-test note flags exactly this risk, and the abstract alone does not resolve it.

This is the sort of targeted application paper that would interest people in music information retrieval or affective computing who already have MusicLLM infrastructure. A reader looking for a new theoretical framework will not find one here, but someone who wants a practical recipe for adding emotion regression might.

I would send it to peer review. The question is clear, the experimental contrast is worth checking, and the authors appear to be engaging the actual limitation rather than overstating the method. Once the reward details and data splits are visible, a referee can judge whether the improvement holds up.

Referee Report

2 major / 1 minor

Summary. The manuscript investigates whether MusicLLMs can be aligned for emotion regression tasks. It trains models using instruction tuning on emotion regression and compares this to feedback-driven alignment with a verifiable numerical reward, claiming that the latter substantially improves performance on arousal and valence prediction while preserving MusicQA capability. The abstract asserts that task-aware instruction tuning enables limited emotion prediction and that feedback-driven alignment yields substantial gains.

Significance. If the reported improvements are supported by independent evaluation and non-circular reward construction, the work would demonstrate a practical method for adapting MusicLLMs to regression objectives not present in pretraining. This could be relevant for music information retrieval applications involving continuous emotion labels, provided the gains generalize beyond the specific alignment setup.

major comments (2)

[Abstract] Abstract: the central claim that 'feedback-driven alignment with a verifiable numerical reward substantially improves performance on both arousal and valence over instruction tuning alone' is stated without any accompanying metrics, datasets, baselines, error bars, or implementation details. This absence prevents verification of whether the improvement is load-bearing or reproducible.
[Abstract] Abstract (and implied methods): the 'verifiable numerical reward' used in feedback-driven alignment is not described. If this reward is derived from or correlates with the same annotated arousal/valence scores used in final evaluation, the reported gains risk being an artifact of the alignment objective rather than improved modeling; the manuscript must specify reward construction, data partitioning, and controls for label leakage.

minor comments (1)

[Abstract] Abstract: the phrase 'MusicQA capability' is used without definition or reference to prior work; clarify what tasks or benchmarks this encompasses.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed comments. The concerns about the abstract and reward construction are valid, and we will revise the manuscript to include the requested details and clarifications. We respond to each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that 'feedback-driven alignment with a verifiable numerical reward substantially improves performance on both arousal and valence over instruction tuning alone' is stated without any accompanying metrics, datasets, baselines, error bars, or implementation details. This absence prevents verification of whether the improvement is load-bearing or reproducible.

Authors: We agree that the abstract would be strengthened by including key quantitative results. In the revision, we will incorporate specific metrics (e.g., MSE or Pearson correlation improvements on arousal and valence), the primary dataset used, a brief mention of the instruction-tuning baseline, and reference to error bars from repeated runs. This addresses verifiability while respecting abstract length constraints. revision: yes
Referee: [Abstract] Abstract (and implied methods): the 'verifiable numerical reward' used in feedback-driven alignment is not described. If this reward is derived from or correlates with the same annotated arousal/valence scores used in final evaluation, the reported gains risk being an artifact of the alignment objective rather than improved modeling; the manuscript must specify reward construction, data partitioning, and controls for label leakage.

Authors: The reward is a numerical function based on the absolute deviation from ground-truth labels in the alignment phase. To prevent leakage, alignment uses a distinct training partition with no overlap to the held-out evaluation set. We will add a dedicated paragraph in the methods section detailing the exact reward formula, data partitioning strategy, and explicit controls confirming no label leakage between stages. This will clarify that gains arise from the alignment process rather than circular evaluation. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation self-contained with no self-referential reductions or fitted inputs renamed as predictions

full rationale

The provided abstract and description contain no equations, parameter-fitting steps, or self-citations that reduce the claimed improvements to inputs by construction. Instruction tuning and feedback-driven alignment are presented as distinct strategies whose outcomes are compared empirically on arousal/valence regression. The numerical reward is described only at a high level as 'verifiable' without any indication that it reuses evaluation labels or is fitted to the target metrics. No uniqueness theorems, ansatzes, or renamings of known results appear. The central claim of performance gains therefore rests on experimental comparison rather than definitional equivalence or self-citation chains. This is the expected honest non-finding for an abstract-level description lacking load-bearing internal reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; assessment limited to surface claims.

pith-pipeline@v0.9.1-grok · 5656 in / 936 out tokens · 18692 ms · 2026-06-25T22:54:35.874707+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 16 canonical work pages · 12 internal anchors

[1]

In the audio domain, integrating an audio encoder with an LLM has been explored for speech recognition, speech understanding, and general audio captioning [1–5]

Introduction Multimodal large language models (multimodal LLMs) have rapidly advanced thanks to the recent growth of LLMs. In the audio domain, integrating an audio encoder with an LLM has been explored for speech recognition, speech understanding, and general audio captioning [1–5]. Specifically, some Musi- cLLMs have been developed to address music info...
[2]

Aligning MusicLLM with Emotion using Instruction Tuning and Feedback-Driven Alignment

Preliminary: Music Emotion Recognition Music emotion recognition (MER) [13,14] aims to estimate the subjective emotion of a music track. This is essential to cre- ate mood-based playlist generation [15, 16] and emotion-based arXiv:2606.24123v1 [cs.SD] 23 Jun 2026 music generation [17–19]. Specifically, regression-based MER [14] predicts arousal and valenc...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

I would assign a 3.6 on the 1-9 scale

Aligning MusicLLM with Emotion 3.1. Overview of MusicLLM Our MusicLLM consists of three modules: a music encoder, a projector, and a text decoder (LLM), as illustrated in Fig. 1. Given a raw waveform, the music encoder produces a sequence of frame-level embeddings. Then, the projector downsamples the sequence and maps it into the LLM token space, yielding...
[4]

Datasets To evaluate whether MusicLLMs can learn emotion levels from music, we used two publicly available emotion regression datasets: DEAM [25] and MERGE [24]

Experiments 4.1. Datasets To evaluate whether MusicLLMs can learn emotion levels from music, we used two publicly available emotion regression datasets: DEAM [25] and MERGE [24]. DEAM consists of about 45-second clips with arousal and valence scores. We fol- low the training/validation/test split used in [23], resulting in 1261/271/270 samples. MERGE cons...
[5]

We systematically compared in- struction tuning and feedback-driven alignment

Conclusion In this work, we investigated whether MusicLLMs can learn emotion levels from music. We systematically compared in- struction tuning and feedback-driven alignment. In our ex- periments, instruction tuning yielded only limited improve- ments in emotion regression performance. However, feedback- driven alignment with verifiable numerical feedback...
[6]

These tools were not used to generate scientific ideas, develop the methodology, design experiments, or produce experimental results

Generative AI Use Disclosure The authors used generative AI tools only to assist with gram- mar correction, improving the clarity of author-written text, and minor code debugging. These tools were not used to generate scientific ideas, develop the methodology, design experiments, or produce experimental results. All AI-assisted text and code were reviewed...
[7]

An Embar- rassingly Simple Approach for LLM with Strong ASR Capacity,

Ziyang Ma,et al., “An embarrassingly simple approach for LLM with strong ASR capacity,”arXiv preprint arXiv:2402.08846, 2024

work page arXiv 2024
[8]

Listen, Think, and Understand,

Yuan Gong,et al., “Listen, Think, and Understand,” inICLR, 2024

2024
[9]

SALMONN: Towards Generic Hearing Abilities for Large Language Models

Changli Tang,et al., “SALMONN: Towards generic hearing abili- ties for large language models,”arXiv preprint arXiv:2310.13289, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

Qwen2-Audio Technical Report

Yunfei Chu,et al., “Qwen2-audio technical report,”arXiv preprint arXiv:2407.10759, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Audio flamingo: A novel audio lan- guage model with few-shot learning and dialogue abilities,

Zhifeng Kong,et al., “Audio flamingo: A novel audio lan- guage model with few-shot learning and dialogue abilities,” in International Conference on Machine Learning. PMLR, 2024, pp. 25125–25148

2024
[12]

LP-MusicCaps: LLM-based pseudo mu- sic captioning,

SeungHeon Doh,et al., “LP-MusicCaps: LLM-based pseudo mu- sic captioning,” inProc. ISMIR, 2023, pp. 409–416

2023
[13]

Music understanding Llama: Advancing text-to-music generation with question answering and caption- ing,

Shansong Liu,et al., “Music understanding Llama: Advancing text-to-music generation with question answering and caption- ing,” inProc. ICASSP, 2024, pp. 286–290

2024
[14]

Musilingo: Bridging music and text with pre-trained language models for music captioning and query re- sponse,

Zihao Deng,et al., “Musilingo: Bridging music and text with pre-trained language models for music captioning and query re- sponse,” inProc. NAACL, 2024, pp. 3643–3655

2024
[15]

Llark: A multimodal instruction-following language model for music,

Josh Gardner,et al., “Llark: A multimodal instruction-following language model for music,” inProc. ICML, 2024, pp. 15037– 15082

2024
[16]

CMI-Bench: A comprehensive bench- mark for evaluating music instruction following,

Yinghao Ma,et al., “CMI-Bench: A comprehensive bench- mark for evaluating music instruction following,”arXiv preprint arXiv:2506.12285, 2025

work page arXiv 2025
[17]

Finetuned Language Models Are Zero-Shot Learners

Jason Wei,et al., “Finetuned language models are zero-shot learn- ers,”arXiv preprint arXiv:2109.01652, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[18]

Deep reinforcement learning from hu- man preferences,

Paul F Christiano,et al., “Deep reinforcement learning from hu- man preferences,”Advances in neural information processing sys- tems, vol. 30, 2017

2017
[19]

Perceptual and cognitive applications in music information retrieval,

David Huron, “Perceptual and cognitive applications in music information retrieval,”Cognition, vol. 10, no. 1, pp. 83–92, 2000

2000
[20]

A regression approach to music emotion recognition,

Yi-Hsuan Yang,et al., “A regression approach to music emotion recognition,”IEEE trans. ASLP, vol. 16, no. 2, pp. 448–457, 2008

2008
[21]

Flow moods: Recommending music by moods on deezer,

Th ´eo Bontempelli,et al., “Flow moods: Recommending music by moods on deezer,” inProc. RecSys, 2022, pp. 452–455

2022
[22]

Music mood prediction and playlist recommendation based on facial expressions,

Harshvardhan Sanjay Kendre,et al., “Music mood prediction and playlist recommendation based on facial expressions,” inProc. ICIMMI, 2023, pp. 1–6

2023
[23]

Sentimozart: Music generation based on emotions.,

Rishi Madhok,et al., “Sentimozart: Music generation based on emotions.,” inProc. ICAART, 2018, pp. 501–506

2018
[24]

Generating music with sentiment using transformer-GANs,

Pedro Neves,et al., “Generating music with sentiment using transformer-GANs,” inProc. ISMIR, 2022, pp. 717–725

2022
[25]

Symbolic music generation conditioned on continuous-valued emotions,

Serkan Sulun,et al., “Symbolic music generation conditioned on continuous-valued emotions,”IEEE Access, vol. 10, pp. 44617– 44626, 2022

2022
[26]

A circumplex model of affect.,

James A Russell, “A circumplex model of affect.,”JPSP, vol. 39, no. 6, pp. 1161, 1980

1980
[27]

CNN based music emotion classification

Xin Liu,et al., “CNN based music emotion classification,”arXiv preprint arXiv:1704.05665, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[28]

Musical instrument emotion recogni- tion using deep recurrent neural network,

Sangeetha Rajesh,et al., “Musical instrument emotion recogni- tion using deep recurrent neural network,”Procedia Computer Science, vol. 167, pp. 16–25, 2020

2020
[29]

Towards unified music emotion recogni- tion across dimensional and categorical models,

Jaeyong Kang,et al., “Towards unified music emotion recogni- tion across dimensional and categorical models,”arXiv preprint arXiv:2502.03979, 2025

work page arXiv 2025
[30]

MERGE–a bimodal dataset for static music emotion recognition,

Pedro Lima Louro,et al., “MERGE–a bimodal dataset for static music emotion recognition,”arXiv preprint arXiv:2407.06060, 2024

work page arXiv 2024
[31]

Developing a benchmark for emotional analysis of music,

Anna Aljanakiet al., “Developing a benchmark for emotional analysis of music,”PLOS ONE, vol. 12, no. 3, pp. 1–22, 03 2017

2017
[32]

GPT-4o System Card

Aaron Hurstet al., “GPT-4o system card,”arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao,et al., “DeepSeekMath: Pushing the limits of mathematical reasoning in open language models,”arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

Aligning Large Multimodal Models with Factually Augmented RLHF

Zhiqing Sun,et al., “Aligning large multimodal models with factually augmented RLHF,”arXiv preprint arXiv:2309.14525, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

Proximal Policy Optimization Algorithms

John Schulman,et al., “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[36]

Direct preference optimization: Your language model is secretly a reward model,

Rafael Rafailov,et al., “Direct preference optimization: Your language model is secretly a reward model,”Proc. NeurIPS, 2023, pp. 53728–53741

2023
[37]

Norms of valence, arousal, and dom- inance for 13,915 english lemmas,

Amy Beth Warriner,et al., “Norms of valence, arousal, and dom- inance for 13,915 english lemmas,”Behavior research methods, vol. 45, no. 4, pp. 1191–1207, 2013

2013
[38]

A foundation model for music informatics,

Minz Won,et al., “A foundation model for music informatics,” in Proc. ICASSP, 2024, pp. 1226–1230

2024
[39]

The million song dataset,

Thierry Bertin-Mahieux,et al., “The million song dataset,” in Proc. ISMIR, 2011, pp. 591–596

2011
[40]

Vicuna: An open-source chatbot im- pressing GPT-4 with 90%* ChatGPT quality,

Wei-Lin Chiang,et al., “Vicuna: An open-source chatbot im- pressing GPT-4 with 90%* ChatGPT quality,” 2023

2023
[41]

LoRA: Low-rank adaptation of large lan- guage models.,

Edward J Hu,et al., “LoRA: Low-rank adaptation of large lan- guage models.,” inProc. ICLR, 2022, 13 pages

2022
[42]

Decoupled Weight Decay Regularization

Ilya Loshchilov,et al., “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[43]

Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

J. Hu,et al., “Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model,”arXiv preprint arXiv:2503.24290, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Q. Yu,et al., “DAPO: An open-source LLM reinforcement learn- ing system at scale,”arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

The MTG-Jamendo dataset for auto- matic music tagging,

Dmitry Bogdanov,et al., “The MTG-Jamendo dataset for auto- matic music tagging,” inProc. ICML, 2019

2019
[46]

1000 songs for emotional analysis of mu- sic,

M. Soleymani,et al., “1000 songs for emotional analysis of mu- sic,” inProceedings of the 2nd ACM international workshop on Crowdsourcing for multimedia, 2013, pp. 1–6

2013
[47]

The PMEmo dataset for music emotion recogni- tion,

K. Zhang,et al., “The PMEmo dataset for music emotion recogni- tion,” inProceedings of the 2018 acm on international conference on multimedia retrieval, 2018, pp. 135–142

2018
[48]

BLEU: a method for automatic evalua- tion of machine translation,

Kishore Papineni,et al., “BLEU: a method for automatic evalua- tion of machine translation,” inProc. ACL, 2002, pp. 311–318

2002
[49]

Meteor: An automatic metric for mt evaluation with improved correlation with human judgments,

Satanjeev Banerjee,et al., “Meteor: An automatic metric for mt evaluation with improved correlation with human judgments,” in Proc. ACL Workshop on Intrinsic and Extrinsic Evaluation (MT/- Sum), 2005, pp. 65–72

2005
[50]

Rouge: A package for automatic evaluation of summaries,

Chin-Yew Lin, “Rouge: A package for automatic evaluation of summaries,” inText summarization branches out, 2004, pp. 74– 81

2004

[1] [1]

In the audio domain, integrating an audio encoder with an LLM has been explored for speech recognition, speech understanding, and general audio captioning [1–5]

Introduction Multimodal large language models (multimodal LLMs) have rapidly advanced thanks to the recent growth of LLMs. In the audio domain, integrating an audio encoder with an LLM has been explored for speech recognition, speech understanding, and general audio captioning [1–5]. Specifically, some Musi- cLLMs have been developed to address music info...

[2] [2]

Aligning MusicLLM with Emotion using Instruction Tuning and Feedback-Driven Alignment

Preliminary: Music Emotion Recognition Music emotion recognition (MER) [13,14] aims to estimate the subjective emotion of a music track. This is essential to cre- ate mood-based playlist generation [15, 16] and emotion-based arXiv:2606.24123v1 [cs.SD] 23 Jun 2026 music generation [17–19]. Specifically, regression-based MER [14] predicts arousal and valenc...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[3] [3]

I would assign a 3.6 on the 1-9 scale

Aligning MusicLLM with Emotion 3.1. Overview of MusicLLM Our MusicLLM consists of three modules: a music encoder, a projector, and a text decoder (LLM), as illustrated in Fig. 1. Given a raw waveform, the music encoder produces a sequence of frame-level embeddings. Then, the projector downsamples the sequence and maps it into the LLM token space, yielding...

[4] [4]

Datasets To evaluate whether MusicLLMs can learn emotion levels from music, we used two publicly available emotion regression datasets: DEAM [25] and MERGE [24]

Experiments 4.1. Datasets To evaluate whether MusicLLMs can learn emotion levels from music, we used two publicly available emotion regression datasets: DEAM [25] and MERGE [24]. DEAM consists of about 45-second clips with arousal and valence scores. We fol- low the training/validation/test split used in [23], resulting in 1261/271/270 samples. MERGE cons...

[5] [5]

We systematically compared in- struction tuning and feedback-driven alignment

Conclusion In this work, we investigated whether MusicLLMs can learn emotion levels from music. We systematically compared in- struction tuning and feedback-driven alignment. In our ex- periments, instruction tuning yielded only limited improve- ments in emotion regression performance. However, feedback- driven alignment with verifiable numerical feedback...

[6] [6]

These tools were not used to generate scientific ideas, develop the methodology, design experiments, or produce experimental results

Generative AI Use Disclosure The authors used generative AI tools only to assist with gram- mar correction, improving the clarity of author-written text, and minor code debugging. These tools were not used to generate scientific ideas, develop the methodology, design experiments, or produce experimental results. All AI-assisted text and code were reviewed...

[7] [7]

An Embar- rassingly Simple Approach for LLM with Strong ASR Capacity,

Ziyang Ma,et al., “An embarrassingly simple approach for LLM with strong ASR capacity,”arXiv preprint arXiv:2402.08846, 2024

work page arXiv 2024

[8] [8]

Listen, Think, and Understand,

Yuan Gong,et al., “Listen, Think, and Understand,” inICLR, 2024

2024

[9] [9]

SALMONN: Towards Generic Hearing Abilities for Large Language Models

Changli Tang,et al., “SALMONN: Towards generic hearing abili- ties for large language models,”arXiv preprint arXiv:2310.13289, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[10] [10]

Qwen2-Audio Technical Report

Yunfei Chu,et al., “Qwen2-audio technical report,”arXiv preprint arXiv:2407.10759, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

Audio flamingo: A novel audio lan- guage model with few-shot learning and dialogue abilities,

Zhifeng Kong,et al., “Audio flamingo: A novel audio lan- guage model with few-shot learning and dialogue abilities,” in International Conference on Machine Learning. PMLR, 2024, pp. 25125–25148

2024

[12] [12]

LP-MusicCaps: LLM-based pseudo mu- sic captioning,

SeungHeon Doh,et al., “LP-MusicCaps: LLM-based pseudo mu- sic captioning,” inProc. ISMIR, 2023, pp. 409–416

2023

[13] [13]

Music understanding Llama: Advancing text-to-music generation with question answering and caption- ing,

Shansong Liu,et al., “Music understanding Llama: Advancing text-to-music generation with question answering and caption- ing,” inProc. ICASSP, 2024, pp. 286–290

2024

[14] [14]

Musilingo: Bridging music and text with pre-trained language models for music captioning and query re- sponse,

Zihao Deng,et al., “Musilingo: Bridging music and text with pre-trained language models for music captioning and query re- sponse,” inProc. NAACL, 2024, pp. 3643–3655

2024

[15] [15]

Llark: A multimodal instruction-following language model for music,

Josh Gardner,et al., “Llark: A multimodal instruction-following language model for music,” inProc. ICML, 2024, pp. 15037– 15082

2024

[16] [16]

CMI-Bench: A comprehensive bench- mark for evaluating music instruction following,

Yinghao Ma,et al., “CMI-Bench: A comprehensive bench- mark for evaluating music instruction following,”arXiv preprint arXiv:2506.12285, 2025

work page arXiv 2025

[17] [17]

Finetuned Language Models Are Zero-Shot Learners

Jason Wei,et al., “Finetuned language models are zero-shot learn- ers,”arXiv preprint arXiv:2109.01652, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[18] [18]

Deep reinforcement learning from hu- man preferences,

Paul F Christiano,et al., “Deep reinforcement learning from hu- man preferences,”Advances in neural information processing sys- tems, vol. 30, 2017

2017

[19] [19]

Perceptual and cognitive applications in music information retrieval,

David Huron, “Perceptual and cognitive applications in music information retrieval,”Cognition, vol. 10, no. 1, pp. 83–92, 2000

2000

[20] [20]

A regression approach to music emotion recognition,

Yi-Hsuan Yang,et al., “A regression approach to music emotion recognition,”IEEE trans. ASLP, vol. 16, no. 2, pp. 448–457, 2008

2008

[21] [21]

Flow moods: Recommending music by moods on deezer,

Th ´eo Bontempelli,et al., “Flow moods: Recommending music by moods on deezer,” inProc. RecSys, 2022, pp. 452–455

2022

[22] [22]

Music mood prediction and playlist recommendation based on facial expressions,

Harshvardhan Sanjay Kendre,et al., “Music mood prediction and playlist recommendation based on facial expressions,” inProc. ICIMMI, 2023, pp. 1–6

2023

[23] [23]

Sentimozart: Music generation based on emotions.,

Rishi Madhok,et al., “Sentimozart: Music generation based on emotions.,” inProc. ICAART, 2018, pp. 501–506

2018

[24] [24]

Generating music with sentiment using transformer-GANs,

Pedro Neves,et al., “Generating music with sentiment using transformer-GANs,” inProc. ISMIR, 2022, pp. 717–725

2022

[25] [25]

Symbolic music generation conditioned on continuous-valued emotions,

Serkan Sulun,et al., “Symbolic music generation conditioned on continuous-valued emotions,”IEEE Access, vol. 10, pp. 44617– 44626, 2022

2022

[26] [26]

A circumplex model of affect.,

James A Russell, “A circumplex model of affect.,”JPSP, vol. 39, no. 6, pp. 1161, 1980

1980

[27] [27]

CNN based music emotion classification

Xin Liu,et al., “CNN based music emotion classification,”arXiv preprint arXiv:1704.05665, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[28] [28]

Musical instrument emotion recogni- tion using deep recurrent neural network,

Sangeetha Rajesh,et al., “Musical instrument emotion recogni- tion using deep recurrent neural network,”Procedia Computer Science, vol. 167, pp. 16–25, 2020

2020

[29] [29]

Towards unified music emotion recogni- tion across dimensional and categorical models,

Jaeyong Kang,et al., “Towards unified music emotion recogni- tion across dimensional and categorical models,”arXiv preprint arXiv:2502.03979, 2025

work page arXiv 2025

[30] [30]

MERGE–a bimodal dataset for static music emotion recognition,

Pedro Lima Louro,et al., “MERGE–a bimodal dataset for static music emotion recognition,”arXiv preprint arXiv:2407.06060, 2024

work page arXiv 2024

[31] [31]

Developing a benchmark for emotional analysis of music,

Anna Aljanakiet al., “Developing a benchmark for emotional analysis of music,”PLOS ONE, vol. 12, no. 3, pp. 1–22, 03 2017

2017

[32] [32]

GPT-4o System Card

Aaron Hurstet al., “GPT-4o system card,”arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [33]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao,et al., “DeepSeekMath: Pushing the limits of mathematical reasoning in open language models,”arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[34] [34]

Aligning Large Multimodal Models with Factually Augmented RLHF

Zhiqing Sun,et al., “Aligning large multimodal models with factually augmented RLHF,”arXiv preprint arXiv:2309.14525, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[35] [35]

Proximal Policy Optimization Algorithms

John Schulman,et al., “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[36] [36]

Direct preference optimization: Your language model is secretly a reward model,

Rafael Rafailov,et al., “Direct preference optimization: Your language model is secretly a reward model,”Proc. NeurIPS, 2023, pp. 53728–53741

2023

[37] [37]

Norms of valence, arousal, and dom- inance for 13,915 english lemmas,

Amy Beth Warriner,et al., “Norms of valence, arousal, and dom- inance for 13,915 english lemmas,”Behavior research methods, vol. 45, no. 4, pp. 1191–1207, 2013

2013

[38] [38]

A foundation model for music informatics,

Minz Won,et al., “A foundation model for music informatics,” in Proc. ICASSP, 2024, pp. 1226–1230

2024

[39] [39]

The million song dataset,

Thierry Bertin-Mahieux,et al., “The million song dataset,” in Proc. ISMIR, 2011, pp. 591–596

2011

[40] [40]

Vicuna: An open-source chatbot im- pressing GPT-4 with 90%* ChatGPT quality,

Wei-Lin Chiang,et al., “Vicuna: An open-source chatbot im- pressing GPT-4 with 90%* ChatGPT quality,” 2023

2023

[41] [41]

LoRA: Low-rank adaptation of large lan- guage models.,

Edward J Hu,et al., “LoRA: Low-rank adaptation of large lan- guage models.,” inProc. ICLR, 2022, 13 pages

2022

[42] [42]

Decoupled Weight Decay Regularization

Ilya Loshchilov,et al., “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[43] [43]

Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

J. Hu,et al., “Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model,”arXiv preprint arXiv:2503.24290, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[44] [44]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Q. Yu,et al., “DAPO: An open-source LLM reinforcement learn- ing system at scale,”arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[45] [45]

The MTG-Jamendo dataset for auto- matic music tagging,

Dmitry Bogdanov,et al., “The MTG-Jamendo dataset for auto- matic music tagging,” inProc. ICML, 2019

2019

[46] [46]

1000 songs for emotional analysis of mu- sic,

M. Soleymani,et al., “1000 songs for emotional analysis of mu- sic,” inProceedings of the 2nd ACM international workshop on Crowdsourcing for multimedia, 2013, pp. 1–6

2013

[47] [47]

The PMEmo dataset for music emotion recogni- tion,

K. Zhang,et al., “The PMEmo dataset for music emotion recogni- tion,” inProceedings of the 2018 acm on international conference on multimedia retrieval, 2018, pp. 135–142

2018

[48] [48]

BLEU: a method for automatic evalua- tion of machine translation,

Kishore Papineni,et al., “BLEU: a method for automatic evalua- tion of machine translation,” inProc. ACL, 2002, pp. 311–318

2002

[49] [49]

Meteor: An automatic metric for mt evaluation with improved correlation with human judgments,

Satanjeev Banerjee,et al., “Meteor: An automatic metric for mt evaluation with improved correlation with human judgments,” in Proc. ACL Workshop on Intrinsic and Extrinsic Evaluation (MT/- Sum), 2005, pp. 65–72

2005

[50] [50]

Rouge: A package for automatic evaluation of summaries,

Chin-Yew Lin, “Rouge: A package for automatic evaluation of summaries,” inText summarization branches out, 2004, pp. 74– 81

2004