{"total":55,"items":[{"citing_arxiv_id":"2605.29202","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Auditing Training Data in Generative Music Models via Black-Box Membership Inference","primary_cat":"cs.LG","submitted_at":"2026-05-28T00:28:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Black-box membership inference on text-to-music models reaches up to 98.6% accuracy by training an auditor on semantic alignment patterns extracted from shadow-model generations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.27741","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Escape the Language Prior: Mitigating Late-Stage Modality Collapse in Audio Reasoning via Modality-Aware Policy Optimization","primary_cat":"cs.CL","submitted_at":"2026-05-26T22:34:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MAPO is a dual-branch RL framework using modality relevance masks from cross-modal differential entropy and auxiliary attention losses to reduce late-stage modality collapse in audio reasoning models and improve benchmark results.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.25967","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Hidden in Plain Tokens: Simply Robust, Gradient-Free Watermark for Synthetic Audio","primary_cat":"cs.LG","submitted_at":"2026-05-25T15:43:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A training-free audio watermarking method that reduces vocabulary via community detection to boost detection robustness by orders of magnitude while resisting audio modifications.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.25343","ref_index":122,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Toward Native Multimodal Modeling: A Roadmap","primary_cat":"cs.CV","submitted_at":"2026-05-25T01:57:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A roadmap that defines architectural nativity for multimodal models and categorizes them into Multi-to-Text, Multi-to-Target, and Multi-to-Multi types while outlining an industrial pipeline toward unified transformer-based native multimodal modeling.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22717","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Live Music Diffusion Models: Efficient Fine-Tuning and Post-Training of Interactive Diffusion Music Generators","primary_cat":"cs.SD","submitted_at":"2026-05-21T16:54:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Live Music Diffusion Models adapt bidirectional diffusion for interactive music generation via KV caching and ARC-Forcing, recovering and exceeding discrete autoregressive efficiency while enabling post-training alignment without RL.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21081","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Musical Attention Transformer: Music Generation Using a Music-Specific Attention Model","primary_cat":"cs.SD","submitted_at":"2026-05-20T12:16:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The paper introduces Musical Attention, an attention variant that incorporates eight musical features including metadata to generate more coherent and varied music than standard or strided attention baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18072","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MusicDET: Zero-Shot AI-Generated Music Detection","primary_cat":"cs.SD","submitted_at":"2026-05-18T08:54:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MusicDET models the distribution of real music features with frequency-guided normalizing flows to detect AI-generated music as out-of-distribution samples in a zero-shot setting.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17414","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"S2Accompanist: A Semantic-Aware and Structure-Guided Diffusion Model for Music Accompaniment Generation","primary_cat":"eess.AS","submitted_at":"2026-05-17T12:22:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"S2Accompanist is a 402M-parameter semantic-aware diffusion model that achieves SOTA on the ATTM Grand Challenge benchmark for music accompaniment generation via automated data processing and structure-guided VAE fine-tuning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16181","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ARIA: A Diagnostic Framework for Music Training Data Attribution","primary_cat":"cs.SD","submitted_at":"2026-05-15T17:00:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ARIA decomposes music training data attribution into musical aspects and supplies reliability diagnostics from similarity metrics and score matrix analysis, with validation on symbolic models using counterfactual retraining.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15831","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Modeling Music as a Time-Frequency Image: A 2D Tokenizer for Music Generation","primary_cat":"cs.SD","submitted_at":"2026-05-15T10:35:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"BandTok tokenizes Mel-spectrograms as independent time-frequency band tokens from a single codebook and pairs it with 2D RoPE in an autoregressive model to improve music generation over residual multi-codebook tokenizers.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13404","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Seconds-Aligned PCA-DAC Latent Diffusion for Symbolic-to-Audio Drum Rendering","primary_cat":"cs.SD","submitted_at":"2026-05-13T11:59:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Sec2Drum-DAC renders drum audio from symbolic inputs via diffusion on PCA-reduced DAC latents, improving spectral and transient metrics over regression baselines on 1733 held-out windows.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Codec-based audio generation.Recent audio generation systems increasingly rely on learned low-rate intermediate representations. SoundStream, EnCodec, and DAC establish neural-codec representations with residual vector quantization and high-fidelity decoding [8, 20, 33]. AudioLM and MusicLM showed that neural-codec representations can support broad audio and music generation [2, 4]. AudioLDM 2 demonstrates diffusion over learned audio representations for broad audio- generation tasks [21], while Stable Audio and long-form music diffusion show that timing-aware latent diffusion can scale to long-duration, high-resolution synthesis [10, 11]. These systems establish the viability of learned audio representations, but they are not designed specifically for event-faithful"},{"citing_arxiv_id":"2605.11866","ref_index":19,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AuDirector: A Self-Reflective Closed-Loop Framework for Immersive Audio Storytelling","primary_cat":"cs.SD","submitted_at":"2026-05-12T09:46:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"AuDirector proposes a self-reflective closed-loop multi-agent framework with identity-aware pre-production, collaborative synthesis-correction, and human-guided refinement for coherent immersive audio storytelling.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"Audio [17] and CLAP [18] for quality assessment, followed by pydub for final composition. Detailed system prompts will be available in our open-source repository. 3.2. Evaluation Data Our evaluation dataset comprises 100 diverse scenarios catego- rized into two primary genres: Podcasts (40 topics) and Radio Dramas (60 stories). Podcasts:We select a subset from Vicuna [ 19], focusing on four categories:Generic,Knowledge,Common-sense, andCoun- terfactual. Each category contains 10 topics aimed at evaluating Figure 1:Overview of the AuDirector framework: 1)Identity-aware pre-productionfor script-driven voice casting; 2)Collaborative synthesis and correctionfeaturing Critic-led quality auditing and self-correction; and 3)Human-guided interactive refinementfor"},{"citing_arxiv_id":"2605.10281","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Drum Synthesis from Expressive Drum Grids via Neural Audio Codecs","primary_cat":"cs.SD","submitted_at":"2026-05-11T09:40:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A Transformer predicts tokens from neural audio codecs (EnCodec, DAC, X-Codec) to convert expressive drum grids into audio, trained and evaluated on the E-GMD dataset using objective metrics.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10228","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"FLARE: Full-Modality Long-Video Audiovisual Retrieval Benchmark with User-Simulated Queries","primary_cat":"cs.MM","submitted_at":"2026-05-11T09:06:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"FLARE is a new benchmark with 399 long videos, 87k multimodal clips, and 275k user-style queries for testing audiovisual retrieval under caption and query regimes.","context_count":1,"top_context_role":"dataset","top_context_polarity":"background","context_text":"• We comprehensively evaluate 15 baselines and reveal differences and deficiencies at two levels- caption vs. queryandsingle-modal vs. unified audiovisual-highlighting the limitations of current research. 2 Related Work 2.1 Single-Modality Retrieval Benchmarks Earlier video-text and audio-text retrieval benchmarks, such as MSR-VTT [ 34], MSVD [5], and V ATEX [30] on the visual side, and AudioCaps [16], Clotho [10], MusicCaps [2], and WavCaps [20] on the audio side, were largely built on independent short clips, with subsequent improvements mainly focusing on data scale and diversity. Such settings lack long-video context as a source of difficulty and are not well aligned with realistic search scenarios. With the rapid development of multimodal LLMs, recent work represented by LoVR [3] extends visual retrieval to long videos and"},{"citing_arxiv_id":"2605.10203","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Polyphonia: Zero-Shot Timbre Transfer in Polyphonic Music with Acoustic-Informed Attention Calibration","primary_cat":"cs.SD","submitted_at":"2026-05-11T08:49:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Polyphonia improves zero-shot stem-specific timbre transfer in polyphonic music by 15.5% target alignment via acoustic-informed attention calibration that uses probabilistic priors to set coarse boundaries.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Case 2: Self and LoA Attention (Source Interpolation).For Source Interpolation (Eq. 10), the attention energy matrix resides in RLz×Lz. We implement arow-wise per querybroadcasting mechanism. Let gi be the i-th element of the flattened acoustic priorg∈R Lz. The mixed energy for queryiand keyjis computed as: (Emix)i,j = (1−g i)(Esrc)i,j +g i(Ecurr)i,j (15) The selection of the row-wise broadcasting dimension is a strict requirement for Polyphonia's structural integrity. This ensures that the decision to preserve or edit is governed by thequery'sspatial-spectral location (i.e., where the model is currently synthesizing features). C.5. Hyperparameter and Scheduling Configuration Regarding the temporal and architectural application of the proposed mechanisms, we adopt a static configuration to ensure"},{"citing_arxiv_id":"2605.08750","ref_index":3,"ref_count":4,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Communicating Sound Through Natural Language","primary_cat":"cs.LG","submitted_at":"2026-05-09T07:25:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Lexical acoustic coding lets LLMs transmit audio waveforms as editable natural-language sentences that another LLM can parse and reconstruct into sound.","context_count":2,"top_context_role":"background","top_context_polarity":"background","context_text":"coordinate has an acoustically meaningful interpretation that can be named and transmitted. We compute d= 47 features, organized as 7 temporal, 7 spectral, 7 harmonic, and 26 psychoacoustic ones. Appendix B reports their short description. Lexical code.Each coordinate is lexicalized independently: ℓ= (ℓ 1, . . . , ℓd), ℓ i =E i(xi).(4) The mapE i :R→ A i is implemented as a feature-specific interval table. For example, an RMS value in [0.10,0.30) maps to mid-power. The full feature set and vocabulary mapping are reported in Appendices B and D. We treat these choices as specific instantiations of the LAC framework rather than core contributions: they were generated by an agent on a best-effort basis and are not optimized; we leave the search for better feature subsets and mappings to future work. 3.3 Sentence transport The ordered codeℓis not sent as a comma-separated list. It is converted into an English sentence q=V(ℓ)(5) that contains alldlexical terms in recoverable form. The verbalizer V may add ordinary grammatical material, but it may not delete, merge, paraphrase, or ambiguously rename any term. This is what distinguishes the transmitted sentence from a loose prose caption: it is readable English, but it remains aninjectivecarrier for the acoustic code. The inverse map is a parserUsuch that, for allℓ∈ L: U(V(ℓ)) =ℓ .(6) 5 Example.Consider a short sequenceℓwithd= 3: ℓ= (thunderous,swift-onset,short-decay). This can be written in a sentenceq: \"A thunderous sound with a swift onset and a short decay .\" Since the sentence is written by the sender LLM, it might differ at each run in terms of prose. The receiver applies the inverse parser ℓ=U(q)∈ L(7) to recover the d-slot lexical code before synthesis. Thus the per-sound payload is the sentence q, while the recoverable object carried by that sentence is the full lexical codeℓ. Remark (finite-rate bottleneck).While the sentence q may be verbose, its recoverable acoustic content cannot be hidden in the pro"},{"citing_arxiv_id":"2605.08608","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Reducing Linguistic Hallucination in LM-Based Speech Enhancement via Noise-Invariant Acoustic-Semantic Distillation","primary_cat":"eess.AS","submitted_at":"2026-05-09T02:07:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"L3-SE reduces linguistic hallucination in LM-based speech enhancement by distilling noise-invariant acoustic-semantic representations from noisy inputs to condition an autoregressive decoder-only language model.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"tions into the LM hidden space, and the projected sequences are concatenated along the feature dimension to form the conditioning prefixp. Conditioned onp, the LM autoregressively predicts the clean WavCodec token sequence, which is then decoded into the enhanced waveform. Because WavCodec employs RVQ with multiple codebooks, we serialize the target tokens using a flattening scheme [1] and train the LM with a shared output head. This provides a simple and effective interface for autoregressive token modeling in our setting. During training, we use teacher forcing [55] strategy and NTP loss: LLM =− 1 𝑇 𝐾 𝑇 𝐾∑︁ 𝑛=1 log𝑝 𝜙 (𝑦𝑛 |𝑦 <𝑛,p ) ,(2) where 𝜙 denotes the trainable parameters of the language model,p denotes the concatenated prefix embeddings, andy =(𝑦 1, ."},{"citing_arxiv_id":"2605.06582","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization","primary_cat":"cs.LG","submitted_at":"2026-05-07T17:11:22+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"symbolic representations, while PairAlign prioritizes explicit sequence-level symbolic structure and retains a practical bridge to time-aware applications through approximate post-hoc grounding. 2.6 Neural Sequence Transduction, CTC, RNN-T, wav2tok, and PairAlign Neural sequence transduction.Neural sequence transduction studies how a model maps an input sequence X= [x 1,...,xT ] to an output sequence Y= [y 1,...,yU], often withU̸=Tand with unknown correspondence between input and output positions. This problem appears in speech recognition, handwriting recognition, machine translation, transliteration, speech synthe- sis, and related sequence-to-sequence tasks. Different neural transduction models mainly differ in how they"},{"citing_arxiv_id":"2605.04505","ref_index":46,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"JASTIN: Aligning LLMs for Zero-Shot Audio and Speech Evaluation via Natural Language Instructions","primary_cat":"eess.AS","submitted_at":"2026-05-06T05:18:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"JASTIN is an instruction-driven audio evaluation system that achieves state-of-the-art correlation with human ratings on speech, sound, music, and out-of-domain tasks without task-specific retraining.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"both synthetic and real audio, including BVCC [ 38], Qual- iSpeech [21], SpeechEval [22], and UrgentMOS [39]. 2. Pseudo-Labeled Data for Scale Extension:To expand our training corpus, we collected over 80,000 utterances from public datasets (LibriTTS [ 40], Expresso [ 41], Com- monV oice [42], EARS [ 43], AudioSet [ 44], FreeSound [ 45], MusicCaps [46], MUSDB18 [ 47]). We then utilized the public 3https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct AES model [ 13] to generate pseudo-labels across dimensions such as CE, CU, PC, and PQ for each utterance. 3. Proxy Data for Broad Generalization:To prevent the model from overfitting to a narrow definition of \"quality,\" we incorporate detection proxy tasks 4."},{"citing_arxiv_id":"2605.03929","ref_index":34,"ref_count":3,"confidence":0.9,"is_internal_anchor":true,"paper_title":"PHALAR: Phasors for Learned Musical Audio Representations","primary_cat":"cs.SD","submitted_at":"2026-05-05T16:19:58+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01235","ref_index":15,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MindMelody: A Closed-Loop EEG-Driven System for Personalized Music Intervention","primary_cat":"cs.SD","submitted_at":"2026-05-02T04:15:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MindMelody combines real-time EEG emotion decoding with an LLM for intervention planning and a hierarchical controller for generating affect-aware music in a continuous feedback loop.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.20719","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ONOTE: Benchmarking Omnimodal Notation Processing for Expert-level Music Intelligence","primary_cat":"cs.SD","submitted_at":"2026-04-22T16:06:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ONOTE is a multi-format benchmark that applies a deterministic pipeline to expose a disconnect between perceptual accuracy and music-theoretic comprehension in leading omnimodal AI models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Figure 1: Task formulation of Omnimodal Notation Process- ing and evaluation framework spatial-temporal alignment across auditory, visual, and symbolic representations (see Figure 1). Unlike standard text, musical nota- tion demands strict adherence to multi-dimensional physical and temporal constraints simultaneously. Recent advancements have shifted focus from audio synthesis [2, 6] to generating code-based Data :https://huggingface.co/datasets/Weisiqing123/ONOTE Code : https://github.com/T12knightally/ONOTE arXiv:2604.20719v1 [cs.SD] 22 Apr 2026 Conference'17, July 2017, Washington, DC, USA Menghe Ma *, Siqing Wei*, Yuecheng Xing*, Yaheng Wang, Fanhong Meng, Peijun Han, Luu Anh Tuan, and Haoran Luo † Figure 2: This framework establishes a deterministic evaluation metric for ONP by benchmarking OLLMs across three notation"},{"citing_arxiv_id":"2604.17986","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Latent Fourier Transform","primary_cat":"cs.SD","submitted_at":"2026-04-20T09:08:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LatentFT uses latent-space Fourier transforms and frequency masking in diffusion autoencoders to enable timescale-specific manipulation of musical structure in generative models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"N N−1X k=0 X[k]w k (IDFT) The inverse DFT is also called the \"synthesis\" equation, since it expresses xas a weighted sum of complex sinusoids. To provide more concrete intu- ition, ifxis real-valued, we can expressxas the sum ofrealsinusoids with various frequencies k N , amplitudesA k, and phase shiftsϕ k: x[n] = ⌊N/2⌋X k=0 Ak cos \u0012 2π k N n+ϕ k \u0013 (1) WhereA k andϕ k are both derived from the coefficientX[k], as shown in Appendix D.1. In words, the DFT can decompose arealsignal into a sum ofrealsinusoids of different frequencies, all of which are mutually orthogonal. We show this decomposition for an example signal in Fig. 1. Diffusion Autoencoders.The diffusion autoencoder was proposed by (Preechakul et al."},{"citing_arxiv_id":"2604.16254","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ArtifactNet: Detecting AI-Generated Music via Forensic Residual Physics","primary_cat":"cs.SD","submitted_at":"2026-04-17T17:14:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ArtifactNet extracts codec residuals from spectrograms with a 4M-parameter network to detect AI music at F1=0.9829 and 1.49% FPR on unseen tracks from 22 generators, outperforming larger baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.10632","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Multimodal Dataset Normalization and Perceptual Validation for Music-Taste Correspondences","primary_cat":"cs.SD","submitted_at":"2026-04-12T13:18:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Music-flavor correspondences transfer from small human-annotated collections to large synthetic FMA datasets, with computational targets showing significant alignment to human listener ratings.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.09054","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"HAFM: Hierarchical Autoregressive Foundation Model for Music Accompaniment Generation","primary_cat":"cs.SD","submitted_at":"2026-04-10T07:27:55+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.08184","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"AT-ADD: All-Type Audio Deepfake Detection Challenge Evaluation Plan","primary_cat":"cs.SD","submitted_at":"2026-04-09T12:38:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"AT-ADD introduces standardized tracks and datasets for evaluating audio deepfake detectors on speech under real-world conditions and on diverse unknown audio types to promote generalization beyond speech-centric methods.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"ples in the training and development sets are generated via singing voice conversion, with strictly non-overlapping source and target singers to avoid identity leakage. The fake samples in the evalua- tion set are produced by 5unseendeepfake methods, allowing a comprehensive evaluation of cross-model generalization. Music.The music subset is derived from the MusicCaps [ 1] dataset. We first divide the audio samples in MusicCaps, which la- beled as real samples, into non-overlapping training, development, and evaluation sets. The synthetic samples in the training and de- velopment sets are generated by TTM models conditioned on the corresponding textual descriptions of the real music. For the evalua- tion set, the fake samples are generated from the remaining textual"},{"citing_arxiv_id":"2604.07895","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"DialBGM: A Benchmark for Background Music Recommendation from Everyday Multi-Turn Dialogues","primary_cat":"cs.AI","submitted_at":"2026-04-09T07:06:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"DialBGM is a new benchmark dataset revealing that existing AI models fall far short of human performance when recommending fitting background music for open-domain conversations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.07612","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Towards Real-Time Human-AI Musical Co-Performance: Accompaniment Generation with Latent Diffusion Models and MAX/MSP","primary_cat":"cs.SD","submitted_at":"2026-04-08T21:30:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A latent diffusion model with consistency distillation generates real-time instrumental accompaniment from live context audio, integrated with MAX/MSP for feasible human-AI co-performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.06489","ref_index":68,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Language-Guided Multimodal Texture Authoring via Generative Models","primary_cat":"cs.HC","submitted_at":"2026-04-07T21:47:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A language-driven system generates semantically consistent multimodal textures from text prompts by linking autoregressive haptic models and diffusion-based visuals through a shared latent representation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.01929","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Woosh: A Sound Effects Foundation Model","primary_cat":"cs.SD","submitted_at":"2026-04-02T11:49:00+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Woosh is a new publicly released foundation model optimized for high-quality sound effect generation from text or video, showing competitive or better results than open alternatives like Stable Audio Open.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"provide a detailed pseudocode for this training process in Algorithm 1 in Appendix A. We set the number of distillation fine-tuning epochs to 50, with a learning rate of5·10 −6. To stabilize training in the first epochs, the MeanFlow joint(t, r)embeddings were fine-tuned for 1000 steps to match thetembeddings in the teacher network. CFG is uniformly sampled in the range [1,9] with a condition dropout rate of 0.1. The discriminator uses 4 convolutional heads. The adversarial loss weight is set to 0.5. For inference, a generic ODE solver, namely DOPRI5, is used with target absolute and relative tolerances of 10−3. 4.5 Evaluation We compare audio fidelity and semantic alignment for the Woosh-Flow and Woosh-DFlow models with"},{"citing_arxiv_id":"2603.03190","ref_index":53,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Expectation and Acoustic Neural Network Representations Enhance Music Identification from Brain Activity","primary_cat":"cs.AI","submitted_at":"2026-03-03T17:47:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Separating acoustic and expectation ANN representations as teacher targets improves EEG music identification beyond baselines and seed ensembles.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.22029","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MIDI-Informed Singing Accompaniment Generation in a Compositional Song Pipeline","primary_cat":"cs.SD","submitted_at":"2026-02-24T06:43:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MIDI-SAG generates consistent long-form singing accompaniments by feeding symbolic MIDI timing, chords, and structure labels into a compositional pipeline built from pre-trained modules.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.11910","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"TADA! Tuning Audio Diffusion Models through Activation Steering","primary_cat":"cs.SD","submitted_at":"2026-02-12T13:07:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Activation steering at a semantic bottleneck in audio diffusion models achieves state-of-the-art control over musical attributes such as instruments, vocals, and genres.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.09448","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"One Prompt, Many Sounds: Modeling Listener Variability in LLM-Based Equalization","primary_cat":"cs.SD","submitted_at":"2026-01-14T12:51:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLMs using in-context learning and fine-tuning on listener experiment data generate equalization settings that align better with population preferences than random sampling or static presets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.02954","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"The World is Not Mono: Enabling Spatial Understanding in Large Audio-Language Models","primary_cat":"cs.SD","submitted_at":"2026-01-06T11:54:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TWNM framework equips audio-language models with spatial scene analysis via FOA simulation and metadata-grounded training, reaching 70.8% accuracy on a new ASA benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.02731","ref_index":40,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Omni2Sound: Towards Unified Video-Text-to-Audio Generation","primary_cat":"cs.SD","submitted_at":"2026-01-06T05:49:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A single DiT-based diffusion model unifies video-to-audio, text-to-audio, and joint video-text-to-audio generation, supported by a new 470k-pair dataset and three-stage progressive training that resolves task competition.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.01537","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Two-Dimensional Quantization for Geometry-Aware Audio Coding","primary_cat":"cs.SD","submitted_at":"2025-12-01T11:06:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Q2D2 uses 2D geometric grid projections to quantize feature pairs in neural audio codecs, yielding implicit codebooks that improve efficiency and utilization over RVQ, VQ, and FSQ while maintaining reconstruction quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.04776","ref_index":2,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Quantifying the Climate Risk of Generative AI: Region-Aware Carbon Accounting with G-TRACE and the AI Sustainability Pyramid","primary_cat":"cs.CY","submitted_at":"2025-11-06T19:52:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"G-TRACE provides region-aware estimates of GenAI carbon emissions including 4309 MWh and 2068 tCO2 for a 2024-2025 image generation trend, paired with a seven-level AI Sustainability Pyramid for policy guidance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.19127","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Steering Autoregressive Music Generation with Recursive Feature Machines","primary_cat":"cs.LG","submitted_at":"2025-10-21T23:23:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MusicRFM discovers interpretable concept directions in music model hidden states using RFM probes and injects them at inference to steer generation toward desired musical properties without retraining.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.23727","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AudioMoG: Guiding Audio Generation with Mixture-of-Guidance","primary_cat":"cs.SD","submitted_at":"2025-09-28T08:12:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AudioMoG is a mixture-of-guidance sampling technique that combines CFG and AG signals to outperform single-guidance baselines in text-to-audio generation at equivalent speed.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.20641","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Investigating Modality Contribution in Audio LLMs for Music","primary_cat":"cs.LG","submitted_at":"2025-09-25T00:56:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Adapts MM-SHAP to quantify modality contributions in two Audio LLMs on MuChoMusic, showing text dominance alongside limited audio localization of key events.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.08128","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models","primary_cat":"cs.SD","submitted_at":"2025-07-10T19:40:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Audio Flamingo 3 introduces an open large audio-language model achieving new state-of-the-art results on over 20 audio understanding and reasoning benchmarks using a unified encoder and curriculum training on open data.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Chen, et al. Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras. arXiv preprint arXiv:2503.01743, 2025. [2] S. Abu-El-Haija, N. Kothari, J. Lee, P. Natsev, G. Toderici, B. Varadarajan, and S. Vijaya- narasimhan. Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675, 2016. [3] A. Agostinelli, T. I. Denk, Z. Borsos, J. Engel, M. Verzetti, A. Caillon, Q. Huang, A. Jansen, A. Roberts, M. Tagliasacchi, et al. Musiclm: Generating music from text. arXiv preprint arXiv:2301.11325, 2023. [4] P. Anastassiou, J. Chen, J. Chen, Y . Chen, Z. Chen, Z. Chen, J. Cong, L. Deng, C. Ding, L. Gao, et al. Seed-tts: A family of high-quality versatile speech generation models."},{"citing_arxiv_id":"2504.08528","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"On The Landscape of Spoken Language Models: A Comprehensive Survey","primary_cat":"cs.CL","submitted_at":"2025-04-11T13:40:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A literature survey that organizes spoken language models by architecture, training, and evaluation choices and identifies key challenges and future directions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2502.05139","ref_index":50,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound","primary_cat":"cs.SD","submitted_at":"2025-02-07T18:15:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Unified no-reference models assess audio aesthetics across speech, music, and sound via four perceptual axes and achieve performance comparable or superior to human mean opinion scores.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2502.04230","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"XAttnMark: Learning Robust Audio Watermarking with Cross-Attention","primary_cat":"cs.SD","submitted_at":"2025-02-06T17:15:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"XAttnMark is a new neural audio watermarking method using partial parameter sharing, cross-attention for message retrieval, temporal conditioning, and a psychoacoustic TF masking loss that reports state-of-the-art detection and attribution robustness.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2411.15913","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Repurposing Image Diffusion Models for Training-Free Music Style Transfer on Mel-spectrograms","primary_cat":"cs.SD","submitted_at":"2024-11-24T16:53:34+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Stylus achieves training-free music style transfer on Mel-spectrograms by repurposing image diffusion models via style-key injection in self-attention plus phase-preserving reconstruction, outperforming baselines by 34.1% in content preservation and 25.7% in perceptual quality per 2,925 human raters","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2407.10759","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Qwen2-Audio Technical Report","primary_cat":"eess.AS","submitted_at":"2024-07-15T14:38:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Qwen2-Audio is an open-source audio-language model that outperforms prior systems such as Gemini-1.5-pro on audio-centric instruction-following benchmarks after simplified prompt-based pre-training and expanded data.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"chat benchmark. 2 Methodology Model Architecture The training process of Qwen2-Audio is depicted in Figure 2, which contains an audio encoder and a large language model. Given the paired data(a, x), where thea and x denote the audio sequences and text sequences, the training objective is to maximize the next text token probability as Pθ(xt|x<t,Encoderϕ(a)), (1) conditioning on audio representations and previous text sequencesx<t, whereθ and ϕ denote the trainable parameters of the LLM and audio encoder respectively. Different from Qwen-Audio, the initialization of the audio encoder of Qwen2-Audio is based on the Whisper- large-v3 model (Radford et al., 2023). To preprocess the audio data, we resamples it to a frequency of 16kHz"},{"citing_arxiv_id":"2406.14294","ref_index":20,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DASB - Discrete Audio and Speech Benchmark","primary_cat":"cs.SD","submitted_at":"2024-06-20T13:23:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DASB is a new benchmark for discrete audio tokens showing semantic tokens outperform acoustic ones but discrete representations remain less robust than continuous features across domains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2406.07476","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs","primary_cat":"cs.CV","submitted_at":"2024-06-11T17:22:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"VideoLLaMA 2 improves video LLMs via a new STC connector for spatial-temporal dynamics and joint audio training, reaching competitive results on video QA and captioning benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}