pith. sign in

arxiv: 2606.17404 · v1 · pith:KC6IGOLCnew · submitted 2026-06-16 · 📡 eess.AS · cs.SD

ELSA: Acoustic Event-Level Semantic Alignment for Fine-Grained Reference-Free Text-to-Audio Evaluation

Pith reviewed 2026-06-26 23:14 UTC · model grok-4.3

classification 📡 eess.AS cs.SD
keywords text-to-audio evaluationreference-free metricacoustic event alignmentfine-grained assessmentsemantic alignmenthuman correlationTTA benchmarksCLAP-based metrics
0
0 comments X

The pith

ELSA evaluates text-to-audio by aligning separate acoustic events from the prompt instead of matching entire clips.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current reference-free metrics for text-to-audio generation compare full audio clips to text descriptions at a coarse level, which often misses the fine details that determine whether listeners find the output faithful. ELSA addresses this by first extracting distinct acoustic events from the text query, then decomposing the generated audio accordingly and scoring how well each event matches its textual counterpart. Experiments on four standard TTA benchmarks show that these event-level scores track human subjective ratings more closely than earlier whole-clip methods. A reader would care because improved automatic metrics reduce the need for repeated expensive human listening tests when developing better generation models.

Core claim

ELSA is a reference-free evaluation metric for fine-grained text-audio alignment that decomposes generated audio guided by distinct acoustic events derived from the text query and assesses event-level alignment, revealing higher correlation with human subjective ratings than prior CLAP-based metrics across four TTA benchmarks.

What carries the argument

Acoustic event-level semantic alignment, which extracts distinct events from the text, decomposes the audio along those events, and computes separate alignment scores for each.

If this is right

  • TTA models can be iterated using automatic scores that better reflect human perception of specific sound events.
  • Evaluation can now flag mismatches at individual events rather than only reporting overall similarity.
  • Reference-free assessment becomes viable for fine-grained intent capture without paired audio references.
  • Development of TTA systems gains a tool that distinguishes precise event matches from coarse overall quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same event-decomposition idea could be tested on text-to-video or text-to-music tasks where fine-grained alignment also matters.
  • Performance may drop on prompts that lack clearly separable acoustic events, such as abstract or overlapping sounds.
  • If the decomposition step proves stable, future work could explore learned rather than rule-based event extraction.

Load-bearing premise

Decomposing generated audio according to acoustic events taken from the text and scoring their individual alignments will produce numbers that track human judgments of fine-grained text-audio match.

What would settle it

New human ratings collected on the same four benchmarks where ELSA scores show lower or equal correlation compared with prior whole-clip metrics.

Figures

Figures reproduced from arXiv: 2606.17404 by Daichi Yashima, Kanon Amemiya, Kento Tokura, Komei Sugiura, Shinnosuke Takamichi, Shuntaro Suzuki.

Figure 1
Figure 1. Figure 1: Overview of ELSA. ELSA enables acoustic event-wise, fine-grained text–audio alignment for reference-free TTA evalu￾ation, yielding high correlation with human subjective ratings. tally different modalities (i.e., text and audio), their evaluations are typically restricted to coarse-grained similarity measured in a jointly aligned text–audio embedding space, as shown by CLAP-based metrics [12]. As a result,… view at source ↗
Figure 2
Figure 2. Figure 2: Architecture of ELSA. ELSA hierarchically evaluates global text–audio matching and fine-grained acoustic-event alignment by combining shared text–audio embeddings with event-level representations extracted via a text parser and a language-queried audio source separation (LASS) model. ture space itself to better reflect human perceptual judgments, thereby providing a more reliable evaluation measure. Conver… view at source ↗
Figure 4
Figure 4. Figure 4 [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 3
Figure 3. Figure 3: Sensitivity analysis of metric–REL correlation with respect to the number of acoustic events in the text query. Clotho [26] benchmark [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
read the original abstract

Text-to-audio (TTA) generation, synthesizing audio from natural language, has been widely studied for its ability to capture precise user intent. To effectively advance TTA models, it is essential to reliably evaluate generated audio without relying on costly human subjective ratings, motivating the development of automatic evaluation metrics that correlate well with human judgments. While recent CLAP-based metrics provide practical reference-free solutions, their coarse-grained text-audio similarity matching often correlates poorly with human ratings. To address this, we propose ELSA, a reference-free evaluation metric for fine-grained text-audio alignment. ELSA decomposes generated audio guided by distinct acoustic events derived from the text query and assesses event-level alignment. Experiments across four TTA benchmarks show that ELSA reveals a higher correlation with human subjective ratings than prior metrics, highlighting its effectiveness for reliable TTA evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes ELSA, a reference-free metric for fine-grained text-to-audio (TTA) evaluation. It decomposes generated audio guided by distinct acoustic events extracted from the text query and measures event-level semantic alignment. Experiments on four TTA benchmarks are claimed to demonstrate higher correlation with human subjective ratings than prior CLAP-based metrics.

Significance. If the central claim holds after detailed verification, ELSA could offer a practical improvement over coarse-grained reference-free metrics for TTA model development. However, the absence of any methodological, experimental, or statistical details in the abstract precludes assessment of whether the result is robust or reproducible.

major comments (2)
  1. [Abstract] Abstract: The claim that ELSA 'reveals a higher correlation with human subjective ratings than prior metrics' supplies no information on the correlation measure employed, the statistical tests performed, sample sizes, or data handling procedures. This information is load-bearing for the central empirical claim and its absence prevents verification that the data support the assertion.
  2. [Abstract] Abstract: No description is provided of the text-to-event extraction process, the audio decomposition method, or how event-level alignment is quantified. These steps are load-bearing for the weakest assumption that event-level decomposition will reliably track human judgments of fine-grained match.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the abstract. We agree that greater specificity in the abstract would strengthen the presentation of the central claims and will revise the abstract accordingly while ensuring it remains concise.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that ELSA 'reveals a higher correlation with human subjective ratings than prior metrics' supplies no information on the correlation measure employed, the statistical tests performed, sample sizes, or data handling procedures. This information is load-bearing for the central empirical claim and its absence prevents verification that the data support the assertion.

    Authors: We agree that the abstract would benefit from indicating the correlation measure, statistical testing approach, and evaluation scale. The manuscript body (experimental section) provides these details for the four benchmarks, including the specific correlation coefficient used, significance testing, and the number of rated samples per benchmark. We will revise the abstract to incorporate a brief statement of the correlation measure and evaluation scope. revision: yes

  2. Referee: [Abstract] Abstract: No description is provided of the text-to-event extraction process, the audio decomposition method, or how event-level alignment is quantified. These steps are load-bearing for the weakest assumption that event-level decomposition will reliably track human judgments of fine-grained match.

    Authors: The abstract is intentionally high-level, but the full methodology for text-to-event extraction, audio decomposition, and event-level alignment quantification is described in the dedicated methods section of the manuscript. We acknowledge that a short outline of these components in the abstract would improve accessibility and will revise the abstract to include one-sentence descriptions of each step. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The abstract presents ELSA as a proposed metric that decomposes generated audio using acoustic events extracted from the text query and evaluates event-level alignment, with the central claim being higher correlation to human ratings than prior CLAP-based metrics across four benchmarks. No equations, parameter-fitting procedures, self-citations, or derivations are described that would reduce the reported correlation or the metric itself to its inputs by construction. The human correlation serves as external validation on separate benchmarks rather than a fitted or self-defined quantity. The paper's claim is therefore self-contained against external benchmarks with no load-bearing circular steps detectable from the given text.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review conducted from abstract only; no free parameters, axioms, or invented entities are identifiable from the provided text.

pith-pipeline@v0.9.1-grok · 5701 in / 945 out tokens · 26613 ms · 2026-06-26T23:14:55.017141+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 5 canonical work pages · 2 internal anchors

  1. [1]

    Introduction Audio generation conditioned on user intent, encompassing speech, sound effects, and music, has been widely studied [1, 2]. This interest is motivated by diverse applications such as aug- mented reality audio environments and media sound genera- tion [3, 4], with text-to-audio (TTA) generation gaining partic- ular attention for directly synth...

  2. [2]

    ELSA: Acoustic Event-Level Semantic Alignment for Fine-Grained Reference-Free Text-to-Audio Evaluation

    Related Works Automatic evaluation metrics for TTA generation [14, 11] have been comprehensively reviewed in Lan et al. [15] and Su et al. [16]. Among these, reference-free metrics such as PAM [9] and CLAPScore [12] have gained prominence due to their broad applicability, avoiding the need for reference audio by directly aligning text and audio representa...

  3. [3]

    dog barking

    Method An overview of the proposed metric is shown in Figure 2. ELSA is inspired by prior evaluation metrics for video and image cap- tioning [18, 23], particularly those that hierarchically assess vi- sual–language alignment [20, 19]. 3.1. Event-Level Representation Extraction Fine-grained text–audio comparison is achieved by hierarchi- cally extracting ...

  4. [4]

    Experiments 4.1. Datasets and Data Pre-processing We evaluated the correlation between the proposed metric and human subjective ratings using four TTA benchmarks: Audio- Caps [25], Clotho [26], MusicCaps [27], and RELATE [28]. For each benchmark, audio samples are generated from text queries using multiple TTA models (e.g., AudioLDM [29]) and anno- tated ...

  5. [5]

    text” in the table) and audio retrieval from text (“audio

    Results and Analysis 5.1. Correlation with Human Subjective Ratings Table 1 shows the correlation between human subjective rat- ings and the proposed metric, along with baseline metrics, across four benchmarks (AudioCaps [25], Clotho [26], Mus- 6https://github.com/soham97/PAM/tree/main 7https://github.com/lourson1091/audiobertscore Table 2:Performance com...

  6. [6]

    Limitation As a primary limitation, ELSA does not explicitly model the temporal order of acoustic events. Although ELSA outperforms baseline metrics on order-sensitive benchmarks, such as OS on RELATE [28] and CompA-order on CompA [30] (Table 2), explicitly modeling event duration and sequential structure re- mains a promising direction for future improvement

  7. [7]

    Experimental results show that ELSA consistently outperforms existing metrics, including both reference-based and reference-free approaches

    Conclusion In this paper, we proposed ELSA, a reference-free evaluation metric for TTA generation that enables fine-grained text–audio comparison. Experimental results show that ELSA consistently outperforms existing metrics, including both reference-based and reference-free approaches. Furthermore, ablation and sen- sitivity analyses show both component-...

  8. [8]

    Honda Bridge Project,

    Acknowledgments Part of this study was executed in the “Honda Bridge Project,” a collaborative research and education program between the Fac- ulty of Science and Technology at Keio University and Honda Motor Co., Ltd

  9. [9]

    They were not involved in the research design, nor did they contribute to the development, im- plementation, or evaluation of the proposed methods

    Generative AI Use Disclosure Generative AIs were used solely for auxiliary purposes, such as language refinement, manuscript formatting, and the imple- mentation of standard algorithms. They were not involved in the research design, nor did they contribute to the development, im- plementation, or evaluation of the proposed methods. Accord- ingly, Generati...

  10. [10]

    Conditional Sound Generation Using Neural Discrete Time-Frequency Representa- tion Learning,

    X. Liu, T. Iqbal, J. Zhao, Q. Huanget al., “Conditional Sound Generation Using Neural Discrete Time-Frequency Representa- tion Learning,” inMLSP, 2021, pp. 1–6

  11. [11]

    Full- band General Audio Synthesis with Score-based Diffusion,

    S. Pascual, G. Bhattacharya, C. Yeh, J. Pons, and J. Serr `a, “Full- band General Audio Synthesis with Score-based Diffusion,” in ICASSP, 2023, pp. 1–5

  12. [12]

    Sound synthesis for impact sounds in video games,

    D. B. Lloyd, N. Raghuvanshi, and N. K. Govindaraju, “Sound synthesis for impact sounds in video games,” inI3D, 2011, pp. 55–62

  13. [13]

    Riffusion: Stable Diffusion for Real-Time Music Generation,

    S. Forsgren and H. Martiros, “Riffusion: Stable Diffusion for Real-Time Music Generation,” URL: https://riffusion.com/about, 2022

  14. [14]

    AudioLDM 2: Learning Holistic Audio Generation With Self- Supervised Pretraining,

    H. Liu, Y . Yuan, X. Liu, X. Mei, Q. Kong, Q. Tian, Y . Wanget al., “AudioLDM 2: Learning Holistic Audio Generation With Self- Supervised Pretraining,”TASLP, vol. 32, pp. 2871–2883, 2024

  15. [15]

    AudioX: Diffusion Transformer for Anything-to-Audio Generation,

    Z. Tian, Y . Jin, Z. Liu, R. Yuan, X. Tan, Q. Chen, W. Xue, and Y . Guo, “AudioX: Diffusion Transformer for Anything-to-Audio Generation,” inICLR, 2026

  16. [16]

    TangoFlux: Text to Au- dio Generation with CLAP-Ranked Preference Optimization,

    C.-Y . Hung, N. Majumder, Z. Kong, A. Mehrish, A. Zadeh, C. Li, R. Valle, B. Catanzaro, and S. Poria, “TangoFlux: Text to Au- dio Generation with CLAP-Ranked Preference Optimization,” in ICLR, 2026

  17. [17]

    AudioGen: Tex- tually Guided Audio Generation,

    F. Kreuk, G. Synnaeve, A. Polyak, U. Singer, A. D ´efossez, J. Copet, D. Parikh, Y . Taigman, and Y . Adi, “AudioGen: Tex- tually Guided Audio Generation,” inICLR, 2023

  18. [18]

    PAM: Prompting Audio- Language Models for Audio Quality Assessment,

    S. Deshmukh, D. Alharthi, B. Elizalde, H. Gamper, M. Al Is- mail, R. Singh, B. Raj, and H. Wang, “PAM: Prompting Audio- Language Models for Audio Quality Assessment,” inInterspeech, 2024, pp. 3320–3324

  19. [19]

    AudioBERTScore: Objective Evaluation of Environmental Sound Synthesis Based on Similarity of Audio Embedding Se- quences,

    M. Kishi, R. Sakai, S. Takamichi, Y . Kanamori, and Y . Okamoto, “AudioBERTScore: Objective Evaluation of Environmental Sound Synthesis Based on Similarity of Audio Embedding Se- quences,” inAAAI Audio-Centric AI Workshop, 2026

  20. [20]

    SDR – Half-baked or Well Done?

    J. L. Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR – Half-baked or Well Done?” inICASSP, 2019, pp. 626–630

  21. [21]

    A Reference-free Metric for Language-Queried Audio Source Separation using Contrastive Language-Audio Pretraining,

    F. Xiao, J. Guan, Q. Zhu, X. Liuet al., “A Reference-free Metric for Language-Queried Audio Source Separation using Contrastive Language-Audio Pretraining,” inDCASE Workshop, 2024

  22. [22]

    Human-CLAP: Human-perception-based Con- trastive Language-audio Pretraining,

    T. Takano, Y . Okamoto, Y . Kanamori, Y . Saito, R. Nagase, and H. Saruwatari, “Human-CLAP: Human-perception-based Con- trastive Language-audio Pretraining,” inAPSIPA ASC, 2025, pp. 131–136

  23. [23]

    Performance measure- ment in blind audio source separation,

    E. Vincent, R. Gribonval, and C. Fevotte, “Performance measure- ment in blind audio source separation,”TASLP, vol. 14, no. 4, pp. 1462–1469, 2006

  24. [24]

    A Survey of Automatic Evaluation Methods on Text, Visual and Speech Generations,

    T. Lan, Y .-H. Zhou, Z.-A. Ma, F. Sun, R.-Q. Sun, J. Luoet al., “A Survey of Automatic Evaluation Methods on Text, Visual and Speech Generations,”arXiv preprint arXiv:2506.10019, 2025

  25. [25]

    Audio-language models for audio-centric tasks: A survey,

    Y . Su, J. Bai, Q. Xu, K. Xu, and Y . Dou, “Audio-Language Models for Audio-Centric Tasks: A survey,”arXiv preprint arXiv:2501.15177, 2025

  26. [26]

    CLAP Learning Audio Concepts from Natural Language Supervision,

    B. Elizalde, S. Deshmukh, M. A. Ismail, and H. Wang, “CLAP Learning Audio Concepts from Natural Language Supervision,” inICASSP, 2023, pp. 1–5

  27. [27]

    VELA: An LLM-hybrid-as-a-judge approach for evaluating long image captions,

    K. Matsuda, Y . Wada, S. Hirano, S. Otsuki, and K. Sugiura, “VELA: An LLM-hybrid-as-a-judge approach for evaluating long image captions,” inEMNLP, 2025, pp. 8680–8696

  28. [28]

    EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained Embedding Matching,

    Y . Shi, X. Yang, H. Xu, C. Yuan, B. Li, W. Hu, and Z. Zha, “EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained Embedding Matching,” inCVPR, 2022

  29. [29]

    HICEScore: A Hierarchical Metric for Image Caption- ing Evaluation,

    Z. Zeng, J. Sun, H. Zhang, T. Wen, Y . Su, Y . Xie, Z. Wang, and B. Chen, “HICEScore: A Hierarchical Metric for Image Caption- ing Evaluation,” inACM-MM, 2024, p. 866–875

  30. [30]

    Advancing multi-grained alignment for contrastive language-audio pre-training,

    Y . Li, Z. Guo, X. Wang, and H. Liu, “Advancing multi-grained alignment for contrastive language-audio pre-training,” inACM MM, 2024, pp. 7356–7365

  31. [31]

    Finelap: Taming heterogeneous supervision for fine-grained language-audio pretraining,

    X. Li, X. Xu, Z. Ma, W. Chen, H. He, Q. Kong, and X. Chen, “Finelap: Taming heterogeneous supervision for fine-grained language-audio pretraining,”ACL, 2026

  32. [32]

    LLM-Free Image Captioning Evaluation in Reference-Flexible Settings,

    S. Hirano, Y . Wada, K. Matsuda, S. Otsuki, and K. Sugiura, “LLM-Free Image Captioning Evaluation in Reference-Flexible Settings,” inAAAI, 2026

  33. [33]

    SAM Audio: Segment Anything in Audio,

    B. Shi, A. Tjandra, J. Hoffman, H. Wang, Y .-C. Wu, L. Gao, J. Richter, M. Le, A. Vyas, S. Chenet al., “SAM Audio: Segment Anything in Audio,”arXiv preprint arXiv:2512.18099, 2025

  34. [34]

    AudioCaps: Generating Captions for Audios in The Wild,

    C. D. Kim, B. Kim, H. Lee, and G. Kim, “AudioCaps: Generating Captions for Audios in The Wild,” inNAACL-HLT, 2019

  35. [35]

    Clotho: an Audio Cap- tioning Dataset,

    K. Drossos, S. Lipping, and T. Virtanen, “Clotho: an Audio Cap- tioning Dataset,” inICASSP, 2020, pp. 736–740

  36. [36]

    MusicLM: Generating Music From Text

    A. Agostinelli, T. I. Denk, Z. Borsos, J. Engel, M. Verzetti, A. Caillon, Q. Huang, A. Jansen, A. Roberts, M. Tagliasac- chiet al., “MusicLM: Generating Music From Text,” arXiv:2301.11325, 2023

  37. [37]

    RELATE: Subjective evaluation dataset for auto- matic evaluation of relevance between text and audio ,

    Y . Kanamori, Y . Okamoto, T. Takano, S. Takamichi, Y . Saito, and H. Saruwatari, “RELATE: Subjective evaluation dataset for auto- matic evaluation of relevance between text and audio ,” inInter- speech, 2025, pp. 3155–3159

  38. [38]

    AudioLDM: Text-to-audio generation with latent diffusion models,

    H. Liu, Z. Chen, Y . Yuan, X. Mei, X. Liu, D. Mandic, W. Wang, and M. D. Plumbley, “AudioLDM: Text-to-audio generation with latent diffusion models,” inICML, vol. 202, 23–29 Jul 2023, pp. 21 450–21 474

  39. [39]

    CompA: Addressing the Gap in Com- positional Reasoning in Audio-Language Models,

    S. Ghosh, A. Seth, S. Kumar, U. Tyagi, C. K. R. Evuru, R. S, S. Sakshi, O. Nietoet al., “CompA: Addressing the Gap in Com- positional Reasoning in Audio-Language Models,” inICLR, 2024

  40. [40]

    Fast Timing-Conditioned Latent Audio Diffusion,

    Z. Evans, C. Carr, J. Taylor, S. H. Hawley, and J. Pons, “Fast Timing-Conditioned Latent Audio Diffusion,” inICML, 2024

  41. [41]

    Efficient Training of Audio Transformers with Patchout,

    Khaled Koutini and Jan Schl ¨uter and Hamid Eghbal-zadeh and Gerhard Widmer, “Efficient Training of Audio Transformers with Patchout,” inInterspeech, 2022, pp. 2753–2757

  42. [42]

    Look, Lis- ten, and Learn More: Design Choices for Deep Audio Embed- dings,

    A. L. Cramer, H.-H. Wu, J. Salamon, and J. P. Bello, “Look, Lis- ten, and Learn More: Design Choices for Deep Audio Embed- dings,” inICASSP 2019, 2019, pp. 3852–3856

  43. [43]

    Large-Scale Contrastive Language-Audio Pretrain- ing with Feature Fusion and Keyword-to-Caption Augmentation,

    Y . Wu, K. Chen, T. Zhang, Y . Hui, T. Berg-Kirkpatrick, and S. Dubnov, “Large-Scale Contrastive Language-Audio Pretrain- ing with Feature Fusion and Keyword-to-Caption Augmentation,” inICASSP, 2023, pp. 1–5

  44. [44]

    Separate Anything You Describe,

    X. Liu, Q. Kong, Y . Zhao, H. Liu, Y . Yuan, Y . Liu, R. Xia, Y . Wang, M. D. Plumbley, and W. Wang, “Separate Anything You Describe,”TASLP, 2024

  45. [45]

    SoloAudio: Target Sound Extraction with Language-oriented Audio Diffusion Transformer,

    H. Wang, J. Hai, Y .-J. Lu, K. Thakkar, M. Elhilali, and N. Dehak, “SoloAudio: Target Sound Extraction with Language-oriented Audio Diffusion Transformer,” inICASSP, 2025, pp. 1–5

  46. [46]

    Tango 2: Aligning diffusion-based text-to-audio generations through direct preference optimization,

    N. Majumder, C.-Y . Hung, D. Ghosal, W.-N. Hsuet al., “Tango 2: Aligning diffusion-based text-to-audio generations through direct preference optimization,” inACM MM, 2024, pp. 564–572