Evaluating Semantic Fragility in Text-to-Audio Generation Systems Under Controlled Prompt Perturbations
Pith reviewed 2026-05-15 12:04 UTC · model grok-4.3
The pith
Text-to-audio models achieve better embedding consistency in larger sizes yet still produce acoustically divergent outputs under equivalent prompts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that semantic fragility in text-to-audio generation arises primarily during semantic-to-acoustic realization rather than multi-modal embedding alignment. Even when embedding cosine similarities remain high, acoustic and temporal analyses show persistent divergence in the generated audio across all tested models and perturbation types.
What carries the argument
A dataset of 75 prompt groups subjected to Minimal Lexical Substitution (MLS), Intensity Shifts (IS), and Structural Rephrasing (SR), scored with spectral, temporal, and embedding similarity measures to isolate where consistency breaks.
If this is right
- Larger models such as MusicGen-large reach cosine similarities of 0.77 under MLS and 0.82 under IS.
- Acoustic and temporal divergence persists even when embedding similarity is high.
- Robustness checks must include multiple representational levels rather than embeddings alone.
- The introduced framework allows systematic comparison of stability across generative audio systems.
Where Pith is reading between the lines
- Targeted improvements to the acoustic realization stage could raise overall consistency without altering the embedding component.
- Real-world uses that rely on repeated generation from similar descriptions may see noticeable variation in output character.
- Certain perturbation types such as structural rephrasing might drive larger acoustic shifts than lexical ones.
- The approach could extend to other generative modalities to locate analogous realization bottlenecks.
Load-bearing premise
The 75 prompt groups keep semantic intent intact while adding only localized linguistic changes, and the chosen spectral, temporal, and embedding measures together capture the dimensions that matter for output differences.
What would settle it
An experiment in which acoustic and temporal similarities rise to match the level of embedding similarities across the same prompt perturbations would contradict the claim that the main fragility occurs in semantic-to-acoustic realization.
read the original abstract
Recent advances in text-to-audio generation enable models to translate natural-language descriptions into diverse musical output. However, the robustness of these systems under semantically equivalent prompt variations remains largely unexplored. Small linguistic changes may lead to substantial variation in generated audio, raising concerns about reliability in practical use. In this study, we evaluate the semantic fragility of text-to-audio systems under controlled prompt perturbations. We selected MusicGen-small, MusicGen-large, and Stable Audio 2.5 as representative models, and we evaluated them under Minimal Lexical Substitution (MLS), Intensity Shifts (IS), and Structural Rephrasing (SR). The proposed dataset contains 75 prompt groups designed to preserve semantic intent while introducing localized linguistic variation. Generated outputs are compared through complementary spectral, temporal, and semantic similarity measures, enabling robustness analysis across multiple representational levels. Experimental results show that larger models achieve improved semantic consistency, with MusicGen-large reaching cosine similarities of 0.77 under MLS and 0.82 under IS. However, acoustic and temporal analyses reveal persistent divergence across all models, even when embedding similarity remains high. These findings indicate that fragility arises primarily during semantic-to-acoustic realization rather than multi-modal embedding alignment. Our study introduces a controlled framework for evaluating robustness in text-to-audio generation and highlights the need for multi-level stability assessment in generative audio systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript evaluates semantic fragility in text-to-audio generation by testing MusicGen-small, MusicGen-large, and Stable Audio 2.5 on 75 prompt groups under Minimal Lexical Substitution (MLS), Intensity Shifts (IS), and Structural Rephrasing (SR). Outputs are compared via spectral, temporal, and embedding similarity metrics. Results show larger models achieve higher cosine similarities (0.77 under MLS, 0.82 under IS), yet acoustic and temporal divergences persist even at high embedding scores, leading to the claim that fragility occurs primarily in semantic-to-acoustic realization rather than embedding alignment.
Significance. If the central assumption holds, the work supplies a useful multi-level framework for robustness testing in generative audio and supplies concrete cross-model comparisons that could guide training improvements for linguistic stability.
major comments (2)
- [Abstract and dataset description] Abstract and dataset description: the claim that fragility arises primarily during semantic-to-acoustic realization (rather than multi-modal embedding alignment) rests on the unverified premise that MLS/IS/SR variants preserve semantic intent. No human ratings, no prompt-level sentence-embedding cosine scores, and no inter-annotator agreement are reported for the 75 groups. Without this check, observed acoustic/temporal divergences cannot be isolated from possible meaning shifts (e.g., intensity-word changes altering perceived dynamics).
- [Results section] Results section: concrete similarity numbers are given without accompanying statistical tests, exact perturbation-generation procedures, or baseline comparisons (e.g., against random prompt pairs or non-semantic controls). This omission weakens the ability to judge whether the reported divergences exceed expected variation.
minor comments (1)
- [Appendix] The manuscript would benefit from an appendix containing representative prompt triples (original + MLS + IS + SR) to support reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which have helped us strengthen the manuscript. We address each major point below and indicate the corresponding revisions.
read point-by-point responses
-
Referee: [Abstract and dataset description] Abstract and dataset description: the claim that fragility arises primarily during semantic-to-acoustic realization (rather than multi-modal embedding alignment) rests on the unverified premise that MLS/IS/SR variants preserve semantic intent. No human ratings, no prompt-level sentence-embedding cosine scores, and no inter-annotator agreement are reported for the 75 groups. Without this check, observed acoustic/temporal divergences cannot be isolated from possible meaning shifts (e.g., intensity-word changes altering perceived dynamics).
Authors: We appreciate this observation on verifying semantic preservation. The perturbations were explicitly designed to maintain intent (synonym substitution for MLS, descriptor adjustment without core-meaning change for IS, and reordering for SR). In the revised manuscript we now report prompt-level sentence-embedding cosine similarities (all-MiniLM-L6-v2) averaging 0.89 across all 75 groups and perturbation types. These scores provide quantitative support that semantic intent is largely preserved, allowing the observed acoustic divergences to be more confidently attributed to realization rather than meaning shift. Full human ratings and inter-annotator agreement were outside the original scope and resource limits of the study. revision: partial
-
Referee: [Results section] Results section: concrete similarity numbers are given without accompanying statistical tests, exact perturbation-generation procedures, or baseline comparisons (e.g., against random prompt pairs or non-semantic controls). This omission weakens the ability to judge whether the reported divergences exceed expected variation.
Authors: We agree that statistical tests, procedural detail, and baselines are necessary for rigorous interpretation. The revised manuscript now includes: (i) exact perturbation-generation rules with examples and pseudocode in the Methods and supplementary material; (ii) baseline cosine similarities computed on 75 random unrelated prompt pairs (mean 0.32), which are significantly lower than the perturbed-pair scores (p < 0.001, paired t-test); and (iii) paired t-tests on model and perturbation-type differences, all reaching p < 0.05. These additions allow readers to assess whether the reported divergences exceed expected variation. revision: yes
Circularity Check
No significant circularity in empirical evaluation framework
full rationale
The paper reports an empirical comparison of text-to-audio model outputs under prompt perturbations using direct measurements of spectral, temporal, and embedding similarities. No equations, derivations, or fitted parameters are described that reduce any claim to its inputs by construction. The central finding (fragility in semantic-to-acoustic realization) follows from contrasting high cosine similarities in embeddings against lower acoustic/temporal consistency; these are independent external metrics rather than self-referential or self-cited results. The design of the 75 prompt groups is presented as an input assumption without any load-bearing self-citation or renaming of known results.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Prompt perturbations preserve semantic intent
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.