pith. sign in

arxiv: 2603.13824 · v2 · submitted 2026-03-14 · 💻 cs.SD · cs.AI

Evaluating Semantic Fragility in Text-to-Audio Generation Systems Under Controlled Prompt Perturbations

Pith reviewed 2026-05-15 12:04 UTC · model grok-4.3

classification 💻 cs.SD cs.AI
keywords text-to-audio generationsemantic fragilityprompt perturbationsrobustness evaluationMusicGenaudio similarity measuresembedding consistency
0
0 comments X

The pith

Text-to-audio models achieve better embedding consistency in larger sizes yet still produce acoustically divergent outputs under equivalent prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates how small linguistic changes in prompts affect the outputs of text-to-audio models such as MusicGen-small, MusicGen-large, and Stable Audio 2.5. It applies three types of controlled variations—minimal lexical substitution, intensity shifts, and structural rephrasing—to 75 prompt groups that keep overall meaning fixed. Results indicate that larger models reach higher cosine similarities in embeddings, up to 0.82 under intensity shifts, but spectral and temporal features continue to differ substantially across all models. This pattern points to the main source of inconsistency occurring when semantic representations are turned into actual audio signals rather than during the initial alignment of text and audio embeddings. The work supplies a multi-level measurement approach using complementary similarity metrics to support more targeted robustness checks.

Core claim

The central claim is that semantic fragility in text-to-audio generation arises primarily during semantic-to-acoustic realization rather than multi-modal embedding alignment. Even when embedding cosine similarities remain high, acoustic and temporal analyses show persistent divergence in the generated audio across all tested models and perturbation types.

What carries the argument

A dataset of 75 prompt groups subjected to Minimal Lexical Substitution (MLS), Intensity Shifts (IS), and Structural Rephrasing (SR), scored with spectral, temporal, and embedding similarity measures to isolate where consistency breaks.

If this is right

  • Larger models such as MusicGen-large reach cosine similarities of 0.77 under MLS and 0.82 under IS.
  • Acoustic and temporal divergence persists even when embedding similarity is high.
  • Robustness checks must include multiple representational levels rather than embeddings alone.
  • The introduced framework allows systematic comparison of stability across generative audio systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Targeted improvements to the acoustic realization stage could raise overall consistency without altering the embedding component.
  • Real-world uses that rely on repeated generation from similar descriptions may see noticeable variation in output character.
  • Certain perturbation types such as structural rephrasing might drive larger acoustic shifts than lexical ones.
  • The approach could extend to other generative modalities to locate analogous realization bottlenecks.

Load-bearing premise

The 75 prompt groups keep semantic intent intact while adding only localized linguistic changes, and the chosen spectral, temporal, and embedding measures together capture the dimensions that matter for output differences.

What would settle it

An experiment in which acoustic and temporal similarities rise to match the level of embedding similarities across the same prompt perturbations would contradict the claim that the main fragility occurs in semantic-to-acoustic realization.

read the original abstract

Recent advances in text-to-audio generation enable models to translate natural-language descriptions into diverse musical output. However, the robustness of these systems under semantically equivalent prompt variations remains largely unexplored. Small linguistic changes may lead to substantial variation in generated audio, raising concerns about reliability in practical use. In this study, we evaluate the semantic fragility of text-to-audio systems under controlled prompt perturbations. We selected MusicGen-small, MusicGen-large, and Stable Audio 2.5 as representative models, and we evaluated them under Minimal Lexical Substitution (MLS), Intensity Shifts (IS), and Structural Rephrasing (SR). The proposed dataset contains 75 prompt groups designed to preserve semantic intent while introducing localized linguistic variation. Generated outputs are compared through complementary spectral, temporal, and semantic similarity measures, enabling robustness analysis across multiple representational levels. Experimental results show that larger models achieve improved semantic consistency, with MusicGen-large reaching cosine similarities of 0.77 under MLS and 0.82 under IS. However, acoustic and temporal analyses reveal persistent divergence across all models, even when embedding similarity remains high. These findings indicate that fragility arises primarily during semantic-to-acoustic realization rather than multi-modal embedding alignment. Our study introduces a controlled framework for evaluating robustness in text-to-audio generation and highlights the need for multi-level stability assessment in generative audio systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript evaluates semantic fragility in text-to-audio generation by testing MusicGen-small, MusicGen-large, and Stable Audio 2.5 on 75 prompt groups under Minimal Lexical Substitution (MLS), Intensity Shifts (IS), and Structural Rephrasing (SR). Outputs are compared via spectral, temporal, and embedding similarity metrics. Results show larger models achieve higher cosine similarities (0.77 under MLS, 0.82 under IS), yet acoustic and temporal divergences persist even at high embedding scores, leading to the claim that fragility occurs primarily in semantic-to-acoustic realization rather than embedding alignment.

Significance. If the central assumption holds, the work supplies a useful multi-level framework for robustness testing in generative audio and supplies concrete cross-model comparisons that could guide training improvements for linguistic stability.

major comments (2)
  1. [Abstract and dataset description] Abstract and dataset description: the claim that fragility arises primarily during semantic-to-acoustic realization (rather than multi-modal embedding alignment) rests on the unverified premise that MLS/IS/SR variants preserve semantic intent. No human ratings, no prompt-level sentence-embedding cosine scores, and no inter-annotator agreement are reported for the 75 groups. Without this check, observed acoustic/temporal divergences cannot be isolated from possible meaning shifts (e.g., intensity-word changes altering perceived dynamics).
  2. [Results section] Results section: concrete similarity numbers are given without accompanying statistical tests, exact perturbation-generation procedures, or baseline comparisons (e.g., against random prompt pairs or non-semantic controls). This omission weakens the ability to judge whether the reported divergences exceed expected variation.
minor comments (1)
  1. [Appendix] The manuscript would benefit from an appendix containing representative prompt triples (original + MLS + IS + SR) to support reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which have helped us strengthen the manuscript. We address each major point below and indicate the corresponding revisions.

read point-by-point responses
  1. Referee: [Abstract and dataset description] Abstract and dataset description: the claim that fragility arises primarily during semantic-to-acoustic realization (rather than multi-modal embedding alignment) rests on the unverified premise that MLS/IS/SR variants preserve semantic intent. No human ratings, no prompt-level sentence-embedding cosine scores, and no inter-annotator agreement are reported for the 75 groups. Without this check, observed acoustic/temporal divergences cannot be isolated from possible meaning shifts (e.g., intensity-word changes altering perceived dynamics).

    Authors: We appreciate this observation on verifying semantic preservation. The perturbations were explicitly designed to maintain intent (synonym substitution for MLS, descriptor adjustment without core-meaning change for IS, and reordering for SR). In the revised manuscript we now report prompt-level sentence-embedding cosine similarities (all-MiniLM-L6-v2) averaging 0.89 across all 75 groups and perturbation types. These scores provide quantitative support that semantic intent is largely preserved, allowing the observed acoustic divergences to be more confidently attributed to realization rather than meaning shift. Full human ratings and inter-annotator agreement were outside the original scope and resource limits of the study. revision: partial

  2. Referee: [Results section] Results section: concrete similarity numbers are given without accompanying statistical tests, exact perturbation-generation procedures, or baseline comparisons (e.g., against random prompt pairs or non-semantic controls). This omission weakens the ability to judge whether the reported divergences exceed expected variation.

    Authors: We agree that statistical tests, procedural detail, and baselines are necessary for rigorous interpretation. The revised manuscript now includes: (i) exact perturbation-generation rules with examples and pseudocode in the Methods and supplementary material; (ii) baseline cosine similarities computed on 75 random unrelated prompt pairs (mean 0.32), which are significantly lower than the perturbed-pair scores (p < 0.001, paired t-test); and (iii) paired t-tests on model and perturbation-type differences, all reaching p < 0.05. These additions allow readers to assess whether the reported divergences exceed expected variation. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical evaluation framework

full rationale

The paper reports an empirical comparison of text-to-audio model outputs under prompt perturbations using direct measurements of spectral, temporal, and embedding similarities. No equations, derivations, or fitted parameters are described that reduce any claim to its inputs by construction. The central finding (fragility in semantic-to-acoustic realization) follows from contrasting high cosine similarities in embeddings against lower acoustic/temporal consistency; these are independent external metrics rather than self-referential or self-cited results. The design of the 75 prompt groups is presented as an input assumption without any load-bearing self-citation or renaming of known results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claim rests on domain assumption that prompt variants preserve intent and that chosen similarity metrics are valid proxies; no free parameters or new entities introduced.

axioms (1)
  • domain assumption Prompt perturbations preserve semantic intent
    Invoked in construction of MLS, IS, and SR variants and in interpretation of embedding similarities.

pith-pipeline@v0.9.0 · 5536 in / 1104 out tokens · 61479 ms · 2026-05-15T12:04:52.389325+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.