Cross-modal Consistency Guidance for Robust Emotion Control in Auto-Regressive TTS Models

Bin Ma; Chongjia Ni; Chong Zhang; Eng Siong Chng; Yi-Wen Chao; Yizhou Peng; Yukun Ma

arxiv: 2510.13293 · v3 · pith:F5776M4Znew · submitted 2025-10-15 · 💻 cs.CL

Cross-modal Consistency Guidance for Robust Emotion Control in Auto-Regressive TTS Models

Yizhou Peng , Yukun Ma , Chong Zhang , Yi-Wen Chao , Chongjia Ni , Bin Ma , Eng Siong Chng This is my paper

Pith reviewed 2026-05-18 07:39 UTC · model grok-4.3

classification 💻 cs.CL

keywords text-to-speechemotion controlclassifier-free guidanceauto-regressive modelsstyle mismatchadaptive guidancenatural language inference

0 comments

The pith

An adaptive guidance scheme detects and compensates for mismatches between desired emotions and text meaning to enable better emotional control in auto-regressive text-to-speech models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Text-to-speech systems struggle when a requested emotion clashes with what the text actually says, often producing unnatural audio. The paper shows that measuring this clash with language models and then varying the strength of classifier-free guidance accordingly yields more emotionally expressive speech. This approach keeps the output intelligible and high-quality even when prompts and content conflict. A sympathetic reader would care because reliable emotion control is key to natural-sounding synthetic voices in applications like audiobooks and virtual assistants.

Core claim

The authors propose an adaptive classifier-free guidance (CFG) scheme for auto-regressive TTS models that adjusts the guidance strength based on the level of mismatch between the emotion style prompt and the semantic content of the text, as detected by large language models or natural language inference models. Through analysis of CFG's impact on emotional expressiveness, they demonstrate that this adaptive method improves expressiveness while preserving audio quality and intelligibility.

What carries the argument

The mismatch-aware adaptive CFG scheme, which scales guidance strength according to quantified mismatch between prompt emotion and text semantics.

If this is right

Emotional expressiveness increases in AR TTS models under mismatched conditions.
Audio quality and intelligibility remain stable across varying mismatch levels.
The method provides robust control without requiring changes to the underlying model architecture.
CFG application to AR TTS benefits from dynamic rather than fixed strength adjustment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar adaptive guidance could extend to other style controls like speaker identity or prosody in TTS.
Testing the method on diverse languages or real-world dialogue datasets would reveal its generalizability.
If detection models improve, the overall system performance could increase further without retraining the TTS model.

Load-bearing premise

Mismatch between the desired emotion style prompt and the semantic content of the text can be reliably detected and quantified by large language models or natural language inference models in a manner that permits effective, quality-preserving adaptation of CFG strength.

What would settle it

A set of test cases with known high emotion-text mismatch where the adaptive scheme produces speech no more expressive or natural than a fixed high CFG strength baseline.

read the original abstract

While Text-to-Speech (TTS) systems enable emotional control via natural-language instructions, expressiveness, naturalness, and speech quality degrade when the target emotion conflicts with the textual semantics. We propose a Cross-modal Consistency Guided Classifier-Free Guidance (CCG-CFG) method with dynamic scales based on the degree of inconsistency between the text emotion and the explicit speech emotion, replacing the dropout condition with the text emotion. We also distill the CCG-CFG guidance signal using a hard-sample mining strategy, improving the TTS model's emotional alignment capability. Evaluations on five emotional corpora and two TTS benchmarks show that our approaches applied to CosyVoice2 achieve up to a 12% absolute improvement in emotion-recognition accuracy and a 10% relative improvement in subjective scores, outperforming baselines including HierSpeech++, Qwen3-TTS, and original CosyVoice2, while preserving intelligibility, naturalness, and high speech quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Adaptive mismatch-aware CFG for AR TTS is a practical engineering tweak that targets prompt conflicts but rests on thin validation of the LLM/NLI detector.

read the letter

The main point is that this paper adds an adaptive layer to classifier-free guidance in auto-regressive TTS. It runs the desired emotion prompt and the text through an LLM or NLI model to score their mismatch, then scales the guidance strength up or down based on that score. The goal is to get stronger emotional alignment when the prompt fits the content and to back off when it does not, avoiding the unnatural prosody that fixed high guidance often produces in AR models.

Referee Report

2 major / 1 minor

Summary. The paper proposes an adaptive Classifier-Free Guidance (CFG) scheme for auto-regressive TTS models to address style-content mismatches between emotional style prompts and input text semantics. Mismatch is detected and quantified using LLMs or NLI models, which then modulates CFG strength to improve emotional expressiveness while preserving audio quality and intelligibility. The central claim rests on an analysis of CFG effects in SOTA AR TTS models and results showing the adaptive approach outperforms fixed CFG.

Significance. If the empirical claims hold with proper validation, the work would address a practical limitation in prompt-driven emotional TTS by enabling robust, mismatch-aware control without quality trade-offs. This could advance deployment of fine-grained style control in AR models. The external-detector approach is a reasonable engineering response to the problem, but its value depends on demonstrating that the mismatch scalar reliably correlates with perceptual outcomes and maps to safe CFG adjustments.

major comments (2)

[Abstract] Abstract: The headline claim that the adaptive CFG scheme 'improves the emotional expressiveness of the AR TTS model while maintaining audio quality and intelligibility' is presented without any quantitative metrics, baselines, statistical tests, or description of how mismatch scores are computed and mapped to CFG scales. This absence prevents assessment of effect size or comparison to standard CFG.
[Experimental evaluation] Experimental evaluation (inferred from abstract and method description): No validation or calibration results are reported for the LLM/NLI mismatch detectors against human perceptual mismatch judgments, nor is there evidence that the chosen mapping from mismatch score to CFG strength was tuned on held-out data to avoid compounding artifacts in sequential AR generation. This is load-bearing for the claim that expressiveness increases without raising WER or lowering MOS.

minor comments (1)

[Method] The description of how mismatch scores are normalized or thresholded before scaling CFG could be made more precise, ideally with a short equation or pseudocode block.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We agree that additional quantitative details and validation would strengthen the presentation of our adaptive CFG approach. We address each major comment below and will incorporate the necessary revisions.

read point-by-point responses

Referee: [Abstract] Abstract: The headline claim that the adaptive CFG scheme 'improves the emotional expressiveness of the AR TTS model while maintaining audio quality and intelligibility' is presented without any quantitative metrics, baselines, statistical tests, or description of how mismatch scores are computed and mapped to CFG scales. This absence prevents assessment of effect size or comparison to standard CFG.

Authors: We acknowledge that the abstract is high-level and omits specific numbers. In the revised version we will expand the abstract to report key quantitative outcomes, including relative gains in emotion classification accuracy or expressiveness metrics, WER, and MOS scores versus fixed-CFG baselines, along with a concise description of the LLM/NLI mismatch computation and the linear or threshold-based mapping to CFG strength. Where space permits we will note statistical significance. revision: yes
Referee: [Experimental evaluation] Experimental evaluation (inferred from abstract and method description): No validation or calibration results are reported for the LLM/NLI mismatch detectors against human perceptual mismatch judgments, nor is there evidence that the chosen mapping from mismatch score to CFG strength was tuned on held-out data to avoid compounding artifacts in sequential AR generation. This is load-bearing for the claim that expressiveness increases without raising WER or lowering MOS.

Authors: This observation is correct for the current draft. While the method section describes the mismatch detectors, we did not include explicit human calibration. We will add a new subsection or appendix reporting Pearson/Spearman correlation between automated mismatch scores and human perceptual mismatch ratings collected on a held-out validation set. We will also document that the mismatch-to-CFG mapping was designed on development data only and will verify that no test-set leakage occurred. If further held-out tuning experiments are required we will perform and report them. revision: yes

Circularity Check

0 steps flagged

No significant circularity; adaptive scheme relies on external mismatch detection

full rationale

The paper's core proposal is an adaptive CFG rule that scales guidance strength according to a mismatch scalar produced by separate LLM or NLI models. This detection step sits outside the TTS generation equations and is not defined in terms of the CFG output or any fitted parameter within the model itself. The claimed improvement in expressiveness is presented as an empirical outcome of applying the rule, not as a quantity that reduces by construction to the inputs or to a self-citation chain. No self-definitional loops, fitted-input predictions, or ansatz smuggling via prior work appear in the described derivation. The approach therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that external mismatch detection is accurate enough to guide CFG adaptation usefully. No free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Large language models or natural language inference models can reliably detect and quantify the mismatch between emotion style prompts and text semantics.
This detection step is required to decide the level of CFG adaptation.

pith-pipeline@v0.9.0 · 5714 in / 1198 out tokens · 53013 ms · 2026-05-18T07:39:32.874441+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose adjusting the CFG scale based on the extent of mismatch to improve the robustness and naturalness of the synthesized speech.
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SMG-CFG Drop_Prompt_Filter ... assign CFG scales of 3.0, 2.5, and 2.0 to the [low, medium, high] levels

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.