An Ultra-Low Latency, End-to-End Streaming Speech Synthesis Architecture via Block-Wise Generation and Depth-Wise Codec Decoding

Aghilas Sini; Salima Mdhaffar; Tianhui Su; Tien-Ping Tan; Yannick Est\`eve

arxiv: 2604.12438 · v1 · submitted 2026-04-14 · 📡 eess.AS

An Ultra-Low Latency, End-to-End Streaming Speech Synthesis Architecture via Block-Wise Generation and Depth-Wise Codec Decoding

Tianhui Su , Tien-Ping Tan , Salima Mdhaffar , Yannick Est\`eve , Aghilas Sini This is my paper

Pith reviewed 2026-05-10 14:28 UTC · model grok-4.3

classification 📡 eess.AS

keywords streaming speech synthesisultra-low latency TTSblock-wise generationneural audio codecresidual vector quantizationnon-autoregressivedepth-wise decodingFastSpeech 2

0 comments

The pith

A block-wise non-autoregressive system directly models 32-layer Mimi codec codes to reach 49 ms speech synthesis latency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops an end-to-end streaming speech synthesis architecture that generates audio in blocks while decoding the layers of a neural audio codec progressively. Conventional pipelines rely on heavy vocoders and regression models that create latency and smoothing artifacts. By working directly in the discrete latent space, the design removes the need for temporal autoregression and targets interactive applications where delay must remain imperceptible. Experiments on English and Malay data show faster inference and better voicing accuracy than cascaded baselines.

Core claim

The architecture integrates a modified FastSpeech 2 backbone with progressive depth-wise sequential decoding to dynamically condition 32 layers of residual vector quantization codes in block-wise fashion, eliminating temporal autoregressive overhead while resolving phonetic alignment degradation and spectral over-smoothing.

What carries the argument

Block-wise generation combined with progressive depth-wise sequential decoding of 32 residual vector quantization layers from the Mimi codec, conditioned on text via a modified FastSpeech 2 backbone.

If this is right

The system achieves an average time-to-first-byte latency of 48.99 milliseconds, below the human perception threshold.
It delivers a 10.6-fold acceleration over conventional cascaded pipelines.
Quantitative gains appear in fundamental voicing accuracy and high-frequency spectral fidelity.
The approach supports language-independent use on English and Malay datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The block-wise design could support instantaneous voice interfaces in live translation or virtual assistants.
Adjusting block size might allow further tuning for mobile or low-power hardware.
The discrete-code modeling strategy may extend to other audio generation tasks such as music synthesis.

Load-bearing premise

Directly modeling the 32-layer residual vector quantization codes via block-wise generation and progressive depth-wise sequential decoding will resolve phonetic alignment degradation and avoid spectral over-smoothing without any temporal autoregressive component.

What would settle it

A deployed real-time test measuring actual time-to-first-byte latency alongside a blind listening evaluation of phonetic accuracy and naturalness compared to cascaded baselines.

read the original abstract

Real-time speech synthesis requires balancing inference latency and acoustic fidelity for interactive applications. Conventional continuous text-to-speech pipelines require computationally intensive neural vocoders to reconstruct phase information, creating a significant streaming bottleneck. Furthermore, regression-based acoustic modeling frequently induces spectral over-smoothing artifacts. To address these limitations, this paper proposes a novel end-to-end non-autoregressive architecture optimized for ultra-low latency block-wise generation, directly modeling the highly compressed discrete latent space of the Mimi neural audio codec. Integrating a modified FastSpeech 2 backbone with a progressive depth-wise sequential decoding strategy, the architecture dynamically conditions 32 layers of residual vector quantization codes. This mechanism resolves phonetic alignment degradation and manages the complexity of high-fidelity discrete representations without temporal autoregressive overhead. Experimental evaluations on English and Malay datasets validate its language-independent deployment capability. Compared to conventional continuous regression models, the proposed architecture demonstrates quantitative improvements in fundamental voicing accuracy and mitigates high-frequency spectral degradation. It achieves ultra-low latency inference, translating to a 10.6-fold absolute acceleration over conventional cascaded pipelines. Crucially, the system achieves an average time-to-first-byte latency of 48.99 milliseconds, falling significantly below the human perception threshold for real-time interactive streaming. These results firmly establish the proposed architecture as a highly optimized solution for deploying real-time streaming speech interfaces.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper claims 49 ms TTFT for streaming TTS via block-wise generation plus depth-wise decoding on Mimi's 32 RVQ layers, but the abstract supplies no baselines, ablations, or timing breakdowns to support it.

read the letter

The main takeaway is a claimed 48.99 ms average time-to-first-byte for an end-to-end streaming TTS system. It uses a modified FastSpeech 2 backbone to generate blocks of discrete codes from the Mimi codec and then decodes those codes progressively across the 32 residual vector quantization layers. This is positioned as a way to skip continuous regression and heavy vocoders while keeping latency low enough for interactive use, with reported gains in voicing accuracy and reduced high-frequency artifacts on English and Malay data, plus a 10.6x speedup over cascaded baselines.

Referee Report

2 major / 1 minor

Summary. The paper proposes a novel end-to-end non-autoregressive streaming speech synthesis architecture that performs block-wise generation while directly modeling the 32-layer residual vector quantization (RVQ) codes of the Mimi neural audio codec via a modified FastSpeech 2 backbone and progressive depth-wise sequential decoding. It claims to eliminate neural vocoders and temporal autoregression, resolve phonetic alignment degradation and spectral over-smoothing, and deliver an average time-to-first-byte (TTFT) latency of 48.99 ms (below human perception threshold) together with a 10.6-fold acceleration over conventional cascaded pipelines, with supporting quantitative gains in voicing accuracy on English and Malay data.

Significance. If the latency and quality claims are substantiated by rigorous experiments, the work would be significant for real-time interactive applications because it replaces continuous regression and vocoder stages with direct discrete-code modeling and block-wise non-autoregressive generation. The language-independent deployment and avoidance of temporal autoregression are attractive, but the absence of any timing breakdowns, ablations, or statistical validation in the reported results limits the immediate assessed impact.

major comments (2)

[Abstract / Experimental evaluations] Abstract and Experimental evaluations: The central claim of 48.99 ms average TTFT latency rests on the assumption that block-wise generation plus progressive depth-wise sequential decoding of all 32 Mimi RVQ layers remains fast enough to meet the threshold. Because the decoding is explicitly sequential (layer 1 conditions layer 2, etc.), any non-negligible per-layer cost compounds inside each block; the manuscript supplies no per-component timing breakdown, block-size ablation, or comparison against a parallel depth decoder, so it is impossible to confirm that the sequential dependency does not push TTFT above the claimed figure.
[Abstract] Abstract: The reported quantitative improvements in fundamental voicing accuracy, mitigation of high-frequency spectral degradation, and 10.6-fold acceleration are stated without any baseline models, dataset sizes, error bars, ablation results, or statistical tests. This absence makes it impossible to determine whether the data actually support the central claims of superiority over conventional continuous regression pipelines.

minor comments (1)

[Abstract] The abstract refers to 'English and Malay datasets' without naming the corpora or reporting their sizes, which hinders reproducibility and assessment of language-independence claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our latency measurements and experimental rigor. We address each major comment below and will incorporate revisions to provide the requested details and validations.

read point-by-point responses

Referee: [Abstract / Experimental evaluations] Abstract and Experimental evaluations: The central claim of 48.99 ms average TTFT latency rests on the assumption that block-wise generation plus progressive depth-wise sequential decoding of all 32 Mimi RVQ layers remains fast enough to meet the threshold. Because the decoding is explicitly sequential (layer 1 conditions layer 2, etc.), any non-negligible per-layer cost compounds inside each block; the manuscript supplies no per-component timing breakdown, block-size ablation, or comparison against a parallel depth decoder, so it is impossible to confirm that the sequential dependency does not push TTFT above the claimed figure.

Authors: We acknowledge that the manuscript currently lacks an explicit per-component timing breakdown, block-size ablation study, or direct comparison to a parallel depth decoder. The reported 48.99 ms TTFT is an end-to-end measurement on the target hardware that already includes the full sequential depth-wise process across all 32 layers. To strengthen the claim, we will add a dedicated timing analysis subsection with (i) per-stage breakdowns (block generation vs. each depth-wise layer decode), (ii) ablation on block sizes, and (iii) a comparison of sequential vs. parallel depth decoding where feasible. These additions will allow readers to verify that the per-layer overhead remains negligible and the overall latency stays below the human perception threshold. revision: yes
Referee: [Abstract] Abstract: The reported quantitative improvements in fundamental voicing accuracy, mitigation of high-frequency spectral degradation, and 10.6-fold acceleration are stated without any baseline models, dataset sizes, error bars, ablation results, or statistical tests. This absence makes it impossible to determine whether the data actually support the central claims of superiority over conventional continuous regression pipelines.

Authors: The current manuscript states comparisons against conventional continuous regression pipelines and reports gains in voicing accuracy on English and Malay data, but we agree that the presentation is insufficiently detailed. In the revised version we will expand the experimental section to explicitly name the baseline models, report exact dataset sizes and splits, include error bars from multiple random seeds, provide ablation results on the depth-wise decoding component, and add statistical significance tests (e.g., paired t-tests or Wilcoxon tests) for the voicing accuracy and latency improvements. These changes will make the quantitative claims fully verifiable. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture validated by direct experiments

full rationale

The paper proposes a block-wise non-autoregressive TTS architecture that directly models Mimi codec RVQ codes with progressive depth-wise decoding. All performance claims (48.99 ms TTFT, 10.6× speedup, voicing accuracy gains) are presented as outcomes of experimental evaluations on English and Malay datasets rather than as predictions derived from equations or first-principles arguments. No equations, fitted-parameter predictions, or self-citation chains appear in the provided text that would reduce the reported metrics to the architecture definition by construction. The central latency and quality results therefore remain independent experimental observations.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on the established Mimi codec and FastSpeech 2 backbone plus the new progressive depth-wise decoding strategy; no new physical entities or free parameters are introduced in the abstract.

axioms (2)

domain assumption Non-autoregressive modeling of discrete codec codes can achieve lower latency than autoregressive or continuous regression pipelines while preserving fidelity.
Invoked in the motivation and architecture description to justify the design.
ad hoc to paper Progressive depth-wise decoding of 32 RVQ layers resolves phonetic alignment issues without temporal autoregression.
Core mechanism claimed to manage complexity of high-fidelity discrete representations.

pith-pipeline@v0.9.0 · 5560 in / 1460 out tokens · 38469 ms · 2026-05-10T14:28:52.701414+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages

[1]

Introduction Speech synthesis, or text-to-speech (TTS) conversion, is the artificial production of human speech from text or phonological representations. The increasing demand for real-time interactive applications has prioritized the development of streaming synthesis systems capable of immediate responsiveness (Jampala et al., 2024; Kaur & Singh, 2023)...

work page 2024
[2]

have substantially accelerated initial feature generation, the inherent reliance on continuous Mel-spectrograms presents fundamental structural limitations. These continuous representations discard critical phase information, unconditionally necessitating computationally heavy phase-estimation networks to synthesize the final audio (Ueno & Kawahara, 2022)...

work page 2022
[3]

Li et al., 2019), established a high standard for prosodic naturalness by utilizing sequence-to-sequence attention mechanisms

and Transformer TTS (N. Li et al., 2019), established a high standard for prosodic naturalness by utilizing sequence-to-sequence attention mechanisms. However, these models suffered from inherent inference latency due to their frame-by-frame generation process, which exhibits an 𝑂(𝑁) computational complexity scaling linearly with the output sequence lengt...

work page 2019
[4]

From a systems engineering perspective, this traditional approach presents critical limitations for practical deployment

or WaveGlow (Prenger et al., 2019)) for phase estimation and waveform reconstruction. From a systems engineering perspective, this traditional approach presents critical limitations for practical deployment. Specifically, the absolute dependency on heavy neural vocoders introduces substantial computational overhead. State-of-the-art vocoders rely on dense...

work page 2019
[5]

Additionally, the acoustic modeling of continuous features via standard regression objectives fundamentally suffers from spectral over-smoothing artifacts

frequently rely on flow-matching algorithms where inference speed remains tightly bound to the numerical integration cost of ordinary differential equation solvers, complicating strict ultra-low latency deployments. Additionally, the acoustic modeling of continuous features via standard regression objectives fundamentally suffers from spectral over-smooth...

work page 2025
[6]

time-to-first-byte

and NaturalSpeech 2 (K. Shen et al., 2024), their iterative sampling procedures incur prohibitive computational costs. These extensive execution delays fundamentally violate the strict algorithmic latency requirements demanded by real-time streaming speech synthesis. 2.2 Streaming TTS Architectures Real-time conversational agents require streaming archite...

work page 2024
[7]

Subsequent structural innovations sought to bypass autoregression entirely

enforced monotonic alignments to favor continuous output, albeit still suffering from temporal generation delays (Zhang et al., 2024). Subsequent structural innovations sought to bypass autoregression entirely. For instance, SyncSpeech (Sheng et al.,

work page 2024
[8]

As generative paradigms advanced, architectures such as StreamVITS (Bai et al., 2025; Kim et al.,

utilized dual-stream transformer mechanisms to synchronize acoustic feature generation without excessive buffering. As generative paradigms advanced, architectures such as StreamVITS (Bai et al., 2025; Kim et al.,

work page 2025
[9]

Nevertheless, the cascaded vocoder remains a major impediment to pure streaming efficiency

implemented causal flow-based generation to adapt continuous latent variables explicitly for real-time output. Nevertheless, the cascaded vocoder remains a major impediment to pure streaming efficiency. Continuous feature vocoders (like CNN-based HiFi-GAN) require overlapping acoustic frames (receptive fields) to ensure phase continuity at the chunk bound...

work page 2019
[10]

utilizes block-wise zero-shot synthesis. However, temporal autoregressive token 7 generation inevitably introduces processing overhead, and block-wise mechanisms frequently remain reliant on intermediate continuous alignment features. Consequently, there is a compelling structural need for unified, end-to-end architectures that can map linguistic streams ...

work page 2017
[11]

These codecs utilize Residual Vector Quantization (RVQ), which cascades multiple quantizer layers to encode audio hierarchically

have significantly improved the mitigation of quantization artifacts. These codecs utilize Residual Vector Quantization (RVQ), which cascades multiple quantizer layers to encode audio hierarchically. The initial layers capture primary semantic and prosodic structures, while deeper layers progressively encode fine-grained acoustic textures and high-frequen...

work page 2025
[12]

These models demonstrate unprecedented zero-shot synthesis capabilities and excellent mitigation of acoustic over-smoothing

autoregressively predict codec tokens utilizing large transformer architectures. These models demonstrate unprecedented zero-shot synthesis capabilities and excellent mitigation of acoustic over-smoothing. However, because these autoregressive architectures rely on strict temporal sequential generation, frequently 8 flattening the extended time axis and t...

work page 2024
[13]

Wang et al., 2024), rely on iterative masked token modeling

or MaskGCT (Y . Wang et al., 2024), rely on iterative masked token modeling. While these architectures achieve parallel generation, their iterative refinement steps still consume significant execution time. Currently, non-iterative and fully parallel streaming architectures capable of modeling deep multi-layered discrete targets without compromising infer...

work page 2024
[14]

for the 𝑖-th quantizer is recursively defined as: ℎ*,

Rather than predicting all quantization layers naively in parallel, an approach that entirely disregards critical inter-layer acoustic dependencies, the proposed decoder predicts the discrete tokens sequentially across the quantization depth for each individual temporal frame. 14 Let ℎ* denote the base acoustic hidden state generated by the non-autoregres...

work page 2048
[15]

For the English benchmark, we utilize the standard LJSpeech corpus encompassing approximately 24 hours of high-fidelity audio (Ito & Johnson, 2017)

Experiments and Results 4.1 Experimental Setup 4.1.1 Dataset and Preprocessing The proposed streaming architecture is rigorously evaluated utilizing two distinct single-speaker corpora to validate its language independent structural robustness. For the English benchmark, we utilize the standard LJSpeech corpus encompassing approximately 24 hours of high-f...

work page 2017
[16]

Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed

Conclusion and Future Work 5.1 Conclusion This study introduces a highly efficient and end-to-end streaming architecture explicitly designed to overcome the computational bottlenecks and spectral degradation inherent in continuous acoustic modeling. By mapping linguistic features directly into a highly compressed discrete hierarchical latent space, the ar...

work page doi:10.21437/interspeech.2022-489 2025

[1] [1]

Introduction Speech synthesis, or text-to-speech (TTS) conversion, is the artificial production of human speech from text or phonological representations. The increasing demand for real-time interactive applications has prioritized the development of streaming synthesis systems capable of immediate responsiveness (Jampala et al., 2024; Kaur & Singh, 2023)...

work page 2024

[2] [2]

have substantially accelerated initial feature generation, the inherent reliance on continuous Mel-spectrograms presents fundamental structural limitations. These continuous representations discard critical phase information, unconditionally necessitating computationally heavy phase-estimation networks to synthesize the final audio (Ueno & Kawahara, 2022)...

work page 2022

[3] [3]

Li et al., 2019), established a high standard for prosodic naturalness by utilizing sequence-to-sequence attention mechanisms

and Transformer TTS (N. Li et al., 2019), established a high standard for prosodic naturalness by utilizing sequence-to-sequence attention mechanisms. However, these models suffered from inherent inference latency due to their frame-by-frame generation process, which exhibits an 𝑂(𝑁) computational complexity scaling linearly with the output sequence lengt...

work page 2019

[4] [4]

From a systems engineering perspective, this traditional approach presents critical limitations for practical deployment

or WaveGlow (Prenger et al., 2019)) for phase estimation and waveform reconstruction. From a systems engineering perspective, this traditional approach presents critical limitations for practical deployment. Specifically, the absolute dependency on heavy neural vocoders introduces substantial computational overhead. State-of-the-art vocoders rely on dense...

work page 2019

[5] [5]

Additionally, the acoustic modeling of continuous features via standard regression objectives fundamentally suffers from spectral over-smoothing artifacts

frequently rely on flow-matching algorithms where inference speed remains tightly bound to the numerical integration cost of ordinary differential equation solvers, complicating strict ultra-low latency deployments. Additionally, the acoustic modeling of continuous features via standard regression objectives fundamentally suffers from spectral over-smooth...

work page 2025

[6] [6]

time-to-first-byte

and NaturalSpeech 2 (K. Shen et al., 2024), their iterative sampling procedures incur prohibitive computational costs. These extensive execution delays fundamentally violate the strict algorithmic latency requirements demanded by real-time streaming speech synthesis. 2.2 Streaming TTS Architectures Real-time conversational agents require streaming archite...

work page 2024

[7] [7]

Subsequent structural innovations sought to bypass autoregression entirely

enforced monotonic alignments to favor continuous output, albeit still suffering from temporal generation delays (Zhang et al., 2024). Subsequent structural innovations sought to bypass autoregression entirely. For instance, SyncSpeech (Sheng et al.,

work page 2024

[8] [8]

As generative paradigms advanced, architectures such as StreamVITS (Bai et al., 2025; Kim et al.,

utilized dual-stream transformer mechanisms to synchronize acoustic feature generation without excessive buffering. As generative paradigms advanced, architectures such as StreamVITS (Bai et al., 2025; Kim et al.,

work page 2025

[9] [9]

Nevertheless, the cascaded vocoder remains a major impediment to pure streaming efficiency

implemented causal flow-based generation to adapt continuous latent variables explicitly for real-time output. Nevertheless, the cascaded vocoder remains a major impediment to pure streaming efficiency. Continuous feature vocoders (like CNN-based HiFi-GAN) require overlapping acoustic frames (receptive fields) to ensure phase continuity at the chunk bound...

work page 2019

[10] [10]

utilizes block-wise zero-shot synthesis. However, temporal autoregressive token 7 generation inevitably introduces processing overhead, and block-wise mechanisms frequently remain reliant on intermediate continuous alignment features. Consequently, there is a compelling structural need for unified, end-to-end architectures that can map linguistic streams ...

work page 2017

[11] [11]

These codecs utilize Residual Vector Quantization (RVQ), which cascades multiple quantizer layers to encode audio hierarchically

have significantly improved the mitigation of quantization artifacts. These codecs utilize Residual Vector Quantization (RVQ), which cascades multiple quantizer layers to encode audio hierarchically. The initial layers capture primary semantic and prosodic structures, while deeper layers progressively encode fine-grained acoustic textures and high-frequen...

work page 2025

[12] [12]

These models demonstrate unprecedented zero-shot synthesis capabilities and excellent mitigation of acoustic over-smoothing

autoregressively predict codec tokens utilizing large transformer architectures. These models demonstrate unprecedented zero-shot synthesis capabilities and excellent mitigation of acoustic over-smoothing. However, because these autoregressive architectures rely on strict temporal sequential generation, frequently 8 flattening the extended time axis and t...

work page 2024

[13] [13]

Wang et al., 2024), rely on iterative masked token modeling

or MaskGCT (Y . Wang et al., 2024), rely on iterative masked token modeling. While these architectures achieve parallel generation, their iterative refinement steps still consume significant execution time. Currently, non-iterative and fully parallel streaming architectures capable of modeling deep multi-layered discrete targets without compromising infer...

work page 2024

[14] [14]

for the 𝑖-th quantizer is recursively defined as: ℎ*,

Rather than predicting all quantization layers naively in parallel, an approach that entirely disregards critical inter-layer acoustic dependencies, the proposed decoder predicts the discrete tokens sequentially across the quantization depth for each individual temporal frame. 14 Let ℎ* denote the base acoustic hidden state generated by the non-autoregres...

work page 2048

[15] [15]

For the English benchmark, we utilize the standard LJSpeech corpus encompassing approximately 24 hours of high-fidelity audio (Ito & Johnson, 2017)

Experiments and Results 4.1 Experimental Setup 4.1.1 Dataset and Preprocessing The proposed streaming architecture is rigorously evaluated utilizing two distinct single-speaker corpora to validate its language independent structural robustness. For the English benchmark, we utilize the standard LJSpeech corpus encompassing approximately 24 hours of high-f...

work page 2017

[16] [16]

Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed

Conclusion and Future Work 5.1 Conclusion This study introduces a highly efficient and end-to-end streaming architecture explicitly designed to overcome the computational bottlenecks and spectral degradation inherent in continuous acoustic modeling. By mapping linguistic features directly into a highly compressed discrete hierarchical latent space, the ar...

work page doi:10.21437/interspeech.2022-489 2025