ClariCodec: Optimising Neural Speech Codes for 200bps Communication using Reinforcement Learning
Pith reviewed 2026-05-10 10:04 UTC · model grok-4.3
The pith
ClariCodec reformulates neural codec quantization as a stochastic policy and fine-tunes the encoder with word-error-rate rewards to cut WER at 200 bps while holding perceptual quality fixed.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ClariCodec reformulates the quantization step inside a neural speech codec as a stochastic policy, allowing reinforcement learning to optimize the encoder directly for word error rate while the acoustic reconstruction pipeline stays frozen. On the LibriSpeech test-clean set the model achieves 3.68 percent WER at 200 bps without RL and 3.20 percent WER after RL fine-tuning; on test-other the corresponding figures are 8.93 percent after RL, a 13 percent relative reduction, all while perceptual quality is preserved.
What carries the argument
The stochastic-policy reformulation of quantization, which turns the discrete code selection into an action that can receive WER-based rewards during RL fine-tuning of the encoder alone.
If this is right
- 200-bps speech transmission can reach word error rates previously seen only at several times higher bitrates.
- Intelligibility and perceptual quality can be traded off by changing only the reward signal rather than retraining the whole model.
- The frozen reconstruction pipeline continues to produce usable audio even after the encoder is optimized for transcription accuracy.
- The method remains effective on both clean and noisy test partitions of LibriSpeech.
Where Pith is reading between the lines
- The same policy-reformulation trick could be applied to other low-rate modalities such as music or environmental sound where a task-specific metric replaces WER.
- Because only the encoder is updated, the approach may allow incremental adaptation of deployed codecs without redistributing the full decoder.
- Real-world channel impairments such as packet loss or fading could be folded directly into the RL reward to close the gap between lab WER and field performance.
Load-bearing premise
Fine-tuning only the encoder with WER-driven RL rewards while keeping the acoustic reconstruction pipeline frozen will improve intelligibility without introducing new artifacts or degrading the codec's core compression behavior.
What would settle it
Apply the same RL fine-tuning procedure and measure whether WER on held-out test sets fails to drop below the non-RL baseline or whether perceptual quality metrics fall below the non-RL level.
read the original abstract
In bandwidth-constrained communication such as satellite and underwater channels, speech must often be transmitted at ultra-low bitrates where intelligibility is the primary objective. At such extreme compression levels, codecs trained with acoustic reconstruction losses tend to allocate bits to perceptual detail, leading to substantial degradation in word error rate (WER). This paper proposes ClariCodec, a neural speech codec operating at 200 bit per second (bps) that reformulates quantisation as a stochastic policy, enabling reinforcement learning (RL)-based optimisation of intelligibility. Specifically, the encoder is fine-tuned using WER-driven rewards while the acoustic reconstruction pipeline remains frozen. Even without RL, ClariCodec achieves 3.68% WER on the LibriSpeech test-clean set at 200 bps, already competitive with codecs operating at higher bitrates. Further RL fine-tuning reduces WER to 3.20% on test-clean and 8.93% on test-other, corresponding to a 13% relative reduction while preserving perceptual quality.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces ClariCodec, a neural speech codec operating at 200 bps for bandwidth-constrained channels. It reformulates quantization as a stochastic policy to enable reinforcement learning (RL) fine-tuning of the encoder using WER-driven rewards, while keeping the acoustic reconstruction pipeline frozen. The abstract reports that the base model achieves 3.68% WER on LibriSpeech test-clean (competitive with higher-bitrate codecs), with RL yielding 3.20% on test-clean and 8.93% on test-other (13% relative reduction) while preserving perceptual quality.
Significance. If the reported WER reductions hold under rigorous controls, the work could advance ultra-low-bitrate speech coding by showing that RL can prioritize intelligibility over acoustic fidelity in codec design, which is relevant for satellite and underwater applications. Credit is due for grounding results in the public LibriSpeech benchmark with concrete WER figures on both test-clean and test-other partitions. However, significance cannot be assessed without the full experimental details.
major comments (1)
- [Abstract] Abstract: the central claim that RL fine-tuning of the encoder produces a 13% relative WER reduction (from 3.68% to 3.20%) while preserving perceptual quality is load-bearing for the paper's contribution, yet the abstract provides no information on the RL reward formulation, the ASR model used for WER computation, baseline comparisons, statistical significance, or ablations confirming that freezing the reconstruction pipeline does not introduce new artifacts or alter compression behavior.
Simulated Author's Rebuttal
We thank the referee for their detailed feedback on the abstract and for recognizing the potential significance of the work for ultra-low-bitrate applications. We address the major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that RL fine-tuning of the encoder produces a 13% relative WER reduction (from 3.68% to 3.20%) while preserving perceptual quality is load-bearing for the paper's contribution, yet the abstract provides no information on the RL reward formulation, the ASR model used for WER computation, baseline comparisons, statistical significance, or ablations confirming that freezing the reconstruction pipeline does not introduce new artifacts or alter compression behavior.
Authors: We agree that the abstract, constrained by typical length limits, does not elaborate on these methodological aspects. The manuscript body provides the RL formulation (quantization as a stochastic policy optimized with WER-based rewards), the ASR system employed for reward computation, comparisons against higher-bitrate codecs, and evaluations of perceptual quality. However, with only the abstract available in the current review materials, we cannot reference or quote the precise sections, model specifications, or ablation results. We will revise the abstract to incorporate a concise mention of the WER-driven RL fine-tuning and the use of perceptual quality metrics to support the claims, while noting that full details remain in the body. revision: partial
- Exact formulation and hyperparameters of the RL reward function
- Specific ASR model architecture, training data, and WER computation details
- Full baseline codec comparisons including exact bitrates and WER figures
- Statistical significance testing (e.g., p-values or confidence intervals) for the 13% relative WER reduction
- Ablation results confirming no new artifacts from freezing the acoustic reconstruction pipeline
Circularity Check
No significant circularity; empirical results only
full rationale
The abstract presents an empirical pipeline: a neural codec at 200 bps is trained, quantisation is reformulated as a stochastic policy, the encoder is fine-tuned via RL with WER rewards while the reconstruction pipeline stays frozen, and WER numbers (3.68% to 3.20% on test-clean) are reported on LibriSpeech. No equations, derivation steps, or first-principles claims appear. Consequently no self-definitional reduction, fitted-input-called-prediction, or self-citation load-bearing step can be identified or quoted. The reported improvements are standard experimental outcomes rather than quantities forced by construction from the inputs.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.