arxiv: 2604.14654 · v2 · submitted 2026-04-16 · 💻 cs.SD · eess.AS

ClariCodec: Optimising Neural Speech Codes for 200bps Communication using Reinforcement Learning

Junyi Wang , Chi Zhang , Jing Qian , Haifeng Luo , Hao Wang , Zengrui Jin , Chao Zhang This is my paper

Pith reviewed 2026-05-10 10:04 UTC · model grok-4.3

classification 💻 cs.SD eess.AS

keywords neural speech codec200 bpsreinforcement learningword error rateintelligibilitylow-bitrate communicationLibriSpeechquantization policy

0 comments

The pith

ClariCodec reformulates neural codec quantization as a stochastic policy and fine-tunes the encoder with word-error-rate rewards to cut WER at 200 bps while holding perceptual quality fixed.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ClariCodec, a neural speech codec designed for 200-bit-per-second transmission where intelligibility matters more than fine acoustic detail. It treats the quantization step as a learnable stochastic policy so that reinforcement learning can directly reward lower word error rates on the encoder, leaving the rest of the reconstruction pipeline untouched. Even before RL the model already reaches 3.68 percent WER on LibriSpeech test-clean, competitive with codecs that use several times more bits. After RL fine-tuning the WER drops to 3.20 percent on test-clean and 8.93 percent on test-other, a 13 percent relative improvement, with no reported loss in perceptual quality. The approach therefore shows that intelligibility can be optimized at extreme compression without retraining the entire codec.

Core claim

ClariCodec reformulates the quantization step inside a neural speech codec as a stochastic policy, allowing reinforcement learning to optimize the encoder directly for word error rate while the acoustic reconstruction pipeline stays frozen. On the LibriSpeech test-clean set the model achieves 3.68 percent WER at 200 bps without RL and 3.20 percent WER after RL fine-tuning; on test-other the corresponding figures are 8.93 percent after RL, a 13 percent relative reduction, all while perceptual quality is preserved.

What carries the argument

The stochastic-policy reformulation of quantization, which turns the discrete code selection into an action that can receive WER-based rewards during RL fine-tuning of the encoder alone.

If this is right

200-bps speech transmission can reach word error rates previously seen only at several times higher bitrates.
Intelligibility and perceptual quality can be traded off by changing only the reward signal rather than retraining the whole model.
The frozen reconstruction pipeline continues to produce usable audio even after the encoder is optimized for transcription accuracy.
The method remains effective on both clean and noisy test partitions of LibriSpeech.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same policy-reformulation trick could be applied to other low-rate modalities such as music or environmental sound where a task-specific metric replaces WER.
Because only the encoder is updated, the approach may allow incremental adaptation of deployed codecs without redistributing the full decoder.
Real-world channel impairments such as packet loss or fading could be folded directly into the RL reward to close the gap between lab WER and field performance.

Load-bearing premise

Fine-tuning only the encoder with WER-driven RL rewards while keeping the acoustic reconstruction pipeline frozen will improve intelligibility without introducing new artifacts or degrading the codec's core compression behavior.

What would settle it

Apply the same RL fine-tuning procedure and measure whether WER on held-out test sets fails to drop below the non-RL baseline or whether perceptual quality metrics fall below the non-RL level.

read the original abstract

In bandwidth-constrained communication such as satellite and underwater channels, speech must often be transmitted at ultra-low bitrates where intelligibility is the primary objective. At such extreme compression levels, codecs trained with acoustic reconstruction losses tend to allocate bits to perceptual detail, leading to substantial degradation in word error rate (WER). This paper proposes ClariCodec, a neural speech codec operating at 200 bit per second (bps) that reformulates quantisation as a stochastic policy, enabling reinforcement learning (RL)-based optimisation of intelligibility. Specifically, the encoder is fine-tuned using WER-driven rewards while the acoustic reconstruction pipeline remains frozen. Even without RL, ClariCodec achieves 3.68% WER on the LibriSpeech test-clean set at 200 bps, already competitive with codecs operating at higher bitrates. Further RL fine-tuning reduces WER to 3.20% on test-clean and 8.93% on test-other, corresponding to a 13% relative reduction while preserving perceptual quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

read the letter

ClariCodec claims a 13% WER drop at 200 bps by RL-tuning the encoder for intelligibility while freezing reconstruction, but the abstract alone gives no way to verify if the RL step actually drives it. The paper starts from the practical point that standard neural codecs at extreme compression waste bits on perceptual details that hurt word recognition in channels like satellite or underwater links. They address this by treating quantization as a stochastic policy and applying reinforcement learning with WER as the direct reward, fine-tuning only the encoder. Without RL they already report 3.68% WER on LibriSpeech test-clean, which they position as competitive with higher-rate codecs, and the RL step brings it to 3.20% on clean and 8.93% on other while claiming perceptual quality holds up. The concrete numbers and the decision to optimize for intelligibility rather than reconstruction are the parts that stand out as useful. The reformulation of quantization into a policy is a clear, if narrow, technical move that lets them run the RL loop. The abstract is straightforward about the pipeline and the target application, so the high-level idea is easy to grasp. The soft spots are all tied to the missing details. We have no information on the baseline codecs, the ASR model used to generate the WER reward, statistical tests on the improvement, or any measurement that actually confirms perceptual quality is preserved rather than just asserted. It is possible the base encoder already delivers most of the gain and the RL adds little or introduces unmeasured side effects. Without those controls the central claim stays untestable. This is the sort of work that matters to people building speech systems for severely constrained links. A reader working on low-bitrate codecs or RL in audio might pick up the policy-reformulation trick and try it, but only the full methods section would make it actionable. It deserves a serious referee because the problem is real and the approach is direct enough that reviewers can quickly check the experimental gaps and decide if the numbers hold.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces ClariCodec, a neural speech codec operating at 200 bps for bandwidth-constrained channels. It reformulates quantization as a stochastic policy to enable reinforcement learning (RL) fine-tuning of the encoder using WER-driven rewards, while keeping the acoustic reconstruction pipeline frozen. The abstract reports that the base model achieves 3.68% WER on LibriSpeech test-clean (competitive with higher-bitrate codecs), with RL yielding 3.20% on test-clean and 8.93% on test-other (13% relative reduction) while preserving perceptual quality.

Significance. If the reported WER reductions hold under rigorous controls, the work could advance ultra-low-bitrate speech coding by showing that RL can prioritize intelligibility over acoustic fidelity in codec design, which is relevant for satellite and underwater applications. Credit is due for grounding results in the public LibriSpeech benchmark with concrete WER figures on both test-clean and test-other partitions. However, significance cannot be assessed without the full experimental details.

major comments (1)

[Abstract] Abstract: the central claim that RL fine-tuning of the encoder produces a 13% relative WER reduction (from 3.68% to 3.20%) while preserving perceptual quality is load-bearing for the paper's contribution, yet the abstract provides no information on the RL reward formulation, the ASR model used for WER computation, baseline comparisons, statistical significance, or ablations confirming that freezing the reconstruction pipeline does not introduce new artifacts or alter compression behavior.

Simulated Author's Rebuttal

1 responses · 5 unresolved

We thank the referee for their detailed feedback on the abstract and for recognizing the potential significance of the work for ultra-low-bitrate applications. We address the major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that RL fine-tuning of the encoder produces a 13% relative WER reduction (from 3.68% to 3.20%) while preserving perceptual quality is load-bearing for the paper's contribution, yet the abstract provides no information on the RL reward formulation, the ASR model used for WER computation, baseline comparisons, statistical significance, or ablations confirming that freezing the reconstruction pipeline does not introduce new artifacts or alter compression behavior.

Authors: We agree that the abstract, constrained by typical length limits, does not elaborate on these methodological aspects. The manuscript body provides the RL formulation (quantization as a stochastic policy optimized with WER-based rewards), the ASR system employed for reward computation, comparisons against higher-bitrate codecs, and evaluations of perceptual quality. However, with only the abstract available in the current review materials, we cannot reference or quote the precise sections, model specifications, or ablation results. We will revise the abstract to incorporate a concise mention of the WER-driven RL fine-tuning and the use of perceptual quality metrics to support the claims, while noting that full details remain in the body. revision: partial

standing simulated objections not resolved

Exact formulation and hyperparameters of the RL reward function
Specific ASR model architecture, training data, and WER computation details
Full baseline codec comparisons including exact bitrates and WER figures
Statistical significance testing (e.g., p-values or confidence intervals) for the 13% relative WER reduction
Ablation results confirming no new artifacts from freezing the acoustic reconstruction pipeline

Circularity Check

0 steps flagged

No significant circularity; empirical results only

full rationale

The abstract presents an empirical pipeline: a neural codec at 200 bps is trained, quantisation is reformulated as a stochastic policy, the encoder is fine-tuned via RL with WER rewards while the reconstruction pipeline stays frozen, and WER numbers (3.68% to 3.20% on test-clean) are reported on LibriSpeech. No equations, derivation steps, or first-principles claims appear. Consequently no self-definitional reduction, fitted-input-called-prediction, or self-citation load-bearing step can be identified or quoted. The reported improvements are standard experimental outcomes rather than quantities forced by construction from the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract supplies no explicit free parameters, axioms, or invented entities; the method is described as building on standard neural codec training plus RL.

pith-pipeline@v0.9.0 · 5465 in / 1191 out tokens · 59382 ms · 2026-05-10T10:04:14.851969+00:00 · methodology