NeuroRVQ: Multi-Scale Biosignal Tokenization for Generative Foundation Models

Alexandros Koliousis; Dario Farina; Dimitrios A. Adamos; Dimitrios Chalatsis; Konstantinos Barmpas; Na Lee; Nikolaos Laskaris; Stefanos Zafeiriou; William Raftery; Yannis Panagakis

arxiv: 2510.13068 · v4 · pith:CTB2TRVXnew · submitted 2025-10-15 · 💻 cs.LG · cs.AI· cs.HC

NeuroRVQ: Multi-Scale Biosignal Tokenization for Generative Foundation Models

Konstantinos Barmpas , Na Lee , Dimitrios Chalatsis , William Raftery , Yannis Panagakis , Dimitrios A. Adamos , Nikolaos Laskaris , Alexandros Koliousis

show 2 more authors

Dario Farina Stefanos Zafeiriou

This is my paper

Pith reviewed 2026-05-21 20:57 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.HC

keywords biosignal tokenizationresidual vector quantizationfoundation modelsmasked token predictionEEGECGEMGmulti-scale convolution

0 comments

The pith

NeuroRVQ tokenization preserves high-frequency biosignal details across scales, letting simple masked-prediction models match or beat existing modality-specific foundation models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents NeuroRVQ as a family of modality-adaptive tokenizers that decompose biosignals such as EEG, ECG, and EMG into frequency-specific parts using multi-scale temporal convolutions. Each part feeds into hierarchical residual vector quantization codebooks while a phase-aware loss respects the circular nature of Fourier phase to keep reconstruction faithful. The authors then train straightforward masked-token foundation models on these tokens and report competitive or better results on downstream tasks than prior specialized models. This outcome points to tokenizer fidelity as the main driver of effective biosignal modeling rather than model scale or architecture alone. A sympathetic reader would care because better tokens could make general-purpose models viable across many physiological signals without heavy per-modality redesign.

Core claim

NeuroRVQ decomposes biosignals into frequency-specific representations via multi-scale temporal convolutions, each encoded into hierarchical RVQ codebooks to preserve high-frequency detail, combined with a novel phase-aware training loss that respects the circular topology of Fourier phase. By tuning temporal resolution, kernel sizes, and RVQ depth per modality, the tokenizer adapts to spectro-temporal characteristics. Training simple masked-token foundation models (NeuroRVQ-FM) on the resulting tokens yields competitive or superior downstream performance compared to existing modality-specific foundation models, showing that high-fidelity tokenization is a critical factor for effective biosi

What carries the argument

NeuroRVQ, a modality-adaptive tokenizer that applies multi-scale temporal convolutions followed by hierarchical residual vector quantization and a phase-aware loss to produce high-fidelity discrete tokens from biosignals.

If this is right

High-fidelity reconstruction directly improves accuracy on downstream biosignal classification and generation tasks.
The same masked-token prediction objective works across EEG, ECG, and EMG when paired with an appropriate NeuroRVQ tokenizer.
Model architecture can remain simple while still achieving strong results if the input tokens retain fine temporal and spectral structure.
Parameter adaptation per modality allows the tokenizer to match the distinct frequency content of each biosignal type.
Token quality matters more than model complexity for building effective biosignal foundation models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same multi-scale RVQ approach could be tested on non-biosignal time series such as audio or sensor streams to check whether the fidelity benefit generalizes.
Replacing per-modality tuning with a single set of parameters shared across signals might reduce engineering effort if reconstruction quality remains high.
Adding NeuroRVQ tokens to larger transformer backbones could reveal whether the performance edge scales with model size.
The phase-aware loss component might improve other signal tasks that rely on accurate phase reconstruction, such as audio synthesis.

Load-bearing premise

That the specific choices of temporal resolution, kernel sizes, and RVQ depth can be tuned per modality to preserve high-frequency dynamics without introducing overfitting that would invalidate cross-modality comparisons.

What would settle it

Train the same NeuroRVQ-FM architecture on one modality using tokenizer parameters tuned for a different modality and measure whether downstream task performance falls below that of modality-specific baselines.

read the original abstract

Biosignals such as electroencephalography (EEG), electrocardiography (ECG), and electromyography (EMG) encode physiological activity across multiple temporal and spectral scales, yielding representations that are rich but challenging for machine learning. Foundation models trained to predict masked signal tokens have shown promise in learning generalizable biosignal representations, yet their performance depends on the tokenizer's ability to preserve high-frequency dynamics and reconstruct signals with high fidelity. We introduce NeuroRVQ, a modality-adaptive biosignal tokenizer family designed for high-fidelity signal reconstruction. To capture the full frequency spectrum, NeuroRVQ decomposes biosignals into frequency-specific representations via multi-scale temporal convolutions, each encoded into hierarchical RVQ codebooks to preserve high-frequency detail, combined with a novel phase-aware training loss that respects the circular topology of Fourier phase. By tuning the temporal resolution, number and size of temporal kernels and RVQ depth, this design adapts to the spectro-temporal characteristics of each biosignal modality. To validate that tokenizer quality drives downstream performance, we train a simple masked-token foundation model for each modality (NeuroRVQ-FM) using the corresponding NeuroRVQ tokenizer. The NeuroRVQ-FM family achieves competitive or superior downstream performance compared to existing modality-specific foundation models, demonstrating that high-fidelity tokenization is a critical factor for effective biosignal modeling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

NeuroRVQ gives a practical multi-scale RVQ tokenizer for biosignals with a phase-aware loss, but the downstream gains are not cleanly isolated from other model differences.

read the letter

The main point is a new tokenizer family that decomposes biosignals with multi-scale temporal convolutions, feeds them into hierarchical residual vector quantizers, and uses a circular phase loss to keep high-frequency structure intact. They tune the scales, kernels, and codebook depths separately for EEG, ECG, and EMG, then train simple masked foundation models on top and report competitive or better results on downstream tasks than existing modality-specific models.

Referee Report

2 major / 2 minor

Summary. The paper introduces NeuroRVQ, a modality-adaptive family of biosignal tokenizers (for EEG, ECG, EMG and similar) that decompose signals via multi-scale temporal convolutions, encode them with hierarchical residual vector quantization (RVQ) codebooks to retain high-frequency detail, and employ a phase-aware loss respecting Fourier phase topology. Temporal resolution, kernel sizes, and RVQ depth are tuned per modality. The authors then train simple masked-token foundation models (NeuroRVQ-FM) on the resulting tokens and report competitive or superior downstream task performance relative to existing modality-specific foundation models, arguing that high-fidelity tokenization is the critical driver of effective biosignal modeling.

Significance. If the performance attribution to tokenizer fidelity can be isolated, the work would usefully highlight tokenizer design as a lever for biosignal foundation models and provide a concrete multi-scale RVQ construction with phase-aware training. The modality-adaptive approach and emphasis on reconstruction fidelity are constructive contributions to a growing area.

major comments (2)

[§4 and Abstract] §4 (Experiments) and Abstract: The central claim that 'high-fidelity tokenization is a critical factor' is not yet load-bearing because the reported gains compare NeuroRVQ-FM against existing modality-specific models whose backbones, pretraining scale, masking strategies, and optimization details are not controlled. No ablation is described that holds the foundation-model architecture fixed and swaps only the tokenizer (NeuroRVQ versus single-scale VQ or spectrogram baselines). Without such isolation, downstream differences cannot be attributed to the proposed tokenization rather than confounding design choices.
[§3] §3 (Method): The per-modality tuning of temporal resolution, kernel sizes, and RVQ depth is presented as necessary to preserve high-frequency dynamics. This tuning, however, risks that observed gains arise from modality-specific hyperparameter optimization rather than the general multi-scale RVQ principle, weakening the cross-modality generalization argument. The manuscript should report how these hyperparameters were selected and whether they transfer across modalities.

minor comments (2)

[Abstract] Abstract: The phrase 'competitive or superior' would be strengthened by explicit reference to the quantitative tables or figures that support it.
[§3] Notation: Define the precise form of the phase-aware loss (e.g., its mathematical expression) at first use to avoid ambiguity with standard reconstruction losses.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comments highlight important aspects regarding the attribution of performance gains and the generalizability of our approach. We address each major comment below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [§4 and Abstract] §4 (Experiments) and Abstract: The central claim that 'high-fidelity tokenization is a critical factor' is not yet load-bearing because the reported gains compare NeuroRVQ-FM against existing modality-specific models whose backbones, pretraining scale, masking strategies, and optimization details are not controlled. No ablation is described that holds the foundation-model architecture fixed and swaps only the tokenizer (NeuroRVQ versus single-scale VQ or spectrogram baselines). Without such isolation, downstream differences cannot be attributed to the proposed tokenization rather than confounding design choices.

Authors: We agree that isolating the effect of the tokenizer by holding the foundation model architecture fixed would provide more direct evidence for our claim. In the current work, NeuroRVQ-FM employs a straightforward masked token prediction setup, and its competitive performance against more elaborate existing models suggests the importance of high-fidelity tokenization. However, to rigorously address this, we will include a new ablation study in the revised manuscript. This study will fix the foundation model backbone, pretraining scale, and optimization details, and compare NeuroRVQ against single-scale VQ and spectrogram tokenization baselines on the same downstream tasks. We believe this addition will make the central claim more load-bearing. revision: yes
Referee: [§3] §3 (Method): The per-modality tuning of temporal resolution, kernel sizes, and RVQ depth is presented as necessary to preserve high-frequency dynamics. This tuning, however, risks that observed gains arise from modality-specific hyperparameter optimization rather than the general multi-scale RVQ principle, weakening the cross-modality generalization argument. The manuscript should report how these hyperparameters were selected and whether they transfer across modalities.

Authors: The hyperparameters for each modality were chosen based on the characteristic frequency content and temporal dynamics of the biosignals, guided by prior literature and validated through reconstruction quality metrics on held-out data. For instance, modalities with richer high-frequency components like EMG use smaller kernel sizes and higher temporal resolutions. To address concerns about transferability, we will add experiments in the revision demonstrating cross-modality application of these hyperparameters and report the selection criteria explicitly in Section 3. This will clarify that the multi-scale RVQ principle is general while allowing modality-specific adaptations for optimal performance. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical design and validation chain is self-contained

full rationale

The paper introduces NeuroRVQ via multi-scale convolutions, hierarchical RVQ, and phase-aware loss, then trains per-modality masked-token FMs and reports competitive downstream results. No equations, derivations, or self-citations are present that reduce the performance claims to fitted parameters or prior author results by construction. The central demonstration relies on external empirical comparison to existing modality-specific models rather than internal reduction. This matches the default expectation of no significant circularity (score 0-2) for papers whose claims rest on experimental outcomes against independent benchmarks.

Axiom & Free-Parameter Ledger

3 free parameters · 1 axioms · 0 invented entities

The design rests on several tunable hyperparameters chosen per modality and the domain assumption that Fourier phase requires special circular treatment; no new physical entities are postulated.

free parameters (3)

temporal resolution
Tuned per biosignal modality to capture spectro-temporal characteristics
number and size of temporal kernels
Chosen to decompose into frequency-specific representations
RVQ depth
Adjusted to preserve high-frequency detail in hierarchical codebooks

axioms (1)

domain assumption Fourier phase has circular topology that must be respected by the training loss
Invoked to justify the novel phase-aware loss term

pith-pipeline@v0.9.0 · 5818 in / 1306 out tokens · 33302 ms · 2026-05-21T20:57:14.579631+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

multi-scale temporal convolutions with varying kernel sizes... hierarchical residual vector quantization (RVQ) codebooks... unit-circle-aware phase loss Lunit-loss = 1 - cosine similarity + λ_circle penalization
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

NEURORVQ achieves lower reconstruction error... competitive or superior downstream performance

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LLM as Clinical Graph Structure Refiner: Enhancing Representation Learning in EEG Seizure Diagnosis
cs.AI 2026-04 unverdicted novelty 6.0

LLM-based refinement of edges in transformer-constructed EEG graphs improves seizure detection accuracy and produces cleaner, more interpretable structures on the TUSZ dataset.