Wavelet as Tokenizer: Preliminary Results on a Shared Wavelet Token Schema for Natural Signals

Shenghao Ding

arxiv: 2606.02631 · v1 · pith:JS4AAYGMnew · submitted 2026-05-30 · 📡 eess.AS · cs.AI· cs.CV· cs.LG· cs.SD

Wavelet as Tokenizer: Preliminary Results on a Shared Wavelet Token Schema for Natural Signals

Shenghao Ding This is my paper

Pith reviewed 2026-06-28 18:03 UTC · model grok-4.3

classification 📡 eess.AS cs.AIcs.CVcs.LGcs.SD

keywords wavelet tokenizershared token schemaHaar DWTnatural signalssparse tokenscontinuous tokensaudio image video

0 comments

The pith

A single wavelet token schema can represent audio, images, and video with competitive reconstruction quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether audio, images, and video can use one common wavelet token schema instead of separate modality-specific latent grids. It builds a preliminary continuous-token model around a one-level Haar DWT/IDWT frontend, a shared coefficient-token layout, optional structural metadata, lightweight modality adapters, and a shared token-wise encoder-decoder trunk. On Speech Commands, EuroSAT RGB, and DAVIS 2017, the dense shared model reaches 39.92 dB audio, 29.37 dB image, and 23.93 dB video PSNR. A matched-rate sweep shows visual gains are not explained solely by latent capacity, additive metadata is not universally helpful, and fixed-rate energy selection improves PSNR by over 15 dB versus uniform selection. Masked sparse training reaches 34.45 dB video PSNR at 50 percent token density, supporting a unified schema and sparse interface without yet defining a universal discrete vocabulary.

Core claim

A continuous-token model built on a shared one-level Haar DWT/IDWT frontend and token-wise trunk reaches 39.92 dB PSNR on audio, 29.37 dB on images, and 23.93 dB on video, with masked sparse training maintaining 34.45 dB video PSNR at 50 percent of dense tokens.

What carries the argument

The shared coefficient-token layout with one-level Haar DWT frontend, optional metadata embeddings, modality value adapters, and shared trunk that processes tokens from any of the three modalities interchangeably.

If this is right

Energy_global selection improves average PSNR by 16.73 dB for audio, 16.90 dB for images, and 15.86 dB for video over uniform selection under compressed keep ratios.
Visual gains persist after controlling for latent capacity in matched-rate sweeps.
Masked sparse training preserves 34.45 dB video PSNR while using only 50 percent of dense tokens.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The schema could support cross-modal pretraining where tokens from one signal type initialize processing for another.
Extending the frontend to multiple wavelet levels might increase compression efficiency without losing the shared-trunk property.
Replacing the continuous scalars with a learned discrete codebook on top of the same layout would test whether a universal vocabulary emerges naturally.

Load-bearing premise

The specific datasets and continuous-token PSNR metric are sufficient to indicate that a shared wavelet schema will be useful for natural signals in general.

What would settle it

An experiment on the same datasets showing that modality-specific wavelet frontends or trunks achieve substantially higher PSNR than the shared version at matched rates would falsify the unification benefit.

Figures

Figures reproduced from arXiv: 2606.02631 by Shenghao Ding.

**Figure 1.** Figure 1: Rate-distortion sweep. The shared schema remains favorable on image and video under matched continuous [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗

**Figure 2.** Figure 2: Non-parametric fixed-rate token selection. Top: reconstruction PSNR across keep ratios and selectors. [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Masked sparse shared training. The curves compare metadata modes, latent dimensions, modalities, and [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

This paper studies whether audio, images, and video can share a common wavelet token schema rather than relying on separate modality-specific latent grids. It introduces a preliminary continuous-token model built around a one-level Haar DWT/IDWT frontend, a shared coefficient-token layout, optional structural metadata, lightweight modality value adapters, and a shared token-wise encoder-decoder trunk. On Speech Commands, EuroSAT RGB, and DAVIS 2017 data, a dense shared model reaches 39.92 dB audio, 29.37 dB image, and 23.93 dB video PSNR. A matched-rate sweep under continuous latent scalar budgets indicates that the visual gains are not explained solely by latent capacity, while also showing that additive metadata embeddings are not a universal source of improvement. Finally, fixed-rate energy selection provides a strong non-parametric baseline: energy_global improves average PSNR over uniform selection by 16.73 dB for audio, 16.90 dB for images, and 15.86 dB for video under compressed keep ratios. Masked sparse training reaches 34.45 dB video PSNR with 50% of dense tokens. The results support a unified wavelet token schema and sparse token interface, while stopping short of establishing a universal discrete vocabulary.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives early PSNR numbers for a shared wavelet frontend and trunk across audio, images, and video, but does not test whether the shared trunk outperforms separate per-modality versions.

read the letter

The main takeaway is that a single one-level Haar wavelet setup plus shared coefficient layout and trunk, with only lightweight adapters, can be run on Speech Commands, EuroSAT, and DAVIS 2017 and produce 39.92 dB audio, 29.37 dB image, and 23.93 dB video PSNR in the dense case. The energy-based selection baseline improves over uniform by roughly 16 dB across modalities, and 50% sparse training still reaches 34.45 dB on video. Those are the concrete empirical points.

The work does a reasonable job of including a matched-rate sweep to check that visual results are not just latent capacity and of reporting the non-parametric energy baseline as a reference. The decision to stay with continuous tokens rather than claiming a universal discrete vocabulary also keeps the scope honest.

The clear gap is the missing comparison to modality-specific models that use the same wavelet frontend and trunk architecture but with separate weights. Every result uses the shared trunk, so the numbers show a unified architecture can be applied, yet they do not quantify whether sharing helps, hurts, or is neutral versus independent tokenizers. That leaves the advantage of the unified schema unmeasured. The abstract also gives no error bars or full training protocol, which is typical for preliminary work but limits how far the numbers can be read.

This is aimed at researchers testing signal-processing alternatives to learned tokenizers in multimodal settings. A reader already working on wavelet or sparse representations might pick up the energy and sparsity results for follow-up.

It is worth sending to peer review. The setup is simple enough that referees can ask for the missing baselines and protocol details, and the current numbers provide a usable starting point.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces a preliminary continuous-token model for a shared wavelet token schema across audio, images, and video, using a one-level Haar DWT/IDWT frontend, shared coefficient-token layout, optional structural metadata, lightweight modality value adapters, and a shared token-wise encoder-decoder trunk. On Speech Commands, EuroSAT RGB, and DAVIS 2017, it reports PSNR of 39.92 dB (audio), 29.37 dB (image), and 23.93 dB (video) for a dense shared model; a matched-rate sweep under continuous latent scalar budgets; fixed-rate energy selection as a non-parametric baseline (improving average PSNR by ~16 dB across modalities at compressed keep ratios); and masked sparse training reaching 34.45 dB video PSNR at 50% token density. The results are presented as supporting a unified wavelet token schema and sparse token interface, while explicitly stopping short of a universal discrete vocabulary.

Significance. If the advantage of cross-modality weight sharing can be isolated and quantified, the approach could provide a practical unified frontend for tokenizing natural signals, enabling shared sparse interfaces and potentially simplifying multimodal architectures. The energy baseline and matched-rate controls are useful empirical anchors, and the explicit acknowledgment of the gap to discrete vocabularies is appropriately cautious. At present the work demonstrates architectural feasibility more than comparative superiority of sharing.

major comments (2)

[Abstract / Experiments] The central claim that the results support a 'unified wavelet token schema' (shared trunk) is undercut by the absence of modality-specific non-shared baselines. All reported models employ the shared trunk plus adapters; no equivalent-capacity per-modality models (identical wavelet frontend and trunk architecture but with separate, non-shared weights) are provided. The matched-rate sweep controls latent dimension but does not isolate the effect of weight sharing, leaving the benefit of unification unquantified (see abstract and experimental results).
[Abstract] The reported PSNR values and sweep conclusions lack error bars, standard deviations, or details on the number of runs, making it difficult to assess whether the visual gains independent of latent capacity are statistically reliable or sensitive to initialization (abstract).

minor comments (2)

[Abstract] Implementation details for the energy_global baseline (e.g., exact coefficient selection rule, handling of structural metadata) are referenced but not fully specified, hindering reproducibility.
[Abstract] The manuscript would benefit from explicit discussion of why the chosen datasets (Speech Commands, EuroSAT RGB, DAVIS 2017) and the continuous-token PSNR metric are expected to generalize to other natural signals.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review of our preliminary work. We address each major comment below and describe the corresponding revisions.

read point-by-point responses

Referee: [Abstract / Experiments] The central claim that the results support a 'unified wavelet token schema' (shared trunk) is undercut by the absence of modality-specific non-shared baselines. All reported models employ the shared trunk plus adapters; no equivalent-capacity per-modality models (identical wavelet frontend and trunk architecture but with separate, non-shared weights) are provided. The matched-rate sweep controls latent dimension but does not isolate the effect of weight sharing, leaving the benefit of unification unquantified (see abstract and experimental results).

Authors: We agree that the current experiments do not isolate the effect of weight sharing. No modality-specific non-shared baselines with matched capacity are reported, and the matched-rate sweep addresses latent dimension rather than parameter sharing. In the revised manuscript we will update the abstract and the discussion of results to state that the work demonstrates architectural feasibility of the shared schema rather than a quantified advantage of unification. This limitation will also be noted as future work. revision: yes
Referee: [Abstract] The reported PSNR values and sweep conclusions lack error bars, standard deviations, or details on the number of runs, making it difficult to assess whether the visual gains independent of latent capacity are statistically reliable or sensitive to initialization (abstract).

Authors: We acknowledge that the reported PSNR values are from single training runs and lack error bars or run counts. In the revision we will update the abstract to indicate that the values are single-run results and add a corresponding note in the experimental section describing the preliminary nature of the study. Multiple runs with standard deviations will be included in future extensions of the work. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical results are self-contained

full rationale

The paper defines a wavelet-based tokenizer architecture (one-level Haar DWT/IDWT frontend, shared coefficient layout, modality adapters, shared trunk) and reports PSNR on held-out data from Speech Commands, EuroSAT RGB, and DAVIS 2017. No derivation, equation, or claim reduces a reported result to a fitted parameter by construction, nor invokes self-citation for a uniqueness theorem or ansatz. Energy selection and masked sparse training are presented as non-parametric baselines and empirical variants, respectively. The central claims rest on direct comparisons against external benchmarks rather than tautological redefinitions.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated beyond the existence of lightweight modality value adapters whose parameters are presumably learned.

free parameters (1)

modality value adapters
Lightweight adapters per modality are introduced to handle differences; these are learned parameters whose count and fitting procedure are not detailed.

pith-pipeline@v0.9.1-grok · 5770 in / 1202 out tokens · 24716 ms · 2026-06-28T18:03:11.819345+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 11 canonical work pages · 5 internal anchors

[1]

Omnitokenizer: A joint image-video tokenizer for visual generation.arXiv preprint arXiv:2406.09399, 2024

Junke Wang, Yi Jiang, Zehuan Yuan, Binyue Peng, Zuxuan Wu, and Yu-Gang Jiang. Omnitokenizer: A joint image-video tokenizer for visual generation.arXiv preprint arXiv:2406.09399, 2024

work page arXiv 2024
[2]

Cosmos World Foundation Model Platform for Physical AI

NVIDIA. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Neural discrete representation learning

Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. InAd- vances in Neural Information Processing Systems, volume 30, 2017

2017
[4]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bj ¨orn Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12873–12883, 2021

2021
[5]

Hauptmann, Ming-Hsuan Yang, Yuan Hao, Irfan Essa, and Lu Jiang

Lijun Yu, Yong Cheng, Kihyuk Sohn, Jos ´e Lezama, Han Zhang, Huiwen Chang, Alexander G. Hauptmann, Ming-Hsuan Yang, Yuan Hao, Irfan Essa, and Lu Jiang. Magvit: Masked generative video transformer.arXiv preprint arXiv:2212.05199, 2022

work page arXiv 2022
[6]

Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Vigh- nesh Birodkar, Agrim Gupta, Xiuye Gu, Alexander G

Lijun Yu, Jos ´e Lezama, Nitesh B. Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Vigh- nesh Birodkar, Agrim Gupta, Xiuye Gu, Alexander G. Hauptmann, Boqing Gong, Ming-Hsuan Yang, Irfan Essa, David A. Ross, and Lu Jiang. Language model beats diffusion—tokenizer is key to visual generation. In International Conference on Learning Represe...

2024
[7]

Wavelet-based image tokenizer for vision transformers.arXiv preprint arXiv:2405.18616, 2024

Zhenhai Zhu and Radu Soricut. Wavelet-based image tokenizer for vision transformers.arXiv preprint arXiv:2405.18616, 2024

work page arXiv 2024
[8]

Stephane G. Mallat. A theory for multiresolution signal decomposition: The wavelet representation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 11(7):674–693, 1989

1989
[9]

Taubman and Michael W

David S. Taubman and Michael W. Marcellin. JPEG2000: Standard for interactive imaging.Proceedings of the IEEE, 90(8):1336–1357, 2002

2002
[10]

Soundstream: An end-to-end neural audio codec.arXiv preprint arXiv:2107.03312, 2021

Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi. Soundstream: An end-to-end neural audio codec.arXiv preprint arXiv:2107.03312, 2021

work page arXiv 2021
[11]

High Fidelity Neural Audio Compression

Alexandre D ´efossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. High fidelity neural audio compression. arXiv preprint arXiv:2210.13438, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[12]

Wavtokenizer: An efficient acoustic discrete codec tokenizer for audio language modeling

Shengpeng Ji, Ziyue Jiang, Wen Wang, Yifu Chen, Minghui Fang, Jialong Zuo, Qian Yang, Xize Cheng, Zehan Wang, Ruiqi Li, et al. Wavtokenizer: An efficient acoustic discrete codec tokenizer for audio language modeling. arXiv preprint arXiv:2408.16532, 2024

work page arXiv 2024
[13]

Finite scalar quantization: Vq-vae made simple

Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen. Finite scalar quantization: Vq-vae made simple. InInternational Conference on Learning Representations, 2024

2024
[14]

Efros, Eli Shechtman, and Oliver Wang

Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 586–595, 2018

2018
[15]

Gans trained by a two time-scale update rule converge to a local nash equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. InAdvances in Neural Information Processing Systems, volume 30, 2017

2017
[16]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Syl- vain Gelly. Towards accurate generative models of video: A new metric and challenges.arXiv preprint arXiv:1812.01717, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[17]

3d gaussian splatting for real- time radiance field rendering.ACM Transactions on Graphics, 42(4):139:1–139:14, 2023

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real- time radiance field rendering.ACM Transactions on Graphics, 42(4):139:1–139:14, 2023

2023
[18]

Compact 3d gaussian splatting for static and dynamic radiance fields.arXiv preprint arXiv:2408.03822, 2024

Joo Chan Lee, Daniel Rho, Xiangyu Sun, Jong Hwan Ko, and Eunbyung Park. Compact 3d gaussian splatting for static and dynamic radiance fields.arXiv preprint arXiv:2408.03822, 2024

work page arXiv 2024
[19]

Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition

Pete Warden. Speech commands: A dataset for limited-vocabulary speech recognition.arXiv preprint arXiv:1804.03209, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[20]

Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217–2226, 2019

2019
[21]

The 2017 DAVIS Challenge on Video Object Segmentation

Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbel ´aez, Alexander Sorkine-Hornung, and Luc Van Gool. The 2017 davis challenge on video object segmentation.arXiv preprint arXiv:1704.00675, 2017. 11 Wavelet as Tokenizer A Experimental Details A.1 Data Preprocessing The experiments use Speech Commands v0.02, EuroSAT RGB, and DA VIS 2017 train/va...

work page internal anchor Pith review Pith/arXiv arXiv 2017

[1] [1]

Omnitokenizer: A joint image-video tokenizer for visual generation.arXiv preprint arXiv:2406.09399, 2024

Junke Wang, Yi Jiang, Zehuan Yuan, Binyue Peng, Zuxuan Wu, and Yu-Gang Jiang. Omnitokenizer: A joint image-video tokenizer for visual generation.arXiv preprint arXiv:2406.09399, 2024

work page arXiv 2024

[2] [2]

Cosmos World Foundation Model Platform for Physical AI

NVIDIA. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Neural discrete representation learning

Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. InAd- vances in Neural Information Processing Systems, volume 30, 2017

2017

[4] [4]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bj ¨orn Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12873–12883, 2021

2021

[5] [5]

Hauptmann, Ming-Hsuan Yang, Yuan Hao, Irfan Essa, and Lu Jiang

Lijun Yu, Yong Cheng, Kihyuk Sohn, Jos ´e Lezama, Han Zhang, Huiwen Chang, Alexander G. Hauptmann, Ming-Hsuan Yang, Yuan Hao, Irfan Essa, and Lu Jiang. Magvit: Masked generative video transformer.arXiv preprint arXiv:2212.05199, 2022

work page arXiv 2022

[6] [6]

Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Vigh- nesh Birodkar, Agrim Gupta, Xiuye Gu, Alexander G

Lijun Yu, Jos ´e Lezama, Nitesh B. Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Vigh- nesh Birodkar, Agrim Gupta, Xiuye Gu, Alexander G. Hauptmann, Boqing Gong, Ming-Hsuan Yang, Irfan Essa, David A. Ross, and Lu Jiang. Language model beats diffusion—tokenizer is key to visual generation. In International Conference on Learning Represe...

2024

[7] [7]

Wavelet-based image tokenizer for vision transformers.arXiv preprint arXiv:2405.18616, 2024

Zhenhai Zhu and Radu Soricut. Wavelet-based image tokenizer for vision transformers.arXiv preprint arXiv:2405.18616, 2024

work page arXiv 2024

[8] [8]

Stephane G. Mallat. A theory for multiresolution signal decomposition: The wavelet representation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 11(7):674–693, 1989

1989

[9] [9]

Taubman and Michael W

David S. Taubman and Michael W. Marcellin. JPEG2000: Standard for interactive imaging.Proceedings of the IEEE, 90(8):1336–1357, 2002

2002

[10] [10]

Soundstream: An end-to-end neural audio codec.arXiv preprint arXiv:2107.03312, 2021

Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi. Soundstream: An end-to-end neural audio codec.arXiv preprint arXiv:2107.03312, 2021

work page arXiv 2021

[11] [11]

High Fidelity Neural Audio Compression

Alexandre D ´efossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. High fidelity neural audio compression. arXiv preprint arXiv:2210.13438, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[12] [12]

Wavtokenizer: An efficient acoustic discrete codec tokenizer for audio language modeling

Shengpeng Ji, Ziyue Jiang, Wen Wang, Yifu Chen, Minghui Fang, Jialong Zuo, Qian Yang, Xize Cheng, Zehan Wang, Ruiqi Li, et al. Wavtokenizer: An efficient acoustic discrete codec tokenizer for audio language modeling. arXiv preprint arXiv:2408.16532, 2024

work page arXiv 2024

[13] [13]

Finite scalar quantization: Vq-vae made simple

Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen. Finite scalar quantization: Vq-vae made simple. InInternational Conference on Learning Representations, 2024

2024

[14] [14]

Efros, Eli Shechtman, and Oliver Wang

Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 586–595, 2018

2018

[15] [15]

Gans trained by a two time-scale update rule converge to a local nash equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. InAdvances in Neural Information Processing Systems, volume 30, 2017

2017

[16] [16]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Syl- vain Gelly. Towards accurate generative models of video: A new metric and challenges.arXiv preprint arXiv:1812.01717, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[17] [17]

3d gaussian splatting for real- time radiance field rendering.ACM Transactions on Graphics, 42(4):139:1–139:14, 2023

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real- time radiance field rendering.ACM Transactions on Graphics, 42(4):139:1–139:14, 2023

2023

[18] [18]

Compact 3d gaussian splatting for static and dynamic radiance fields.arXiv preprint arXiv:2408.03822, 2024

Joo Chan Lee, Daniel Rho, Xiangyu Sun, Jong Hwan Ko, and Eunbyung Park. Compact 3d gaussian splatting for static and dynamic radiance fields.arXiv preprint arXiv:2408.03822, 2024

work page arXiv 2024

[19] [19]

Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition

Pete Warden. Speech commands: A dataset for limited-vocabulary speech recognition.arXiv preprint arXiv:1804.03209, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[20] [20]

Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217–2226, 2019

2019

[21] [21]

The 2017 DAVIS Challenge on Video Object Segmentation

Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbel ´aez, Alexander Sorkine-Hornung, and Luc Van Gool. The 2017 davis challenge on video object segmentation.arXiv preprint arXiv:1704.00675, 2017. 11 Wavelet as Tokenizer A Experimental Details A.1 Data Preprocessing The experiments use Speech Commands v0.02, EuroSAT RGB, and DA VIS 2017 train/va...

work page internal anchor Pith review Pith/arXiv arXiv 2017