arxiv: 2604.13432 · v1 · submitted 2026-04-15 · 💻 cs.CV · cs.AI

Recognition: unknown

MaMe & MaRe: Matrix-Based Token Merging and Restoration for Efficient Visual Perception and Synthesis

Simin Huo , Ning Li

Authors on Pith no claims yet

Pith reviewed 2026-05-10 13:02 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords mamemaretokenaccuracysynthesiscomindereffectivenessgithub

0 comments

The pith

Matrix-based token merging doubles ViT-B throughput with a 2% accuracy drop while its inverse speeds up image synthesis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops MaMe as a token merging technique for Vision Transformers that performs all operations through matrix calculations rather than sorting or scattered writes. This design keeps the method training-free and fully differentiable so it applies directly to pre-trained models. Experiments demonstrate that MaMe can double throughput on ViT-B with only a 2% accuracy drop and even raise accuracy after fine-tuning just the final layer. When combined with the inverse MaRe operation for token restoration, the approach also improves both speed and quality in generative models such as Stable Diffusion. A reader would care because the method offers a hardware-efficient path to reduce the quadratic cost of self-attention without extensive retraining or task-specific changes.

Core claim

MaMe is a training-free, differentiable token merging method based entirely on matrix operations. When applied to pre-trained models, MaMe doubles ViT-B throughput with a 2% accuracy drop. Notably, fine-tuning the last layer with MaMe boosts ViT-B accuracy by 1.0% at 1.1x speed. In image synthesis, the MaMe+MaRe pipeline enhances quality while reducing Stable Diffusion v2.1 generation latency by 31%. The same matrix approach yields 1.3x acceleration on SigLIP2 and 48.5% faster inference on VideoMAE-L with minimal accuracy loss.

What carries the argument

MaMe, the matrix-based token merging operation that computes merging decisions and applies reductions using only matrix multiplications and aggregations to lower token count while remaining GPU-efficient and differentiable.

Load-bearing premise

Matrix-based merging decisions preserve task-critical information across diverse inputs without needing task-specific tuning or introducing hidden biases.

What would settle it

Applying MaMe to ViT-B on ImageNet without fine-tuning and observing an accuracy drop larger than 4% or a speedup below 1.5x would indicate the matrix merging does not preserve information as claimed.

Figures

Figures reproduced from arXiv: 2604.13432 by Ning Li, Simin Huo.

**Figure 1.** Figure 1: The visualization illustrates the progression of token count reduction in the first 8 blocks of the AugReg ViT-B/16 with MaMe. Each color square [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗

**Figure 2.** Figure 2: Qualitative Results. Images from left to right generated by Stable Diffusion v2.1, ToMeSD and MaMe+MaRe. MaMe+MaRe retains high-frequency details (left) and produce images with higher aesthetic quality (right). More results are in the supplementary material [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative Results. The image generated by the model with MaMe. The clarity of the picture changed with the similarity threshold. A higher similarity threshold brings a more detailed and clearer painting. More results are in the supplementary material [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Ablation experiments using AugReg ViT-B/16. Our default settings are marked in Gray. We report Top-1 accuracy (acc) and model inference throughput [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: The visualization of the final token count in the 8th block of the AugReg ViT-B/16. Each color square represents a distinct type of token. [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: The accuracy and throughput change with the similarity threshold under both alternating and random partitions. [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

read the original abstract

Token compression is crucial for mitigating the quadratic complexity of self-attention mechanisms in Vision Transformers (ViTs), which often involve numerous input tokens. Existing methods, such as ToMe, rely on GPU-inefficient operations (e.g., sorting, scattered writes), introducing overheads that limit their effectiveness. We introduce MaMe, a training-free, differentiable token merging method based entirely on matrix operations, which is GPU-friendly to accelerate ViTs. Additionally, we present MaRe, its inverse operation, for token restoration, forming a MaMe+MaRe pipeline for image synthesis. When applied to pre-trained models, MaMe doubles ViT-B throughput with a 2% accuracy drop. Notably, fine-tuning the last layer with MaMe boosts ViT-B accuracy by 1.0% at 1.1x speed. In SigLIP2-B@512 zero-shot classification, MaMe provides 1.3x acceleration with negligible performance degradation. In video tasks, MaMe accelerates VideoMAE-L by 48.5% on Kinetics-400 with only a 0.84% accuracy loss. Furthermore, MaMe achieves simultaneous improvements in both performance and speed on some tasks. In image synthesis, the MaMe+MaRe pipeline enhances quality while reducing Stable Diffusion v2.1 generation latency by 31%. Collectively, these results demonstrate MaMe's and MaRe's effectiveness in accelerating vision models. The code is available at https://github.com/cominder/mame}{https://github.com/cominder/mame.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MaMe gives a clean matrix-only token merge that actually runs faster on GPU than ToMe and delivers usable speedups on ViT-B, video, and Stable Diffusion with small accuracy tradeoffs.

read the letter

The main point is that MaMe reformulates token merging as pure matrix operations—similarity matrix, differentiable selection, and weighted averaging—without sorting or scattered writes. This makes it GPU-friendly and training-free while staying end-to-end differentiable. They pair it with MaRe for token restoration in synthesis pipelines. The results show concrete gains: roughly 2x throughput on ViT-B at a 2% accuracy drop, a 1% accuracy lift at 1.1x speed after last-layer fine-tuning, 1.3x on SigLIP2, 48.5% faster VideoMAE on Kinetics with under 1% loss, and 31% lower latency on Stable Diffusion v2.1 with better quality. The ablations indicate the same hyperparameters work across classification, video, and diffusion without retuning per task. The stress-test confirms the formulation avoids non-matrix primitives and holds together internally. What the paper does well is ship a practical efficiency trick with code and numbers that line up on standard benchmarks. The matrix approach is a clear technical step past ToMe's overheads. Soft spots are modest. The merging criteria still rely on a similarity matrix whose exact construction and selection logic need close reading to verify they truly preserve critical features across diverse inputs; the reported results look robust but rest on those choices. More head-to-head numbers against other recent merging variants would strengthen the case, though the core gains over ToMe are clear. This paper is for people who deploy or optimize ViTs and diffusion models for inference speed—engineers and researchers working on real-world efficiency rather than pure theory. The empirical claims are sharp enough and the method is grounded enough that it deserves a serious referee. I would send it to peer review.

Referee Report

0 major / 1 minor

Summary. The paper claims to introduce MaMe, a training-free and differentiable token merging method for Vision Transformers that relies solely on matrix operations for improved GPU efficiency compared to methods like ToMe. MaRe is presented as the inverse operation to enable token restoration in synthesis tasks. Key empirical results include doubling the throughput of ViT-B with a 2% accuracy drop, achieving a 1% accuracy boost at 1.1x speed after last-layer fine-tuning, 1.3x acceleration on SigLIP2 with minimal degradation, 48.5% speedup on VideoMAE-L with 0.84% loss on Kinetics-400, and 31% reduced latency in Stable Diffusion v2.1 with quality gains.

Significance. If the results hold, this work offers a significant practical contribution to efficient visual perception and synthesis by providing a simple, matrix-based approach that is training-free and generalizes across tasks without per-task retuning. The reported cases of simultaneous accuracy and speed improvements are notable. The open-source code strengthens the reproducibility of the findings.

minor comments (1)

The abstract contains a duplicated URL in the code availability statement: 'https://github.com/cominder/mame}{https://github.com/cominder/mame'.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, recognition of the practical significance of the matrix-only token merging approach, and the recommendation for minor revision. We appreciate the emphasis on reproducibility via open-source code and the noted cases of simultaneous accuracy and speed gains.

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper defines MaMe explicitly as a sequence of matrix operations (similarity matrix construction, differentiable selection, and weighted averaging) that are training-free and end-to-end differentiable by construction of the primitives themselves. All headline performance numbers (2x throughput with 2% drop on ViT-B, 1% gain after last-layer fine-tuning, 31% latency reduction in Stable Diffusion, etc.) are presented as direct empirical measurements on standard public benchmarks rather than as outputs of any fitted parameter or self-referential prediction. No self-citations, uniqueness theorems, or ansatzes imported from prior author work are invoked to justify the core method; the derivation from matrix formulation to reported speed/accuracy trade-offs therefore remains independent of the evaluation results and externally falsifiable via the released code.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are identifiable from the abstract; the method is described as training-free and based on standard matrix operations.

pith-pipeline@v0.9.0 · 5581 in / 1184 out tokens · 25800 ms · 2026-05-10T13:02:36.961056+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TTF: Temporal Token Fusion for Efficient Video-Language Model
cs.CV 2026-05 unverdicted novelty 5.0

TTF fuses temporally redundant visual tokens via local similarity search in a plug-and-play way, cutting ~67% tokens on Qwen3-VL-8B while retaining 99.5% accuracy with minimal overhead.

Reference graph

Works this paper leans on

44 extracted references · 14 canonical work pages · cited by 1 Pith paper · 3 internal anchors

[1]

Dosovitskiy, L

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., An image is worth 16x16 words: Transformers for image recognition at scale, Inter- national Conference on Learning Representations (ICLR) (2021)

2021
[2]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need, Advances in Neural Information Processing Sys- tems (NeurIPS) 30 (2017)

2017
[3]

Y . Rao, W. Zhao, B. Liu, J. Lu, J. Zhou, C.-J. Hsieh, Dy- namicvit: Efficient vision transformers with dynamic to- ken sparsification, Advances in neural information pro- cessing systems 34 (2021) 13937–13949

2021
[4]

Not all patches are what you need: Expediting vision transformers via token reorganizations

Y . Liang, C. Ge, Z. Tong, Y . Song, J. Wang, P. Xie, Not all patches are what you need: Expediting vision transformers via token reorganizations (2022).arXiv: 2202.07800. URLhttps://arxiv.org/abs/2202.07800

work page arXiv 2022
[5]

Bolya, C.-Y

D. Bolya, C.-Y . Fu, X. Dai, P. Zhang, C. Feichtenhofer, J. Hoffman, Token merging: Your vit but faster, Interna- tional Conference on Learning Representations (2022)

2022
[6]

In: IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2023, Waikoloa, HI, USA, January 2-7, 2023

D. Marin, J.-H. R. Chang, A. Ranjan, A. Prabhu, M. Rastegari, O. Tuzel, Token pooling in vision trans- formers for image classification, in: 2023 IEEE/CVF Winter Conference on Applications of Computer Vi- sion (W ACV), 2023, pp. 12–21.doi:10.1109/ WACV56688.2023.00010

work page arXiv 2023
[7]

M. Chen, W. Shao, P. Xu, M. Lin, K. Zhang, F. Chao, R. Ji, Y . Qiao, P. Luo, Diffrate: Differentiable compres- sion rate for efficient vision transformers, in: Proceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 17164–17174

2023
[8]

F. Zeng, D. Yu, Z. Kong, H. Tang, Token transform- ing: A unified and training-free token compression frame- work for vision transformer acceleration (2025).arXiv: 2506.05709. URLhttps://arxiv.org/abs/2506.05709

work page arXiv 2025
[9]

Shang, M

Y . Shang, M. Cai, B. Xu, Y . J. Lee, Y . Yan, Llava- prumerge: Adaptive token reduction for efficient large multimodal models, in: International Conference on Computer Vision (ICCV), 2025

2025
[10]

S. Yang, Y . Chen, Z. Tian, C. Wang, J. Li, B. Yu, J. Jia, Visionzip: Longer is better but not necessary in vi- sion language models, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025
[11]

L. Meng, H. Li, B.-C. Chen, S. Lan, Z. Wu, Y .-G. Jiang, S.-N. Lim, Adavit: Adaptive vision transformers for effi- cient image recognition, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 12309–12318

2022
[12]

L. Chen, H. Zhao, T. Liu, S. Bai, J. Lin, C. Zhou, B. Chang, An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision- language models, in: European Conference on Computer Vision (ECCV), 2024

2024
[13]

Z. Lin, M. Lin, L. Lin, R. Ji, Boosting multimodal large language models with visual tokens withdrawal for rapid inference, in: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2025

2025
[14]

Song, Y .-J

H. Song, Y .-J. Kim, S.-W. Oh, S.-J. Chun, ToFu: To- ken fusion for fast and accurate vision transformers, arXiv preprint arXiv:2403.14950 (2024)

work page arXiv 2024
[15]

Z. Fu, Z. Huang, Y . Liu, S. Han, Y . Sun, Y . Zhu, J. Yan, Pumer: Pruning and merging for efficient vision trans- formers, arXiv preprint arXiv:2405.02835 (2024). 12

work page arXiv 2024
[16]

B. Li, W. Zhao, Z. Zhang, LTPM: A learnable token prun- ing and merging method for vision transformer, Journal of Machine Learning Research (2024)

2024
[17]

W. Zeng, S. Jin, L. Xu, W. Liu, C. Qian, W. Ouyang, P. Luo, X. Wang, Tcformer: Visual recognition via to- ken clustering transformer, IEEE Transactions on Pattern Analysis and Machine Intelligence 46 (12) (2024) 9521– 9535.doi:10.1109/TPAMI.2024.3425768

work page doi:10.1109/tpami.2024.3425768 2024
[18]

Y . Xie, J. Zhang, Y . Xia, A. van den Hengel, Q. Wu, Clustr: Exploring efficient self-attention via clustering for vision transformers (2022).arXiv:2208.13138. URLhttps://arxiv.org/abs/2208.13138

work page arXiv 2022
[19]

P. Jin, R. Takanobu, W. Zhang, X. Cao, L. Yuan, Chat- univi: Unified visual representation empowers large lan- guage models with image and video understanding, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024
[20]

Fayyaz, S

M. Fayyaz, S. A. Koohpayegani, F. R. Jafari, S. Sengupta, H. R. V . Joze, E. Sommerlade, H. Pirsiavash, J. Gall, Adaptive token sampling for efficient vision transformers, in: European conference on computer vision, Springer, 2022, pp. 396–414

2022
[21]

Y . Wang, Y . Chen, L. Wang, J. Chen, Token morphing for vision transformer, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 16584–16593

2023
[22]

Kim, H.-J

D.-H. Kim, H.-J. Kim, T.-H. Kim, Gumbel-gate: A gumbel-based gating network for vision transformers, in: Proceedings of the AAAI Conference on Artificial Intelli- gence, V ol. 37, 2023, pp. 1454–1462

2023
[23]

Y . Li, C. Wang, J. Jia, Llama-vid: An image is worth 2 to- kens in large language models, in: European Conference on Computer Vision (ECCV), 2024

2024
[24]

Touvron, M

H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablay- rolles, H. Jégou, Training data-efficient image transform- ers & distillation through attention, in: International con- ference on machine learning, PMLR, 2021, pp. 10347– 10357

2021
[25]

K. He, X. Chen, S. Xie, Y . Li, P. Dollár, R. Girshick, Masked autoencoders are scalable vision learners, in: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 16000–16009

2022
[26]

Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, B. Guo, Swin transformer: Hierarchical vision trans- former using shifted windows, in: Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10012–10022

2021
[27]

Learning Transferable Visual Models From Natural Language Supervision

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, I. Sutskever, Learning transferable vi- sual models from natural language supervision (2021). arXiv:2103.00020. URLhttps://arxiv.org/abs/2103.00020

work page internal anchor Pith review Pith/arXiv arXiv 2021
[28]

X. Zhai, B. Mustafa, A. Kolesnikov, L. Beyer, Sigmoid loss for language image pre-training (2023).arXiv: 2303.15343

work page arXiv 2023
[29]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

M. Tschannen, A. Gritsenko, X. Wang, M. F. Naeem, I. Alabdulmohsin, N. Parthasarathy, T. Evans, L. Beyer, Y . Xia, B. Mustafa, O. Hénaff, J. Harmsen, A. Steiner, X. Zhai, Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features (2025).arXiv:2502.14786. URLhttps://arxiv.org/abs/2502.14786

work page internal anchor Pith review arXiv 2025
[30]

H. Liu, C. Li, Q. Wu, Y . J. Lee, Visual instruction tuning (2023)

2023
[31]

H. Duan, J. Yang, Y . Qiao, X. Fang, L. Chen, Y . Liu, X. Dong, Y . Zang, P. Zhang, J. Wang, et al., Vlmevalkit: An open-source toolkit for evaluating large multi-modality models, in: Proceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 11198–11201

2024
[32]

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C. L. Zitnick, Microsoft coco: Common objects in context, in: Computer vision–ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part v 13, Springer, 2014, pp. 740–755

2014
[33]

Z. Tong, Y . Song, J. Wang, L. Wang, VideoMAE: Masked autoencoders are data-efficient learners for self- supervised video pre-training, in: Advances in Neural In- formation Processing Systems, 2022

2022
[34]

W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Nat- sev, M. Suleyman, A. Zisserman, The kinetics hu- man action video dataset, CoRR abs/1705.06950 (2017). arXiv:1705.06950. URLhttp://arxiv.org/abs/1705.06950

work page internal anchor Pith review arXiv 2017
[35]

Rombach, A

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, B. Om- mer, High-resolution image synthesis with latent diffusion models, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 10684–10695

2022
[36]

Available: https://arxiv.org/abs/2303.17604

D. Bolya, J. Hoffman, Token merging for fast stable dif- fusion, arXiv preprint arXiv:2303.17604 (2023)

work page arXiv 2023
[37]

Y . Guo, H. Liu, H. Wen, Gemrec: Towards generative model recommendation, in: Proceedings of the 17th ACM International Conference on Web Search and Data Min- ing, V ol. 9 of WSDM ’24, ACM, 2024, p. 1054–1057. doi:10.1145/3616855.3635700. 13 URLhttp://dx.doi.org/10.1145/3616855. 3635700

work page doi:10.1145/3616855.3635700 2024
[38]

Heusel, H

M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, S. Hochreiter, Gans trained by a two time-scale update rule converge to a local nash equilibrium, Advances in Neural Information Processing Systems 30 (2017)

2017
[39]

Improved Techniques for Training GANs

T. Salimans, I. Goodfellow, W. Zaremba, V . Cheung, A. Radford, X. Chen, Improved techniques for training gans, arXiv preprint arXiv:1606.03498 (2016)

work page Pith review arXiv 2016
[40]

Zhang, P

R. Zhang, P. Isola, A. A. Efros, E. Shechtman, O. Wang, The unreasonable effectiveness of deep features as a per- ceptual metric, in: CVPR, 2018

2018
[41]

R. C. Gonzalez, R. E. Woods, Digital image processing, Pearson Education India, 2008

2008
[42]

Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli, Im- age Quality Assessment: From Error Visibility to Struc- tural Similarity, IEEE Transactions on Image Processing 13 (4) (2004) 600–612

2004
[43]

Dong, J.-B

Y . Dong, J.-B. Cordonnier, A. Loukas, Attention is not all you need: Pure attention loses rank doubly exponen- tially with depth, in: International conference on machine learning, PMLR, 2021, pp. 2793–2803

2021
[44]

This image depicts

T. Dao, FlashAttention-2: Faster attention with better par- allelism and work partitioning, in: International Confer- ence on Learning Representations (ICLR), 2024. 14 Appendix A. Appendix 15 Figure A.7: The visualization in the 8th block of the AugReg ViT-B/16 using MaMe with different settings. Each color square represents a distinct type of token. Defa...

2024