GHOST: Geometry-Hierarchical Online Streaming Token Eviction for Efficient 3D Reconstruction

Junyi Wu; Leyang Chen; Yulun Zhang; Zhiteng Li

arxiv: 2605.15852 · v1 · pith:EUVLVGO5new · submitted 2026-05-15 · 💻 cs.CV

GHOST: Geometry-Hierarchical Online Streaming Token Eviction for Efficient 3D Reconstruction

Leyang Chen , Junyi Wu , Zhiteng Li , Yulun Zhang This is my paper

Pith reviewed 2026-05-20 19:49 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D reconstructionstreaming videoKV cache evictiontoken managementgeometry-aware pruningefficient inferencemonocular videoonline cache management

0 comments

The pith

GHOST uses a model's own 3D geometry outputs to evict redundant KV-cache tokens online during streaming reconstruction from video.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GHOST as a training-free way to manage the growing key-value cache in long monocular video sequences for 3D reconstruction. Instead of cutting the cache to fixed anchor frames or using attention scores that ignore scene structure, it scores tokens by how important they are to the current 3D geometry prediction. Three components work together: a dual-level importance score, protection for special tokens, and per-layer budget allocation guided by cosine similarity between layers. Experiments show this keeps reconstruction quality high on standard benchmarks while shrinking the cache by roughly half and speeding inference by 1.75 times over prior methods.

Core claim

GHOST is a geometry-hierarchical online streaming token eviction method that exploits the model's own 3D geometry outputs to decide which tokens to retain in the KV cache. It combines a hierarchical dual-level importance scoring scheme, a privilege mechanism that shields special tokens from eviction, and a cosine-similarity-guided layer-wise budget allocation strategy. This framework runs without additional training and directly addresses the linear growth of cache memory in long-sequence 3D reconstruction tasks.

What carries the argument

hierarchical dual-level importance scoring with privilege protection and cosine-similarity layer-wise budget allocation that uses the model's 3D geometry predictions as the eviction signal

If this is right

KV cache memory usage drops by nearly half while reconstruction quality stays comparable to full-cache baselines.
Inference runs 1.75 times faster than existing state-of-the-art streaming methods on the tested benchmarks.
The approach works without any extra training or fine-tuning steps.
The three components reinforce one another so that geometrically valuable tokens survive eviction even in extended sequences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same geometry-driven eviction idea could be tested on other long-context 3D tasks such as novel-view synthesis from video.
If the geometry outputs degrade on out-of-distribution scenes the eviction decisions would likely become unreliable, suggesting a need for fallback heuristics.
Layer-wise budget allocation might generalize to other transformer-based 3D models that output intermediate geometric features.
Real-time deployment on memory-constrained hardware becomes more feasible once the cache size is decoupled from sequence length.

Load-bearing premise

The model's 3D geometry outputs are sufficiently reliable and informative to serve as the basis for online token eviction decisions without causing quality degradation across diverse scenes and benchmarks.

What would settle it

Running GHOST on the same long video sequences and benchmarks as the full-cache baseline and measuring a clear drop in reconstruction metrics such as PSNR, SSIM, or surface accuracy would show the eviction strategy harms quality.

Figures

Figures reproduced from arXiv: 2605.15852 by Junyi Wu, Leyang Chen, Yulun Zhang, Zhiteng Li.

**Figure 1.** Figure 1: Radar comparison across 7-Scenes, NRGBD and Bonn (averaged over all input lengths; outer = better). GHOST consistently dominates all baselines on every axis. Transformer models [17] have achieved remarkable results in 3D reconstruction from monocular images [21, 11, 18], learning to predict dense depth, point maps, and camera poses in a single forward pass. VGGT [18] extends this to multi-view sequences b… view at source ↗

**Figure 2.** Figure 2: Correlation between Key-sim score and two frame attributes: Left: Negligible linear correlation between Key-sim score and camera pose change (ρ = −0.07); right: Moderate positive linear correlation between Key-sim score and depth gradient variance (ρ = +0.31). Dashed lines denote linear fitting trends. As shown in [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Top row: Raw RGB inputs from Long3D Lecture Hall, 7-Scenes Heads, 7-Scenes Chess, [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: GHOST inference pipeline. Offline: Cosine-similarity profiling allocates per-layer budgets. Online: An eviction mode that prunes KV cache to layer-wise budget computed offline with Geometry-Hierarchical Importance scoring and Special token boost . GHOST assigns per-patch importance ϕ(t, p) = wf sframe(t) + wkstoken(t, p), where sframe combines camera motion scam, depth variance sgeo, and recency stemp, whi… view at source ↗

**Figure 5.** Figure 5: Layer-wise budget allocation guided by cosine similarity. Larger input–output colour discrepancy and larger arrows indicate lower ρ¯ℓ; the cylinder shows how Btotal is distributed, with such layers receiving larger Bℓ. Camera tokens ct and register tokens {r i t} encode global scene geometry state and structural priors. Evicting these tokens can corrupt pose estimation and scene globalisation, yet standar… view at source ↗

**Figure 6.** Figure 6: Qualitative reconstruction comparison on 7-Scenes (Chess, Fire, Heads, Office, Kitchen [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Per-layer cosine similarity ρ¯ℓ (blue) and GHOST budget Bℓ (orange, τ=0.5). Lower similarity layers receive larger budgets [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Accuracy Mean (↓) versus sequence length on the Long3D benchmark. The shaded region highlights the gap between GHOST and InfiniteVGGT. GHOST’s advantage over key-similarity eviction (InfiniteVGGT) grows with sequence length, confirming that geometry-aware eviction scales more gracefully to very long sequences. Limitations. GHOST is not directly applicable to architectures that lack any causal structure (e.… view at source ↗

read the original abstract

Streaming 3D reconstruction from long monocular video sequences requires maintaining a key-value (KV) cache that grows linearly with sequence length, creating a severe memory bottleneck. Existing approaches either truncate the cache to a fixed set of anchor frames, leading to reconstruction quality degradation, or rely on attention-score heuristics that are agnostic to 3D scene structure, failing to preserve geometrically valuable tokens. To address these problems, we present GHOST (Geometry-Hierarchical Online Streaming Token Eviction), a training-free KV cache management framework that exploits the model's own 3D geometry outputs to evict redundant tokens online. GHOST introduces three mutually reinforcing innovations: a hierarchical dual-level importance scoring scheme, a privilege mechanism that protects special tokens from eviction, and a cosine-similarity-guided layer-wise budget allocation. Experiments on various benchmarks show that GHOST preserves excellent reconstruction quality while cutting the KV cache by nearly half and delivering 1.75x faster inference compared to state-of-the-art methods. Our code is available at https://github.com/lokiniuniu/GHOST.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GHOST gives a practical training-free way to evict KV tokens using the model's own 3D geometry outputs, but the early-frame reliability issue is a real soft spot that needs checking.

read the letter

Hi, the main takeaway is that this paper tackles the linear growth of KV cache in streaming monocular 3D reconstruction by scoring tokens for eviction with geometry information the model already computes. It avoids both crude truncation and pure attention heuristics. The three pieces are a hierarchical dual-level scorer, a privilege guard for special tokens, and cosine-similarity allocation of per-layer budgets. All of it runs without retraining, which keeps it simple to apply. The reported gains are roughly halving the cache size and 1.75 times faster inference while holding reconstruction quality on the benchmarks they tried. Public code is a plus for anyone who wants to reproduce or extend it. The approach feels like a sensible engineering step for memory-constrained settings such as robotics or AR pipelines that ingest long videos. On the downside, the stress-test concern about early frames lands. With minimal parallax at the start, the geometry signal is likely weak or biased, and any tokens evicted then stay gone. The privilege set is small and fixed, so it may not catch systematic early mis-ranking across many tokens. The abstract claims strong results but gives no experimental details, baselines, or variance numbers, so it is hard to judge whether the quality preservation holds when geometry is uncertain. This work is aimed at practitioners who already run 3D reconstruction models and need to stretch them to longer sequences without blowing up memory. A reader focused on efficient inference or KV cache tricks would get concrete ideas from the scoring and allocation rules. It deserves a serious referee because the problem is real and the proposed combination is specific enough to evaluate. I would send it out for review rather than desk reject.

Referee Report

2 major / 0 minor

Summary. The paper presents GHOST, a training-free KV cache management framework for streaming 3D reconstruction from long monocular video sequences. It uses the model's 3D geometry outputs to perform online token eviction via a hierarchical dual-level importance scoring scheme, a privilege mechanism for special tokens, and cosine-similarity-guided layer-wise budget allocation. The authors claim that this approach maintains excellent reconstruction quality while reducing the KV cache by nearly half and achieving 1.75x faster inference compared to state-of-the-art methods.

Significance. If the results hold, GHOST represents a meaningful advance in efficient long-sequence 3D reconstruction by incorporating geometric structure into cache eviction decisions rather than relying on generic attention heuristics. The training-free design and availability of code are positive aspects that facilitate reproducibility and adoption.

major comments (2)

The abstract reports positive benchmark results but provides no details on experimental setup, baselines, error bars, or potential post-hoc choices in eviction rules. This makes it impossible to verify if the data supports the claim of preserved quality with halved cache.
The approach assumes that the model's 3D geometry outputs are reliable from the first frame for making eviction decisions. However, in streaming monocular video with minimal initial parallax, these outputs are likely low-confidence or biased, risking permanent eviction of geometrically critical tokens. The privilege mechanism protects only a small fixed set and does not address this systematic early mis-ranking issue.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and the positive assessment of GHOST's significance and reproducibility. We respond point-by-point to the major comments below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: The abstract reports positive benchmark results but provides no details on experimental setup, baselines, error bars, or potential post-hoc choices in eviction rules. This makes it impossible to verify if the data supports the claim of preserved quality with halved cache.

Authors: We agree that the abstract's brevity omits these specifics, which are instead provided in the main text. Section 4 details the experimental setup (datasets, hardware, streaming protocol), baselines (including prior KV-cache eviction and streaming 3D methods), and evaluation metrics. Section 5 reports results with error bars from multiple runs and confirms that eviction rules are fixed and deterministic with no post-hoc tuning. To improve immediate verifiability, we will revise the abstract to briefly note the benchmarks used and that quality preservation is shown with standard deviations. revision: yes
Referee: The approach assumes that the model's 3D geometry outputs are reliable from the first frame for making eviction decisions. However, in streaming monocular video with minimal initial parallax, these outputs are likely low-confidence or biased, risking permanent eviction of geometrically critical tokens. The privilege mechanism protects only a small fixed set and does not address this systematic early mis-ranking issue.

Authors: This is a legitimate concern for the bootstrap phase. GHOST's hierarchical dual-level scoring combines immediate geometry cues with longer-term consistency, while the privilege mechanism protects both a fixed set of special tokens and dynamically high-importance ones. Because eviction decisions are made online and continuously, early low-confidence rankings can be revisited as parallax accumulates. Our experiments (Section 5 and supplementary ablations) show robust final reconstruction quality under streaming conditions. To strengthen the manuscript, we will add a short discussion subsection analyzing early-frame behavior and any observed sensitivity to initial parallax. revision: partial

Circularity Check

0 steps flagged

No circularity: heuristic training-free method with no self-referential derivation

full rationale

The paper presents GHOST as a training-free KV cache management framework that exploits the model's own 3D geometry outputs for online token eviction via hierarchical scoring, privilege mechanism, and cosine-similarity allocation. No mathematical derivation chain, parameter fitting, or equations are described that reduce predictions or results to inputs by construction. The approach is explicitly heuristic and relies on external model outputs plus empirical benchmark validation rather than any closed self-referential loop, making the central claims self-contained against external testing.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unverified reliability of geometry outputs for eviction and the effectiveness of the proposed heuristics across scenes; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Model's 3D geometry outputs can be used directly to identify redundant tokens without introducing reconstruction errors.
Eviction decisions depend on these outputs being accurate guides for importance.

pith-pipeline@v0.9.0 · 5726 in / 1126 out tokens · 35641 ms · 2026-05-20T19:49:19.101842+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

GHOST scores each cached token by a hierarchical dual-level importance signal derived entirely from the model’s own outputs: a frame-level component integrating camera pose change, depth gradient variance, and temporal recency... a token-level component integrating visual saliency, depth confidence, and 3D point confidence
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean alpha_pin_under_high_calibration unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

cosine-similarity-guided layer-wise budget allocation... πℓ ∝ exp(aℓ/τ) where aℓ = 1−ρ̄ℓ

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 3 internal anchors

[1]

Neural rgb-d surface reconstruction

Dejan Azinovi´c, Ricardo Martin-Brualla, Dan B Goldman, Matthias Nießner, and Justus Thies. Neural rgb-d surface reconstruction. InICCV, 2022

work page 2022
[2]

Token Merging: Your ViT But Faster

Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your vit but faster.arXiv preprint arXiv:2210.09461, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[3]

PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Baobao Chang, Junjie Hu, and Wen Xiao. PyramidKV: Dynamic KV cache compression based on pyramidal information funneling.arXiv:2406.02069, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Diffrate: Differentiable compression rate for efficient vision transformers

Mengzhao Chen, Wenqi Shao, Peng Xu, Mingbao Lin, Kaipeng Zhang, Fei Chao, Rongrong Ji, Yu Qiao, and Ping Luo. Diffrate: Differentiable compression rate for efficient vision transformers. InICCV, 2023

work page 2023
[5]

Ttt3r: 3d reconstruc- tion as test-time training

Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, and Anpei Chen. Ttt3r: 3d reconstruc- tion as test-time training. InICLR, 2026

work page 2026
[6]

Evit: Privacy-preserving image retrieval via encrypted vision transformer in cloud computing.TCSVT, 2024

Qihua Feng, Peiya Li, Zhixun Lu, Chaozhuo Li, Zefan Wang, Zhiquan Liu, Chunhui Duan, Feiran Huang, Jian Weng, and Philip S Yu. Evit: Privacy-preserving image retrieval via encrypted vision transformer in cloud computing.TCSVT, 2024

work page 2024
[7]

Exploiting uncertainty in regression forests for accurate camera relocalization

Abner Guzman-Rivera, Pushmeet Kohli, Ben Glocker, Jamie Shotton, Toby Sharp, Andrew Fitzgibbon, and Shahram Izadi. Exploiting uncertainty in regression forests for accurate camera relocalization. InCVPR, 2014

work page 2014
[8]

Gradpruner: Gradient-guided layer pruning enabling efficient fine-tuning and inference for llms.arXiv preprint arXiv:2601.19503, 2026

Wei Huang, Anda Cheng, and Yinggui Wang. Gradpruner: Gradient-guided layer pruning enabling efficient fine-tuning and inference for llms.arXiv preprint arXiv:2601.19503, 2026

work page arXiv 2026
[9]

3d gaussian splatting for real-time radiance field rendering.ACM TOG, 2023

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, George Drettakis, et al. 3d gaussian splatting for real-time radiance field rendering.ACM TOG, 2023

work page 2023
[10]

A fast post-training pruning framework for transformers

Woosuk Kwon, Sehoon Kim, Michael W Mahoney, Joseph Hassoun, Kurt Keutzer, and Amir Gholami. A fast post-training pruning framework for transformers. InNeurIPS, 2022

work page 2022
[11]

MASt3R: Grounding image matching in 3D with MASt3R

Vincent Leroy, Yohann Cabon, and Jerome Revaud. MASt3R: Grounding image matching in 3D with MASt3R. InECCV, 2024

work page 2024
[12]

Analyzing the mechanism of attention collapse in VGGT from a dynamics perspective.arXiv preprint arXiv:2512.21691, 2025

Huan Li, Longjun Luo, Yuling Shi, and Xiaodong Gu. Analyzing the mechanism of attention collapse in vggt from a dynamics perspective.arXiv preprint arXiv:2512.21691, 2025

work page arXiv 2025
[13]

SnapKV: LLM knows what you are looking for before generation

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. SnapKV: LLM knows what you are looking for before generation. InNeurIPS, 2024

work page 2024
[14]

Evict3R: Training-free token eviction for memory-bounded streaming visual geometry transformers.arXiv preprint arXiv:2509.17650, 2025

Soroush Mahdi, Fardin Ayar, Ehsan Javanmardi, Manabu Tsukada, and Mahdi Javanmardi. Evict3r: Training-free token eviction for memory-bounded streaming visual geometry trans- formers.arXiv preprint arXiv:2509.17650, 2025

work page arXiv 2025
[15]

Palazzolo, J

E. Palazzolo, J. Behley, P. Lottes, P. Giguère, and C. Stachniss. ReFusion: 3D Reconstruction in Dynamic Environments for RGB-D Cameras Exploiting Residuals. InIROS, 2019

work page 2019
[16]

Dynamicvit: Efficient vision transformers with dynamic token sparsification

Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. Dynamicvit: Efficient vision transformers with dynamic token sparsification. InNeurIPS, 2021

work page 2021
[17]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InNeurIPS, 2017

work page 2017
[18]

VGGT: Visual geometry grounded deep structured feature transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. VGGT: Visual geometry grounded deep structured feature transformer. InCVPR, 2025

work page 2025
[19]

Efficient video transformers with spatial-temporal token selection

Junke Wang, Xitong Yang, Hengduo Li, Li Liu, Zuxuan Wu, and Yu-Gang Jiang. Efficient video transformers with spatial-temporal token selection. InECCV, 2022. 10

work page 2022
[20]

Continuous 3d perception model with persistent state

Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state. InCVPR, 2025

work page 2025
[21]

DUSt3R: Geometric 3D vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. DUSt3R: Geometric 3D vision made easy. InCVPR, 2024

work page 2024
[22]

Quantcache: Adaptive importance-guided quantization with hierarchical latent and layer caching for video generation

Junyi Wu, Zhiteng Li, Zheng Hui, Yulun Zhang, Linghe Kong, and Xiaokang Yang. Quantcache: Adaptive importance-guided quantization with hierarchical latent and layer caching for video generation. InICCV, 2025

work page 2025
[23]

FlashEdit: Decoupling Speed, Structure, and Semantics for Precise Image Editing

Junyi Wu, Zhiteng Li, Haotong Qin, Xiaohong Liu, Linghe Kong, Yulun Zhang, and Xiaokang Yang. Flashedit: Decoupling speed, structure, and semantics for precise image editing.arXiv preprint arXiv:2509.22244, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Balancegs: Algorithm-system co-design for efficient 3d gaussian splatting training on gpu

Junyi Wu, Jiaming Xu, Jinhao Li, Yongkang Zhou, Jiayi Pan, Xingyang Li, and Guohao Dai. Balancegs: Algorithm-system co-design for efficient 3d gaussian splatting training on gpu. In Asia and South Pacific Design Automation Conference, 2026

work page 2026
[25]

Point3r: Streaming 3d reconstruction with explicit spatial pointer memory

Yuqi Wu, Wenzhao Zheng, Jie Zhou, and Jiwen Lu. Point3r: Streaming 3d reconstruction with explicit spatial pointer memory. InNeurIPS, 2025

work page 2025
[26]

Efficient streaming language models with attention sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InICLR, 2024

work page 2024
[27]

Specontext: Enabling efficient long-context reasoning with speculative context sparsity in llms

Jiaming Xu, Jiayi Pan, Hanzhen Wang, Yongkang Zhou, Jiancai Ye, Yu Wang, and Guohao Dai. Specontext: Enabling efficient long-context reasoning with speculative context sparsity in llms. InACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2026

work page 2026
[28]

Streamingvlm: Real-time understanding for infinite video streams.ICLR, 2026

Ruyi Xu, Guangxuan Xiao, Yukang Chen, Liuning He, Kelly Peng, Yao Lu, and Song Han. Streamingvlm: Real-time understanding for infinite video streams.ICLR, 2026

work page 2026
[29]

Evo-vit: Slow-fast token evolution for dynamic vision trans- former

Yifan Xu, Zhijie Zhang, Mengdan Zhang, Kekai Sheng, Ke Li, Weiming Dong, Liqing Zhang, Changsheng Xu, and Xing Sun. Evo-vit: Slow-fast token evolution for dynamic vision trans- former. InAAAI, 2022

work page 2022
[30]

Global vision transformer pruning with hessian-aware saliency

Huanrui Yang, Hongxu Yin, Maying Shen, Pavlo Molchanov, Hai Li, and Jan Kautz. Global vision transformer pruning with hessian-aware saliency. InCVPR, 2023

work page 2023
[31]

Kvsharer: Efficient inference via layer-wise dissimilar kv cache sharing.arXiv preprint arXiv:2410.18517, 2024

Yifei Yang, Zouying Cao, Qiguang Chen, Libo Qin, Dongjie Yang, Hai Zhao, and Zhi Chen. Kvsharer: Efficient inference via layer-wise dissimilar kv cache sharing.arXiv preprint arXiv:2410.18517, 2024

work page arXiv 2024
[32]

A-vit: Adaptive tokens for efficient vision transformer

Hongxu Yin, Arash Vahdat, Jose M Alvarez, Arun Mallya, Jan Kautz, and Pavlo Molchanov. A-vit: Adaptive tokens for efficient vision transformer. InICCV, 2022

work page 2022
[33]

InfiniteVGGT: Visual geometry grounded transformer for endless streams

Shuai Yuan, Yantai Yang, Xiaotian Yang, Xupeng Zhang, Zhonghao Zhao, Lingming Zhang, and Zhipeng Zhang. Infinitevggt: Visual geometry grounded transformer for endless streams. arXiv preprint arXiv:2601.02281, 2026

work page arXiv 2026
[34]

Big bird: Transformers for longer sequences

Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird: Transformers for longer sequences. InNeurIPS, 2020

work page 2020
[35]

Magnitude pruning of large pretrained trans- former models with a mixture gaussian prior.Journal of Data Science: JDS, 2024

Mingxuan Zhang, Yan Sun, and Faming Liang. Magnitude pruning of large pretrained trans- former models with a mixture gaussian prior.Journal of Data Science: JDS, 2024

work page 2024
[36]

H2O: Heavy-hitter oracle for efficient generative inference of large language models

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, Zhangyang Wang, and Beidi Chen. H2O: Heavy-hitter oracle for efficient generative inference of large language models. InNeurIPS, 2023

work page 2023
[37]

Streaming 4d visual geometry transformer

Dong Zhuo, Wenzhao Zheng, Jiahe Guo, Yuqi Wu, Jie Zhou, and Jiwen Lu. Streaming 4d visual geometry transformer. InICLR, 2026. 11 A GHOST Eviction Algorithm A.1 Online Incremental Computation Computing full importance from scratch at every eviction step would require O(T 2) operations. GHOST maintains animportance cachethat stores per-frame raw scores: • O...

work page 2026

[1] [1]

Neural rgb-d surface reconstruction

Dejan Azinovi´c, Ricardo Martin-Brualla, Dan B Goldman, Matthias Nießner, and Justus Thies. Neural rgb-d surface reconstruction. InICCV, 2022

work page 2022

[2] [2]

Token Merging: Your ViT But Faster

Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your vit but faster.arXiv preprint arXiv:2210.09461, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[3] [3]

PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Baobao Chang, Junjie Hu, and Wen Xiao. PyramidKV: Dynamic KV cache compression based on pyramidal information funneling.arXiv:2406.02069, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

Diffrate: Differentiable compression rate for efficient vision transformers

Mengzhao Chen, Wenqi Shao, Peng Xu, Mingbao Lin, Kaipeng Zhang, Fei Chao, Rongrong Ji, Yu Qiao, and Ping Luo. Diffrate: Differentiable compression rate for efficient vision transformers. InICCV, 2023

work page 2023

[5] [5]

Ttt3r: 3d reconstruc- tion as test-time training

Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, and Anpei Chen. Ttt3r: 3d reconstruc- tion as test-time training. InICLR, 2026

work page 2026

[6] [6]

Evit: Privacy-preserving image retrieval via encrypted vision transformer in cloud computing.TCSVT, 2024

Qihua Feng, Peiya Li, Zhixun Lu, Chaozhuo Li, Zefan Wang, Zhiquan Liu, Chunhui Duan, Feiran Huang, Jian Weng, and Philip S Yu. Evit: Privacy-preserving image retrieval via encrypted vision transformer in cloud computing.TCSVT, 2024

work page 2024

[7] [7]

Exploiting uncertainty in regression forests for accurate camera relocalization

Abner Guzman-Rivera, Pushmeet Kohli, Ben Glocker, Jamie Shotton, Toby Sharp, Andrew Fitzgibbon, and Shahram Izadi. Exploiting uncertainty in regression forests for accurate camera relocalization. InCVPR, 2014

work page 2014

[8] [8]

Gradpruner: Gradient-guided layer pruning enabling efficient fine-tuning and inference for llms.arXiv preprint arXiv:2601.19503, 2026

Wei Huang, Anda Cheng, and Yinggui Wang. Gradpruner: Gradient-guided layer pruning enabling efficient fine-tuning and inference for llms.arXiv preprint arXiv:2601.19503, 2026

work page arXiv 2026

[9] [9]

3d gaussian splatting for real-time radiance field rendering.ACM TOG, 2023

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, George Drettakis, et al. 3d gaussian splatting for real-time radiance field rendering.ACM TOG, 2023

work page 2023

[10] [10]

A fast post-training pruning framework for transformers

Woosuk Kwon, Sehoon Kim, Michael W Mahoney, Joseph Hassoun, Kurt Keutzer, and Amir Gholami. A fast post-training pruning framework for transformers. InNeurIPS, 2022

work page 2022

[11] [11]

MASt3R: Grounding image matching in 3D with MASt3R

Vincent Leroy, Yohann Cabon, and Jerome Revaud. MASt3R: Grounding image matching in 3D with MASt3R. InECCV, 2024

work page 2024

[12] [12]

Analyzing the mechanism of attention collapse in VGGT from a dynamics perspective.arXiv preprint arXiv:2512.21691, 2025

Huan Li, Longjun Luo, Yuling Shi, and Xiaodong Gu. Analyzing the mechanism of attention collapse in vggt from a dynamics perspective.arXiv preprint arXiv:2512.21691, 2025

work page arXiv 2025

[13] [13]

SnapKV: LLM knows what you are looking for before generation

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. SnapKV: LLM knows what you are looking for before generation. InNeurIPS, 2024

work page 2024

[14] [14]

Evict3R: Training-free token eviction for memory-bounded streaming visual geometry transformers.arXiv preprint arXiv:2509.17650, 2025

Soroush Mahdi, Fardin Ayar, Ehsan Javanmardi, Manabu Tsukada, and Mahdi Javanmardi. Evict3r: Training-free token eviction for memory-bounded streaming visual geometry trans- formers.arXiv preprint arXiv:2509.17650, 2025

work page arXiv 2025

[15] [15]

Palazzolo, J

E. Palazzolo, J. Behley, P. Lottes, P. Giguère, and C. Stachniss. ReFusion: 3D Reconstruction in Dynamic Environments for RGB-D Cameras Exploiting Residuals. InIROS, 2019

work page 2019

[16] [16]

Dynamicvit: Efficient vision transformers with dynamic token sparsification

Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. Dynamicvit: Efficient vision transformers with dynamic token sparsification. InNeurIPS, 2021

work page 2021

[17] [17]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InNeurIPS, 2017

work page 2017

[18] [18]

VGGT: Visual geometry grounded deep structured feature transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. VGGT: Visual geometry grounded deep structured feature transformer. InCVPR, 2025

work page 2025

[19] [19]

Efficient video transformers with spatial-temporal token selection

Junke Wang, Xitong Yang, Hengduo Li, Li Liu, Zuxuan Wu, and Yu-Gang Jiang. Efficient video transformers with spatial-temporal token selection. InECCV, 2022. 10

work page 2022

[20] [20]

Continuous 3d perception model with persistent state

Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state. InCVPR, 2025

work page 2025

[21] [21]

DUSt3R: Geometric 3D vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. DUSt3R: Geometric 3D vision made easy. InCVPR, 2024

work page 2024

[22] [22]

Quantcache: Adaptive importance-guided quantization with hierarchical latent and layer caching for video generation

Junyi Wu, Zhiteng Li, Zheng Hui, Yulun Zhang, Linghe Kong, and Xiaokang Yang. Quantcache: Adaptive importance-guided quantization with hierarchical latent and layer caching for video generation. InICCV, 2025

work page 2025

[23] [23]

FlashEdit: Decoupling Speed, Structure, and Semantics for Precise Image Editing

Junyi Wu, Zhiteng Li, Haotong Qin, Xiaohong Liu, Linghe Kong, Yulun Zhang, and Xiaokang Yang. Flashedit: Decoupling speed, structure, and semantics for precise image editing.arXiv preprint arXiv:2509.22244, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[24] [24]

Balancegs: Algorithm-system co-design for efficient 3d gaussian splatting training on gpu

Junyi Wu, Jiaming Xu, Jinhao Li, Yongkang Zhou, Jiayi Pan, Xingyang Li, and Guohao Dai. Balancegs: Algorithm-system co-design for efficient 3d gaussian splatting training on gpu. In Asia and South Pacific Design Automation Conference, 2026

work page 2026

[25] [25]

Point3r: Streaming 3d reconstruction with explicit spatial pointer memory

Yuqi Wu, Wenzhao Zheng, Jie Zhou, and Jiwen Lu. Point3r: Streaming 3d reconstruction with explicit spatial pointer memory. InNeurIPS, 2025

work page 2025

[26] [26]

Efficient streaming language models with attention sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InICLR, 2024

work page 2024

[27] [27]

Specontext: Enabling efficient long-context reasoning with speculative context sparsity in llms

Jiaming Xu, Jiayi Pan, Hanzhen Wang, Yongkang Zhou, Jiancai Ye, Yu Wang, and Guohao Dai. Specontext: Enabling efficient long-context reasoning with speculative context sparsity in llms. InACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2026

work page 2026

[28] [28]

Streamingvlm: Real-time understanding for infinite video streams.ICLR, 2026

Ruyi Xu, Guangxuan Xiao, Yukang Chen, Liuning He, Kelly Peng, Yao Lu, and Song Han. Streamingvlm: Real-time understanding for infinite video streams.ICLR, 2026

work page 2026

[29] [29]

Evo-vit: Slow-fast token evolution for dynamic vision trans- former

Yifan Xu, Zhijie Zhang, Mengdan Zhang, Kekai Sheng, Ke Li, Weiming Dong, Liqing Zhang, Changsheng Xu, and Xing Sun. Evo-vit: Slow-fast token evolution for dynamic vision trans- former. InAAAI, 2022

work page 2022

[30] [30]

Global vision transformer pruning with hessian-aware saliency

Huanrui Yang, Hongxu Yin, Maying Shen, Pavlo Molchanov, Hai Li, and Jan Kautz. Global vision transformer pruning with hessian-aware saliency. InCVPR, 2023

work page 2023

[31] [31]

Kvsharer: Efficient inference via layer-wise dissimilar kv cache sharing.arXiv preprint arXiv:2410.18517, 2024

Yifei Yang, Zouying Cao, Qiguang Chen, Libo Qin, Dongjie Yang, Hai Zhao, and Zhi Chen. Kvsharer: Efficient inference via layer-wise dissimilar kv cache sharing.arXiv preprint arXiv:2410.18517, 2024

work page arXiv 2024

[32] [32]

A-vit: Adaptive tokens for efficient vision transformer

Hongxu Yin, Arash Vahdat, Jose M Alvarez, Arun Mallya, Jan Kautz, and Pavlo Molchanov. A-vit: Adaptive tokens for efficient vision transformer. InICCV, 2022

work page 2022

[33] [33]

InfiniteVGGT: Visual geometry grounded transformer for endless streams

Shuai Yuan, Yantai Yang, Xiaotian Yang, Xupeng Zhang, Zhonghao Zhao, Lingming Zhang, and Zhipeng Zhang. Infinitevggt: Visual geometry grounded transformer for endless streams. arXiv preprint arXiv:2601.02281, 2026

work page arXiv 2026

[34] [34]

Big bird: Transformers for longer sequences

Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird: Transformers for longer sequences. InNeurIPS, 2020

work page 2020

[35] [35]

Magnitude pruning of large pretrained trans- former models with a mixture gaussian prior.Journal of Data Science: JDS, 2024

Mingxuan Zhang, Yan Sun, and Faming Liang. Magnitude pruning of large pretrained trans- former models with a mixture gaussian prior.Journal of Data Science: JDS, 2024

work page 2024

[36] [36]

H2O: Heavy-hitter oracle for efficient generative inference of large language models

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, Zhangyang Wang, and Beidi Chen. H2O: Heavy-hitter oracle for efficient generative inference of large language models. InNeurIPS, 2023

work page 2023

[37] [37]

Streaming 4d visual geometry transformer

Dong Zhuo, Wenzhao Zheng, Jiahe Guo, Yuqi Wu, Jie Zhou, and Jiwen Lu. Streaming 4d visual geometry transformer. InICLR, 2026. 11 A GHOST Eviction Algorithm A.1 Online Incremental Computation Computing full importance from scratch at every eviction step would require O(T 2) operations. GHOST maintains animportance cachethat stores per-frame raw scores: • O...

work page 2026