GHOST: Geometry-Hierarchical Online Streaming Token Eviction for Efficient 3D Reconstruction
Pith reviewed 2026-05-20 19:49 UTC · model grok-4.3
The pith
GHOST uses a model's own 3D geometry outputs to evict redundant KV-cache tokens online during streaming reconstruction from video.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GHOST is a geometry-hierarchical online streaming token eviction method that exploits the model's own 3D geometry outputs to decide which tokens to retain in the KV cache. It combines a hierarchical dual-level importance scoring scheme, a privilege mechanism that shields special tokens from eviction, and a cosine-similarity-guided layer-wise budget allocation strategy. This framework runs without additional training and directly addresses the linear growth of cache memory in long-sequence 3D reconstruction tasks.
What carries the argument
hierarchical dual-level importance scoring with privilege protection and cosine-similarity layer-wise budget allocation that uses the model's 3D geometry predictions as the eviction signal
If this is right
- KV cache memory usage drops by nearly half while reconstruction quality stays comparable to full-cache baselines.
- Inference runs 1.75 times faster than existing state-of-the-art streaming methods on the tested benchmarks.
- The approach works without any extra training or fine-tuning steps.
- The three components reinforce one another so that geometrically valuable tokens survive eviction even in extended sequences.
Where Pith is reading between the lines
- The same geometry-driven eviction idea could be tested on other long-context 3D tasks such as novel-view synthesis from video.
- If the geometry outputs degrade on out-of-distribution scenes the eviction decisions would likely become unreliable, suggesting a need for fallback heuristics.
- Layer-wise budget allocation might generalize to other transformer-based 3D models that output intermediate geometric features.
- Real-time deployment on memory-constrained hardware becomes more feasible once the cache size is decoupled from sequence length.
Load-bearing premise
The model's 3D geometry outputs are sufficiently reliable and informative to serve as the basis for online token eviction decisions without causing quality degradation across diverse scenes and benchmarks.
What would settle it
Running GHOST on the same long video sequences and benchmarks as the full-cache baseline and measuring a clear drop in reconstruction metrics such as PSNR, SSIM, or surface accuracy would show the eviction strategy harms quality.
Figures
read the original abstract
Streaming 3D reconstruction from long monocular video sequences requires maintaining a key-value (KV) cache that grows linearly with sequence length, creating a severe memory bottleneck. Existing approaches either truncate the cache to a fixed set of anchor frames, leading to reconstruction quality degradation, or rely on attention-score heuristics that are agnostic to 3D scene structure, failing to preserve geometrically valuable tokens. To address these problems, we present GHOST (Geometry-Hierarchical Online Streaming Token Eviction), a training-free KV cache management framework that exploits the model's own 3D geometry outputs to evict redundant tokens online. GHOST introduces three mutually reinforcing innovations: a hierarchical dual-level importance scoring scheme, a privilege mechanism that protects special tokens from eviction, and a cosine-similarity-guided layer-wise budget allocation. Experiments on various benchmarks show that GHOST preserves excellent reconstruction quality while cutting the KV cache by nearly half and delivering 1.75x faster inference compared to state-of-the-art methods. Our code is available at https://github.com/lokiniuniu/GHOST.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents GHOST, a training-free KV cache management framework for streaming 3D reconstruction from long monocular video sequences. It uses the model's 3D geometry outputs to perform online token eviction via a hierarchical dual-level importance scoring scheme, a privilege mechanism for special tokens, and cosine-similarity-guided layer-wise budget allocation. The authors claim that this approach maintains excellent reconstruction quality while reducing the KV cache by nearly half and achieving 1.75x faster inference compared to state-of-the-art methods.
Significance. If the results hold, GHOST represents a meaningful advance in efficient long-sequence 3D reconstruction by incorporating geometric structure into cache eviction decisions rather than relying on generic attention heuristics. The training-free design and availability of code are positive aspects that facilitate reproducibility and adoption.
major comments (2)
- The abstract reports positive benchmark results but provides no details on experimental setup, baselines, error bars, or potential post-hoc choices in eviction rules. This makes it impossible to verify if the data supports the claim of preserved quality with halved cache.
- The approach assumes that the model's 3D geometry outputs are reliable from the first frame for making eviction decisions. However, in streaming monocular video with minimal initial parallax, these outputs are likely low-confidence or biased, risking permanent eviction of geometrically critical tokens. The privilege mechanism protects only a small fixed set and does not address this systematic early mis-ranking issue.
Simulated Author's Rebuttal
We thank the referee for the constructive comments and the positive assessment of GHOST's significance and reproducibility. We respond point-by-point to the major comments below, indicating planned revisions where appropriate.
read point-by-point responses
-
Referee: The abstract reports positive benchmark results but provides no details on experimental setup, baselines, error bars, or potential post-hoc choices in eviction rules. This makes it impossible to verify if the data supports the claim of preserved quality with halved cache.
Authors: We agree that the abstract's brevity omits these specifics, which are instead provided in the main text. Section 4 details the experimental setup (datasets, hardware, streaming protocol), baselines (including prior KV-cache eviction and streaming 3D methods), and evaluation metrics. Section 5 reports results with error bars from multiple runs and confirms that eviction rules are fixed and deterministic with no post-hoc tuning. To improve immediate verifiability, we will revise the abstract to briefly note the benchmarks used and that quality preservation is shown with standard deviations. revision: yes
-
Referee: The approach assumes that the model's 3D geometry outputs are reliable from the first frame for making eviction decisions. However, in streaming monocular video with minimal initial parallax, these outputs are likely low-confidence or biased, risking permanent eviction of geometrically critical tokens. The privilege mechanism protects only a small fixed set and does not address this systematic early mis-ranking issue.
Authors: This is a legitimate concern for the bootstrap phase. GHOST's hierarchical dual-level scoring combines immediate geometry cues with longer-term consistency, while the privilege mechanism protects both a fixed set of special tokens and dynamically high-importance ones. Because eviction decisions are made online and continuously, early low-confidence rankings can be revisited as parallax accumulates. Our experiments (Section 5 and supplementary ablations) show robust final reconstruction quality under streaming conditions. To strengthen the manuscript, we will add a short discussion subsection analyzing early-frame behavior and any observed sensitivity to initial parallax. revision: partial
Circularity Check
No circularity: heuristic training-free method with no self-referential derivation
full rationale
The paper presents GHOST as a training-free KV cache management framework that exploits the model's own 3D geometry outputs for online token eviction via hierarchical scoring, privilege mechanism, and cosine-similarity allocation. No mathematical derivation chain, parameter fitting, or equations are described that reduce predictions or results to inputs by construction. The approach is explicitly heuristic and relies on external model outputs plus empirical benchmark validation rather than any closed self-referential loop, making the central claims self-contained against external testing.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Model's 3D geometry outputs can be used directly to identify redundant tokens without introducing reconstruction errors.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
GHOST scores each cached token by a hierarchical dual-level importance signal derived entirely from the model’s own outputs: a frame-level component integrating camera pose change, depth gradient variance, and temporal recency... a token-level component integrating visual saliency, depth confidence, and 3D point confidence
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanalpha_pin_under_high_calibration unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
cosine-similarity-guided layer-wise budget allocation... πℓ ∝ exp(aℓ/τ) where aℓ = 1−ρ̄ℓ
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.