pith. machine review for the scientific record. sign in

arxiv: 2604.07350 · v1 · submitted 2026-04-08 · 💻 cs.CV · cs.GR· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Fast Spatial Memory with Elastic Test-Time Training

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:22 UTC · model grok-4.3

classification 💻 cs.CV cs.GRcs.LG
keywords elastic test-time trainingfast spatial memory3D reconstruction4D reconstructiontest-time adaptationcatastrophic forgettingelastic weight consolidationspatiotemporal representations
0
0 comments X

The pith

Elastic Test-Time Training stabilizes LaCT fast-weight updates using a Fisher-weighted prior and EMA anchor to support multi-chunk 3D/4D reconstruction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large Chunk Test-Time Training performs well on long-context 3D reconstruction but its fully plastic updates cause catastrophic forgetting and overfitting, so it is restricted to single large chunks. The paper proposes Elastic Test-Time Training, inspired by elastic weight consolidation, that adds stability by applying a Fisher-weighted elastic prior around an anchor state which evolves as an exponential moving average of past fast weights. This stabilized architecture powers Fast Spatial Memory, a model pre-trained on large-scale 3D/4D data to learn spatiotemporal representations and render novel view-time combinations. A sympathetic reader would care because the method allows high-quality reconstruction from long sequences using smaller chunks, reduces memory demands, mitigates camera-interpolation shortcuts, and advances toward single-pass handling of arbitrarily long inputs.

Core claim

We propose Elastic Test-Time Training that stabilizes LaCT fast-weight updates with a Fisher-weighted elastic prior around a maintained anchor state. The anchor evolves as an exponential moving average of past fast weights to balance stability and plasticity. Based on this architecture, we introduce Fast Spatial Memory (FSM), an efficient model for 4D reconstruction that learns spatiotemporal representations from long observation sequences and renders novel view-time combinations. Pre-trained on large-scale curated 3D/4D data, FSM supports fast adaptation over long sequences and delivers high-quality 3D/4D reconstruction with smaller chunks while mitigating the camera-interpolation shortcut,

What carries the argument

Elastic Test-Time Training mechanism that applies a Fisher-weighted elastic prior around an exponentially moving average anchor state to regularize LaCT fast-weight updates.

Load-bearing premise

The Fisher-weighted elastic prior combined with the EMA-updated anchor will reliably prevent catastrophic forgetting and overfitting during multi-chunk test-time adaptation without introducing new instabilities or reducing the benefits of fast-weight updates.

What would settle it

Measuring 3D/4D reconstruction quality and forgetting rates when FSM processes a long sequence split into many small chunks versus a single large chunk; if quality drops or forgetting increases with multiple chunks, the stabilization claim fails.

Figures

Figures reproduced from arXiv: 2604.07350 by Chuang Gan, Haoyu Zhen, Joyce Chai, Xueyang Yu, Yuncong Yang, Ziqiao Ma.

Figure 1
Figure 1. Figure 1: Fast Spatial Memory (FSM) is an efficient, scalable 4D reconstruction model that learns spatiotemporal representations from long sequences to render novel views at novel times. The model is powered by Large Chunk Elastic Test-Time Training (LaCET) blocks and is compatible with a range of rendering decoders, including LRM-style and LVSM-style decoders. Abstract Large Chunk Test-Time Training (LaCT) has show… view at source ↗
Figure 2
Figure 2. Figure 2 [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: FSM-LVSM and FSM-LRM architectural designs. (a) LVSM-style rendering predicts target image patches directly from query tokens and does not build an explicit scene representation. (b) LRM-style rendering first predicts an explicit 4D scene representation with Gaussian primitives and then renders target views from that representation. This design ensures that the target-view tokens do not interact with one a… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative illustration of the ablation studies, obtained after the same training steps (16K) with the same training and inference [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Test-time scaling curves. Shown are PSNR/SSIM/LPIPS of LaCT (1/4 chunks) and LaCET (4 chunks; streaming-ema), trained with 32 images (vertical line) and evaluated with varying numbers of input images. Each point uses a 136-frame Stereo4D clip. For sparse views, input and target frames are randomly sampled across the long full span. For continuous views, we select a contiguous sub-sequence (e.g., 40 frames … view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison on Steoro4D test set. Note that for MoVieS we use a higher default resolution (504 [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative comparison on DL3DV benchmark. [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Additional comparison on Steoro4D test set. Note that for MoVieS we use a higher default resolution (504 [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative examples on Steoro4D test set. [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative failure example. 26.84 PSNR 24.51 24.17 23.69 23.74 23.45 Ground Truth 28.10 PSNR 22.56 22.27 22.59 23.46 24.30 FSM 4D-LVSM Ground Truth FSM 4D-LVSM [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Qualitative results on NVIDIA benchmark. [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Qualitative results on DL3DV-140 benchmark. [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗
read the original abstract

Large Chunk Test-Time Training (LaCT) has shown strong performance on long-context 3D reconstruction, but its fully plastic inference-time updates remain vulnerable to catastrophic forgetting and overfitting. As a result, LaCT is typically instantiated with a single large chunk spanning the full input sequence, falling short of the broader goal of handling arbitrarily long sequences in a single pass. We propose Elastic Test-Time Training inspired by elastic weight consolidation, that stabilizes LaCT fast-weight updates with a Fisher-weighted elastic prior around a maintained anchor state. The anchor evolves as an exponential moving average of past fast weights to balance stability and plasticity. Based on this updated architecture, we introduce Fast Spatial Memory (FSM), an efficient and scalable model for 4D reconstruction that learns spatiotemporal representations from long observation sequences and renders novel view-time combinations. We pre-trained FSM on large-scale curated 3D/4D data to capture the dynamics and semantics of complex spatial environments. Extensive experiments show that FSM supports fast adaptation over long sequences and delivers high-quality 3D/4D reconstruction with smaller chunks and mitigating the camera-interpolation shortcut. Overall, we hope to advance LaCT beyond the bounded single-chunk setting toward robust multi-chunk adaptation, a necessary step for generalization to genuinely longer sequences, while substantially alleviating the activation-memory bottleneck.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper proposes Elastic Test-Time Training (ETT), inspired by elastic weight consolidation, to stabilize Large Chunk Test-Time Training (LaCT) fast-weight updates for long-context 3D/4D reconstruction. It introduces a Fisher-weighted elastic prior around an anchor state that evolves via exponential moving average (EMA) of past fast weights to balance stability and plasticity. This enables the Fast Spatial Memory (FSM) model, pre-trained on large-scale 3D/4D data, to support multi-chunk test-time adaptation over long sequences with smaller chunks, high-quality novel view-time rendering, and mitigation of the camera-interpolation shortcut, while reducing activation-memory bottlenecks.

Significance. If the empirical results hold, the approach could meaningfully advance test-time adaptation methods for spatiotemporal vision models by enabling scalable handling of arbitrarily long sequences without single-chunk memory limits or severe forgetting/overfitting. The explicit use of EWC-style regularization with an evolving anchor is a clear strength, and the pre-training plus multi-chunk experiments provide a concrete path toward practical 4D reconstruction systems.

major comments (3)
  1. [§3.2] §3.2 (Elastic Test-Time Training): The central stabilization claim relies on the Fisher-weighted prior accurately ranking parameter importance for the test-time objective, yet the manuscript does not specify whether the Fisher matrix is computed once on pre-training data, recomputed on each chunk, or updated online. This leaves open the distributional mismatch risk highlighted in the stress-test note, which directly affects whether the prior curbs forgetting without damping plasticity.
  2. [§4.1] §4.1 (FSM architecture and anchor update): The EMA anchor is presented as balancing stability/plasticity, but no ablation isolates its contribution versus the Fisher prior alone, nor quantifies how the anchor update rate interacts with chunk size to prevent the overfitting observed in plain LaCT. This is load-bearing for the multi-chunk claim.
  3. [Table 2] Table 2 (quantitative comparisons): The reported gains in PSNR/SSIM for smaller chunks are central to the 'high-quality reconstruction with smaller chunks' claim, but the table lacks variance across runs or statistical significance tests, making it difficult to confirm the improvements exceed the camera-interpolation shortcut baseline.
minor comments (3)
  1. [Eq. (7)] Notation for the elastic prior loss (Eq. 7) uses inconsistent symbols for the anchor state across the text and algorithm box; standardize to a single symbol.
  2. [§5] The abstract and §1 claim 'extensive experiments' but the experimental section would benefit from an explicit list of datasets and chunk sizes used in the multi-chunk setting.
  3. [Figure 3] Figure 3 caption does not state the number of chunks or sequence length for the visualized 4D reconstruction, reducing interpretability of the qualitative results.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and have revised the manuscript to incorporate clarifications, additional analyses, and statistical reporting as appropriate.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Elastic Test-Time Training): The central stabilization claim relies on the Fisher-weighted prior accurately ranking parameter importance for the test-time objective, yet the manuscript does not specify whether the Fisher matrix is computed once on pre-training data, recomputed on each chunk, or updated online. This leaves open the distributional mismatch risk highlighted in the stress-test note, which directly affects whether the prior curbs forgetting without damping plasticity.

    Authors: We have revised Section 3.2 to explicitly state that the Fisher matrix is computed once on the pre-training data, consistent with standard EWC practice, to obtain a fixed importance ranking without incurring per-chunk overhead at test time. We acknowledge the potential for distributional mismatch between pre-training and test chunks and have expanded the discussion to explain why the resulting elastic prior still supports effective stabilization in our setting, as demonstrated by the multi-chunk results. A brief reference to the stress-test observations has also been added for context. revision: yes

  2. Referee: [§4.1] §4.1 (FSM architecture and anchor update): The EMA anchor is presented as balancing stability/plasticity, but no ablation isolates its contribution versus the Fisher prior alone, nor quantifies how the anchor update rate interacts with chunk size to prevent the overfitting observed in plain LaCT. This is load-bearing for the multi-chunk claim.

    Authors: We agree that isolating the EMA anchor's role strengthens the multi-chunk claims. The revised manuscript includes a new ablation in Section 4.1 comparing the full ETT model against a Fisher-prior-only variant and the plain LaCT baseline. We have also added quantitative analysis and a supplementary figure examining the interaction between the EMA update rate and chunk size, showing that suitable rates reduce the overfitting seen in LaCT while preserving adaptation performance. revision: yes

  3. Referee: [Table 2] Table 2 (quantitative comparisons): The reported gains in PSNR/SSIM for smaller chunks are central to the 'high-quality reconstruction with smaller chunks' claim, but the table lacks variance across runs or statistical significance tests, making it difficult to confirm the improvements exceed the camera-interpolation shortcut baseline.

    Authors: We have updated Table 2 to report means accompanied by standard deviations computed over multiple runs with different random seeds. We have also added the results of paired statistical significance tests (t-tests) against the baselines, including the camera-interpolation shortcut, confirming that the reported gains are statistically significant. revision: yes

Circularity Check

0 steps flagged

No significant circularity; proposal extends external EWC without self-referential reduction

full rationale

The paper's core contribution is the proposal of Elastic Test-Time Training (inspired by external elastic weight consolidation) and Fast Spatial Memory for LaCT stabilization via Fisher-weighted prior and EMA anchor. No derivation chain is presented that reduces a claimed prediction or result to its own inputs by construction. The abstract and description frame the approach as an architectural extension applying known regularization ideas to test-time adaptation, without fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations that collapse the argument. The method's claims rest on empirical validation rather than tautological re-expression of inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The abstract introduces Elastic Test-Time Training and FSM but does not list explicit free parameters or axioms. The approach inherits assumptions from elastic weight consolidation and relies on pre-training capturing useful spatiotemporal structure.

axioms (1)
  • domain assumption Elastic weight consolidation using Fisher information provides effective regularization to prevent catastrophic forgetting in neural network updates.
    The paper directly builds on this prior technique to stabilize test-time training.
invented entities (1)
  • Fast Spatial Memory (FSM) no independent evidence
    purpose: Scalable model for learning spatiotemporal representations and rendering novel view-time combinations from long sequences.
    New model name and architecture introduced on top of the elastic training method.

pith-pipeline@v0.9.0 · 5544 in / 1401 out tokens · 91915 ms · 2026-05-10T18:22:36.510212+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

80 extracted references · 13 canonical work pages · 5 internal anchors

  1. [1]

    Memoryawaresynapses: Learning what (not) to forget

    RahafAljundi,FrancescaBabiloni,MohamedElhoseiny,Mar- cusRohrbach,andTinneTuytelaars. Memoryawaresynapses: Learning what (not) to forget. InEuropean conference on computer vision (ECCV), pages 139–154, 2018. 4

  2. [2]

    Recammaster: Camera-controlled generative rendering from a single video

    Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, et al. Recammaster: Camera-controlled generative rendering from a single video. InInternational Conference on Computer Vision, 2025. 6

  3. [3]

    Atlas: Learning to optimally memorize the context at test time, 2025

    Ali Behrouz, Zeman Li, Praneeth Kacham, Majid Daliri, Yuan Deng, Peilin Zhong, Meisam Razaviyayn, and Vahab 10 Mirrokni. Atlas: Learning to optimally memorize the context at test time.arXiv preprint arXiv:2505.23735, 2025. 10

  4. [4]

    It’s all connected: A journey through test-time memorization, attentional bias, retention, and online optimization

    Ali Behrouz, Meisam Razaviyayn, Peilin Zhong, and Vahab Mirrokni.It’sallconnected: Ajourneythroughtest-timemem- orization, attentional bias, retention, and online optimization. arXiv preprint arXiv:2504.13173, 2025. 10

  5. [5]

    Titans: Learning to memorize at test time

    Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. Titans: Learning to memorize at test time. InConference on Neural Information Processing Systems, 2025. 10

  6. [6]

    Birth of a transformer: A memory viewpoint

    Alberto Bietti, Vivien Cabannes, Diane Bouchacourt, Herve Jegou, and Leon Bottou. Birth of a transformer: A memory viewpoint. InConference on Neural Information Processing Systems, pages 1560–1588, 2023. 10

  7. [7]

    Hardware-constrained hy- bridcodingofvideoimagery.IEEETransactionsonAerospace and Electronic Systems, (1):71–84, 1983

    Luen C Chan and Peter Whiteman. Hardware-constrained hy- bridcodingofvideoimagery.IEEETransactionsonAerospace and Electronic Systems, (1):71–84, 1983. 7

  8. [8]

    Ttt3r: 3d reconstruction as test-time training

    Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, and Anpei Chen. Ttt3r: 3d reconstruction as test-time training. In International Conference on Learning Representations, 2026. 2, 10

  9. [9]

    Wildrayzer: Self-supervisedlargeviewsynthesisindynamicenvironments

    XuweiyiChen,WentaoZhou,andZezhouCheng. Wildrayzer: Self-supervisedlargeviewsynthesisindynamicenvironments. InConference on Computer Vision and Pattern Recognition,

  10. [10]

    One-minute video generation with test-time training

    Karan Dalal, Daniel Koceja, Jiarui Xu, Yue Zhao, Shihao Han, Ka Chun Cheung, Jan Kautz, Yejin Choi, Yu Sun, and Xiaolong Wang. One-minute video generation with test-time training. InConference on Computer Vision and Pattern Recognition, pages 17702–17711, 2025. 10

  11. [11]

    Learning without training: The implicit dynamics of in-context learning.arXiv preprint arXiv:2507.16003, 2025

    Benoit Dherin, Michael Munn, Hanna Mazzawi, Michael Wunder, and Javier Gonzalvo. Learning without training: The implicit dynamics of in-context learning.arXiv preprint arXiv:2507.16003, 2025. 10

  12. [12]

    St4rtrack: Simultaneous 4d reconstruction and tracking in the world

    Haiwen Feng, Junyi Zhang, Qianqian Wang, Yufei Ye, Pengcheng Yu, Michael J Black, Trevor Darrell, and Angjoo Kanazawa. St4rtrack: Simultaneous 4d reconstruction and tracking in the world. InInternational Conference on Com- puter Vision, pages 8503–8513, 2025. 10

  13. [13]

    Query-key normalization for trans- formers

    Alex Henry, Prudhvi Raj Dachapally, Shubham Shantaram Pawar, and Yuxuan Chen. Query-key normalization for trans- formers. InFindings of the Association for Computational Linguistics: EMNLP 2020, pages 4246–4253, 2020. 14

  14. [14]

    Lrm: Large reconstruction model for single image to 3d

    Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. Lrm: Large reconstruction model for single image to 3d. InInternationalConferenceonLearningRepresentations,

  15. [15]

    Real3d: Scalinguplargereconstructionmodelswithreal-worldimages

    HanwenJiang,QixingHuang,andGeorgiosPavlakos. Real3d: Scalinguplargereconstructionmodelswithreal-worldimages. InInternationalConferenceonComputerVision,pages5821– 5833, 2025. 10

  16. [16]

    Rayzer: A self-supervised large view synthesis model

    HanwenJiang,HaoTan,PengWang,HaianJin,YueZhao,Sai Bi,KaiZhang,FujunLuan,KalyanSunkavalli,QixingHuang, et al. Rayzer: A self-supervised large view synthesis model. InInternational Conference on Computer Vision, 2025. 9, 10

  17. [17]

    LVSM: A large view synthesis model with minimal 3d inductive bias

    Haian Jin, Hanwen Jiang, Hao Tan, Kai Zhang, Sai Bi, Tianyuan Zhang, Fujun Luan, Noah Snavely, and Zexiang Xu. LVSM: A large view synthesis model with minimal 3d inductive bias. InInternational Conference on Learning Representations, 2025. 1, 4, 9, 10

  18. [18]

    Stereo4d: Learning how things move in 3d from internet stereo videos

    Linyi Jin, Richard Tucker, Zhengqi Li, David Fouhey, Noah Snavely, and Aleksander Holynski. Stereo4d: Learning how things move in 3d from internet stereo videos. InConference on Computer Vision and Pattern Recognition, pages 10497– 10509, 2025. 6, 8, 9, 14, 15

  19. [19]

    Muon: An optimizer for hidden layers in neural networks.https: //kellerjordan.github.io/posts/muon , 2024

    Keller Jordan, Yuchen Jin, Vlado Boza, You Jiacheng, Franz Cecista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks.https: //kellerjordan.github.io/posts/muon , 2024. 3

  20. [20]

    Dy- namicstereo: Consistent dynamic depth from stereo videos

    Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Dy- namicstereo: Consistent dynamic depth from stereo videos. InConference on Computer Vision and Pattern Recognition, pages 13229–13239, 2023. 6

  21. [21]

    Lattice: Learn- ing to efficiently compress the memory.arXiv preprint arXiv:2504.05646, 2025

    Mahdi Karami and Vahab Mirrokni. Lattice: Learn- ing to efficiently compress the memory.arXiv preprint arXiv:2504.05646, 2025. 10

  22. [22]

    Robot see robot do: Imitating articulated object manipulation with monocular 4d reconstruction

    Justin Kerr, Chung Min Kim, Mingxuan Wu, Brent Yi, Qian- qian Wang, Ken Goldberg, and Angjoo Kanazawa. Robot see robot do: Imitating articulated object manipulation with monocular 4d reconstruction. InConference on Robot Learn- ing, 2024. 1

  23. [23]

    Scalingviewsynthesistransformers.arXivpreprint arXiv:2602.21341, 2026

    Evan Kim, Hyunwoo Ryu, Thomas W Mitchel, and Vincent Sitzmann. Scalingviewsynthesistransformers.arXivpreprint arXiv:2602.21341, 2026. 1, 4, 10

  24. [24]

    Overcoming catastrophic forgetting in neural networks

    James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13): 3521–3526, 2017. 2, 3, 4

  25. [25]

    Dynamic evaluation of neural sequence models

    Ben Krause, Emmanuel Kahembwe, Iain Murray, and Steve Renals. Dynamic evaluation of neural sequence models. In International Conference on Machine Learning, pages 2766– 2775, 2018. 5, 14

  26. [26]

    Mosca: Dynamic gaussian fusion from casual videos via 4d motion scaffolds

    Jiahui Lei, Yijia Weng, Adam W Harley, Leonidas Guibas, and Kostas Daniilidis. Mosca: Dynamic gaussian fusion from casual videos via 4d motion scaffolds. InConference on Computer Vision and Pattern Recognition, pages 6165–6177,

  27. [27]

    Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model

    Jiahao Li, Hao Tan, Kai Zhang, Zexiang Xu, Fujun Luan, Yinghao Xu, Yicong Hong, Kalyan Sunkavalli, Greg Shakhnarovich, and Sai Bi. Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model. In International Conference on Learning Representations, 2024. 10

  28. [28]

    Feed-forward bullet-timereconstructionofdynamicscenesfrommonocular videos

    HanxueLiang,JiaweiRen,AshkanMirzaei,AntonioTorralba, Ziwei Liu, Igor Gilitschenski, Sanja Fidler, Cengiz Oztireli, Huan Ling, Zan Gojcic, and Jiahui Huang. Feed-forward bullet-timereconstructionofdynamicscenesfrommonocular videos. InConference on Neural Information Processing Systems, 2025. 10

  29. [29]

    Movies: 11 Motion-aware 4d dynamic view synthesis in one second

    Chenguo Lin, Yuchen Lin, Panwang Pan, Yifan Yu, Tao Hu, Honglei Yan, Katerina Fragkiadaki, and Yadong Mu. Movies: 11 Motion-aware 4d dynamic view synthesis in one second. In Conference on Computer Vision and Pattern Recognition,

  30. [30]

    Dl3dv-10k: A large-scale scene dataset for deep learning- based 3d vision

    Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, KunWan,LantaoYu,QianyuGuo,ZixunYu,YawenLu,etal. Dl3dv-10k: A large-scale scene dataset for deep learning- based 3d vision. InConference on Computer Vision and Pattern Recognition, pages 22160–22169, 2024. 6, 9, 15

  31. [31]

    Longhorn: State space models are amortized online learners

    Bo Liu, Rui Wang, Lemeng Wu, Yihao Feng, Peter Stone, and qiang liu. Longhorn: State space models are amortized online learners. InInternational Conference on Learning Representations, 2025. 10

  32. [32]

    Muon is Scalable for LLM Training

    Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is scalable for llm training.arXiv preprint arXiv:2502.16982, 2025. 3

  33. [33]

    Test-Time Training with KV Binding Is Secretly Linear Attention

    JunchenLiu,SvenElflein,OrLitany,ZanGojcic,andRuilong Li. Test-time training with kv binding is secretly linear attention.arXiv preprint arXiv:2602.21204, 2026. 10

  34. [34]

    4d-lrm: Large space-time reconstruction model from and to any view at any time

    Ziqiao Ma, Xuweiyi Chen, Shoubin Yu, Sai Bi, Kai Zhang, Chen Ziwen, Sihan Xu, Jianing Yang, Zexiang Xu, Kalyan Sunkavalli, et al. 4d-lrm: Large space-time reconstruction model from and to any view at any time. InConference on Neural Information Processing Systems, 2025. 2, 4, 5, 10, 15

  35. [35]

    Spring: A high-resolution high- detail dataset and benchmark for scene flow, optical flow and stereo

    Lukas Mehl, Jenny Schmalfuss, Azin Jahedi, Yaroslava Nali- vayko, and Andrés Bruhn. Spring: A high-resolution high- detail dataset and benchmark for scene flow, optical flow and stereo. InConference on Computer Vision and Pattern Recognition, pages 4981–4991, 2023. 6

  36. [36]

    True self-supervised novel view synthesis is transferable

    Thomas Mitchel, Hyunwoo Ryu, and Vincent Sitzmann. True self-supervised novel view synthesis is transferable. InInter- national Conference on Learning Representations, 2026. 8, 10

  37. [37]

    Julius Plücker. Xvii. on a new geometry of space.Philo- sophical Transactions of the Royal Society of London, (155): 725–791, 1865. 4

  38. [38]

    Hopfield networks is all you need

    Hubert Ramsauer, Bernhard Schäfl, Johannes Lehner, Philipp Seidl, Michael Widrich, Lukas Gruber, Markus Holzleitner, Thomas Adler, David Kreil, Michael K Kopp, Günter Klam- bauer, Johannes Brandstetter, and Sepp Hochreiter. Hopfield networks is all you need. InInternational Conference on Learning Representations, 2021. 10

  39. [39]

    L4gm: Large4dgaussianreconstructionmodel

    JiaweiRen,ChengXie,AshkanMirzaei,KarstenKreis,Ziwei Liu, Antonio Torralba, Sanja Fidler, Seung Wook Kim, Huan Ling,etal. L4gm: Large4dgaussianreconstructionmodel. In Conference on Neural Information Processing Systems, pages 56828–56858, 2024. 1, 8, 10

  40. [40]

    Weight normalization: A simplereparameterizationtoacceleratetrainingofdeepneural networks

    Tim Salimans and Durk P Kingma. Weight normalization: A simplereparameterizationtoacceleratetrainingofdeepneural networks. InConference on Neural Information Processing Systems, 2016. 3

  41. [41]

    Linear transformers are secretly fast weight programmers

    Imanol Schlag, Kazuki Irie, and Jürgen Schmidhuber. Linear transformers are secretly fast weight programmers. InInter- national conference on machine learning, pages 9355–9366,

  42. [42]

    Learning to control fast-weight memo- ries: An alternative to dynamic recurrent networks.Neural Computation, 4(1):131–139, 1992

    Jürgen Schmidhuber. Learning to control fast-weight memo- ries: An alternative to dynamic recurrent networks.Neural Computation, 4(1):131–139, 1992. 10

  43. [43]

    GLU Variants Improve Transformer

    Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202, 2020. 4

  44. [44]

    Learning to (learn at test time): Rnns with expressive hidden states

    Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, et al. Learning to (learn at test time): Rnns with expressive hidden states. InInternational Conference on Machine Learning, pages 57503–57522, 2025. 2, 10

  45. [45]

    End-to-end test-time training for long context.arXiv preprint arXiv:2512.23675, 2025

    ArnuvTandon,KaranDalal,XinhaoLi,DanielKoceja,Marcel Rød, Sam Buchanan, Xiaolong Wang, Jure Leskovec, Sanmi Koyejo, Tatsunori Hashimoto, et al. End-to-end test-time training for long context.arXiv preprint arXiv:2512.23675,

  46. [46]

    Mv- dust3r+: Single-stage scene reconstruction from sparse views in 2 seconds

    Zhenggang Tang, Yuchen Fan, Dilin Wang, Hongyu Xu, Rakesh Ranjan, Alexander Schwing, and Zhicheng Yan. Mv- dust3r+: Single-stage scene reconstruction from sparse views in 2 seconds. InConference on Computer Vision and Pattern Recognition, 2024. 10

  47. [47]

    Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results

    Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. InConference on Neural Information Processing Systems, 2017. 4

  48. [48]

    Transformers learn in-context by gradient descent

    Johannes Von Oswald, Eyvind Niklasson, Ettore Randazzo, João Sacramento, Alexander Mordvintsev, Andrey Zhmogi- nov, and Max Vladymyrov. Transformers learn in-context by gradient descent. InInternational Conference on Machine Learning, pages 35151–35174, 2023. 5, 10

  49. [49]

    tttlrm: Test-time training for long context and autoregressive 3d reconstruction

    Chen Wang, Hao Tan, Wang Yifan, Zhiqin Chen, Yuheng Liu, Kalyan Sunkavalli, Sai Bi, Lingjie Liu, and Yiwei Hu. tttlrm: Test-time training for long context and autoregressive 3d reconstruction. InConference on Computer Vision and Pattern Recognition, 2026. 2, 4, 5, 9, 10, 15

  50. [50]

    Vggt: Visual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InConference on Computer Vision and Pattern Recognition, pages 5294–5306,

  51. [51]

    Test-time regression: a unifying framework for designing sequence models with associative memory.arXiv:2501.12352, 2025

    Ke Alexander Wang, Jiaxin Shi, and Emily B Fox. Test-time regression: aunifyingframeworkfordesigningsequencemod- elswithassociativememory.arXivpreprintarXiv:2501.12352,

  52. [52]

    Pf- lrm: Pose-free large reconstruction model for joint pose and shape prediction

    PengWang,HaoTan,SaiBi,YinghaoXu,FujunLuan,Kalyan Sunkavalli, Wenping Wang, Zexiang Xu, and Kai Zhang. Pf- lrm: Pose-free large reconstruction model for joint pose and shape prediction. InInternational Conference on Learning Representations, 2024. 10

  53. [53]

    Shape of mo- tion: 4d reconstruction from a single video

    Qianqian Wang, Vickie Ye, Hang Gao, Weijia Zeng, Jake Austin, Zhengqi Li, and Angjoo Kanazawa. Shape of mo- tion: 4d reconstruction from a single video. InInternational Conference on Computer Vision, pages 9660–9672, 2025. 8

  54. [54]

    Continuous 3d perception model with persistent state

    Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state. InConference on Computer Vision and Pattern Recognition, pages 10510–10522, 2025. 10

  55. [55]

    Dust3r: Geometric 3d vision made easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InConference on Computer Vision and Pattern Recognition, pages 20697–20709, 2024. 10 12

  56. [56]

    Imagequalityassessment: fromerrorvisibilityto structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

    Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Imagequalityassessment: fromerrorvisibilityto structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004. 7

  57. [57]

    Lrm- zero: Training large reconstruction models with synthesized data.InConferenceonNeuralInformationProcessingSystems,

    Desai Xie, Sai Bi, Zhixin Shu, Kai Zhang, Zexiang Xu, Yi Zhou,SorenPirk,ArieKaufman,XinSun,andHaoTan. Lrm- zero: Training large reconstruction models with synthesized data.InConferenceonNeuralInformationProcessingSystems,

  58. [58]

    SV4d: Dynamic 3d content generation withmulti-frameandmulti-viewconsistency

    Yiming Xie, Chun-Han Yao, Vikram Voleti, Huaizu Jiang, and Varun Jampani. SV4d: Dynamic 3d content generation withmulti-frameandmulti-viewconsistency. InInternational Conference on Learning Representations, 2025. 1

  59. [59]

    Depth- splat: Connectinggaussiansplattinganddepth

    HaofeiXu,SongyouPeng, FangjinhuaWang,HermannBlum, Daniel Barath, Andreas Geiger, and Marc Pollefeys. Depth- splat: Connectinggaussiansplattinganddepth. InConference on Computer Vision and Pattern Recognition, pages 16453– 16463, 2025. 9

  60. [60]

    4dgt: Learning a 4d gaussian transformerusingreal-worldmonocularvideos

    Zhen Xu, Zhengqin Li, Zhao Dong, Xiaowei Zhou, Richard Newcombe, and Zhaoyang Lv. 4dgt: Learning a 4d gaussian transformerusingreal-worldmonocularvideos. InConference on Neural Information Processing Systems, 2025. 8, 10

  61. [61]

    InInternational Conference on Learning Representations, 2025

    Jiawei Yang, Jiahui Huang, Yuxiao Chen, Yan Wang, Boyi Li, Yurong You, Apoorva Sharma, Maximilian Igl, Peter Karkus, DanfeiXu,etal.Storm: Spatio-temporalreconstructionmodel for large-scale outdoor scenes. InInternational Conference on Learning Representations, 2025. 10

  62. [62]

    Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass

    Jianing Yang, Alexander Sax, Kevin J Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass. InConference on Computer Vision and Pattern Recognition, 2025. 10

  63. [63]

    Parallelizinglineartransformerswiththedeltarule over sequence length

    Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and YoonKim. Parallelizinglineartransformerswiththedeltarule over sequence length. InConference on Neural Information Processing Systems, pages 115491–115522, 2024. 10

  64. [64]

    Real-time photorealistic dynamic scene representation and rendering with 4d gaussian splatting

    ZeyuYang,HongyeYang,ZijiePan,andLiZhang. Real-time photorealistic dynamic scene representation and rendering with 4d gaussian splatting. InInternational Conference on Learning Representations, 2024. 5, 15

  65. [65]

    Novel view synthesis of dynamic scenes with globally coherent depths from a monocular camera

    Jae Shin Yoon, Kihwan Kim, Orazio Gallo, Hyun Soo Park, and Jan Kautz. Novel view synthesis of dynamic scenes with globally coherent depths from a monocular camera. In Conference on Computer Vision and Pattern Recognition, pages 5336–5345, 2020. 8, 9

  66. [66]

    Revealing and mitigating the local pattern shortcuts of mamba

    Wangjie You, Zecheng Tang, Juntao Li, Lili Yao, and Min Zhang. Revealing and mitigating the local pattern shortcuts of mamba. InFindings of the Association for Computational Linguistics: ACL 2025, pages 12156–12178, 2025. 7

  67. [67]

    Contin- ual learning through synaptic intelligence

    Friedemann Zenke, Ben Poole, and Surya Ganguli. Contin- ual learning through synaptic intelligence. InInternational conference on machine learning, pages 3987–3995, 2017. 4

  68. [68]

    Monst3r: A simple approach for estimating geometry in the presence of motion

    JunyiZhang,CharlesHerrmann,JunhwaHur,VarunJampani, Trevor Darrell, Forrester Cole, Deqing Sun, and Ming-Hsuan Yang. Monst3r: A simple approach for estimating geometry in the presence of motion. InInternational Conference on Learning Representations, 2025. 10

  69. [69]

    LoGeR: Long-Context Geometric Reconstruction with Hybrid Memory

    Junyi Zhang, Charles Herrmann, Junhwa Hur, Chen Sun, Ming-HsuanYang,ForresterCole,TrevorDarrell,andDeqing Sun. Loger: Long-context geometric reconstruction with hybrid memory.arXiv preprint arXiv:2603.03269, 2026. 2, 10

  70. [70]

    Arf: Artistic radiance fields

    Kai Zhang, Nick Kolkin, Sai Bi, Fujun Luan, Zexiang Xu, Eli Shechtman, and Noah Snavely. Arf: Artistic radiance fields. InEuropean Conference on Computer Vision, pages 717–733,

  71. [71]

    Gs-lrm: Large reconstruction model for 3d gaussian splatting

    Kai Zhang, Sai Bi, Hao Tan, Yuanbo Xiangli, Nanxuan Zhao, Kalyan Sunkavalli, and Zexiang Xu. Gs-lrm: Large reconstruction model for 3d gaussian splatting. InEuropean Conference on Computer Vision, pages 1–19, 2024. 1, 4, 9, 10, 15

  72. [72]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018. 6, 7

  73. [73]

    Test-time training done right

    Tianyuan Zhang, Sai Bi, Yicong Hong, Kai Zhang, Fujun Luan, Songlin Yang, Kalyan Sunkavalli, William T Freeman, and Hao Tan. Test-time training done right. InInternational Conference on Learning Representations, 2026. 2, 9, 10

  74. [74]

    Learning 4d embodied world models

    Haoyu Zhen, Qiao Sun, Hongxin Zhang, Junyan Li, Siyuan Zhou, Yilun Du, and Chuang Gan. Learning 4d embodied world models. InInternational Conference on Computer Vision, pages 5337–5347, 2025. 1

  75. [75]

    Pointodyssey: A large-scale syn- thetic dataset for long-term point tracking

    YangZheng,AdamWHarley,BokuiShen,GordonWetzstein, and Leonidas J Guibas. Pointodyssey: A large-scale syn- thetic dataset for long-term point tracking. InInternational Conference on Computer Vision, pages 19855–19865, 2023. 6

  76. [76]

    Page-4d: Disentangled pose and geometry estimation for 4d perception

    KaichenZhou,YuhanWang,GraceChen,GaspardBeaudouin, Fangneng Zhan, Paul Pu Liang, and Mengyu Wang. Page-4d: Disentangled pose and geometry estimation for 4d perception. InInternational Conference on Learning Representations,

  77. [77]

    Stereo magnification: learning view synthesis using multiplane images.ACM Transactions on Graphics, 37 (4):1–12, 2018

    TinghuiZhou,RichardTucker,JohnFlynn,GrahamFyffe,and Noah Snavely. Stereo magnification: learning view synthesis using multiplane images.ACM Transactions on Graphics, 37 (4):1–12, 2018. 6

  78. [78]

    Streaming 4d visual geometry transformer.arXiv preprint arXiv:2507.11539, 2025

    Dong Zhuo, Wenzhao Zheng, Jiahe Guo, Yuqi Wu, Jie Zhou, and Jiwen Lu. Streaming 4d visual geometry transformer. arXiv preprint arXiv:2507.11539, 2025. 10

  79. [79]

    Long-LRM++: Preserving Fine Details in Feed-Forward Wide-Coverage Reconstruction

    Chen Ziwen, Hao Tan, Peng Wang, Zexiang Xu, and Li Fuxin. Long-lrm++: Preserving fine details in feed-forward wide- coverage reconstruction.arXiv preprint arXiv:2512.10267,

  80. [80]

    Long-lrm: Long-sequence large reconstruction model for wide-coverage gaussian splats

    ChenZiwen,HaoTan,KaiZhang,SaiBi,FujunLuan,Yicong Hong, Li Fuxin, and Zexiang Xu. Long-lrm: Long-sequence large reconstruction model for wide-coverage gaussian splats. InInternationalConferenceonComputerVision,pages4349– 4359, 2025. 2, 9, 10 13 A. Implementation and Training Details A.1. Data Pre-processing For each training sample, we load a video clip to...