pith. sign in

arxiv: 2606.23581 · v1 · pith:LRO62IW7new · submitted 2026-06-22 · 💻 cs.DC · cs.AI· cs.CV

Kamera: Unified Position-Invariant Multimodal KV Cache for Training-Free Reuse

Pith reviewed 2026-06-26 07:11 UTC · model grok-4.3

classification 💻 cs.DC cs.AIcs.CV
keywords KV cache reuseposition-invariant cachelow-rank conditioning patchcross-chunk conditioningtraining-free reusemultimodal attentionRoPE re-rotationsliding window cache
0
0 comments X

The pith

Storing a low-rank patch with each KV chunk restores the lost cross-chunk conditioning for training-free position-invariant reuse in multimodal models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that naive reuse of a cached chunk at a new position loses exactly the conditioning signal the chunk absorbed from its neighbors. This loss appears as a diffuse low-rank residue strongest in deep layers and invisible to single-hop retrieval yet essential for multi-hop binding. The fix is a small training-free patch computed once per chunk and combined with exact RoPE re-rotation. The combination turns reordering, sliding-window survival, and recall into cheap operations that avoid re-encoding while recovering full accuracy on cross-chunk benchmarks. The approach works uniformly across MHA, GQA, and MLA families and across six model backbones in a production kernel.

Core claim

The central claim is that the information lost when a cached chunk is read at a new position is precisely the cross-chunk conditioning absorbed from its neighbors, which manifests as a diffuse low-rank residue concentrated in deeper layers. This residue is recovered by storing a small rank-m conditioning patch with each position-free chunk; reuse then reduces to one operator consisting of exact RoPE re-rotation plus application of the patch. The resulting mechanism supports three window operations at low cost: arbitrary reordering of a cached set, relocation of surviving chunks in a sliding window, and rehydration of an evicted chunk, all without re-encoding. A rank-m patch restores full tas

What carries the argument

the low-rank conditioning patch stored with each position-free chunk, which restores the diffuse cross-chunk residue after exact RoPE re-rotation

If this is right

  • Cached chunks can be placed in any order using the same patch and rotation operator.
  • Surviving chunks in a sliding window relocate via rotation alone with zero re-encoding.
  • An evicted chunk is rehydrated for recall by its stored patch without re-encoding.
  • Full task accuracy on MM-NIAH and two-page doc-QA is recovered at a fraction of the KV footprint.
  • The patch reconstructs re-prefill KV values to within bf16 rounding in a production kernel across six backbones.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same low-rank residue pattern may appear in single-modal long-context settings, allowing the patch technique to transfer without multimodal-specific changes.
  • Savings would be largest in video and streaming workloads where the conditioning signal is described as strongest.
  • The patch could be combined with existing KV compression methods such as quantization to compound memory reductions.
  • Testing whether the required rank m grows with context length would clarify the scaling behavior of the residue.
  • The method leaves open whether the patch itself can be further compressed or shared across similar chunks.

Load-bearing premise

The cross-chunk conditioning loss is a diffuse low-rank residue concentrated in deep layers that can be fully restored by a small training-free patch stored with each chunk, without requiring model changes or re-encoding.

What would settle it

Apply the rank-m patch versus full re-prefill on MM-NIAH across the two attention families and check whether multi-hop accuracy matches within bf16 rounding error; failure to match falsifies the restoration claim.

Figures

Figures reproduced from arXiv: 2606.23581 by Bole Ma, Gerhard Wellein, Harald Koestler, Jan Eitzinger.

Figure 1
Figure 1. Figure 1: Three reuse patterns beyond the window, and what each costs. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Position-invariant storage across attention families. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Structure of the conditioning deficit (Kimi-VL, MLA content channel, 27 layers). [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Why cheap estimation of the deficit is still open. (a) no cheap selector locates [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: One intrinsic rank (about 32) governs the correction across GQA, MoE, and MLA [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Prior position-independent caches select on the wrong target. (a) [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Feature-axis patch vs. sparse-token PIC baselines (MuirBench ( [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: The deficit is shallow-free and deep. Each plane is one layer’s feature-by-token [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Low-rank ∆ and monotone rank-m recovery across softmax-attention VLMs (pure￾MHA and KV-sharing variants); the gap tracks cross-chunk-binding capability, not size (LLaVA-1.5 and DeepSeek-VL are both 7B/576-tok, only DeepSeek shows a gap). model family (nL) patch η@0 re-prefill to match token η@0.5 Kimi-VL-A3B MLA (27) 0.91 0.89 0.32 Qwen2.5-VL-7B GQA (28) 0.96 0.86 0.60 Qwen3-VL-8B deepstack-GQA (36) 0.88 0… view at source ↗
Figure 10
Figure 10. Figure 10: Modality scope: vision and video show the gap and recover; audio (Qwen2.5- [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: The cost is on memory. (a) decode sits far on the bandwidth-bound side of [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Recompute-free temporal reuse in the live SGLang kernel. (a) the operator’s error [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗
read the original abstract

Multimodal agents repeatedly re-examine the same video frames, UI screenshots, and rendered artifacts as their context window slides and reasoning iterates, yet every look-back re-encodes from scratch, because prefix caches serve reuse only at a fixed leading position. We show this recompute is avoidable, and identify exactly what naive KV reuse loses: the cross-chunk conditioning a chunk absorbs from its neighbours. This loss is asymmetric. The direct readout of a cached chunk is recovered exactly and for free by the standard state-merge. What remains is a diffuse, low-rank residue concentrated in deep layers, invisible to single-hop retrieval but precisely what multi-hop reasoning binds on. Blind reuse therefore leaves single-hop recall intact while halving multi-hop accuracy; this is the failure mode prior position-independent caches, designed for single-context or single-image reuse, do not address. We repair it with a small, training-free low-rank conditioning patch stored alongside each position-free chunk. Reuse reduces to one operator across MLA, GQA, and MHA: exact RoPE re-rotation to any target position, plus the patch that restores cross-chunk binding. This makes three window operations cheap: reorder (one patch serves every ordering of a cached set), sliding-window survival (surviving chunks relocate via rotation only, zero re-encode), and recall (an evicted chunk is rehydrated by its patch, never re-encoded). A rank-m patch recovers full task accuracy on cross-chunk-binding benchmarks, MM-NIAH across two attention families and two-page doc-QA, at a fraction of the KV footprint, and reconstructs re-prefill KV to within bf16 rounding in a production SGLang kernel across six backbones. The conditioning signal is strongest in redundant vision and video streams, making our solution most impactful where multimodal agents spend their recompute budget.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that naive KV-cache reuse in multimodal models loses an asymmetric, diffuse low-rank cross-chunk conditioning residue concentrated in deep layers; this residue can be restored by a small training-free rank-m patch stored with each position-free chunk. Reuse then reduces to RoPE re-rotation plus the patch, enabling cheap reorder, sliding-window survival, and recall. The authors report that a rank-m patch recovers full task accuracy on cross-chunk-binding benchmarks and MM-NIAH (across two attention families and two-page doc-QA), reconstructs re-prefill KV to within bf16 rounding in a production SGLang kernel across six backbones, and is most impactful on redundant vision/video streams.

Significance. If the central empirical claims hold, the work would provide a practical, training-free mechanism for KV-cache reuse in multimodal agents that repeatedly re-examine the same visual content, potentially reducing recompute at a fraction of the KV footprint while preserving multi-hop reasoning accuracy.

major comments (2)
  1. [Abstract] Abstract: the claim that the lost cross-chunk conditioning is a 'diffuse, low-rank residue' restorable by a single fixed patch per chunk is load-bearing for the reuse guarantee. Because the residue is defined as conditioning absorbed from neighbors, a patch computed once and stored with the chunk must be independent of any particular neighbor set; the manuscript must demonstrate (via explicit construction or ablation) that this residue does not vary materially with neighbor identity or content, otherwise the patch cannot restore binding for arbitrary reorderings without re-encoding.
  2. [Abstract] Abstract: the reported recovery of 'full task accuracy' on MM-NIAH and cross-chunk-binding tasks is presented without reference to the precise experimental protocol, baseline definitions, or controls for position-dependent effects; this makes it impossible to assess whether the patch truly isolates the claimed low-rank residue or whether the result depends on the specific neighbor sets used during patch computation.
minor comments (2)
  1. The abstract states results 'across six backbones' and 'two attention families' but does not name them or provide the corresponding tables/figures; adding an explicit list or reference would improve clarity.
  2. Notation for the patch (rank m, storage format) is introduced without a dedicated equation or pseudocode block in the provided abstract; a short formal definition would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the abstract. We address each major point below with clarifications drawn from the manuscript and indicate revisions to strengthen the presentation of independence and experimental details.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that the lost cross-chunk conditioning is a 'diffuse, low-rank residue' restorable by a single fixed patch per chunk is load-bearing for the reuse guarantee. Because the residue is defined as conditioning absorbed from neighbors, a patch computed once and stored with the chunk must be independent of any particular neighbor set; the manuscript must demonstrate (via explicit construction or ablation) that this residue does not vary materially with neighbor identity or content, otherwise the patch cannot restore binding for arbitrary reorderings without re-encoding.

    Authors: The manuscript demonstrates independence via the reorder experiments, where one fixed patch per chunk supports arbitrary reorderings of a cached set while recovering full accuracy. This outcome would be impossible if the residue varied materially with specific neighbor identities or content. The patch is constructed as the low-rank component of the absorbed conditioning in a position-invariant manner (detailed in the methods), and the SGLang kernel reconstruction to bf16 precision across backbones further supports generality. To make the independence explicit as requested, we will add an ablation varying neighbor sets at patch computation time and report the resulting variance. revision: yes

  2. Referee: [Abstract] Abstract: the reported recovery of 'full task accuracy' on MM-NIAH and cross-chunk-binding tasks is presented without reference to the precise experimental protocol, baseline definitions, or controls for position-dependent effects; this makes it impossible to assess whether the patch truly isolates the claimed low-rank residue or whether the result depends on the specific neighbor sets used during patch computation.

    Authors: The abstract is a summary; the full protocols (patch computation on chunks with a fixed neighbor set during precomputation, baselines of naive reuse without patch versus full re-prefill, and position controls via exact RoPE re-rotation) appear in Section 4 and Appendix B, along with results across two attention families. These controls isolate the residue effect. We will revise the abstract to reference the experimental section and briefly note the protocol and position controls. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's core argument identifies an empirical loss (asymmetric cross-chunk conditioning residue in deep layers) from naive KV reuse and repairs it via a training-free low-rank patch. No equations or claims in the provided text reduce the patch effect or accuracy recovery to a quantity defined by the result itself, a fitted input renamed as prediction, or a self-citation chain. The derivation remains self-contained, with the central claim resting on direct experimental validation across benchmarks and kernels rather than internal redefinition.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 1 invented entities

Abstract-only review; the low-rank patch is the primary new element introduced to capture the residue. No other free parameters or axioms are identifiable from the provided text.

free parameters (1)
  • patch rank m
    Dimensionality of the conditioning patch selected to recover accuracy on the mentioned benchmarks.
invented entities (1)
  • low-rank conditioning patch no independent evidence
    purpose: Restores the diffuse low-rank cross-chunk conditioning residue for position-free KV reuse
    Introduced in the abstract as the key addition enabling accurate multi-hop reuse without re-encoding.

pith-pipeline@v0.9.1-grok · 5885 in / 1338 out tokens · 34808 ms · 2026-06-26T07:11:02.033740+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

77 extracted references · 9 canonical work pages

  1. [1]

    and Ermon, Stefano and Rudra, Atri and R

    Dao, Tri and Fu, Daniel Y. and Ermon, Stefano and Rudra, Atri and R. Flash. Advances in Neural Information Processing Systems (NeurIPS) , year=

  2. [2]

    Dao, Tri , booktitle=. Flash

  3. [3]

    Efficient memory management for large language model serving with pagedattention

    Kwon, Woosuk and Li, Zhuohan and Zhuang, Siyuan and Sheng, Ying and Zheng, Lianmin and Yu, Cody Hao and Gonzalez, Joseph and Zhang, Hao and Stoica, Ion , title =. Proceedings of the 29th Symposium on Operating Systems Principles , pages =. 2023 , isbn =. doi:10.1145/3600006.3613165 , abstract =

  4. [4]

    and Barrett, Clark and Sheng, Ying , title =

    Zheng, Lianmin and Yin, Liangsheng and Xie, Zhiqiang and Sun, Chuyue and Huang, Jeff and Yu, Cody Hao and Cao, Shiyi and Kozyrakis, Christos and Stoica, Ion and Gonzalez, Joseph E. and Barrett, Clark and Sheng, Ying , title =. Proceedings of the 38th International Conference on Neural Information Processing Systems , articleno =. 2024 , isbn =

  5. [5]

    Kong et al

    Patel, Pratyush and Choukse, Esha and Zhang, Chaojie and Shah, Aashaka and Goiri, \'. Splitwise: Efficient Generative. Proceedings of the 51st Annual International Symposium on Computer Architecture , pages =. 2025 , isbn =. doi:10.1109/ISCA59077.2024.00019 , abstract =

  6. [6]

    Proceedings of the 18th USENIX Conference on Operating Systems Design and Implementation , articleno =

    Zhong, Yinmin and Liu, Shengyu and Chen, Junda and Hu, Jianbo and Zhu, Yibo and Liu, Xuanzhe and Jin, Xin and Zhang, Hao , title =. Proceedings of the 18th USENIX Conference on Operating Systems Design and Implementation , articleno =. 2024 , isbn =

  7. [7]

    Proceedings of the 38th International Conference on Neural Information Processing Systems , articleno =

    Shah, Jay and Bikshandi, Ganesh and Zhang, Ying and Thakkar, Vijay and Ramani, Pradeep and Dao, Tri , title =. Proceedings of the 38th International Conference on Neural Information Processing Systems , articleno =. 2024 , isbn =

  8. [8]

    2025 , url=

    Shang Yang and Junxian Guo and Haotian Tang and Qinghao Hu and Guangxuan Xiao and Jiaming Tang and Yujun Lin and Zhijian Liu and Yao Lu and Song Han , booktitle=. 2025 , url=

  9. [9]

    Proceedings of the Twentieth European Conference on Computer Systems , pages =

    Yao, Jiayi and Li, Hanchen and Liu, Yuhan and Ray, Siddhant and Cheng, Yihua and Zhang, Qizheng and Du, Kuntai and Lu, Shan and Jiang, Junchen , title =. Proceedings of the Twentieth European Conference on Computer Systems , pages =. 2025 , isbn =. doi:10.1145/3689031.3696098 , abstract =

  10. [10]

    2510.10129 , archivePrefix=

    Bin Yang and Qiuyu Leng and Jun Zeng and Zhenhua Wu , year=. 2510.10129 , archivePrefix=

  11. [11]

    Yang, Zebin and Xie, Tong and Lu, Baotong and Liu, Shaoshan and Yu, Bo and Li, Meng , journal=

  12. [12]

    Jingbo Yang and Bairu Hou and Wei Wei and Yujia Bao and Shiyu Chang , journal=

  13. [13]

    Proceedings of the 42nd International Conference on Machine Learning , articleno =

    Hu, Junhao and Huang, Wenrui and Wang, Weidong and Wang, Haoyi and Hu, Tiancheng and Zhang, Qin and Feng, Hao and Chen, Xusheng and Shan, Yizhou and Xie, Tao , title =. Proceedings of the 42nd International Conference on Machine Learning , articleno =. 2025 , publisher =

  14. [14]

    2025 , url=

    Junhao Hu and Wenrui Huang and Weidong Wang and Haoyi Wang and tiancheng hu and zhang qin and Hao Feng and Xusheng Chen and Yizhou Shan and Tao Xie , booktitle=. 2025 , url=

  15. [15]

    Qin, Shengling and Yu, Hao and Wu, Chenxin and Li, Zheng and Cao, Yizhong and Zhuge, Zhengyang and Zhou, Yuxin and Yao, Wentao and Zhang, Yi and Wang, Zhengheng and Bai, Shuai and Zhang, Jianwei and Lin, Junyang , journal=

  16. [16]

    2403.05527 , archivePrefix=

    Hao Kang and Qingru Zhang and Souvik Kundu and Geonhwa Jeong and Zaoxing Liu and Tushar Krishna and Tuo Zhao , year=. 2403.05527 , archivePrefix=

  17. [17]

    Abdelfattah , year=

    Chi-Chih Chang and Wei-Cheng Lin and Chien-Yu Lin and Yash Akhauri and Hung-Yueh Chiang and Xilai Dai and Huiqiang Jiang and Yucheng Li and Kai-Chiang Wu and Luis Ceze and Mohamed S. Abdelfattah , year=. x

  18. [18]

    2026 , eprint=

    Semantic Cache Distillation: Efficient State Transfer via Reuse and Selective Patching , author=. 2026 , eprint=

  19. [19]

    GetMobile: Mobile Comp

    Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Xiao, Guangxuan and Han, Song , title =. GetMobile: Mobile Comp. and Comm. , month = jan, pages =. 2025 , issue_date =. doi:10.1145/3714983.3714987 , abstract =

  20. [20]

    2023 , url=

    Elias Frantar and Saleh Ashkboos and Torsten Hoefler and Dan Alistarh , booktitle=. 2023 , url=

  21. [21]

    Proceedings of the 40th International Conference on Machine Learning , articleno =

    Xiao, Guangxuan and Lin, Ji and Seznec, Mickael and Wu, Hao and Demouth, Julien and Han, Song , title =. Proceedings of the 40th International Conference on Machine Learning , articleno =. 2023 , publisher =

  22. [22]

    2024 , url=

    Zirui Liu and Jiayi Yuan and Hongye Jin and Shaochen Zhong and Zhaozhuo Xu and Vladimir Braverman and Beidi Chen and Xia Hu , booktitle=. 2024 , url=

  23. [23]

    and Shao, Yakun Sophia and Keutzer, Kurt and Gholami, Amir , title =

    Hooper, Coleman and Kim, Sehoon and Mohammadzadeh, Hiva and Mahoney, Michael W. and Shao, Yakun Sophia and Keutzer, Kurt and Gholami, Amir , title =. Proceedings of the 38th International Conference on Neural Information Processing Systems , articleno =. 2024 , isbn =

  24. [24]

    Lin*, Yujun and Tang*, Haotian and Yang*, Shang and Zhang, Zhekai and Xiao, Guangxuan and Gan, Chuang and Han, Song , journal=

  25. [25]

    2024 , eprint=

    Efficient Streaming Language Models with Attention Sinks , author=. 2024 , eprint=

  26. [26]

    CoRR , volume=

    Zhongwei Wan and Ziang Wu and Che Liu and Jinfa Huang and Zhihong Zhu and Peng Jin and Longyue Wang and Li Yuan , title=. CoRR , volume=. 2024 , cdate=

  27. [27]

    Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =

    Zhang, Zhenyu and Sheng, Ying and Zhou, Tianyi and Chen, Tianlong and Zheng, Lianmin and Cai, Ruisi and Song, Zhao and Tian, Yuandong and R\'. Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =. 2023 , publisher =

  28. [28]

    Proceedings of the 41st International Conference on Machine Learning , articleno =

    Tang, Jiaming and Zhao, Yilong and Zhu, Kan and Xiao, Guangxuan and Kasikci, Baris and Han, Song , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

  29. [29]

    2410.10819 , archivePrefix=

    Guangxuan Xiao and Jiaming Tang and Jingwei Zuo and Junxian Guo and Shang Yang and Haotian Tang and Yao Fu and Song Han , year=. 2410.10819 , archivePrefix=

  30. [30]

    Proceedings of the 42nd International Conference on Machine Learning , articleno =

    Sun, Hanshi and Chang, Li-Wen and Bao, Wenlei and Zheng, Size and Zheng, Ningxin and Liu, Xin and Dong, Harry and Chi, Yuejie and Chen, Beidi , title =. Proceedings of the 42nd International Conference on Machine Learning , articleno =. 2025 , publisher =

  31. [31]

    2024 , eprint=

    MagicPIG: LSH Sampling for Efficient LLM Generation , author=. 2024 , eprint=

  32. [32]

    RoFormer: Enhanced transformer with Rotary Position Embedding.Neurocomputing, 568:127063, February 2024

    Su, Jianlin and Ahmed, Murtadha and Lu, Yu and Pan, Shengfeng and Bo, Wen and Liu, Yunfeng , title =. 2024 , issue_date =. doi:10.1016/j.neucom.2023.127063 , journal =

  33. [33]

    2023 , url=

    Joshua Ainslie and James Lee-Thorp and Michiel de Jong and Yury Zemlyanskiy and Federico Lebron and Sumit Sanghai , booktitle=. 2023 , url=

  34. [34]

    Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =

    Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae , title =. Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =. 2023 , publisher =

  35. [35]

    Improved Baselines with Visual Instruction Tuning , year=

    Liu, Haotian and Li, Chunyuan and Li, Yuheng and Lee, Yong Jae , booktitle=. Improved Baselines with Visual Instruction Tuning , year=

  36. [36]

    2407.07895 , archivePrefix=

    Feng Li and Renrui Zhang and Hao Zhang and Yuanhan Zhang and Bo Li and Wei Li and Zejun Ma and Chunyuan Li , year=. 2407.07895 , archivePrefix=

  37. [37]

    2025 , url=

    Bo Li and Yuanhan Zhang and Dong Guo and Renrui Zhang and Feng Li and Hao Zhang and Kaichen Zhang and Peiyuan Zhang and Yanwei Li and Ziwei Liu and Chunyuan Li , journal=. 2025 , url=

  38. [38]

    2312.07533 , archivePrefix=

    Ji Lin and Hongxu Yin and Wei Ping and Yao Lu and Pavlo Molchanov and Andrew Tao and Huizi Mao and Jan Kautz and Mohammad Shoeybi and Song Han , year=. 2312.07533 , archivePrefix=

  39. [39]

    2510.09608 , archivePrefix=

    Ruyi Xu and Guangxuan Xiao and Yukang Chen and Liuning He and Kelly Peng and Yao Lu and Song Han , year=. 2510.09608 , archivePrefix=

  40. [40]

    Fei Wang and Xingyu Fu and James Y. Huang and Zekun Li and Qin Liu and Xiaogeng Liu and Mingyu Derek Ma and Nan Xu and Wenxuan Zhou and Kai Zhang and Tianyi Lorena Yan and Wenjie Jacky Mo and Hsiang-Hui Liu and Pan Lu and Chunyuan Li and Chaowei Xiao and Kai-Wei Chang and Dan Roth and Sheng Zhang and Hoifung Poon and Muhao Chen , year=. 2406.09411 , archi...

  41. [41]

    2024 , url=

    Song Dingjie and Shunian Chen and Guiming Hardy Chen and Fei Yu and Xiang Wan and Benyou Wang , booktitle=. 2024 , url=

  42. [42]

    The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

    Needle In A Multimodal Haystack , author=. The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

  43. [43]

    2024 , url=

    Yubo Ma and Yuhang Zang and Liangyu Chen and Meiqi Chen and Yizhu Jiao and Xinze Li and Xinyuan Lu and Ziyu Liu and Yan Ma and Xiaoyi Dong and Pan Zhang and Liangming Pan and Yu-Gang Jiang and Jiaqi Wang and Yixin Cao and Aixin Sun , booktitle=. 2024 , url=

  44. [44]

    2405.21075 , archivePrefix=

    Chaoyou Fu and Yuhan Dai and Yongdong Luo and Lei Li and Shuhuai Ren and Renrui Zhang and Zihan Wang and Chenyu Zhou and Yunhang Shen and Mengdan Zhang and Peixian Chen and Yanwei Li and Shaohui Lin and Sirui Zhao and Ke Li and Tong Xu and Xiawu Zheng and Enhong Chen and Caifeng Shan and Ran He and Xing Sun , year=. 2405.21075 , archivePrefix=

  45. [45]

    Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =

    Mangalam, Karttikeya and Akshkulakov, Raiymbek and Malik, Jitendra , title =. Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =. 2023 , publisher =

  46. [46]

    2025 , eprint=

    Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces , author=. 2025 , eprint=

  47. [47]

    CoRR , volume =

    Harsh Trivedi and Niranjan Balasubramanian and Tushar Khot and Ashish Sabharwal , title =. CoRR , volume =. 2021 , url =. 2108.00573 , timestamp =

  48. [48]

    2411.02820 , archivePrefix=

    Yuhan Liu and Yuyang Huang and Jiayi Yao and Shaoting Feng and Zhuohan Gu and Kuntai Du and Hanchen Li and Yihua Cheng and Junchen Jiang and Shan Lu and Madan Musuvathi and Esha Choukse , year=. 2411.02820 , archivePrefix=

  49. [49]

    2511.05534 , archivePrefix=

    Kunxi Li and Yufan Xiong and Zhonghua Jiang and Yiyun Zhou and Zhaode Wang and Chengfei Lv and Shengyu Zhang , year=. 2511.05534 , archivePrefix=

  50. [50]

    2502.02175 , archivePrefix=

    Siyu Xu and Yunke Wang and Chenghao Xia and Dihao Zhu and Tao Huang and Chang Xu , year=. 2502.02175 , archivePrefix=

  51. [51]

    2605.28544 , archivePrefix=

    Chen Shi and Jinrui Xu and Shaoshuai Shi and Kehua Sheng and Bo Zhang and Li Jiang , year=. 2605.28544 , archivePrefix=

  52. [52]

    2601.14724 , archivePrefix=

    Haowei Zhang and Shudong Yang and Jinlan Fu and See-Kiong Ng and Xipeng Qiu , year=. 2601.14724 , archivePrefix=

  53. [53]

    2508.15717 , archivePrefix=

    Yanlai Yang and Zhuokai Zhao and Satya Narayan Shukla and Aashu Singh and Shlok Kumar Mishra and Lizhu Zhang and Mengye Ren , year=. 2508.15717 , archivePrefix=

  54. [54]

    2510.00413 , archivePrefix=

    Zikang Liu and Junyi Li and Wayne Xin Zhao and Dawei Gao and Yaliang Li and Ji-rong Wen , year=. 2510.00413 , archivePrefix=

  55. [55]

    2025 , eprint=

    Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding , author=. 2025 , eprint=

  56. [56]

    Z oom E ye: Enhancing Multimodal LLM s with Human-Like Zooming Capabilities through Tree-Based Image Exploration

    Shen, Haozhan and Zhao, Kangjia and Zhao, Tiancheng and Xu, Ruochen and Zhang, Zilun and Zhu, Mingwei and Yin, Jianwei. Z oom E ye: Enhancing Multimodal LLM s with Human-Like Zooming Capabilities through Tree-Based Image Exploration. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.335

  57. [57]

    Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part LXXX , pages =

    Wang, Xiaohan and Zhang, Yuhui and Zohar, Orr and Yeung-Levy, Serena , title =. Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part LXXX , pages =. 2024 , isbn =. doi:10.1007/978-3-031-72989-8_4 , abstract =

  58. [58]

    2024 , publisher =

    He, Hongliang and Yao, Wenlin and Ma, Kaixin and Yu, Wenhao and Dai, Yong and Zhang, Hongming and Lan, Zhenzhong and Yu, Dong. W eb V oyager: Building an End-to-End Web Agent with Large Multimodal Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.371

  59. [59]

    Modular Visual Question Answering via Code Generation

    Subramanian, Sanjay and Narasimhan, Medhini and Khangaonkar, Kushal and Yang, Kevin and Nagrani, Arsha and Schmid, Cordelia and Zeng, Andy and Darrell, Trevor and Klein, Dan. Modular Visual Question Answering via Code Generation. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2023. doi:10....

  60. [60]

    Haoxuan You and Rui Sun and Zhecan Wang and Long Chen and Gengyu Wang and Hammad Ayyubi and Kai-Wei Chang and Shih-Fu Chang , booktitle=. Ideal. 2023 , url=

  61. [61]

    2312.14135 , archivePrefix=

    Penghao Wu and Saining Xie , year=. 2312.14135 , archivePrefix=

  62. [62]

    Proceedings of the 41st International Conference on Machine Learning , articleno =

    Zheng, Boyuan and Gou, Boyu and Kil, Jihyung and Sun, Huan and Su, Yu , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

  63. [63]

    2411.04952 , archivePrefix=

    Jaemin Cho and Debanjan Mahata and Ozan Irsoy and Yujie He and Mohit Bansal , year=. 2411.04952 , archivePrefix=

  64. [64]

    2509.24786 , archivePrefix=

    Shenghao Fu and Qize Yang and Yuan-Ming Li and Xihan Wei and Xiaohua Xie and Wei-Shi Zheng , year=. 2509.24786 , archivePrefix=

  65. [65]

    Zhang, Shuoshuo and Li, Zijian and Zhang, Yizhen and Fu, Jingjing and Song, Lei and Bian, Jiang and Zhang, Jun and Yang, Yujiu and Wang, Rui , journal=

  66. [66]

    2604.13731 , archivePrefix=

    Yuanlei Zheng and Pei Fu and Hang Li and Ziyang Wang and Yuyi Zhang and Wenyu Ruan and Xiaojin Zhang and Zhongyu Wei and Zhenbo Luo and Jian Luan and Wei Chen and Xiang Bai , year=. 2604.13731 , archivePrefix=

  67. [67]

    arXiv preprint arXiv:2505.18079 , year=

    Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding , author=. arXiv preprint arXiv:2505.18079 , year=

  68. [68]

    2502.13923 , archivePrefix=

    Shuai Bai and Keqin Chen and Xuejing Liu and Jialin Wang and Wenbin Ge and others , year=. 2502.13923 , archivePrefix=

  69. [69]

    2511.21631 , archivePrefix=

    Shuai Bai and Yuxuan Cai and Ruizhe Chen and Keqin Chen and Xionghui Chen and others , year=. 2511.21631 , archivePrefix=

  70. [70]

    2509.17765 , archivePrefix=

    Jin Xu and Zhifang Guo and Hangrui Hu and Yunfei Chu and Xiong Wang and others , year=. 2509.17765 , archivePrefix=

  71. [71]

    2403.05525 , archivePrefix=

    Haoyu Lu and Wen Liu and Bo Zhang and Bingxuan Wang and Kai Dong and Bo Liu and Jingxiang Sun and Tongzheng Ren and Zhuoshu Li and Hao Yang and Yaofeng Sun and Chengqi Deng and Hanwei Xu and Zhenda Xie and Chong Ruan , year=. 2403.05525 , archivePrefix=

  72. [72]

    2412.10302 , archivePrefix=

    Zhiyu Wu and Xiaokang Chen and Zizheng Pan and Xingchao Liu and Wen Liu and Damai Dai and Huazuo Gao and Yiyang Ma and Chengyue Wu and Bingxuan Wang and Zhenda Xie and Yu Wu and Kai Hu and Jiawei Wang and Yaofeng Sun and Yukun Li and Yishi Piao and Kang Guan and Aixin Liu and Xin Xie and Yuxiang You and Kai Dong and Xingkai Yu and Haowei Zhang and Liang Z...

  73. [73]

    2504.10479 , archivePrefix=

    Jinguo Zhu and Weiyun Wang and Zhe Chen and Zhaoyang Liu and Shenglong Ye and others , year=. 2504.10479 , archivePrefix=

  74. [74]

    2024 , volume=

    Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and Li, Bin and Luo, Ping and Lu, Tong and Qiao, Yu and Dai, Jifeng , booktitle=. 2024 , volume=

  75. [75]

    2404.14219 , archivePrefix=

    Marah Abdin and Jyoti Aneja and Hany Awadalla and Ahmed Awadallah and Ammar Ahmad Awan and others , year=. 2404.14219 , archivePrefix=

  76. [76]

    Andr. Smol. Second Conference on Language Modeling , year=

  77. [77]

    Attention is All you Need , url =

    Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =