pith. machine review for the scientific record. sign in

arxiv: 2605.07019 · v1 · submitted 2026-05-07 · 💻 cs.CV · cs.AI

Recognition: no theorem link

LensVLM: Selective Context Expansion for Compressed Visual Representation of Text

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:04 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords vision language modelsvisual text compressionselective expansionrendered textmultimodal QAdocument understandingimage-based processing
0
0 comments X

The pith

LensVLM lets VLMs read heavily compressed text images by expanding only the relevant sections.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to train vision language models to scan low-resolution rendered text, spot the parts that matter for a given question, and expand just those parts to full detail using learned tools. This keeps accuracy near the level of reading the entire uncompressed text even at 4.3 times compression and beats retrieval or uniform compression baselines up to 10.1 times compression on text QA tasks. The same recipe works for documents and code where layout or structure carries extra information. A sympathetic reader would care because it removes the usual trade-off between handling long text and staying accurate when the vision encoder cannot resolve tiny characters.

Core claim

Post-training a base VLM equips it with tools to first view compressed image renders of text and then selectively restore only the task-relevant regions to their original resolution. On this basis the model reaches accuracy comparable to the full-text upper bound at 4.3 times effective compression and outperforms text-compression, visual-compression, and retrieval baselines up to 10.1 times compression across seven QA benchmarks. The same approach extends to multimodal document and code tasks, with larger gains appearing at higher compression ratios. Analysis confirms that training renders the method robust to rendering choices and that the model shifts reliance toward the expanded content,

What carries the argument

Learned tools that scan compressed rendered images and selectively expand only the relevant regions to full resolution.

If this is right

  • As the compression ratio rises, the model depends more on the selectively expanded content than on direct reading of the low-resolution image.
  • Gains over retrieval and uniform-compression baselines increase with higher compression levels.
  • The same selective-expansion training works for native documents and code where layout or visual structure supplies task-relevant cues.
  • Text expansion is more effective than high-resolution image expansion when the input is rendered text.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This mechanism could let models process much longer documents without rendering every page at full resolution from the start.
  • The same scan-then-expand pattern might transfer to other compressed modalities such as audio waveforms or video frames.
  • Testing whether the tools remain effective when the underlying vision encoder or base language model changes would show how general the approach is.

Load-bearing premise

The trained model can reliably locate and expand the precise regions that contain the information needed for the current task without missing critical details or introducing new errors.

What would settle it

A test case in which a question's answer lies in a small text region that the model does not expand, causing the output to be incorrect while the full uncompressed text would have produced the right answer.

read the original abstract

Vision Language Models (VLMs) offer the exciting possibility of processing text as rendered images, bypassing the need for tokenizing the text into long token sequences. Since VLM image encoders map fixed-size images to a fixed number of visual tokens, varying rendering resolution provides a fine-grained compression knob. However, accuracy deteriorates quickly as compression increases: characters shrink below the vision encoder's effective resolution, making them indistinguishable. To address this, we propose LensVLM, an inference framework and post-training recipe that enables VLMs to scan compressed images, then selectively expand only the relevant images to their uncompressed form via learned tools. Building on Qwen3.5-9B-Base, LensVLM maintains accuracy comparable to the full-text upper bound at 4.3x effective compression and outperforms retrieval-based, text- and visual-compression baselines up to 10.1x effective compression across seven text QA benchmarks. LensVLM also generalizes to multimodal document and code understanding tasks, with the accuracy gain over baselines growing as compression increases. Our analysis validates this approach: training makes visual compression robust to rendering choices, and as compression grows the model increasingly relies on expanded content rather than unreliable visual reading. The analysis also yields practical tool-choice guidance: text expansion is preferable for rendered text, while high-resolution image expansion suits native documents whose layout cues carry task-relevant information.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces LensVLM, an inference framework and post-training recipe for VLMs that processes compressed rendered text images by scanning them and selectively expanding only task-relevant regions to full resolution using learned tools. Building on Qwen3.5-9B-Base, it claims to match full-text accuracy at 4.3x effective compression while outperforming retrieval-based, text-compression, and visual-compression baselines up to 10.1x compression across seven text QA benchmarks. The approach generalizes to multimodal document and code understanding tasks, with accuracy gains over baselines increasing at higher compression levels. Analysis indicates that training confers robustness to rendering choices and that models increasingly rely on expansions rather than direct visual reading as compression rises, with practical guidance favoring text expansion for rendered text and high-resolution image expansion for layout-rich documents.

Significance. If the central claims hold, LensVLM offers a practical mechanism for high-ratio visual compression of text without sacrificing accuracy, supported by multi-benchmark empirical results and an analysis of tool reliance. The post-training recipe and tool-choice guidance are concrete contributions that could inform efficient VLM deployment for long-context tasks. The work is empirically grounded rather than axiomatic, with no free parameters or circular derivations noted.

major comments (2)
  1. [Analysis] Analysis section (referenced in abstract as validating increasing reliance on expansions): the claim that the model reliably identifies and expands all task-critical regions without omissions at high compression lacks supporting quantitative evidence such as recall metrics for critical patches, ablation studies on tool-selection errors, or OOD failure-case breakdowns. This is load-bearing for the headline result of matching full-text accuracy at 4.3x and outperforming baselines at 10.1x, as even occasional skips of answer-bearing content would drop performance below the upper bound.
  2. [Experiments] Experimental evaluation (abstract and results sections): while concrete numbers are reported, the manuscript provides insufficient detail on baseline implementations, statistical significance testing, run-to-run variance, or error analysis to allow verification that the data supports the generalization and compression claims. This weakens assessment of whether the learned tools avoid new failure modes.
minor comments (2)
  1. [Introduction] Clarify the precise definition and calculation of 'effective compression' (mentioned as 4.3x and 10.1x) early in the paper, including how it accounts for both visual tokens and any expanded content.
  2. [Abstract] The abstract states generalization to multimodal document and code tasks but does not specify the exact benchmarks or compression levels used; add a dedicated table or subsection for these results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful review and constructive comments. We address the major concerns point-by-point below and will revise the manuscript to incorporate additional quantitative evidence and experimental details.

read point-by-point responses
  1. Referee: [Analysis] Analysis section (referenced in abstract as validating increasing reliance on expansions): the claim that the model reliably identifies and expands all task-critical regions without omissions at high compression lacks supporting quantitative evidence such as recall metrics for critical patches, ablation studies on tool-selection errors, or OOD failure-case breakdowns. This is load-bearing for the headline result of matching full-text accuracy at 4.3x and outperforming baselines at 10.1x, as even occasional skips of answer-bearing content would drop performance below the upper bound.

    Authors: We appreciate this observation. Our analysis shows increasing reliance on expansions with higher compression via tool-usage statistics, but we agree that direct evidence of complete coverage of task-critical regions is needed to fully support the headline claims. In the revision we will add recall metrics for critical patches (via keyword/semantic matching against ground-truth answer spans), ablations on tool-selection errors and their accuracy impact, and a breakdown of failure cases including OOD examples drawn from the benchmarks. These will quantify reliability and address potential omissions. revision: yes

  2. Referee: [Experiments] Experimental evaluation (abstract and results sections): while concrete numbers are reported, the manuscript provides insufficient detail on baseline implementations, statistical significance testing, run-to-run variance, or error analysis to allow verification that the data supports the generalization and compression claims. This weakens assessment of whether the learned tools avoid new failure modes.

    Authors: We agree that greater experimental transparency is required. The revised manuscript will include: full implementation details and hyperparameters for all baselines (retrieval, text-compression, and visual-compression); results averaged over multiple random seeds with standard deviations; statistical significance tests (e.g., paired t-tests or Wilcoxon tests) against baselines; and a categorized error analysis of failure modes for LensVLM versus baselines. This will allow verification of the claims and confirm that selective expansion does not introduce new failure modes. revision: yes

Circularity Check

0 steps flagged

Empirical framework with no derivational circularity

full rationale

The paper describes an inference framework and post-training recipe for selective context expansion in compressed visual text representations, building on an existing base VLM (Qwen3.5-9B-Base). All claims rest on benchmark accuracy comparisons and observational analysis of tool reliance, with no equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations. The central results are empirical performance numbers at varying compression levels; nothing reduces by construction to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim rests on the effectiveness of learned selective expansion tools and post-training robustness, which are introduced by the paper but not detailed in the provided abstract; no explicit free parameters or standard axioms are stated.

invented entities (1)
  • learned tools for selective expansion no independent evidence
    purpose: to decide which compressed image regions to expand to full resolution during inference
    Introduced as part of the LensVLM framework; no independent evidence or falsifiable prediction outside the training process is mentioned in the abstract.

pith-pipeline@v0.9.0 · 5574 in / 1348 out tokens · 52093 ms · 2026-05-11T01:04:52.649669+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 27 canonical work pages · 9 internal anchors

  1. [1]

    gpt-oss-120b & gpt-oss-20b Model Card

    Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925,

  2. [2]

    Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li

    Accessed: 2026-05-01. Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. LongBench: A bilingual, multitask benchmark for long context understanding. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics,

  3. [3]

    arXiv preprint arXiv:2510.17800 (2025) 11

    Jiale Cheng, Yusen Liu, Xinyu Zhang, Yulin Fei, Wenyi Hong, Ruiliang Lyu, Weihan Wang, Zhe Su, Xiaotao Gu, Xiao Liu, Yushi Bai, Jie Tang, Hongning Wang, and Minlie Huang. Glyph: Scaling context windows via visual-text compression.arXiv preprint arXiv:2510.17800,

  4. [4]

    M3DOCRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding

    Jaemin Cho, Debanjan Mahata, Ozan Irsoy, Yujie He, and Mohit Bansal. M3docrag: Multi-modal retrieval is what you need for multi-page multi-document understanding.arXiv preprint arXiv:2411.04952,

  5. [5]

    PaddleOCR-VL-1.5: Towards a Multi-Task 0.9B VLM for Robust In-the-Wild Document Parsing

    URLhttps://arxiv.org/abs/2601.21957. Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A. Smith, and Matt Gardner. A dataset of information- seeking questions and answers anchored in research papers. In Kristina Toutanova, Anna Rumshisky, Luke Zettle- moyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao...

  6. [6]

    and Gardner, Matt

    Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.365. URLhttps://aclanthology.org/2021.naacl-main.365/. Chunyuan Deng, Yilun Zhao, Xiangru Tang, Mark Gerstein, and Arman Cohan. Investigating data contamination in modern benchmarks for large language models. InProceedings of the 2024 Conference of the North American Chapter of th...

  7. [7]

    Arm-thinker: Reinforcing multimodal generative reward models with agentic tool use and visual reasoning.arXiv preprint arXiv:2512.05111, 2025

    URLhttps://arxiv.org/abs/2512.05111. Yue Fan, Xuehai He, Diji Yang, Kaizhi Zheng, Ching-Chen Kuo, Yuting Zheng, Sravana Jyothi Narayanaraju, Xinze Guan, and Xin Eric Wang. Grit: Teaching mllms to think with images.arXiv preprint arXiv:2505.15879,

  8. [8]

    ColPali: Efficient Document Retrieval with Vision Language Models

    10 Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, Céline Hudelot, and Pierre Colombo. ColPali: Efficient document retrieval with vision language models.arXiv preprint arXiv:2407.01449,

  9. [9]

    and Yang, F

    Lang Feng, Fuchao Yang, Feng Chen, Xin Cheng, Haiyang Xu, Zhenglin Wan, Ming Yan, and Bo An. Agentocr: Reimagining agent history via optical self-compression.arXiv preprint arXiv:2601.04786,

  10. [10]

    Zerosense: How vision matters in long context compression.arXiv preprint arXiv:2603.11846,

    Yonghan Gao, Zehong Chen, Lijian Xu, Jingzhi Chen, Jingwei Guan, and Xingyu Zeng. Zerosense: How vision matters in long context compression.arXiv preprint arXiv:2603.11846,

  11. [11]

    RULER: What's the Real Context Size of Your Long-Context Language Models?

    Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654,

  12. [12]

    ISBN 979-8-89176-251-0

    Association for Computational Linguistics. ISBN 979-8-89176-251-0. doi: 10.18653/v1/2025.acl-long.291. URLhttps://aclanthology.org/2025.acl-long.291/. De-An Huang, Subhashree Radhakrishnan, Zhiding Yu, and Jan Kautz. Frag: Frame selection augmented generation for long video and long document understanding.arXiv preprint arXiv:2504.17447,

  13. [13]

    Sim- pledoc: Multi-modal document understanding with dual-cue page retrieval and iterative refinement

    Chelsi Jain, Yiran Wu, Yifan Zeng, Jiale Liu, Shengyu Dai, Zhenwen Shao, Qingyun Wu, and Huazheng Wang. Sim- pledoc: Multi-modal document understanding with dual-cue page retrieval and iterative refinement. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 28398–28415,

  14. [14]

    LLMLingua: Compressing prompts for accelerated inference of large language models

    Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. LLMLingua: Compressing prompts for accelerated inference of large language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13358–13376, Singapore, December

  15. [15]

    In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (Dec 2023)

    Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.825. URLhttps: //aclanthology.org/2023.emnlp-main.825/. Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516,

  16. [16]

    https://aclanthology.org/ Q19-1026/

    doi: 10.1162/tacl_a_00276. URLhttps://aclanthology.org/Q19-1026/. Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626,

  17. [17]

    Repoqa: Evaluating long context code understanding

    Jiawei Liu, Jia Le Tian, Vijay Daita, Yuxiang Wei, Yifeng Ding, Yuhan Katherine Wang, Jun Yang, and Lingming Zhang. Repoqa: Evaluating long context code understanding.arXiv preprint arXiv:2406.06025,

  18. [18]

    Vicky Zhao, Lili Qiu, and Dongmei Zhang

    Zhuoshi Pan, Qianhui Wu, Huiqiang Jiang, Menglin Xia, Xufang Luo, Jue Zhang, Qingwei Lin, Victor Rühle, Yuqing Yang, Chin-Yew Lin, H. Vicky Zhao, Lili Qiu, and Dongmei Zhang. LLMLingua-2: Data distillation for efficient and faithful task-agnostic prompt compression. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for ...

  19. [19]

    In: Findings of the Association for Computational Linguistics: ACL 2024 (Aug 2024)

    Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-acl.57. URLhttps://aclanthology.org/2024. findings-acl.57/. Qwen Team. Qwen3.5 technical report.https://qwen.ai/blog?id=qwen3.5,

  20. [20]

    Language modelling with pixels.arXiv preprint arXiv:2207.06991,

    11 Phillip Rust, Jonas F Lotz, Emanuele Bugliarello, Elizabeth Salesky, Miryam de Lhoneux, and Desmond Elliott. Language modelling with pixels.arXiv preprint arXiv:2207.06991,

  21. [21]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256,

  22. [22]

    Zhaochen Su, Peng Xia, Hangyu Guo, Zhenhua Liu, Yan Ma, Xiaoye Qu, Jiaqi Liu, Yanshu Li, Kaide Zeng, Zhengyuan Yang, Linjie Li, Yu Cheng, Heng Ji, Junxian He, and Yi R. Fung. Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers.arXiv preprint arXiv:2506.23918,

  23. [23]

    Reading, not thinking: Understanding and bridging the modality gap when text becomes pixels in multimodal llms.arXiv preprint arXiv:2603.09095,

    Kaiser Sun, Xiaochuang Yuan, Hongjun Liu, Chen Zhao, Cheng Zhang, Mark Dredze, and Fan Bai. Reading, not thinking: Understanding and bridging the modality gap when text becomes pixels in multimodal llms.arXiv preprint arXiv:2603.09095,

  24. [24]

    Dragonfly: Multi-resolution zoom-in encoding enhances vision-language mod- els

    Rahul Thapa, Kezhen Chen, Ian Covert, Rahul Chalamala, Ben Athiwaratkun, Shuaiwen Leon Song, and James Zou. Dragonfly: Multi-resolution zoom-in encoding enhances vision-language models.arXiv preprint arXiv:2406.00977,

  25. [25]

    Yibo Wang, Yongcheng Jing, Shunyu Liu, Hao Guan, Rong-cheng Tu, Chengyu Wang, Jun Huang, and Dacheng Tao

    URLhttps://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf. Yibo Wang, Yongcheng Jing, Shunyu Liu, Hao Guan, Rong-cheng Tu, Chengyu Wang, Jun Huang, and Dacheng Tao. Vtc-r1: Vision-text compression for efficient long-context reasoning.arXiv preprint arXiv:2601.22069,

  26. [26]

    DeepSeek-OCR: Contexts Optical Compression

    Haoran Wei, Yaofeng Sun, and Yukun Li. DeepSeek-OCR: Contexts optical compression.arXiv preprint arXiv:2510.18234,

  27. [27]

    Vtool-r1: Vlms learn to think with images via reinforcement learning on multimodal tool use.arXiv preprint arXiv:2505.19255, 2025

    URLhttps://arxiv.org/abs/2505.19255. Wenhao Wu, Yizhong Wang, Guangxuan Xiao, Hao Peng, and Yao Fu. Retrieval head mechanistically explains long-context factuality.arXiv preprint arXiv:2404.15574,

  28. [28]

    Vision-centric token compression in large language model.arXiv preprint arXiv:2502.00791,

    Ling Xing, Alex Jinpeng Wang, Rui Yan, Xiangbo Shu, and Jinhui Tang. Vision-centric token compression in large language model.arXiv preprint arXiv:2502.00791,

  29. [29]

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, editors,Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processin...

  30. [30]

    , booktitle =

    Association for Computational Linguistics. doi: 10.18653/v1/D18-1259. URLhttps://aclanthology.org/D18-1259/. Howard Yen, Tianyu Gao, Minmin Hou, Ke Ding, Daniel Fleischer, Peter Izsak, Moshe Wasserblat, and Danqi Chen. Helmet: How to evaluate long-context language models effectively and thoroughly. InInternational Conference on Learning Representations (ICLR),

  31. [31]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    12 Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476,

  32. [32]

    LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

    Association for Computational Linguistics. URLhttp://arxiv.org/abs/2403.13372. Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deepeyes: Incentivizing" thinking with images" via reinforcement learning.arXiv preprint arXiv:2505.14362,

  33. [33]

    with DeepSpeed ZeRO- 3 (Aminabadi et al., 2022), learning rate5×10−6, effective batch size 64 (batch size 2 per GPU with gradient accumulation

  34. [34]

    as our backbone VLM. It couples a 9B-parameter language decoder with a vision encoder (ViT with spatial merging) that maps each image into a resolution-dependent number of visual tokens (Table 7 in Section 7.1). We select Qwen3.5-9B for three reasons: (1) it natively supports interleaved image-text inputs, which is required for our multi-turn tool-use for...

  35. [35]

    The two judges differ by only 0.3pp overall with inter-judge agreement of 91.7% (Table 8)

    and greedy decoding on the same (question, 16 gold answer, model prediction) triples across all three compression rates. The two judges differ by only 0.3pp overall with inter-judge agreement of 91.7% (Table 8). 8 Information-Theoretic Perspective We discuss the design motivation behind LensVLM from an information-theoretic perspective. Information can be...

  36. [36]

    Effective Compression Rate ComputationWe report both ICR (C in, Equation (3.1)) and ECR (Ceff, Equa- tion (3.3)) in this work

    improve reading from compressed renderings, but still rely on a fixed compressed view with no mechanism to recover information lost to the vision encoder; this helps at lower compression rates but inevitably hits a ceiling at higher compression. Effective Compression Rate ComputationWe report both ICR (C in, Equation (3.1)) and ECR (Ceff, Equa- tion (3.3)...

  37. [37]

    (QA subsets; original 128K contexts truncated to∼32K tokens by removing padding while preserving needle evidence), Qasper (Dasigi et al., 2021), and LongBench (Bai et al.,

  38. [38]

    Per-benchmark results in Table 11 (Section 13)

    (12 subtask variants covering multi-hop QA, passage retrieval, summarization, and single-document QA). Per-benchmark results in Table 11 (Section 13). 12.2 Hard Sample Filtering The filtering (§3.4) follows three steps: 1.Base model inference.Base model served via vLLM runs single-turn inference on all training samples. The model sees rendered images and ...

  39. [39]

    Synthetic tool use and reasoning traces are generated via the same pipeline as in the main experiment

    and DocVQA: 10,256 raw samples filtered for OCR answer recall (the gold answer must appear in at least one image’s OCR text), balanced across presets (∼3,300 each). Synthetic tool use and reasoning traces are generated via the same pipeline as in the main experiment. Two tool variants are trained: (1)OCR text:read_imagereturns PaddleOCR-VL text from the f...

  40. [40]

    iconic image for the 2011 Derby,

    and are omitted from the table. The dominant pattern is consistent across all three presets: training redirects attention away from the model’s own tool-call turn (−8pp), non-GT images (−4 to−6pp), and the question (−1pp), and concentrates it on the tool-response text (+15 to +16pp). This shows that the model learns to read the expanded content rather tha...

  41. [41]

    Further details of the winner, Pour Moi

    on its first attempt by leveraging coarse visual cues from the compressed overview. A singleExpandcall suffices, and the reasoning trace cites specific retrieved evidence (“Further details of the winner, Pour Moi”) rather than paraphrasing, confirming that the trace generator grounds conclusions in tool-response content. Multi-Hop: MuSiQue (23 images, 5×c...

  42. [42]

    name": "read_text

    234 2000 Akron, Canton, Youngstown, and Warren, overlay with 330 [...] 419 1947 Northwest and north central Ohio including... [Assistant –- Turn 3] <think>Image 22 contains the beginning of a list of Ohio area codes but is cut off before reaching Cincinnati. The list continues on Image 23, which should contain the Cincinnati entry. I will read Image 23 to...

  43. [43]

    513 1947 Southwest Ohio including Cincinnati

    567 1947 Northwest and north central Ohio including Toledo [...] [Assistant –- Final Answer] <think>Image 22 establishes that Cincinnati is in Ohio, confirming the state context. Image 23 contains the list of Ohio area codes, specifically stating “513 1947 Southwest Ohio including Cincinnati.” By combining the location from Image 16 (Lewis Kemp House in O...