{"total":13,"items":[{"citing_arxiv_id":"2606.29503","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"The Verbose Context Problem in Medical Records","primary_cat":"cs.CL","submitted_at":"2026-06-28T17:03:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Presents PopMedQA benchmark and shows domain-independent LLM methods fail on token-inefficient longitudinal medical records, leaving room for domain-specific approaches.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.28344","ref_index":35,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PIXELRAG: Web Screenshots Beat Text for Retrieval-Augmented Generation","primary_cat":"cs.IR","submitted_at":"2026-06-01T23:20:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PixelRAG shows that operating RAG entirely over web screenshots outperforms text-based retrieval on NQ, SimpleQA, MMSearch, LiveVQA, and MoNaCo, with up to 18.1% accuracy gains and 3x token savings via image compression.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.28338","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Memory Shot for Long-Term Dialogue","primary_cat":"cs.IR","submitted_at":"2026-05-30T01:59:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MemShot renders local dialogue spans as structured visual memory units to improve long-term dialogue modeling in LLMs, achieving competitive benchmark performance with 70x faster memory construction.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30265","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LoMo: Local Modality Substitution for Deeper Vision-Language Fusion","primary_cat":"cs.CV","submitted_at":"2026-05-28T17:27:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LoMo is a lightweight data curation technique that locally substitutes text with images in prompts to enforce cross-modal invariance, yielding 2.67-2.82 point gains over standard SFT on two VLMs across 13 benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07250","ref_index":37,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Hard to Read, Easy to Jailbreak: How Visual Degradation Bypasses MLLM Safety Alignment","primary_cat":"cs.CV","submitted_at":"2026-05-08T05:19:23+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Degraded image resolution in MLLMs bypasses safety alignments via cognitive overload, raising jailbreak rates across perturbations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07019","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LensVLM: Selective Context Expansion for Compressed Visual Representation of Text","primary_cat":"cs.CV","submitted_at":"2026-05-07T23:03:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LensVLM trains VLMs to scan compressed rendered text images and selectively expand task-relevant regions, achieving 4.3x compression with near full-text accuracy and outperforming baselines up to 10.1x on text QA benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[Tool Response] Text content of Image 22:[Langdon House] The Langdon House is a historic house on the eastern side of Cincinnati, Ohio, United States. [...] [List of Ohio area codes] Code Created Region 216 1947 Cleveland (October 1947) 234 2000 Akron, Canton, Youngstown, and Warren, overlay with 330 [...] 419 1947 Northwest and north central Ohio including... [Assistant -- Turn 3] <think>Image 22 contains the beginning of a list of Ohio area codes but is cut off before reaching Cincinnati. The list continues on Image 23, which should contain the Cincinnati entry. I will read Image 23 to find the specific code.</think> <tool_call>{\"name\": \"read_text\", \"arguments\": {\"image\": 23}}</tool_- call> [Tool Response] Text content of Image 23:Toledo, Sandusky, and Ashland, overlay with 567 [."},{"citing_arxiv_id":"2605.06708","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Visual Text Compression as Measure Transport","primary_cat":"cs.CV","submitted_at":"2026-05-06T19:02:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Framing visual text compression as measure transport decomposes encoding loss into precision and coverage costs, enabling a label-free routing rule that matches oracle performance on 17 of 24 NLP datasets while using 10% fewer tokens.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"3% average tokens relative to a pure-LLM. 1 Introduction Visual text compression (VTC) has recently emerged as a practical route to long-context processing. By rendering text into images and re-encoding it through a vision-language model (VLM), VTC replaces long subword-token sequences with substantially fewer visual tokens, often at compression ratios of 3-20 × [9, 18, 23, 46-48]. DeepSeek-OCR [ 9] and Glyph [ 46] show that VLMs can preserve strong OCR fidelity and competitive long-context modeling even under heavy compression. These results establish the efficiency promise of VTC. They do not resolve the more fundamental question that determines whether VTC is actually usable: once text is rendered and re-encoded, what"},{"citing_arxiv_id":"2604.14029","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"POINTS-Seeker: Towards Training a Multimodal Agentic Search Model from Scratch","primary_cat":"cs.CV","submitted_at":"2026-04-15T16:09:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"POINTS-Seeker-8B is an 8B multimodal model trained from scratch for agentic search that uses seeding and visual-space history folding to outperform prior models on six visual reasoning benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"example, MemAgent [58] proposes maintaining a fixed-length memory that is proactively and selectively updated by the model. AgentFold [57] introduces a foldingmechanismtointelligentlyfoldsegmentsofcontextduringtaskexecution. More recently, cross-modal compression [10,35] has emerged as a potent alterna- tive. DeepSeek-OCR [47] proposes that images can serve as an effective medium for compression. Similarly, Glyph [4] advances a related idea by transforming the long-text modeling problem into a multiple image-text modeling paradigm. Fur- thermore, AgentOCR [10] proposes a self-compression mechanism, whereby the model autonomously determines the compression ratio. However, in the domain of multimodal agentic search, limited attention has been devoted to memory management."},{"citing_arxiv_id":"2604.03660","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"TableVision: A Large-Scale Benchmark for Spatially Grounded Reasoning over Complex Hierarchical Tables","primary_cat":"cs.AI","submitted_at":"2026-04-04T09:26:09+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TableVision benchmark shows explicit spatial grounding recovers MLLM reasoning on hierarchical tables, delivering 12.3% accuracy improvement through a decoupled perception-reasoning framework.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"1 Experimental Setup Baselines for Benchmarking.To assess current capabilities in hierarchical table reasoning, we evaluate 10 representative MLLMs with active parameter scales ranging from 4B to 9B. The selection includes general-purpose models (e.g., MiniCPM-V-2.6 [53], InternVL3-8B [63], Qwen3-VL-8B-Instruct [2]) and table-specialized models (e.g., Glyph [12]). All models are evaluated in a zero- shot setting to measure their inherent structural perception. These standard VLMs perform inference by directly generating the final answer without ex- tracting bounding boxes. Implementation Details for Our Framework.For our two-stage frame- work, we utilizeQwen3-VL-8B-Instructas the base model. The Supervised"},{"citing_arxiv_id":"2602.18600","ref_index":14,"ref_count":4,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?","primary_cat":"cs.LG","submitted_at":"2026-02-20T20:22:18+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MapTab is a new multimodal benchmark with 328 images and nearly 200k queries that shows current MLLMs have substantial difficulty with multi-criteria route planning when visual and tabular information must be combined.","context_count":2,"top_context_role":"background","top_context_polarity":"background","context_text":"ence on computer vision and pattern recognition, pages 21819-21830, 2024. [12] Liang Chen, Xueting Han, Li Shen, Jing Bai, and Kam-Fai Wong. Beyond two-stage training: Cooperative sft and rl for llm reasoning.arXiv preprint arXiv:2509.06948, 2025. [13] Yan Chen. Path planning algorithm for logistics autonomous vehicles at cainiao stations based on multi-sensor data fusion.PLoS One, 20(5):e0321257, 2025. [14] Jiale Cheng, Yusen Liu, Xinyu Zhang, Yulin Fei, Wenyi Hong, Ruiliang Lyu, Weihan Wang, Zhe Su, Xiaotao Gu, Xiao Liu, et al. Glyph: Scaling context windows via visual-text compres- sion.arXiv preprint arXiv:2510.17800, 2025. [15] Yong Xien Chng, Tao Hu, Wenwen Tong, Xueheng Li, Jiandong Chen, Haojia Yu, Jiefan Lu, Hewei Guo, Hanming Deng, Chengjun Xie, et al."},{"citing_arxiv_id":"2602.04802","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VISTA-Bench: Do Vision-Language Models Really Understand Visualized Text as Well as Pure Text?","primary_cat":"cs.CV","submitted_at":"2026-02-04T17:48:55+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VISTA-Bench shows vision-language models degrade on visualized text in images compared to equivalent pure text, with larger gaps under increased perceptual difficulty.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.21468","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MemOCR: Layout-Aware Visual Memory for Efficient Long-Horizon Reasoning","primary_cat":"cs.AI","submitted_at":"2026-01-29T09:47:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MemOCR renders structured memory as images with adaptive visual density to improve long-horizon reasoning under tight context budgets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.03643","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Optical Context Compression Is Just (Bad) Autoencoding","primary_cat":"cs.CV","submitted_at":"2025-12-03T10:27:27+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Vision-based optical context compression performs no better than direct autoencoding baselines like mean pooling or hierarchical encoders across compression ratios.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}