{"total":16,"items":[{"citing_arxiv_id":"2607.00115","ref_index":40,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PixelEyes: Decoupling Perception and Reasoning for Pinpoint Visual Evidence Seeking","primary_cat":"cs.CV","submitted_at":"2026-06-30T19:51:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PixelEyes decouples reasoning and perception via mask-guided search and semantic BFS, introduces PixelEyes-6K dataset and Pinpoint-Bench benchmark, and open-sources code and models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00562","ref_index":43,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DeepLatent: Think with Images via Parallel Latent Visual Reasoning","primary_cat":"cs.CV","submitted_at":"2026-05-30T06:33:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DeepLatent introduces a parallel latent visual reasoning framework with learnable 2D tokens and continuous RL, trained via distillation then RL, plus a new 180K dataset, claiming SOTA benchmark results.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.28741","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Self-Prophetic Decoding to Unlock Visual Search in LVLMs","primary_cat":"cs.CV","submitted_at":"2026-05-27T17:01:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SeProD is a plug-and-play self-prophetic decoding framework that combines pre- and post-training LVLM capabilities via probability-based sampling to improve coherent visual search and multi-step reasoning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.26520","ref_index":38,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"InterSketch: An Interleaved Reasoning Model with Self-correcting Visual Sketch and Stepwise Reward","primary_cat":"cs.CV","submitted_at":"2026-05-26T04:07:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"InterSketch improves long-horizon visual-textual chain-of-thought in VLMs by dynamically generating and interleaving self-correcting visual sketches with text, using a synthesized dataset plus reflection in cold-start followed by stepwise-reward RL, and reports outperforming Gemini-3-Pro on benchmar","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00096","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Diversity Over Frequency: Rethinking Tool Use in Visual Chain-of-Thought Agents","primary_cat":"cs.CV","submitted_at":"2026-05-25T13:06:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Visual CoT agents exhibit tool-use collapse where tool usage declines but task accuracy rises, and adding entropy regularization for rollout diversity produces the strongest performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22642","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Spreadsheet-RL: Advancing Large Language Model Agents on Realistic Spreadsheet Tasks via Reinforcement Learning","primary_cat":"cs.AI","submitted_at":"2026-05-21T15:47:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Spreadsheet-RL applies RL fine-tuning and a custom Gym environment to raise LLM agent Pass@1 scores on spreadsheet benchmarks from roughly 8-12% to 17-23%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22177","ref_index":63,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Maestro: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles","primary_cat":"cs.LG","submitted_at":"2026-05-21T08:47:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Maestro uses outcome-based RL to train a lightweight policy that orchestrates ensembles of frozen expert models and skills, reporting 70.1% average accuracy across ten multimodal benchmarks and outperforming GPT-5 and Gemini-2.5-Pro while generalizing to unseen components.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18641","ref_index":35,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Leveraging Latent Visual Reasoning in Silence","primary_cat":"cs.CV","submitted_at":"2026-05-18T16:46:02+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Latent visual reasoning improves multimodal models via training effects even without using latent tokens at inference, enabled by an attention-based RL reward that promotes interaction with text tokens.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15792","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Reversing the Flow: Generation-to-Understanding Synergy in Large Multimodal Models","primary_cat":"cs.CV","submitted_at":"2026-05-15T09:48:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Generation-to-Understanding synergy lets multimodal models create self-generated visual edits as intermediate steps, improving performance on twelve benchmarks while revealing limits in task-aligned self-reflection.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11856","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"UniVLR: Unifying Text and Vision in Visual Latent Reasoning for Multimodal LLMs","primary_cat":"cs.CV","submitted_at":"2026-05-12T09:40:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"UniVLR unifies textual and visual reasoning in multimodal LLMs by compressing reasoning traces and auxiliary images into visual latent tokens for direct inference without interleaved text CoT.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"Could your multimodal LLM challenge high-resolution real-world scenarios that are difficult for humans? InICLR. OpenReview.net, 2025. [6] Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A. Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models. InNeurIPS, 2024. [7] Mingyuan Wu, Jingcheng Yang, Jize Jiang, Meitang Li, Kaizhuo Yan, Hanchao Yu, Minjia Zhang, Chengxiang Zhai, and Klara Nahrstedt. Vtool-r1: Vlms learn to think with images via reinforcement learning on multimodal tool use.CoRR, abs/2505.19255, 2025. [8] Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu."},{"citing_arxiv_id":"2605.07019","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LensVLM: Selective Context Expansion for Compressed Visual Representation of Text","primary_cat":"cs.CV","submitted_at":"2026-05-07T23:03:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LensVLM trains VLMs to scan compressed rendered text images and selectively expand task-relevant regions, achieving 4.3x compression with near full-text accuracy and outperforming baselines up to 10.1x on text QA benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19945","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Visual Reasoning through Tool-supervised Reinforcement Learning","primary_cat":"cs.CV","submitted_at":"2026-04-21T19:48:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ToolsRL trains MLLMs via a tool-specific then accuracy-focused RL curriculum to master visual tools for complex reasoning tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.02794","ref_index":52,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CharTool: Tool-Integrated Visual Reasoning for Chart Understanding","primary_cat":"cs.AI","submitted_at":"2026-04-03T07:02:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CharTool equips MLLMs with cropping and code tools plus agentic RL on DuoChart data to raise chart-reasoning accuracy by up to 9.78 percent on benchmarks.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"distance formula: $$ y = \\sqrt{x^2 + y^2 + z^2} $$ Where: - $x = 11$ cm (one side of the square base) - $y = 11$ cm (the other side of the square base) - $z = 15$ cm (the height/length of the prism) I will calculate this using `python_interpreter`. </think> <code> import numpy as np # Define the dimensions dimensions = [11, 11, 15] # Calculate the space diagonal space_diagonal = np.sqrt(np.sum(np.array(dimensions)*[2)) print(f\"Length of diagonal DF (to two decimal places): {space_diagonal:.2f}\") </code> <sandbox_output> Length of diagonal DF (to two decimal places): 21.61 </sandbox_output> <answer> \\boxed{21.61} </answer> Figure 11.A CHARTOOLexample using Code Computation Tool on MathVerse (Zhang et al., 2024c) benchmark. 24 Submission and Formatting Instructions for ICML 2026 Question Which sampling method in the KITTIFog subplot achieves the highest mAP at 2600 labeled instances? * Your final answer must be grounded to some text that is explicitly written and relevant to the question in the chart. * If you need to answer multiple terms, separate them with commas. * Unless specified in the question (such as answering with a letter), you are required to answer the full names of subplots and/or labels by default. CoT <think> The question asks which sampling method in the \"KITTIFog\" subplot achieves the highest mAP at 2600 labeled instances. I need to focus on the top-right subplot labeled \"KITTIFog\". I will zoom in on this specific subplot to clearly read the data points and identify the highest mAP value at 2600 labeled instances. </think> <code> from PIL import Image img = Image.open(\"chart.png\") cropped = img.crop((738, 45, 1024, 390)) cropped.save(\"output.png\") </code> <sandbox_output> <image> </sandbox_output> <think> After zooming in on the \"KITTIFog\" subplot, I can see the mAP values for different sampling methods at 2600 labeled instances. I will use the `python_interpreter` tool to estimate the mAP value for each method at 2600 labeled instances ba"},{"citing_arxiv_id":"2604.03307","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"V-Reflection: Transforming MLLMs from Passive Observers to Active Interrogators","primary_cat":"cs.CV","submitted_at":"2026-03-31T03:57:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"V-Reflection introduces a think-then-look mechanism where MLLM latent states actively interrogate visual features via two-stage distillation from a box-guided teacher to a dynamic autoregressive student, narrowing the fine-grained perception gap on benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Perception-aware policy optimization for multimodal reasoning. arXiv preprint arXiv:2507.06448, 2025. 8 [28] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824-24837, 2022. 3 [29] Mingyuan Wu, Jingcheng Yang, Jize Jiang, Meitang Li, Kaizhuo Yan, Hanchao Yu, Minjia Zhang, Chengxiang Zhai, and Klara Nahrstedt. Vtool-r1: Vlms learn to think with images via reinforcement learning on multimodal tool use.arXiv preprint arXiv:2505.19255, 2025. 2, 3 [30] Penghao Wu and Saining Xie. V?: Guided visual search as a core mechanism in multimodal llms."},{"citing_arxiv_id":"2512.08980","ref_index":46,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Training Multi-Image Vision Agents via End2End Reinforcement Learning","primary_cat":"cs.CV","submitted_at":"2025-12-05T10:02:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"IMAgent trains a multi-image vision agent via pure end-to-end RL with visual reflection tools and a two-layer motion trajectory masking strategy, reaching SOTA on single- and multi-image benchmarks while revealing tool-use effects on attention.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.02547","ref_index":253,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"The Landscape of Agentic Reinforcement Learning for LLMs: A Survey","primary_cat":"cs.AI","submitted_at":"2025-09-02T17:46:26+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"the(I) integration of process-based supervision with final outcome rewards.Rather than relying on a single reward at a trajectory's conclusion, this paradigm usesauxiliary modelsorprogrammatic rulesto evaluate the quality of intermediate steps, providing a denser and more immediate learning signal that guides the agent's multi-turn strategy. For example, EPO [253], ThinkRM [254], SPO [255], and AgentPRM [256] introduce external reward models to provide step-wise signals for agents; in contrast, RLVMR [257] designs manually defined, programmatic rules to guide the intermediate supervision. A second, complementary strategy is to(II) extend preference optimization from single turns to multi-step segments.Techniques"}],"limit":50,"offset":0}