{"paper":{"title":"Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Vision-language models can reason directly in pixel space using operations like zoom-in and frame selection to achieve new open-source highs on visual benchmarks.","cross_cats":["cs.AI","cs.CL"],"primary_cat":"cs.CV","authors_text":"Alex Su, Fangzhen Lin, Haozhe Wang, Weiming Ren, Wenhu Chen","submitted_at":"2025-05-21T19:35:08Z","abstract_excerpt":"Chain-of-thought reasoning has significantly improved the performance of Large Language Models (LLMs) across various domains. However, this reasoning process has been confined exclusively to textual space, limiting its effectiveness in visually intensive tasks. To address this limitation, we introduce the concept of reasoning in the pixel-space. Within this novel framework, Vision-Language Models (VLMs) are equipped with a suite of visual reasoning operations, such as zoom-in and select-frame. These operations enable VLMs to directly inspect, interrogate, and infer from visual evidences, there"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Our 7B model, Pixel Reasoner, achieves 84% on V* bench, 74% on TallyQA-Complex, and 84% on InfographicsVQA, marking the highest accuracy achieved by any open-source model to date.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The curiosity-driven reward scheme will successfully balance exploration of pixel-space operations with textual reasoning without the model reverting to familiar text-only strategies or exploiting the reward in unintended ways.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Pixel Reasoner equips VLMs with pixel-space operations and uses curiosity-driven RL to improve visual reasoning, achieving top open-source results on V*, TallyQA-Complex, and InfographicsVQA.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Vision-language models can reason directly in pixel space using operations like zoom-in and frame selection to achieve new open-source highs on visual benchmarks.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"55c13e3003722b7b4c8725ea84119f64340acd1970b77bc19818f4151d3c9118"},"source":{"id":"2505.15966","kind":"arxiv","version":3},"verdict":{"id":"6923ceef-ec5c-43fc-9b39-43d1d12980bb","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-14T02:17:25.171924Z","strongest_claim":"Our 7B model, Pixel Reasoner, achieves 84% on V* bench, 74% on TallyQA-Complex, and 84% on InfographicsVQA, marking the highest accuracy achieved by any open-source model to date.","one_line_summary":"Pixel Reasoner equips VLMs with pixel-space operations and uses curiosity-driven RL to improve visual reasoning, achieving top open-source results on V*, TallyQA-Complex, and InfographicsVQA.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The curiosity-driven reward scheme will successfully balance exploration of pixel-space operations with textual reasoning without the model reverting to familiar text-only strategies or exploiting the reward in unintended ways.","pith_extraction_headline":"Vision-language models can reason directly in pixel space using operations like zoom-in and frame selection to achieve new open-source highs on visual benchmarks."},"references":{"count":0,"sample":[],"resolved_work":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57","internal_anchors":0},"formal_canon":{"evidence_count":3,"snapshot_sha256":"a8250e4eaf1438dda974834727597102c2376cf88439c959f873d92310fb7c01"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}