{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2025:V6TLOTG4PKFMUULQ7FMC7RG2MH","short_pith_number":"pith:V6TLOTG4","schema_version":"1.0","canonical_sha256":"afa6b74cdc7a8aca5170f9582fc4da61e4444390fc2620520edd8bf07af4a5fb","source":{"kind":"arxiv","id":"2505.15966","version":3},"attestation_state":"computed","paper":{"title":"Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Vision-language models can reason directly in pixel space using operations like zoom-in and frame selection to achieve new open-source highs on visual benchmarks.","cross_cats":["cs.AI","cs.CL"],"primary_cat":"cs.CV","authors_text":"Alex Su, Fangzhen Lin, Haozhe Wang, Weiming Ren, Wenhu Chen","submitted_at":"2025-05-21T19:35:08Z","abstract_excerpt":"Chain-of-thought reasoning has significantly improved the performance of Large Language Models (LLMs) across various domains. However, this reasoning process has been confined exclusively to textual space, limiting its effectiveness in visually intensive tasks. To address this limitation, we introduce the concept of reasoning in the pixel-space. Within this novel framework, Vision-Language Models (VLMs) are equipped with a suite of visual reasoning operations, such as zoom-in and select-frame. These operations enable VLMs to directly inspect, interrogate, and infer from visual evidences, there"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":false,"formal_links_present":true},"canonical_record":{"source":{"id":"2505.15966","kind":"arxiv","version":3},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.CV","submitted_at":"2025-05-21T19:35:08Z","cross_cats_sorted":["cs.AI","cs.CL"],"title_canon_sha256":"8a94c8ff782df7bc890b2729354604b210b5e0714caef36845c6c12ce7030c23","abstract_canon_sha256":"13fa2021d633cbcd726c6218ae04a8537475157c5fc39ae87a2616339b10f070"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-18T02:39:16.984759Z","signature_b64":"r7MIEDuogbPObKmblcg+meKf49+KuIea95hqCuaFyqgIITJsK/C+bgKje85nkC1Q6WOA5BtaWhg5s+CicBPnAg==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"afa6b74cdc7a8aca5170f9582fc4da61e4444390fc2620520edd8bf07af4a5fb","last_reissued_at":"2026-05-18T02:39:16.984324Z","signature_status":"signed_v1","first_computed_at":"2026-05-18T02:39:16.984324Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Vision-language models can reason directly in pixel space using operations like zoom-in and frame selection to achieve new open-source highs on visual benchmarks.","cross_cats":["cs.AI","cs.CL"],"primary_cat":"cs.CV","authors_text":"Alex Su, Fangzhen Lin, Haozhe Wang, Weiming Ren, Wenhu Chen","submitted_at":"2025-05-21T19:35:08Z","abstract_excerpt":"Chain-of-thought reasoning has significantly improved the performance of Large Language Models (LLMs) across various domains. However, this reasoning process has been confined exclusively to textual space, limiting its effectiveness in visually intensive tasks. To address this limitation, we introduce the concept of reasoning in the pixel-space. Within this novel framework, Vision-Language Models (VLMs) are equipped with a suite of visual reasoning operations, such as zoom-in and select-frame. These operations enable VLMs to directly inspect, interrogate, and infer from visual evidences, there"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Our 7B model, Pixel Reasoner, achieves 84% on V* bench, 74% on TallyQA-Complex, and 84% on InfographicsVQA, marking the highest accuracy achieved by any open-source model to date.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The curiosity-driven reward scheme will successfully balance exploration of pixel-space operations with textual reasoning without the model reverting to familiar text-only strategies or exploiting the reward in unintended ways.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Pixel Reasoner equips VLMs with pixel-space operations and uses curiosity-driven RL to improve visual reasoning, achieving top open-source results on V*, TallyQA-Complex, and InfographicsVQA.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Vision-language models can reason directly in pixel space using operations like zoom-in and frame selection to achieve new open-source highs on visual benchmarks.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"55c13e3003722b7b4c8725ea84119f64340acd1970b77bc19818f4151d3c9118"},"source":{"id":"2505.15966","kind":"arxiv","version":3},"verdict":{"id":"6923ceef-ec5c-43fc-9b39-43d1d12980bb","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-14T02:17:25.171924Z","strongest_claim":"Our 7B model, Pixel Reasoner, achieves 84% on V* bench, 74% on TallyQA-Complex, and 84% on InfographicsVQA, marking the highest accuracy achieved by any open-source model to date.","one_line_summary":"Pixel Reasoner equips VLMs with pixel-space operations and uses curiosity-driven RL to improve visual reasoning, achieving top open-source results on V*, TallyQA-Complex, and InfographicsVQA.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The curiosity-driven reward scheme will successfully balance exploration of pixel-space operations with textual reasoning without the model reverting to familiar text-only strategies or exploiting the reward in unintended ways.","pith_extraction_headline":"Vision-language models can reason directly in pixel space using operations like zoom-in and frame selection to achieve new open-source highs on visual benchmarks."},"references":{"count":0,"sample":[],"resolved_work":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57","internal_anchors":0},"formal_canon":{"evidence_count":3,"snapshot_sha256":"a8250e4eaf1438dda974834727597102c2376cf88439c959f873d92310fb7c01"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2505.15966","created_at":"2026-05-18T02:39:16.984383+00:00"},{"alias_kind":"arxiv_version","alias_value":"2505.15966v3","created_at":"2026-05-18T02:39:16.984383+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2505.15966","created_at":"2026-05-18T02:39:16.984383+00:00"},{"alias_kind":"pith_short_12","alias_value":"V6TLOTG4PKFM","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"V6TLOTG4PKFMUULQ","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"V6TLOTG4","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":49,"internal_anchor_count":49,"sample":[{"citing_arxiv_id":"2605.23216","citing_title":"CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering","ref_index":32,"is_internal_anchor":true},{"citing_arxiv_id":"2511.20785","citing_title":"LongVT: Incentivizing \"Thinking with Long Videos\" via Native Tool Calling","ref_index":39,"is_internal_anchor":true},{"citing_arxiv_id":"2605.21652","citing_title":"Look-Closer-Then-Diagnose: Confidence-Aware Ultrasound VQA via Active Zooming","ref_index":22,"is_internal_anchor":true},{"citing_arxiv_id":"2602.23622","citing_title":"DLEBench: Evaluating Small-scale Object Editing Ability for Instruction-based Image Editing Model","ref_index":20,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09860","citing_title":"When to Re-Commit: Temporal Abstraction Discovery for Long-Horizon Vision-Language Reasoning","ref_index":50,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09860","citing_title":"When to Re-Commit: Temporal Abstraction Discovery for Long-Horizon Vision-Language Reasoning","ref_index":50,"is_internal_anchor":true},{"citing_arxiv_id":"2605.16079","citing_title":"VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation","ref_index":33,"is_internal_anchor":true},{"citing_arxiv_id":"2605.15792","citing_title":"Reversing the Flow: Generation-to-Understanding Synergy in Large Multimodal Models","ref_index":34,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18740","citing_title":"Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation","ref_index":40,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18603","citing_title":"Starve to Perceive: Taming Lazy Perception in VLMs with Constrained Visual Bandwidth","ref_index":32,"is_internal_anchor":true},{"citing_arxiv_id":"2605.19852","citing_title":"Are Tools Always Beneficial? Learning to Invoke Tools Adaptively for Dual-Mode Multimodal LLM Reasoning","ref_index":59,"is_internal_anchor":true},{"citing_arxiv_id":"2605.20165","citing_title":"CaMo: Camera Motion Grounded Evaluation and Training for Vision-Language Models","ref_index":33,"is_internal_anchor":true},{"citing_arxiv_id":"2605.13169","citing_title":"PanoWorld: Towards Spatial Supersensing in 360$^\\circ$ Panorama World","ref_index":41,"is_internal_anchor":true},{"citing_arxiv_id":"2507.06448","citing_title":"Perception-Aware Policy Optimization for Multimodal Reasoning","ref_index":30,"is_internal_anchor":true},{"citing_arxiv_id":"2509.02547","citing_title":"The Landscape of Agentic Reinforcement Learning for LLMs: A Survey","ref_index":106,"is_internal_anchor":true},{"citing_arxiv_id":"2509.07969","citing_title":"Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search","ref_index":32,"is_internal_anchor":true},{"citing_arxiv_id":"2505.15436","citing_title":"Adaptive Chain-of-Focus Reasoning via Dynamic Visual Search and Zooming for Efficient VLMs","ref_index":38,"is_internal_anchor":true},{"citing_arxiv_id":"2512.08980","citing_title":"Training Multi-Image Vision Agents via End2End Reinforcement Learning","ref_index":35,"is_internal_anchor":true},{"citing_arxiv_id":"2512.12623","citing_title":"Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2512.13671","citing_title":"AgentIAD: Agentic Industrial Anomaly Detection via Adaptive Memory Augmentation","ref_index":34,"is_internal_anchor":true},{"citing_arxiv_id":"2512.16918","citing_title":"AdaTooler-V: Adaptive Tool-Use for Images and Videos","ref_index":52,"is_internal_anchor":true},{"citing_arxiv_id":"2601.15356","citing_title":"Q-Probe: Scaling Image Quality Assessment to High Resolution via Context-Aware Agentic Probing","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2511.05271","citing_title":"DeepEyesV2: Toward Agentic Multimodal Model","ref_index":42,"is_internal_anchor":true},{"citing_arxiv_id":"2605.14054","citing_title":"Bad Seeing or Bad Thinking? Rewarding Perception for Multimodal Reasoning","ref_index":19,"is_internal_anchor":true},{"citing_arxiv_id":"2605.15198","citing_title":"ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both","ref_index":16,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":3,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/V6TLOTG4PKFMUULQ7FMC7RG2MH","json":"https://pith.science/pith/V6TLOTG4PKFMUULQ7FMC7RG2MH.json","graph_json":"https://pith.science/api/pith-number/V6TLOTG4PKFMUULQ7FMC7RG2MH/graph.json","events_json":"https://pith.science/api/pith-number/V6TLOTG4PKFMUULQ7FMC7RG2MH/events.json","paper":"https://pith.science/paper/V6TLOTG4"},"agent_actions":{"view_html":"https://pith.science/pith/V6TLOTG4PKFMUULQ7FMC7RG2MH","download_json":"https://pith.science/pith/V6TLOTG4PKFMUULQ7FMC7RG2MH.json","view_paper":"https://pith.science/paper/V6TLOTG4","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2505.15966&json=true","fetch_graph":"https://pith.science/api/pith-number/V6TLOTG4PKFMUULQ7FMC7RG2MH/graph.json","fetch_events":"https://pith.science/api/pith-number/V6TLOTG4PKFMUULQ7FMC7RG2MH/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/V6TLOTG4PKFMUULQ7FMC7RG2MH/action/timestamp_anchor","attest_storage":"https://pith.science/pith/V6TLOTG4PKFMUULQ7FMC7RG2MH/action/storage_attestation","attest_author":"https://pith.science/pith/V6TLOTG4PKFMUULQ7FMC7RG2MH/action/author_attestation","sign_citation":"https://pith.science/pith/V6TLOTG4PKFMUULQ7FMC7RG2MH/action/citation_signature","submit_replication":"https://pith.science/pith/V6TLOTG4PKFMUULQ7FMC7RG2MH/action/replication_record"}},"created_at":"2026-05-18T02:39:16.984383+00:00","updated_at":"2026-05-18T02:39:16.984383+00:00"}