{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2025:WERPE3RT226B7XDJYI4L7CGJGT","short_pith_number":"pith:WERPE3RT","schema_version":"1.0","canonical_sha256":"b122f26e33d6bc1fdc69c238bf88c934c50526300d8cfec83d124e95ce85034a","source":{"kind":"arxiv","id":"2508.11630","version":1},"attestation_state":"computed","paper":{"title":"Thyme: Think Beyond Images","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Thyme lets multimodal models autonomously generate and run code to manipulate images and perform calculations during reasoning.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Bin Wen, Changyi Liu, Chaoyou Fu, Fan Yang, Guorui Zhou, Haojie Ding, Haonan Fan, Jiankang Chen, Kaibing Chen, Kaiyu Jiang, Kaiyu Tang, Liang Wang, Shukang Yin, Tianke Zhang, Tingting Gao, Wei Chen, Xiao Hu, Xingyu Lu, Yi-Fan Zhang, Zhang Zhang","submitted_at":"2025-08-15T17:59:49Z","abstract_excerpt":"Following OpenAI's introduction of the ``thinking with images'' concept, recent efforts have explored stimulating the use of visual information in the reasoning process to enhance model performance in perception and reasoning tasks. However, to the best of our knowledge, no open-source work currently offers a feature set as rich as proprietary models (O3), which can perform diverse image manipulations and simultaneously enhance logical reasoning capabilities through code. In this paper, we make a preliminary attempt in this direction by introducing Thyme (Think Beyond Images), a novel paradigm"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":false,"formal_links_present":true},"canonical_record":{"source":{"id":"2508.11630","kind":"arxiv","version":1},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.CV","submitted_at":"2025-08-15T17:59:49Z","cross_cats_sorted":[],"title_canon_sha256":"f2f6b8736d88a031f9b2a8d0a45dda295e0567afec694387533599d48bff4345","abstract_canon_sha256":"da3bb5704052413b3507b5af1b9bb79a11d60aed77148309406f0ae202179661"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:39:19.706362Z","signature_b64":"Ww4K6TVa4dAOjgM2J1YTyAjIxXWNhcWD+eAmUQcc7T4qu1GMofnHkNzLomcDWoGqxHaKsITG2QTj7qSIkgO5Dg==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"b122f26e33d6bc1fdc69c238bf88c934c50526300d8cfec83d124e95ce85034a","last_reissued_at":"2026-05-17T23:39:19.705585Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:39:19.705585Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Thyme: Think Beyond Images","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Thyme lets multimodal models autonomously generate and run code to manipulate images and perform calculations during reasoning.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Bin Wen, Changyi Liu, Chaoyou Fu, Fan Yang, Guorui Zhou, Haojie Ding, Haonan Fan, Jiankang Chen, Kaibing Chen, Kaiyu Jiang, Kaiyu Tang, Liang Wang, Shukang Yin, Tianke Zhang, Tingting Gao, Wei Chen, Xiao Hu, Xingyu Lu, Yi-Fan Zhang, Zhang Zhang","submitted_at":"2025-08-15T17:59:49Z","abstract_excerpt":"Following OpenAI's introduction of the ``thinking with images'' concept, recent efforts have explored stimulating the use of visual information in the reasoning process to enhance model performance in perception and reasoning tasks. However, to the best of our knowledge, no open-source work currently offers a feature set as rich as proprietary models (O3), which can perform diverse image manipulations and simultaneously enhance logical reasoning capabilities through code. In this paper, we make a preliminary attempt in this direction by introducing Thyme (Think Beyond Images), a novel paradigm"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Thyme yields significant and consistent performance gains, particularly in challenging high-resolution perception and complex reasoning tasks.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the RL phase with GRPO-ATS will produce reliable autonomous decisions on when and how to apply code-based image manipulations without introducing execution errors or overfitting to the manually collected high-resolution QA pairs.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Thyme trains MLLMs to autonomously generate executable code for image processing and math computations, yielding gains on high-resolution perception and complex reasoning benchmarks.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Thyme lets multimodal models autonomously generate and run code to manipulate images and perform calculations during reasoning.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"a86b02f87756171900cbad8265d5db6c92c91b9bd57a339f8a368a9c0312e418"},"source":{"id":"2508.11630","kind":"arxiv","version":1},"verdict":{"id":"dac0483f-13bc-4c54-b7fd-eb255261162c","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T00:28:32.861606Z","strongest_claim":"Thyme yields significant and consistent performance gains, particularly in challenging high-resolution perception and complex reasoning tasks.","one_line_summary":"Thyme trains MLLMs to autonomously generate executable code for image processing and math computations, yielding gains on high-resolution perception and complex reasoning benchmarks.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the RL phase with GRPO-ATS will produce reliable autonomous decisions on when and how to apply code-based image manipulations without introducing execution errors or overfitting to the manually collected high-resolution QA pairs.","pith_extraction_headline":"Thyme lets multimodal models autonomously generate and run code to manipulate images and perform calculations during reasoning."},"references":{"count":0,"sample":[],"resolved_work":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57","internal_anchors":0},"formal_canon":{"evidence_count":2,"snapshot_sha256":"b6ce30ad5966026126835fb3a7680046ba90f4cc73d2c183ad1393529e2ba4d8"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2508.11630","created_at":"2026-05-17T23:39:19.705715+00:00"},{"alias_kind":"arxiv_version","alias_value":"2508.11630v1","created_at":"2026-05-17T23:39:19.705715+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2508.11630","created_at":"2026-05-17T23:39:19.705715+00:00"},{"alias_kind":"pith_short_12","alias_value":"WERPE3RT226B","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"WERPE3RT226B7XDJ","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"WERPE3RT","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":26,"internal_anchor_count":26,"sample":[{"citing_arxiv_id":"2511.23230","citing_title":"Action-guided generation of 3D functionality segmentation data","ref_index":42,"is_internal_anchor":true},{"citing_arxiv_id":"2512.16918","citing_title":"AdaTooler-V: Adaptive Tool-Use for Images and Videos","ref_index":82,"is_internal_anchor":true},{"citing_arxiv_id":"2511.05271","citing_title":"DeepEyesV2: Toward Agentic Multimodal Model","ref_index":59,"is_internal_anchor":true},{"citing_arxiv_id":"2602.18600","citing_title":"MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?","ref_index":102,"is_internal_anchor":true},{"citing_arxiv_id":"2604.25855","citing_title":"SIEVES: Selective Prediction Generalizes through Visual Evidence Scoring","ref_index":51,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12163","citing_title":"Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model","ref_index":60,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12882","citing_title":"CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence","ref_index":53,"is_internal_anchor":true},{"citing_arxiv_id":"2604.03307","citing_title":"V-Reflection: Transforming MLLMs from Passive Observers to Active Interrogators","ref_index":37,"is_internal_anchor":true},{"citing_arxiv_id":"2604.02794","citing_title":"CharTool: Tool-Integrated Visual Reasoning for Chart Understanding","ref_index":67,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12163","citing_title":"Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model","ref_index":60,"is_internal_anchor":true},{"citing_arxiv_id":"2605.08146","citing_title":"VT-Bench: A Unified Benchmark for Visual-Tabular Multi-Modal Learning","ref_index":61,"is_internal_anchor":true},{"citing_arxiv_id":"2605.10172","citing_title":"V-ABS: Action-Observer Driven Beam Search for Dynamic Visual Reasoning","ref_index":25,"is_internal_anchor":true},{"citing_arxiv_id":"2604.25855","citing_title":"SIEVES: Selective Prediction Generalizes through Visual Evidence Scoring","ref_index":51,"is_internal_anchor":true},{"citing_arxiv_id":"2604.24583","citing_title":"Improving Vision-language Models with Perception-centric Process Reward Models","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2604.20328","citing_title":"Hybrid Latent Reasoning with Decoupled Policy Optimization","ref_index":44,"is_internal_anchor":true},{"citing_arxiv_id":"2604.12896","citing_title":"Don't Show Pixels, Show Cues: Unlocking Visual Tool Reasoning in Language Models via Perception Programs","ref_index":33,"is_internal_anchor":true},{"citing_arxiv_id":"2604.11025","citing_title":"Test-time Scaling over Perception: Resolving the Grounding Paradox in Thinking with Images","ref_index":42,"is_internal_anchor":true},{"citing_arxiv_id":"2604.06777","citing_title":"Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2604.08545","citing_title":"Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models","ref_index":51,"is_internal_anchor":true},{"citing_arxiv_id":"2604.06912","citing_title":"Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2604.09712","citing_title":"LAST: Leveraging Tools as Hints to Enhance Spatial Reasoning for Multimodal Large Language Models","ref_index":48,"is_internal_anchor":true},{"citing_arxiv_id":"2604.17969","citing_title":"E3VS-Bench: A Benchmark for Viewpoint-Dependent Active Perception in 3D Gaussian Splatting Scenes","ref_index":41,"is_internal_anchor":true},{"citing_arxiv_id":"2604.18320","citing_title":"EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations","ref_index":51,"is_internal_anchor":true},{"citing_arxiv_id":"2604.18292","citing_title":"Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence","ref_index":132,"is_internal_anchor":true},{"citing_arxiv_id":"2604.21409","citing_title":"S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images","ref_index":22,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/WERPE3RT226B7XDJYI4L7CGJGT","json":"https://pith.science/pith/WERPE3RT226B7XDJYI4L7CGJGT.json","graph_json":"https://pith.science/api/pith-number/WERPE3RT226B7XDJYI4L7CGJGT/graph.json","events_json":"https://pith.science/api/pith-number/WERPE3RT226B7XDJYI4L7CGJGT/events.json","paper":"https://pith.science/paper/WERPE3RT"},"agent_actions":{"view_html":"https://pith.science/pith/WERPE3RT226B7XDJYI4L7CGJGT","download_json":"https://pith.science/pith/WERPE3RT226B7XDJYI4L7CGJGT.json","view_paper":"https://pith.science/paper/WERPE3RT","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2508.11630&json=true","fetch_graph":"https://pith.science/api/pith-number/WERPE3RT226B7XDJYI4L7CGJGT/graph.json","fetch_events":"https://pith.science/api/pith-number/WERPE3RT226B7XDJYI4L7CGJGT/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/WERPE3RT226B7XDJYI4L7CGJGT/action/timestamp_anchor","attest_storage":"https://pith.science/pith/WERPE3RT226B7XDJYI4L7CGJGT/action/storage_attestation","attest_author":"https://pith.science/pith/WERPE3RT226B7XDJYI4L7CGJGT/action/author_attestation","sign_citation":"https://pith.science/pith/WERPE3RT226B7XDJYI4L7CGJGT/action/citation_signature","submit_replication":"https://pith.science/pith/WERPE3RT226B7XDJYI4L7CGJGT/action/replication_record"}},"created_at":"2026-05-17T23:39:19.705715+00:00","updated_at":"2026-05-17T23:39:19.705715+00:00"}