Recognition: 2 theorem links
· Lean TheoremThyme: Think Beyond Images
Pith reviewed 2026-05-15 00:28 UTC · model grok-4.3
The pith
Thyme lets multimodal models autonomously generate and run code to manipulate images and perform calculations during reasoning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Thyme introduces a paradigm where multimodal models autonomously generate and execute diverse image processing and computational operations via executable code. This is achieved through a two-stage training: supervised fine-tuning on 500,000 samples for code generation, then reinforcement learning with GRPO-ATS to refine autonomous decisions on when and how to apply operations, leading to performance gains on high-resolution perception and complex reasoning tasks.
What carries the argument
Autonomous executable code generation for on-the-fly image manipulations such as cropping and rotation, plus mathematical computations, guided by GRPO-ATS in the reinforcement learning phase.
If this is right
- Consistent performance improvements across nearly 20 benchmarks.
- Enhanced capabilities in high-resolution perception tasks.
- Improved handling of complex reasoning problems.
- Richer set of image manipulations without fixed tool limitations.
- Maintained autonomy in decision-making for when and how to apply operations.
Where Pith is reading between the lines
- The code-generation approach might extend to video or audio by allowing models to write processing scripts for those modalities.
- Similar autonomous code use could reduce reliance on hand-crafted tool interfaces in other AI reasoning systems.
- Trained models might invent novel image transformations that were not present in the original training examples.
- Scaling the reinforcement learning phase with more diverse high-resolution data could produce even more adaptive decision patterns.
Load-bearing premise
The reinforcement learning phase will teach the model to make reliable decisions about when and how to use code without causing execution errors or overfitting to the collected data.
What would settle it
Frequent code execution errors or lack of performance gains on high-resolution perception and complex reasoning benchmarks after the full two-stage training would show the approach does not deliver the claimed benefits.
read the original abstract
Following OpenAI's introduction of the ``thinking with images'' concept, recent efforts have explored stimulating the use of visual information in the reasoning process to enhance model performance in perception and reasoning tasks. However, to the best of our knowledge, no open-source work currently offers a feature set as rich as proprietary models (O3), which can perform diverse image manipulations and simultaneously enhance logical reasoning capabilities through code. In this paper, we make a preliminary attempt in this direction by introducing Thyme (Think Beyond Images), a novel paradigm for enabling MLLMs to transcend existing ``think with images'' approaches by autonomously generating and executing diverse image processing and computational operations via executable code. This approach not only facilitates a rich, on-the-fly set of image manipulations (e.g., cropping, rotation, contrast enhancement) but also allows for mathematical computations, all while maintaining high autonomy in deciding when and how to apply these operations. We activate this capability through a two-stage training strategy: an initial SFT on a curated dataset of 500K samples to teach code generation, followed by a RL phase to refine decision-making. For the RL stage, we manually collect and design high-resolution question-answer pairs to increase the learning difficulty, and we propose GRPO-ATS (Group Relative Policy Optimization with Adaptive Temperature Sampling), an algorithm that applies distinct temperatures to text and code generation to balance reasoning exploration with code execution precision. We conduct extensive experimental analysis and ablation studies. Comprehensive evaluations on nearly 20 benchmarks show that Thyme yields significant and consistent performance gains, particularly in challenging high-resolution perception and complex reasoning tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Thyme, a paradigm enabling MLLMs to autonomously generate and execute code for diverse image manipulations (cropping, rotation, contrast enhancement) and mathematical computations. It employs a two-stage pipeline: SFT on a 500K-sample dataset to teach code generation, followed by RL with the proposed GRPO-ATS algorithm (Group Relative Policy Optimization with Adaptive Temperature Sampling) on manually collected high-resolution QA pairs to refine autonomous decision-making. The central claim is that this yields significant and consistent performance gains across nearly 20 benchmarks, especially in high-resolution perception and complex reasoning tasks.
Significance. If the empirical results hold under scrutiny, the work would be significant for advancing open-source MLLMs toward richer 'thinking with images' capabilities via executable code, narrowing the gap with proprietary systems like o3. The GRPO-ATS mechanism for applying distinct temperatures to text versus code generation offers a concrete algorithmic contribution to balancing exploration and execution precision in RL for tool use. The emphasis on high-resolution QA pairs to increase training difficulty provides a practical template for scaling autonomous visual reasoning.
major comments (3)
- [Abstract] Abstract: the claim of 'significant and consistent performance gains' across nearly 20 benchmarks is presented without any reference to specific baselines, statistical tests, run-to-run variance, or data exclusion criteria, which directly undermines verification of the central empirical result.
- [RL phase] RL phase description: no quantitative evidence is supplied on post-RL code execution success rates, error recovery frequency, or performance on held-out distributions, leaving the autonomy claim vulnerable to the possibility that gains arise from overfitting to the manually curated high-resolution QA pairs rather than robust GRPO-ATS-driven decisions.
- [Ablation studies] Ablation studies: the reported ablations do not isolate the incremental contribution of the GRPO-ATS RL stage (versus SFT alone) or test sensitivity to temperature splitting, making it impossible to confirm that the proposed algorithm is load-bearing for the observed improvements.
minor comments (1)
- [Method] The description of GRPO-ATS would benefit from an explicit equation or pseudocode block defining the adaptive temperature sampling rule for text versus code tokens.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments highlight important areas where the manuscript can be strengthened for clarity and rigor. We address each major comment below and indicate the corresponding revisions.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim of 'significant and consistent performance gains' across nearly 20 benchmarks is presented without any reference to specific baselines, statistical tests, run-to-run variance, or data exclusion criteria, which directly undermines verification of the central empirical result.
Authors: We agree that the abstract would benefit from greater specificity to allow readers to immediately assess the claims. In the revised manuscript, we will update the abstract to explicitly name key baselines (the base MLLM and SFT-only model), report the magnitude of gains on representative high-resolution and reasoning benchmarks, and reference the evaluation protocol detailed in the experiments section. We will also clarify that results reflect standard single-run reporting unless otherwise noted and specify any data filtering criteria applied during evaluation. revision: yes
-
Referee: [RL phase] RL phase description: no quantitative evidence is supplied on post-RL code execution success rates, error recovery frequency, or performance on held-out distributions, leaving the autonomy claim vulnerable to the possibility that gains arise from overfitting to the manually curated high-resolution QA pairs rather than robust GRPO-ATS-driven decisions.
Authors: This observation is correct and points to a genuine gap in the current presentation. To better substantiate the autonomy claim, we will add a new subsection (or expanded table) reporting post-RL code execution success rates, frequency of successful error recovery during inference, and performance on a held-out subset of the high-resolution QA pairs. These metrics will help distinguish the contribution of GRPO-ATS from potential overfitting. revision: yes
-
Referee: [Ablation studies] Ablation studies: the reported ablations do not isolate the incremental contribution of the GRPO-ATS RL stage (versus SFT alone) or test sensitivity to temperature splitting, making it impossible to confirm that the proposed algorithm is load-bearing for the observed improvements.
Authors: We acknowledge that the existing ablations could more directly isolate the RL stage and the temperature-splitting component. We will expand the ablation section to include (1) a head-to-head comparison of the SFT-only checkpoint versus the full SFT+GRPO-ATS model on the full benchmark suite and (2) an explicit sensitivity study varying the text/code temperature split while keeping other factors fixed. These additions will clarify the load-bearing role of the proposed algorithm. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper describes an empirical two-stage training pipeline (SFT on 500K samples followed by RL with GRPO-ATS) and reports benchmark gains without any mathematical derivations, first-principles predictions, or equations that reduce to fitted inputs by construction. No self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text; claims rest on external benchmark results.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Reinforcement learning with adaptive temperature sampling improves decision-making for code generation in multimodal models.
Lean theorems connected to this paper
-
IndisputableMonolith.Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We activate this capability through a two-stage training strategy: an initial SFT on a curated dataset of 500K samples to teach code generation, followed by a RL phase to refine decision-making... GRPO-ATS... applies distinct temperatures to text and code generation
-
IndisputableMonolith.Foundation.DimensionForcingdimension_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
comprehensive evaluations on nearly 20 benchmarks show that Thyme yields significant and consistent performance gains, particularly in challenging high-resolution perception and complex reasoning tasks
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 22 Pith papers
-
CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence
CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed mod...
-
S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images
S1-VL combines structured scientific reasoning with iterative image manipulation via code execution to reach state-of-the-art results on visual and scientific reasoning benchmarks.
-
EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations
EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.
-
V-ABS: Action-Observer Driven Beam Search for Dynamic Visual Reasoning
V-ABS is an action-observer beam search method with entropy-based adaptive weighting and an 80k-sample SFT dataset that delivers 19.7% average gains on visual reasoning tasks for MLLMs.
-
VT-Bench: A Unified Benchmark for Visual-Tabular Multi-Modal Learning
VT-Bench is the first unified benchmark aggregating 14 visual-tabular datasets with over 756K samples and evaluating 23 models to expose challenges in this multi-modal area.
-
Improving Vision-language Models with Perception-centric Process Reward Models
Perceval is a perception-centric PRM that detects token-level perceptual errors in VLMs, supporting token-advantage RL training and iterative test-time scaling for improved reasoning.
-
Hybrid Latent Reasoning with Decoupled Policy Optimization
HyLaR with DePO enables effective RL in hybrid discrete-continuous spaces for multimodal models, outperforming prior MLLMs on perception and understanding benchmarks.
-
E3VS-Bench: A Benchmark for Viewpoint-Dependent Active Perception in 3D Gaussian Splatting Scenes
E3VS-Bench supplies 99 3D Gaussian Splatting scenes and 2,014 episodes to test whether embodied agents can use unrestricted 5-DoF viewpoint control to answer questions that depend on fine-grained visual details visibl...
-
V-Reflection: Transforming MLLMs from Passive Observers to Active Interrogators
V-Reflection introduces a think-then-look mechanism where MLLM latent states actively interrogate visual features via two-stage distillation from a box-guided teacher to a dynamic autoregressive student, narrowing the...
-
Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model
SCOLAR addresses information gain collapse in latent visual reasoning by generating independent auxiliary visual tokens from LLM hidden states, extending acceptable CoT length over 30x and achieving +14.12% gains on b...
-
Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model
SCOLAR fixes information gain collapse in latent visual reasoning by generating independent auxiliary visual tokens via a detransformer, extending acceptable CoT length over 30x and delivering +14.12% gains on reasoni...
-
SIEVES: Selective Prediction Generalizes through Visual Evidence Scoring
SIEVES improves selective prediction coverage up to 3x on OOD VQA benchmarks by training a selector on visual localization quality, generalizing across datasets and proprietary reasoners without specific adaptation.
-
SIEVES: Selective Prediction Generalizes through Visual Evidence Scoring
SIEVES improves selective prediction coverage by up to 3x on OOD VQA benchmarks by training a selector to score the quality of visual evidence produced by reasoner models, generalizing across benchmarks and proprietar...
-
Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence
Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.
-
Don't Show Pixels, Show Cues: Unlocking Visual Tool Reasoning in Language Models via Perception Programs
Perception Programs rewrite dense visual tool outputs into language-native summaries, boosting MLLM accuracy by 15-45% absolute on BLINK perception tasks and setting new state-of-the-art results.
-
Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models
Q-Zoom achieves up to 4.39x inference speedup in high-resolution MLLM scenarios via query-aware gating and region localization, matching or exceeding baseline accuracy on document and high-res benchmarks.
-
Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization
MAPO improves multimodal chain-of-thought reasoning by requiring explicit textual descriptions of visual tool results and using a novel advantage estimator that combines semantic alignment with task rewards.
-
LAST: Leveraging Tools as Hints to Enhance Spatial Reasoning for Multimodal Large Language Models
LAST augments MLLMs with a tool-abstraction sandbox and three-stage training to deliver around 20% gains on spatial reasoning tasks, outperforming closed-source models.
-
CharTool: Tool-Integrated Visual Reasoning for Chart Understanding
CharTool equips MLLMs with cropping and code tools plus agentic RL on DuoChart data to raise chart-reasoning accuracy by up to 9.78 percent on benchmarks.
-
Perceptual Flow Network for Visually Grounded Reasoning
PFlowNet decouples perception from reasoning, integrates multi-dimensional rewards with vicinal geometric shaping via variational RL, and reports new SOTA results on V* Bench (90.6%) and MME-RealWorld-lite (67.0%).
-
Test-time Scaling over Perception: Resolving the Grounding Paradox in Thinking with Images
TTSP resolves the Grounding Paradox by treating perception as a scalable test-time process that generates, filters, and iteratively refines multiple visual exploration traces, outperforming baselines on high-resolutio...
-
Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models
HDPO reframes tool efficiency as a conditional objective within accurate trajectories, enabling Metis to reduce tool invocations by orders of magnitude while raising reasoning accuracy.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.