pith. machine review for the scientific record. sign in

arxiv: 2601.14750 · v3 · submitted 2026-01-21 · 💻 cs.CL · cs.CV

Recognition: unknown

Render-of-Thought: Rendering Textual Chain-of-Thought as Images for Visual Latent Reasoning

Authors on Pith no claims yet
classification 💻 cs.CL cs.CV
keywords reasoninglatenttextualvisionchainchain-of-thoughtexplicitimages
0
0 comments X
read the original abstract

Chain-of-Thought (CoT) prompting has achieved remarkable success in unlocking the reasoning capabilities of Large Language Models (LLMs). Although CoT prompting enhances reasoning, its verbosity imposes substantial computational overhead. Recent works often focus exclusively on outcome alignment and lack supervision on the intermediate reasoning process. These deficiencies obscure the analyzability of the latent reasoning chain. To address these challenges, we introduce Render-of-Thought (RoT), the first framework to reify the reasoning chain by rendering textual steps into images, making the latent rationale explicit and traceable. Specifically, we leverage the vision encoders of existing Vision Language Models (VLMs) as semantic anchors to align the vision embeddings with the textual space. This design ensures plug-and-play implementation without incurring additional pre-training overhead. Extensive experiments on mathematical and logical reasoning benchmarks demonstrate that our method achieves 3-4x token compression and substantial inference acceleration compared to explicit CoT. Furthermore, it maintains competitive performance against other methods, validating the feasibility of this paradigm. Our code is available at https://github.com/TencentBAC/RoT

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. UniVLR: Unifying Text and Vision in Visual Latent Reasoning for Multimodal LLMs

    cs.CV 2026-05 unverdicted novelty 7.0

    UniVLR unifies textual and visual reasoning in multimodal LLMs by compressing reasoning traces and auxiliary images into visual latent tokens for direct inference without interleaved text CoT.

  2. Visual Text Compression as Measure Transport

    cs.CV 2026-05 unverdicted novelty 7.0

    Framing visual text compression as measure transport decomposes encoding loss into precision and coverage costs, enabling a label-free routing rule that matches oracle performance on 17 of 24 NLP datasets while using ...

  3. Latent-GRPO: Group Relative Policy Optimization for Latent Reasoning

    cs.LG 2026-04 unverdicted novelty 6.0

    Latent-GRPO stabilizes reinforcement learning in latent space, delivering 7.86 Pass@1 gains on low-difficulty tasks over latent baselines and 4.27 points over explicit GRPO on high-difficulty tasks with 3-4x shorter r...

  4. HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering

    cs.AI 2026-04 unverdicted novelty 6.0

    HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.

  5. Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models

    cs.CV 2026-05 unverdicted novelty 5.0

    GAP aligns visual latent reasoning in MLLMs at feature, context, and capacity levels, yielding the best aggregate perception and reasoning scores on Qwen2.5-VL 7B among supervised variants while providing task-relevan...