EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models

Beichen Zhang; Jiazi Bu; Kai Chen; Long Xing; Xilin Wei; Xuanlang Dai; Yuhang Zang; Yuhong Liu; Yujie Zhou

arxiv: 2603.12252 · v4 · pith:ZC2JPTMDnew · submitted 2026-03-12 · 💻 cs.CV · cs.CL

EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models

Xuanlang Dai , Yujie Zhou , Long Xing , Jiazi Bu , Xilin Wei , Yuhong Liu , Beichen Zhang , Kai Chen

show 1 more author

Yuhang Zang

This is my paper

classification 💻 cs.CV cs.CL

keywords guidancereasoningcomplexmllmschain-of-thoughtendocotprocesstasks

0 comments

read the original abstract

Recently, Multimodal Large Language Models (MLLMs) have been widely integrated into diffusion frameworks primarily as text encoders to tackle complex tasks such as spatial reasoning. However, this paradigm suffers from two critical limitations: (i) MLLMs text encoder exhibits insufficient reasoning depth. Single-step encoding fails to activate the Chain-of-Thought process, which is essential for MLLMs to provide accurate guidance for complex tasks. (ii) The guidance remains invariant during the decoding process. Invariant guidance during decoding prevents DiT from progressively decomposing complex instructions into actionable denoising steps, even with correct MLLM encodings. To this end, we propose Endogenous Chain-of-Thought (EndoCoT), a novel framework that first activates MLLMs' reasoning potential by iteratively refining latent thought states through an iterative thought guidance module, and then bridges these states to the DiT's denoising process. Second, a terminal thought grounding module is applied to ensure the reasoning trajectory remains grounded in textual supervision by aligning the final state with ground-truth answers. With these two components, the MLLM text encoder delivers meticulously reasoned guidance, enabling the DiT to execute it progressively and ultimately solve complex tasks in a step-by-step manner. Extensive evaluations across diverse benchmarks (e.g., Maze, TSP, VSP, and Sudoku) achieve an average accuracy of 92.1%, outperforming the strongest baseline by 8.3 percentage points. The code and dataset are publicly available at https://internlm.github.io/EndoCoT/.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games
cs.CV 2026-06 unverdicted novelty 7.0

RNG-Bench evaluates MLLMs on hidden-observation reconstruction in non-Markov games, finds forgetting as the dominant error source, and shows fine-tuning on optimal rollouts improves performance with transfer to other ...
DDA-Thinker: Decoupled Dual-Atomic Reinforcement Learning for Reasoning-Driven Image Editing
cs.CV 2026-04 unverdicted novelty 6.0

DDA-Thinker decouples planning from generation and applies dual-atomic RL with checklist-based rewards to boost reasoning in image editing, yielding competitive results on RISE-Bench and KRIS-Bench.