Descriptive caption enhancement with visual specialists for multimodal perception.arXiv preprint arXiv:2412.14233

Sun, Y · arXiv 2412.14233

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

representative citing papers

From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models

cs.CL · 2026-05-19 · conditional · novelty 6.0

Staged post-training that first solidifies visual perception before visual and textual reasoning improves VLM accuracy and shortens reasoning traces on visual math and perception benchmarks.

citing papers explorer

Showing 1 of 1 citing paper.

From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models cs.CL · 2026-05-19 · conditional · none · ref 22
Staged post-training that first solidifies visual perception before visual and textual reasoning improves VLM accuracy and shortens reasoning traces on visual math and perception benchmarks.

Descriptive caption enhancement with visual specialists for multimodal perception.arXiv preprint arXiv:2412.14233

fields

years

verdicts

representative citing papers

citing papers explorer