arxiv: 2602.11731 · v2 · submitted 2026-02-12 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

Thinking with Drafting: Optical Decompression via Logical Reconstruction

Jingxuan Wei , Honghao He , Caijun Jia , Siyuan Li , Zheng Sun , Yuhang Xu , Yuanyuan Lin , Linzhuang Sun

show 4 more authors

Yuchen Wu Bihui Yu Xiangxiang Zhang Cheng Tan

Authors on Pith no claims yet

Pith reviewed 2026-05-16 03:18 UTC · model grok-4.3

classification 💻 cs.CL

keywords visual reasoningmultimodal LLMsdomain-specific languageoptical decompressionlogical reconstructionself-verificationvisual algebra benchmark

0 comments

The pith

Thinking with Drafting reconceptualizes visual reasoning as optical decompression by drafting mental models into executable code for self-verification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that multimodal models suffer a precision gap because they transcribe visual symbols without recovering their logical topology. It proposes Thinking with Drafting (TwD), which inserts a minimalist Domain-Specific Language as an intermediate step so the model must first draft its reasoning as executable code before producing an answer. This code draft then generates deterministic visual proofs that the model can check against its own output. Experiments on the introduced VisAlg visual algebra benchmark show TwD functions as a stronger cognitive scaffold than direct generation. The approach closes the loop by treating visual output as a logical verifier rather than an open-ended creation.

Core claim

Reasoning over visual inputs should be treated as optical decompression: the reconstruction of latent logical structures from compressed visual tokens via a minimalist DSL. Thinking with Drafting forces the model to externalize its mental model as executable code, which produces deterministic visual proofs usable for self-verification, and this process outperforms standard hallucination-prone methods on the VisAlg benchmark.

What carries the argument

Thinking with Drafting (TwD) with a minimalist DSL as grounding intermediate representation that forces drafting of mental models into executable code for deterministic verification.

If this is right

Visual generation becomes a logical verification step rather than an independent creative process.
Models gain an internal self-correction loop that reduces hallucinations on topology-sensitive tasks.
The same drafting mechanism can serve as a general scaffold for any visual input whose structure can be expressed in a compact DSL.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extending the DSL to include geometric primitives could test whether the same decompression pattern applies to diagram-based geometry problems.
If the code-drafting step is removed, performance should drop to baseline levels, isolating the contribution of the explicit reconstruction stage.

Load-bearing premise

A minimalist DSL can faithfully capture logical topology from visual tokens without introducing loss or hallucination.

What would settle it

TwD produces code drafts that either fail to run or generate visual proofs inconsistent with the correct algebraic solution on a majority of VisAlg items.

read the original abstract

Existing multimodal large language models have achieved high-fidelity visual perception and exploratory visual generation. However, a precision paradox persists in complex reasoning tasks: optical perception systems transcribe symbols without capturing logical topology, while pixel-based generative models produce visual artifacts lacking mathematical exactness. To bridge this gap, we propose that reasoning over visual inputs be reconceptualized as optical decompression-the process of reconstructing latent logical structures from compressed visual tokens. Guided by the axiom that Parsing is Reasoning, we introduce Thinking with Drafting (TwD), which utilizes a minimalist Domain-Specific Language (DSL) as a grounding intermediate representation. Unlike standard approaches that hallucinate answers directly, TwD forces the model to draft its mental model into executable code, rendering deterministic visual proofs for self-verification. To validate this, we present VisAlg, a visual algebra benchmark. Experiments demonstrate that TwD serve as a superior cognitive scaffold. Our work establishes a closed-loop system where visual generation acts not as a creative output but as a logical verifier, offering a generalizable path for visual reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TwD reframes visual reasoning as drafting into executable DSL code for verification, but the abstract supplies no results or error analysis to test whether it works.

read the letter

The main thing here is that the authors want to treat visual reasoning as optical decompression: the model must first draft its understanding of a diagram into a minimalist DSL that can be executed to produce a deterministic check. They call the method Thinking with Drafting and introduce VisAlg as a visual algebra benchmark to test it. The pitch is that this forces logical reconstruction instead of letting the model hallucinate an answer directly from pixels or tokens.

Referee Report

2 major / 1 minor

Summary. The paper claims that multimodal LLMs suffer from a precision paradox in visual reasoning because perception transcribes symbols without logical topology and generation lacks mathematical exactness. It proposes reconceptualizing reasoning as optical decompression via Thinking with Drafting (TwD), which uses a minimalist DSL as an intermediate representation to force the model to draft executable code for deterministic self-verification. The approach is guided by the axiom that Parsing is Reasoning and is validated on a new VisAlg visual algebra benchmark, where TwD is presented as a superior cognitive scaffold that turns visual generation into logical verification.

Significance. If the empirical claims hold and the DSL can indeed perform lossless logical reconstruction, the work would provide a concrete mechanism for grounding visual reasoning in executable structures, potentially improving reliability on tasks requiring exact topology such as algebraic diagrams. The closed-loop verification idea is a clear strength if supported by reproducible code or falsifiable predictions, but the current manuscript supplies no such evidence.

major comments (2)

[Abstract] Abstract: the assertion that 'Experiments demonstrate that TwD serve as a superior cognitive scaffold' is unsupported by any quantitative results, baseline comparisons, error analysis, or experimental controls, which is load-bearing for the central claim that TwD outperforms standard approaches.
[Abstract] The manuscript presents the 'Parsing is Reasoning' axiom and the minimalist DSL without a grammar definition, coverage analysis over diagram classes, or failure-case enumeration; if the DSL cannot express certain algebraic relations in VisAlg without omission or hallucination, the lossless reconstruction and deterministic-proof claims collapse.

minor comments (1)

[Abstract] Abstract contains a subject-verb agreement error: 'TwD serve' should read 'TwD serves'.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The central claim rests on the introduced axiom 'Parsing is Reasoning' and the new entities TwD and VisAlg, with no free parameters or external benchmarks detailed.

axioms (1)

ad hoc to paper Parsing is Reasoning
Guided by the axiom that Parsing is Reasoning as the foundational principle for the approach.

invented entities (3)

Thinking with Drafting (TwD) no independent evidence
purpose: Method using DSL to draft and verify logical structures from visual inputs
New proposed technique for visual reasoning.
VisAlg no independent evidence
purpose: Visual algebra benchmark for testing the method
New benchmark introduced for validation.
minimalist Domain-Specific Language (DSL) no independent evidence
purpose: Grounding intermediate representation for executable code drafts
Core component of the TwD method.

pith-pipeline@v0.9.0 · 5516 in / 1235 out tokens · 240049 ms · 2026-05-16T03:18:50.085292+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Guided by the axiom that Parsing is Reasoning, we introduce Thinking with Drafting (TwD), which utilizes a minimalist Domain-Specific Language (DSL) as a grounding intermediate representation.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the DSL consists of three fundamental operator categories: Entity Primitives (HL), Relational Primitives (VL), Aggregation Primitives (HB/VB)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 13 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advancesin neural information processing systems, 35:23716–23736, 2022

work page 2022
[3]

Claude sonnet 4.5 system card

Anthropic. Claude sonnet 4.5 system card. Technical report, Anthropic PBC, 2025. Official system card describing Claude Sonnet 4.5 capabilities and safety evaluation. Available at:https://assets.anthropic.com/ m/12f214efcc2f457a/original/Claude-Sonnet-4-5-System-Card.pdf

work page 2025
[4]

Intern-s1: A scientific multimodal foundation model.arXiv preprint arXiv:2508.15763, 2025

Lei Bai, Zhongrui Cai, Yuhang Cao, Maosong Cao, Weihan Cao, Chiyu Chen, Haojiong Chen, Kai Chen, Pengcheng Chen, Ying Chen, et al. Intern-s1: A scientific multimodal foundation model.arXiv preprint arXiv:2508.15763, 2025

work page arXiv 2025
[5]

Qwen3-vl technical report, 2025

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page 2025
[6]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks.Transactions on Machine Learning Research

Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W Cohen. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks.Transactions on Machine Learning Research

work page
[8]

Learning only with images: Visual reinforcement learning with reasoning, rendering, and visual feedback

Yang Chen, Yufan Shen, Wenxuan Huang, Sheng Zhou, Qunshu Lin, Xinyu Cai, Zhi Yu, Jiajun Bu, Botian Shi, and Yu Qiao. Learning only with images: Visual reinforcement learning with reasoning, rendering, and visual feedback. arXiv preprint arXiv:2507.20766, 2025

work page arXiv 2025
[9]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Thinking with generated images

Ethan Chern, Zhulin Hu, Steffi Chern, Siqi Kou, Jiadi Su, Yan Ma, Zhijie Deng, and Pengfei Liu. Thinking with generated images. arXiv preprint arXiv:2505.22525, 2025

work page arXiv 2025
[11]

Nvidia nemotron parse 1.1.arXiv preprint arXiv:2511.20478, 2025

Kateryna Chumachenko, Amala Sanjay Deshmukh, Jarno Seppanen, Ilia Karmanov, Chia-Chih Chen, Lukas Voegtle, Philipp Fischer, Marek Wawrzos, Saeid Motiian, Roman Ageev, et al. Nvidia nemotron parse 1.1.arXiv preprint arXiv:2511.20478, 2025

work page arXiv 2025
[12]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Paddleocr-vl: Boosting multilin- gual document parsing via a 0.9 b ultra-compact vision-language model

Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, Manhui Lin, et al. Paddleocr-vl: Boosting multilingual document parsing via a 0.9 b ultra-compact vision-language model. arXiv preprint arXiv:2510.14528, 2025

work page arXiv 2025
[14]

Pal: Program-aided language models

Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models. In International Conference on Machine Learning, pages 10764–10799. PMLR, 2023

work page 2023
[15]

Visual programming: Compositional visual reasoning without training

Tanmay Gupta and Aniruddha Kembhavi. Visual programming: Compositional visual reasoning without training. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14953–14962, 2023. 11

work page 2023
[16]

Ns3d: Neuro-symbolic grounding of 3d objects and relations

Joy Hsu, Jiayuan Mao, and Jiajun Wu. Ns3d: Neuro-symbolic grounding of 3d objects and relations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2614–2623, 2023

work page 2023
[17]

Arc is a vision problem!arXiv preprint arXiv:2511.14761, 2025

Keya Hu, Ali Cy, Linlu Qiu, Xiaoman Delores Ding, Runqian Wang, Yeyin Eva Zhu, Jacob Andreas, and Kaiming He. Arc is a vision problem!arXiv preprint arXiv:2511.14761, 2025

work page arXiv 2025
[18]

Language is not all you need: Aligning perception with language models

Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Barun Patra, et al. Language is not all you need: Aligning perception with language models. Advancesin Neural Information Processing Systems, 36:72096–72109, 2023

work page 2023
[19]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Ocr-free document understanding transformer

Geewook Kim, Teakgyu Hong, Moonbin Yim, JeongYeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park. Ocr-free document understanding transformer. InEuropean Conference on Computer Vision, pages 498–517. Springer, 2022

work page 2022
[21]

Large language models are zero-shot reasoners.Advancesin neural information processing systems, 35:22199–22213, 2022

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners.Advancesin neural information processing systems, 35:22199–22213, 2022

work page 2022
[22]

Xiaomi mimo-vl-miloco technical report.arXiv preprint arXiv:2512.17436, 2025

Jiaze Li, Jingyang Chen, Yuxun Qu, Jianzhong Ju, Zhenbo Luo, Jian Luan, Shijie Xu, Zhenru Lin, Junyou Zhu, Boshen Xu, et al. Xiaomi mimo-vl-miloco technical report.arXiv preprint arXiv:2512.17436, 2025

work page arXiv 2025
[23]

Monkeyocr: Document parsing with a structure- recognition-relation triplet paradigm

Zhang Li, Yuliang Liu, Qiang Liu, Zhiyin Ma, Ziyang Zhang, Shuo Zhang, Zidun Guo, Jiarui Zhang, Xinyu Wang, and Xiang Bai. Monkeyocr: Document parsing with a structure-recognition-relation triplet paradigm.arXiv preprint arXiv:2506.05218, 2025

work page arXiv 2025
[24]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

work page 2023
[25]

Inter-gps: In- terpretable geometry problem solving with formal language and symbolic reasoning

Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song-Chun Zhu. Inter- gps: Interpretable geometry problem solving with formal language and symbolic reasoning. arXiv preprint arXiv:2105.04165, 2021

work page arXiv 2021
[26]

Chameleon: Plug-and-play compositional reasoning with large language models

Pan Lu, Baolin Peng, Hao Cheng, Michel Galley, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, and Jianfeng Gao. Chameleon: Plug-and-play compositional reasoning with large language models. Advances in Neural Information Processing Systems, 36:43447–43478, 2023

work page 2023
[27]

V-thinker: Interactive thinking with images.arXiv preprint arXiv:2511.04460, 2025

Runqi Qiao, Qiuna Tan, Minghan Yang, Guanting Dong, Peiqing Yang, Shiqiang Lang, Enhui Wan, Xiaowan Wang, Yida Xu, Lan Yang, et al. V-thinker: Interactive thinking with images.arXiv preprint arXiv:2511.04460, 2025

work page arXiv 2025
[28]

Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36:68539–68551, 2023

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36:68539–68551, 2023

work page 2023
[29]

Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face. Advances in Neural Information Processing Systems, 36:38154–38180, 2023

work page 2023
[30]

Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

Zhaochen Su, Peng Xia, Hangyu Guo, Zhenhua Liu, Yan Ma, Xiaoye Qu, Jiaqi Liu, Yanshu Li, Kaide Zeng, Zhengyuan Yang, et al. Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers. arXiv preprint arXiv:2506.23918, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

Vipergpt: Visual inference via python execution for reasoning

Dídac Surís, Sachit Menon, and Carl Vondrick. Vipergpt: Visual inference via python execution for reasoning. In Proceedings of the IEEE/CVF international conference on computer vision, pages 11888–11898, 2023

work page 2023
[32]

Unifying vision, text, and layout for universal document processing

Zineng Tang, Ziyi Yang, Guoxin Wang, Yuwei Fang, Yang Liu, Chenguang Zhu, Michael Zeng, Cha Zhang, and Mohit Bansal. Unifying vision, text, and layout for universal document processing. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19254–19264, 2023

work page 2023
[33]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 12

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

MinerU: An Open-Source Solution for Precise Document Content Extraction

Bin Wang, Chao Xu, Xiaomeng Zhao, Linke Ouyang, Fan Wu, Zhiyuan Zhao, Rui Xu, Kaiwen Liu, Yuan Qu, Fukai Shang, et al. Mineru: An open-source solution for precise document content extraction.arXiv preprint arXiv:2409.18839, 2024

work page internal anchor Pith review arXiv 2024
[35]

DeepSeek-OCR: Contexts Optical Compression

Haoran Wei, Yaofeng Sun, and Yukun Li. Deepseek-ocr: Contexts optical compression. arXiv preprint arXiv:2510.18234, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Chain-of-thought prompting elicits reasoning in large language models.Advancesin neural information processing systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advancesin neural information processing systems, 35:24824–24837, 2022

work page 2022
[37]

xgen-mm (blip-3): A family of open large multimodal models.arXiv preprint arXiv:2408.08872, 2024

Le Xue, Manli Shu, Anas Awadalla, Jun Wang, An Yan, Senthil Purushwalkam, Honglu Zhou, Viraj Prabhu, Yutong Dai, Michael S Ryoo, et al. xgen-mm (blip-3): A family of open large multimodal models.arXiv preprint arXiv:2408.08872, 2024

work page arXiv 2024
[38]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Machine mental imagery: Empower multimodal reasoning with latent visual tokens.arXiv preprint arXiv:2506.17218, 2025

Zeyuan Yang, Xueyang Yu, Delin Chen, Maohao Shen, and Chuang Gan. Machine mental imagery: Empower multimodal reasoning with latent visual tokens.arXiv preprint arXiv:2506.17218, 2025

work page arXiv 2025
[40]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In The eleventh international conference on learning representations, 2022

work page 2022
[41]

Docr-inspector: Fine-grained and automated evaluation of document parsing with vlm

Qintong Zhang, Junyuan Zhang, Zhifei Ren, Linke Ouyang, Zichen Wen, Junbo Niu, Yuan Qu, Bin Wang, Ka-Ho Chow, Conghui He, et al. Docr-inspector: Fine-grained and automated evaluation of document parsing with vlm. arXiv preprint arXiv:2512.10619, 2025

work page arXiv 2025
[42]

Multimodal Chain-of-Thought Reasoning in Language Models

Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Multimodal chain-of-thought reasoning in language models.arXiv preprint arXiv:2302.00923, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deepeyes: Incentivizing" thinking with images" via reinforcement learning.arXiv preprint arXiv:2505.14362, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479, 2025. 13 Appendix A Additional Details for Dataset Construction A.1 Prompt for Data Draft Generation Thi...

work page internal anchor Pith review Pith/arXiv arXiv 2025