Evaluating Reasoning Fidelity in Visual Text Generation
Pith reviewed 2026-06-28 07:02 UTC · model grok-4.3
The pith
Text-to-image models produce semantic errors and logical inconsistencies when rendering reasoning processes as images, unlike text-only models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Current T2I models frequently produce semantic errors, logical inconsistencies, and incorrect intermediate steps, even when the rendered text appears visually clear. These failures contrast with the strong reasoning performance of text-only models on the same tasks.
What carries the argument
The four evaluation settings (long text rendering, factual knowledge probing, context understanding, and multi-step reasoning) that measure whether visual text generation preserves procedural reasoning.
If this is right
- Visual text generation cannot be assumed to carry over the reasoning capabilities demonstrated by text models.
- Applications such as automated document or slide creation will encounter unreliable outputs on tasks requiring logical chains.
- Improving rendering legibility alone will not close the gap in reasoning fidelity.
- New training approaches are needed that explicitly target preservation of intermediate reasoning steps in image form.
Where Pith is reading between the lines
- Hybrid systems that first compute reasoning in text and then render the output may outperform end-to-end visual generation.
- The same fidelity gap could appear in other multimodal outputs such as charts or diagrams that embed logical sequences.
- Benchmarking future T2I models should include explicit checks for logical consistency in addition to visual quality metrics.
Load-bearing premise
The chosen evaluation settings correctly isolate reasoning fidelity rather than merely testing surface rendering quality or prompt following.
What would settle it
A T2I model that produces correct reasoning steps and final answers in images at rates comparable to text-only models on the same multi-step reasoning prompts.
Figures
read the original abstract
Recent text-to-image (T2I) models can render highly legible and well-structured text within images, enabling applications including document generation and slide generation. However, it remains unclear whether such systems faithfully preserve reasoning ability when complex solutions must be expressed directly through rendered text, or whether they merely imitate surface-level patterns. We investigate this question by evaluating reasoning fidelity in visual text generation, where models must express complete reasoning processes as images. Our evaluation includes long text rendering, factual knowledge probing, context understanding, and multi-step reasoning. Across these settings, we find that current T2I models frequently produce semantic errors, logical inconsistencies, and incorrect intermediate steps, even when the rendered text appears visually clear. These failures contrast with the strong reasoning performance of text-only models on the same tasks. Our findings reveal a substantial gap between visual text generation and procedural reasoning, motivating more reliable visual text reasoning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper evaluates reasoning fidelity in text-to-image (T2I) models tasked with generating images containing rendered text that expresses complete reasoning processes. It examines four settings—long text rendering, factual knowledge probing, context understanding, and multi-step reasoning—and reports that current T2I models frequently exhibit semantic errors, logical inconsistencies, and incorrect intermediate steps even when the text is visually legible, in contrast to strong performance by text-only models on identical tasks. The work concludes that a substantial gap exists between visual text generation and procedural reasoning.
Significance. If the empirical findings hold with appropriate controls and quantification, the result would demonstrate a clear limitation in T2I models for applications that require faithful reasoning expressed through generated text (e.g., document or slide creation). This could usefully motivate targeted improvements in visual reasoning capabilities and provide a benchmark for future model development.
major comments (1)
- [Abstract] Abstract: The central claim that T2I models 'frequently produce semantic errors, logical inconsistencies, and incorrect intermediate steps' is asserted without any quantitative results, error rates, sample counts, model list, or error taxonomy. This absence makes it impossible to assess whether the data support the stated contrast with text-only models or the overall conclusion.
Simulated Author's Rebuttal
We thank the referee for highlighting the need for greater specificity in the abstract. We address this point directly below and are prepared to revise the abstract accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that T2I models 'frequently produce semantic errors, logical inconsistencies, and incorrect intermediate steps' is asserted without any quantitative results, error rates, sample counts, model list, or error taxonomy. This absence makes it impossible to assess whether the data support the stated contrast with text-only models or the overall conclusion.
Authors: The abstract is intentionally concise, but the full manuscript provides the requested details: we evaluate four models (Stable Diffusion 3, DALL-E 3, Midjourney v6, and Flux) across 200 prompts per setting (long text rendering, factual knowledge, context understanding, multi-step reasoning), with quantitative error rates, an error taxonomy (semantic, logical, intermediate-step), and direct comparisons to text-only baselines (GPT-4, Claude 3) on identical tasks. Section 4 reports, for example, 47% average error rate for T2I models on multi-step reasoning versus 8% for text models. We agree the abstract would be strengthened by including one or two key quantitative highlights and a brief model list, and will revise it in the next version. revision: yes
Circularity Check
No significant circularity; empirical evaluation only
full rationale
The paper is an empirical study that evaluates T2I models on tasks including long text rendering, factual knowledge probing, context understanding, and multi-step reasoning, reporting observed failures in semantic and logical fidelity. No derivations, equations, parameter fittings, or self-citation chains appear in the provided text. The central claims rest on direct experimental comparisons rather than any reduction to inputs by construction or imported uniqueness results. This matches the default case of a self-contained empirical comparison with no load-bearing circular steps.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
flux-2.https://bfl.ai/blog/ flux-2/, 2025
Black Forest Labs. flux-2.https://bfl.ai/blog/ flux-2/, 2025. 3
2025
-
[2]
Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, and Furu Wei. Textdiffuser-2: Unleashing the power of language models for text rendering.ArXiv, abs/2311.16465, 2023. 2, 3
-
[3]
arXiv preprint arXiv:2305.10855 (2023)
Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, and Furu Wei. Textdiffuser: Diffusion models as text painters.ArXiv, abs/2305.10855, 2023. 1, 2
-
[4]
Halc: Object hallucination reduc- tion via adaptive focal-contrast decoding
Zhaorun Chen, Zhuokai Zhao, Hongyin Luo, Huaxiu Yao, Bo Li, and Jiawei Zhou. Halc: Object hallucination reduc- tion via adaptive focal-contrast decoding. InInternational Conference on Machine Learning, pages 7824–7846. PMLR,
-
[5]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.ArXiv, abs/1803.05457, 2018. 1, 3
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[6]
Paddleocr 3.0 technical report, 2025
Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, Yue Zhang, Wenyu Lv, Kui Huang, Yichao Zhang, Jing Zhang, Jun Zhang, Yi Liu, Dianhai Yu, and Yanjun Ma. Paddleocr 3.0 technical report, 2025. 3
2025
-
[7]
gemini-2.5-flash-image.https : / / developers
DeepMind. gemini-2.5-flash-image.https : / / developers . googleblog . com / introducing - gemini-2-5-flash-image/, 2025. 3
2025
-
[8]
DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Jun- Mei Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiaoling Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bing-Li Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Dam...
2025
-
[9]
Drop: A read- 8 ing comprehension benchmark requiring discrete reasoning over paragraphs
Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. Drop: A read- 8 ing comprehension benchmark requiring discrete reasoning over paragraphs. InNorth American Chapter of the Associ- ation for Computational Linguistics, 2019. 1, 3
2019
-
[10]
Rongyao Fang, Chengqi Duan, Kun Wang, Linjiang Huang, Hao Li, Shilin Yan, Hao Tian, Xingyu Zeng, Rui Zhao, Jifeng Dai, Xihui Liu, and Hongsheng Li. Got: Unleashing reasoning capability of multimodal large language model for visual generation and editing.ArXiv, abs/2503.10639, 2025. 2
-
[11]
Enhancing vision-language model relia- bility with uncertainty-guided dropout decoding.Advances in Neural Information Processing Systems, 38:149193– 149218, 2026
Yixiong Fang, Ziran Yang, Zhaorun Chen, Zhuokai Zhao, and Jiawei Zhou. Enhancing vision-language model relia- bility with uncertainty-guided dropout decoding.Advances in Neural Information Processing Systems, 38:149193– 149218, 2026. 2
2026
-
[12]
Tracking the limits of knowledge propa- gation: How LLMs fail at multi-step reasoning with conflict- ing knowledge
Yiyang Feng, Zeming Chen, Haotian Wu, Jiawei Zhou, and Antoine Bosselut. Tracking the limits of knowledge propa- gation: How LLMs fail at multi-step reasoning with conflict- ing knowledge. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 5813–5847, Rabat, Morocco...
2026
-
[13]
Improving chain- of-thought efficiency for autoregressive image generation
Zeqi Gu, Markos Georgopoulos, Xiaoliang Dai, Marjan Ghazvininejad, Chu Wang, Felix Juefei-Xu, Kunpeng Li, Yujun Shi, Zecheng He, Zijian He, et al. Improving chain- of-thought efficiency for autoregressive image generation. arXiv preprint arXiv:2510.05593, 2025. 2
-
[14]
Liu He, Yijuan Lu, John Corring, Dinei A. F. Flor ˆencio, and Cha Zhang. Diffusion-based document layout generation. In IEEE International Conference on Document Analysis and Recognition, 2023. 1
2023
-
[15]
Measuring Mathematical Problem Solving With the MATH Dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Xiaodong Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.ArXiv, abs/2103.03874, 2021. 1, 3
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[16]
Jonathan Ho, Ajay Jain, and P. Abbeel. Denoising diffusion probabilistic models.ArXiv, abs/2006.11239, 2020. 1
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[17]
HintMR: Eliciting Stronger Mathematical Reasoning in Small Language Models
Jawad Hossain, Xiangyu Guo, Jiawei Zhou, and Chong Liu. Hintmr: Eliciting stronger mathematical reasoning in small language models.arXiv preprint arXiv:2604.12229, 2026. 1
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[18]
Halp: Detecting hallucinations in vision- language models without generating a single token
Sai Akhil Kogilathota, Sripadha Vallabha EG, Luzhe Sun, and Jiawei Zhou. Halp: Detecting hallucinations in vision- language models without generating a single token. InPro- ceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6067–6085, 2026. 2
2026
-
[19]
Solving Quantitative Reasoning Problems with Language Models
Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Venkatesh Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman- Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantitative reasoning problems with language models.ArXiv, abs/2206.14858, 2022. 1
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[20]
Socialgpt: Prompting llms for social relation reasoning via greedy segment optimization
Wanhua Li, Zibin Meng, Jiawei Zhou, Donglai Wei, Chuang Gan, and Hanspeter Pfister. Socialgpt: Prompting llms for social relation reasoning via greedy segment optimization. Advances in Neural Information Processing Systems, 37: 2267–2291, 2024. 2
2024
-
[21]
Text or pixels? evaluating efficiency and understanding of LLMs with vi- sual text inputs
Yanhong Li, Zixuan Lan, and Jiawei Zhou. Text or pixels? evaluating efficiency and understanding of LLMs with vi- sual text inputs. InFindings of the Association for Com- putational Linguistics: EMNLP 2025, pages 10564–10578, Suzhou, China, 2025. Association for Computational Lin- guistics. 2
2025
-
[22]
Yanhong Li, Tianyang Xu, Kenan Tang, Karen Livescu, David McAllester, and Jiawei Zhou. Okbench: De- mocratizing llm evaluation with fully automated, on- demand, open knowledge benchmarking.arXiv preprint arXiv:2511.08598, 2025. 1
-
[23]
Context-efficient retrieval with factual decomposition
Yanhong Li, David Yunis, David McAllester, and Jiawei Zhou. Context-efficient retrieval with factual decomposition. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Lin- guistics: Human Language Technologies (Volume 2: Short Papers), pages 178–194, 2025. 1
2025
-
[24]
Hunter Lightman, Vineet Kosaraju, Yura Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schul- man, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step.ArXiv, abs/2305.20050, 2023. 4
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[25]
Character-aware models improve visual text rendering
Rosanne Liu, Dan Garrette, Chitwan Saharia, William Chan, Adam Roberts, Sharan Narang, Irina Blok, RJ Mical, Mo- hammad Norouzi, and Noah Constant. Character-aware models improve visual text rendering. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16270–16297,
-
[26]
Glyph-byt5: A customized text encoder for accurate visual text rendering
Zeyu Liu, Weicong Liang, Zhanhao Liang, Chong Luo, Ji Li, Gao Huang, and Yuhui Yuan. Glyph-byt5: A customized text encoder for accurate visual text rendering. InEuropean Conference on Computer Vision, 2024. 2
2024
-
[27]
Zeyu Liu, Weicong Liang, Yiming Zhao, Bohan Chen, Lin Liang, Lijuan Wang, Ji Li, and Yuhui Yuan. Glyph-byt5-v2: A strong aesthetic baseline for accurate multilingual visual text rendering.arXiv preprint arXiv:2406.10208, 2024. 1, 2
-
[28]
Mathvista: Evaluating mathemat- ical reasoning of foundation models in visual contexts
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chun yue Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathemat- ical reasoning of foundation models in visual contexts. InIn- ternational Conference on Learning Representations, 2023. 2
2023
-
[29]
Multimodal llm-guided semantic cor- rection in text-to-image diffusion.ArXiv, abs/2505.20053,
Zheqi Lv, Junhao Chen, Qi Tian, Keting Yin, Shengyu Zhang, and Fei Wu. Multimodal llm-guided semantic cor- rection in text-to-image diffusion.ArXiv, abs/2505.20053,
-
[30]
ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning
Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq R. Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. ArXiv, abs/2203.10244, 2022. 2
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[31]
Pointer sentinel mixture models, 2016
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models, 2016. 2
2016
-
[32]
Zhenxing Mi, Kuan-Chieh Jackson Wang, Guocheng Gor- don Qian, Hanrong Ye, Runtao Liu, Sergey Tulyakov, Kfir Aberman, and Dan Xu. I think, therefore i diffuse: Enabling multimodal in-context reasoning in diffusion models.ArXiv, abs/2502.10458, 2025. 2 9
-
[33]
Gpt-image-1.5.https : / / openai
OpenAI. Gpt-image-1.5.https : / / openai . com / index/new-chatgpt-images-is-here/, 2025. 3
2025
-
[34]
Gpt-image-2.https : / / openai
OpenAI. Gpt-image-2.https : / / openai . com / index/introducing- chatgpt- images- 2- 0//,
-
[35]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
Dustin Podell, Zion English, Kyle Lacey, A. Blattmann, Tim Dockhorn, Jonas Muller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.ArXiv, abs/2307.01952, 2023. 1, 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[36]
Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer
Robin Rombach, A. Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models.2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10674–10685, 2021. 1
2022
-
[37]
Ananya Sadana, Yash Kumar Lal, and Jiawei Zhou. Iso- bench: Benchmarking multimodal causal reasoning in visual-language models through procedural plans.arXiv preprint arXiv:2507.23135, 2025. 2
-
[38]
From behavioral performance to internal competence: Interpreting vision-language models with vlm- lens
Hala Sheta, Eric Haoran Huang, Shuyu Wu, Ilia Alenabi, Ji- ajun Hong, Ryker Lin, Ruoxi Ning, Daniel Wei, Jialin Yang, Jiawei Zhou, et al. From behavioral performance to internal competence: Interpreting vision-language models with vlm- lens. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demon- strations, ...
2025
-
[39]
arXiv preprint arXiv:2311.03054 (2023)
Yuxiang Tuo, Wangmeng Xiang, Jun-Yan He, Yifeng Geng, and Xuansong Xie. Anytext: Multilingual visual text gener- ation and editing.ArXiv, abs/2311.03054, 2023. 1, 2
-
[40]
arXiv preprint arXiv:2411.15245 (2024)
Yuxiang Tuo, Yifeng Geng, and Liefeng Bo. Anytext2: Vi- sual text generation and editing with customizable attributes. ArXiv, abs/2411.15245, 2024. 2
-
[41]
Alex Jinpeng Wang, Dongxing Mao, Jiawei Zhang, Weim- ing Han, Zhuobai Dong, Linjie Li, Yiqi Lin, Zhengyuan Yang, Libo Qin, Fuwei Zhang, et al. Textatlas5m: A large- scale dataset for dense text image generation.arXiv preprint arXiv:2502.07870, 2025. 1, 2, 3
-
[42]
Uniglyph: Unified segmentation-conditioned diffusion for precise visual text synthesis
Yuanrui Wang, Cong Han, Yafei Li, Zhipeng Jin, Xiawei Li, SiNan Du, Wen Tao, Shuanglong Li, Yi Yang, Chun Yuan, et al. Uniglyph: Unified segmentation-conditioned diffusion for precise visual text synthesis. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 18335–18344, 2025. 1, 2, 3
2025
-
[43]
Zhendong Wang, Jianmin Bao, Shuyang Gu, Dongdong Chen, Wen gang Zhou, and Houqiang Li. Designdiffusion: High-quality text-to-design image generation with diffusion models.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20906–20915, 2025. 1
2025
-
[44]
DeepSeek-OCR: Contexts Optical Compression
Haoran Wei, Yaofeng Sun, and Yukun Li. Deepseek-ocr: Contexts optical compression.ArXiv, abs/2510.18234, 2025. 2, 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[45]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed H. Chi, F. Xia, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large lan- guage models.ArXiv, abs/2201.11903, 2022. 1
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[46]
Boosting gui prototyping with diffusion models.2023 IEEE 31st Inter- national Requirements Engineering Conference (RE), pages 275–280, 2023
Jialiang Wei, Anne-Lise Courbis, Thomas Lambolais, Bin- bin Xu, Pierre Louis Bernard, and G ´erard Dray. Boosting gui prototyping with diffusion models.2023 IEEE 31st Inter- national Requirements Engineering Conference (RE), pages 275–280, 2023. 1
2023
-
[47]
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Da-Wei Liu, De mei Li, Hang Zhang, Hao Meng, Hu Wei, Ji-Li Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Min Wu, Peng Wang, Shuting Yu, Tingkun Wen, Wens...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[48]
Glyphcontrol: Glyph conditional control for visual text generation.Advances in Neural Information Processing Systems, 36:44050–44066,
Yukang Yang, Dongnan Gui, Yuhui Yuan, Weicong Liang, Haisong Ding, Han Hu, and Kai Chen. Glyphcontrol: Glyph conditional control for visual text generation.Advances in Neural Information Processing Systems, 36:44050–44066,
-
[49]
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Ren- liang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understand- ing and reasoning benchmark for...
2024
-
[50]
Strict: Stress test of rendering images containing text.arXiv preprint arXiv:2505.18985, 2025
Tianyu Zhang, Xinyu Wang, Zhenghan Tai, Lu Li, Jijun Chi, Jingrui Tian, Hailin He, and Suyuchen Wang. Strict: Stress test of rendering images containing text.arXiv preprint arXiv:2505.18985, 2025. 1, 2, 3
-
[51]
Shitian Zhao, Qilong Wu, Xinyue Li, Bo Zhang, Ming Li, Qi Qin, Dongyang Liu, Kaipeng Zhang, Hongsheng Li, Yu Qiao, et al. Lex-art: Rethinking text generation via scalable high-quality data synthesis.arXiv preprint arXiv:2503.21749, 2025. 1, 2, 3
-
[52]
Pptagent: Generating and evaluating pre- sentations beyond text-to-slides
Hao Zheng, Xinyan Guan, Hao Kong, Wenkai Zhang, Jia Zheng, Weixiang Zhou, Hongyu Lin, Yaojie Lu, Xianpei Han, and Le Sun. Pptagent: Generating and evaluating pre- sentations beyond text-to-slides. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Pro- cessing, pages 14413–14429, 2025. 1
2025
-
[53]
Visual text generation in the wild
Yuanzhi Zhu, Jiawei Liu, Feiyu Gao, Wenyu Liu, Xinggang Wang, Peng Wang, Fei Huang, Cong Yao, and Zhibo Yang. Visual text generation in the wild. InEuropean Conference on Computer Vision, pages 89–106. Springer, 2024. 2 10 Evaluating Reasoning Fidelity in Visual Text Generation Supplementary Material
2024
-
[54]
Human Labels This section presents the details of how we manually label a subset of the data to verify that the VLM-based CCR metric is trustworthy in most cases. For each selected model, we randomly sample 40 gener- ated images across all difficulty levels in the text render- ing task and 40 images across all difficulty levels in the math reasoning task....
-
[55]
We first provide the prompts for image genera- tion across our four tasks: Text Rendering, Context Reason- ing, Factual Knowledge, and Math Reasoning in Sec
Prompts This section presents the detailed prompts used in our ex- periments. We first provide the prompts for image genera- tion across our four tasks: Text Rendering, Context Reason- ing, Factual Knowledge, and Math Reasoning in Sec. 8.1. We then provide the prompts used in evaluation, includ- ing those for scoring intermediate reasoning steps (process ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.