pith. sign in

arxiv: 2605.19382 · v1 · pith:RZHCYJR2new · submitted 2026-05-19 · 💻 cs.AI

PRISM: A Benchmark for Programmatic Spatial-Temporal Reasoning

Pith reviewed 2026-05-20 05:48 UTC · model grok-4.3

classification 💻 cs.AI
keywords programmatic video generationspatial-temporal reasoningLLM evaluationcode generation benchmarkvisualizationanimationexecution gap
0
0 comments X

The pith

LLMs that generate executable code for animated visualizations often produce spatially incoherent outputs, with an average 41% performance drop.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents PRISM, a benchmark of 10,372 human-calibrated instruction-code pairs for testing language models on generating code that creates accurate animated visualizations. It evaluates seven mainstream LLMs and identifies a consistent gap where code runs successfully but fails to maintain correct spatial layouts across animation sequences. This separation matters because programmatic approaches promise geometric precision that pixel-based methods lack, yet current models do not reliably deliver it for real-world visualization tasks. The work uses a set of four metrics to isolate issues of executability from problems in spatial and temporal reasoning.

Core claim

The paper establishes an Execution-Spatial Gap in which success at producing runnable code for video generation drops by approximately 41% on average when the requirement is added that the resulting animations must show correct spatial layouts over full sequences, based on evaluation across thousands of tasks in 437 subject categories.

What carries the argument

The PRISM benchmark of 10,372 human-calibrated instruction-code pairs together with its funnel-style evaluation framework that applies four metrics: Code-Level Reliability for executability, Spatial Reasoning for layout correctness, Prompt-Aware Dynamic Visual Complexity, and Temporal Density.

If this is right

  • Evaluation of programmatic video generation must extend beyond code executability to include checks for spatial coherence across animation frames.
  • Mainstream LLMs exhibit substantial limitations in spatial-temporal reasoning when translating instructions into code for visualizations.
  • The benchmark spans English and Chinese instructions and 437 categories, indicating the gap is not limited to narrow domains.
  • Future model development should target improvements in geometric and temporal understanding rather than relying solely on execution feedback.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Models could be trained with additional signals that enforce geometric constraints during code generation to reduce the observed gap.
  • The benchmark structure might transfer to other code-based simulation tasks such as scientific plotting or interactive diagrams.
  • The results point to a need for verification or planning stages that check spatial properties before final code output.

Load-bearing premise

The human-calibrated instruction-code pairs and the four metrics accurately capture spatial-temporal reasoning ability without bias from the calibration or metric design.

What would settle it

A model achieving high spatial pass rates close to its execution success rates on the PRISM tasks, or a demonstration that the spatial metric fails to match independent human judgments of visual coherence, would undermine the reported gap.

Figures

Figures reproduced from arXiv: 2605.19382 by Chen Qian, Huatao Li, Jie Zhang, Jingru Fan, Lin Wu, Qiran Zhang, Ruijie Shi, Runde Yang, Shu Yao, Tianle Zhou, Yihan Li, Yuheng Wang.

Figure 1
Figure 1. Figure 1: Qualitative contrast between pixel-level and programmatic video generation. Recent advances in large language models (LLMs) and generative AI have broadened automated content creation from text and im￾ages to videos [25, 61, 59]. Automated video generation has since evolved along two ma￾jor routes [61]. Pixel-level methods, typi￾cally based on diffusion models, achieve im￾pressive visual fidelity by modeli… view at source ↗
Figure 2
Figure 2. Figure 2: Data overview of PRISM and aggregate model capability on the benchmark. The left panel illustrates multi-level subject coverage with representative examples, while the right panel presents direction-aligned scores summarizing model capability. To address these issues, we introduce PRISM (Programmatic Reasoning In Spatial Modalities) (See [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Benchmark statistics. Left: The 12 most frequently occurring Manim APIs and operators in the dataset. Top-right: Distribution of character counts per prompt. Middle-right: Scale of the reference code. Bottom-right: Composition of program structure. Inputs feature structured educational text averaging 1,169 characters (English) and 432 characters (Chinese). Over 87% of samples include headings, lists, or La… view at source ↗
Figure 4
Figure 4. Figure 4: Funnel-style evaluation framework. From Code-Level Reli￾ability to Spatial Reasoning, and to PADVC/TD diagnostic dimensions. We construct a fine-grained evaluation suite spanning both code and visual dimensions, organized around four comple￾mentary metrics. Code-Level Reliability measures execution robustness. Spatial Reasoning assesses layout planning on a constrained two-dimensional canvas. Prompt-Aware … view at source ↗
Figure 5
Figure 5. Figure 5: PADVC vs. generation quality. Both under- and over-estimated dynamic visual complex￾ity lead to failures. Both Etext and Egeo are computed from frame-level image analysis. For each frame t, OCR detects text regions and produces a binary mask Mctext(t). The frame-level text￾boundary energy is Etext(t) = X [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Execution-Spatial Gap. All models lie above the diagonal, indicating a clear disconnect between execution success and spatial pass. The gap size varies revealingly across models. Gemini 3.1 Pro Preview pairs the strongest Spatial scores with one of the smallest gaps, suggesting that its code generation and spatial planning are well aligned. Qwen3.5-397B-A17B has the lowest Exec. yet a small gap, pointing t… view at source ↗
Figure 7
Figure 7. Figure 7: TextExpand cases. The left Gemini sample remains concise, whereas the right GPT-5.4 sample expands visible text aggressively and fails under a substantially heavier spatial burden. Joint diagnosis with PADVC and total energy. We further compare a collapsed total-energy score against the separated view of PADVC and TextExpand. Total energy does identify high-risk outputs, with the top 10% reaching an 80.4% … view at source ↗
Figure 8
Figure 8. Figure 8: Thinking ablation across models and languages. Deltas are computed as thinking minus base. ∆Spatial (x) vs. ∆PADVCc (y). Bubble size indicates latency increase; color intensity encodes token increase. The green quadrant marks simulta￾neous improvement. Effect of thinking [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Thinking-ablation cases. Left denotes the base and right denotes the thinking mode. paradigm target static, single-frame outputs: Design2Code [42] evaluates HTML/CSS generation, Plot2Code [51] and ChartMimic [57] focus on chart reproduction, and TikZ benchmarks [50] assess scientific figure generation. These works are limited to instantaneous layouts and overlook multi-step animation sequences, where spati… view at source ↗
read the original abstract

Programmatic video generation through code offers geometric precision and temporal coherence beyond pixel-level diffusion models, yet rigorously evaluating whether language models can produce spatially correct animated outputs remains an open problem. We introduce PRISM, a large-scale benchmark of 10,372 human-calibrated instruction-code pairs (20 times larger than prior programmatic video generation benchmarks), grounded in real-world knowledge visualization scenarios across English and Chinese and spanning 437 subject categories. We further propose a funnel-style evaluation framework with four complementary metrics: Code-Level Reliability for executability, Spatial Reasoning for layout correctness over full animation sequences, and Prompt-Aware Dynamic Visual Complexity (PADVC) and Temporal Density (TD) for diagnosing dynamic expression and temporal activity. Systematic evaluation of seven mainstream LLMs reveals a striking Execution-Spatial Gap: the average drop from execution success rate to spatial pass rate is approximately 41%, showing that runnable code does not necessarily yield spatially coherent visual output. These findings show that programmatic video generation evaluation should go beyond executability. PRISM provides a principled benchmark for advancing spatially coherent code generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces PRISM, a benchmark of 10,372 human-calibrated instruction-code pairs (20x larger than prior work) for programmatic video generation across English/Chinese and 437 categories. It defines a funnel-style evaluation using four metrics (Code-Level Reliability for executability, Spatial Reasoning for layout correctness over animation sequences, plus PADVC and TD for dynamic/temporal aspects) and evaluates seven mainstream LLMs, reporting an average ~41% Execution-Spatial Gap between execution success rate and spatial pass rate to argue that runnable code does not guarantee spatially coherent visual output.

Significance. If the gap is shown to reflect spatial deficits conditional on executable code, the work is significant for establishing a large-scale, human-grounded benchmark that pushes evaluation of code-based video generation beyond executability alone. The scale, real-world scenario grounding, and multi-metric framework are clear strengths that could support reproducible progress in spatial-temporal reasoning for LLMs.

major comments (1)
  1. [Evaluation framework] Evaluation framework section: the Spatial Reasoning metric and the reported 41% Execution-Spatial Gap must explicitly define the aggregation procedure. Is the spatial pass rate computed only over the subset of generations that pass Code-Level Reliability (i.e., conditional on successful execution and subsequent rendering/layout checks), or is it an unconditional percentage over all samples? The abstract's claim that 'runnable code does not necessarily yield spatially coherent visual output' requires the former; if the latter, the gap largely reproduces the execution failure rate rather than revealing additional spatial deficits.
minor comments (2)
  1. [Results] Results section: report per-model execution success rates, spatial pass rates, and the exact gap values with standard deviations or confidence intervals rather than only the average 41% figure, to allow readers to assess variability across the seven LLMs.
  2. [Benchmark construction] Benchmark construction: clarify the exact procedure and inter-annotator agreement for the human calibration of the 10,372 instruction-code pairs, including how spatial-temporal correctness was verified during dataset creation.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback. We address the major comment on the evaluation framework below and will incorporate clarifications in the revised manuscript.

read point-by-point responses
  1. Referee: Evaluation framework section: the Spatial Reasoning metric and the reported 41% Execution-Spatial Gap must explicitly define the aggregation procedure. Is the spatial pass rate computed only over the subset of generations that pass Code-Level Reliability (i.e., conditional on successful execution and subsequent rendering/layout checks), or is it an unconditional percentage over all samples? The abstract's claim that 'runnable code does not necessarily yield spatially coherent visual output' requires the former; if the latter, the gap largely reproduces the execution failure rate rather than revealing additional spatial deficits.

    Authors: We agree that the aggregation procedure requires explicit definition to avoid ambiguity. The Spatial Reasoning metric is computed conditionally: the spatial pass rate is calculated exclusively over the subset of generations that first pass Code-Level Reliability (i.e., successful execution and rendering). This conditional evaluation isolates spatial-temporal deficits beyond mere executability and directly supports the abstract claim that runnable code does not guarantee spatially coherent output. The reported ~41% Execution-Spatial Gap is the average difference between execution success rate and this conditional spatial pass rate across models. We will revise the Evaluation framework section to state this conditional procedure explicitly, include the precise aggregation formula, and clarify how the gap is derived. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical gap is measured outcome of independent benchmark evaluation

full rationale

The paper introduces a new benchmark of instruction-code pairs and applies four explicitly defined metrics (Code-Level Reliability, Spatial Reasoning, PADVC, TD) to LLM-generated outputs. The Execution-Spatial Gap is reported as the observed numerical difference between execution success rate and spatial pass rate across seven LLMs. No equations, fitted parameters, or self-citations appear in the derivation; the gap is not forced by redefining one metric in terms of the other or by renaming an input. The central claim rests on external LLM evaluations against the benchmark rather than reducing to the benchmark construction itself. This is a standard self-contained benchmark paper with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only abstract available so ledger is minimal; main assumptions concern the validity of human calibration and metric definitions for spatial reasoning.

axioms (1)
  • domain assumption Human calibration produces reliable ground-truth instruction-code pairs that reflect real-world visualization needs.
    Stated in abstract as 'human-calibrated' without further detail on process or inter-annotator agreement.

pith-pipeline@v0.9.0 · 5747 in / 1193 out tokens · 36131 ms · 2026-05-20T05:48:16.321062+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 11 internal anchors

  1. [1]

    Projudge: A multi-modal multi-discipline benchmark and instruction-tuning dataset for mllm-based process judges

    Jiaxin Ai, Pengfei Zhou, Zhaopan Xu, Ming Li, Fanrui Zhang, Zizhen Li, Jianwen Sun, Yukang Feng, Baojin Huang, Zhongyuan Wang, and Kaipeng Zhang. Projudge: A multi-modal multi-discipline benchmark and instruction-tuning dataset for mllm-based process judges. In IEEE/CVF International Conference on Computer Vision (ICCV), 2025

  2. [2]

    Kimi K2: Open Agentic Intelligence

    Moonshot AI. Kimi k2: Open agentic intelligence. InarXiv preprint arXiv:2507.20534, 2025

  3. [3]

    GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

    Zhipu AI. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models. InarXiv preprint arXiv:2508.06471, 2025

  4. [4]

    Claude’s extended thinking

    Anthropic. Claude’s extended thinking. 2025

  5. [5]

    Claude opus 4 & claude sonnet 4 system card

    Anthropic. Claude opus 4 & claude sonnet 4 system card. 2025

  6. [6]

    Dash: Detection and assessment of systematic hallucinations of vlms

    Maximilian Augustin, Yannic Neuhaus, and Matthias Hein. Dash: Detection and assessment of systematic hallucinations of vlms. InIEEE/CVF International Conference on Computer Vision (ICCV), 2025

  7. [7]

    Tikzero: Zero-shot text-guided graphics program synthesis

    Jonas Belouadi, Eddy Ilg, Margret Keuper, Hideki Tanaka, Masao Utiyama, Raj Dabre, Steffen Eger, and Simone Paolo Ponzetto. Tikzero: Zero-shot text-guided graphics program synthesis. InIEEE/CVF International Conference on Computer Vision (ICCV), 2025

  8. [8]

    Spatialvlm: Endowing vision-language models with spatial reasoning capabilities

    Boyuan Chen, Zhuo Xu, Sean Kirmani, Brian Ichter, Danny Driess, Pete Florence, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 10

  9. [9]

    Visualedu: A benchmark for assessing coding and visual comprehension through educational problem-solving video generation

    Hao Chen, Tianyu Shi, Pengran Huang, Zeyuan Li, Jiahui Pan, Qianglong Chen, and Lewei He. Visualedu: A benchmark for assessing coding and visual comprehension through educational problem-solving video generation. InConference on Empirical Methods in Natural Language Processing (EMNLP), 2025

  10. [10]

    arXiv preprint arXiv:2510.01174 , year =

    Yanzhe Chen, Kevin Qinghong Lin, and Mike Zheng Shou. Code2video: A code-centric paradigm for educational video generation. InarXiv preprint arXiv:2510.01174, 2025

  11. [11]

    Wan-move: Motion-controllable video generation via latent trajectory guidance

    Ruihang Chu, Yefei He, Zhekai Chen, Shiwei Zhang, Xiaogang Xu, Bin Xia, Dingdong Wang, Hongwei Yi, Xihui Liu, Hengshuang Zhao, Yu Liu, Yingya Zhang, and Yujiu Yang. Wan-move: Motion-controllable video generation via latent trajectory guidance. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

  12. [12]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Google DeepMind. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. InarXiv preprint arXiv:2507.06261, 2025

  13. [13]

    DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

    DeepSeek-AI. Deepseek-v3.2: Pushing the frontier of open large language models. InarXiv preprint arXiv:2512.02556, 2025

  14. [14]

    Codescore: Evaluating code generation by learning code execution

    Yihong Dong, Jiazheng Ding, Xue Jiang, Ge Li, Zhuo Li, and Zhi Jin. Codescore: Evaluating code generation by learning code execution. InACM Transactions on Software Engineering and Methodology (TOSEM), 2025

  15. [15]

    A Survey on Code Generation with LLM-based Agents

    Yihong Dong, Xue Jiang, Jiaru Qian, Tian Wang, Kechi Zhang, Zhi Jin, and Ge Li. A survey on code generation with llm-based agents. InarXiv preprint arXiv:2508.00083, 2025

  16. [16]

    Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Caifeng Shan, Ran He, and Xing Sun. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InI...

  17. [17]

    Cad-coder: Text-to-cad generation with chain-of-thought and geometric reward

    Yandong Guan, Xilin Wang, Ximing Xing, Jing Zhang, Dong Xu, and Qian Yu. Cad-coder: Text-to-cad generation with chain-of-thought and geometric reward. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

  18. [18]

    Flaw or artifact? rethinking prompt sensitivity in evaluating llms

    Andong Hua, Kenan Tang, Chenhe Gu, Jindong Gu, Eric Wong, and Yao Qin. Flaw or artifact? rethinking prompt sensitivity in evaluating llms. InConference on Empirical Methods in Natural Language Processing (EMNLP), 2025

  19. [19]

    Scipostgen: Bridging the gap between scientific papers and poster layouts

    Shun Inadumi, Shohei Tanaka, Tosho Hirasawa, Atsushi Hashimoto, Koichiro Yoshino, and Yoshitaka Ushiku. Scipostgen: Bridging the gap between scientific papers and poster layouts. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)(Findings), 2026

  20. [20]

    G, Minseo Yoon, Manmohan Chan- draker, and Hyunwoo J

    Dohwan Ko, Sihyeon Kim, Yumin Suh, Vijay Kumar B. G, Minseo Yoon, Manmohan Chan- draker, and Hyunwoo J. Kim. St-vlm: Kinematic instruction tuning for spatio-temporal reason- ing in vision-language models. InarXiv preprint arXiv:2503.19355, 2025

  21. [21]

    InAnnual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2025

    Woosung Koh, Jang Han Yoon, MinHyung Lee, Youngjin Song, Jaegwan Cho, Jaehyun Kang, Taehyeon Kim, Se-Young Yun, Youngjae Yu, and Bongshin Lee.c2: Scalable auto-feedback for llm-based chart generation. InAnnual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2025

  22. [22]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, Kathrina Wu, Qin Lin, Junkun Yuan, Yanxin Long, Aladdin Wang, Andong Wang, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Hongmei Wang, Jacob Song, Jiawang Bai, Jianbing Wu, Jinbao Xue, Joey Wang, Kai Wang, Mengyang Liu, Pengyu Li, Shuai Li, ...

  23. [23]

    Theorem- explainagent: Towards video-based multimodal explanations for llm theorem understanding

    Max Ku, Thomas Chong, Jonathan Leung, Krish Shah, Alvin Yu, and Wenhu Chen. Theorem- explainagent: Towards video-based multimodal explanations for llm theorem understanding. In Annual Meeting of the Association for Computational Linguistics (ACL), 2025

  24. [24]

    Mvbench: A comprehensive multi-modal video understanding benchmark

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, Limin Wang, and Yu Qiao. Mvbench: A comprehensive multi-modal video understanding benchmark. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  25. [25]

    A sur- vey of state of the art large vision language models: Alignment, benchmark, evaluations and challenges

    Zongxia Li, Xiyang Wu, Hongyang Du, Fuxiao Liu, Huy Nghiem, and Guangyao Shi. A sur- vey of state of the art large vision language models: Alignment, benchmark, evaluations and challenges. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)(Workshops), 2025

  26. [26]

    Can multimodal large language models understand spatial relations? InAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

    Jingping Liu, Ziyan Liu, Zhedong Cen, Yan Zhou, Yinan Zou, Weiyan Zhang, Haiyun Jiang, and Tong Ruan. Can multimodal large language models understand spatial relations? InAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

  27. [27]

    On robustness and reliability of benchmark-based evaluation of llms

    Riccardo Lunardi, Vincenzo Della Mea, Stefano Mizzaro, and Kevin Roitero. On robustness and reliability of benchmark-based evaluation of llms. InarXiv preprint arXiv:2509.04013, 2025

  28. [28]

    Geogram- bench: Benchmarking the geometric program reasoning in modern llms

    Shixian Luo, Zezhou Zhu, Yu Yuan, Yuncheng Yang, Lianlei Shan, and Yong Wu. Geogram- bench: Benchmarking the geometric program reasoning in modern llms. InInternational Conference on Learning Representations (ICLR), 2026

  29. [29]

    Rethinking verification for llm code generation: From generation to testing

    Zihan Ma, Taolin Zhang, Maosong Cao, Junnan Liu, Wenwei Zhang, Minnan Luo, Songyang Zhang, and Kai Chen. Rethinking verification for llm code generation: From generation to testing. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

  30. [30]

    ivispar – an interactive visual-spatial reasoning benchmark for vlms

    Julius Mayer, Mohamad Ballout, Serwan Jassim, Farbod Nosrat Nezami, and Elia Bruni. ivispar – an interactive visual-spatial reasoning benchmark for vlms. InConference on Empirical Methods in Natural Language Processing (EMNLP), 2025

  31. [31]

    CountLoop: Training-Free High-Instance Image Generation via Iterative Agent Guidance

    Anindya Mondal, Ayan Banerjee, Sauradip Nag, Josep Llados, Xiatian Zhu, and Anjan Dutta. Countloop: Training-free high-instance image generation via iterative agent guidance. InarXiv preprint arXiv:2508.16644, 2025

  32. [32]

    Spare: Enhancing spatial reasoning in vision-language models with synthetic data

    Michael Ogezi and Freda Shi. Spare: Enhancing spatial reasoning in vision-language models with synthetic data. InAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

  33. [33]

    arXiv preprint arXiv:2603.13251 , year =

    Nabin Oli. Manibench: A benchmark for testing visual-logic drift and syntactic hallucinations in manim code generation. InarXiv preprint arXiv:2603.13251, 2026

  34. [34]

    Gpt-4.5 system card

    OpenAI. Gpt-4.5 system card. 2025

  35. [35]

    Renjie Pi, Felix Bai, Qibin Chen, Simon Wang, Jiulong Shan, Kieran Liu, and Meng Cao. Mr. judge: Multimodal reasoner as a judge. InConference on Empirical Methods in Natural Language Processing (EMNLP), 2025

  36. [36]

    Capture: Evaluating spatial reasoning in vision language models via occluded object counting

    Atin Pothiraj, Elias Stengel-Eskin, Jaemin Cho, and Mohit Bansal. Capture: Evaluating spatial reasoning in vision language models via occluded object counting. InIEEE/CVF International Conference on Computer Vision (ICCV), 2025

  37. [37]

    Forest: Frame of reference evaluation in spatial rea- soning tasks

    Tanawan Premsri and Parisa Kordjamshidi. Forest: Frame of reference evaluation in spatial rea- soning tasks. InConference on Empirical Methods in Natural Language Processing (EMNLP), 2025

  38. [38]

    Xiao, Katherine M

    Zeju Qiu, Weiyang Liu, Haiwen Feng, Zhen Liu, Tim Z. Xiao, Katherine M. Collins, Joshua B. Tenenbaum, Adrian Weller, Michael J. Black, and Bernhard Schölkopf. Can large language models understand symbolic graphics programs? InInternational Conference on Learning Representations (ICLR), 2025. 12

  39. [39]

    Benchmarking spatiotemporal reasoning in llms and reasoning models: Capabilities and challenges

    Pengrui Quan, Brian Wang, Kang Yang, Liying Han, and Mani Srivastava. Benchmarking spatiotemporal reasoning in llms and reasoning models: Capabilities and challenges. In Advances in Neural Information Processing Systems (NeurIPS), 2025

  40. [40]

    Text2vis: A challenging and diverse benchmark for generating multimodal visualizations from text

    Mizanur Rahman, Md Tahmid Rahman Laskar, Shafiq Joty, and Enamul Hoque. Text2vis: A challenging and diverse benchmark for generating multimodal visualizations from text. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2025

  41. [41]

    Brittlebench: Quantifying LLM robustness via prompt sensitivity

    Angelika Romanou, Mark Ibrahim, Candace Ross, Chantal Shaib, Kerem Oktar, Samuel J. Bell, Anaelia Ovalle, Jesse Dodge, Antoine Bosselut, Koustuv Sinha, and Adina Williams. Brit- tlebench: Quantifying llm robustness via prompt sensitivity. InarXiv preprint arXiv:2603.13285, 2026

  42. [42]

    Design2code: Benchmarking multimodal code generation for automated front-end engineering

    Chenglei Si, Yanzhe Zhang, Ryan Li, Zhengyuan Yang, Ruibo Liu, and Diyi Yang. Design2code: Benchmarking multimodal code generation for automated front-end engineering. InAnnual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2025

  43. [43]

    Ravidu Suien Rammuni Silva, Ahmad Lotfi, Isibor Kennedy Ihianle, Golnaz Shahtahmassebi, and Jordan J. Bird. Training and agentic inference strategies for llm-based manim animation generation. InarXiv preprint arXiv:2604.18364, 2026

  44. [44]

    Manim - mathematical animation framework (v0.19.0)

    The Manim Community Developers. Manim - mathematical animation framework (v0.19.0). In 10.5281/zenodo.14699705, 2025

  45. [45]

    Ode: Open-set evaluation of hallucinations in multimodal large language models

    Yahan Tu, Rui Hu, and Jitao Sang. Ode: Open-set evaluation of hallucinations in multimodal large language models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  46. [46]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

  47. [47]

    Fei Wang, Xingyu Fu, James Y . Huang, Zekun Li, Qin Liu, Xiaogeng Liu, Mingyu Derek Ma, Nan Xu, Wenxuan Zhou, Kai Zhang, Tianyi Lorena Yan, Wenjie Jacky Mo, Hsiang-Hui Liu, Pan Lu, Chunyuan Li, Chaowei Xiao, Kai-Wei Chang, Dan Roth, Sheng Zhang, Hoifung Poon, and Muhao Chen. Muirbench: A comprehensive benchmark for robust multi-image understanding. InInte...

  48. [48]

    Ske- layout: Spatial knowledge enhanced layout generation with llms

    Junsheng Wang, Nieqing Cao, Yan Ding, Mengying Xie, Fuqiang Gu, and Chao Chen. Ske- layout: Spatial knowledge enhanced layout generation with llms. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  49. [49]

    Spatial457: A diagnostic benchmark for 6d spatial reasoning of large multimodal models

    Xingrui Wang, Wufei Ma, Tiezheng Zhang, Celso M de Melo, Jieneng Chen, and Alan Yuille. Spatial457: A diagnostic benchmark for 6d spatial reasoning of large multimodal models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  50. [50]

    From words to structured visuals: A benchmark and framework for text-to-diagram generation and editing

    Jingxuan Wei, Cheng Tan, Qi Chen, Gaowei Wu, Siyuan Li, Zhangyang Gao, Linzhuang Sun, Bihui Yu, and Ruifeng Guo. From words to structured visuals: A benchmark and framework for text-to-diagram generation and editing. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  51. [51]

    Plot2code: A comprehensive benchmark for evaluating multi-modal large language models in code generation from scientific plots

    Chengyue Wu, Zhixuan Liang, Yixiao Ge, Qiushan Guo, Zeyu Lu, Jiahao Wang, Ying Shan, and Ping Luo. Plot2code: A comprehensive benchmark for evaluating multi-modal large language models in code generation from scientific plots. InAnnual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)(Findings), 2024. 13

  52. [52]

    PanoWan: Lifting diffusion video generation models to 360◦ with latitude/longitude-aware mechanisms

    Yifei Xia, Shuchen Weng, Siqi Yang, Jingqi Liu, Chengxuan Zhu, Minggui Teng, Zijian Jia, Han Jiang, and Boxin Shi. PanoWan: Lifting diffusion video generation models to 360◦ with latitude/longitude-aware mechanisms. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

  53. [53]

    Core: Benchmarking llms code reasoning capabilities through static analysis tasks

    Danning Xie, Mingwei Zheng, Xuwei Liu, Jiannan Wang, Chengpeng Wang, Lin Tan, and Xiangyu Zhang. Core: Benchmarking llms code reasoning capabilities through static analysis tasks. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

  54. [54]

    Empower- ing llms to understand and generate complex vector graphics

    Ximing Xing, Juncheng Hu, Guotao Liang, Jing Zhang, Dong Xu, and Qian Yu. Empower- ing llms to understand and generate complex vector graphics. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  55. [55]

    Defining and evaluating visual language models’ basic spatial abilities: A perspective from psychometrics

    Wenrui Xu, Dalin Lyu, Weihang Wang, Jie Feng, Chen Gao, and Yong Li. Defining and evaluating visual language models’ basic spatial abilities: A perspective from psychometrics. In Annual Meeting of the Association for Computational Linguistics (ACL), 2025

  56. [56]

    Qwen3 Technical Report

    An Yang et al. Qwen3 technical report. InarXiv preprint arXiv:2505.09388, 2025

  57. [57]

    Chart- mimic: Evaluating lmm’s cross-modal reasoning capability via chart-to-code generation

    Cheng Yang, Chufan Shi, Yaxin Liu, Bo Shui, Junjie Wang, Mohan Jing, Linran Xu, Xinyu Zhu, Siheng Li, Yuxiang Zhang, Gongye Liu, Xiaomei Nie, Deng Cai, and Yujiu Yang. Chart- mimic: Evaluating lmm’s cross-modal reasoning capability via chart-to-code generation. In International Conference on Learning Representations (ICLR), 2025

  58. [58]

    Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie

    Jihan Yang, Shusheng Yang, Anjali W. Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  59. [60]

    Omnisvg: A unified scalable vector graphics generation model

    Yiying Yang, Wei Cheng, Sijin Chen, Xianfang Zeng, Fukun Yin, Jiaxu Zhang, Liao Wang, Gang Yu, Xingjun Ma, and Yu-Gang Jiang. Omnisvg: A unified scalable vector graphics generation model. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

  60. [61]

    Cogvideox: Text-to-video diffusion models with an expert transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Yuxuan Zhang, Weihan Wang, Yean Cheng, Bin Xu, Xiaotao Gu, Yuxiao Dong, and Jie Tang. Cogvideox: Text-to-video diffusion models with an expert transformer. InInternational Conference on Learning Representations (ICLR), 2025

  61. [62]

    Mitigating spatial hallucination in large language models for path planning via prompt engineering

    Hongjie Zhang, Hourui Deng, Jie Ou, and Chaosheng Feng. Mitigating spatial hallucination in large language models for path planning via prompt engineering. InScientific Reports, 2025

  62. [63]

    Sphere: Unveiling spatial blind spots in vision-language models through hierarchical evaluation

    Wenyu Zhang, Wei En Ng, Lixin Ma, Yuwen Wang, Junqi Zhao, Allison Koenecke, Boyang Li, and Lu Wang. Sphere: Unveiling spatial blind spots in vision-language models through hierarchical evaluation. InAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

  63. [64]

    Chartcoder: Advancing multimodal large language model for chart-to-code generation

    Xuanle Zhao, Xianzhen Luo, Qi Shi, Chi Chen, Shuo Wang, Zhiyuan Liu, and Maosong Sun. Chartcoder: Advancing multimodal large language model for chart-to-code generation. In Annual Meeting of the Association for Computational Linguistics (ACL), 2025

  64. [65]

    Knowledge- enhanced large language models for automatic lesson plan generation

    Ying Zheng, Shuyan Huang, Xiaoli Zeng, Yaying Huang, Zitao Liu, and Weiqi Luo. Knowledge- enhanced large language models for automatic lesson plan generation. InHumanities and Social Sciences Communications, 2025

  65. [66]

    Autofigure: Generating and refining publication-ready scientific illustrations

    Minjun Zhu, Zhen Lin, Yixuan Weng, Panzhong Lu, Qiujie Xie, Yifan Wei, Sifan Liu, Qiyao Sun, and Yue Zhang. Autofigure: Generating and refining publication-ready scientific illustrations. InInternational Conference on Learning Representations (ICLR), 2026. 14