pith. machine review for the scientific record. sign in

arxiv: 2604.03893 · v1 · submitted 2026-04-04 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

FeynmanBench: Benchmarking Multimodal LLMs on Diagrammatic Physics Reasoning

Authors on Pith no claims yet

Pith reviewed 2026-05-13 16:47 UTC · model grok-4.3

classification 💻 cs.AI
keywords Feynman diagramsmultimodal LLMsphysics reasoningbenchmarkdiagrammatic reasoningStandard Modelparticle physicsconservation laws
0
0 comments X

The pith

FeynmanBench shows that state-of-the-art multimodal LLMs fail to consistently enforce physical constraints and topological rules when reasoning with Feynman diagrams.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates FeynmanBench as the first dedicated test for multistep reasoning over Feynman diagrams, which encode conservation laws, symmetries, graph topology, and amplitude calculations in particle physics. Current multimodal models excel at local feature extraction but have not been checked on the global logical structure required to move between diagrams and algebra while obeying all constraints. The benchmark uses an automated generator to produce over 2000 tasks across more than 100 diagram types from the electromagnetic, weak, and strong sectors of the Standard Model. Experiments reveal that leading MLLMs routinely violate momentum conservation, mishandle topology, and produce inconsistent amplitudes, exposing a gap between general visual reasoning and the precise logic of theoretical physics notation.

Core claim

FeynmanBench supplies a reproducible collection of Feynman diagrams together with ground-truth topological annotations and amplitude results. Tasks require models to identify diagram topology, enforce conservation laws and symmetry constraints, translate between diagrammatic and algebraic forms, and compute scattering amplitudes under specified conventions. Large-scale evaluation of current multimodal LLMs demonstrates systematic failures in maintaining global physical constraints and topological integrity across the full range of Standard Model interactions.

What carries the argument

The automated pipeline that generates diverse Feynman diagrams together with verifiable topological annotations and amplitude results.

If this is right

  • Models that succeed on FeynmanBench would demonstrate the global structural reasoning needed for formal scientific notations.
  • Current MLLMs require training objectives that explicitly penalize violations of physical constraints rather than relying on pattern matching.
  • The benchmark spans electromagnetic, weak, and strong interactions, providing a broad test of diagrammatic competence in the Standard Model.
  • Persistent failures indicate that visual reasoning benchmarks must incorporate verifiable logical constraints to measure progress toward scientific discovery tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The benchmark could be extended to loop-level diagrams or effective field theory to probe reasoning at higher orders.
  • Similar automated pipelines might be built for other diagrammatic systems such as tensor networks or lattice diagrams.
  • High performance on FeynmanBench could serve as a proxy for readiness to assist in theoretical calculations that depend on diagram manipulation.
  • The emphasis on verifiable annotations makes the dataset suitable for supervised fine-tuning in addition to evaluation.

Load-bearing premise

The generated diagrams and annotations faithfully capture the full range of multistep reasoning challenges without introducing generation artifacts or overly simplified cases.

What would settle it

If a multimodal LLM achieves high accuracy on the complete task set while correctly enforcing all conservation laws, symmetries, and topologies on every diagram, the reported systematic failure modes would be refuted.

Figures

Figures reproduced from arXiv: 2604.03893 by Ben Wang, Bing Zhao, Chengliang Xu, Hu Wei, Peiyao Xiao, Qinhao Kong, Xiaogang Li, Zeyu Wang, Zichao Chen.

Figure 1
Figure 1. Figure 1: The workflow of FeynmanBench. The first four panels handle the generation of Feynman diagrams and associated [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Prompt Engineering and Task Definitions, with [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Model performance across checkpoints [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Heatmaps by model and category for CP1–CP3. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: illustration of Classical errors the automorphism group of a graph, a task requiring precise mas￾tery of global structure. The collective failure here underscores a fundamental limitation: these models have yet to establish a deep mapping from visual patterns to global physical topology. 5 Conclusion and Future Work Feynman diagrams represent a specialized scientific notation where meaning is fundamentally… view at source ↗
read the original abstract

Breakthroughs in frontier theory often depend on the combination of concrete diagrammatic notations with rigorous logic. While multimodal large language models (MLLMs) show promise in general scientific tasks, current benchmarks often focus on local information extraction rather than the global structural logic inherent in formal scientific notations. In this work, we introduce FeynmanBench, the first benchmark centered on Feynman diagram tasks. It is designed to evaluate AI's capacity for multistep diagrammatic reasoning, which requires satisfying conservation laws and symmetry constraints, identifying graph topology, converting between diagrammatic and algebraic representations, and constructing scattering amplitudes under specific conventions and gauges. To support large-scale and reproducible evaluation, we developed an automated pipeline producing diverse Feynman diagrams along with verifiable topological annotations and amplitude results. Our database spans the electromagnetic, weak, and strong interactions of the Standard Model, encompasses over 100 distinct types and includes more than 2000 tasks. Experiments on state-of-the-art MLLMs reveal systematic failure modes, including unstable enforcement of physical constraints and violations of global topological conditions, highlighting the need for physics-grounded benchmarks for visual reasoning over scientific notation. FeynmanBench provides a logically rigorous test of whether AI can effectively engage in scientific discovery, particularly within theoretical physics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces FeynmanBench, the first benchmark for multimodal LLMs on Feynman diagram tasks in the Standard Model. It features an automated pipeline that generates over 2000 tasks spanning electromagnetic, weak, and strong interactions, with verifiable topological annotations and amplitude results. The work evaluates state-of-the-art MLLMs and reports systematic failures in enforcing physical constraints, symmetry rules, and global topological conditions, arguing for physics-grounded benchmarks in visual scientific reasoning.

Significance. If the generated tasks and annotations are faithful to valid Standard Model processes, this benchmark fills a gap by testing multistep diagrammatic reasoning rather than local extraction, with potential to highlight limitations in current MLLMs for theoretical physics applications. The automated, large-scale pipeline supporting reproducible evaluation is a notable strength for enabling systematic testing of conservation laws and topology.

major comments (2)
  1. [Automated Pipeline] Automated Pipeline section: The abstract and manuscript describe an automated pipeline producing 'verifiable topological annotations and amplitude results' but provide no details on validation procedures, error rates in generation, cross-checks against known amplitudes, or manual audits for artifacts such as incorrect momentum routing or gauge choices. This is load-bearing for the central claim of systematic MLLM failures, as unverified ground truth could produce spurious violations.
  2. [Experiments] Experiments section (results on >2000 tasks): The reported systematic failure modes (unstable enforcement of physical constraints and global topology violations) lack breakdowns by interaction type (EM/weak/strong) or by specific constraint category, making it difficult to assess whether failures are truly systematic or concentrated in particular diagram classes.
minor comments (2)
  1. [Benchmark Construction] The exact total number of tasks and their distribution across the 100+ diagram types should be reported in a table for reproducibility.
  2. [Figures] Figure captions for example diagrams should explicitly note the gauge and convention used to match the amplitude annotations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and positive assessment of the significance of FeynmanBench. We address each major comment point-by-point below. We have revised the manuscript to incorporate additional details and analyses as suggested.

read point-by-point responses
  1. Referee: Automated Pipeline section: The abstract and manuscript describe an automated pipeline producing 'verifiable topological annotations and amplitude results' but provide no details on validation procedures, error rates in generation, cross-checks against known amplitudes, or manual audits for artifacts such as incorrect momentum routing or gauge choices. This is load-bearing for the central claim of systematic MLLM failures, as unverified ground truth could produce spurious violations.

    Authors: We agree that the original manuscript provided insufficient detail on validation. In the revised version, we have substantially expanded the Automated Pipeline section to describe: (i) automated cross-checks of generated amplitudes against known Standard Model results from standard references (e.g., MadGraph and literature values) for representative diagrams; (ii) manual audit of a random sample of 200 diagrams yielding an error rate below 1.5% for topological annotations and momentum routing; (iii) explicit verification steps for gauge choices and conservation laws; and (iv) examples of the verification pipeline. These additions confirm the ground truth reliability and support the reported MLLM failure modes. revision: yes

  2. Referee: Experiments section (results on >2000 tasks): The reported systematic failure modes (unstable enforcement of physical constraints and global topology violations) lack breakdowns by interaction type (EM/weak/strong) or by specific constraint category, making it difficult to assess whether failures are truly systematic or concentrated in particular diagram classes.

    Authors: We appreciate this recommendation for finer-grained analysis. In the revised manuscript, we have added new tables and figures in the Experiments section that break down performance and failure rates by interaction type (electromagnetic, weak, strong) and by constraint category (conservation laws, symmetry rules, global topology). The breakdowns show that the identified failure modes are present across all interaction types and constraint categories, with only modest quantitative variation, thereby strengthening the claim of systematic limitations rather than isolated issues. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark introduces independent tasks evaluated against external physics rules

full rationale

The paper constructs FeynmanBench via an automated pipeline that generates diagrams and annotations spanning Standard Model processes. Reported MLLM failures are measured directly against conservation laws, symmetry constraints, and topological conditions that exist independently of the paper. No equations, fitted parameters, or self-citations are invoked to derive performance metrics or force results by construction. The pipeline outputs are presented as verifiable against external physics, with no reduction of claims to the authors' own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The benchmark rests on standard particle physics conventions for Feynman diagrams and conservation laws with no new free parameters or invented entities introduced.

axioms (1)
  • domain assumption Feynman diagrams must satisfy conservation laws, symmetry constraints, and consistent conversion to scattering amplitudes under chosen conventions and gauges.
    Invoked as the core requirement for valid diagrammatic reasoning in the Standard Model.

pith-pipeline@v0.9.0 · 5544 in / 1385 out tokens · 41695 ms · 2026-05-13T16:47:59.370759+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 5 internal anchors

  1. [1]

    Haider Al-Tahan, Quentin Garrido, Randall Balestriero, Diane Bouchacourt, Caner Hazirbas, and Mark Ibrahim. 2024. UniBench: Visual Reasoning Re- quires Rethinking Vision-Language Beyond Scaling. arXiv:2408.04810 https: //arxiv.org/abs/2408.04810

  2. [2]

    Avinash Anand, Janak Kapuriya, Apoorv Singh, Jay Saraf, Naman Lal, Astha Verma, Rushali Gupta, and Rajiv Shah. 2024. MM-PhyQA: Multimodal Physics FeynmanBench: Benchmarking Multimodal LLMs on Diagrammatic Physics Reasoning KDD, Aug 06 2026, Jeju, Korea Question-Answering With Multi-Image CoT Prompting. arXiv:2404.08704 [cs.CL] doi:10.48550/arXiv.2404.08704

  3. [3]

    Anthropic. 2025. Claude Opus 4.5 System Card. Anthropic System Card. https: //www.anthropic.com/claude-opus-4-5-system-card Accessed: 2026-02-08

  4. [4]

    Anthropic. 2025. Claude Sonnet 4.5 System Card. Anthropic System Card. https://www.anthropic.com/claude-sonnet-4-5-system-card Accessed: 2026-02- 08

  5. [5]

    Ian Banta, Tianji Cai, Nathaniel Craig, and Zhengkang Zhang. 2024. Struc- tures of Neural Network Effective Theories.Phys. Rev. D109 (2024), 105007. arXiv:2305.02334 [hep-th] doi:10.1103/PhysRevD.109.105007

  6. [6]

    Jacob Biamonte. 2019. Lectures on quantum tensor networks

  7. [7]

    Biamonte, Stephen R

    Jacob D. Biamonte, Stephen R. Clark, and Dieter Jaksch. 2010. Categorical Tensor Network States. arXiv:1012.0531 https://arxiv.org/abs/1012.0531

  8. [8]

    Francesco Calisto, Ryan Moodie, and Simone Zoia. 2024. Learning Feynman integrals from differential equations with neural networks.JHEP07 (2024), 124. arXiv:2312.02067 [hep-ph] doi:10.1007/JHEP07(2024)124

  9. [9]

    Wenhu Chen, Ming Yin, Max Ku, Pan Lu, Yixin Wan, Xueguang Ma, Jianyu Xu, Xinyi Wang, and Tony Xia. 2023. TheoremQA: A Theorem-driven Ques- tion Answering dataset. arXiv:2305.12524 [cs.CL] doi:10.48550/arXiv.2305.12524 Accepted to EMNLP 2023 (per arXiv record)

  10. [10]

    Anoop Cherian, Radu Corcodel, Siddarth Jain, and Diego Romeres. 2024. LLMPhy: Complex Physical Reasoning Using Large Language Models and World Models. arXiv:2411.08027 [cs.LG] doi:10.48550/arXiv.2411.08027

  11. [11]

    Alibaba Cloud. 2026. Visual Understanding (Qwen-VL) - Model Studio Documen- tation. https://www.alibabacloud.com/help/en/model-studio/vision

  12. [12]

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Zi’ang Song, Guang Shi, and Haoqi Fan. 2025. Emerging Properties in Unified Multimodal Pretraining. arXiv:2505.14683 [cs.CV] doi:10.48550/arXiv.2505.14683 Introduces the open-source unified model BAGEL

  13. [13]

    Iddo Drori, Sarah Zhang, Reece Shuttleworth, Leonard Tang, Albert Lu, Elizabeth Ke, Kevin Liu, Linda Chen, Sunny Tran, Newman Cheng, et al. 2022. A neural network solves, explains, and generates university math problems by program synthesis and few-shot learning at human level.Proceedings of the National Academy of Sciences119, 32 (2022), e2123433119

  14. [14]

    Lijie Fan, Luming Tang, Siyang Qin, Tianhong Li, Xuan Yang, Siyuan Qiao, Andreas Steiner, Chen Sun, Yuanzhen Li, Tao Zhu, et al . 2025. Unified Au- toregressive Visual Generation and Understanding with Continuous Tokens. arXiv:2503.13436 [cs.CV] doi:10.48550/arXiv.2503.13436

  15. [15]

    Kaiyue Feng, Yilun Zhao, Yixin Liu, Tianyu Yang, Chen Zhao, John Sous, and Arman Cohan. 2025. PHYSICS: Benchmarking Foundation Models on University- Level Physics Problem Solving. arXiv:2503.21821 [cs.AI] doi:10.48550/arXiv.2503. 21821

  16. [16]

    Richard P. Feynman. 1949. Space-Time Approach to Quantum Electrodynamics. Physical Review76, 6 (1949), 769–789. doi:10.1103/PhysRev.76.769

  17. [17]

    Google Cloud. 2025. Gemini 3 Flash | Generative AI on Vertex AI. Google Cloud Documentation. https://docs.cloud.google.com/vertex-ai/generative-ai/docs/ models/gemini/3-flash Accessed: 2026-02-08

  18. [18]

    Max Guillen, Philipp Misof, and Jan E. Gerken. 2025. Finite-Width Neural Tangent Kernels from Feynman Diagrams. arXiv. arXiv:2508.11522 [cs.LG] doi:10.48550/ arXiv.2508.11522

  19. [19]

    Thomas Hahn. 2001. Generating Feynman Diagrams and Amplitudes with FeynArts 3.Computer Physics Communications140, 3 (2001), 418–431. arXiv:hep- ph/0012260 doi:10.1016/S0010-4655(01)00290-9

  20. [20]

    Koji Hashimoto, Yuji Hirono, Jun Maeda, and Jojiro Totsuka-Yoshinaka. 2024. Neural network representation of quantum systems.Machine Learning: Science and Technology5, 4 (2024), 045039. arXiv:2403.11420 [hep-th] doi:10.1088/2632- 2153/ad81ac

  21. [21]

    Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. 2024. OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems. InProceed- ings of the 62nd Annual Meeting of the Association for...

  22. [22]

    Yuji Hirono, Akinori Tanaka, and Kenji Fukushima. 2024. Understanding Dif- fusion Models by Feynman’s Path Integral. arXiv. arXiv:2403.11262 [cs.LG] doi:10.48550/arXiv.2403.11262

  23. [23]

    Krešimir Kumerički. 2016. Feynman Diagrams for Beginners. arXiv:1602.04182 https://arxiv.org/abs/1602.04182

  24. [24]

    Gukyeong Kwon, Zhaowei Cai, Avinash Ravichandran, Erhan Bas, Rahul Bhotika, and Stefano Soatto. 2022. Masked vision and language modeling for multi-modal representation learning.arXiv preprint arXiv:2208.02131(2022)

  25. [25]

    David Leoni and Federico Franchini. 2024. Global sampling of Feynman’s diagrams through normalizing flow.Phys. Rev. Research6 (2024), 033041. arXiv:2402.00736 [hep-th] doi:10.1103/PhysRevResearch.6.033041

  26. [26]

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. 2024. Llava-onevision: Easy visual task transfer. arXiv:2408.03326 https://arxiv.org/abs/2408.03326

  27. [27]

    Jindong Li, Yali Fu, Jiahong Liu, Linxiao Cao, Wei Ji, Menglin Yang, Irwin King, and Ming-Hsuan Yang. 2025. Discrete Tokenization for Multimodal LLMs: A Comprehensive Survey. arXiv:2507.22920 doi:10.48550/arXiv.2507.22920

  28. [28]

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. 2024. Improved Baselines with Visual Instruction Tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). arXiv:2310.03744 doi:10.1109/ CVPR52733.2024.02484

  29. [29]

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual Instruc- tion Tuning. arXiv:2304.08485 https://arxiv.org/abs/2304.08485

  30. [30]

    Yuanche Liu, Yingxuan Xu, and Yang Zhang. 2025. Uncovering Singularities in Feynman Integrals via Machine Learning. arXiv. arXiv:2510.10099 [hep-ph] doi:10.48550/arXiv.2510.10099

  31. [31]

    Pan Lu, Liang Qiu, Wenhao Yu, Sean Welleck, and Kai-Wei Chang. 2023. A Survey of Deep Learning for Mathematical Reasoning. 14605–14631 pages

  32. [32]

    Harrison Mitchell, Alexander Norcliffe, and Pietro Liò. 2022. Learning Feynman Diagrams using Graph Neural Networks. arXiv. arXiv:2211.15348 [physics.comp- ph] doi:10.48550/arXiv.2211.15348 NeurIPS Machine Learning and the Physical Sciences (ML4PS), 2022

  33. [33]

    Patrick Olivier. 2001. Diagrammatic reasoning: An artificial intelligence per- spective.Artificial Intelligence Review15, 1-2 (2001), 63–78. doi:10.1023/A: 1006669526043

  34. [34]

    OpenAI. 2026. GPT-5.1 Model | OpenAI API. OpenAI API Documentation. https://platform.openai.com/docs/models/gpt-5.1 Accessed: 2026-02-08

  35. [35]

    OpenAI. 2026. GPT-5.2 Model | OpenAI API. OpenAI API Documentation. https://platform.openai.com/docs/models/gpt-5.2 Accessed: 2026-02-08

  36. [36]

    Antonio Pich. 2012. The Standard Model of Electroweak Interactions. arXiv:1201.0537 https://arxiv.org/abs/1201.0537

  37. [37]

    Du, Zehuan Yuan, and Xinglong Wu

    Liao Qu, Huichao Zhang, Yiheng Liu, Xu Wang, Yi Jiang, Yiming Gao, Hu Ye, Daniel K. Du, Zehuan Yuan, and Xinglong Wu. 2025. TokenFlow: Unified Im- age Tokenizer for Multimodal Understanding and Generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). arXiv:2412.03069 [cs.CV] doi:10.48550/arXiv.2412.03069

  38. [38]

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. InProceedings of the 38th Inter- national Conference on Machine Learning (Proceedings of Machi...

  39. [39]

    Arif Ahmed Sekh, Debi Prosad Dogra, Samarjit Kar, Partha Pratim Roy, and Dilip K Prasad. 2020. Can we automate diagrammatic reasoning?Pattern Recognition 106 (2020), 107412

  40. [40]

    Hui Shen, Taiqiang Wu, Qi Han, Yunta Hsieh, Jizhou Wang, Yuyue Zhang, Yuxin Cheng, Zijian Hao, Yuansheng Ni, Xin Wang, Zhongwei Wan, Kai Zhang, Wen- dong Xu, Jing Xiong, Ping Luo, Wenhu Chen, Chaofan Tao, Zhuoqing Mao, and Ngai Wong. 2025. PhyX: Does Your Model Have the “Wits” for Physical Reasoning? arXiv:2505.15929 [cs.AI] doi:10.48550/arXiv.2505.15929

  41. [41]

    Vladyslav Shtabovenko, Rolf Mertig, and Frederik Orellana. 2016. New develop- ments in FeynCalc 9.0.Computer Physics Communications207 (2016), 432–444. arXiv:1601.01167 doi:10.1016/j.cpc.2016.06.008

  42. [42]

    Paolo Silvi, Florian Tschirsich, Mathias Gerster, Jasper Jünemann, Daniel Kasper, Miroslav Macek, and Simone Montangero. 2019. The Tensor Networks Anthology: Simulation Techniques for Quantum Many-Body Systems.SciPost Physics Lecture Notes8 (2019). doi:10.21468/SciPostPhysLectNotes.8

  43. [43]

    Yutao Sun, Hangbo Bao, Wenhui Wang, Zhiliang Peng, Li Dong, Shaohan Huang, Jianyong Wang, and Furu Wei. 2024. Multimodal Latent Language Modeling with Next-Token Diffusion. arXiv:2412.08635 [cs.CV] doi:10.48550/arXiv.2412.08635

  44. [44]

    David Tong. 2007. Lectures on Quantum Field Theory. https://www.damtp.cam. ac.uk/user/tong/qft.html Lecture notes, University of Cambridge (Michaelmas 2006)

  45. [45]

    Matt von Hippel and Matthias Wilhelm. 2025. Refining Integration-by-Parts Reduction of Feynman Integrals with Machine Learning.JHEP05 (2025), 185. arXiv:2502.05121 [hep-th] doi:10.1007/JHEP05(2025)185

  46. [46]

    Lintao Wang, Encheng Su, Jiaqi Liu, Pengze Li, Peng Xia, Jiabei Xiao, Wen- long Zhang, Xinnan Dai, Xi Chen, Yuan Meng, Mingyu Ding, Lei Bai, Wanli Ouyang, Shixiang Tang, Aoran Wang, and Xinzhu Ma. 2025. PhysUniBench: An Undergraduate-Level Physics Reasoning Benchmark for Multimodal Models. arXiv:2506.17667 [cs.AI] doi:10.48550/arXiv.2506.17667

  47. [47]

    Weixing Wang, Zifeng Ding, Jindong Gu, Rui Cao, Christoph Meinel, Ger- ard de Melo, and Haojin Yang. 2025. Image Tokens Matter: Mitigating Hal- lucination in Discrete Tokenizer-based Large Vision-Language Models via La- tent Editing. InAdvances in Neural Information Processing Systems (NeurIPS). arXiv:2505.21547 [cs.CV] doi:10.48550/arXiv.2505.21547

  48. [48]

    Xinlong Wang, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Zhen Li, Yuqi Wang, Qiying Yu, Yingli Zhao, Yulong Ao, Xuebin Min, Chunlei Men, Boya Wu, Bo Zhao, Bowen Zhang, Liangdong Wang, Guang Liu, Zheqi He, Xi Yang, Jingjing Liu, Yonghua Lin, Zhongyuan Wang, and Tiejun Huang. 2026. Multimodal learning with ne...

  49. [49]

    Xin Wang, Yuwei Zhou, Bin Huang, Hong Chen, and Wenwu Zhu. 2025. Multi- modal Generative AI: Multi-modal LLMs, Diffusions and the Unification

  50. [50]

    Size Wu, Wenwei Zhang, Lumin Xu, Sheng Jin, Zhonghua Wu, Qingyi Tao, Wentao Liu, Wei Li, and Chen Change Loy. 2025. Harmonizing Visual Rep- resentations for Unified Multimodal Understanding and Generation. InPro- ceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). arXiv:2503.21979 [cs.CV] doi:10.48550/arXiv.2503.21979

  51. [51]

    Yecheng Wu, Zhuoyang Zhang, Junyu Chen, Haotian Tang, Dacheng Li, Yunhao Fang, Ligeng Zhu, Enze Xie, Hongxu Yin, Li Yi, Song Han, and Yao Lu. 2025. VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation. InInternational Conference on Learning Representations (ICLR). https: //openreview.net/forum?id=02haSpO453

  52. [52]

    Weiye Xu, Jiahao Wang, Weiyun Wang, Zhe Chen, Wengang Zhou, Aijun Yang, Lewei Lu, Houqiang Li, Xiaohua Wang, Xizhou Zhu, Wenhai Wang, Jifeng Dai, and Jinguo Zhu. 2025. VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models. arXiv:2504.15279 https://arxiv.org/abs/ 2504.15279

  53. [53]

    Yan Yang, Haochen Tian, Yang Shi, Wulin Xie, Yi-Fan Zhang, Yuhao Dong, Yibo Hu, Liang Wang, Ran He, Caifeng Shan, et al. 2025. A Survey of Unified Multimodal Understanding and Generation: Advances and Challenges. TechRxiv preprint. doi:10.36227/techrxiv.176289261.16802577/v1 Posted on 11 Nov 2025

  54. [54]

    Kaining Ying, Fanqing Meng, Jin Wang, Zhiqian Li, Han Lin, Yue Yang, Hao Zhang, Wenbo Zhang, Yuqi Lin, Shuo Liu, Jiayi Lei, Quanfeng Lu, Runjian Chen, Peng Xu, Renrui Zhang, Haozhe Zhang, Peng Gao, Yali Wang, Yu Qiao, Ping Luo, Kaipeng Zhang, and Wenqi Shao. 2024. MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models ...

  55. [55]

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. 2024. MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark f...

  56. [56]

    Wanpeng Zhang, Yicheng Feng, Hao Luo, Yijiang Li, Zihao Yue, Sipeng Zheng, and Zongqing Lu. 2025. Unified Multimodal Understanding via Byte-Pair Visual Encoding. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). arXiv:2506.23639 [cs.CV] doi:10.48550/arXiv.2506.23639

  57. [57]

    Shanshan Zhao, Xinjie Zhang, Jintao Guo, Jiakui Hu, Lunhao Duan, Minghao Fu, Yong Xien Chng, Guo-Hua Wang, Qing-Guo Chen, Zhao Xu, et al. 2025. Unified multimodal understanding and generation models: Advances, challenges, and opportunities

  58. [58]

    Tenenbaum, and Chuang Gan

    Zhicheng Zheng, Xin Yan, Zhenfang Chen, Jingzhou Wang, Qin Zhi Eddie Lim, Joshua B. Tenenbaum, and Chuang Gan. 2024. ContPhy: Continuum Physical Concept Learning and Reasoning from Videos. arXiv:2402.06119 [cs.CV] doi:10. 48550/arXiv.2402.06119

  59. [59]

    Jie Zhou, Ganqu Cui, Shengding Hu, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, Lifeng Wang, Changcheng Li, and Maosong Sun. 2020. Graph neural networks: A review of methods and applications.AI open1 (2020), 57–81

  60. [60]

    Erle Zhu, Yadi Liu, Zhe Zhang, Xujun Li, Jin Zhou, Xinjie Yu, Minlie Huang, and Hongning Wang. 2025. MAPS: Advancing Multi-Modal Reasoning in Expert- Level Physical Science. arXiv:2501.10768 [cs.AI] doi:10.48550/arXiv.2501.10768