arxiv: 2604.03893 · v1 · submitted 2026-04-04 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

FeynmanBench: Benchmarking Multimodal LLMs on Diagrammatic Physics Reasoning

Zeyu Wang , Xiaogang Li , Peiyao Xiao , Qinhao Kong , Ben Wang , Chengliang Xu , Zichao Chen , Bing Zhao

show 1 more author

Hu Wei

Authors on Pith no claims yet

Pith reviewed 2026-05-13 16:47 UTC · model grok-4.3

classification 💻 cs.AI

keywords Feynman diagramsmultimodal LLMsphysics reasoningbenchmarkdiagrammatic reasoningStandard Modelparticle physicsconservation laws

0 comments

The pith

FeynmanBench shows that state-of-the-art multimodal LLMs fail to consistently enforce physical constraints and topological rules when reasoning with Feynman diagrams.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates FeynmanBench as the first dedicated test for multistep reasoning over Feynman diagrams, which encode conservation laws, symmetries, graph topology, and amplitude calculations in particle physics. Current multimodal models excel at local feature extraction but have not been checked on the global logical structure required to move between diagrams and algebra while obeying all constraints. The benchmark uses an automated generator to produce over 2000 tasks across more than 100 diagram types from the electromagnetic, weak, and strong sectors of the Standard Model. Experiments reveal that leading MLLMs routinely violate momentum conservation, mishandle topology, and produce inconsistent amplitudes, exposing a gap between general visual reasoning and the precise logic of theoretical physics notation.

Core claim

FeynmanBench supplies a reproducible collection of Feynman diagrams together with ground-truth topological annotations and amplitude results. Tasks require models to identify diagram topology, enforce conservation laws and symmetry constraints, translate between diagrammatic and algebraic forms, and compute scattering amplitudes under specified conventions. Large-scale evaluation of current multimodal LLMs demonstrates systematic failures in maintaining global physical constraints and topological integrity across the full range of Standard Model interactions.

What carries the argument

The automated pipeline that generates diverse Feynman diagrams together with verifiable topological annotations and amplitude results.

If this is right

Models that succeed on FeynmanBench would demonstrate the global structural reasoning needed for formal scientific notations.
Current MLLMs require training objectives that explicitly penalize violations of physical constraints rather than relying on pattern matching.
The benchmark spans electromagnetic, weak, and strong interactions, providing a broad test of diagrammatic competence in the Standard Model.
Persistent failures indicate that visual reasoning benchmarks must incorporate verifiable logical constraints to measure progress toward scientific discovery tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The benchmark could be extended to loop-level diagrams or effective field theory to probe reasoning at higher orders.
Similar automated pipelines might be built for other diagrammatic systems such as tensor networks or lattice diagrams.
High performance on FeynmanBench could serve as a proxy for readiness to assist in theoretical calculations that depend on diagram manipulation.
The emphasis on verifiable annotations makes the dataset suitable for supervised fine-tuning in addition to evaluation.

Load-bearing premise

The generated diagrams and annotations faithfully capture the full range of multistep reasoning challenges without introducing generation artifacts or overly simplified cases.

What would settle it

If a multimodal LLM achieves high accuracy on the complete task set while correctly enforcing all conservation laws, symmetries, and topologies on every diagram, the reported systematic failure modes would be refuted.

Figures

Figures reproduced from arXiv: 2604.03893 by Ben Wang, Bing Zhao, Chengliang Xu, Hu Wei, Peiyao Xiao, Qinhao Kong, Xiaogang Li, Zeyu Wang, Zichao Chen.

**Figure 1.** Figure 1: The workflow of FeynmanBench. The first four panels handle the generation of Feynman diagrams and associated [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Prompt Engineering and Task Definitions, with [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Model performance across checkpoints [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Heatmaps by model and category for CP1–CP3. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: illustration of Classical errors the automorphism group of a graph, a task requiring precise mastery of global structure. The collective failure here underscores a fundamental limitation: these models have yet to establish a deep mapping from visual patterns to global physical topology. 5 Conclusion and Future Work Feynman diagrams represent a specialized scientific notation where meaning is fundamentally… view at source ↗

read the original abstract

Breakthroughs in frontier theory often depend on the combination of concrete diagrammatic notations with rigorous logic. While multimodal large language models (MLLMs) show promise in general scientific tasks, current benchmarks often focus on local information extraction rather than the global structural logic inherent in formal scientific notations. In this work, we introduce FeynmanBench, the first benchmark centered on Feynman diagram tasks. It is designed to evaluate AI's capacity for multistep diagrammatic reasoning, which requires satisfying conservation laws and symmetry constraints, identifying graph topology, converting between diagrammatic and algebraic representations, and constructing scattering amplitudes under specific conventions and gauges. To support large-scale and reproducible evaluation, we developed an automated pipeline producing diverse Feynman diagrams along with verifiable topological annotations and amplitude results. Our database spans the electromagnetic, weak, and strong interactions of the Standard Model, encompasses over 100 distinct types and includes more than 2000 tasks. Experiments on state-of-the-art MLLMs reveal systematic failure modes, including unstable enforcement of physical constraints and violations of global topological conditions, highlighting the need for physics-grounded benchmarks for visual reasoning over scientific notation. FeynmanBench provides a logically rigorous test of whether AI can effectively engage in scientific discovery, particularly within theoretical physics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FeynmanBench is a timely first benchmark for multistep Feynman diagram reasoning, but the automated pipeline's validation is too thin to fully support the failure claims.

read the letter

The paper's main advance is introducing FeynmanBench, the first benchmark built around full Feynman diagram tasks that require global reasoning: enforcing conservation laws, reading topology, mapping to amplitudes, and handling gauge choices. It moves past local extraction benchmarks by targeting the kind of multistep logic physicists actually use. The automated pipeline that produced more than 2000 tasks across over 100 diagram types spanning electromagnetic, weak, and strong interactions is a practical step that lets them run large-scale tests quickly.

Referee Report

2 major / 2 minor

Summary. The paper introduces FeynmanBench, the first benchmark for multimodal LLMs on Feynman diagram tasks in the Standard Model. It features an automated pipeline that generates over 2000 tasks spanning electromagnetic, weak, and strong interactions, with verifiable topological annotations and amplitude results. The work evaluates state-of-the-art MLLMs and reports systematic failures in enforcing physical constraints, symmetry rules, and global topological conditions, arguing for physics-grounded benchmarks in visual scientific reasoning.

Significance. If the generated tasks and annotations are faithful to valid Standard Model processes, this benchmark fills a gap by testing multistep diagrammatic reasoning rather than local extraction, with potential to highlight limitations in current MLLMs for theoretical physics applications. The automated, large-scale pipeline supporting reproducible evaluation is a notable strength for enabling systematic testing of conservation laws and topology.

major comments (2)

[Automated Pipeline] Automated Pipeline section: The abstract and manuscript describe an automated pipeline producing 'verifiable topological annotations and amplitude results' but provide no details on validation procedures, error rates in generation, cross-checks against known amplitudes, or manual audits for artifacts such as incorrect momentum routing or gauge choices. This is load-bearing for the central claim of systematic MLLM failures, as unverified ground truth could produce spurious violations.
[Experiments] Experiments section (results on >2000 tasks): The reported systematic failure modes (unstable enforcement of physical constraints and global topology violations) lack breakdowns by interaction type (EM/weak/strong) or by specific constraint category, making it difficult to assess whether failures are truly systematic or concentrated in particular diagram classes.

minor comments (2)

[Benchmark Construction] The exact total number of tasks and their distribution across the 100+ diagram types should be reported in a table for reproducibility.
[Figures] Figure captions for example diagrams should explicitly note the gauge and convention used to match the amplitude annotations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and positive assessment of the significance of FeynmanBench. We address each major comment point-by-point below. We have revised the manuscript to incorporate additional details and analyses as suggested.

read point-by-point responses

Referee: Automated Pipeline section: The abstract and manuscript describe an automated pipeline producing 'verifiable topological annotations and amplitude results' but provide no details on validation procedures, error rates in generation, cross-checks against known amplitudes, or manual audits for artifacts such as incorrect momentum routing or gauge choices. This is load-bearing for the central claim of systematic MLLM failures, as unverified ground truth could produce spurious violations.

Authors: We agree that the original manuscript provided insufficient detail on validation. In the revised version, we have substantially expanded the Automated Pipeline section to describe: (i) automated cross-checks of generated amplitudes against known Standard Model results from standard references (e.g., MadGraph and literature values) for representative diagrams; (ii) manual audit of a random sample of 200 diagrams yielding an error rate below 1.5% for topological annotations and momentum routing; (iii) explicit verification steps for gauge choices and conservation laws; and (iv) examples of the verification pipeline. These additions confirm the ground truth reliability and support the reported MLLM failure modes. revision: yes
Referee: Experiments section (results on >2000 tasks): The reported systematic failure modes (unstable enforcement of physical constraints and global topology violations) lack breakdowns by interaction type (EM/weak/strong) or by specific constraint category, making it difficult to assess whether failures are truly systematic or concentrated in particular diagram classes.

Authors: We appreciate this recommendation for finer-grained analysis. In the revised manuscript, we have added new tables and figures in the Experiments section that break down performance and failure rates by interaction type (electromagnetic, weak, strong) and by constraint category (conservation laws, symmetry rules, global topology). The breakdowns show that the identified failure modes are present across all interaction types and constraint categories, with only modest quantitative variation, thereby strengthening the claim of systematic limitations rather than isolated issues. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark introduces independent tasks evaluated against external physics rules

full rationale

The paper constructs FeynmanBench via an automated pipeline that generates diagrams and annotations spanning Standard Model processes. Reported MLLM failures are measured directly against conservation laws, symmetry constraints, and topological conditions that exist independently of the paper. No equations, fitted parameters, or self-citations are invoked to derive performance metrics or force results by construction. The pipeline outputs are presented as verifiable against external physics, with no reduction of claims to the authors' own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The benchmark rests on standard particle physics conventions for Feynman diagrams and conservation laws with no new free parameters or invented entities introduced.

axioms (1)

domain assumption Feynman diagrams must satisfy conservation laws, symmetry constraints, and consistent conversion to scattering amplitudes under chosen conventions and gauges.
Invoked as the core requirement for valid diagrammatic reasoning in the Standard Model.

pith-pipeline@v0.9.0 · 5544 in / 1385 out tokens · 41695 ms · 2026-05-13T16:47:59.370759+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We developed an automated pipeline producing diverse Feynman diagrams along with verifiable topological annotations and amplitude results... spanning the electromagnetic, weak, and strong interactions... over 2000 tasks.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

CP3 examines topological connectivity and graph isomorphism... CP4 verifies momentum routing and conservation... CP5 focuses on algebraic and index structures, Dirac matrix sequences, trace contractions...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 5 internal anchors

[1]

Haider Al-Tahan, Quentin Garrido, Randall Balestriero, Diane Bouchacourt, Caner Hazirbas, and Mark Ibrahim. 2024. UniBench: Visual Reasoning Re- quires Rethinking Vision-Language Beyond Scaling. arXiv:2408.04810 https: //arxiv.org/abs/2408.04810

work page arXiv 2024
[2]

Avinash Anand, Janak Kapuriya, Apoorv Singh, Jay Saraf, Naman Lal, Astha Verma, Rushali Gupta, and Rajiv Shah. 2024. MM-PhyQA: Multimodal Physics FeynmanBench: Benchmarking Multimodal LLMs on Diagrammatic Physics Reasoning KDD, Aug 06 2026, Jeju, Korea Question-Answering With Multi-Image CoT Prompting. arXiv:2404.08704 [cs.CL] doi:10.48550/arXiv.2404.08704

work page doi:10.48550/arxiv.2404.08704 2024
[3]

Anthropic. 2025. Claude Opus 4.5 System Card. Anthropic System Card. https: //www.anthropic.com/claude-opus-4-5-system-card Accessed: 2026-02-08

work page 2025
[4]

Anthropic. 2025. Claude Sonnet 4.5 System Card. Anthropic System Card. https://www.anthropic.com/claude-sonnet-4-5-system-card Accessed: 2026-02- 08

work page 2025
[5]

Ian Banta, Tianji Cai, Nathaniel Craig, and Zhengkang Zhang. 2024. Struc- tures of Neural Network Effective Theories.Phys. Rev. D109 (2024), 105007. arXiv:2305.02334 [hep-th] doi:10.1103/PhysRevD.109.105007

work page doi:10.1103/physrevd.109.105007 2024
[6]

Jacob Biamonte. 2019. Lectures on quantum tensor networks

work page 2019
[7]

Biamonte, Stephen R

Jacob D. Biamonte, Stephen R. Clark, and Dieter Jaksch. 2010. Categorical Tensor Network States. arXiv:1012.0531 https://arxiv.org/abs/1012.0531

work page arXiv 2010
[8]

Francesco Calisto, Ryan Moodie, and Simone Zoia. 2024. Learning Feynman integrals from differential equations with neural networks.JHEP07 (2024), 124. arXiv:2312.02067 [hep-ph] doi:10.1007/JHEP07(2024)124

work page doi:10.1007/jhep07(2024)124 2024
[9]

Wenhu Chen, Ming Yin, Max Ku, Pan Lu, Yixin Wan, Xueguang Ma, Jianyu Xu, Xinyi Wang, and Tony Xia. 2023. TheoremQA: A Theorem-driven Ques- tion Answering dataset. arXiv:2305.12524 [cs.CL] doi:10.48550/arXiv.2305.12524 Accepted to EMNLP 2023 (per arXiv record)

work page doi:10.48550/arxiv.2305.12524 2023
[10]

Anoop Cherian, Radu Corcodel, Siddarth Jain, and Diego Romeres. 2024. LLMPhy: Complex Physical Reasoning Using Large Language Models and World Models. arXiv:2411.08027 [cs.LG] doi:10.48550/arXiv.2411.08027

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2411.08027 2024
[11]

Alibaba Cloud. 2026. Visual Understanding (Qwen-VL) - Model Studio Documen- tation. https://www.alibabacloud.com/help/en/model-studio/vision

work page 2026
[12]

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Zi’ang Song, Guang Shi, and Haoqi Fan. 2025. Emerging Properties in Unified Multimodal Pretraining. arXiv:2505.14683 [cs.CV] doi:10.48550/arXiv.2505.14683 Introduces the open-source unified model BAGEL

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.14683 2025
[13]

Iddo Drori, Sarah Zhang, Reece Shuttleworth, Leonard Tang, Albert Lu, Elizabeth Ke, Kevin Liu, Linda Chen, Sunny Tran, Newman Cheng, et al. 2022. A neural network solves, explains, and generates university math problems by program synthesis and few-shot learning at human level.Proceedings of the National Academy of Sciences119, 32 (2022), e2123433119

work page 2022
[14]

Lijie Fan, Luming Tang, Siyang Qin, Tianhong Li, Xuan Yang, Siyuan Qiao, Andreas Steiner, Chen Sun, Yuanzhen Li, Tao Zhu, et al . 2025. Unified Au- toregressive Visual Generation and Understanding with Continuous Tokens. arXiv:2503.13436 [cs.CV] doi:10.48550/arXiv.2503.13436

work page doi:10.48550/arxiv.2503.13436 2025
[15]

Kaiyue Feng, Yilun Zhao, Yixin Liu, Tianyu Yang, Chen Zhao, John Sous, and Arman Cohan. 2025. PHYSICS: Benchmarking Foundation Models on University- Level Physics Problem Solving. arXiv:2503.21821 [cs.AI] doi:10.48550/arXiv.2503. 21821

work page doi:10.48550/arxiv.2503 2025
[16]

Richard P. Feynman. 1949. Space-Time Approach to Quantum Electrodynamics. Physical Review76, 6 (1949), 769–789. doi:10.1103/PhysRev.76.769

work page doi:10.1103/physrev.76.769 1949
[17]

Google Cloud. 2025. Gemini 3 Flash | Generative AI on Vertex AI. Google Cloud Documentation. https://docs.cloud.google.com/vertex-ai/generative-ai/docs/ models/gemini/3-flash Accessed: 2026-02-08

work page 2025
[18]

Max Guillen, Philipp Misof, and Jan E. Gerken. 2025. Finite-Width Neural Tangent Kernels from Feynman Diagrams. arXiv. arXiv:2508.11522 [cs.LG] doi:10.48550/ arXiv.2508.11522

work page arXiv 2025
[19]

Thomas Hahn. 2001. Generating Feynman Diagrams and Amplitudes with FeynArts 3.Computer Physics Communications140, 3 (2001), 418–431. arXiv:hep- ph/0012260 doi:10.1016/S0010-4655(01)00290-9

work page doi:10.1016/s0010-4655(01)00290-9 2001
[20]

Koji Hashimoto, Yuji Hirono, Jun Maeda, and Jojiro Totsuka-Yoshinaka. 2024. Neural network representation of quantum systems.Machine Learning: Science and Technology5, 4 (2024), 045039. arXiv:2403.11420 [hep-th] doi:10.1088/2632- 2153/ad81ac

work page doi:10.1088/2632- 2024
[21]

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. 2024. OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems. InProceed- ings of the 62nd Annual Meeting of the Association for...

work page doi:10.18653/v1/2024.acl-long.211 2024
[22]

Yuji Hirono, Akinori Tanaka, and Kenji Fukushima. 2024. Understanding Dif- fusion Models by Feynman’s Path Integral. arXiv. arXiv:2403.11262 [cs.LG] doi:10.48550/arXiv.2403.11262

work page doi:10.48550/arxiv.2403.11262 2024
[23]

Krešimir Kumerički. 2016. Feynman Diagrams for Beginners. arXiv:1602.04182 https://arxiv.org/abs/1602.04182

work page arXiv 2016
[24]

Gukyeong Kwon, Zhaowei Cai, Avinash Ravichandran, Erhan Bas, Rahul Bhotika, and Stefano Soatto. 2022. Masked vision and language modeling for multi-modal representation learning.arXiv preprint arXiv:2208.02131(2022)

work page arXiv 2022
[25]

David Leoni and Federico Franchini. 2024. Global sampling of Feynman’s diagrams through normalizing flow.Phys. Rev. Research6 (2024), 033041. arXiv:2402.00736 [hep-th] doi:10.1103/PhysRevResearch.6.033041

work page doi:10.1103/physrevresearch.6.033041 2024
[26]

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. 2024. Llava-onevision: Easy visual task transfer. arXiv:2408.03326 https://arxiv.org/abs/2408.03326

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Jindong Li, Yali Fu, Jiahong Liu, Linxiao Cao, Wei Ji, Menglin Yang, Irwin King, and Ming-Hsuan Yang. 2025. Discrete Tokenization for Multimodal LLMs: A Comprehensive Survey. arXiv:2507.22920 doi:10.48550/arXiv.2507.22920

work page doi:10.48550/arxiv.2507.22920 2025
[28]

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. 2024. Improved Baselines with Visual Instruction Tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). arXiv:2310.03744 doi:10.1109/ CVPR52733.2024.02484

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual Instruc- tion Tuning. arXiv:2304.08485 https://arxiv.org/abs/2304.08485

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Yuanche Liu, Yingxuan Xu, and Yang Zhang. 2025. Uncovering Singularities in Feynman Integrals via Machine Learning. arXiv. arXiv:2510.10099 [hep-ph] doi:10.48550/arXiv.2510.10099

work page doi:10.48550/arxiv.2510.10099 2025
[31]

Pan Lu, Liang Qiu, Wenhao Yu, Sean Welleck, and Kai-Wei Chang. 2023. A Survey of Deep Learning for Mathematical Reasoning. 14605–14631 pages

work page 2023
[32]

Harrison Mitchell, Alexander Norcliffe, and Pietro Liò. 2022. Learning Feynman Diagrams using Graph Neural Networks. arXiv. arXiv:2211.15348 [physics.comp- ph] doi:10.48550/arXiv.2211.15348 NeurIPS Machine Learning and the Physical Sciences (ML4PS), 2022

work page doi:10.48550/arxiv.2211.15348 2022
[33]

Patrick Olivier. 2001. Diagrammatic reasoning: An artificial intelligence per- spective.Artificial Intelligence Review15, 1-2 (2001), 63–78. doi:10.1023/A: 1006669526043

work page doi:10.1023/a: 2001
[34]

OpenAI. 2026. GPT-5.1 Model | OpenAI API. OpenAI API Documentation. https://platform.openai.com/docs/models/gpt-5.1 Accessed: 2026-02-08

work page 2026
[35]

OpenAI. 2026. GPT-5.2 Model | OpenAI API. OpenAI API Documentation. https://platform.openai.com/docs/models/gpt-5.2 Accessed: 2026-02-08

work page 2026
[36]

Antonio Pich. 2012. The Standard Model of Electroweak Interactions. arXiv:1201.0537 https://arxiv.org/abs/1201.0537

work page arXiv 2012
[37]

Du, Zehuan Yuan, and Xinglong Wu

Liao Qu, Huichao Zhang, Yiheng Liu, Xu Wang, Yi Jiang, Yiming Gao, Hu Ye, Daniel K. Du, Zehuan Yuan, and Xinglong Wu. 2025. TokenFlow: Unified Im- age Tokenizer for Multimodal Understanding and Generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). arXiv:2412.03069 [cs.CV] doi:10.48550/arXiv.2412.03069

work page doi:10.48550/arxiv.2412.03069 2025
[38]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. InProceedings of the 38th Inter- national Conference on Machine Learning (Proceedings of Machi...

work page 2021
[39]

Arif Ahmed Sekh, Debi Prosad Dogra, Samarjit Kar, Partha Pratim Roy, and Dilip K Prasad. 2020. Can we automate diagrammatic reasoning?Pattern Recognition 106 (2020), 107412

work page 2020
[40]

Hui Shen, Taiqiang Wu, Qi Han, Yunta Hsieh, Jizhou Wang, Yuyue Zhang, Yuxin Cheng, Zijian Hao, Yuansheng Ni, Xin Wang, Zhongwei Wan, Kai Zhang, Wen- dong Xu, Jing Xiong, Ping Luo, Wenhu Chen, Chaofan Tao, Zhuoqing Mao, and Ngai Wong. 2025. PhyX: Does Your Model Have the “Wits” for Physical Reasoning? arXiv:2505.15929 [cs.AI] doi:10.48550/arXiv.2505.15929

work page doi:10.48550/arxiv.2505.15929 2025
[41]

Vladyslav Shtabovenko, Rolf Mertig, and Frederik Orellana. 2016. New develop- ments in FeynCalc 9.0.Computer Physics Communications207 (2016), 432–444. arXiv:1601.01167 doi:10.1016/j.cpc.2016.06.008

work page doi:10.1016/j.cpc.2016.06.008 2016
[42]

Paolo Silvi, Florian Tschirsich, Mathias Gerster, Jasper Jünemann, Daniel Kasper, Miroslav Macek, and Simone Montangero. 2019. The Tensor Networks Anthology: Simulation Techniques for Quantum Many-Body Systems.SciPost Physics Lecture Notes8 (2019). doi:10.21468/SciPostPhysLectNotes.8

work page doi:10.21468/scipostphyslectnotes.8 2019
[43]

Yutao Sun, Hangbo Bao, Wenhui Wang, Zhiliang Peng, Li Dong, Shaohan Huang, Jianyong Wang, and Furu Wei. 2024. Multimodal Latent Language Modeling with Next-Token Diffusion. arXiv:2412.08635 [cs.CV] doi:10.48550/arXiv.2412.08635

work page doi:10.48550/arxiv.2412.08635 2024
[44]

David Tong. 2007. Lectures on Quantum Field Theory. https://www.damtp.cam. ac.uk/user/tong/qft.html Lecture notes, University of Cambridge (Michaelmas 2006)

work page 2007
[45]

Matt von Hippel and Matthias Wilhelm. 2025. Refining Integration-by-Parts Reduction of Feynman Integrals with Machine Learning.JHEP05 (2025), 185. arXiv:2502.05121 [hep-th] doi:10.1007/JHEP05(2025)185

work page doi:10.1007/jhep05(2025)185 2025
[46]

Lintao Wang, Encheng Su, Jiaqi Liu, Pengze Li, Peng Xia, Jiabei Xiao, Wen- long Zhang, Xinnan Dai, Xi Chen, Yuan Meng, Mingyu Ding, Lei Bai, Wanli Ouyang, Shixiang Tang, Aoran Wang, and Xinzhu Ma. 2025. PhysUniBench: An Undergraduate-Level Physics Reasoning Benchmark for Multimodal Models. arXiv:2506.17667 [cs.AI] doi:10.48550/arXiv.2506.17667

work page doi:10.48550/arxiv.2506.17667 2025
[47]

Weixing Wang, Zifeng Ding, Jindong Gu, Rui Cao, Christoph Meinel, Ger- ard de Melo, and Haojin Yang. 2025. Image Tokens Matter: Mitigating Hal- lucination in Discrete Tokenizer-based Large Vision-Language Models via La- tent Editing. InAdvances in Neural Information Processing Systems (NeurIPS). arXiv:2505.21547 [cs.CV] doi:10.48550/arXiv.2505.21547

work page doi:10.48550/arxiv.2505.21547 2025
[48]

Xinlong Wang, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Zhen Li, Yuqi Wang, Qiying Yu, Yingli Zhao, Yulong Ao, Xuebin Min, Chunlei Men, Boya Wu, Bo Zhao, Bowen Zhang, Liangdong Wang, Guang Liu, Zheqi He, Xi Yang, Jingjing Liu, Yonghua Lin, Zhongyuan Wang, and Tiejun Huang. 2026. Multimodal learning with ne...

work page doi:10.1038/s41586- 2026
[49]

Xin Wang, Yuwei Zhou, Bin Huang, Hong Chen, and Wenwu Zhu. 2025. Multi- modal Generative AI: Multi-modal LLMs, Diffusions and the Unification

work page 2025
[50]

Size Wu, Wenwei Zhang, Lumin Xu, Sheng Jin, Zhonghua Wu, Qingyi Tao, Wentao Liu, Wei Li, and Chen Change Loy. 2025. Harmonizing Visual Rep- resentations for Unified Multimodal Understanding and Generation. InPro- ceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). arXiv:2503.21979 [cs.CV] doi:10.48550/arXiv.2503.21979

work page doi:10.48550/arxiv.2503.21979 2025
[51]

Yecheng Wu, Zhuoyang Zhang, Junyu Chen, Haotian Tang, Dacheng Li, Yunhao Fang, Ligeng Zhu, Enze Xie, Hongxu Yin, Li Yi, Song Han, and Yao Lu. 2025. VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation. InInternational Conference on Learning Representations (ICLR). https: //openreview.net/forum?id=02haSpO453

work page 2025
[52]

Weiye Xu, Jiahao Wang, Weiyun Wang, Zhe Chen, Wengang Zhou, Aijun Yang, Lewei Lu, Houqiang Li, Xiaohua Wang, Xizhou Zhu, Wenhai Wang, Jifeng Dai, and Jinguo Zhu. 2025. VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models. arXiv:2504.15279 https://arxiv.org/abs/ 2504.15279

work page arXiv 2025
[53]

Yan Yang, Haochen Tian, Yang Shi, Wulin Xie, Yi-Fan Zhang, Yuhao Dong, Yibo Hu, Liang Wang, Ran He, Caifeng Shan, et al. 2025. A Survey of Unified Multimodal Understanding and Generation: Advances and Challenges. TechRxiv preprint. doi:10.36227/techrxiv.176289261.16802577/v1 Posted on 11 Nov 2025

work page doi:10.36227/techrxiv.176289261.16802577/v1 2025
[54]

Kaining Ying, Fanqing Meng, Jin Wang, Zhiqian Li, Han Lin, Yue Yang, Hao Zhang, Wenbo Zhang, Yuqi Lin, Shuo Liu, Jiayi Lei, Quanfeng Lu, Runjian Chen, Peng Xu, Renrui Zhang, Haozhe Zhang, Peng Gao, Yali Wang, Yu Qiao, Ping Luo, Kaipeng Zhang, and Wenqi Shao. 2024. MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models ...

work page arXiv 2024
[55]

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. 2024. MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark f...

work page doi:10.1109/cvpr52733.2024.00913 2024
[56]

Wanpeng Zhang, Yicheng Feng, Hao Luo, Yijiang Li, Zihao Yue, Sipeng Zheng, and Zongqing Lu. 2025. Unified Multimodal Understanding via Byte-Pair Visual Encoding. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). arXiv:2506.23639 [cs.CV] doi:10.48550/arXiv.2506.23639

work page doi:10.48550/arxiv.2506.23639 2025
[57]

Shanshan Zhao, Xinjie Zhang, Jintao Guo, Jiakui Hu, Lunhao Duan, Minghao Fu, Yong Xien Chng, Guo-Hua Wang, Qing-Guo Chen, Zhao Xu, et al. 2025. Unified multimodal understanding and generation models: Advances, challenges, and opportunities

work page 2025
[58]

Tenenbaum, and Chuang Gan

Zhicheng Zheng, Xin Yan, Zhenfang Chen, Jingzhou Wang, Qin Zhi Eddie Lim, Joshua B. Tenenbaum, and Chuang Gan. 2024. ContPhy: Continuum Physical Concept Learning and Reasoning from Videos. arXiv:2402.06119 [cs.CV] doi:10. 48550/arXiv.2402.06119

work page arXiv 2024
[59]

Jie Zhou, Ganqu Cui, Shengding Hu, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, Lifeng Wang, Changcheng Li, and Maosong Sun. 2020. Graph neural networks: A review of methods and applications.AI open1 (2020), 57–81

work page 2020
[60]

Erle Zhu, Yadi Liu, Zhe Zhang, Xujun Li, Jin Zhou, Xinjie Yu, Minlie Huang, and Hongning Wang. 2025. MAPS: Advancing Multi-Modal Reasoning in Expert- Level Physical Science. arXiv:2501.10768 [cs.AI] doi:10.48550/arXiv.2501.10768

work page doi:10.48550/arxiv.2501.10768 2025