Recognition: 2 theorem links
· Lean TheoremFeynmanBench: Benchmarking Multimodal LLMs on Diagrammatic Physics Reasoning
Pith reviewed 2026-05-13 16:47 UTC · model grok-4.3
The pith
FeynmanBench shows that state-of-the-art multimodal LLMs fail to consistently enforce physical constraints and topological rules when reasoning with Feynman diagrams.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FeynmanBench supplies a reproducible collection of Feynman diagrams together with ground-truth topological annotations and amplitude results. Tasks require models to identify diagram topology, enforce conservation laws and symmetry constraints, translate between diagrammatic and algebraic forms, and compute scattering amplitudes under specified conventions. Large-scale evaluation of current multimodal LLMs demonstrates systematic failures in maintaining global physical constraints and topological integrity across the full range of Standard Model interactions.
What carries the argument
The automated pipeline that generates diverse Feynman diagrams together with verifiable topological annotations and amplitude results.
If this is right
- Models that succeed on FeynmanBench would demonstrate the global structural reasoning needed for formal scientific notations.
- Current MLLMs require training objectives that explicitly penalize violations of physical constraints rather than relying on pattern matching.
- The benchmark spans electromagnetic, weak, and strong interactions, providing a broad test of diagrammatic competence in the Standard Model.
- Persistent failures indicate that visual reasoning benchmarks must incorporate verifiable logical constraints to measure progress toward scientific discovery tasks.
Where Pith is reading between the lines
- The benchmark could be extended to loop-level diagrams or effective field theory to probe reasoning at higher orders.
- Similar automated pipelines might be built for other diagrammatic systems such as tensor networks or lattice diagrams.
- High performance on FeynmanBench could serve as a proxy for readiness to assist in theoretical calculations that depend on diagram manipulation.
- The emphasis on verifiable annotations makes the dataset suitable for supervised fine-tuning in addition to evaluation.
Load-bearing premise
The generated diagrams and annotations faithfully capture the full range of multistep reasoning challenges without introducing generation artifacts or overly simplified cases.
What would settle it
If a multimodal LLM achieves high accuracy on the complete task set while correctly enforcing all conservation laws, symmetries, and topologies on every diagram, the reported systematic failure modes would be refuted.
Figures
read the original abstract
Breakthroughs in frontier theory often depend on the combination of concrete diagrammatic notations with rigorous logic. While multimodal large language models (MLLMs) show promise in general scientific tasks, current benchmarks often focus on local information extraction rather than the global structural logic inherent in formal scientific notations. In this work, we introduce FeynmanBench, the first benchmark centered on Feynman diagram tasks. It is designed to evaluate AI's capacity for multistep diagrammatic reasoning, which requires satisfying conservation laws and symmetry constraints, identifying graph topology, converting between diagrammatic and algebraic representations, and constructing scattering amplitudes under specific conventions and gauges. To support large-scale and reproducible evaluation, we developed an automated pipeline producing diverse Feynman diagrams along with verifiable topological annotations and amplitude results. Our database spans the electromagnetic, weak, and strong interactions of the Standard Model, encompasses over 100 distinct types and includes more than 2000 tasks. Experiments on state-of-the-art MLLMs reveal systematic failure modes, including unstable enforcement of physical constraints and violations of global topological conditions, highlighting the need for physics-grounded benchmarks for visual reasoning over scientific notation. FeynmanBench provides a logically rigorous test of whether AI can effectively engage in scientific discovery, particularly within theoretical physics.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces FeynmanBench, the first benchmark for multimodal LLMs on Feynman diagram tasks in the Standard Model. It features an automated pipeline that generates over 2000 tasks spanning electromagnetic, weak, and strong interactions, with verifiable topological annotations and amplitude results. The work evaluates state-of-the-art MLLMs and reports systematic failures in enforcing physical constraints, symmetry rules, and global topological conditions, arguing for physics-grounded benchmarks in visual scientific reasoning.
Significance. If the generated tasks and annotations are faithful to valid Standard Model processes, this benchmark fills a gap by testing multistep diagrammatic reasoning rather than local extraction, with potential to highlight limitations in current MLLMs for theoretical physics applications. The automated, large-scale pipeline supporting reproducible evaluation is a notable strength for enabling systematic testing of conservation laws and topology.
major comments (2)
- [Automated Pipeline] Automated Pipeline section: The abstract and manuscript describe an automated pipeline producing 'verifiable topological annotations and amplitude results' but provide no details on validation procedures, error rates in generation, cross-checks against known amplitudes, or manual audits for artifacts such as incorrect momentum routing or gauge choices. This is load-bearing for the central claim of systematic MLLM failures, as unverified ground truth could produce spurious violations.
- [Experiments] Experiments section (results on >2000 tasks): The reported systematic failure modes (unstable enforcement of physical constraints and global topology violations) lack breakdowns by interaction type (EM/weak/strong) or by specific constraint category, making it difficult to assess whether failures are truly systematic or concentrated in particular diagram classes.
minor comments (2)
- [Benchmark Construction] The exact total number of tasks and their distribution across the 100+ diagram types should be reported in a table for reproducibility.
- [Figures] Figure captions for example diagrams should explicitly note the gauge and convention used to match the amplitude annotations.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and positive assessment of the significance of FeynmanBench. We address each major comment point-by-point below. We have revised the manuscript to incorporate additional details and analyses as suggested.
read point-by-point responses
-
Referee: Automated Pipeline section: The abstract and manuscript describe an automated pipeline producing 'verifiable topological annotations and amplitude results' but provide no details on validation procedures, error rates in generation, cross-checks against known amplitudes, or manual audits for artifacts such as incorrect momentum routing or gauge choices. This is load-bearing for the central claim of systematic MLLM failures, as unverified ground truth could produce spurious violations.
Authors: We agree that the original manuscript provided insufficient detail on validation. In the revised version, we have substantially expanded the Automated Pipeline section to describe: (i) automated cross-checks of generated amplitudes against known Standard Model results from standard references (e.g., MadGraph and literature values) for representative diagrams; (ii) manual audit of a random sample of 200 diagrams yielding an error rate below 1.5% for topological annotations and momentum routing; (iii) explicit verification steps for gauge choices and conservation laws; and (iv) examples of the verification pipeline. These additions confirm the ground truth reliability and support the reported MLLM failure modes. revision: yes
-
Referee: Experiments section (results on >2000 tasks): The reported systematic failure modes (unstable enforcement of physical constraints and global topology violations) lack breakdowns by interaction type (EM/weak/strong) or by specific constraint category, making it difficult to assess whether failures are truly systematic or concentrated in particular diagram classes.
Authors: We appreciate this recommendation for finer-grained analysis. In the revised manuscript, we have added new tables and figures in the Experiments section that break down performance and failure rates by interaction type (electromagnetic, weak, strong) and by constraint category (conservation laws, symmetry rules, global topology). The breakdowns show that the identified failure modes are present across all interaction types and constraint categories, with only modest quantitative variation, thereby strengthening the claim of systematic limitations rather than isolated issues. revision: yes
Circularity Check
No circularity: benchmark introduces independent tasks evaluated against external physics rules
full rationale
The paper constructs FeynmanBench via an automated pipeline that generates diagrams and annotations spanning Standard Model processes. Reported MLLM failures are measured directly against conservation laws, symmetry constraints, and topological conditions that exist independently of the paper. No equations, fitted parameters, or self-citations are invoked to derive performance metrics or force results by construction. The pipeline outputs are presented as verifiable against external physics, with no reduction of claims to the authors' own inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Feynman diagrams must satisfy conservation laws, symmetry constraints, and consistent conversion to scattering amplitudes under chosen conventions and gauges.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We developed an automated pipeline producing diverse Feynman diagrams along with verifiable topological annotations and amplitude results... spanning the electromagnetic, weak, and strong interactions... over 2000 tasks.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
CP3 examines topological connectivity and graph isomorphism... CP4 verifies momentum routing and conservation... CP5 focuses on algebraic and index structures, Dirac matrix sequences, trace contractions...
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
- [1]
-
[2]
Avinash Anand, Janak Kapuriya, Apoorv Singh, Jay Saraf, Naman Lal, Astha Verma, Rushali Gupta, and Rajiv Shah. 2024. MM-PhyQA: Multimodal Physics FeynmanBench: Benchmarking Multimodal LLMs on Diagrammatic Physics Reasoning KDD, Aug 06 2026, Jeju, Korea Question-Answering With Multi-Image CoT Prompting. arXiv:2404.08704 [cs.CL] doi:10.48550/arXiv.2404.08704
-
[3]
Anthropic. 2025. Claude Opus 4.5 System Card. Anthropic System Card. https: //www.anthropic.com/claude-opus-4-5-system-card Accessed: 2026-02-08
work page 2025
-
[4]
Anthropic. 2025. Claude Sonnet 4.5 System Card. Anthropic System Card. https://www.anthropic.com/claude-sonnet-4-5-system-card Accessed: 2026-02- 08
work page 2025
-
[5]
Ian Banta, Tianji Cai, Nathaniel Craig, and Zhengkang Zhang. 2024. Struc- tures of Neural Network Effective Theories.Phys. Rev. D109 (2024), 105007. arXiv:2305.02334 [hep-th] doi:10.1103/PhysRevD.109.105007
-
[6]
Jacob Biamonte. 2019. Lectures on quantum tensor networks
work page 2019
-
[7]
Jacob D. Biamonte, Stephen R. Clark, and Dieter Jaksch. 2010. Categorical Tensor Network States. arXiv:1012.0531 https://arxiv.org/abs/1012.0531
-
[8]
Francesco Calisto, Ryan Moodie, and Simone Zoia. 2024. Learning Feynman integrals from differential equations with neural networks.JHEP07 (2024), 124. arXiv:2312.02067 [hep-ph] doi:10.1007/JHEP07(2024)124
-
[9]
Wenhu Chen, Ming Yin, Max Ku, Pan Lu, Yixin Wan, Xueguang Ma, Jianyu Xu, Xinyi Wang, and Tony Xia. 2023. TheoremQA: A Theorem-driven Ques- tion Answering dataset. arXiv:2305.12524 [cs.CL] doi:10.48550/arXiv.2305.12524 Accepted to EMNLP 2023 (per arXiv record)
-
[10]
Anoop Cherian, Radu Corcodel, Siddarth Jain, and Diego Romeres. 2024. LLMPhy: Complex Physical Reasoning Using Large Language Models and World Models. arXiv:2411.08027 [cs.LG] doi:10.48550/arXiv.2411.08027
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2411.08027 2024
-
[11]
Alibaba Cloud. 2026. Visual Understanding (Qwen-VL) - Model Studio Documen- tation. https://www.alibabacloud.com/help/en/model-studio/vision
work page 2026
-
[12]
Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Zi’ang Song, Guang Shi, and Haoqi Fan. 2025. Emerging Properties in Unified Multimodal Pretraining. arXiv:2505.14683 [cs.CV] doi:10.48550/arXiv.2505.14683 Introduces the open-source unified model BAGEL
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.14683 2025
-
[13]
Iddo Drori, Sarah Zhang, Reece Shuttleworth, Leonard Tang, Albert Lu, Elizabeth Ke, Kevin Liu, Linda Chen, Sunny Tran, Newman Cheng, et al. 2022. A neural network solves, explains, and generates university math problems by program synthesis and few-shot learning at human level.Proceedings of the National Academy of Sciences119, 32 (2022), e2123433119
work page 2022
-
[14]
Lijie Fan, Luming Tang, Siyang Qin, Tianhong Li, Xuan Yang, Siyuan Qiao, Andreas Steiner, Chen Sun, Yuanzhen Li, Tao Zhu, et al . 2025. Unified Au- toregressive Visual Generation and Understanding with Continuous Tokens. arXiv:2503.13436 [cs.CV] doi:10.48550/arXiv.2503.13436
-
[15]
Kaiyue Feng, Yilun Zhao, Yixin Liu, Tianyu Yang, Chen Zhao, John Sous, and Arman Cohan. 2025. PHYSICS: Benchmarking Foundation Models on University- Level Physics Problem Solving. arXiv:2503.21821 [cs.AI] doi:10.48550/arXiv.2503. 21821
-
[16]
Richard P. Feynman. 1949. Space-Time Approach to Quantum Electrodynamics. Physical Review76, 6 (1949), 769–789. doi:10.1103/PhysRev.76.769
-
[17]
Google Cloud. 2025. Gemini 3 Flash | Generative AI on Vertex AI. Google Cloud Documentation. https://docs.cloud.google.com/vertex-ai/generative-ai/docs/ models/gemini/3-flash Accessed: 2026-02-08
work page 2025
- [18]
-
[19]
Thomas Hahn. 2001. Generating Feynman Diagrams and Amplitudes with FeynArts 3.Computer Physics Communications140, 3 (2001), 418–431. arXiv:hep- ph/0012260 doi:10.1016/S0010-4655(01)00290-9
-
[20]
Koji Hashimoto, Yuji Hirono, Jun Maeda, and Jojiro Totsuka-Yoshinaka. 2024. Neural network representation of quantum systems.Machine Learning: Science and Technology5, 4 (2024), 045039. arXiv:2403.11420 [hep-th] doi:10.1088/2632- 2153/ad81ac
-
[21]
Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. 2024. OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems. InProceed- ings of the 62nd Annual Meeting of the Association for...
-
[22]
Yuji Hirono, Akinori Tanaka, and Kenji Fukushima. 2024. Understanding Dif- fusion Models by Feynman’s Path Integral. arXiv. arXiv:2403.11262 [cs.LG] doi:10.48550/arXiv.2403.11262
- [23]
- [24]
-
[25]
David Leoni and Federico Franchini. 2024. Global sampling of Feynman’s diagrams through normalizing flow.Phys. Rev. Research6 (2024), 033041. arXiv:2402.00736 [hep-th] doi:10.1103/PhysRevResearch.6.033041
-
[26]
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. 2024. Llava-onevision: Easy visual task transfer. arXiv:2408.03326 https://arxiv.org/abs/2408.03326
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[27]
Jindong Li, Yali Fu, Jiahong Liu, Linxiao Cao, Wei Ji, Menglin Yang, Irwin King, and Ming-Hsuan Yang. 2025. Discrete Tokenization for Multimodal LLMs: A Comprehensive Survey. arXiv:2507.22920 doi:10.48550/arXiv.2507.22920
-
[28]
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. 2024. Improved Baselines with Visual Instruction Tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). arXiv:2310.03744 doi:10.1109/ CVPR52733.2024.02484
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[29]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual Instruc- tion Tuning. arXiv:2304.08485 https://arxiv.org/abs/2304.08485
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[30]
Yuanche Liu, Yingxuan Xu, and Yang Zhang. 2025. Uncovering Singularities in Feynman Integrals via Machine Learning. arXiv. arXiv:2510.10099 [hep-ph] doi:10.48550/arXiv.2510.10099
-
[31]
Pan Lu, Liang Qiu, Wenhao Yu, Sean Welleck, and Kai-Wei Chang. 2023. A Survey of Deep Learning for Mathematical Reasoning. 14605–14631 pages
work page 2023
-
[32]
Harrison Mitchell, Alexander Norcliffe, and Pietro Liò. 2022. Learning Feynman Diagrams using Graph Neural Networks. arXiv. arXiv:2211.15348 [physics.comp- ph] doi:10.48550/arXiv.2211.15348 NeurIPS Machine Learning and the Physical Sciences (ML4PS), 2022
-
[33]
Patrick Olivier. 2001. Diagrammatic reasoning: An artificial intelligence per- spective.Artificial Intelligence Review15, 1-2 (2001), 63–78. doi:10.1023/A: 1006669526043
work page doi:10.1023/a: 2001
-
[34]
OpenAI. 2026. GPT-5.1 Model | OpenAI API. OpenAI API Documentation. https://platform.openai.com/docs/models/gpt-5.1 Accessed: 2026-02-08
work page 2026
-
[35]
OpenAI. 2026. GPT-5.2 Model | OpenAI API. OpenAI API Documentation. https://platform.openai.com/docs/models/gpt-5.2 Accessed: 2026-02-08
work page 2026
- [36]
-
[37]
Du, Zehuan Yuan, and Xinglong Wu
Liao Qu, Huichao Zhang, Yiheng Liu, Xu Wang, Yi Jiang, Yiming Gao, Hu Ye, Daniel K. Du, Zehuan Yuan, and Xinglong Wu. 2025. TokenFlow: Unified Im- age Tokenizer for Multimodal Understanding and Generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). arXiv:2412.03069 [cs.CV] doi:10.48550/arXiv.2412.03069
-
[38]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. InProceedings of the 38th Inter- national Conference on Machine Learning (Proceedings of Machi...
work page 2021
-
[39]
Arif Ahmed Sekh, Debi Prosad Dogra, Samarjit Kar, Partha Pratim Roy, and Dilip K Prasad. 2020. Can we automate diagrammatic reasoning?Pattern Recognition 106 (2020), 107412
work page 2020
-
[40]
Hui Shen, Taiqiang Wu, Qi Han, Yunta Hsieh, Jizhou Wang, Yuyue Zhang, Yuxin Cheng, Zijian Hao, Yuansheng Ni, Xin Wang, Zhongwei Wan, Kai Zhang, Wen- dong Xu, Jing Xiong, Ping Luo, Wenhu Chen, Chaofan Tao, Zhuoqing Mao, and Ngai Wong. 2025. PhyX: Does Your Model Have the “Wits” for Physical Reasoning? arXiv:2505.15929 [cs.AI] doi:10.48550/arXiv.2505.15929
-
[41]
Vladyslav Shtabovenko, Rolf Mertig, and Frederik Orellana. 2016. New develop- ments in FeynCalc 9.0.Computer Physics Communications207 (2016), 432–444. arXiv:1601.01167 doi:10.1016/j.cpc.2016.06.008
-
[42]
Paolo Silvi, Florian Tschirsich, Mathias Gerster, Jasper Jünemann, Daniel Kasper, Miroslav Macek, and Simone Montangero. 2019. The Tensor Networks Anthology: Simulation Techniques for Quantum Many-Body Systems.SciPost Physics Lecture Notes8 (2019). doi:10.21468/SciPostPhysLectNotes.8
-
[43]
Yutao Sun, Hangbo Bao, Wenhui Wang, Zhiliang Peng, Li Dong, Shaohan Huang, Jianyong Wang, and Furu Wei. 2024. Multimodal Latent Language Modeling with Next-Token Diffusion. arXiv:2412.08635 [cs.CV] doi:10.48550/arXiv.2412.08635
-
[44]
David Tong. 2007. Lectures on Quantum Field Theory. https://www.damtp.cam. ac.uk/user/tong/qft.html Lecture notes, University of Cambridge (Michaelmas 2006)
work page 2007
-
[45]
Matt von Hippel and Matthias Wilhelm. 2025. Refining Integration-by-Parts Reduction of Feynman Integrals with Machine Learning.JHEP05 (2025), 185. arXiv:2502.05121 [hep-th] doi:10.1007/JHEP05(2025)185
-
[46]
Lintao Wang, Encheng Su, Jiaqi Liu, Pengze Li, Peng Xia, Jiabei Xiao, Wen- long Zhang, Xinnan Dai, Xi Chen, Yuan Meng, Mingyu Ding, Lei Bai, Wanli Ouyang, Shixiang Tang, Aoran Wang, and Xinzhu Ma. 2025. PhysUniBench: An Undergraduate-Level Physics Reasoning Benchmark for Multimodal Models. arXiv:2506.17667 [cs.AI] doi:10.48550/arXiv.2506.17667
-
[47]
Weixing Wang, Zifeng Ding, Jindong Gu, Rui Cao, Christoph Meinel, Ger- ard de Melo, and Haojin Yang. 2025. Image Tokens Matter: Mitigating Hal- lucination in Discrete Tokenizer-based Large Vision-Language Models via La- tent Editing. InAdvances in Neural Information Processing Systems (NeurIPS). arXiv:2505.21547 [cs.CV] doi:10.48550/arXiv.2505.21547
-
[48]
Xinlong Wang, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Zhen Li, Yuqi Wang, Qiying Yu, Yingli Zhao, Yulong Ao, Xuebin Min, Chunlei Men, Boya Wu, Bo Zhao, Bowen Zhang, Liangdong Wang, Guang Liu, Zheqi He, Xi Yang, Jingjing Liu, Yonghua Lin, Zhongyuan Wang, and Tiejun Huang. 2026. Multimodal learning with ne...
-
[49]
Xin Wang, Yuwei Zhou, Bin Huang, Hong Chen, and Wenwu Zhu. 2025. Multi- modal Generative AI: Multi-modal LLMs, Diffusions and the Unification
work page 2025
-
[50]
Size Wu, Wenwei Zhang, Lumin Xu, Sheng Jin, Zhonghua Wu, Qingyi Tao, Wentao Liu, Wei Li, and Chen Change Loy. 2025. Harmonizing Visual Rep- resentations for Unified Multimodal Understanding and Generation. InPro- ceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). arXiv:2503.21979 [cs.CV] doi:10.48550/arXiv.2503.21979
-
[51]
Yecheng Wu, Zhuoyang Zhang, Junyu Chen, Haotian Tang, Dacheng Li, Yunhao Fang, Ligeng Zhu, Enze Xie, Hongxu Yin, Li Yi, Song Han, and Yao Lu. 2025. VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation. InInternational Conference on Learning Representations (ICLR). https: //openreview.net/forum?id=02haSpO453
work page 2025
-
[52]
Weiye Xu, Jiahao Wang, Weiyun Wang, Zhe Chen, Wengang Zhou, Aijun Yang, Lewei Lu, Houqiang Li, Xiaohua Wang, Xizhou Zhu, Wenhai Wang, Jifeng Dai, and Jinguo Zhu. 2025. VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models. arXiv:2504.15279 https://arxiv.org/abs/ 2504.15279
-
[53]
Yan Yang, Haochen Tian, Yang Shi, Wulin Xie, Yi-Fan Zhang, Yuhao Dong, Yibo Hu, Liang Wang, Ran He, Caifeng Shan, et al. 2025. A Survey of Unified Multimodal Understanding and Generation: Advances and Challenges. TechRxiv preprint. doi:10.36227/techrxiv.176289261.16802577/v1 Posted on 11 Nov 2025
-
[54]
Kaining Ying, Fanqing Meng, Jin Wang, Zhiqian Li, Han Lin, Yue Yang, Hao Zhang, Wenbo Zhang, Yuqi Lin, Shuo Liu, Jiayi Lei, Quanfeng Lu, Runjian Chen, Peng Xu, Renrui Zhang, Haozhe Zhang, Peng Gao, Yali Wang, Yu Qiao, Ping Luo, Kaipeng Zhang, and Wenqi Shao. 2024. MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models ...
-
[55]
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. 2024. MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark f...
-
[56]
Wanpeng Zhang, Yicheng Feng, Hao Luo, Yijiang Li, Zihao Yue, Sipeng Zheng, and Zongqing Lu. 2025. Unified Multimodal Understanding via Byte-Pair Visual Encoding. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). arXiv:2506.23639 [cs.CV] doi:10.48550/arXiv.2506.23639
-
[57]
Shanshan Zhao, Xinjie Zhang, Jintao Guo, Jiakui Hu, Lunhao Duan, Minghao Fu, Yong Xien Chng, Guo-Hua Wang, Qing-Guo Chen, Zhao Xu, et al. 2025. Unified multimodal understanding and generation models: Advances, challenges, and opportunities
work page 2025
-
[58]
Zhicheng Zheng, Xin Yan, Zhenfang Chen, Jingzhou Wang, Qin Zhi Eddie Lim, Joshua B. Tenenbaum, and Chuang Gan. 2024. ContPhy: Continuum Physical Concept Learning and Reasoning from Videos. arXiv:2402.06119 [cs.CV] doi:10. 48550/arXiv.2402.06119
-
[59]
Jie Zhou, Ganqu Cui, Shengding Hu, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, Lifeng Wang, Changcheng Li, and Maosong Sun. 2020. Graph neural networks: A review of methods and applications.AI open1 (2020), 57–81
work page 2020
-
[60]
Erle Zhu, Yadi Liu, Zhe Zhang, Xujun Li, Jin Zhou, Xinjie Yu, Minlie Huang, and Hongning Wang. 2025. MAPS: Advancing Multi-Modal Reasoning in Expert- Level Physical Science. arXiv:2501.10768 [cs.AI] doi:10.48550/arXiv.2501.10768
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.