FeynmanBench is the first benchmark for evaluating multimodal LLMs on diagrammatic reasoning with Feynman diagrams, revealing systematic failures in enforcing physical constraints and global topology.
Unibench: Visual reasoning requires rethinking vision- language beyond scaling
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
citation-role summary
background 1
citation-polarity summary
verdicts
UNVERDICTED 2roles
background 1polarities
background 1representative citing papers
Multimodal foundation models achieve respectable but sub-specialist performance on semantic vision tasks and weaker results on geometric tasks when evaluated through prompt chaining on established benchmarks.
citing papers explorer
-
FeynmanBench: Benchmarking Multimodal LLMs on Diagrammatic Physics Reasoning
FeynmanBench is the first benchmark for evaluating multimodal LLMs on diagrammatic reasoning with Feynman diagrams, revealing systematic failures in enforcing physical constraints and global topology.
-
How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks
Multimodal foundation models achieve respectable but sub-specialist performance on semantic vision tasks and weaker results on geometric tasks when evaluated through prompt chaining on established benchmarks.