ErrorRadar is a new benchmark of 2,500 multimodal K-12 math problems for MLLM error step identification and categorization, where GPT-4o trails human experts by ~10%.
Are deep neural networks adequate behavioral models of human visual perception? Annual Review of Vision Science, 9 0 (1): 0 501--524, 2023
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2024 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection
ErrorRadar is a new benchmark of 2,500 multimodal K-12 math problems for MLLM error step identification and categorization, where GPT-4o trails human experts by ~10%.