ErrorRadar is a new benchmark of 2,500 multimodal K-12 math problems for MLLM error step identification and categorization, where GPT-4o trails human experts by ~10%.
Turning large language models into cognitive models
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
verdicts
UNVERDICTED 2representative citing papers
WSTypist is a new RL-based simulation model that reproduces human-like word suggestion strategies, individual differences, and adaptation to design changes in mobile text entry.
citing papers explorer
-
ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection
ErrorRadar is a new benchmark of 2,500 multimodal K-12 math problems for MLLM error step identification and categorization, where GPT-4o trails human experts by ~10%.
-
Simulating Word Suggestion Usage in Mobile Typing to Guide Intelligent Text Entry Design
WSTypist is a new RL-based simulation model that reproduces human-like word suggestion strategies, individual differences, and adaptation to design changes in mobile text entry.