ErrorRadar is a new benchmark of 2,500 multimodal K-12 math problems for MLLM error step identification and categorization, where GPT-4o trails human experts by ~10%.
arXiv preprint arXiv:2402.17161 , year=
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 5roles
background 1polarities
background 1representative citing papers
StreetDesignAI provides structured multi-persona feedback on cycling designs and a user study shows it broadens designers' grasp of diverse cyclist perspectives and improves design decision confidence.
APMPO boosts average Pass@1 scores on math reasoning benchmarks by 3 points over GRPO by using an adaptive power-mean policy objective and feedback-driven clipping bounds in RLVR training.
FREIA applies free energy principles and adaptive advantage shaping to unsupervised RL, outperforming baselines by 0.5-3.5 Pass@1 points on math reasoning with a 1.5B model.
The paper delivers a unified review and roadmap of Earth science foundation models, structured by capability depth from perception to agentic reasoning and by application breadth across atmosphere, hydrosphere, lithosphere, biosphere, anthroposphere, and cryosphere, while compiling over 200 datasets
citing papers explorer
-
ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection
ErrorRadar is a new benchmark of 2,500 multimodal K-12 math problems for MLLM error step identification and categorization, where GPT-4o trails human experts by ~10%.
-
StreetDesignAI: Broadening Designer Perspectives Through Multi-Persona Evaluation of Cycling Infrastructure
StreetDesignAI provides structured multi-persona feedback on cycling designs and a user study shows it broadens designers' grasp of diverse cyclist perspectives and improves design decision confidence.
-
Adapt to Thrive! Adaptive Power-Mean Policy Optimization for Improved LLM Reasoning
APMPO boosts average Pass@1 scores on math reasoning benchmarks by 3 points over GRPO by using an adaptive power-mean policy objective and feedback-driven clipping bounds in RLVR training.
-
Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs
FREIA applies free energy principles and adaptive advantage shaping to unsupervised RL, outperforming baselines by 0.5-3.5 Pass@1 points on math reasoning with a 1.5B model.
-
Earth Science Foundation Models: From Perception to Reasoning and Discovery
The paper delivers a unified review and roadmap of Earth science foundation models, structured by capability depth from perception to agentic reasoning and by application breadth across atmosphere, hydrosphere, lithosphere, biosphere, anthroposphere, and cryosphere, while compiling over 200 datasets