ErrorRadar is a new benchmark of 2,500 multimodal K-12 math problems for MLLM error step identification and categorization, where GPT-4o trails human experts by ~10%.
Turning large language models into cognitive models
4 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 4representative citing papers
WSTypist is a new RL-based simulation model that reproduces human-like word suggestion strategies, individual differences, and adaptation to design changes in mobile text entry.
Equation-to-Behavior Prompting lets large LLMs match cognitive models like Bayesian updating in persuasion games; RL training cuts small-model belief error by 26.5% and improves diverse training outcomes by 2.5-12%.
Value-prompted LLMs align with human value structures and value-behavior relationships, and incorporating human value distributions improves population-level simulations.
citing papers explorer
-
ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection
ErrorRadar is a new benchmark of 2,500 multimodal K-12 math problems for MLLM error step identification and categorization, where GPT-4o trails human experts by ~10%.
-
Simulating Word Suggestion Usage in Mobile Typing to Guide Intelligent Text Entry Design
WSTypist is a new RL-based simulation model that reproduces human-like word suggestion strategies, individual differences, and adaptation to design changes in mobile text entry.
-
Using Cognitive Models to Improve Language Model Simulation of Human Persuasion Games
Equation-to-Behavior Prompting lets large LLMs match cognitive models like Bayesian updating in persuasion games; RL training cuts small-model belief error by 26.5% and improves diverse training outcomes by 2.5-12%.
-
Teaching Values to Machines: Simulating Human-Like Behavior in LLMs
Value-prompted LLMs align with human value structures and value-behavior relationships, and incorporating human value distributions improves population-level simulations.