ErrorRadar is a new benchmark of 2,500 multimodal K-12 math problems for MLLM error step identification and categorization, where GPT-4o trails human experts by ~10%.
Trends in integration of knowledge and large language models: A survey and taxonomy of methods, benchmarks, and applications
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
citation-role summary
background 1
citation-polarity summary
years
2024 2roles
background 1polarities
background 1representative citing papers
A systematic review of memory designs, evaluation methods, applications, limitations, and future directions for LLM-based agents.
citing papers explorer
-
ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection
ErrorRadar is a new benchmark of 2,500 multimodal K-12 math problems for MLLM error step identification and categorization, where GPT-4o trails human experts by ~10%.
-
A Survey on the Memory Mechanism of Large Language Model based Agents
A systematic review of memory designs, evaluation methods, applications, limitations, and future directions for LLM-based agents.