PaperMind is a new benchmark that evaluates integrated multimodal reasoning and critique over scientific papers through four complementary task families across seven domains.
Title resolution pending
5 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
o1-like models overthink easy tasks; self-training reduces compute use without accuracy loss on GSM8K, MATH500, GPQA, and AIME.
JTPRO co-optimizes prompts and tool descriptions via reflection to raise overall success rate by 5-20% over baselines on multi-tool benchmarks.
The LMMP framework improves tool-calling accuracy and task success rates for Earth observation agents by grounding plans in multimodal features and remote sensing expert knowledge via a two-stage training process.
citing papers explorer
-
PaperMind: Benchmarking Agentic Reasoning and Critique over Scientific Papers in Multimodal LLMs
PaperMind is a new benchmark that evaluates integrated multimodal reasoning and critique over scientific papers through four complementary task families across seven domains.
-
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs
o1-like models overthink easy tasks; self-training reduces compute use without accuracy loss on GSM8K, MATH500, GPQA, and AIME.
-
JTPRO: A Joint Tool-Prompt Reflective Optimization Framework for Language Agents
JTPRO co-optimizes prompts and tool descriptions via reflection to raise overall success rate by 5-20% over baselines on multi-tool benchmarks.
-
Bridging Perception and Action: A Lightweight Multimodal Meta-Planner Framework for Robust Earth Observation Agents
The LMMP framework improves tool-calling accuracy and task success rates for Earth observation agents by grounding plans in multimodal features and remote sensing expert knowledge via a two-stage training process.
- To Call or Not to Call: A Framework to Assess and Optimize LLM Tool Calling