PaperMind is a new benchmark that evaluates integrated multimodal reasoning and critique over scientific papers through four complementary task families across seven domains.
Title resolution pending
6 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 6representative citing papers
Nexa learns a response-conditioned policy that starts with parallel agent execution and adds at most one round of sequential message passing via a predicted sparse DAG, strictly subsuming pure parallel mode.
OS-Atlas, trained on the largest open-source cross-platform GUI grounding corpus of 13 million elements, outperforms prior open-source models on six benchmarks across mobile, desktop, and web platforms.
RADAR is a redundancy-aware, query-adaptive framework that uses conditional discrete graph diffusion to generate efficient communication topologies for multi-agent LLM systems, outperforming baselines on six benchmarks with higher accuracy and lower token use.
LLM multi-agent simulations reveal a cumulative product effect from multiple weak links on team performance and identify distinct capability regimes including a Sisyphus predicament.
ErrorProbe introduces a self-improving pipeline for attributing semantic failures in LLM multi-agent systems to specific agents and steps via anomaly detection, backward tracing, and tool-grounded validation with verified episodic memory.
citing papers explorer
-
PaperMind: Benchmarking Agentic Reasoning and Critique over Scientific Papers in Multimodal LLMs
PaperMind is a new benchmark that evaluates integrated multimodal reasoning and critique over scientific papers through four complementary task families across seven domains.
-
Response-Conditioned Parallel-to-Sequential Orchestration for Multi-Agent Systems
Nexa learns a response-conditioned policy that starts with parallel agent execution and adds at most one round of sequential message passing via a predicted sparse DAG, strictly subsuming pure parallel mode.
-
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
OS-Atlas, trained on the largest open-source cross-platform GUI grounding corpus of 13 million elements, outperforms prior open-source models on six benchmarks across mobile, desktop, and web platforms.
-
RADAR: Redundancy-Aware Diffusion for Multi-Agent Communication Structure Generation
RADAR is a redundancy-aware, query-adaptive framework that uses conditional discrete graph diffusion to generate efficient communication topologies for multi-agent LLM systems, outperforming baselines on six benchmarks with higher accuracy and lower token use.
-
Is a team only as strong as its weakest link? Quantifying the short-board effect with AI Agents
LLM multi-agent simulations reveal a cumulative product effect from multiple weak links on team performance and identify distinct capability regimes including a Sisyphus predicament.
-
Towards Self-Improving Error Diagnosis in Multi-Agent Systems
ErrorProbe introduces a self-improving pipeline for attributing semantic failures in LLM multi-agent systems to specific agents and steps via anomaly detection, backward tracing, and tool-grounded validation with verified episodic memory.