AutoResearchBench is a new benchmark showing top AI agents achieve under 10% success on complex scientific literature discovery tasks that demand deep comprehension and open-ended search.
hub
React: Synergizing reasoning and acting in language models
13 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
A one-parameter early-termination gate based on mean pairwise prefix edit distance reduces wall-clock time by 10.7% and raises held-out success by 2.5 pp in GRPO on ALFWorld by cutting zero-advantage batch dilution.
TS-Agent is an agentic framework that uses LLMs only for evidence-based reasoning while delegating extraction to raw time series tools, matching or exceeding baselines on four benchmarks with largest gains on reasoning tasks.
Introduces a user concern simulator and asymmetric policy optimization to enable proactive behavior in task-oriented dialogues by using latent concerns as a training signal.
NIAgent is a multi-agent system using code-centric execution and hierarchical verification to autonomously build and adapt neuroimaging analysis workflows, showing better predictive performance than standard pipelines on ADHD-200 and ADNI data.
MERIT achieves 81.65% F1 on MMFakeBench for multimodal misinformation detection via a four-module framework, outperforming zero-shot baselines like GPT-4V with MMD-Agent at 74.0% F1, with gains attributed to architectural design.
UI-TARS-2 reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld while attaining 59.8 mean normalized score on a 15-game suite through multi-turn RL and scalable data generation.
Introduces NPAS and AV Filter using LLM attention weights to defend RAG against poisoning, reporting up to 20% accuracy gains while adaptive attacks reach 35% success.
InfiGFusion introduces graph-on-logits distillation with an O(n log n) Gromov-Wasserstein approximation to fuse LLMs by modeling token co-activations, reporting gains over baselines on 11 benchmarks.
Explicit provenance across the full agentic AI lifecycle is the necessary condition for making responsibility computable and actionable.
SciFi is a safe, lightweight agentic AI framework that automates structured scientific tasks with minimal human intervention via isolated environments and layered self-assessing agents.
The survey structures agentic reasoning for LLMs into foundational, self-evolving, and collective multi-agent layers while distinguishing in-context orchestration from post-training optimization and reviewing applications across domains.
This survey frames foundation agents using brain-inspired modular architectures and reviews challenges in evolution, collaboration, and safety.
citing papers explorer
-
AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery
AutoResearchBench is a new benchmark showing top AI agents achieve under 10% success on complex scientific literature discovery tasks that demand deep comprehension and open-ended search.
-
Selective Rollout: Mid-Trajectory Termination for Multi-Sample Agent RL
A one-parameter early-termination gate based on mean pairwise prefix edit distance reduces wall-clock time by 10.7% and raises held-out success by 2.5 pp in GRPO on ALFWorld by cutting zero-advantage batch dilution.
-
TS-Agent: Understanding and Reasoning Over Raw Time Series via Iterative Insight Gathering
TS-Agent is an agentic framework that uses LLMs only for evidence-based reasoning while delegating extraction to raw time series tools, matching or exceeding baselines on four benchmarks with largest gains on reasoning tasks.
-
Unlocking Proactivity in Task-Oriented Dialogue
Introduces a user concern simulator and asymmetric policy optimization to enable proactive behavior in task-oriented dialogues by using latent concerns as a training signal.
-
Towards a Virtual Neuroscientist: Autonomous Neuroimaging Analysis via Multi-Agent Collaboration
NIAgent is a multi-agent system using code-centric execution and hierarchical verification to autonomously build and adapt neuroimaging analysis workflows, showing better predictive performance than standard pipelines on ADHD-200 and ADNI data.
-
MERIT: Modular Framework for Multimodal Misinformation Detection with Web-Grounded Reasoning
MERIT achieves 81.65% F1 on MMFakeBench for multimodal misinformation detection via a four-module framework, outperforming zero-shot baselines like GPT-4V with MMD-Agent at 74.0% F1, with gains attributed to architectural design.
-
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
UI-TARS-2 reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld while attaining 59.8 mean normalized score on a 15-game suite through multi-turn RL and scalable data generation.
-
Through the Stealth Lens: Attention-Aware Defenses Against Poisoning in RAG
Introduces NPAS and AV Filter using LLM attention weights to defend RAG against poisoning, reporting up to 20% accuracy gains while adaptive attacks reach 35% success.
-
InfiGFusion: Graph-on-Logits Distillation via Efficient Gromov-Wasserstein for Model Fusion
InfiGFusion introduces graph-on-logits distillation with an O(n log n) Gromov-Wasserstein approximation to fuse LLMs by modeling token co-activations, reporting gains over baselines on 11 benchmarks.
-
Responsible Agentic AI Requires Explicit Provenance
Explicit provenance across the full agentic AI lifecycle is the necessary condition for making responsibility computable and actionable.
-
SciFi: A Safe, Lightweight, User-Friendly, and Fully Autonomous Agentic AI Workflow for Scientific Applications
SciFi is a safe, lightweight agentic AI framework that automates structured scientific tasks with minimal human intervention via isolated environments and layered self-assessing agents.
-
Agentic Reasoning for Large Language Models
The survey structures agentic reasoning for LLMs into foundational, self-evolving, and collective multi-agent layers while distinguishing in-context orchestration from post-training optimization and reviewing applications across domains.
-
Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems
This survey frames foundation agents using brain-inspired modular architectures and reviews challenges in evolution, collaboration, and safety.