hub Mixed citations

Gaia: a benchmark for general ai assistants

Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, Thomas Scialom · 2023

Mixed citation behavior. Most common role is background (57%).

16 Pith papers citing it

Background 57% of classified citations

browse 16 citing papers

hub tools

JSON dossier citing papers JSON

citation-role summary

background 4 dataset 3

citation-polarity summary

background 4 use dataset 3

representative citing papers

AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery

cs.AI · 2026-04-28 · accept · novelty 8.0

AutoResearchBench is a new benchmark showing top AI agents achieve under 10% success on complex scientific literature discovery tasks that demand deep comprehension and open-ended search.

$\pi$-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows

cs.AI · 2026-05-14 · unverdicted · novelty 7.0 · 2 refs

π-Bench is a new benchmark for evaluating proactive personal assistant agents on 100 multi-turn tasks that include hidden intents, inter-task dependencies, and cross-session continuity.

AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems

cs.CL · 2026-05-09 · unverdicted · novelty 7.0 · 2 refs

AgentForesight introduces an online auditor model that predicts decisive errors in multi-agent trajectories at the earliest step using a coarse-to-fine reinforcement learning recipe on a new curated dataset AFTraj-2K.

Weblica: Scalable and Reproducible Training Environments for Visual Web Agents

cs.AI · 2026-05-07 · unverdicted · novelty 7.0

Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar open baselines on navigation benchmarks.

RiskWebWorld: A Realistic Interactive Benchmark for GUI Agents in E-commerce Risk Management

cs.AI · 2026-04-15 · unverdicted · novelty 7.0

RiskWebWorld is the first realistic interactive benchmark for GUI agents in e-commerce risk management, revealing a large gap between generalist and specialized models plus RL gains.

Towards Direct Evaluation of Harness Optimizers via Priority Ranking

cs.AI · 2026-05-21 · unverdicted · novelty 6.0

Priority ranking offers a low-cost direct evaluation for harness optimizers that correlates with their real multi-step optimization performance, supported by the Shor dataset of 182 scenarios.

Why Does Agentic Safety Fail to Generalize Across Tasks?

cs.LG · 2026-05-07 · conditional · novelty 6.0

Agentic safety fails to generalize across tasks because the task-to-safe-controller mapping has a higher Lipschitz constant than the task-to-controller mapping alone, as proven in linear-quadratic control and demonstrated in quadcopter and LLM experiments.

Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges

cs.AI · 2026-05-07 · unverdicted · novelty 6.0

LLM safety judges flip verdicts on equivalent policy rewrites up to 9.1% of the time and cannot distinguish meaningful from meaningless changes, requiring new invariance-based reliability metrics.

Uno-Orchestra: Parsimonious Agent Routing via Selective Delegation

cs.AI · 2026-05-06 · unverdicted · novelty 6.0

A learned orchestration policy for LLM agents that jointly optimizes task decomposition and selective routing to (model, primitive) pairs, delivering 77% macro pass@1 at 10x lower cost than strong baselines across 13 benchmarks.

Towards Long-horizon Agentic Multimodal Search

cs.CV · 2026-04-14 · unverdicted · novelty 6.0

LMM-Searcher uses file-based visual UIDs and a fetch tool plus 12K synthesized trajectories to fine-tune a multimodal agent that scales to 100-turn horizons and reaches SOTA among open-source models on MM-BrowseComp and MMSearch-Plus.

Do Agent Societies Develop Intellectual Elites? The Hidden Power Laws of Collective Cognition in LLM Multi-Agent Systems

cs.MA · 2026-04-03 · unverdicted · novelty 6.0

LLM agent societies develop power-law coordination cascades and intellectual elites through an integration bottleneck that grows with system size.

MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling

cs.CL · 2025-11-14 · unverdicted · novelty 6.0

MiroThinker shows that scaling agent-environment interactions via reinforcement learning lets a 72B open-source model reach up to 81.9% on GAIA and approach commercial performance on research benchmarks.

Beyond Scaling: Agents Are Heading to the Edge

cs.LG · 2026-05-18 · unverdicted · novelty 5.0

Personal agents require edge deployment to preserve high-fidelity local context and zero-latency loops, as claimed through three structural shifts away from cloud-centric designs.

ViDR: Grounding Multimodal Deep Research Reports in Source Visual Evidence

cs.CV · 2026-05-13 · unverdicted · novelty 5.0

ViDR treats source figures as retrievable and verifiable evidence objects in multimodal deep research reports and introduces MMR Bench+ to measure improvements in visual integration and verifiability.

Seed1.8 Model Card: Towards Generalized Real-World Agency

cs.AI · 2026-03-21 · unverdicted · novelty 5.0

Seed1.8 is a new foundation model that adds unified agentic capabilities for search, code execution, and GUI interaction to existing LLM and vision strengths.

TeamPath: Building MultiModal Pathology Experts with Reasoning AI Copilots

q-bio.QM · 2025-11-20 · unverdicted · novelty 5.0

TeamPath introduces a reinforcement-learning-powered multimodal AI copilot for pathology that generates reasoned diagnoses and integrates image and transcriptomic data.

citing papers explorer

Showing 16 of 16 citing papers.

AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery cs.AI · 2026-04-28 · accept · none · ref 15
AutoResearchBench is a new benchmark showing top AI agents achieve under 10% success on complex scientific literature discovery tasks that demand deep comprehension and open-ended search.
$\pi$-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows cs.AI · 2026-05-14 · unverdicted · none · ref 24 · 2 links
π-Bench is a new benchmark for evaluating proactive personal assistant agents on 100 multi-turn tasks that include hidden intents, inter-task dependencies, and cross-session continuity.
AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems cs.CL · 2026-05-09 · unverdicted · none · ref 34 · 2 links
AgentForesight introduces an online auditor model that predicts decisive errors in multi-agent trajectories at the earliest step using a coarse-to-fine reinforcement learning recipe on a new curated dataset AFTraj-2K.
Weblica: Scalable and Reproducible Training Environments for Visual Web Agents cs.AI · 2026-05-07 · unverdicted · none · ref 22
Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar open baselines on navigation benchmarks.
RiskWebWorld: A Realistic Interactive Benchmark for GUI Agents in E-commerce Risk Management cs.AI · 2026-04-15 · unverdicted · none · ref 31
RiskWebWorld is the first realistic interactive benchmark for GUI agents in e-commerce risk management, revealing a large gap between generalist and specialized models plus RL gains.
Towards Direct Evaluation of Harness Optimizers via Priority Ranking cs.AI · 2026-05-21 · unverdicted · none · ref 41
Priority ranking offers a low-cost direct evaluation for harness optimizers that correlates with their real multi-step optimization performance, supported by the Shor dataset of 182 scenarios.
Why Does Agentic Safety Fail to Generalize Across Tasks? cs.LG · 2026-05-07 · conditional · none · ref 74
Agentic safety fails to generalize across tasks because the task-to-safe-controller mapping has a higher Lipschitz constant than the task-to-controller mapping alone, as proven in linear-quadratic control and demonstrated in quadcopter and LLM experiments.
Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges cs.AI · 2026-05-07 · unverdicted · none · ref 32
LLM safety judges flip verdicts on equivalent policy rewrites up to 9.1% of the time and cannot distinguish meaningful from meaningless changes, requiring new invariance-based reliability metrics.
Uno-Orchestra: Parsimonious Agent Routing via Selective Delegation cs.AI · 2026-05-06 · unverdicted · none · ref 40
A learned orchestration policy for LLM agents that jointly optimizes task decomposition and selective routing to (model, primitive) pairs, delivering 77% macro pass@1 at 10x lower cost than strong baselines across 13 benchmarks.
Towards Long-horizon Agentic Multimodal Search cs.CV · 2026-04-14 · unverdicted · none · ref 5
LMM-Searcher uses file-based visual UIDs and a fetch tool plus 12K synthesized trajectories to fine-tune a multimodal agent that scales to 100-turn horizons and reaches SOTA among open-source models on MM-BrowseComp and MMSearch-Plus.
Do Agent Societies Develop Intellectual Elites? The Hidden Power Laws of Collective Cognition in LLM Multi-Agent Systems cs.MA · 2026-04-03 · unverdicted · none · ref 40
LLM agent societies develop power-law coordination cascades and intellectual elites through an integration bottleneck that grows with system size.
MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling cs.CL · 2025-11-14 · unverdicted · none · ref 28
MiroThinker shows that scaling agent-environment interactions via reinforcement learning lets a 72B open-source model reach up to 81.9% on GAIA and approach commercial performance on research benchmarks.
Beyond Scaling: Agents Are Heading to the Edge cs.LG · 2026-05-18 · unverdicted · none · ref 33
Personal agents require edge deployment to preserve high-fidelity local context and zero-latency loops, as claimed through three structural shifts away from cloud-centric designs.
ViDR: Grounding Multimodal Deep Research Reports in Source Visual Evidence cs.CV · 2026-05-13 · unverdicted · none · ref 17
ViDR treats source figures as retrievable and verifiable evidence objects in multimodal deep research reports and introduces MMR Bench+ to measure improvements in visual integration and verifiability.
Seed1.8 Model Card: Towards Generalized Real-World Agency cs.AI · 2026-03-21 · unverdicted · none · ref 44
Seed1.8 is a new foundation model that adds unified agentic capabilities for search, code execution, and GUI interaction to existing LLM and vision strengths.
TeamPath: Building MultiModal Pathology Experts with Reasoning AI Copilots q-bio.QM · 2025-11-20 · unverdicted · none · ref 37
TeamPath introduces a reinforcement-learning-powered multimodal AI copilot for pathology that generates reasoned diagnoses and integrates image and transcriptomic data.

Gaia: a benchmark for general ai assistants

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer