Title resolution pending

· 2026 · arXiv 2601.18137

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

citation-role summary

dataset 1

citation-polarity summary

use dataset 1

representative citing papers

Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety

cs.CL · 2026-05-21 · unverdicted · novelty 7.0 · 2 refs

Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.

MMEB-V3: Measuring the Performance Gaps of Omni-Modality Embedding Models

cs.IR · 2026-04-25 · unverdicted · novelty 7.0

MMEB-V3 benchmark shows omni-modality embedding models fail to enforce instruction-specified modality constraints and exhibit asymmetric, query-biased retrieval.

GTA-2: Benchmarking General Tool Agents from Atomic Tool-Use to Open-Ended Workflows

cs.CL · 2026-04-17 · conditional · novelty 7.0

GTA-2 benchmark shows frontier models achieve below 50% on atomic tool tasks and only 14.39% success on realistic long-horizon workflows, with execution harnesses like Manus providing substantial gains.

PIVOT: Bridging Planning and Execution in LLM Agents via Trajectory Refinement

cs.AI · 2026-05-11 · unverdicted · novelty 6.0

PIVOT refines LLM agent trajectories through plan-inspect-evolve-verify stages using environment feedback, yielding up to 94% relative gains in constraint satisfaction and 3-5x token efficiency over prior refinement methods.

TIP: Token Importance in On-Policy Distillation

cs.LG · 2026-04-15 · unverdicted · novelty 6.0 · 3 refs

A two-axis taxonomy of student entropy and teacher-student divergence identifies informative tokens in on-policy distillation, allowing near-full performance with 10-50% of tokens.

CivBench: Progress-Based Evaluation for LLMs' Strategic Decision-Making in Civilization V

cs.AI · 2026-04-09 · unverdicted · novelty 6.0

CivBench trains models on turn-level states in Civilization V to predict victory probabilities, providing a progress-based evaluation of LLM strategic capabilities across 307 games with 7 models.

CRISP: Compressed Reasoning via Iterative Self-Policy Distillation

cs.LG · 2026-03-05 · conditional · novelty 6.0

CRISP achieves 57-59% token reduction on MATH-500 with 9-16 point accuracy gains on Qwen3 models via iterative self-distillation of concise reasoning behavior.

Harnessing AtomisticSkills for Agentic Atomistic Research

physics.chem-ph · 2026-05-18 · unverdicted · novelty 5.0

AtomisticSkills is a new harness framework with 100+ human-curated skills that lets general AI agents perform atomistic research tasks including simulations, screening, and analysis, shown on electrolyte design, CO2 capture, drug screening, and catalyst tasks.

Interactive Evaluation Requires a Design Science

cs.AI · 2026-05-18 · unverdicted · novelty 5.0

Interactive evaluation of AI must be reframed as a distinct paradigm that maps interaction trajectories to judgments on process, recoverability, coordination, robustness, and system performance, supported by a two-axis taxonomy and design principles.

citing papers explorer

Showing 9 of 9 citing papers.

Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety cs.CL · 2026-05-21 · unverdicted · none · ref 95 · 2 links
Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.
MMEB-V3: Measuring the Performance Gaps of Omni-Modality Embedding Models cs.IR · 2026-04-25 · unverdicted · none · ref 41
MMEB-V3 benchmark shows omni-modality embedding models fail to enforce instruction-specified modality constraints and exhibit asymmetric, query-biased retrieval.
GTA-2: Benchmarking General Tool Agents from Atomic Tool-Use to Open-Ended Workflows cs.CL · 2026-04-17 · conditional · none · ref 19
GTA-2 benchmark shows frontier models achieve below 50% on atomic tool tasks and only 14.39% success on realistic long-horizon workflows, with execution harnesses like Manus providing substantial gains.
PIVOT: Bridging Planning and Execution in LLM Agents via Trajectory Refinement cs.AI · 2026-05-11 · unverdicted · none · ref 26
PIVOT refines LLM agent trajectories through plan-inspect-evolve-verify stages using environment feedback, yielding up to 94% relative gains in constraint satisfaction and 3-5x token efficiency over prior refinement methods.
TIP: Token Importance in On-Policy Distillation cs.LG · 2026-04-15 · unverdicted · none · ref 17 · 3 links
A two-axis taxonomy of student entropy and teacher-student divergence identifies informative tokens in on-policy distillation, allowing near-full performance with 10-50% of tokens.
CivBench: Progress-Based Evaluation for LLMs' Strategic Decision-Making in Civilization V cs.AI · 2026-04-09 · unverdicted · none · ref 44
CivBench trains models on turn-level states in Civilization V to predict victory probabilities, providing a progress-based evaluation of LLM strategic capabilities across 307 games with 7 models.
CRISP: Compressed Reasoning via Iterative Self-Policy Distillation cs.LG · 2026-03-05 · conditional · none · ref 27
CRISP achieves 57-59% token reduction on MATH-500 with 9-16 point accuracy gains on Qwen3 models via iterative self-distillation of concise reasoning behavior.
Harnessing AtomisticSkills for Agentic Atomistic Research physics.chem-ph · 2026-05-18 · unverdicted · none · ref 112
AtomisticSkills is a new harness framework with 100+ human-curated skills that lets general AI agents perform atomistic research tasks including simulations, screening, and analysis, shown on electrolyte design, CO2 capture, drug screening, and catalyst tasks.
Interactive Evaluation Requires a Design Science cs.AI · 2026-05-18 · unverdicted · none · ref 69
Interactive evaluation of AI must be reframed as a distinct paradigm that maps interaction trajectories to judgments on process, recoverability, coordination, robustness, and system performance, supported by a two-axis taxonomy and design principles.

Title resolution pending

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer