TaskBench: Benchmarking large language models for task automation

arXiv:2311 · 2025 · arXiv 2311.18760

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

read on arXiv browse 5 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

ArtifactLinker: Linking Scientific Artifacts for Automatic State-of-the-Art Discovery

cs.LG · 2026-05-16 · unverdicted · novelty 6.0

ArtifactLinker frames SOTA discovery as missing-link prediction on an artifact graph of models and datasets, with a two-stage ranking-plus-verification pipeline and a new benchmark of 14k artifacts.

The Scaling Laws of Skills in LLM Agent Systems

cs.CL · 2026-05-15 · unverdicted · novelty 6.0

Empirical analysis across 15 LLMs and 1,141 skills identifies a logarithmic routing decay law and a multiplicative execution law coupled by a single fitted slope parameter b that enables targeted library optimizations improving routing accuracy and downstream task pass rates.

ToolMATH: A Diagnostic Benchmark for Long-Horizon Tool Use under Systematic Tool-Catalog Constraints

cs.CL · 2026-02-24 · unverdicted · novelty 6.0

ToolMATH converts MATH solutions into controlled tool environments with gold tools and graded distractors to diagnose LLM adaptability, robustness, and long-horizon tool connectivity.

From Intent to Execution: Composing Agentic Workflows with Agent Recommendation

cs.AI · 2026-05-05 · unverdicted · novelty 5.0

A framework automates multi-agent system creation via LLM planning and two-stage agent recommendation, claiming higher recall than prior methods.

A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications

cs.IR · 2026-05-08 · 2 refs

citing papers explorer

Showing 5 of 5 citing papers.

ArtifactLinker: Linking Scientific Artifacts for Automatic State-of-the-Art Discovery cs.LG · 2026-05-16 · unverdicted · none · ref 20
ArtifactLinker frames SOTA discovery as missing-link prediction on an artifact graph of models and datasets, with a two-stage ranking-plus-verification pipeline and a new benchmark of 14k artifacts.
The Scaling Laws of Skills in LLM Agent Systems cs.CL · 2026-05-15 · unverdicted · none · ref 8
Empirical analysis across 15 LLMs and 1,141 skills identifies a logarithmic routing decay law and a multiplicative execution law coupled by a single fitted slope parameter b that enables targeted library optimizations improving routing accuracy and downstream task pass rates.
ToolMATH: A Diagnostic Benchmark for Long-Horizon Tool Use under Systematic Tool-Catalog Constraints cs.CL · 2026-02-24 · unverdicted · none · ref 4
ToolMATH converts MATH solutions into controlled tool environments with gold tools and graded distractors to diagnose LLM adaptability, robustness, and long-horizon tool connectivity.
From Intent to Execution: Composing Agentic Workflows with Agent Recommendation cs.AI · 2026-05-05 · unverdicted · none · ref 9
A framework automates multi-agent system creation via LLM planning and two-stage agent recommendation, claiming higher recall than prior methods.
A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications cs.IR · 2026-05-08 · unreviewed · ref 131 · 2 links

TaskBench: Benchmarking large language models for task automation

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer