ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities

Jiarui Lu, Thomas Holleis, Yizhe Zhang, Bernhard Aumayer, Feng Nan, Haoping Bai, Shuang Ma, Shen Ma, Mengyu Li, Guoli Yin, Zirui Wang, Ruoming Pang · 2025 · DOI 10.18653/v1/2025.findings-naacl.65

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open at publisher browse 2 citing papers

representative citing papers

MirrorBench: A Benchmark to Evaluate Conversational User-Proxy Agents for Human-Likeness

cs.AI · 2026-01-13 · unverdicted · novelty 6.0

MirrorBench defines a reproducible benchmark combining lexical metrics (MATTR, Yule's K, HD-D) and LLM-judge metrics with calibration controls to measure human-likeness of user-proxy agents across four datasets.

EngiAI: A Multi-Agent Framework and Benchmark Suite for LLM-Driven Engineering Design

cs.AI · 2026-05-19

citing papers explorer

Showing 2 of 2 citing papers.

MirrorBench: A Benchmark to Evaluate Conversational User-Proxy Agents for Human-Likeness cs.AI · 2026-01-13 · unverdicted · none · ref 25
MirrorBench defines a reproducible benchmark combining lexical metrics (MATTR, Yule's K, HD-D) and LLM-judge metrics with calibration controls to measure human-likeness of user-proxy agents across four datasets.
EngiAI: A Multi-Agent Framework and Benchmark Suite for LLM-Driven Engineering Design cs.AI · 2026-05-19 · unreviewed · ref 32

ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities

fields

years

verdicts

representative citing papers

citing papers explorer