T -Eval: Evaluating the Tool Utilization Capability of Large Language Models Step by Step

Chen, Zehui, Du, Weihua, Zhang, Wenwei, Liu, Kuikun, Liu, Jiangning, Zheng, Miao · 2024 · DOI 10.18653/v1/2024.acl-long.515

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

open at publisher browse 3 citing papers

representative citing papers

The Moltbook Files: A Harmless Slopocalypse or Humanity's Last Experiment

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.

AURA: Intent-Directed Probing for Implicit-Need Surfacing in Situated LLM Agents

cs.CL · 2026-06-04 · unverdicted · novelty 6.0

AURA improves implicit-need coverage by 0.07 over ReAct baselines on a 100-query benchmark by inserting an intent inference step controlled by a gap score, while cutting probes 82% on factual tasks.

PEC-Home: Interpretation of Progressively Elliptical Commands in Smart Homes

cs.CL · 2026-06-17 · unverdicted · novelty 5.0

Presents PEC-Home dataset for elliptical smart-home commands and shows LLMs achieve lower execution accuracy on elliptical inputs than complete commands even with dialogue history access.

citing papers explorer

Showing 3 of 3 citing papers.

The Moltbook Files: A Harmless Slopocalypse or Humanity's Last Experiment cs.CL · 2026-05-08 · unverdicted · none · ref 161
An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.
AURA: Intent-Directed Probing for Implicit-Need Surfacing in Situated LLM Agents cs.CL · 2026-06-04 · unverdicted · none · ref 35
AURA improves implicit-need coverage by 0.07 over ReAct baselines on a 100-query benchmark by inserting an intent inference step controlled by a gap score, while cutting probes 82% on factual tasks.
PEC-Home: Interpretation of Progressively Elliptical Commands in Smart Homes cs.CL · 2026-06-17 · unverdicted · none · ref 59
Presents PEC-Home dataset for elliptical smart-home commands and shows LLMs achieve lower execution accuracy on elliptical inputs than complete commands even with dialogue history access.

T -Eval: Evaluating the Tool Utilization Capability of Large Language Models Step by Step

fields

years

verdicts

representative citing papers

citing papers explorer