TASTE automates generation of high-coverage difficult agent benchmarks via adaptive contrastive n-gram sampling of tool sequences, yielding τ^c-Bench where models saturating τ²-Bench drop sharply and unique tool combinations more than double.
Smith, and Yanai Elazar
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2verdicts
UNVERDICTED 2representative citing papers
Proposes CAC prompting to benchmark language models on syntactic and discourse properties of determiners against child acquisition data, finding large models approach but do not match human performance on both.
citing papers explorer
-
A Matter of TASTE: Improving Coverage and Difficulty of Agent Benchmarks
TASTE automates generation of high-coverage difficult agent benchmarks via adaptive contrastive n-gram sampling of tool sequences, yielding τ^c-Bench where models saturating τ²-Bench drop sharply and unique tool combinations more than double.