SynAE is a multi-metric framework that evaluates how well synthetic benchmarks replicate real data characteristics for multi-turn tool-calling agent testing.
Stealtheval: A probe-rewrite-evaluate workflow for reliable bench- marks
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.CL 4years
2026 4roles
background 2polarities
background 2representative citing papers
LLMs have linearly decodable functional metacognitive states that causally modulate reasoning when steered via activation interventions.
Verbalised evaluation awareness in large reasoning models has only small effects on their outputs across safety and alignment tests.
StarDrinks is a new English and Korean test set supporting speech-to-slots SLU, transcription-to-slots NLU, and ASR evaluation in a drink ordering scenario.
citing papers explorer
-
SynAE: A Framework for Measuring the Quality of Synthetic Data for Tool-Calling Agent Evaluations
SynAE is a multi-metric framework that evaluates how well synthetic benchmarks replicate real data characteristics for multi-turn tool-calling agent testing.
-
Decomposing and Steering Functional Metacognition in Large Language Models
LLMs have linearly decodable functional metacognitive states that causally modulate reasoning when steered via activation interventions.
-
Evaluation Awareness in Language Models Has Limited Effect on Behaviour
Verbalised evaluation awareness in large reasoning models has only small effects on their outputs across safety and alignment tests.
-
StarDrinks: An English and Korean Test Set for SLU Evaluation in a Drink Ordering Scenario
StarDrinks is a new English and Korean test set supporting speech-to-slots SLU, transcription-to-slots NLU, and ASR evaluation in a drink ordering scenario.