MirrorBench defines a reproducible benchmark combining lexical metrics (MATTR, Yule's K, HD-D) and LLM-judge metrics with calibration controls to measure human-likeness of user-proxy agents across four datasets.
arXiv preprint arXiv:2412.16516 , year=
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2representative citing papers
R2IF uses a composite reward (binary correctness, CoT effectiveness, and parameter-level SMV) under GRPO to align LLM reasoning with function-calling decisions, improving accuracy and reasoning quality on BFCL/ACEBench.
citing papers explorer
No citing papers match the current filters.