MirrorBench defines a reproducible benchmark combining lexical metrics (MATTR, Yule's K, HD-D) and LLM-judge metrics with calibration controls to measure human-likeness of user-proxy agents across four datasets.
arXiv preprint arXiv:2412.16516 , year=
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it