RealMath-Eval benchmark shows LLM judges have an evaluation gap, performing worse on diverse real human math reasoning than on synthetic solutions due to greater error diversity and higher surprisal.
An empirical study of smoothing techniques for language modeling
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
verdicts
UNVERDICTED 2representative citing papers
An informed machine learning approach using LSTM networks and expert-driven visual clustering to model normal behavior and detect misuse in system logs.
citing papers explorer
-
RealMath-Eval: Why SOTA Judges Struggle with Real Human Reasoning
RealMath-Eval benchmark shows LLM judges have an evaluation gap, performing worse on diverse real human math reasoning than on synthetic solutions due to greater error diversity and higher surprisal.
-
System Misuse Detection via Informed Behavior Clustering and Modeling
An informed machine learning approach using LSTM networks and expert-driven visual clustering to model normal behavior and detect misuse in system logs.