Introduces BeliefTrack benchmark diagnosing three CBM failures in LLMs and shows RL with belief-state rewards cuts failure rates by 70.9% while representation steering cuts them by 46.1%.
TurnBench - MS : A Benchmark for Evaluating Multi - Turn , Multi - Step Reasoning in Large Language Models
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2verdicts
UNVERDICTED 2representative citing papers
Kernel ridge regression combined with mRMR feature selection improves prediction of full benchmark scores from question subsets over existing efficient benchmarking techniques.
citing papers explorer
-
When Should Models Change Their Minds? Contextual Belief Management in Large Language Models
Introduces BeliefTrack benchmark diagnosing three CBM failures in LLMs and shows RL with belief-state rewards cuts failure rates by 70.9% while representation steering cuts them by 46.1%.
-
Efficient Benchmarking Is Just Feature Selection and Multiple Regression
Kernel ridge regression combined with mRMR feature selection improves prediction of full benchmark scores from question subsets over existing efficient benchmarking techniques.