TurnBench - MS : A Benchmark for Evaluating Multi - Turn , Multi - Step Reasoning in Large Language Models

Yiran Zhang, Mo Wang, Xiaoyang Li, Kaixuan Ren, Chencheng Zhu, Usman Naseem · 2025 · DOI 10.18653/v1/2025.findings-emnlp.1084

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open at publisher browse 2 citing papers

representative citing papers

When Should Models Change Their Minds? Contextual Belief Management in Large Language Models

cs.AI · 2026-05-28 · unverdicted · novelty 6.0

Introduces BeliefTrack benchmark diagnosing three CBM failures in LLMs and shows RL with belief-state rewards cuts failure rates by 70.9% while representation steering cuts them by 46.1%.

Efficient Benchmarking Is Just Feature Selection and Multiple Regression

stat.ML · 2026-05-25 · unverdicted · novelty 4.0

Kernel ridge regression combined with mRMR feature selection improves prediction of full benchmark scores from question subsets over existing efficient benchmarking techniques.

citing papers explorer

Showing 2 of 2 citing papers after filters.

When Should Models Change Their Minds? Contextual Belief Management in Large Language Models cs.AI · 2026-05-28 · unverdicted · none · ref 50
Introduces BeliefTrack benchmark diagnosing three CBM failures in LLMs and shows RL with belief-state rewards cuts failure rates by 70.9% while representation steering cuts them by 46.1%.
Efficient Benchmarking Is Just Feature Selection and Multiple Regression stat.ML · 2026-05-25 · unverdicted · none · ref 11
Kernel ridge regression combined with mRMR feature selection improves prediction of full benchmark scores from question subsets over existing efficient benchmarking techniques.

TurnBench - MS : A Benchmark for Evaluating Multi - Turn , Multi - Step Reasoning in Large Language Models

fields

years

verdicts

representative citing papers

citing papers explorer