SCICONVBENCH is a new benchmark evaluating LLMs on multi-turn disambiguation and inconsistency resolution for task formulation in computational science, with frontier models reaching only 52.7% success on fluid mechanics disambiguation cases.
Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 6verdicts
UNVERDICTED 6roles
background 1polarities
unclear 1representative citing papers
R²A uses a hybrid ensemble surrogate router and suffix optimization to significantly increase black-box LLM router selection of expensive models across query distributions.
FrontierFinance benchmark shows human financial experts outperform state-of-the-art LLMs by achieving higher scores and more client-ready outputs on realistic long-horizon tasks.
COALA applies convex optimization reformulations of neural networks to direct preference optimization, claiming single-GPU training with ~18% of DPO's TFLOPs and competitive performance on multiple datasets and models up to 8B parameters.
LoCar is a localization-aware evaluation framework for in-vehicle assistants that identifies unstable Korean honorific control and weaker performance on strategic metrics like clarification and proactivity in current LLMs.
Interactive LLM dialogue raised residents' hard-case diagnostic correctness from 0.589 to 0.734 and produced medium effect sizes in a blinded study of seven physicians on 52 emergency cases.
citing papers explorer
-
SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science
SCICONVBENCH is a new benchmark evaluating LLMs on multi-turn disambiguation and inconsistency resolution for task formulation in computational science, with frontier models reaching only 52.7% success on fluid mechanics disambiguation cases.
-
Route to Rome Attack: Directing LLM Routers to Expensive Models via Adversarial Suffix Optimization
R²A uses a hybrid ensemble surrogate router and suffix optimization to significantly increase black-box LLM router selection of expensive models across query distributions.
-
FrontierFinance: A Long-Horizon Computer-Use Benchmark of Real-World Financial Tasks
FrontierFinance benchmark shows human financial experts outperform state-of-the-art LLMs by achieving higher scores and more client-ready outputs on realistic long-horizon tasks.
-
Convex Optimization for Alignment and Preference Learning on a Single GPU
COALA applies convex optimization reformulations of neural networks to direct preference optimization, claiming single-GPU training with ~18% of DPO's TFLOPs and competitive performance on multiple datasets and models up to 8B parameters.
-
LoCar: Localization-Aware Evaluation of In-Vehicle Assistants through Fine-Grained Sociolinguistic Control
LoCar is a localization-aware evaluation framework for in-vehicle assistants that identifies unstable Korean honorific control and weaker performance on strategic metrics like clarification and proactivity in current LLMs.
-
Human-LLM Dialogue Improves Diagnostic Accuracy in Emergency Care
Interactive LLM dialogue raised residents' hard-case diagnostic correctness from 0.589 to 0.734 and produced medium effect sizes in a blinded study of seven physicians on 52 emergency cases.