hub

Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues

Ge Bai, Jie Liu, Xingyuan Bu, Yancheng He, Jiaheng Liu, Zhanhui Zhou, Zhuoran Lin, Wenbo Su, Tiezheng Ge, Bo Zheng, Wanli Ouyang · 2024 · arXiv 2402.14762

11 Pith papers cite this work. Polarity classification is still indexing.

11 Pith papers citing it

read on arXiv browse 11 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

dataset 2 background 1

citation-polarity summary

use dataset 2 background 1

representative citing papers

DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows

cs.AI · 2026-05-18 · unverdicted · novelty 7.0

DecisionBench supplies a fixed task suite, model pool, delegation interface, and multi-axis metrics to evaluate emergent delegation, showing similar quality across awareness conditions but 15-31 point headroom under perfect delegation.

EditPropBench: Measuring Factual Edit Propagation in Scientific Manuscripts

cs.CL · 2026-05-03 · unverdicted · novelty 7.0

EditPropBench evaluates LLM editors on propagating factual edits to dependent claims in synthetic scientific manuscripts, showing that even the strongest systems miss roughly 30% of required updates on hard cases.

EngiBench: A Benchmark for Evaluating Large Language Models on Engineering Problem Solving

cs.AI · 2025-09-22 · unverdicted · novelty 7.0

EngiBench shows LLMs accuracy drops with task complexity, degrades under perturbations, and stays below human performance on open-ended engineering problems.

SOMA: Efficient Multi-turn LLM Serving via Small Language Model

cs.CL · 2026-05-11 · unverdicted · novelty 6.0

SOMA estimates a local response manifold from early turns and adapts a small surrogate model via divergence-maximizing prompts and localized LoRA fine-tuning for efficient multi-turn serving.

Data-dependent Exploration for Online Reinforcement Learning from Human Feedback

cs.LG · 2026-05-06 · unverdicted · novelty 6.0

DEPO uses historical data to build a data-dependent uncertainty bonus for exploration in online RLHF, yielding an adaptive regret bound and stronger empirical performance than baselines.

CoopGuard: Stateful Cooperative Agents Safeguarding LLMs Against Evolving Multi-Round Attacks

cs.CR · 2026-04-05 · unverdicted · novelty 6.0

CoopGuard deploys cooperative agents to track conversation history and counter evolving multi-round attacks on LLMs, achieving a 78.9% reduction in attack success rate on a new 5,200-sample benchmark.

SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding

cs.DC · 2026-02-10 · unverdicted · novelty 6.0

SPEED-Bench is a new standardized benchmark for speculative decoding that supplies semantically diverse qualitative data and throughput-oriented splits across concurrency levels, integrated with vLLM and TensorRT-LLM.

TRINITY: An Evolved LLM Coordinator

cs.LG · 2025-12-04 · unverdicted · novelty 6.0

A compact 0.6B-parameter coordinator with a 10K-parameter head uses evolutionary strategy to dynamically delegate roles to LLMs, achieving SOTA results such as 86.2% on LiveCodeBench.

Computational Hermeneutics: Evaluating generative AI as a cultural technology

cs.AI · 2026-03-31 · unverdicted · novelty 5.0

Generative AI should be evaluated through computational hermeneutics using iterative, human-inclusive benchmarks that measure cultural context rather than isolated model outputs.

ReasonCache: Accelerating Large Reasoning Model Serving through KV Cache Sharing

cs.LG · 2025-07-29 · unverdicted · novelty 5.0

ReasonCache reuses similar KV cache states across reasoning steps in LRMs via collaborative filtering to boost serving throughput by up to 89.2% while preserving accuracy.

Towards EnergyGPT: A Large Language Model Specialized for the Energy Sector

cs.CL · 2025-09-08 · unverdicted · novelty 3.0

Fine-tuned LLaMA 3.1-8B variants for the energy sector outperform the base model on domain QA benchmarks, with LoRA delivering similar gains at lower training cost.

citing papers explorer

Showing 11 of 11 citing papers.

DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows cs.AI · 2026-05-18 · unverdicted · none · ref 1
DecisionBench supplies a fixed task suite, model pool, delegation interface, and multi-axis metrics to evaluate emergent delegation, showing similar quality across awareness conditions but 15-31 point headroom under perfect delegation.
EditPropBench: Measuring Factual Edit Propagation in Scientific Manuscripts cs.CL · 2026-05-03 · unverdicted · none · ref 1
EditPropBench evaluates LLM editors on propagating factual edits to dependent claims in synthetic scientific manuscripts, showing that even the strongest systems miss roughly 30% of required updates on hard cases.
EngiBench: A Benchmark for Evaluating Large Language Models on Engineering Problem Solving cs.AI · 2025-09-22 · unverdicted · none · ref 7
EngiBench shows LLMs accuracy drops with task complexity, degrades under perturbations, and stays below human performance on open-ended engineering problems.
SOMA: Efficient Multi-turn LLM Serving via Small Language Model cs.CL · 2026-05-11 · unverdicted · none · ref 6
SOMA estimates a local response manifold from early turns and adapts a small surrogate model via divergence-maximizing prompts and localized LoRA fine-tuning for efficient multi-turn serving.
Data-dependent Exploration for Online Reinforcement Learning from Human Feedback cs.LG · 2026-05-06 · unverdicted · none · ref 60
DEPO uses historical data to build a data-dependent uncertainty bonus for exploration in online RLHF, yielding an adaptive regret bound and stronger empirical performance than baselines.
CoopGuard: Stateful Cooperative Agents Safeguarding LLMs Against Evolving Multi-Round Attacks cs.CR · 2026-04-05 · unverdicted · none · ref 2
CoopGuard deploys cooperative agents to track conversation history and counter evolving multi-round attacks on LLMs, achieving a 78.9% reduction in attack success rate on a new 5,200-sample benchmark.
SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding cs.DC · 2026-02-10 · unverdicted · none · ref 4
SPEED-Bench is a new standardized benchmark for speculative decoding that supplies semantically diverse qualitative data and throughput-oriented splits across concurrency levels, integrated with vLLM and TensorRT-LLM.
TRINITY: An Evolved LLM Coordinator cs.LG · 2025-12-04 · unverdicted · none · ref 2
A compact 0.6B-parameter coordinator with a 10K-parameter head uses evolutionary strategy to dynamically delegate roles to LLMs, achieving SOTA results such as 86.2% on LiveCodeBench.
Computational Hermeneutics: Evaluating generative AI as a cultural technology cs.AI · 2026-03-31 · unverdicted · none · ref 5
Generative AI should be evaluated through computational hermeneutics using iterative, human-inclusive benchmarks that measure cultural context rather than isolated model outputs.
ReasonCache: Accelerating Large Reasoning Model Serving through KV Cache Sharing cs.LG · 2025-07-29 · unverdicted · none · ref 4
ReasonCache reuses similar KV cache states across reasoning steps in LRMs via collaborative filtering to boost serving throughput by up to 89.2% while preserving accuracy.
Towards EnergyGPT: A Large Language Model Specialized for the Energy Sector cs.CL · 2025-09-08 · unverdicted · none · ref 62
Fine-tuned LLaMA 3.1-8B variants for the energy sector outperform the base model on domain QA benchmarks, with LoRA delivering similar gains at lower training cost.

Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer