When benchmarks talk: Re-evaluating code llms with interactive feedback

Jane Pan, Ryan Shar, Jacob Pfau, Ameet Talwalkar, He He, Valerie Chen · 2025 · arXiv 2502.18413

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

Ask Early, Ask Late, Ask Right: When Does Clarification Timing Matter for Long-Horizon Agents?

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

Goal clarifications lose nearly all value after 10% of execution while input clarifications retain value until roughly 50%, and asking any type past mid-trajectory hurts performance more than never asking.

HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?

cs.AI · 2026-04-10 · unverdicted · novelty 7.0

HiL-Bench shows frontier AI agents fail to ask for help on incomplete tasks, recovering only a fraction of full-information performance, but RL training on Ask-F1 reward improves judgment and transfers across domains.

Deceive, Detect, and Disclose: Large Language Models Play Mini-Mafia

cs.AI · 2025-09-27 · unverdicted · novelty 7.0

Mini-Mafia supplies an analytical model logit(p) = v*(m-d) for mafia win probability in LLM role interactions and uses Bayesian inference to estimate per-model parameters that predict tournament results with 76.6% Brier-score improvement over random.

citing papers explorer

Showing 3 of 3 citing papers.

Ask Early, Ask Late, Ask Right: When Does Clarification Timing Matter for Long-Horizon Agents? cs.CL · 2026-05-08 · unverdicted · none · ref 27
Goal clarifications lose nearly all value after 10% of execution while input clarifications retain value until roughly 50%, and asking any type past mid-trajectory hurts performance more than never asking.
HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help? cs.AI · 2026-04-10 · unverdicted · none · ref 25
HiL-Bench shows frontier AI agents fail to ask for help on incomplete tasks, recovering only a fraction of full-information performance, but RL training on Ask-F1 reward improves judgment and transfers across domains.
Deceive, Detect, and Disclose: Large Language Models Play Mini-Mafia cs.AI · 2025-09-27 · unverdicted · none · ref 16
Mini-Mafia supplies an analytical model logit(p) = v*(m-d) for mafia win probability in LLM role interactions and uses Bayesian inference to estimate per-model parameters that predict tournament results with 76.6% Brier-score improvement over random.

When benchmarks talk: Re-evaluating code llms with interactive feedback

fields

years

verdicts

representative citing papers

citing papers explorer