A hierarchical and evolvable benchmark for fine-grained code instruction following with multi-turn feedback,

· 2025 · arXiv 2507.00699

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

CodeChat-Eval: Evaluating Large Language Models in Multi-Turn Code Refinement Dialogues

cs.SE · 2026-06-24 · unverdicted · novelty 7.0

CodeChat-Eval shows LLMs lose 19.2% to 69.2% functional correctness over multi-turn refinement dialogues, with largest drops on logic-level and additive changes.

When Should Models Change Their Minds? Contextual Belief Management in Large Language Models

cs.AI · 2026-05-28 · unverdicted · novelty 6.0

Introduces BeliefTrack benchmark diagnosing three CBM failures in LLMs and shows RL with belief-state rewards cuts failure rates by 70.9% while representation steering cuts them by 46.1%.

Bridging Generation and Training: A Systematic Review of Quality Issues in LLMs for Code

cs.SE · 2026-05-06 · accept · novelty 6.0

A review of 114 studies creates taxonomies for code and data quality issues, formalizes 18 propagation mechanisms from training data defects to LLM-generated code defects, and synthesizes detection and mitigation techniques.

Prompt Governance? On Governing Technologies Governed by Natural Language

cs.CY · 2026-04-29 · unverdicted · novelty 4.0

Literature on system prompts for AI shows fragmented and contradictory claims that complicate policy efforts to use them as reliable governance mechanisms.

citing papers explorer

Showing 0 of 0 citing papers after filters.

No citing papers match the current filters.

A hierarchical and evolvable benchmark for fine-grained code instruction following with multi-turn feedback,

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer