CodeChat-Eval shows LLMs lose 19.2% to 69.2% functional correctness over multi-turn refinement dialogues, with largest drops on logic-level and additive changes.
A hierarchical and evolvable benchmark for fine-grained code instruction following with multi-turn feedback,
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 4roles
background 2representative citing papers
Introduces BeliefTrack benchmark diagnosing three CBM failures in LLMs and shows RL with belief-state rewards cuts failure rates by 70.9% while representation steering cuts them by 46.1%.
A review of 114 studies creates taxonomies for code and data quality issues, formalizes 18 propagation mechanisms from training data defects to LLM-generated code defects, and synthesizes detection and mitigation techniques.
Literature on system prompts for AI shows fragmented and contradictory claims that complicate policy efforts to use them as reliable governance mechanisms.
citing papers explorer
-
Prompt Governance? On Governing Technologies Governed by Natural Language
Literature on system prompts for AI shows fragmented and contradictory claims that complicate policy efforts to use them as reliable governance mechanisms.