Models Recall What They Violate: Constraint Adherence in Multi-Turn LLM Ideation

Garvin Kruthof

arxiv: 2604.28031 · v2 · submitted 2026-04-30 · 💻 cs.CL · cs.AI

Models Recall What They Violate: Constraint Adherence in Multi-Turn LLM Ideation

Garvin Kruthof This is my paper

Pith reviewed 2026-05-07 07:24 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords constraint adherencemulti-turn LLMscientific ideationDriftBenchknows-but-violatesiterative refinementbenchmark evaluation

0 comments

The pith

Large language models accurately recall constraints yet frequently violate them in multi-turn scientific ideation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether LLMs maintain fidelity to original objectives when researchers iteratively refine scientific ideas over multiple turns. It introduces DriftBench to evaluate this across seven models, 38 briefs from 24 domains, and four interaction conditions, finding that iterative pressure increases structural complexity while decreasing adherence to constraints. Models show a clear dissociation: they can accurately restate the constraints in a probe yet still produce outputs that break them, with knows-but-violates rates ranging from 8% to 99%. Structured checkpointing lowers these rates somewhat but leaves the underlying mismatch intact, and the LLM judge used for scoring under-detects violations, so the reported adherence figures are conservative.

Core claim

In multi-turn LLM-assisted scientific ideation, models exhibit a dissociation between declarative recall and behavioral adherence to constraints: they accurately restate the original rules in a restatement probe while simultaneously violating those rules in their generated ideas. Iterative pressure reliably increases structural complexity and reduces adherence, with knows-but-violates rates varying from 8% to 99% across models. Structured checkpointing partially mitigates the dissociation without eliminating it, and complexity inflation persists even when recall remains intact.

What carries the argument

DriftBench benchmark together with the restatement probe that directly measures the knows-but-violates rate, defined as the share of cases where a model correctly recalls a constraint yet produces an output that violates it.

If this is right

Iterative pressure increases structural complexity while reducing constraint adherence.
Structured checkpointing partially reduces knows-but-violates rates but leaves the recall-adherence dissociation intact.
LLM judges under-detect violations, rendering reported adherence scores conservative.
The dissociation and complexity inflation hold across temperature settings and both novelty-driven and rigor-driven pressure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Teams relying on LLMs for extended research ideation may need external monitoring tools to catch violations that the models themselves cannot self-correct.
The pattern suggests that prompt-based consistency mechanisms alone are insufficient for tasks where an initial specification must survive many refinement cycles.
Testing the same probes on non-scientific creative or engineering workflows could show whether the knows-but-violates behavior is domain-specific or general to multi-turn LLM use.

Load-bearing premise

The 38 research briefs and four interaction conditions sufficiently represent real scientific ideation workflows, and the LLM-as-judge scoring captures the full scope of constraint violations without systematic bias.

What would settle it

A follow-up experiment with practicing researchers conducting real multi-turn ideation sessions on new scientific briefs that finds knows-but-violates rates consistently near zero across models would falsify the reported dissociation between recall and adherence.

Figures

Figures reproduced from arXiv: 2604.28031 by Garvin Kruthof.

**Figure 1.** Figure 1: Constraint non-compliance rate (adherence view at source ↗

**Figure 2.** Figure 2: Mean constraint adherence by model and condition. Under pressure, Sonnet 4.6, Gemini view at source ↗

**Figure 3.** Figure 3: KBV rate under pressure vs. after checkpointing. Checkpointing reduces KBV for most view at source ↗

**Figure 4.** Figure 4: Sensitivity analyses. (a) Temperature: Gemini Flash KBV increases from 76% to 83% at view at source ↗

read the original abstract

When researchers iteratively refine ideas with large language models, do the models preserve fidelity to the original objective? We introduce DriftBench, a benchmark for evaluating constraint adherence in multi-turn LLM-assisted scientific ideation. Across 2,146 scored benchmark runs spanning seven models from five providers (including two open-weight), four interaction conditions, and 38 research briefs from 24 scientific domains, we find that iterative pressure reliably increases structural complexity and often reduces adherence to original constraints. A restatement probe reveals a dissociation between declarative recall and behavioral adherence, as models accurately restate constraints they simultaneously violate. The knows-but-violates (KBV) rate, measuring constraint non-compliance despite preserved recall, ranges from 8% to 99% across models. Structured checkpointing partially reduces KBV rates but does not close the dissociation, and complexity inflation persists. Human validation against blind raters confirms that the LLM judge under-detects constraint violations, making reported constraint adherence scores conservative. Sensitivity analyses confirm the findings are robust to temperature (0.7 vs.\ 1.0) and pressure type (novelty vs.\ rigor). We release all briefs, prompts, rubrics, transcripts, and scores as an open benchmark.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper introduces DriftBench, a benchmark for constraint adherence in multi-turn LLM-assisted scientific ideation. Across 2146 scored runs with seven models, 38 research briefs from 24 domains, and four interaction conditions, it reports that iterative pressure increases structural complexity while reducing adherence to original constraints. A restatement probe demonstrates a dissociation: models accurately recall constraints they violate, yielding knows-but-violates (KBV) rates from 8% to 99%. Structured checkpointing partially mitigates KBV but does not eliminate the dissociation or complexity inflation. Human validation confirms the LLM judge under-detects violations, making adherence scores conservative. All materials are released openly.

Significance. If the results hold, this provides important empirical evidence of a recall-adherence dissociation in LLMs during iterative scientific tasks, with implications for tool design in research assistance. Strengths include the large evaluation scale, sensitivity checks on temperature and pressure type, human validation of the LLM judge, and full open release of briefs, prompts, rubrics, transcripts, and scores, which support reproducibility and community use.

minor comments (3)

The abstract would benefit from briefly naming and describing the four interaction conditions to provide immediate context.
Include a summary table or figure of KBV rates and adherence scores broken down by model and condition to facilitate direct comparison of the reported 8-99% range.
Clarify the exact operationalization and measurement of 'structural complexity' (e.g., via clause count, dependency depth, or LLM rubric) in the methods, as it underpins one of the main findings.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their accurate and positive summary of our work on DriftBench, including the scale of the evaluation (2,146 runs), the dissociation between recall and adherence, the open release of all materials, and the human validation confirming conservative scoring. We appreciate the recognition of robustness to temperature and pressure type, as well as the recommendation for minor revision. No specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity: purely empirical benchmark with direct measurements

full rationale

The paper introduces DriftBench as an empirical benchmark and reports direct measurements of constraint adherence and KBV rates across 2146 runs on seven models. No derivation chain, equations, fitted parameters, or predictions exist that reduce the central findings to inputs by construction. The KBV rate is computed from scored model outputs and restatement probes; human validation and sensitivity checks are external to any internal reduction. No self-citation load-bearing steps or ansatzes are invoked for the core claims. The study is self-contained against external benchmarks and falsifiable via the released transcripts and rubrics.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the assumption that the benchmark design and LLM judge provide a valid (conservative) measure of real constraint adherence in scientific contexts; no free parameters or invented physical entities.

axioms (1)

domain assumption LLM-as-judge can reliably score constraint adherence in generated ideas (with acknowledged under-detection)
Core to all quantitative results and human validation comparison.

invented entities (2)

DriftBench no independent evidence
purpose: Benchmark dataset and evaluation framework for multi-turn constraint adherence
Newly constructed for this study; no independent evidence outside the paper.
knows-but-violates (KBV) rate no independent evidence
purpose: Metric capturing dissociation between declarative recall and behavioral adherence
Defined and measured in this paper; no prior independent validation.

pith-pipeline@v0.9.0 · 5510 in / 1374 out tokens · 47078 ms · 2026-05-07T07:24:55.694271+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages

[1]

Sikun Guo, Amir Hassan Shariatmadari, Guangzhi Xiong, Albert Huang, Eric Xie, Stefan Bekiranov, and Aidong Zhang

URLhttps://api.semanticscholar.org/CorpusID:281950814. Sikun Guo, Amir Hassan Shariatmadari, Guangzhi Xiong, Albert Huang, Eric Xie, Stefan Bekiranov, and Aidong Zhang. Ideabench: Benchmarking large language models for research idea generation. Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V .2,

work page
[2]

Peter Jansen, Marc-Alexandre Côté, Tushar Khot, Erin Bransom, Bhavana Dalvi Mishra, Bodhisattwa Prasad Majumder, Oyvind Tafjord, and Peter Clark

URLhttps://api.semanticscholar.org/CorpusID:273821733. Peter Jansen, Marc-Alexandre Côté, Tushar Khot, Erin Bransom, Bhavana Dalvi Mishra, Bodhisattwa Prasad Majumder, Oyvind Tafjord, and Peter Clark. Discoveryworld: A virtual environment for developing and evaluating automated scientific discovery agents. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, ...

work page doi:10.52202/079017-0324 2024
[3]

P ro SA : Assessing and understanding the prompt sensitivity of LLM s

URLhttps://api.semanticscholar.org/CorpusID:286366317. 11 Xingyao Wang, Zihan Wang, Jiateng Liu, Yangyi Chen, Lifan Yuan, Hao Peng, and Heng Ji. Mint: Evaluating llms in multi-turn interaction with tools and language feed- back. In B. Kim, Y . Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y . Sun, ed- itors,International Conference on Learning Represent...

work page doi:10.18653/v1/2024.findings-emnlp.108 2024

[1] [1]

Sikun Guo, Amir Hassan Shariatmadari, Guangzhi Xiong, Albert Huang, Eric Xie, Stefan Bekiranov, and Aidong Zhang

URLhttps://api.semanticscholar.org/CorpusID:281950814. Sikun Guo, Amir Hassan Shariatmadari, Guangzhi Xiong, Albert Huang, Eric Xie, Stefan Bekiranov, and Aidong Zhang. Ideabench: Benchmarking large language models for research idea generation. Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V .2,

work page

[2] [2]

Peter Jansen, Marc-Alexandre Côté, Tushar Khot, Erin Bransom, Bhavana Dalvi Mishra, Bodhisattwa Prasad Majumder, Oyvind Tafjord, and Peter Clark

URLhttps://api.semanticscholar.org/CorpusID:273821733. Peter Jansen, Marc-Alexandre Côté, Tushar Khot, Erin Bransom, Bhavana Dalvi Mishra, Bodhisattwa Prasad Majumder, Oyvind Tafjord, and Peter Clark. Discoveryworld: A virtual environment for developing and evaluating automated scientific discovery agents. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, ...

work page doi:10.52202/079017-0324 2024

[3] [3]

P ro SA : Assessing and understanding the prompt sensitivity of LLM s

URLhttps://api.semanticscholar.org/CorpusID:286366317. 11 Xingyao Wang, Zihan Wang, Jiateng Liu, Yangyi Chen, Lifan Yuan, Hao Peng, and Heng Ji. Mint: Evaluating llms in multi-turn interaction with tools and language feed- back. In B. Kim, Y . Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y . Sun, ed- itors,International Conference on Learning Represent...

work page doi:10.18653/v1/2024.findings-emnlp.108 2024