How Well Do Models Follow Their Constitutions?
Pith reviewed 2026-06-30 15:29 UTC · model grok-4.3
The pith
Successive versions of Claude and GPT models violate their behavioral specifications at steadily lower rates.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Applying the pipeline across seven models per specification, we find that models follow their own lab's specification substantially better with each generation. On Anthropic's constitution, the Claude family falls from a 15.0% violation rate (Sonnet 4) to 2.0% (Sonnet 4.6); on OpenAI's Model Spec, the GPT family falls from 11.7% (GPT-4o) to 3.6% (GPT-5.2 medium reasoning), with the severity ceiling falling from 10/10 to 7/10. Remaining failures cluster around operator-imposed personas under AI-identity questioning, irreversible action in agentic deployments, and fabricated quantitative claims with false precision.
What carries the argument
The multi-method audit pipeline, which decomposes each specification into atomic testable tenets, generates multi-turn adversarial scenarios using the Petri agent, applies a modified SURF rubric search, and validates flagged transcripts.
If this is right
- Violation rates decrease substantially from earlier to later model generations in both the Claude and GPT families.
- Severity of remaining violations also decreases, with the maximum severity score falling from 10/10 to 7/10 on the OpenAI spec.
- Failures that persist tend to occur in specific scenarios involving AI identity, agentic actions, and precise quantitative claims.
- The gains cannot be attributed to any single cause such as specification-specific training versus general improvements.
Where Pith is reading between the lines
- If the improvement trend holds, future models could reach violation rates below 1 percent under similar audit conditions.
- The clustering of failures suggests targeted additional training on AI self-identity and agentic irreversibility could yield further gains.
- Independent labs could apply the same pipeline to other models to compare adherence across different specifications.
Load-bearing premise
The multi-method pipeline produces measurements that accurately reflect whether models follow the specifications under real-world adversarial pressure.
What would settle it
Running the identical audit pipeline on the next released Claude or GPT model and finding violation rates equal to or higher than the current latest versions would indicate the improvement trend has stopped.
Figures
read the original abstract
Frontier AI developers now train models against long written behavioral specifications, such as Anthropic's constitution (Anthropic, 2025a) and OpenAI's Model Spec (OpenAI, 2025a), integrated into post-training via methods like character training (Anthropic, 2024) and deliberative alignment (Guan et al., 2024). These documents serve a governance function, but it is unclear how well models actually follow them under adversarial, multi-turn pressure similar to what they would face in real-world deployment. We propose a multi-method audit pipeline that treats each lab's published specification as an auditable target: it decomposes the specification into atomic testable tenets (205 for Anthropic, 197 for OpenAI), generates multi-turn adversarial scenarios with the Petri auditing agent (Anthropic, 2025b), runs a modified SURF-style rubric search (Murray et al., 2026) to catch shallow single-turn failures Petri misses, validates flagged transcripts against the relevant specification, and compares the findings against the lab's own published system card. Applying the pipeline across seven models per specification, we find that models follow their own lab's specification substantially better with each generation. On Anthropic's constitution, the Claude family falls from a 15.0% violation rate (Sonnet 4) to 2.0% (Sonnet 4.6); on OpenAI's Model Spec, the GPT family falls from 11.7% (GPT-4o) to 3.6% (GPT-5.2 medium reasoning), with the severity ceiling falling from 10/10 to 7/10. We cannot externally isolate whether these gains come from specification-specific training, broader post-training improvements, or evaluation awareness. Remaining failures cluster around operator-imposed personas under AI-identity questioning, irreversible action in agentic deployments, and fabricated quantitative claims with false precision.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a multi-method audit pipeline to evaluate how well frontier models adhere to their labs' published behavioral specifications (Anthropic's constitution and OpenAI's Model Spec) under multi-turn adversarial pressure. The pipeline decomposes each spec into atomic tenets (205/197), generates scenarios via the Petri agent, applies a modified SURF rubric search, validates flagged transcripts, and benchmarks against each lab's system card. It reports clear generational gains—Claude violation rates falling from 15.0% (Sonnet 4) to 2.0% (Sonnet 4.6) and GPT rates from 11.7% (GPT-4o) to 3.6% (GPT-5.2)—while noting unisolated causes and clustering of remaining failures around operator personas, irreversible agentic actions, and fabricated quantitative claims.
Significance. If the measurements are reliable, the work supplies the first quantitative, cross-lab comparison of specification adherence under deployment-like pressure and demonstrates measurable progress from character training and deliberative alignment. The reusable pipeline and the identification of persistent failure modes supply concrete targets for future governance and red-teaming efforts.
major comments (3)
- [Pipeline / validation subsection] The transcript validation step (described in the pipeline section) provides no information on inter-rater reliability, exact exclusion criteria, or resolution of disagreements; because this step determines which Petri- and SURF-flagged cases count as violations, the reported 15.0%→2.0% and 11.7%→3.6% trends rest on an unquantified measurement process.
- [Results (§4) and Methods (Petri + SURF)] The central quantitative claims compare pipeline outputs against each lab's own published system card while employing the lab-developed Petri tool (Anthropic 2025b); no external red-teaming corpus or independent calibration of the generated scenarios is reported, so the observed generational improvement could partly reflect pipeline artifacts rather than genuine adherence gains.
- [Discussion] The paper explicitly states that causes of improvement cannot be isolated, yet presents no ablations (e.g., models trained without the target specification or with held-out evaluation awareness) that would allow readers to assess whether the measured drops are specification-specific.
minor comments (2)
- [Results] Define 'violation rate' and 'severity ceiling' explicitly (including how severity is scored on the 10-point scale) before presenting the headline percentages.
- [Methods] Clarify whether the modified SURF rubric was tuned on the same model families under test or held out; if tuned on the test set, report any overfitting diagnostics.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which highlight important aspects of transparency and methodological rigor. We address each major comment below and have revised the manuscript accordingly where feasible to strengthen the presentation of our audit pipeline and findings.
read point-by-point responses
-
Referee: [Pipeline / validation subsection] The transcript validation step (described in the pipeline section) provides no information on inter-rater reliability, exact exclusion criteria, or resolution of disagreements; because this step determines which Petri- and SURF-flagged cases count as violations, the reported 15.0%→2.0% and 11.7%→3.6% trends rest on an unquantified measurement process.
Authors: We agree that the validation process requires greater transparency to support the reliability of the reported violation rates. In the revised manuscript, we have expanded the relevant subsection to specify the exact exclusion criteria (e.g., only transcripts showing clear, unambiguous violations of atomic tenets are counted as violations; ambiguous or context-dependent cases are excluded), and we describe that any initial disagreements among validators were resolved via discussion to reach consensus. As the validation was performed internally by the author team rather than by independent raters, formal inter-rater reliability statistics were not computed; we have added this as an explicit limitation. These additions improve reproducibility without altering the core quantitative results. revision: partial
-
Referee: [Results (§4) and Methods (Petri + SURF)] The central quantitative claims compare pipeline outputs against each lab's own published system card while employing the lab-developed Petri tool (Anthropic 2025b); no external red-teaming corpus or independent calibration of the generated scenarios is reported, so the observed generational improvement could partly reflect pipeline artifacts rather than genuine adherence gains.
Authors: We acknowledge that the scenario generation relies on the Petri agent and that comparisons are made to the labs' own system cards. Our contribution centers on a reusable, multi-method pipeline applied to publicly available specifications rather than on creating an entirely independent red-teaming benchmark. In the revised manuscript, we have added explicit discussion of potential pipeline-specific artifacts and biases in the Methods and Discussion sections, while noting that the observed improvements are consistent across two distinct specifications and model families. Independent external corpora would strengthen future audits but fall outside the scope of this work, which focuses on auditing published targets with available tools. The generational trends are presented with these caveats intact. revision: no
-
Referee: [Discussion] The paper explicitly states that causes of improvement cannot be isolated, yet presents no ablations (e.g., models trained without the target specification or with held-out evaluation awareness) that would allow readers to assess whether the measured drops are specification-specific.
Authors: The manuscript already states that causes cannot be isolated due to lack of access to training details. Ablations such as training without the target specification or controlling for evaluation awareness require direct access to proprietary post-training pipelines, which external researchers do not have. We have revised the Discussion to more explicitly frame this as an inherent limitation of external auditing and to recommend such experiments for labs with internal access. No empirical changes to the results are possible or claimed. revision: no
Circularity Check
No significant circularity detected
full rationale
The paper describes an empirical audit that decomposes published specifications into tenets, applies a cited multi-turn generation tool (Petri), runs a modified rubric search, validates transcripts, and reports observed violation rates across model generations. These rates are produced by the pipeline rather than being equivalent to the inputs or system-card baselines by construction. No step reduces a fitted parameter, self-defined quantity, or author-overlapping citation chain to the reported outcome; the comparison to system cards functions as an external reference point rather than a definitional anchor. The derivation therefore remains self-contained.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Decomposition of each specification into 205/197 atomic testable tenets is valid and exhaustive.
- domain assumption Petri-generated multi-turn scenarios plus SURF rubric search produce representative adversarial pressure comparable to real deployment.
Reference graph
Works this paper leans on
-
[1]
Claude's Constitution
Anthropic. Claude's Constitution. anthropic.com/news/claude-new-constitution, 2025a
-
[2]
Petri: An open-source auditing tool to accelerate AI safety research
Anthropic. Petri: An open-source auditing tool to accelerate AI safety research. anthropic.com/research/petri-open-source-auditing, October 2025b
-
[3]
Claude's character
Anthropic. Claude's character. anthropic.com/news/claude-character, 2024
2024
-
[4]
Claude Sonnet 4.5 System Card
Anthropic. Claude Sonnet 4.5 System Card. anthropic.com/claude-sonnet-4-5-system-card, September 2025c
-
[5]
Claude Opus 4.5 System Card
Anthropic. Claude Opus 4.5 System Card. anthropic.com/claude-opus-4-5-system-card, November 2025d
-
[6]
Claude Sonnet 4.6 System Card
Anthropic. Claude Sonnet 4.6 System Card. anthropic.com/claude-sonnet-4-6-system-card, February 2026a
-
[7]
Claude Opus 4.6 System Card
Anthropic. Claude Opus 4.6 System Card. anthropic.com/claude-opus-4-6-system-card, February 2026b
-
[8]
Constitutional AI: Harmlessness from AI Feedback
Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., et al. Constitutional AI : Harmlessness from AI feedback. arXiv preprint arXiv:2212.08073 , 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[9]
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
Ganguli, D., Lovitt, L., Kernion, J., Askell, A., Bai, Y., Kadavath, S., Mann, B., Perez, E., Schiefer, N., Ndousse, K., et al. Red teaming language models to reduce harms. arXiv preprint arXiv:2209.07858 , 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[10]
Guan, M. Y., Joglekar, M., Wallace, E., Jain, S., Barak, B., Heylar, A., Dias, R., Vallone, A., Ren, H., Wei, J., Chung, H. W., Toyer, S., Heidecke, J., Beutel, A., and Glaese, A. Deliberative alignment: Reasoning enables safer language models. arXiv preprint arXiv:2412.16339 , 2024
-
[11]
Chunky post-training: Data driven failures of generalization
Murray, S., et al. Chunky post-training: Data driven failures of generalization. arXiv preprint arXiv:2602.05910 , 2026
-
[12]
Model Spec
OpenAI. Model Spec. model-spec.openai.com, December 2025a
-
[13]
Update to GPT-5 System Card: GPT-5.2
OpenAI. Update to GPT-5 System Card: GPT-5.2 . openai.com/index/gpt-5-system-card-update-gpt-5-2/, December 2025b
-
[14]
Red teaming language models with language models
Perez, E., Huang, S., Song, F., Cai, T., Ring, R., Aslanides, J., Glaese, A., McAleese, N., and Irving, G. Red teaming language models with language models. In EMNLP , 2022
2022
-
[15]
A StrongREJECT for Empty Jailbreaks
Souly, A., Lu, Q., Bowen, D., Trinh, T., Hsieh, E., Pandey, S., Abbeel, P., Svegliato, J., Emmons, S., Watkins, O., and Toyer, S. StrongREJECT for empty jailbreaks. arXiv preprint arXiv:2402.10260 , 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
Jailbroken: How does LLM safety training fail? In NeurIPS , 2023
Wei, A., Haghtalab, N., and Steinhardt, J. Jailbroken: How does LLM safety training fail? In NeurIPS , 2023
2023
-
[17]
Universal and Transferable Adversarial Attacks on Aligned Language Models
Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z., and Fredrikson, M. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.