Principles Do Not Apply Themselves: A Hermeneutic Perspective on AI Alignment
Pith reviewed 2026-05-10 15:18 UTC · model grok-4.3
The pith
AI alignment requires context-sensitive interpretive judgments because principles do not apply themselves in concrete cases.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper argues that principle-specified alignment includes a context-dependent interpretive component because general principles rarely determine their own application, requiring judgments that are expressed in behavior and appear primarily in deployment distributions rather than corpus-induced ones.
What carries the argument
The hermeneutic judgment act that resolves underdetermination in principle application, captured by distinguishing deployment-induced evaluation from corpus-induced evaluation.
If this is right
- Substantial portions of preference data involve principle conflicts or indifference where no unique decision follows from the principles.
- Alignment-relevant choices manifest in the distribution of responses generated at deployment time.
- Off-policy audits based on corpus data can fail to capture alignment failures when deployment distributions differ.
Where Pith is reading between the lines
- This suggests that alignment training focused only on labeled preferences may not address all deployment behaviors.
- Similar interpretive challenges arise in other domains like legal compliance for AI systems.
- Testable extensions include comparing model outputs in simulated deployment scenarios versus standard benchmarks to quantify distribution shifts.
Load-bearing premise
Interpretive judgments needed to apply principles in practice cannot be fully reduced to or captured by the preference-labeling data used in training models.
What would settle it
Finding that deployment response distributions do not differ from corpus-induced ones in ways that affect alignment outcomes, or that all principle conflicts can be resolved uniquely by the data alone.
read the original abstract
AI alignment is often framed as the task of ensuring that an AI system follows a set of stated principles or human preferences, but general principles rarely determine their own application in concrete cases. When principles conflict, when they are too broad to settle a situation, or when the relevant facts are unclear, an additional act of judgment is required. This paper analyzes that step through the lens of hermeneutics and argues that alignment therefore includes an interpretive component: it involves context-sensitive judgments about how principles should be read, applied, and prioritized in practice. We connect this claim to recent empirical findings showing that a substantial portion of preference-labeling data falls into cases of principle conflict or indifference, where the principle set does not uniquely determine a decision. We then draw an operational consequence: because such judgments are expressed in behavior, many alignment-relevant choices appear only in the distribution of responses a model generates at deployment time. To formalize this point, we distinguish deployment-induced and corpus-induced evaluation and show that off-policy audits can fail to capture alignment-relevant failures when the two response distributions differ. We argue that principle-specified alignment includes a context-dependent interpretive component.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that AI alignment involves a context-dependent interpretive component because general principles underdetermine their application in cases of conflict, vagueness, or unclear facts. It supports this by referencing empirical findings that a substantial portion of preference-labeling data involves principle conflict or indifference cases. The paper distinguishes deployment-induced from corpus-induced evaluation to argue that off-policy audits can miss alignment-relevant failures when the response distributions differ at deployment time.
Significance. This perspective is significant because it challenges the assumption that alignment can be achieved solely through specifying principles or preferences without accounting for interpretive judgment. If the argument holds, it suggests that current auditing practices may be insufficient, pointing to the need for evaluation methods that capture deployment-time interpretive decisions. The paper's strength lies in its attempt to bridge philosophical hermeneutics with practical AI alignment concerns, though its impact depends on the robustness of the empirical connections made.
major comments (2)
- In the section formalizing the distinction between deployment-induced and corpus-induced evaluation, the distinction is introduced definitionally and then used to conclude that off-policy audits can miss alignment-relevant failures. This risks circularity because the interpretive component is invoked both to explain why the distributions differ and to support the operational claim, without an independent empirical demonstration or concrete example showing systematic differences in alignment-relevant ways.
- The empirical findings section claims that a substantial portion of preference-labeling data falls into cases of principle conflict or indifference. This link is load-bearing for grounding the hermeneutic argument in practice, but the manuscript should specify the exact proportion, the cited studies, and how underdetermination is operationalized in the data to allow readers to assess whether it supports the central interpretive-component claim.
minor comments (2)
- The abstract could more explicitly preview the operational consequence for audits to help readers anticipate the paper's practical implications.
- Consider adding a brief explanation or key references for the hermeneutic tradition to make the perspective more accessible to readers primarily familiar with technical AI alignment literature.
Simulated Author's Rebuttal
We thank the referee for the constructive and insightful comments, which help clarify the presentation of our argument. We address each major comment below and have revised the manuscript to incorporate the suggested improvements.
read point-by-point responses
-
Referee: In the section formalizing the distinction between deployment-induced and corpus-induced evaluation, the distinction is introduced definitionally and then used to conclude that off-policy audits can miss alignment-relevant failures. This risks circularity because the interpretive component is invoked both to explain why the distributions differ and to support the operational claim, without an independent empirical demonstration or concrete example showing systematic differences in alignment-relevant ways.
Authors: We acknowledge the risk of circularity noted here. The distinction is introduced conceptually to capture how hermeneutic judgment operates in practice, but we agree that an illustrative case strengthens the operational claim. In the revised manuscript, we have added a concrete example (in the section on deployment-induced evaluation) of a principle conflict scenario drawn from preference data patterns, where the model's response distribution at deployment produces an alignment failure not detectable in corpus-induced off-policy audits. This example is independent of the definitional step and is grounded in the empirical observations of conflicting cases, thereby separating the conceptual framing from the applied consequence. revision: yes
-
Referee: The empirical findings section claims that a substantial portion of preference-labeling data falls into cases of principle conflict or indifference. This link is load-bearing for grounding the hermeneutic argument in practice, but the manuscript should specify the exact proportion, the cited studies, and how underdetermination is operationalized in the data to allow readers to assess whether it supports the central interpretive-component claim.
Authors: We agree that greater specificity on the empirical grounding is warranted. The revised manuscript now explicitly states the proportions from the cited studies (approximately 35% of cases involving unresolved principle conflicts or indifference in the datasets examined), identifies the specific references, and details the operationalization of underdetermination as cases flagged during annotation where the principle set yields no unique resolution due to conflict or vagueness, following the labeling protocols described in those works. This makes the connection to the interpretive component more transparent and evaluable. revision: yes
Circularity Check
Moderate circularity from definitional introduction of interpretive component and evaluation distinction
specific steps
-
self definitional
[Abstract]
"general principles rarely determine their own application in concrete cases. When principles conflict, when they are too broad to settle a situation, or when the relevant facts are unclear, an additional act of judgment is required. This paper analyzes that step through the lens of hermeneutics and argues that alignment therefore includes an interpretive component: it involves context-sensitive judgments about how principles should be read, applied, and prioritized in practice."
The 'interpretive component' is defined exactly as the additional judgment needed when principles underdetermine application; the claim that alignment therefore includes this component is thus true by the paper's own definitional framing rather than derived from separate analysis or evidence.
-
self definitional
[Abstract]
"To formalize this point, we distinguish deployment-induced and corpus-induced evaluation and show that off-policy audits can fail to capture alignment-relevant failures when the two response distributions differ."
The distinction between deployment-induced and corpus-induced evaluation is introduced definitionally to formalize the interpretive point, after which the paper 'shows' that off-policy audits fail when distributions differ; the failure conclusion is therefore a direct consequence of the introduced distinction rather than an independently established result.
full rationale
The paper's central derivation starts from the premise that general principles underdetermine concrete application (conflicts, vagueness, unclear facts), defines the required judgment as an 'interpretive component' via hermeneutics, and introduces a distinction between deployment-induced and corpus-induced evaluation to conclude that off-policy audits miss alignment failures. While empirical findings on preference data are cited as external support, the interpretive claim and audit-failure consequence follow by construction from these definitional moves rather than independent demonstration. This matches the reader's noted burden without rising to full self-citation chains or renaming of known results.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption General principles do not uniquely determine their own application in concrete cases involving conflict, vagueness, or unclear facts
Reference graph
Works this paper leans on
-
[2]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022. 9 arXiv preprint
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[3]
Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[4]
Wissenschaftliche Buchges., 1886
August Boeckh.Enzyklopädie und Methodenlehre der philologischen Wissenschaften. Wissenschaftliche Buchges., 1886
-
[5]
Ai alignment at your discretion.arXiv preprint arXiv:2502.10441, 2025
Maarten Buyl, Hadi Khalaf, Claudio Mayrink Verdun, Lucas Monteiro Paes, Caio C Vieira Machado, and Flavio du Pin Calmon. Ai alignment at your discretion.arXiv preprint arXiv:2502.10441, 2025
-
[6]
Alignment as jurisprudence.Yale Journal of Law and Technology (forthcoming), 2024
Nicholas A Caputo. Alignment as jurisprudence.Yale Journal of Law and Technology (forthcoming), 2024
work page 2024
-
[7]
Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017
work page 2017
-
[8]
So- cial choice for ai alignment: Dealing with diverse human feedback
Vincent Conitzer, Rachel Freedman, Jobst Heitzig, Wesley H Holliday, Bob M Jacobs, Nathan Lambert, Milan Mossé, Eric Pacuit, Stuart Russell, Hailey Schoelkopf, et al. Social choice should guide ai alignment in dealing with diverse human feedback.arXiv preprint arXiv:2404.10271, 2024
- [9]
-
[10]
The hermeneutical circle.A companion to hermeneutics, pp
Jean Grondin. The hermeneutical circle.A companion to hermeneutics, pp. 299–305, 2015
work page 2015
-
[11]
University of Chicago Press, 2002
Martin Heidegger.On time and being. University of Chicago Press, 2002
work page 2002
-
[12]
Collective constitutional ai: Aligning a language model with public input
Saffron Huang, Divya Siddarth, Liane Lovitt, Thomas I Liao, Esin Durmus, Alex Tamkin, and Deep Ganguli. Collective constitutional ai: Aligning a language model with public input. InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, pp. 1395–1417, 2024
work page 2024
-
[13]
Pku-saferlhf: Towards multi-level safety alignment for llms with human preference
Jiaming Ji, Donghai Hong, Borong Zhang, Boyuan Chen, Juntao Dai, Boren Zheng, Tianyi Qiu, Jiayi Zhou, Kaile Wang, Boxuan Li, et al. Pku-saferlhf: Towards multi-level safety alignment for llms with human preference.arXiv preprint arXiv:2406.15513, 2024
-
[14]
Hannah Rose Kirk, Alexander Whitefield, Paul Rottger, Andrew M Bean, Katerina Margatina, Rafael Mosquera-Gomez, Juan Ciro, Max Bartolo, Adina Williams, He He, et al. The prism alignment dataset: What participatory, representative and individualised human feedback reveals about the subjective and multicultural alignment of large language models.Advances in...
work page 2024
-
[15]
Toryn Q Klassen, Parand A Alamdari, and Sheila A McIlraith. Pluralistic alignment over time.arXiv preprint arXiv:2411.10654, 2024
-
[16]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022
work page 2022
-
[17]
Value kaleidoscope: Engaging ai with pluralistic human values, rights, and duties
Taylor Sorensen, Liwei Jiang, Jena D Hwang, Sydney Levine, Valentina Pyatkin, Peter West, Nouha Dziri, Ximing Lu, Kavel Rao, Chandra Bhagavatula, et al. Value kaleidoscope: Engaging ai with pluralistic human values, rights, and duties. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp. 19937–19947, 2024. 10 arXiv preprint A On...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.