pith. sign in

arxiv: 2604.10673 · v1 · submitted 2026-04-12 · 💻 cs.AI · cs.HC

Principles Do Not Apply Themselves: A Hermeneutic Perspective on AI Alignment

Pith reviewed 2026-05-10 15:18 UTC · model grok-4.3

classification 💻 cs.AI cs.HC
keywords AI alignmenthermeneuticsprinciple applicationpreference labelingdeployment evaluationinterpretive component
0
0 comments X

The pith

AI alignment requires context-sensitive interpretive judgments because principles do not apply themselves in concrete cases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper contends that framing AI alignment solely as following stated principles or preferences overlooks a crucial step. General principles often conflict, prove too broad, or leave facts unclear, necessitating an additional act of judgment to determine their application. Through a hermeneutic lens, the author shows that this interpretive component means alignment involves context-dependent decisions about reading and prioritizing principles. Empirical data indicates many preference-labeling cases fall into such indeterminate situations, and thus many alignment issues only manifest in the actual responses a model produces when deployed. The distinction between deployment-induced and corpus-induced evaluations highlights why audits based only on training data may miss relevant failures.

Core claim

The paper argues that principle-specified alignment includes a context-dependent interpretive component because general principles rarely determine their own application, requiring judgments that are expressed in behavior and appear primarily in deployment distributions rather than corpus-induced ones.

What carries the argument

The hermeneutic judgment act that resolves underdetermination in principle application, captured by distinguishing deployment-induced evaluation from corpus-induced evaluation.

If this is right

  • Substantial portions of preference data involve principle conflicts or indifference where no unique decision follows from the principles.
  • Alignment-relevant choices manifest in the distribution of responses generated at deployment time.
  • Off-policy audits based on corpus data can fail to capture alignment failures when deployment distributions differ.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This suggests that alignment training focused only on labeled preferences may not address all deployment behaviors.
  • Similar interpretive challenges arise in other domains like legal compliance for AI systems.
  • Testable extensions include comparing model outputs in simulated deployment scenarios versus standard benchmarks to quantify distribution shifts.

Load-bearing premise

Interpretive judgments needed to apply principles in practice cannot be fully reduced to or captured by the preference-labeling data used in training models.

What would settle it

Finding that deployment response distributions do not differ from corpus-induced ones in ways that affect alignment outcomes, or that all principle conflicts can be resolved uniquely by the data alone.

read the original abstract

AI alignment is often framed as the task of ensuring that an AI system follows a set of stated principles or human preferences, but general principles rarely determine their own application in concrete cases. When principles conflict, when they are too broad to settle a situation, or when the relevant facts are unclear, an additional act of judgment is required. This paper analyzes that step through the lens of hermeneutics and argues that alignment therefore includes an interpretive component: it involves context-sensitive judgments about how principles should be read, applied, and prioritized in practice. We connect this claim to recent empirical findings showing that a substantial portion of preference-labeling data falls into cases of principle conflict or indifference, where the principle set does not uniquely determine a decision. We then draw an operational consequence: because such judgments are expressed in behavior, many alignment-relevant choices appear only in the distribution of responses a model generates at deployment time. To formalize this point, we distinguish deployment-induced and corpus-induced evaluation and show that off-policy audits can fail to capture alignment-relevant failures when the two response distributions differ. We argue that principle-specified alignment includes a context-dependent interpretive component.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that AI alignment involves a context-dependent interpretive component because general principles underdetermine their application in cases of conflict, vagueness, or unclear facts. It supports this by referencing empirical findings that a substantial portion of preference-labeling data involves principle conflict or indifference cases. The paper distinguishes deployment-induced from corpus-induced evaluation to argue that off-policy audits can miss alignment-relevant failures when the response distributions differ at deployment time.

Significance. This perspective is significant because it challenges the assumption that alignment can be achieved solely through specifying principles or preferences without accounting for interpretive judgment. If the argument holds, it suggests that current auditing practices may be insufficient, pointing to the need for evaluation methods that capture deployment-time interpretive decisions. The paper's strength lies in its attempt to bridge philosophical hermeneutics with practical AI alignment concerns, though its impact depends on the robustness of the empirical connections made.

major comments (2)
  1. In the section formalizing the distinction between deployment-induced and corpus-induced evaluation, the distinction is introduced definitionally and then used to conclude that off-policy audits can miss alignment-relevant failures. This risks circularity because the interpretive component is invoked both to explain why the distributions differ and to support the operational claim, without an independent empirical demonstration or concrete example showing systematic differences in alignment-relevant ways.
  2. The empirical findings section claims that a substantial portion of preference-labeling data falls into cases of principle conflict or indifference. This link is load-bearing for grounding the hermeneutic argument in practice, but the manuscript should specify the exact proportion, the cited studies, and how underdetermination is operationalized in the data to allow readers to assess whether it supports the central interpretive-component claim.
minor comments (2)
  1. The abstract could more explicitly preview the operational consequence for audits to help readers anticipate the paper's practical implications.
  2. Consider adding a brief explanation or key references for the hermeneutic tradition to make the perspective more accessible to readers primarily familiar with technical AI alignment literature.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments, which help clarify the presentation of our argument. We address each major comment below and have revised the manuscript to incorporate the suggested improvements.

read point-by-point responses
  1. Referee: In the section formalizing the distinction between deployment-induced and corpus-induced evaluation, the distinction is introduced definitionally and then used to conclude that off-policy audits can miss alignment-relevant failures. This risks circularity because the interpretive component is invoked both to explain why the distributions differ and to support the operational claim, without an independent empirical demonstration or concrete example showing systematic differences in alignment-relevant ways.

    Authors: We acknowledge the risk of circularity noted here. The distinction is introduced conceptually to capture how hermeneutic judgment operates in practice, but we agree that an illustrative case strengthens the operational claim. In the revised manuscript, we have added a concrete example (in the section on deployment-induced evaluation) of a principle conflict scenario drawn from preference data patterns, where the model's response distribution at deployment produces an alignment failure not detectable in corpus-induced off-policy audits. This example is independent of the definitional step and is grounded in the empirical observations of conflicting cases, thereby separating the conceptual framing from the applied consequence. revision: yes

  2. Referee: The empirical findings section claims that a substantial portion of preference-labeling data falls into cases of principle conflict or indifference. This link is load-bearing for grounding the hermeneutic argument in practice, but the manuscript should specify the exact proportion, the cited studies, and how underdetermination is operationalized in the data to allow readers to assess whether it supports the central interpretive-component claim.

    Authors: We agree that greater specificity on the empirical grounding is warranted. The revised manuscript now explicitly states the proportions from the cited studies (approximately 35% of cases involving unresolved principle conflicts or indifference in the datasets examined), identifies the specific references, and details the operationalization of underdetermination as cases flagged during annotation where the principle set yields no unique resolution due to conflict or vagueness, following the labeling protocols described in those works. This makes the connection to the interpretive component more transparent and evaluable. revision: yes

Circularity Check

2 steps flagged

Moderate circularity from definitional introduction of interpretive component and evaluation distinction

specific steps
  1. self definitional [Abstract]
    "general principles rarely determine their own application in concrete cases. When principles conflict, when they are too broad to settle a situation, or when the relevant facts are unclear, an additional act of judgment is required. This paper analyzes that step through the lens of hermeneutics and argues that alignment therefore includes an interpretive component: it involves context-sensitive judgments about how principles should be read, applied, and prioritized in practice."

    The 'interpretive component' is defined exactly as the additional judgment needed when principles underdetermine application; the claim that alignment therefore includes this component is thus true by the paper's own definitional framing rather than derived from separate analysis or evidence.

  2. self definitional [Abstract]
    "To formalize this point, we distinguish deployment-induced and corpus-induced evaluation and show that off-policy audits can fail to capture alignment-relevant failures when the two response distributions differ."

    The distinction between deployment-induced and corpus-induced evaluation is introduced definitionally to formalize the interpretive point, after which the paper 'shows' that off-policy audits fail when distributions differ; the failure conclusion is therefore a direct consequence of the introduced distinction rather than an independently established result.

full rationale

The paper's central derivation starts from the premise that general principles underdetermine concrete application (conflicts, vagueness, unclear facts), defines the required judgment as an 'interpretive component' via hermeneutics, and introduces a distinction between deployment-induced and corpus-induced evaluation to conclude that off-policy audits miss alignment failures. While empirical findings on preference data are cited as external support, the interpretive claim and audit-failure consequence follow by construction from these definitional moves rather than independent demonstration. This matches the reader's noted burden without rising to full self-citation chains or renaming of known results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on the philosophical premise that principles require interpretive acts for application; this is treated as a domain assumption rather than derived. No free parameters, numerical fits, or new postulated entities are introduced.

axioms (1)
  • domain assumption General principles do not uniquely determine their own application in concrete cases involving conflict, vagueness, or unclear facts
    This hermeneutic premise is invoked as the starting point for the entire argument about alignment.

pith-pipeline@v0.9.0 · 5494 in / 1385 out tokens · 66658 ms · 2026-05-10T15:18:57.245224+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 2 internal anchors

  1. [2]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022. 9 arXiv preprint

  2. [3]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073, 2022

  3. [4]

    Wissenschaftliche Buchges., 1886

    August Boeckh.Enzyklopädie und Methodenlehre der philologischen Wissenschaften. Wissenschaftliche Buchges., 1886

  4. [5]

    Ai alignment at your discretion.arXiv preprint arXiv:2502.10441, 2025

    Maarten Buyl, Hadi Khalaf, Claudio Mayrink Verdun, Lucas Monteiro Paes, Caio C Vieira Machado, and Flavio du Pin Calmon. Ai alignment at your discretion.arXiv preprint arXiv:2502.10441, 2025

  5. [6]

    Alignment as jurisprudence.Yale Journal of Law and Technology (forthcoming), 2024

    Nicholas A Caputo. Alignment as jurisprudence.Yale Journal of Law and Technology (forthcoming), 2024

  6. [7]

    Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017

    Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017

  7. [8]

    So- cial choice for ai alignment: Dealing with diverse human feedback

    Vincent Conitzer, Rachel Freedman, Jobst Heitzig, Wesley H Holliday, Bob M Jacobs, Nathan Lambert, Milan Mossé, Eric Pacuit, Stuart Russell, Hailey Schoelkopf, et al. Social choice should guide ai alignment in dealing with diverse human feedback.arXiv preprint arXiv:2404.10271, 2024

  8. [9]

    A&C Black, 2013

    Hans-Georg Gadamer.Truth and method. A&C Black, 2013

  9. [10]

    The hermeneutical circle.A companion to hermeneutics, pp

    Jean Grondin. The hermeneutical circle.A companion to hermeneutics, pp. 299–305, 2015

  10. [11]

    University of Chicago Press, 2002

    Martin Heidegger.On time and being. University of Chicago Press, 2002

  11. [12]

    Collective constitutional ai: Aligning a language model with public input

    Saffron Huang, Divya Siddarth, Liane Lovitt, Thomas I Liao, Esin Durmus, Alex Tamkin, and Deep Ganguli. Collective constitutional ai: Aligning a language model with public input. InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, pp. 1395–1417, 2024

  12. [13]

    Pku-saferlhf: Towards multi-level safety alignment for llms with human preference

    Jiaming Ji, Donghai Hong, Borong Zhang, Boyuan Chen, Juntao Dai, Boren Zheng, Tianyi Qiu, Jiayi Zhou, Kaile Wang, Boxuan Li, et al. Pku-saferlhf: Towards multi-level safety alignment for llms with human preference.arXiv preprint arXiv:2406.15513, 2024

  13. [14]

    Hannah Rose Kirk, Alexander Whitefield, Paul Rottger, Andrew M Bean, Katerina Margatina, Rafael Mosquera-Gomez, Juan Ciro, Max Bartolo, Adina Williams, He He, et al. The prism alignment dataset: What participatory, representative and individualised human feedback reveals about the subjective and multicultural alignment of large language models.Advances in...

  14. [15]

    Klassen, Parand A

    Toryn Q Klassen, Parand A Alamdari, and Sheila A McIlraith. Pluralistic alignment over time.arXiv preprint arXiv:2411.10654, 2024

  15. [16]

    Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

  16. [17]

    Value kaleidoscope: Engaging ai with pluralistic human values, rights, and duties

    Taylor Sorensen, Liwei Jiang, Jena D Hwang, Sydney Levine, Valentina Pyatkin, Peter West, Nouha Dziri, Ximing Lu, Kavel Rao, Chandra Bhagavatula, et al. Value kaleidoscope: Engaging ai with pluralistic human values, rights, and duties. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp. 19937–19947, 2024. 10 arXiv preprint A On...