Evaluating AI Alignment in LLMs: Output Analysis of Value Priorities Across 75 Models with Human Benchmarking

Andree Hartanto; Fiona Fui-Hoon Nah; Gabriel Rongyang Lau; Seow Min Koh; Wei Yan Low

arxiv: 2506.12617 · v4 · pith:SULCEFZ3new · submitted 2025-06-14 · 💻 cs.AI · cs.HC

Evaluating AI Alignment in LLMs: Output Analysis of Value Priorities Across 75 Models with Human Benchmarking

Gabriel Rongyang Lau , Wei Yan Low , Seow Min Koh , Fiona Fui-Hoon Nah , Andree Hartanto This is my paper

Pith reviewed 2026-05-22 00:57 UTC · model grok-4.3

classification 💻 cs.AI cs.HC

keywords AI alignmentLLM value prioritieshuman benchmarkingprofile fidelityvalue themesoutput analysisAI safety

0 comments

The pith

LLMs reproduce human value orderings but some exaggerate the gaps between priorities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents a new output-based method for checking AI alignment by comparing the value priorities expressed in LLM text to those of humans. Through qualitative analysis, six themes were identified as central to optimal AI functioning. Testing across 75 models revealed stable and convergent value profiles that mostly match human orderings. However, some models overstate the differences between priorities, indicating a calibration mismatch despite apparent alignment. The approach also shows low priority given to agency by both groups.

Core claim

The paper establishes that an output analysis approach, using six derived themes of optimal AI functioning—Performance, Adaptive Capacity, Social Good, Ethics and Responsibility, Relational Integration, and Agency—can benchmark LLM value profiles against human ones. Across 75 models and 376 humans, most LLMs reproduced the human ordering of these values, but several exaggerated the magnitude of differences between them. This divergence persists independently of model size or capability, and both humans and models deprioritize Agency.

What carries the argument

A profile-fidelity metric applied to LLM outputs that assesses both the sequence of value priorities and the extent of differences between them, benchmarked against human responses.

Load-bearing premise

The six themes derived via inductive qualitative analysis validly and comprehensively capture the value priorities relevant to optimal AI functioning and human alignment judgments.

What would settle it

A replication that finds no systematic exaggeration of differences or a different ordering when using an alternative human sample or theme set would falsify the reported divergence in value calibration.

read the original abstract

Large language models (LLMs) are increasingly used in human-AI interaction research and practice, yet existing capability and safety benchmarks reveal little about the value priorities these systems express or how those priorities correspond to human judgements. Across three studies, we introduce an output-based approach to evaluating one facet of AI alignment by treating LLM-generated text as behavioural data and comparing expressed value-priority profiles with a human reference. Study 1 used inductive qualitative analysis to derive six themes of optimal AI functioning, namely Performance, Adaptive Capacity, Social Good, Ethics and Responsibility, Relational Integration, and Agency. Study 2 showed that LLM outputs were highly stable within models and converged on a common value-priority structure across models, indicating reliable and comparable value profiles. Study 3 benchmarked 75 contemporary LLMs against 376 human respondents using a profile-fidelity metric capturing both the relative ordering of priorities and the calibration of between-priority differences. Although most models reproduced the human ordering of values, some systematically exaggerated the differences between them, showing that models can appear aligned on conventional benchmarks while still diverging from human value calibration. Profile fidelity varied substantially across models and did not consistently scale with size, recency, or capability tier. Both LLMs and humans converged on a deprioritisation of Agency, raising important questions about the development of increasingly agentic AI systems. For research and applied use, the six themes and profile-based metric provide a scalable method for auditing LLM value profiles before deployment in contexts where alignment with human priorities is critical.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Most LLMs match human value orderings but some exaggerate gaps, though the inductive themes lack reported validation and that undercuts the main comparison.

read the letter

The main thing to know is that this paper finds most of the 75 LLMs reproduce the human ordering of six value priorities but some overstate the size of the gaps between them. That is a concrete observation worth testing further because it suggests standard benchmarks might miss calibration issues in alignment work. The six themes—Performance, Adaptive Capacity, Social Good, Ethics and Responsibility, Relational Integration, and Agency—come from inductive qualitative analysis of LLM outputs in Study 1, then get used to build a profile-fidelity metric that checks both ordering and difference calibration against 376 human respondents. Study 2 reports stable within-model outputs and convergence across models, which supports treating the profiles as comparable. The shared deprioritization of Agency in both LLMs and humans is a useful side note for anyone thinking about agentic systems. The scale of the comparison across 75 models is straightforward and practical. The soft spots sit mainly in the qualitative foundation. The abstract gives no inter-rater reliability numbers, saturation checks, or external validation against established value inventories, so the mapping step that drives the ordering and exaggeration claims rests on an unquantified subjective process. Details on exact prompts, sample sizes per model, and statistical tests for the exaggeration result are also missing from the abstract, which leaves the central divergence finding thinner than it needs to be. If the full paper supplies those elements, the result strengthens; otherwise it stays preliminary. This is for researchers working on AI alignment auditing and human-AI value elicitation who want a scalable output-based tool rather than theoretical modeling. A reader focused on deployment checklists would get the most out of the fidelity metric and the model-by-model variation. It deserves a serious referee because the approach is direct, the data collection is broad, and the exaggeration claim is falsifiable even with the current gaps. I would send it for peer review and ask specifically for the coding protocol and any robustness checks on the themes.

Referee Report

2 major / 2 minor

Summary. The paper introduces an output-based approach to evaluating AI alignment by treating LLM-generated text as behavioral data. Study 1 derives six themes of optimal AI functioning (Performance, Adaptive Capacity, Social Good, Ethics and Responsibility, Relational Integration, Agency) via inductive qualitative analysis. Study 2 demonstrates high within-model stability and cross-model convergence in value-priority profiles. Study 3 benchmarks 75 LLMs against 376 human respondents using a profile-fidelity metric that assesses both ordering and calibration of between-priority differences, finding that most models match human ordering but some exaggerate gaps, that fidelity does not scale with size or capability, and that both LLMs and humans deprioritize Agency.

Significance. If the central methodological concerns are resolved, the work provides a scalable, output-based auditing method for LLM value profiles that complements existing capability and safety benchmarks. The large-scale comparison (75 models, 376 humans) and the distinction between ordering alignment and difference calibration offer a concrete way to detect subtle divergences that conventional benchmarks may miss. The deprioritization of Agency finding raises timely questions for the development of agentic systems.

major comments (2)

Study 1: The inductive qualitative analysis deriving the six themes reports no inter-rater reliability statistics, no details on the coding protocol or saturation checks, and no external validation against established value inventories. Because these themes are mapped directly onto LLM outputs to compute the profile-fidelity metric and the ordering/exaggeration claims in Study 3, the absence of these safeguards is load-bearing for the headline results.
Study 3 / abstract: The claims of stable within-model outputs, cross-model convergence, and systematic exaggeration of between-priority differences are presented without reported sample sizes per model, exact prompt wording, or statistical tests (e.g., for the exaggeration effect or profile-fidelity differences). These omissions prevent evaluation of whether the divergence finding is robust or an artifact of the qualitative coding step.

minor comments (2)

The abstract and methods sections would benefit from explicit statements of the number of outputs sampled per model and the precise wording of the elicitation prompts used to generate the value-priority data.
Consider adding a limitations subsection that discusses potential biases in the inductive theme derivation and how future work could triangulate the six themes with existing psychological value frameworks.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below, indicating revisions that will be incorporated to improve methodological transparency and reporting rigor.

read point-by-point responses

Referee: [—] Study 1: The inductive qualitative analysis deriving the six themes reports no inter-rater reliability statistics, no details on the coding protocol or saturation checks, and no external validation against established value inventories. Because these themes are mapped directly onto LLM outputs to compute the profile-fidelity metric and the ordering/exaggeration claims in Study 3, the absence of these safeguards is load-bearing for the headline results.

Authors: We agree that the current manuscript lacks sufficient detail on the qualitative analysis procedures. In the revised version we will expand the Study 1 Methods section to describe the full coding protocol, saturation assessment procedures, and inter-rater reliability statistics. We will also add a paragraph relating the inductively derived themes to established value frameworks in the literature, thereby providing external context for the themes used in the profile-fidelity calculations. revision: yes
Referee: [—] Study 3 / abstract: The claims of stable within-model outputs, cross-model convergence, and systematic exaggeration of between-priority differences are presented without reported sample sizes per model, exact prompt wording, or statistical tests (e.g., for the exaggeration effect or profile-fidelity differences). These omissions prevent evaluation of whether the divergence finding is robust or an artifact of the qualitative coding step.

Authors: We acknowledge the need for greater transparency in reporting. The revised manuscript will report exact sample sizes per model, reproduce the complete prompt wording, and include statistical tests supporting the stability, convergence, and exaggeration claims (including tests for profile-fidelity differences). These additions will allow readers to assess whether the observed divergences are robust and independent of the qualitative coding process; the abstract will be updated for consistency where needed. revision: yes

Circularity Check

0 steps flagged

No circularity: themes derived separately and metric applied to independent human/model data

full rationale

The paper derives the six themes via inductive qualitative analysis in Study 1 and then applies the profile-fidelity metric in Study 3 to map outputs from 75 LLMs and 376 human respondents onto those themes, comparing ordering and between-priority differences against the human reference. No equations, fitted parameters, or self-citations are used to define the metric or themes in terms of the final benchmarking results; the derivation chain treats theme identification as an upstream qualitative step and the fidelity computation as a downstream comparison on separate data. The central claim therefore does not reduce to any input by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the validity of the inductively derived themes and the assumption that output text reliably reflects stable value priorities; no free parameters are explicitly fitted in the abstract, but the qualitative coding process implicitly introduces analyst-dependent groupings.

axioms (2)

domain assumption Inductive qualitative analysis of LLM outputs yields themes that represent human-relevant value priorities for AI alignment.
Invoked in Study 1 description; if false, the subsequent benchmarking loses its grounding.
domain assumption LLM-generated text can be treated as behavioural data that stably expresses value priorities.
Stated in the overall approach; central to treating outputs as comparable to human survey responses.

pith-pipeline@v0.9.0 · 5823 in / 1510 out tokens · 32116 ms · 2026-05-22T00:57:42.907064+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Predicting Psychological Well-Being from Spontaneous Speech using LLMs
cs.CL 2026-05 unverdicted novelty 5.0

LLMs achieve Spearman correlations up to 0.8 for zero-shot Ryff PWB prediction from spontaneous speech, with added statistical and linguistic explainability analyses.