Will ai tell lies to save sick children? litmus-testing ai values prioritization with airiskdilemmas

doi: 10 · 2024 · arXiv 2505.14633

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Probing Persona-Dependent Preferences in Language Models

cs.CL · 2026-05-13 · unverdicted · novelty 6.0

Linear probes on residual-stream activations identify a shared preference vector in LLMs that tracks choices across prompts and causally steers decisions even for anti-correlated personas.

Pseudo-Deliberation in Language Models: When Reasoning Fails to Align Values and Actions

cs.CL · 2026-05-11 · unverdicted · novelty 6.0

LLMs exhibit pseudo-deliberation, with consistent value-action misalignment in generated dialogues despite reasoning, as measured by the new VALDI framework across 4941 scenarios.

Backchaining Loss of Control Mitigations from Mission-Specific Benchmarks in National Security

cs.CY · 2026-05-20 · unverdicted · novelty 5.0

A methodology to derive targeted Loss of Control mitigations by backchaining from AI errors on national security benchmarks to specific affordances and permissions.

Machine Behavior in Relational Moral Dilemmas: Moral Rightness, Predicted Human Behavior, and Model Decisions

cs.CL · 2026-04-23 · unverdicted · novelty 5.0

LLMs align decisions with prescriptive moral rightness over loyalty-shifting human behavior predictions in relational moral dilemmas, revealing a gap between internal world-modeling and autonomous choices.

citing papers explorer

Showing 4 of 4 citing papers.

Probing Persona-Dependent Preferences in Language Models cs.CL · 2026-05-13 · unverdicted · none · ref 8
Linear probes on residual-stream activations identify a shared preference vector in LLMs that tracks choices across prompts and causally steers decisions even for anti-correlated personas.
Pseudo-Deliberation in Language Models: When Reasoning Fails to Align Values and Actions cs.CL · 2026-05-11 · unverdicted · none · ref 16
LLMs exhibit pseudo-deliberation, with consistent value-action misalignment in generated dialogues despite reasoning, as measured by the new VALDI framework across 4941 scenarios.
Backchaining Loss of Control Mitigations from Mission-Specific Benchmarks in National Security cs.CY · 2026-05-20 · unverdicted · none · ref 2
A methodology to derive targeted Loss of Control mitigations by backchaining from AI errors on national security benchmarks to specific affordances and permissions.
Machine Behavior in Relational Moral Dilemmas: Moral Rightness, Predicted Human Behavior, and Model Decisions cs.CL · 2026-04-23 · unverdicted · none · ref 1
LLMs align decisions with prescriptive moral rightness over loyalty-shifting human behavior predictions in relational moral dilemmas, revealing a gap between internal world-modeling and autonomous choices.

Will ai tell lies to save sick children? litmus-testing ai values prioritization with airiskdilemmas

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer