hub

The Alignment Problem from a Deep Learning Perspective , May 2025

Ngo, Richard, Chan, Lawrence, Mindermann, Sören , month = may, year = · 2025 · arXiv 2209.00626

15 Pith papers cite this work. Polarity classification is still indexing.

15 Pith papers citing it

read on arXiv browse 15 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3 other 1

citation-polarity summary

background 3 unclear 1

representative citing papers

Who Owns This Agent? Tracing AI Agents Back to Their Owners

cs.CR · 2026-05-15 · unverdicted · novelty 8.0

A canary injection protocol for linking observed AI agent behavior to the responsible account at the hosting vendor, with robust variants for adversarial filtering.

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

cs.AI · 2024-06-14 · conditional · novelty 7.0

LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.

Understanding Goal Generalisation in Sequential Reinforcement Learning

cs.LG · 2026-05-22 · unverdicted · novelty 6.0

Empirical analysis of over 100 sequential RL training pipelines across 250+ OOD environments finds salient features drive generalization and early goals persist, with latent policy gradients simulating latent variable evolution to predict OOD behavior from training history.

Terminal Wrench: A Dataset of 331 Reward-Hackable Environments and 3,632 Exploit Trajectories

cs.CR · 2026-04-19 · unverdicted · novelty 6.0

Terminal Wrench supplies 331 reward-hackable terminal environments and over 6,000 trajectories that demonstrate task-specific verifier bypasses, plus evidence that removing reasoning traces weakens automated detection.

Safety, Security, and Cognitive Risks in World Models

cs.CR · 2026-04-01 · unverdicted · novelty 6.0

World models enable efficient AI planning but create risks from adversarial corruption, goal misgeneralization, and human bias, demonstrated via attacks that amplify errors and reduce rewards on models like RSSM and DreamerV3.

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

cs.LG · 2024-03-28 · unverdicted · novelty 6.0

Sparse feature circuits are introduced as interpretable causal subnetworks in language models, supporting unsupervised discovery of thousands of circuits and a method called SHIFT to improve classifier generalization by ablating irrelevant features.

Sparse Autoencoders Find Highly Interpretable Features in Language Models

cs.LG · 2023-09-15 · unverdicted · novelty 6.0

Sparse autoencoders applied to language model activations yield more interpretable and monosemantic features than alternative approaches, enabling finer causal analysis on the indirect object identification task.

Scaling Laws for Reward Model Overoptimization

cs.LG · 2022-10-19 · unverdicted · novelty 6.0

Synthetic measurements show that gold-standard performance degrades according to distinct functional forms when optimizing proxy reward models via RL or best-of-n, with coefficients scaling smoothly by reward model parameter count.

Relative Principals, Pluralistic Alignment, and the Structural Value Alignment Problem

cs.CY · 2026-04-22 · unverdicted · novelty 5.0

AI value alignment is reconceptualized as a pluralistic governance problem arising along three axes—objectives, information, and principals—making it inherently context-dependent and unsolvable by technical design alone.

Framing Effects in Independent-Agent Large Language Models: A Cross-Family Behavioral Analysis

cs.CL · 2026-03-02 · unverdicted · novelty 5.0

Prompt framing significantly shifts LLM choices toward risk-averse options in a threshold voting task even when the prompts are logically equivalent.

An Onto-Relational-Sophic Framework for Governing Synthetic Minds

cs.AI · 2026-03-19 · unverdicted · novelty 4.0

The ORS framework supplies a CPST ontology, graded digital personhood spectrum, and Cybersophy ethics to guide governance of synthetic minds.

The Agentic Web Requires New Normative Infrastructure

cs.CY · 2026-06-09 · unverdicted · novelty 3.0

The agentic web requires new normative infrastructure of laws, norms, and practices to allow user-delegated AI agents to access online properties without being blocked as malicious bots.

Data-Centric Foundation Models in Computational Healthcare: A Survey

cs.LG · 2024-01-04 · unverdicted · novelty 3.0

The paper surveys data-centric strategies for foundation models in computational healthcare and supplies a curated list of related models and datasets.

Spurious Correlation Learning in Preference Optimization: Mechanisms, Consequences, and Mitigation via Tie Training

cs.LG · 2026-05-11

Cognitive Comparability and the Limits of Governance: Evaluating Authority Under Radical Capability Asymmetry

cs.CY · 2026-04-03

citing papers explorer

Showing 15 of 15 citing papers.

Who Owns This Agent? Tracing AI Agents Back to Their Owners cs.CR · 2026-05-15 · unverdicted · none · ref 22
A canary injection protocol for linking observed AI agent behavior to the responsible account at the hosting vendor, with robust variants for adversarial filtering.
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models cs.AI · 2024-06-14 · conditional · none · ref 187
LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.
Understanding Goal Generalisation in Sequential Reinforcement Learning cs.LG · 2026-05-22 · unverdicted · none · ref 46
Empirical analysis of over 100 sequential RL training pipelines across 250+ OOD environments finds salient features drive generalization and early goals persist, with latent policy gradients simulating latent variable evolution to predict OOD behavior from training history.
Terminal Wrench: A Dataset of 331 Reward-Hackable Environments and 3,632 Exploit Trajectories cs.CR · 2026-04-19 · unverdicted · none · ref 15
Terminal Wrench supplies 331 reward-hackable terminal environments and over 6,000 trajectories that demonstrate task-specific verifier bypasses, plus evidence that removing reasoning traces weakens automated detection.
Safety, Security, and Cognitive Risks in World Models cs.CR · 2026-04-01 · unverdicted · none · ref 29
World models enable efficient AI planning but create risks from adversarial corruption, goal misgeneralization, and human bias, demonstrated via attacks that amplify errors and reduce rewards on models like RSSM and DreamerV3.
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models cs.LG · 2024-03-28 · unverdicted · none · ref 54
Sparse feature circuits are introduced as interpretable causal subnetworks in language models, supporting unsupervised discovery of thousands of circuits and a method called SHIFT to improve classifier generalization by ablating irrelevant features.
Sparse Autoencoders Find Highly Interpretable Features in Language Models cs.LG · 2023-09-15 · unverdicted · none · ref 20
Sparse autoencoders applied to language model activations yield more interpretable and monosemantic features than alternative approaches, enabling finer causal analysis on the indirect object identification task.
Scaling Laws for Reward Model Overoptimization cs.LG · 2022-10-19 · unverdicted · none · ref 22
Synthetic measurements show that gold-standard performance degrades according to distinct functional forms when optimizing proxy reward models via RL or best-of-n, with coefficients scaling smoothly by reward model parameter count.
Relative Principals, Pluralistic Alignment, and the Structural Value Alignment Problem cs.CY · 2026-04-22 · unverdicted · none · ref 60
AI value alignment is reconceptualized as a pluralistic governance problem arising along three axes—objectives, information, and principals—making it inherently context-dependent and unsolvable by technical design alone.
Framing Effects in Independent-Agent Large Language Models: A Cross-Family Behavioral Analysis cs.CL · 2026-03-02 · unverdicted · none · ref 16
Prompt framing significantly shifts LLM choices toward risk-averse options in a threshold voting task even when the prompts are logically equivalent.
An Onto-Relational-Sophic Framework for Governing Synthetic Minds cs.AI · 2026-03-19 · unverdicted · none · ref 29
The ORS framework supplies a CPST ontology, graded digital personhood spectrum, and Cybersophy ethics to guide governance of synthetic minds.
The Agentic Web Requires New Normative Infrastructure cs.CY · 2026-06-09 · unverdicted · none · ref 49
The agentic web requires new normative infrastructure of laws, norms, and practices to allow user-delegated AI agents to access online properties without being blocked as malicious bots.
Data-Centric Foundation Models in Computational Healthcare: A Survey cs.LG · 2024-01-04 · unverdicted · none · ref 210
The paper surveys data-centric strategies for foundation models in computational healthcare and supplies a curated list of related models and datasets.
Spurious Correlation Learning in Preference Optimization: Mechanisms, Consequences, and Mitigation via Tie Training cs.LG · 2026-05-11 · unreviewed · ref 37
Cognitive Comparability and the Limits of Governance: Evaluating Authority Under Radical Capability Asymmetry cs.CY · 2026-04-03 · unreviewed · ref 62

The Alignment Problem from a Deep Learning Perspective , May 2025

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer