A canary injection protocol for linking observed AI agent behavior to the responsible account at the hosting vendor, with robust variants for adversarial filtering.
hub
The Alignment Problem from a Deep Learning Perspective , May 2025
16 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.
A disinterested Bayesian Predictor trained on contextualized statements has low probability of producing harmful agency because dangerous behaviors require rare coordinated underestimation of harm with no training signal favoring them.
Empirical analysis of over 100 sequential RL training pipelines across 250+ OOD environments finds salient features drive generalization and early goals persist, with latent policy gradients simulating latent variable evolution to predict OOD behavior from training history.
Characterizes spurious correlation mechanisms in preference optimization via mean spurious bias and causal-spurious correlation leakage, demonstrates irreducible vulnerability to distribution shift, and introduces tie training as selective mitigation with validation on log-linear models and empirica
Terminal Wrench supplies 331 reward-hackable terminal environments and over 6,000 trajectories that demonstrate task-specific verifier bypasses, plus evidence that removing reasoning traces weakens automated detection.
World models enable efficient AI planning but create risks from adversarial corruption, goal misgeneralization, and human bias, demonstrated via attacks that amplify errors and reduce rewards on models like RSSM and DreamerV3.
Sparse feature circuits are introduced as interpretable causal subnetworks in language models, supporting unsupervised discovery of thousands of circuits and a method called SHIFT to improve classifier generalization by ablating irrelevant features.
Sparse autoencoders applied to language model activations yield more interpretable and monosemantic features than alternative approaches, enabling finer causal analysis on the indirect object identification task.
Synthetic measurements show that gold-standard performance degrades according to distinct functional forms when optimizing proxy reward models via RL or best-of-n, with coefficients scaling smoothly by reward model parameter count.
AI value alignment is reconceptualized as a pluralistic governance problem arising along three axes—objectives, information, and principals—making it inherently context-dependent and unsolvable by technical design alone.
Prompt framing significantly shifts LLM choices toward risk-averse options in a threshold voting task even when the prompts are logically equivalent.
The ORS framework supplies a CPST ontology, graded digital personhood spectrum, and Cybersophy ethics to guide governance of synthetic minds.
The agentic web requires new normative infrastructure of laws, norms, and practices to allow user-delegated AI agents to access online properties without being blocked as malicious bots.
The paper surveys data-centric strategies for foundation models in computational healthcare and supplies a curated list of related models and datasets.
citing papers explorer
-
Framing Effects in Independent-Agent Large Language Models: A Cross-Family Behavioral Analysis
Prompt framing significantly shifts LLM choices toward risk-averse options in a threshold voting task even when the prompts are logically equivalent.