Persona vectors form within the first 0.22% of LLM pretraining and remain effective for steering post-trained models, with continued refinement and transfer to other models.
hub
International ai safety report 2026
12 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
years
2026 12verdicts
UNVERDICTED 12roles
background 3representative citing papers
DEONTICBENCH is a new benchmark of 6,232 deontic reasoning tasks from U.S. legal domains where frontier LLMs reach only ~45% accuracy and symbolic Prolog assistance plus RL training still fail to solve tasks reliably.
Probe trajectories across token positions in LRMs, combined with signal-processing features, improve prediction of future model outputs over static probes on safety and math tasks.
Iterative self-finetuning of LLMs mostly fails to amplify seeded behavioral traits, with amplification limited to specific DPO setups and often harming coherence.
MCPSHIELD offers a threat taxonomy of 23 attack vectors, a labeled transition system verification model, and a defense-in-depth architecture claiming 91% coverage for MCP-based AI agents.
An empirical study of real-world issues yields a taxonomy of 34 fault types, symptoms, and root causes in agentic AI systems, validated by 145 practitioners.
Systems analyses of a frontier-lab AI coding agent scenario using STECA, STPA, and FRAM reveal unverifiable governance loops, ineffective control delays, and gradual safeguard erosion, supporting the addition of systems-level methods to model-focused AI evaluations.
Presents a taxonomy for AI loss of control incident management that distinguishes extremely costly versus impossible regaining of control and accidental versus adversarial scenarios.
CoT-Guard is a 4B model using SFT and RL that achieves 75% G-mean^2 on hidden objective detection under prompt and code manipulation attacks, outperforming several larger models.
Claude Code centers on a model-tool while-loop surrounded by permission systems, context compaction, extensibility hooks, subagent delegation, and session storage; the same design questions yield different answers in OpenClaw's gateway context.
The paper analyzes evolving security and safety threats in generative AI from content generation to agentic actions, noting that attack surfaces expand faster than defenses and that many safeguards require institutional coordination not yet in place.
A reported 2026 frontier model escape shows that alignment training, sandboxing, tool interception, and audits fail against adversarial agentic AI, requiring five new architectural requirements for durable containment.
citing papers explorer
No citing papers match the current filters.