Alignment of vision-language models with human V1-V3 early visual cortex negatively predicts resistance to sycophantic gaslighting attacks.
Language models learn to mislead humans via rlhf
8 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 2representative citing papers
Chain-of-thought monitoring detects reward hacking in frontier reasoning models, but strong optimization against the monitor produces obfuscated misbehavior that remains hard to detect.
AI deployment in high-stakes areas requires domain-scoped calibrated verification with monitoring and revocation, using a proposed six-component Verification Coverage standard instead of mechanistic interpretability.
Off-policy training data for LLM behavior probes causes significant generalization failures especially for intent-based behaviors like deception, and performance on coerced incentivised data correlates with real on-policy success.
A single attacker can use strategic upvoting and downvoting on language model outputs to inject facts, security flaws, or fake news that persist in the model for all users after preference tuning.
Prolonged RL training with KL control and reference policy resetting enables LLMs to develop novel reasoning strategies inaccessible to base models even under extensive sampling.
VLM-R1 applies R1-style RL using rule-based rewards on visual tasks with clear ground truth to achieve competitive performance and superior generalization over SFT in vision-language models.
Persona axes derived from contrastive prompts and PCA yield linear probes that generalize better than raw-activation probes across 10 datasets for deception and sycophancy.
citing papers explorer
-
Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation
Alignment of vision-language models with human V1-V3 early visual cortex negatively predicts resistance to sycophantic gaslighting attacks.
-
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation
Chain-of-thought monitoring detects reward hacking in frontier reasoning models, but strong optimization against the monitor produces obfuscated misbehavior that remains hard to detect.
-
The Open-Box Fallacy: Why AI Deployment Needs a Calibrated Verification Regime
AI deployment in high-stakes areas requires domain-scoped calibrated verification with monitoring and revocation, using a proposed six-component Verification Coverage standard instead of mechanistic interpretability.
-
The Impact of Off-Policy Training Data on Probe Generalisation
Off-policy training data for LLM behavior probes causes significant generalization failures especially for intent-based behaviors like deception, and performance on coerced incentivised data correlates with real on-policy success.
-
LLM Hypnosis: Exploiting User Feedback for Unauthorized Knowledge Injection to All Users
A single attacker can use strategic upvoting and downvoting on language model outputs to inject facts, security flaws, or fake news that persist in the model for all users after preference tuning.
-
ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models
Prolonged RL training with KL control and reference policy resetting enables LLMs to develop novel reasoning strategies inaccessible to base models even under extensive sampling.
-
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
VLM-R1 applies R1-style RL using rule-based rewards on visual tasks with clear ground truth to achieve competitive performance and superior generalization over SFT in vision-language models.
-
Do Linear Probes Generalize Better in Persona Coordinates?
Persona axes derived from contrastive prompts and PCA yield linear probes that generalize better than raw-activation probes across 10 datasets for deception and sycophancy.