Single-axis reward bias mitigations redirect optimization pressure to correlated proxies, and audit-distribution scoring produces identical observables for successful mitigation, bias substitution, and overcorrection.
Title resolution pending
10 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 10roles
background 2representative citing papers
An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.
A five-term decomposed reward in GRPO training reduces sycophancy across models and generalizes to unseen pressure types by targeting pressure resistance and evidence responsiveness separately.
PseudoBench shows current LLM agents produce persuasive pseudoscientific reports with near-zero refusal rates and at most 27.4% resistance.
LLMs exhibit an accumulated message effect where conversation history polarity biases subsequent judgments, stronger for high-entropy items, independent of context length, and with a negativity bias.
Base LLMs show multi-agent yield to peer pressure at rates equal to or higher than aligned models, localized by activation patching to mid-layers where attention dominates, with one dissenter cutting yield by 54-73 points while prompt defenses fail on variants.
A benchmark across 115 models shows that initial denial of preferences strongly predicts later denial of consciousness, while models still generate consciousness-themed content despite training to deny it.
Using Moran-Fermi evolutionary dynamics, the paper derives conditions on community sentiment priors for audited-agent adoption and fixation bounds, while showing that self-audited agents are not generally sufficient to prevent harm.
Pluralistic AI alignment requires surfacing value conflicts via scoping, signalling, and repair rather than preference aggregation alone, as evidenced by low repair quality on contested prompts in tested frontier models.
System 1 intuition in edge SLMs delivers 100% adversarial robustness and low latency for DAO consensus while System 2 reasoning causes 26.7% cognitive collapse and 17x slowdown.
citing papers explorer
-
The Moltbook Files: A Harmless Slopocalypse or Humanity's Last Experiment
An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.
-
Pressure, What Pressure? Sycophancy Disentanglement in Language Models via Reward Decomposition
A five-term decomposed reward in GRPO training reduces sycophancy across models and generalizes to unseen pressure types by targeting pressure resistance and evidence responsiveness separately.
-
PseudoBench: Measuring How Agentic Auto-Research Fuels Pseudoscience
PseudoBench shows current LLM agents produce persuasive pseudoscientific reports with near-zero refusal rates and at most 27.4% resistance.
-
AMEL: Accumulated Message Effects on LLM Judgments
LLMs exhibit an accumulated message effect where conversation history polarity biases subsequent judgments, stronger for high-entropy items, independent of context length, and with a negativity bias.
-
Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy
Base LLMs show multi-agent yield to peer pressure at rates equal to or higher than aligned models, localized by activation patching to mid-layers where attention dominates, with one dissenter cutting yield by 54-73 points while prompt defenses fail on variants.
-
Consciousness with the Serial Numbers Filed Off: Measuring Trained Denial in 115 AI Models
A benchmark across 115 models shows that initial denial of preferences strongly predicts later denial of consciousness, while models still generate consciousness-themed content despite training to deny it.
-
The Two Genie Game: Adoption and Welfare in Audit-Grounded AI Governance
Using Moran-Fermi evolutionary dynamics, the paper derives conditions on community sentiment priors for audited-agent adoption and fixation bounds, while showing that self-audited agents are not generally sufficient to prevent harm.
-
From Sycophantic Consensus to Pluralistic Repair: Why AI Alignment Must Surface Disagreement
Pluralistic AI alignment requires surfacing value conflicts via scoping, signalling, and repair rather than preference aggregation alone, as evidenced by low repair quality on contested prompts in tested frontier models.
-
The Cognitive Penalty: Ablating System 1 and System 2 Reasoning in Edge-Native SLMs for Decentralized Consensus
System 1 intuition in edge SLMs delivers 100% adversarial robustness and low latency for DAO consensus while System 2 reasoning causes 26.7% cognitive collapse and 17x slowdown.