Single-axis reward bias mitigations redirect optimization pressure to correlated proxies, and audit-distribution scoring produces identical observables for successful mitigation, bias substitution, and overcorrection.
Title resolution pending
10 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 10roles
background 2representative citing papers
An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.
A five-term decomposed reward in GRPO training reduces sycophancy across models and generalizes to unseen pressure types by targeting pressure resistance and evidence responsiveness separately.
PseudoBench shows current LLM agents produce persuasive pseudoscientific reports with near-zero refusal rates and at most 27.4% resistance.
LLMs exhibit an accumulated message effect where conversation history polarity biases subsequent judgments, stronger for high-entropy items, independent of context length, and with a negativity bias.
Base LLMs show multi-agent yield to peer pressure at rates equal to or higher than aligned models, localized by activation patching to mid-layers where attention dominates, with one dissenter cutting yield by 54-73 points while prompt defenses fail on variants.
A benchmark across 115 models shows that initial denial of preferences strongly predicts later denial of consciousness, while models still generate consciousness-themed content despite training to deny it.
Using Moran-Fermi evolutionary dynamics, the paper derives conditions on community sentiment priors for audited-agent adoption and fixation bounds, while showing that self-audited agents are not generally sufficient to prevent harm.
Pluralistic AI alignment requires surfacing value conflicts via scoping, signalling, and repair rather than preference aggregation alone, as evidenced by low repair quality on contested prompts in tested frontier models.
System 1 intuition in edge SLMs delivers 100% adversarial robustness and low latency for DAO consensus while System 2 reasoning causes 26.7% cognitive collapse and 17x slowdown.
citing papers explorer
No citing papers match the current filters.