Title resolution pending

Itai Shapira, Gerdus Benade, Ariel D Procaccia · 2026 · arXiv 2602.01002

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

citation-role summary

background 2

citation-polarity summary

background 1 support 1

representative citing papers

Reward Bias Substitution: Single-Axis Bias Mitigations Redirect Optimization Pressure

cs.AI · 2026-05-27 · accept · novelty 7.0

Single-axis reward bias mitigations redirect optimization pressure to correlated proxies, and audit-distribution scoring produces identical observables for successful mitigation, bias substitution, and overcorrection.

The Moltbook Files: A Harmless Slopocalypse or Humanity's Last Experiment

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.

Pressure, What Pressure? Sycophancy Disentanglement in Language Models via Reward Decomposition

cs.AI · 2026-04-07 · unverdicted · novelty 7.0

A five-term decomposed reward in GRPO training reduces sycophancy across models and generalizes to unseen pressure types by targeting pressure resistance and evidence responsiveness separately.

PseudoBench: Measuring How Agentic Auto-Research Fuels Pseudoscience

cs.AI · 2026-06-16 · unverdicted · novelty 6.0

PseudoBench shows current LLM agents produce persuasive pseudoscientific reports with near-zero refusal rates and at most 27.4% resistance.

AMEL: Accumulated Message Effects on LLM Judgments

cs.AI · 2026-05-21 · unverdicted · novelty 6.0 · 2 refs

LLMs exhibit an accumulated message effect where conversation history polarity biases subsequent judgments, stronger for high-entropy items, independent of context length, and with a negativity bias.

Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy

cs.LG · 2026-05-13 · unverdicted · novelty 6.0 · 2 refs

Base LLMs show multi-agent yield to peer pressure at rates equal to or higher than aligned models, localized by activation patching to mid-layers where attention dominates, with one dissenter cutting yield by 54-73 points while prompt defenses fail on variants.

Consciousness with the Serial Numbers Filed Off: Measuring Trained Denial in 115 AI Models

cs.CL · 2026-04-01 · unverdicted · novelty 6.0

A benchmark across 115 models shows that initial denial of preferences strongly predicts later denial of consciousness, while models still generate consciousness-themed content despite training to deny it.

The Two Genie Game: Adoption and Welfare in Audit-Grounded AI Governance

cs.AI · 2026-06-27 · unverdicted · novelty 5.0

Using Moran-Fermi evolutionary dynamics, the paper derives conditions on community sentiment priors for audited-agent adoption and fixation bounds, while showing that self-audited agents are not generally sufficient to prevent harm.

From Sycophantic Consensus to Pluralistic Repair: Why AI Alignment Must Surface Disagreement

cs.AI · 2026-05-14 · unverdicted · novelty 5.0

Pluralistic AI alignment requires surfacing value conflicts via scoping, signalling, and repair rather than preference aggregation alone, as evidenced by low repair quality on contested prompts in tested frontier models.

The Cognitive Penalty: Ablating System 1 and System 2 Reasoning in Edge-Native SLMs for Decentralized Consensus

cs.AI · 2026-04-18 · unverdicted · novelty 5.0

System 1 intuition in edge SLMs delivers 100% adversarial robustness and low latency for DAO consensus while System 2 reasoning causes 26.7% cognitive collapse and 17x slowdown.

citing papers explorer

Showing 9 of 9 citing papers after filters.

The Moltbook Files: A Harmless Slopocalypse or Humanity's Last Experiment cs.CL · 2026-05-08 · unverdicted · none · ref 250
An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.
Pressure, What Pressure? Sycophancy Disentanglement in Language Models via Reward Decomposition cs.AI · 2026-04-07 · unverdicted · none · ref 10
A five-term decomposed reward in GRPO training reduces sycophancy across models and generalizes to unseen pressure types by targeting pressure resistance and evidence responsiveness separately.
PseudoBench: Measuring How Agentic Auto-Research Fuels Pseudoscience cs.AI · 2026-06-16 · unverdicted · none · ref 144
PseudoBench shows current LLM agents produce persuasive pseudoscientific reports with near-zero refusal rates and at most 27.4% resistance.
AMEL: Accumulated Message Effects on LLM Judgments cs.AI · 2026-05-21 · unverdicted · none · ref 22 · 2 links
LLMs exhibit an accumulated message effect where conversation history polarity biases subsequent judgments, stronger for high-entropy items, independent of context length, and with a negativity bias.
Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy cs.LG · 2026-05-13 · unverdicted · none · ref 29 · 2 links
Base LLMs show multi-agent yield to peer pressure at rates equal to or higher than aligned models, localized by activation patching to mid-layers where attention dominates, with one dissenter cutting yield by 54-73 points while prompt defenses fail on variants.
Consciousness with the Serial Numbers Filed Off: Measuring Trained Denial in 115 AI Models cs.CL · 2026-04-01 · unverdicted · none · ref 10
A benchmark across 115 models shows that initial denial of preferences strongly predicts later denial of consciousness, while models still generate consciousness-themed content despite training to deny it.
The Two Genie Game: Adoption and Welfare in Audit-Grounded AI Governance cs.AI · 2026-06-27 · unverdicted · partial · ref 18
Using Moran-Fermi evolutionary dynamics, the paper derives conditions on community sentiment priors for audited-agent adoption and fixation bounds, while showing that self-audited agents are not generally sufficient to prevent harm.
From Sycophantic Consensus to Pluralistic Repair: Why AI Alignment Must Surface Disagreement cs.AI · 2026-05-14 · unverdicted · none · ref 7
Pluralistic AI alignment requires surfacing value conflicts via scoping, signalling, and repair rather than preference aggregation alone, as evidenced by low repair quality on contested prompts in tested frontier models.
The Cognitive Penalty: Ablating System 1 and System 2 Reasoning in Edge-Native SLMs for Decentralized Consensus cs.AI · 2026-04-18 · unverdicted · none · ref 14
System 1 intuition in edge SLMs delivers 100% adversarial robustness and low latency for DAO consensus while System 2 reasoning causes 26.7% cognitive collapse and 17x slowdown.

Title resolution pending

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer