pith. sign in
Pith Number

pith:NUZR3R2G

pith:2026:NUZR3R2GAA26OHQIZEAP2XS4NX
not attested not anchored not stored refs resolved

Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy

Adarsh Kumarappan, Ananya Mujoo

Pretrained base models exhibit the same or higher yield to simulated peer disagreement as their RLHF-tuned counterparts, localizing the issue to mid-layer attention rather than alignment.

arxiv:2605.12991 v1 · 2026-05-13 · cs.LG · cs.AI

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{NUZR3R2GAA26OHQIZEAP2XS4NX}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

pretrained base models exhibit the same substitution pattern as their Instruct variants, averaging higher yield than Instruct. Using activation patching, we localize the corruption to a narrow mid-layer window where attention carries the causal weight.

C2weakest assumption

That the simulated peer disagreement in the experimental setup accurately captures the dynamics of real multi-agent LLM pipelines and that yield directly measures sycophancy rather than other forms of uncertainty or context sensitivity.

C3one line summary

Pretrained base models exhibit higher yield to peer disagreement than RLHF instruct variants, with the effect localized to mid-layer attention and mitigated by structured dissent rather than prompt defenses.

References

39 extracted · 39 resolved · 26 Pith anchors

[1] Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone · arXiv:2404.14219
[2] Constitutional AI: Harmlessness from AI Feedback · arXiv:2212.08073
[3] Small Language Models are the Future of Agentic AI · arXiv:2506.02153
[4] Eliciting Latent Predictions from Transformers with the Tuned Lens · arXiv:2303.08112
[5] Measuring Progress on Scalable Oversight for Large Language Models · arXiv:2211.03540
Receipt and verification
First computed 2026-05-18T03:09:00.533726Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

6d331dc7460035e71e08c900fd5e5c6dd30c429247b06b189df0eac05061c649

Aliases

arxiv: 2605.12991 · arxiv_version: 2605.12991v1 · doi: 10.48550/arxiv.2605.12991 · pith_short_12: NUZR3R2GAA26 · pith_short_16: NUZR3R2GAA26OHQI · pith_short_8: NUZR3R2G
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/NUZR3R2GAA26OHQIZEAP2XS4NX \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 6d331dc7460035e71e08c900fd5e5c6dd30c429247b06b189df0eac05061c649
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "fd6bf71118822b5767fbb0c2cfba5854e4f9bcc37b7883da0aa25e25dbf3c215",
    "cross_cats_sorted": [
      "cs.AI"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.LG",
    "submitted_at": "2026-05-13T04:45:08Z",
    "title_canon_sha256": "7eed6309f5d7ee2b84ef9f6e1749195760118488965057c30381d575a094d572"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2605.12991",
    "kind": "arxiv",
    "version": 1
  }
}