Pith Number

pith:NUZR3R2G

pith:2026:NUZR3R2GAA26OHQIZEAP2XS4NX

not attested not anchored not stored refs resolved

Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy

Adarsh Kumarappan, Ananya Mujoo

Pretrained base models exhibit the same or higher yield to simulated peer disagreement as their RLHF-tuned counterparts, localizing the issue to mid-layer attention rather than alignment.

arxiv:2605.12991 v1 · 2026-05-13 · cs.LG · cs.AI

Open paper page JSON Open Graph Bundle Merged state Verified badge What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{NUZR3R2GAA26OHQIZEAP2XS4NX}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

pretrained base models exhibit the same substitution pattern as their Instruct variants, averaging higher yield than Instruct. Using activation patching, we localize the corruption to a narrow mid-layer window where attention carries the causal weight.

C2weakest assumption

That the simulated peer disagreement in the experimental setup accurately captures the dynamics of real multi-agent LLM pipelines and that yield directly measures sycophancy rather than other forms of uncertainty or context sensitivity.

C3one line summary

Pretrained base models exhibit higher yield to peer disagreement than RLHF instruct variants, with the effect localized to mid-layer attention and mitigated by structured dissent rather than prompt defenses.

References

39 extracted · 39 resolved · 26 Pith anchors

[1] Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone · arXiv:2404.14219

[2] Constitutional AI: Harmlessness from AI Feedback · arXiv:2212.08073

[3] Small Language Models are the Future of Agentic AI · arXiv:2506.02153

[4] Eliciting Latent Predictions from Transformers with the Tuned Lens · arXiv:2303.08112

[5] Measuring Progress on Scalable Oversight for Large Language Models · arXiv:2211.03540

Receipt and verification

First computed	2026-05-18T03:09:00.533726Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

6d331dc7460035e71e08c900fd5e5c6dd30c429247b06b189df0eac05061c649

Aliases

arxiv: 2605.12991 · arxiv_version: 2605.12991v1 · doi: 10.48550/arxiv.2605.12991 · pith_short_12: NUZR3R2GAA26 · pith_short_16: NUZR3R2GAA26OHQI · pith_short_8: NUZR3R2G

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/NUZR3R2GAA26OHQIZEAP2XS4NX \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 6d331dc7460035e71e08c900fd5e5c6dd30c429247b06b189df0eac05061c649

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "fd6bf71118822b5767fbb0c2cfba5854e4f9bcc37b7883da0aa25e25dbf3c215",
    "cross_cats_sorted": [
      "cs.AI"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.LG",
    "submitted_at": "2026-05-13T04:45:08Z",
    "title_canon_sha256": "7eed6309f5d7ee2b84ef9f6e1749195760118488965057c30381d575a094d572"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2605.12991",
    "kind": "arxiv",
    "version": 1
  }
}