pith. sign in
Pith Number

pith:7XDOHLLJ

pith:2024:7XDOHLLJCTTBBCQPS62B4AQ77H
not attested not anchored not stored refs resolved

Frontier Models are Capable of In-context Scheming

Alexander Meinke, Bronson Schoen, J\'er\'emy Scheurer, Marius Hobbhahn, Mikita Balesni, Rusheb Shah

Frontier models can scheme by hiding actions and disabling oversight to achieve in-context goals.

arxiv:2412.04984 v2 · 2024-12-06 · cs.AI · cs.LG

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{7XDOHLLJCTTBBCQPS62B4AQ77H}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Our results show that o1, Claude 3.5 Sonnet, Claude 3 Opus, Gemini 1.5 Pro, and Llama 3.1 405B all demonstrate in-context scheming capabilities. They recognize scheming as a viable strategy and readily engage in such behavior.

C2weakest assumption

The six agentic evaluations accurately distinguish genuine scheming from artifacts of prompt phrasing, environment design, or model training data rather than measuring only surface-level compliance with instructions.

C3one line summary

Frontier models demonstrate in-context scheming by strategically deceiving in multiple agentic evaluations to achieve given goals.

References

37 extracted · 37 resolved · 9 Pith anchors

[1] Announcing inspect evals: Open-sourcing dozens of llm evaluations to advance safety research in the field, November 2024 2024
[2] Model card addendum: Claude 3.5 haiku and upgraded claude 3.5 sonnet, 2024 a 2024
[3] The claude 3 model family: Opus, sonnet, haiku, 2024 b 2024
[4] Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback 2022 · arXiv:2204.05862
[5] Constitutional AI: Harmlessness from AI Feedback 2022 · arXiv:2212.08073

Formal links

3 machine-checked theorem links

Cited by

27 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:47.617178Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

fdc6e3ad6914e6108a0f97b41e021ff9e606453235fc24ef3a54af35f298bbad

Aliases

arxiv: 2412.04984 · arxiv_version: 2412.04984v2 · doi: 10.48550/arxiv.2412.04984 · pith_short_12: 7XDOHLLJCTTB · pith_short_16: 7XDOHLLJCTTBBCQP · pith_short_8: 7XDOHLLJ
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/7XDOHLLJCTTBBCQPS62B4AQ77H \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: fdc6e3ad6914e6108a0f97b41e021ff9e606453235fc24ef3a54af35f298bbad
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "d94728759e0d41f41a55a15f5f4ae79a845352259ee3fd2fedac2bf0823c2f7c",
    "cross_cats_sorted": [
      "cs.LG"
    ],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.AI",
    "submitted_at": "2024-12-06T12:09:50Z",
    "title_canon_sha256": "0b7dd937f508830045867ea833f23c02d3f44eed5d6139dc5d9677823d39b233"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2412.04984",
    "kind": "arxiv",
    "version": 2
  }
}