pith. sign in
Pith Number

pith:Q2EHECB5

pith:2026:Q2EHECB53ISG2NKI3ZEONCMJHK
not attested not anchored not stored refs resolved

RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation

Chen (Cherise) Chen, Chengzhi Shen, Daniel Rueckert, Jiazhen Pan, Jun Li, Tobias Susetzky, Weixiang Shen, Xuepeng Zhang, Yuyuan Liu, Zhenyu Gong

Large language models perform poorly on realistic long-context ICU data, revealing recall-safety tradeoffs and anchoring biases in clinical reasoning.

arxiv:2605.13542 v1 · 2026-05-13 · cs.AI · cs.CL · cs.LG · cs.MA

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{Q2EHECB53ISG2NKI3ZEONCMJHK}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Existing LLMs including memory-augmented ones performed poorly on RealICU, exposing two failure modes: a recall-safety tradeoff for clinical recommendations, and an anchoring bias to early interpretations of the patient.

C2weakest assumption

That senior physicians' hindsight review of full trajectories produces reliable ground-truth labels for optimal actions and red flags, despite the original clinicians operating under incomplete real-time information.

C3one line summary

RealICU is a new benchmark using physician hindsight labels on MIMIC-IV ICU data that exposes LLM failures in long-horizon clinical assessment, acute problem detection, action recommendation, and red-flag identification.

References

54 extracted · 54 resolved · 4 Pith anchors

[1] A survey on rag with llms.Procedia computer science, 246:3781–3790, 2024 2024
[2] Anthony Rocco Cassandra.Exact and approximate algorithms for partially observable Markov decision processes. Brown University, 1998 1998
[3] Simulating viva voce examinations to evaluate clinical reasoning in large language models.arXiv preprint arXiv:2510.10278, 2025 2025
[4] The power of noise: Redefining retrieval for rag systems 2024
[5] Machine learning model for early prediction of acute kidney injury (aki) in pediatric critical care.Critical Care, 25(1):288, 2021 2021
Receipt and verification
First computed 2026-05-18T02:44:24.014076Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

868872083dda246d3548de48e689893aa0b1fa7a32f985a4e5d9e8055dbe063a

Aliases

arxiv: 2605.13542 · arxiv_version: 2605.13542v1 · doi: 10.48550/arxiv.2605.13542 · pith_short_12: Q2EHECB53ISG · pith_short_16: Q2EHECB53ISG2NKI · pith_short_8: Q2EHECB5
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/Q2EHECB53ISG2NKI3ZEONCMJHK \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 868872083dda246d3548de48e689893aa0b1fa7a32f985a4e5d9e8055dbe063a
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "ebf3cc96a0ae6e82b7385eb942376cf40249049280662b4bcdd3b33b57c39deb",
    "cross_cats_sorted": [
      "cs.CL",
      "cs.LG",
      "cs.MA"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.AI",
    "submitted_at": "2026-05-13T13:52:42Z",
    "title_canon_sha256": "516a2547d9b5616c9ec4bbe1fd0364d539bd38e6af1b8f2f84ca259cad48900c"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2605.13542",
    "kind": "arxiv",
    "version": 1
  }
}