Pith Number

pith:3F2XMC6W

pith:2026:3F2XMC6W3K2NF6I3IUT6ZFR7GK

not attested not anchored not stored refs resolved

BioXArena: Benchmarking LLM Agents on Multi-Modal Biomedical Machine Learning Tasks

Assanali Aukenov, Bin Zhang, Duzhen Zhang, Feilong Chen, Jiahua Dong, Kun Zhang, Leonard Song, Le Song, Loka Li, Noel Thomas, Shakhnazar Sailaukan, Xingbo Du, Yonghan Yang, Zixiao Wang

BioXArena tests whether LLM agents can write code to build predictive models across 76 multi-modal biomedical tasks.

arxiv:2605.15766 v1 · 2026-05-15 · cs.CE

Open paper page JSON Open Graph Bundle Merged state What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{3F2XMC6W3K2NF6I3IUT6ZFR7GK}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

BioXArena contains 76 end-to-end tasks across 9 domains... Agents are required to write executable code, train predictive models, and generate submissions for private test samples. MLEvolve with Gemini-3.1-Pro achieves the highest average score of 0.666, followed by GPT-5.4 with 0.636.

C2weakest assumption

The 76 tasks curated from primary biomedical sources into a unified framework with hidden labels and biology-aware metrics accurately and fairly measure real-world agent performance on heterogeneous multi-modal biomedical ML problems.

C3one line summary

BioXArena benchmarks LLM agents on generating end-to-end ML pipelines for 76 multi-modal biomedical tasks, with MLEvolve plus Gemini-3.1-Pro scoring highest at 0.666.

References

142 extracted · 142 resolved · 9 Pith anchors

[1] ReAct: Synergizing Reasoning and Acting in Language Models 2022 · arXiv:2210.03629

[2] AgentBench: Evaluating LLMs as Agents 2023 · arXiv:2308.03688

[3] Executable code actions elicit better llm agents 2024

[4] arXiv:2508.02744 [cs.AI] https://arxiv.org/abs/2508.02744 2025

[5] Biomni: A general-purpose biomedical ai agent.biorxiv 2025

Formal links

2 machine-checked theorem links

Receipt and verification

First computed	2026-05-20T00:01:17.216131Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

d975760bd6dab4d2f91b4527ec963f3283b9daea33c59a4231c5ca4347f2d519

Aliases

arxiv: 2605.15766 · arxiv_version: 2605.15766v1 · doi: 10.48550/arxiv.2605.15766 · pith_short_12: 3F2XMC6W3K2N · pith_short_16: 3F2XMC6W3K2NF6I3 · pith_short_8: 3F2XMC6W

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/3F2XMC6W3K2NF6I3IUT6ZFR7GK \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: d975760bd6dab4d2f91b4527ec963f3283b9daea33c59a4231c5ca4347f2d519

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "916194be1e859f0ec0d7b3468a5f89ea76c158bddfa543434daf7836b127394e",
    "cross_cats_sorted": [],
    "license": "http://creativecommons.org/licenses/by-nc-sa/4.0/",
    "primary_cat": "cs.CE",
    "submitted_at": "2026-05-15T09:24:55Z",
    "title_canon_sha256": "f4c85286c339a68790f5f3701192ca912db286197d155efd57a93b03e0c4f71c"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2605.15766",
    "kind": "arxiv",
    "version": 1
  }
}