pith. sign in
Pith Number

pith:VITK3VBU

pith:2025:VITK3VBU5MEPLZ4MOALEG5VYZ6
not attested not anchored not stored refs resolved

R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

Bo Zhang, Dacheng Yin, Fengyun Rao, Haoyu Lu, Hongkun Pan, Minfeng Zhu, Wei Chen, Xiaoxuan He, Xingtao Yang, Xiyan Jiang, Yan Deng, Yi Yang

Converting images to formal textual representations lets a new model reason more precisely about visual content and outperform GPT-4o on multimodal benchmarks.

arxiv:2503.10615 v2 · 2025-03-13 · cs.CV

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{VITK3VBU5MEPLZ4MOALEG5VYZ6}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Experimental results show that R1-Onevision achieves state-of-the-art performance, outperforming models such as GPT-4o and Qwen2.5-VL on multiple challenging multimodal reasoning benchmarks.

C2weakest assumption

The cross-modal reasoning pipeline that transforms images into formal textual representations enables precise language-based reasoning without loss of critical visual information.

C3one line summary

R1-Onevision turns images into structured text for multimodal reasoning, trains on a custom dataset with RL, and claims SOTA results on an educational benchmark.

References

52 extracted · 52 resolved · 12 Pith anchors

[1] GPT-4 Technical Report · arXiv:2303.08774
[2] Large language models for mathematical reasoning: Progresses and challenges 2024
[3] Qwen2.5-VL Technical Report 2025 · arXiv:2502.13923
[4] Evaluating Large Language Models Trained on Code 2021 · arXiv:2107.03374
[5] Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling 2024 · arXiv:2412.05271

Formal links

2 machine-checked theorem links

Cited by

44 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:49.635056Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

aa26add434eb08f5e78c70164376b8cfb7ad667f77886f2511bf8d2db77f60c0

Aliases

arxiv: 2503.10615 · arxiv_version: 2503.10615v2 · doi: 10.48550/arxiv.2503.10615 · pith_short_12: VITK3VBU5MEP · pith_short_16: VITK3VBU5MEPLZ4M · pith_short_8: VITK3VBU
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/VITK3VBU5MEPLZ4MOALEG5VYZ6 \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: aa26add434eb08f5e78c70164376b8cfb7ad667f77886f2511bf8d2db77f60c0
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "d5c091666b3b7ee6e36740e3f46ddaba83dc06f3c49fa818841cd1f10fe06639",
    "cross_cats_sorted": [],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2025-03-13T17:56:05Z",
    "title_canon_sha256": "a834e22b8a0f74a52fd27e91bfd289fbbba20a0f8d0cae4148e740dde2f22644"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2503.10615",
    "kind": "arxiv",
    "version": 2
  }
}