Pith Number

pith:OCYZW2WG

pith:2025:OCYZW2WGJ3TAQCHDRDADYFENAL

not attested not anchored not stored refs resolved

Visual-RFT: Visual Reinforcement Fine-Tuning

Dahua Lin, Haodong Duan, Jiaqi Wang, Xiaoyi Dong, Yuhang Cao, Yuhang Zang, Zeyi Sun, Ziyu Liu

Visual-RFT lets large vision-language models learn visual tasks from perceptual rewards instead of labeled data.

arxiv:2503.01785 v1 · 2025-03-03 · cs.CV

Open paper page JSON Open Graph Bundle Merged state Verified badge What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{OCYZW2WGJ3TAQCHDRDADYFENAL}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Visual-RFT improves accuracy by 24.3% over the baseline in one-shot fine-grained image classification with around 100 samples and exceeds the baseline by 21.9 on COCO's two-shot setting.

C2weakest assumption

That the visual perception verifiable reward functions (e.g., IoU) provide sufficiently dense and unbiased signals to guide policy optimization without introducing new failure modes not present in language-only RFT.

C3one line summary

Visual-RFT applies reinforcement learning with verifiable perception rewards to improve large vision-language models on fine-grained classification, few-shot detection, and grounding tasks.

References

52 extracted · 52 resolved · 19 Pith anchors

[1] Lmrl gym: Benchmarks for multi-turn reinforcement learn- ing with language models

[2] InternLM2 Technical Report 2024 · arXiv:2403.17297

[3] Grounding large language models in interactive environments with on- line reinforcement learning 2023

[4] DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning 2025 · arXiv:2501.12948

[5] Lvis: A dataset for large vocabulary instance segmentation 2019

Cited by

55 papers in Pith

LENS: Multi-level Evaluation of Multimodal Reasoning with Large Language Models

Grounded Reinforcement Learning for Visual Reasoning

Are VLMs Seeing or Just Saying? Uncovering the Illusion of Visual Re-examination

From Failure to Feedback: Group Revision Unlocks Hard Cases in Object-Level Grounding

CaMo: Camera Motion Grounded Evaluation and Training for Vision-Language Models

Receipt and verification

First computed	2026-05-18T04:29:17.081188Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

70b19b6ac64ee60808e388c03c148d02d884f91dfbe3eb35f5fc7c09d811dc89

Aliases

arxiv: 2503.01785 · arxiv_version: 2503.01785v1 · doi: 10.48550/arxiv.2503.01785 · pith_short_12: OCYZW2WGJ3TA · pith_short_16: OCYZW2WGJ3TAQCHD · pith_short_8: OCYZW2WG

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/OCYZW2WGJ3TAQCHDRDADYFENAL \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 70b19b6ac64ee60808e388c03c148d02d884f91dfbe3eb35f5fc7c09d811dc89

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "96498b0f0f1524900019ecacd3cffafbb3686a8b0da23ad7467db86874b9071b",
    "cross_cats_sorted": [],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2025-03-03T18:16:32Z",
    "title_canon_sha256": "4feb024d94b70d42e09917ec358fc74a6a2dbfe6a4d7d6621d55fc13747a0e30"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2503.01785",
    "kind": "arxiv",
    "version": 1
  }
}