pith. sign in
Pith Number

pith:OCYZW2WG

pith:2025:OCYZW2WGJ3TAQCHDRDADYFENAL
not attested not anchored not stored refs resolved

Visual-RFT: Visual Reinforcement Fine-Tuning

Dahua Lin, Haodong Duan, Jiaqi Wang, Xiaoyi Dong, Yuhang Cao, Yuhang Zang, Zeyi Sun, Ziyu Liu

Visual-RFT lets large vision-language models learn visual tasks from perceptual rewards instead of labeled data.

arxiv:2503.01785 v1 · 2025-03-03 · cs.CV

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{OCYZW2WGJ3TAQCHDRDADYFENAL}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Visual-RFT improves accuracy by 24.3% over the baseline in one-shot fine-grained image classification with around 100 samples and exceeds the baseline by 21.9 on COCO's two-shot setting.

C2weakest assumption

That the visual perception verifiable reward functions (e.g., IoU) provide sufficiently dense and unbiased signals to guide policy optimization without introducing new failure modes not present in language-only RFT.

C3one line summary

Visual-RFT applies reinforcement learning with verifiable perception rewards to improve large vision-language models on fine-grained classification, few-shot detection, and grounding tasks.

References

52 extracted · 52 resolved · 19 Pith anchors

[1] Lmrl gym: Benchmarks for multi-turn reinforcement learn- ing with language models
[2] InternLM2 Technical Report 2024 · arXiv:2403.17297
[3] Grounding large language models in interactive environments with on- line reinforcement learning 2023
[4] DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning 2025 · arXiv:2501.12948
[5] Lvis: A dataset for large vocabulary instance segmentation 2019

Cited by

55 papers in Pith

Receipt and verification
First computed 2026-05-18T04:29:17.081188Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

70b19b6ac64ee60808e388c03c148d02d884f91dfbe3eb35f5fc7c09d811dc89

Aliases

arxiv: 2503.01785 · arxiv_version: 2503.01785v1 · doi: 10.48550/arxiv.2503.01785 · pith_short_12: OCYZW2WGJ3TA · pith_short_16: OCYZW2WGJ3TAQCHD · pith_short_8: OCYZW2WG
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/OCYZW2WGJ3TAQCHDRDADYFENAL \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 70b19b6ac64ee60808e388c03c148d02d884f91dfbe3eb35f5fc7c09d811dc89
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "96498b0f0f1524900019ecacd3cffafbb3686a8b0da23ad7467db86874b9071b",
    "cross_cats_sorted": [],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2025-03-03T18:16:32Z",
    "title_canon_sha256": "4feb024d94b70d42e09917ec358fc74a6a2dbfe6a4d7d6621d55fc13747a0e30"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2503.01785",
    "kind": "arxiv",
    "version": 1
  }
}