Pith Number

pith:XXMJ2NHH

pith:2025:XXMJ2NHHAXOHWMNFGFBXUC4CXI

not attested not anchored not stored refs resolved

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Haobo Yuan, Jiashi Feng, Lu Qi, Ming-Hsuan Yang, Shilin Xu, Shunping Ji, Tao Zhang, Xiangtai Li, Yueyi Sun, Yunhai Tong, Zilong Huang

Sa2VA unifies segmentation and language models for referring tasks on both images and videos using minimal instruction tuning.

arxiv:2501.04001 v3 · 2025-01-07 · cs.CV

Open paper page JSON Open Graph Bundle Merged state Verified badge What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{XXMJ2NHHAXOHWMNFGFBXUC4CXI}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Sa2VA is the first comprehensive, unified model for dense grounded understanding of both images and videos that supports referring segmentation and conversation with minimal one-shot instruction tuning.

C2weakest assumption

That the LLM-generated instruction tokens can reliably guide SAM-2 to produce precise masks across complex video scenes without task-specific architectural changes or heavy fine-tuning.

C3one line summary

Sa2VA unifies SAM-2 segmentation with MLLM reasoning into a single model for referring segmentation and conversation on images and videos, supported by a new 72k-expression Ref-SAV dataset.

References

125 extracted · 125 resolved · 21 Pith anchors

[1] Vqa: Visual question an- swering 2015

[2] Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond 2023 · arXiv:2308.12966

[3] Qwen2.5-VL Technical Report 2025 · arXiv:2502.13923

[4] One token to seg them all: Language instructed reasoning segmentation in videos 2024

[5] Language models are few-shot learners 2020

Formal links

2 machine-checked theorem links

Cited by

27 papers in Pith

Circle-RoPE: Cone-like Decoupled Rotary Positional Embedding for Large Vision-Language Models

GeoWeaver: Grounding Visual Tokens with Geometric Evidence before Scene Reasoning

VISTAQA: Benchmarking Joint Visual Question Answering and Pixel-Level Evidence

SetCon: Towards Open-Ended Referring Segmentation via Set-Level Concept Prediction

Chain-of-Glimpse: Search-Guided Progressive Object-Grounded Reasoning for Video Understanding

Receipt and verification

First computed	2026-05-17T23:38:48.000065Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

bdd89d34e705dc7b31a531437a0b82ba1979ecf24ec99ee74181c8f9372b81a1

Aliases

arxiv: 2501.04001 · arxiv_version: 2501.04001v3 · doi: 10.48550/arxiv.2501.04001 · pith_short_12: XXMJ2NHHAXOH · pith_short_16: XXMJ2NHHAXOHWMNF · pith_short_8: XXMJ2NHH

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/XXMJ2NHHAXOHWMNFGFBXUC4CXI \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: bdd89d34e705dc7b31a531437a0b82ba1979ecf24ec99ee74181c8f9372b81a1

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "af1b72fcb24f808b68a624b1c50a2756d0e0ee2cfca9157d605e2e71789746e5",
    "cross_cats_sorted": [],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2025-01-07T18:58:54Z",
    "title_canon_sha256": "182fbd81701d2e83bb759bbc154509c069a63c6d97465120834f6e8e7d179ffb"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2501.04001",
    "kind": "arxiv",
    "version": 3
  }
}