pith. sign in
Pith Number

pith:XXMJ2NHH

pith:2025:XXMJ2NHHAXOHWMNFGFBXUC4CXI
not attested not anchored not stored refs resolved

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Haobo Yuan, Jiashi Feng, Lu Qi, Ming-Hsuan Yang, Shilin Xu, Shunping Ji, Tao Zhang, Xiangtai Li, Yueyi Sun, Yunhai Tong, Zilong Huang

Sa2VA unifies segmentation and language models for referring tasks on both images and videos using minimal instruction tuning.

arxiv:2501.04001 v3 · 2025-01-07 · cs.CV

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{XXMJ2NHHAXOHWMNFGFBXUC4CXI}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Sa2VA is the first comprehensive, unified model for dense grounded understanding of both images and videos that supports referring segmentation and conversation with minimal one-shot instruction tuning.

C2weakest assumption

That the LLM-generated instruction tokens can reliably guide SAM-2 to produce precise masks across complex video scenes without task-specific architectural changes or heavy fine-tuning.

C3one line summary

Sa2VA unifies SAM-2 segmentation with MLLM reasoning into a single model for referring segmentation and conversation on images and videos, supported by a new 72k-expression Ref-SAV dataset.

References

125 extracted · 125 resolved · 21 Pith anchors

[1] Vqa: Visual question an- swering 2015
[2] Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond 2023 · arXiv:2308.12966
[3] Qwen2.5-VL Technical Report 2025 · arXiv:2502.13923
[4] One token to seg them all: Language instructed reasoning segmentation in videos 2024
[5] Language models are few-shot learners 2020

Formal links

2 machine-checked theorem links

Cited by

27 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:48.000065Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

bdd89d34e705dc7b31a531437a0b82ba1979ecf24ec99ee74181c8f9372b81a1

Aliases

arxiv: 2501.04001 · arxiv_version: 2501.04001v3 · doi: 10.48550/arxiv.2501.04001 · pith_short_12: XXMJ2NHHAXOH · pith_short_16: XXMJ2NHHAXOHWMNF · pith_short_8: XXMJ2NHH
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/XXMJ2NHHAXOHWMNFGFBXUC4CXI \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: bdd89d34e705dc7b31a531437a0b82ba1979ecf24ec99ee74181c8f9372b81a1
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "af1b72fcb24f808b68a624b1c50a2756d0e0ee2cfca9157d605e2e71789746e5",
    "cross_cats_sorted": [],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2025-01-07T18:58:54Z",
    "title_canon_sha256": "182fbd81701d2e83bb759bbc154509c069a63c6d97465120834f6e8e7d179ffb"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2501.04001",
    "kind": "arxiv",
    "version": 3
  }
}