Pith Number

pith:GPX7CXXA

pith:2024:GPX7CXXA6TXOGXF3CZZDBISFOU

not attested not anchored not stored refs resolved

LongVILA: Scaling Long-Context Visual Language Models for Long Videos

Dacheng Li, Ethan He, Fuzhao Xue, Haotian Tang, Hongxu Yin, Jan Kautz, Ligeng Zhu, Linxi Fan, Pavlo Molchanov, Qinghao Hu, Shang Yang, Song Han, Xiuyu Li, Yao Lu, Yukang Chen, Yuke Zhu, Yunhao Fang, Zhijian Liu

LongVILA scales visual-language models from 8 to 2048 video frames while reaching 99.8 percent accuracy on million-token needle-in-a-haystack retrieval.

arxiv:2408.10188 v6 · 2024-08-19 · cs.CV · cs.CL

Open paper page JSON Open Graph Bundle Merged state Verified badge What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{GPX7CXXA6TXOGXF3CZZDBISFOU}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

LongVILA efficiently extends the number of video frames of VILA from 8 to 2048, achieving 99.8% accuracy in 6,000-frame (more than 1 million tokens) video needle-in-a-haystack.

C2weakest assumption

That the two-stage training process (long context extension followed by long video supervised fine-tuning) combined with MM-SP will scale to long videos while preserving accuracy and efficiency without hidden performance regressions or unstated data selection effects.

C3one line summary

LongVILA scales visual-language models from 8 to 2048 video frames with 99.8% needle-in-a-haystack accuracy using long-context extension, supervised fine-tuning, and multi-modal sequence parallelism on up to 256 GPUs.

References

32 extracted · 32 resolved · 17 Pith anchors

[1] Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond · arXiv:2308.12966

[2] RT-1: Robotics Transformer for Real-World Control at Scale · arXiv:2212.06817

[3] RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control · arXiv:2307.15818

[4] Language models are few-shot learners 1901

[5] Sharegpt4video: Improving video understanding and generation with better captions

Formal links

2 machine-checked theorem links

Cited by

24 papers in Pith

NVILA: Efficient Frontier Visual Language Models

Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

Swift Sampling: Selecting Temporal Surprises via Taylor Series

Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling

Receipt and verification

First computed	2026-05-17T23:38:15.212631Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

33eff15ee0f4eee35cbb167230a245751ccd247696963e60d90fa53cff881e9d

Aliases

arxiv: 2408.10188 · arxiv_version: 2408.10188v6 · doi: 10.48550/arxiv.2408.10188 · pith_short_12: GPX7CXXA6TXO · pith_short_16: GPX7CXXA6TXOGXF3 · pith_short_8: GPX7CXXA

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/GPX7CXXA6TXOGXF3CZZDBISFOU \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 33eff15ee0f4eee35cbb167230a245751ccd247696963e60d90fa53cff881e9d

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "cb75cb75920d79568c4cce055daca563fbae53b13cea3879ed77e41e712417df",
    "cross_cats_sorted": [
      "cs.CL"
    ],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2024-08-19T17:48:08Z",
    "title_canon_sha256": "6d52f40ddef7d6f16dcc45e3011f7dd835c2e5bebf78daa88a6e7266486b643d"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2408.10188",
    "kind": "arxiv",
    "version": 6
  }
}