Pith Number

pith:EXNEXHS3

pith:2024:EXNEXHS34HBJHXDIZFWV4V4ATB

not attested not anchored not stored refs resolved

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

Balakrishnan Varadarajan, Bilge Soran, Changsheng Zhao, Chenchen Zhu, Fanyi Xiao, Florian Bordes, Hu Xu, Hyunwoo J. Kim, Jun Chen, Lemeng Wu, Mohamed Elhoseiny, Raghuraman Krishnamoorthi, Vikas Chandra, Xiaoqian Shen, Yunyang Xiong, Zechun Liu, Zhuang Liu

LongVU adaptively compresses long videos by removing redundant frames and tokens to fit hour-long clips into limited LLM context.

arxiv:2410.17434 v1 · 2024-10-22 · cs.CV

Open paper page JSON Open Graph Bundle Merged state Verified badge What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{EXNEXHS34HBJHXDIZFWV4V4ATB}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Our LongVU consistently surpass existing methods across a variety of video understanding benchmarks, especially on hour-long video understanding tasks such as VideoMME and MLVU. Given a light-weight LLM, our LongVU also scales effectively into a smaller size with state-of-the-art video understanding performance.

C2weakest assumption

The assumption that DINOv2 similarity reliably identifies redundant frames without discarding task-relevant visual information and that text-guided cross-modal queries plus temporal dependency reduction preserve all necessary details for downstream understanding.

C3one line summary

LongVU adaptively compresses long video tokens using DINOv2-based frame deduplication, text-guided cross-modal selection, and temporal spatial reduction to improve video-language understanding in MLLMs with minimal detail loss.

References

35 extracted · 35 resolved · 25 Pith anchors

[1] Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone · arXiv:2404.14219

[2] GPT-4 Technical Report · arXiv:2303.08774

[3] Minigpt4-video: Advancing multimodal llms for video understanding with interleaved visual-textual tokens

[4] Token Merging: Your ViT But Faster · arXiv:2210.09461

[5] Language Models are Few-Shot Learners 2005 · arXiv:2005.14165

Formal links

2 machine-checked theorem links

Cited by

35 papers in Pith

NVILA: Efficient Frontier Visual Language Models

Learning Spatiotemporal Sensitivity in Video LLMs via Counterfactual Reinforcement Learning

MuKV: Multi-Grained KV Cache Compression for Long Streaming Video Question-Answering

Swift Sampling: Selecting Temporal Surprises via Taylor Series

LiteFrame: Efficient Vision Encoders Unlock Frame Scaling in Video LLMs

Receipt and verification

First computed	2026-05-17T23:38:47.688103Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

25da4b9e5be1c293dc68c96d5e5780985e68ac4cf4cd275df3a443c98744cefc

Aliases

arxiv: 2410.17434 · arxiv_version: 2410.17434v1 · doi: 10.48550/arxiv.2410.17434 · pith_short_12: EXNEXHS34HBJ · pith_short_16: EXNEXHS34HBJHXDI · pith_short_8: EXNEXHS3

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/EXNEXHS34HBJHXDIZFWV4V4ATB \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 25da4b9e5be1c293dc68c96d5e5780985e68ac4cf4cd275df3a443c98744cefc

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "2c655dcf5b26292ac4b16b56aefe6dbd68a6c412c51af52fcb02b16e3e68c63d",
    "cross_cats_sorted": [],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2024-10-22T21:21:37Z",
    "title_canon_sha256": "1b6ac9fd9476f5260c2a24fde0b0a0761b95e10915c781dc05fadc0f7ab7e229"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2410.17434",
    "kind": "arxiv",
    "version": 1
  }
}