pith. sign in
Pith Number

pith:EXNEXHS3

pith:2024:EXNEXHS34HBJHXDIZFWV4V4ATB
not attested not anchored not stored refs resolved

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

Balakrishnan Varadarajan, Bilge Soran, Changsheng Zhao, Chenchen Zhu, Fanyi Xiao, Florian Bordes, Hu Xu, Hyunwoo J. Kim, Jun Chen, Lemeng Wu, Mohamed Elhoseiny, Raghuraman Krishnamoorthi, Vikas Chandra, Xiaoqian Shen, Yunyang Xiong, Zechun Liu, Zhuang Liu

LongVU adaptively compresses long videos by removing redundant frames and tokens to fit hour-long clips into limited LLM context.

arxiv:2410.17434 v1 · 2024-10-22 · cs.CV

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{EXNEXHS34HBJHXDIZFWV4V4ATB}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Our LongVU consistently surpass existing methods across a variety of video understanding benchmarks, especially on hour-long video understanding tasks such as VideoMME and MLVU. Given a light-weight LLM, our LongVU also scales effectively into a smaller size with state-of-the-art video understanding performance.

C2weakest assumption

The assumption that DINOv2 similarity reliably identifies redundant frames without discarding task-relevant visual information and that text-guided cross-modal queries plus temporal dependency reduction preserve all necessary details for downstream understanding.

C3one line summary

LongVU adaptively compresses long video tokens using DINOv2-based frame deduplication, text-guided cross-modal selection, and temporal spatial reduction to improve video-language understanding in MLLMs with minimal detail loss.

References

35 extracted · 35 resolved · 25 Pith anchors

[1] Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone · arXiv:2404.14219
[2] GPT-4 Technical Report · arXiv:2303.08774
[3] Minigpt4-video: Advancing multimodal llms for video understanding with interleaved visual-textual tokens
[4] Token Merging: Your ViT But Faster · arXiv:2210.09461
[5] Language Models are Few-Shot Learners 2005 · arXiv:2005.14165

Formal links

2 machine-checked theorem links

Cited by

35 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:47.688103Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

25da4b9e5be1c293dc68c96d5e5780985e68ac4cf4cd275df3a443c98744cefc

Aliases

arxiv: 2410.17434 · arxiv_version: 2410.17434v1 · doi: 10.48550/arxiv.2410.17434 · pith_short_12: EXNEXHS34HBJ · pith_short_16: EXNEXHS34HBJHXDI · pith_short_8: EXNEXHS3
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/EXNEXHS34HBJHXDIZFWV4V4ATB \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 25da4b9e5be1c293dc68c96d5e5780985e68ac4cf4cd275df3a443c98744cefc
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "2c655dcf5b26292ac4b16b56aefe6dbd68a6c412c51af52fcb02b16e3e68c63d",
    "cross_cats_sorted": [],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2024-10-22T21:21:37Z",
    "title_canon_sha256": "1b6ac9fd9476f5260c2a24fde0b0a0761b95e10915c781dc05fadc0f7ab7e229"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2410.17434",
    "kind": "arxiv",
    "version": 1
  }
}