{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2025:LWVTT7QDM7U2MZKY67CFNOK2OO","short_pith_number":"pith:LWVTT7QD","schema_version":"1.0","canonical_sha256":"5dab39fe0367e9a66558f7c456b95a738cc15ff70419b293c2e5ec8f7245c54c","source":{"kind":"arxiv","id":"2510.09608","version":1},"attestation_state":"computed","paper":{"title":"StreamingVLM: Real-Time Understanding for Infinite Video Streams","license":"http://creativecommons.org/licenses/by/4.0/","headline":"A vision-language model achieves stable real-time understanding of arbitrarily long video streams through a streaming attention cache aligned with training on short clips.","cross_cats":["cs.AI","cs.CL"],"primary_cat":"cs.CV","authors_text":"Guangxuan Xiao, Kelly Peng, Liuning He, Ruyi Xu, Song Han, Yao Lu, Yukang Chen","submitted_at":"2025-10-10T17:59:58Z","abstract_excerpt":"Vision-language models (VLMs) could power real-time assistants and autonomous agents, but they face a critical challenge: understanding near-infinite video streams without escalating latency and memory usage. Processing entire videos with full attention leads to quadratic computational costs and poor performance on long videos. Meanwhile, simple sliding window methods are also flawed, as they either break coherence or suffer from high latency due to redundant recomputation. In this paper, we introduce StreamingVLM, a model designed for real-time, stable understanding of infinite visual input. "},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2510.09608","kind":"arxiv","version":1},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.CV","submitted_at":"2025-10-10T17:59:58Z","cross_cats_sorted":["cs.AI","cs.CL"],"title_canon_sha256":"b46c56aa5c0f7276e4ddc1686d851d012fd82e8fe2abf4e0a3fb109d49914448","abstract_canon_sha256":"87a6d0d45f33b663733e1d3ccab4840f56fea1a808eabf19e20b21ee3d318aa3"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:14.196276Z","signature_b64":"AwbssBV8OuJBsIJkz6hCpv+24Gh+ODYaN3jn9bFI2uf+mI5ogG4Fq3FeFIVTFLQpIWsjreGcepfsQ9Q5TZ0kDw==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"5dab39fe0367e9a66558f7c456b95a738cc15ff70419b293c2e5ec8f7245c54c","last_reissued_at":"2026-05-17T23:38:14.195787Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:14.195787Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"StreamingVLM: Real-Time Understanding for Infinite Video Streams","license":"http://creativecommons.org/licenses/by/4.0/","headline":"A vision-language model achieves stable real-time understanding of arbitrarily long video streams through a streaming attention cache aligned with training on short clips.","cross_cats":["cs.AI","cs.CL"],"primary_cat":"cs.CV","authors_text":"Guangxuan Xiao, Kelly Peng, Liuning He, Ruyi Xu, Song Han, Yao Lu, Yukang Chen","submitted_at":"2025-10-10T17:59:58Z","abstract_excerpt":"Vision-language models (VLMs) could power real-time assistants and autonomous agents, but they face a critical challenge: understanding near-infinite video streams without escalating latency and memory usage. Processing entire videos with full attention leads to quadratic computational costs and poor performance on long videos. Meanwhile, simple sliding window methods are also flawed, as they either break coherence or suffer from high latency due to redundant recomputation. In this paper, we introduce StreamingVLM, a model designed for real-time, stable understanding of infinite visual input. "},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"On Inf-Streams-Eval, StreamingVLM achieves a 66.18% win rate against GPT-4O mini and maintains stable, real-time performance at up to 8 FPS on a single NVIDIA H100.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That supervised fine-tuning with full attention on short overlapped video chunks will produce stable coherence and performance when the same model is later run with the streaming KV cache on arbitrarily long, non-overlapped video streams.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"StreamingVLM enables stable real-time understanding of infinite video streams at up to 8 FPS using a streaming KV cache and aligned SFT on overlapped chunks, with a 66.18% win rate over GPT-4O mini on a new two-hour video benchmark.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"A vision-language model achieves stable real-time understanding of arbitrarily long video streams through a streaming attention cache aligned with training on short clips.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"5648812ab5122e4765157cc167940d09c1e98fbbaa513d195acd690b4165da07"},"source":{"id":"2510.09608","kind":"arxiv","version":1},"verdict":{"id":"544ab5f3-bc89-4a9c-af69-1225c41a838d","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T11:47:01.498125Z","strongest_claim":"On Inf-Streams-Eval, StreamingVLM achieves a 66.18% win rate against GPT-4O mini and maintains stable, real-time performance at up to 8 FPS on a single NVIDIA H100.","one_line_summary":"StreamingVLM enables stable real-time understanding of infinite video streams at up to 8 FPS using a streaming KV cache and aligned SFT on overlapped chunks, with a 66.18% win rate over GPT-4O mini on a new two-hour video benchmark.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That supervised fine-tuning with full attention on short overlapped video chunks will produce stable coherence and performance when the same model is later run with the streaming KV cache on arbitrarily long, non-overlapped video streams.","pith_extraction_headline":"A vision-language model achieves stable real-time understanding of arbitrarily long video streams through a streaming attention cache aligned with training on short clips."},"references":{"count":12,"sample":[{"doi":"","year":null,"title":"Qwen2.5-VL Technical Report","work_id":"69dffacb-bfe8-442d-be86-48624c60426f","ref_index":1,"cited_arxiv_id":"2502.13923","is_internal_anchor":true},{"doi":"","year":null,"title":"VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs","work_id":"ccfc3f89-c510-45f1-8a35-ed1a56c0ae5c","ref_index":2,"cited_arxiv_id":"2406.07476","is_internal_anchor":true},{"doi":"","year":null,"title":"arXiv preprint arXiv:2503.00540 , year=","work_id":"f9b28e0b-f48b-484b-a271-22ecec990b86","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens","work_id":"41fabff1-11da-43da-8efd-2eb55186b9f2","ref_index":4,"cited_arxiv_id":"2402.13753","is_internal_anchor":true},{"doi":"","year":null,"title":"Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis","work_id":"77fd5ac9-ae98-4846-9d83-e9c73c8f2a52","ref_index":5,"cited_arxiv_id":"2405.21075","is_internal_anchor":true}],"resolved_work":12,"snapshot_sha256":"511e5fcf9d8e0de4ded0821cc6f59a4d9b37d2f04771beddcde774589ad51a78","internal_anchors":8},"formal_canon":{"evidence_count":2,"snapshot_sha256":"4fe17a9627caac04e6a62ee27aa9201af505c751465ad4d493cba9a9d27fae09"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2510.09608","created_at":"2026-05-17T23:38:14.195865+00:00"},{"alias_kind":"arxiv_version","alias_value":"2510.09608v1","created_at":"2026-05-17T23:38:14.195865+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2510.09608","created_at":"2026-05-17T23:38:14.195865+00:00"},{"alias_kind":"pith_short_12","alias_value":"LWVTT7QDM7U2","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"LWVTT7QDM7U2MZKY","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"LWVTT7QD","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":17,"internal_anchor_count":17,"sample":[{"citing_arxiv_id":"2511.14582","citing_title":"OmniZip: Audio-Guided Dynamic Token Compression for Fast Omnimodal Large Language Models","ref_index":56,"is_internal_anchor":true},{"citing_arxiv_id":"2512.21334","citing_title":"Streaming Video Instruction Tuning","ref_index":21,"is_internal_anchor":true},{"citing_arxiv_id":"2601.14724","citing_title":"HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding","ref_index":47,"is_internal_anchor":true},{"citing_arxiv_id":"2602.13310","citing_title":"Visual Para-Thinker: Divide-and-Conquer Reasoning for Visual Comprehension","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2602.22455","citing_title":"Exploring Multimodal LMMs for Online Episodic Memory Question Answering on the Edge","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09874","citing_title":"EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding","ref_index":73,"is_internal_anchor":true},{"citing_arxiv_id":"2605.03351","citing_title":"VLMaxxing through FrameMogging Training-Free Anti-Recomputation for Video Vision-Language Models","ref_index":36,"is_internal_anchor":true},{"citing_arxiv_id":"2604.24317","citing_title":"Don't Pause! Every prediction matters in a streaming video","ref_index":67,"is_internal_anchor":true},{"citing_arxiv_id":"2605.01858","citing_title":"Decouple and Cache: KV Cache Construction for Streaming Video Understanding","ref_index":16,"is_internal_anchor":true},{"citing_arxiv_id":"2604.11627","citing_title":"POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs","ref_index":96,"is_internal_anchor":true},{"citing_arxiv_id":"2604.11411","citing_title":"Online Reasoning Video Object Segmentation","ref_index":48,"is_internal_anchor":true},{"citing_arxiv_id":"2604.07634","citing_title":"VSAS-Bench: Real-Time Evaluation of Visual Streaming Assistant Models","ref_index":33,"is_internal_anchor":true},{"citing_arxiv_id":"2605.07897","citing_title":"Semantic-Aware Adaptive Visual Memory for Streaming Video Understanding","ref_index":38,"is_internal_anchor":true},{"citing_arxiv_id":"2604.06036","citing_title":"CodecSight: Leveraging Video Codec Signals for Efficient Streaming VLM Inference","ref_index":82,"is_internal_anchor":true},{"citing_arxiv_id":"2604.04419","citing_title":"BoxComm: Benchmarking Category-Aware Commentary Generation and Narration Rhythm in Boxing","ref_index":40,"is_internal_anchor":true},{"citing_arxiv_id":"2604.10060","citing_title":"Mosaic: Cross-Modal Clustering for Efficient Video Understanding","ref_index":36,"is_internal_anchor":true},{"citing_arxiv_id":"2604.15188","citing_title":"VisPCO: Visual Token Pruning Configuration Optimization via Budget-Aware Pareto-Frontier Learning for Vision-Language Models","ref_index":26,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/LWVTT7QDM7U2MZKY67CFNOK2OO","json":"https://pith.science/pith/LWVTT7QDM7U2MZKY67CFNOK2OO.json","graph_json":"https://pith.science/api/pith-number/LWVTT7QDM7U2MZKY67CFNOK2OO/graph.json","events_json":"https://pith.science/api/pith-number/LWVTT7QDM7U2MZKY67CFNOK2OO/events.json","paper":"https://pith.science/paper/LWVTT7QD"},"agent_actions":{"view_html":"https://pith.science/pith/LWVTT7QDM7U2MZKY67CFNOK2OO","download_json":"https://pith.science/pith/LWVTT7QDM7U2MZKY67CFNOK2OO.json","view_paper":"https://pith.science/paper/LWVTT7QD","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2510.09608&json=true","fetch_graph":"https://pith.science/api/pith-number/LWVTT7QDM7U2MZKY67CFNOK2OO/graph.json","fetch_events":"https://pith.science/api/pith-number/LWVTT7QDM7U2MZKY67CFNOK2OO/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/LWVTT7QDM7U2MZKY67CFNOK2OO/action/timestamp_anchor","attest_storage":"https://pith.science/pith/LWVTT7QDM7U2MZKY67CFNOK2OO/action/storage_attestation","attest_author":"https://pith.science/pith/LWVTT7QDM7U2MZKY67CFNOK2OO/action/author_attestation","sign_citation":"https://pith.science/pith/LWVTT7QDM7U2MZKY67CFNOK2OO/action/citation_signature","submit_replication":"https://pith.science/pith/LWVTT7QDM7U2MZKY67CFNOK2OO/action/replication_record"}},"created_at":"2026-05-17T23:38:14.195865+00:00","updated_at":"2026-05-17T23:38:14.195865+00:00"}