{"paper":{"title":"PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning","license":"http://creativecommons.org/licenses/by-nc-nd/4.0/","headline":"A parameter-free temporal pooling strategy lets image-language models extend directly to video dense captioning and question answering without added parameters or heavy retraining.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Daquan Zhou, Jiashi Feng, Lin Xu, See Kiong Ng, Yilin Zhao, Zhijie Lin","submitted_at":"2024-04-25T19:29:55Z","abstract_excerpt":"Vision-language pre-training has significantly elevated performance across a wide range of image-language applications. Yet, the pre-training process for video-related tasks demands exceptionally large computational and data resources, which hinders the progress of video-language models. This paper investigates a straight-forward, highly efficient, and resource-light approach to adapting an existing image-language pre-trained model for dense video understanding. Our preliminary experiments reveal that directly fine-tuning pre-trained image-language models with multiple frames as inputs on vide"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"PLLaVA achieves 3.48/5 on VideoChatGPT (9% above GPT-4V IG-VLM) and 58.1% on MVBench (14.5% above GPT-4V IG-VLM) by applying a parameter-free temporal pooling strategy that mitigates high-norm feature bias.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the performance drop when feeding multiple frames directly is caused primarily by high-norm visual feature bias rather than by other factors such as temporal modeling capacity or training data mismatch.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"A temporal pooling layer added to LLaVA smooths video feature distributions and lifts performance on dense video captioning and QA to new SOTA levels without extra parameters.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"A parameter-free temporal pooling strategy lets image-language models extend directly to video dense captioning and question answering without added parameters or heavy retraining.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"8a4a1585ddad7ccb6a1212c7f6ee77426fcbbde33839b6d7d1f901555d7a0dba"},"source":{"id":"2404.16994","kind":"arxiv","version":2},"verdict":{"id":"0ab945e4-32d6-48c8-a134-bce6542669c1","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T20:18:37.088390Z","strongest_claim":"PLLaVA achieves 3.48/5 on VideoChatGPT (9% above GPT-4V IG-VLM) and 58.1% on MVBench (14.5% above GPT-4V IG-VLM) by applying a parameter-free temporal pooling strategy that mitigates high-norm feature bias.","one_line_summary":"A temporal pooling layer added to LLaVA smooths video feature distributions and lifts performance on dense video captioning and QA to new SOTA levels without extra parameters.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the performance drop when feeding multiple frames directly is caused primarily by high-norm visual feature bias rather than by other factors such as temporal modeling capacity or training data mismatch.","pith_extraction_headline":"A parameter-free temporal pooling strategy lets image-language models extend directly to video dense captioning and question answering without added parameters or heavy retraining."},"references":{"count":53,"sample":[{"doi":"","year":2023,"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","ref_index":1,"cited_arxiv_id":"2303.08774","is_internal_anchor":true},{"doi":"","year":2021,"title":"Frozen in time: A joint video and image encoder for end-to-end retrieval","work_id":"2a136f10-92cd-4a8d-96ba-7aa9ab74f8d3","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Videollm: Modeling video sequence with large language models","work_id":"b2dab7c7-a0c3-46e2-99e3-b19a08e2436b","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2021,"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","ref_index":4,"cited_arxiv_id":"2107.03374","is_internal_anchor":true},{"doi":"","year":2023,"title":"Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality","work_id":"61034f5e-003f-4ba2-b05e-f332bf79c5d5","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":53,"snapshot_sha256":"835ef22f9bfc629c0781e381327ee441ddf0421407bcaf20da4642a10abe07dc","internal_anchors":9},"formal_canon":{"evidence_count":1,"snapshot_sha256":"ac4edc97a35346b99dd1c7e90cd9aca510e25c658d24c49337c19bcc59f27eb0"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}