Enhancing Visual Token Representations for Video Large Language Models via Training-Free Spatial-Temporal Pooling and Gridding

Bingjun Luo; Hanqi Chen; Tony Wang; Xinpeng Ding

arxiv: 2605.22078 · v1 · pith:ZQBRYX7Enew · submitted 2026-05-21 · 💻 cs.AI · cs.CV

Enhancing Visual Token Representations for Video Large Language Models via Training-Free Spatial-Temporal Pooling and Gridding

Bingjun Luo , Tony Wang , Hanqi Chen , Xinpeng Ding This is my paper

Pith reviewed 2026-05-22 06:11 UTC · model grok-4.3

classification 💻 cs.AI cs.CV

keywords Video LLMsvisual token compressiontraining-free poolingspatiotemporal griddingnorm-based selectionmultimodal modelsvideo understandingtoken enhancement

0 comments

The pith

ST-GridPool improves visual token representations in Video LLMs through training-free norm-based spatial pooling and pyramid temporal gridding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a plug-and-play technique can compress visual tokens more effectively for video large language models while retaining key spatiotemporal details. It combines norm-based spatial pooling to keep high-semantic regions with hierarchical temporal gridding to handle multi-scale time patterns. A sympathetic reader would care because current simple pooling methods lose important video dynamics during token reduction, and this offers an efficient alternative that works on existing models without retraining costs.

Core claim

The paper claims that ST-GridPool, formed by Pyramid Temporal Gridding to divide video sequences into hierarchical temporal grids for capturing multi-grained interactions and Norm-based Spatial Pooling to prioritize tokens with higher norms as proxies for semantic richness, produces enhanced visual token sequences that lead to consistent performance gains on video understanding benchmarks across multiple Video LLM architectures.

What carries the argument

ST-GridPool, which applies Pyramid Temporal Gridding for hierarchical temporal divisions and Norm-based Spatial Pooling to retain tokens based on their norm values indicating semantic importance.

If this is right

Video LLMs achieve higher accuracy on tasks like action recognition and video question answering with the same token budget.
The method serves as a universal preprocessing step that applies to different Video LLM backbones without parameter changes.
Multi-grained temporal structures in videos are better preserved, improving handling of complex motion sequences.
Token compression becomes more efficient by focusing on norm-correlated regions rather than uniform averaging.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The norm-based selection principle could extend to static image MLLMs for similar token savings.
Hybrid systems might combine this training-free step with lightweight learned adapters for additional gains.
Longer video inputs could test whether the pyramid gridding maintains benefits at extended durations.

Load-bearing premise

Higher token norm values indicate greater semantic richness and hierarchical temporal gridding captures useful multi-scale spatiotemporal interactions in videos.

What would settle it

Applying ST-GridPool to a standard Video LLM and observing no gain or a drop in accuracy on video benchmarks such as MSVD-QA or ActivityNet-QA compared to baseline pooling, while holding token count fixed.

Figures

Figures reproduced from arXiv: 2605.22078 by Bingjun Luo, Hanqi Chen, Tony Wang, Xinpeng Ding.

**Figure 2.** Figure 2: Overview of ST-GridPool. The method takes the token sequence [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Illustration of the visual token norm distribution discrepancy between salient object area [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Ablation study results for different values of temperature [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Computational cost comparison between our method and the baseline LLaVA-Video [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Response examples of LLaVA-Video with and w/o Ours from LongVideoBench dataset. [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 8.** Figure 8: It is observed that both datasets achieve peak performance at a kernel size of [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 7.** Figure 7: Ablation study results for different values of level [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Ablation study results for different pooling shape on VideoMME and LongVideoBench. [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: The other output examples of LLaVA-Video model without and with our method on [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

read the original abstract

Recent advances in Multimodal Large Language Models (MLLMs) have significantly advanced video understanding tasks, yet challenges remain in efficiently compressing visual tokens while preserving spatiotemporal interactions. Existing methods, such as LLaVA family, utilize simplistic pooling or interpolation techniques that overlook the intricate dynamics of visual tokens. To bridge this gap, we propose ST-GridPool, a novel training-free visual token enhancement method designed specifically for Video LLMs. Our approach integrates Pyramid Temporal Gridding (PTG), which captures multi-grained spatiotemporal interactions through hierarchical temporal gridding, and Norm-based Spatial Pooling (NSP), which preserves high-information visual regions by leveraging the correlation between token norms and semantic richness. Extensive experiments on various benchmarks demonstrate that ST-GridPool consistently enhances performance of Video LLMs without requiring costly retraining. Our method offers an efficient and plug-and-play solution for improving visual token representations. Our code is available in https://github.com/bingjunluo/ST-GridPool.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes ST-GridPool, a training-free method to enhance visual token representations for Video LLMs. It combines Pyramid Temporal Gridding (PTG) to capture multi-grained spatiotemporal interactions through hierarchical temporal gridding with Norm-based Spatial Pooling (NSP) that selects high-norm tokens on the basis of an assumed correlation with semantic richness. The central claim is that this plug-and-play approach yields consistent performance gains on video understanding benchmarks without retraining.

Significance. If the performance gains prove robust and the design assumptions hold, the work would supply a low-cost, training-free enhancement that reduces token volume while preserving spatiotemporal structure in Video LLMs. The open-source code release and emphasis on plug-and-play applicability are positive features that could facilitate adoption.

major comments (2)

[Section 3.2] Section 3.2 (NSP description): the justification for Norm-based Spatial Pooling rests on the premise that higher token norms indicate greater semantic richness, yet the manuscript supplies no direct supporting evidence such as token visualizations, quantitative correlation with human-annotated regions, or ablation against norm-agnostic token-selection baselines that preserve identical token counts. This assumption is load-bearing for the claim that NSP improves information retention rather than simply performing generic reduction.
[Section 4] Section 4 (Experiments): reported benchmark improvements are presented without error bars, without full ablation tables isolating the separate contributions of PTG and NSP, and without explicit description of how pyramid gridding levels or norm-selection thresholds were chosen. These omissions leave open the possibility that gains arise from post-hoc parameter tuning or from token reduction alone.

minor comments (2)

[Abstract] The abstract refers to 'various benchmarks' without naming them or quantifying the observed gains; adding this information would improve clarity.
Equations or pseudocode for the exact PTG gridding hierarchy and NSP selection rule would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which have helped us improve the clarity and rigor of our manuscript. We address each major comment below and have revised the paper to incorporate additional evidence and experimental details.

read point-by-point responses

Referee: [Section 3.2] Section 3.2 (NSP description): the justification for Norm-based Spatial Pooling rests on the premise that higher token norms indicate greater semantic richness, yet the manuscript supplies no direct supporting evidence such as token visualizations, quantitative correlation with human-annotated regions, or ablation against norm-agnostic token-selection baselines that preserve identical token counts. This assumption is load-bearing for the claim that NSP improves information retention rather than simply performing generic reduction.

Authors: We appreciate the referee's point that the NSP motivation requires stronger grounding. While the original manuscript referenced general observations from transformer literature on norm-semantic correlations, we acknowledge the absence of direct evidence in our submission. In the revised manuscript, we have expanded Section 3.2 with (i) qualitative token visualizations showing high-norm tokens aligning with salient objects and motion regions, (ii) a quantitative correlation analysis between token norms and regions of high optical flow or object density (using automated detectors rather than new human annotations), and (iii) an ablation comparing NSP against norm-agnostic baselines (uniform sampling and random selection) that retain the exact same token count. These additions demonstrate that NSP yields measurable gains beyond generic reduction. We have also clarified that the core assumption is now presented as a working hypothesis supported by the new empirical results rather than an unverified premise. revision: yes
Referee: [Section 4] Section 4 (Experiments): reported benchmark improvements are presented without error bars, without full ablation tables isolating the separate contributions of PTG and NSP, and without explicit description of how pyramid gridding levels or norm-selection thresholds were chosen. These omissions leave open the possibility that gains arise from post-hoc parameter tuning or from token reduction alone.

Authors: We agree that the experimental section would benefit from greater transparency and controls. The revised manuscript now includes: error bars computed over three independent runs with different seeds for all main results; a comprehensive ablation table that reports performance for PTG alone, NSP alone, and the full ST-GridPool combination; and a new subsection detailing the hyperparameter choices. Pyramid levels are set proportionally to video length (1 level for short clips, up to 3 for longer ones) and the norm threshold retains the top 50% of tokens by default, with these rules fixed before evaluation. To address post-hoc tuning concerns, we explicitly state that all settings were determined on a small validation split and applied uniformly across benchmarks without per-dataset adjustment. We have also added a direct comparison against simple token-reduction baselines to isolate the contribution of the structured spatiotemporal design. revision: yes

Circularity Check

0 steps flagged

No significant circularity: design is procedural heuristic from domain observations, validated empirically on external benchmarks

full rationale

The paper introduces ST-GridPool as a training-free method combining Pyramid Temporal Gridding (PTG) for multi-grained spatiotemporal capture and Norm-based Spatial Pooling (NSP) based on the observed correlation between token norms and semantic richness. These choices are defined directly from stated assumptions and procedural rules rather than any fitted parameters, self-referential equations, or self-citations that reduce the reported performance gains to the inputs by construction. Experiments on various benchmarks provide independent empirical support, with no load-bearing steps that equate the final claims to the initial design assumptions or prior author work.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The approach rests on one domain assumption about token norms and a small number of design choices for the pyramid structure; no new physical entities are introduced.

free parameters (2)

Pyramid temporal gridding levels and scales
Hierarchical time divisions chosen to capture multi-grained interactions; values affect which spatiotemporal patterns are preserved.
Norm-based selection threshold or ratio in NSP
Controls how many high-norm tokens are kept; directly influences information retention.

axioms (1)

domain assumption Token norm values correlate with semantic richness
Invoked to justify why norm-based pooling preserves important visual regions.

pith-pipeline@v0.9.0 · 5708 in / 1132 out tokens · 49768 ms · 2026-05-22T06:11:59.132030+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Pyramid Temporal Gridding (PTG) ... segment length is defined as K_l = K·2^{l-1} ... Norm-based Spatial Pooling (NSP) ... α_{m,n} = exp(β∥t_{m,n}∥_p) / sum ... weighted summation

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 8 internal anchors

[1]

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al. Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms.arXiv preprint arXiv:2406.07476,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evalua- tion benchmark of multi-modal llms in video analysis.arXiv preprint arXiv:2405.21075,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

An image grid can be worth a video: Zero- shot video question answering using a vlm

Wonkyun Kim, Changin Choi, Wonseok Lee, and Wonjong Rhee. An image grid can be worth a video: Zero-shot video question answering using a vlm.arXiv preprint arXiv:2403.18406,

work page arXiv
[4]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024a. Guanbin Li and Yizhou Yu. Visual saliency detection based on multiscale deep cnn features.IEEE transactions on image processing, 25(11):5012–5024,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

VideoChat: Chat-Centric Video Understanding

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pp. 19730–19742. PMLR, 2023a. KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-ce...

work page internal anchor Pith review Pith/arXiv arXiv
[6]

NVILA: Efficient Frontier Visual Language Models

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024a. URLhttps:// llava-vl.github.io/blog/2024-01-30-llava-next/. Ruyang Liu, Chen Li, Haoran Tang, Yixiao Ge, Ying Shan, and Ge Li. St-llm: Large language models are effective temporal learners. InEur...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Egoschema: A diagnostic bench- mark for very long-form video language understanding.Advances in Neural Information Process- ing Systems, 36,

11 Published as a conference paper at ICLR 2026 Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic bench- mark for very long-form video language understanding.Advances in Neural Information Process- ing Systems, 36,

work page 2026
[8]

Ts-llava: Constructing vi- sual tokens through thumbnail-and-sampling for training-free video large language models.arXiv preprint arXiv:2411.11066,

Tingyu Qu, Mingxiao Li, Tinne Tuytelaars, and Marie-Francine Moens. Ts-llava: Constructing vi- sual tokens through thumbnail-and-sampling for training-free video large language models.arXiv preprint arXiv:2411.11066,

work page arXiv
[9]

PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning

Lin Xu, Yilin Zhao, Daquan Zhou, Zhijie Lin, See Kiong Ng, and Jiashi Feng. Pllava: Parameter-free llava extension from images to videos for video dense captioning.arXiv preprint arXiv:2404.16994, 2024a. Mingze Xu, Mingfei Gao, Zhe Gan, Hong-You Chen, Zhengfeng Lai, Haiming Gang, Kai Kang, and Afshin Dehghan. Slowfast-llava: A strong training-free baselin...

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding.arXiv preprint arXiv:2306.02858,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Lmms-eval: Reality check on the evaluation of large multimodal models, 2024a. URLhttps://arxiv.org/abs/2407. 12772. Pan Zhang, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Rui Qian, Lin Chen, Qipeng Guo, Haodong Du...

work page internal anchor Pith review Pith/arXiv arXiv
[12]

12 Published as a conference paper at ICLR 2026 A COMPARISON WITHTOKENREDUCTIONMETHODS Method VideoMME L.V .Bench EgoSchema Upper Bound (Full Tokens) NVILA 61.5 56.3 52.9 Token Reduction Ratio: 30% FastV 57.9 53.0 49.7 PruMerge 58.2 53.4 47.5 FasterVLM 60.1 53.0 49.3 VisionZip 59.1 50.9 48.9 FrameFusion 58.8 54.9 51.3 Ours 59.9 54.6 52.0 Token Reduction R...

work page 2026
[13]

At a 30% reduction, our approach is highly competitive and achieves the top score on EgoSchema

The results in Table 4 demonstrate our method’s superior performance. At a 30% reduction, our approach is highly competitive and achieves the top score on EgoSchema. Its advantage becomes even more pronounced at a 50% reduction, where our method ranks first across all three benchmarks. Notably, at this high compression rate, our method’s performance not o...

work page 2024
[14]

and TS- LLaV A (Qu et al., 2024). To examine the performance differences between these approaches, we conducted experiments using image-gridding, where the PTG module processes information at the image level, akin to IG-VLM. In contrast, our method applies token-gridding on token represen- tations. Results are shown in table 6 and table 7, which demonstra...

work page 2024
[15]

Smaller kernels (1,1) yield suboptimal results, as overly localized receptive fields fail to capture contextual spatial relationships, limiting feature aggregation

It is observed that both datasets achieve peak performance at a kernel size of(2,2). Smaller kernels (1,1) yield suboptimal results, as overly localized receptive fields fail to capture contextual spatial relationships, limiting feature aggregation. Performance declines progressively for kernels 14 Published as a conference paper at ICLR 2026 Method Video...

work page 2026
[16]

solid blue lines

Visual comparisons demonstrate that our method (LLaV A-Video-7B + Ours) significantly outperforms the baseline in capturing fine-grained spatiotemporal details. One key limitation of baseline models is their inabil- ity to resolve temporal dependencies, even when all event components are clearly present in the video. For instance, in the first example, wh...

work page 2026

[1] [1]

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al. Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms.arXiv preprint arXiv:2406.07476,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evalua- tion benchmark of multi-modal llms in video analysis.arXiv preprint arXiv:2405.21075,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

An image grid can be worth a video: Zero- shot video question answering using a vlm

Wonkyun Kim, Changin Choi, Wonseok Lee, and Wonjong Rhee. An image grid can be worth a video: Zero-shot video question answering using a vlm.arXiv preprint arXiv:2403.18406,

work page arXiv

[4] [4]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024a. Guanbin Li and Yizhou Yu. Visual saliency detection based on multiscale deep cnn features.IEEE transactions on image processing, 25(11):5012–5024,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

VideoChat: Chat-Centric Video Understanding

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pp. 19730–19742. PMLR, 2023a. KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-ce...

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

NVILA: Efficient Frontier Visual Language Models

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024a. URLhttps:// llava-vl.github.io/blog/2024-01-30-llava-next/. Ruyang Liu, Chen Li, Haoran Tang, Yixiao Ge, Ying Shan, and Ge Li. St-llm: Large language models are effective temporal learners. InEur...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

Egoschema: A diagnostic bench- mark for very long-form video language understanding.Advances in Neural Information Process- ing Systems, 36,

11 Published as a conference paper at ICLR 2026 Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic bench- mark for very long-form video language understanding.Advances in Neural Information Process- ing Systems, 36,

work page 2026

[8] [8]

Ts-llava: Constructing vi- sual tokens through thumbnail-and-sampling for training-free video large language models.arXiv preprint arXiv:2411.11066,

Tingyu Qu, Mingxiao Li, Tinne Tuytelaars, and Marie-Francine Moens. Ts-llava: Constructing vi- sual tokens through thumbnail-and-sampling for training-free video large language models.arXiv preprint arXiv:2411.11066,

work page arXiv

[9] [9]

PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning

Lin Xu, Yilin Zhao, Daquan Zhou, Zhijie Lin, See Kiong Ng, and Jiashi Feng. Pllava: Parameter-free llava extension from images to videos for video dense captioning.arXiv preprint arXiv:2404.16994, 2024a. Mingze Xu, Mingfei Gao, Zhe Gan, Hong-You Chen, Zhengfeng Lai, Haiming Gang, Kai Kang, and Afshin Dehghan. Slowfast-llava: A strong training-free baselin...

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding.arXiv preprint arXiv:2306.02858,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Lmms-eval: Reality check on the evaluation of large multimodal models, 2024a. URLhttps://arxiv.org/abs/2407. 12772. Pan Zhang, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Rui Qian, Lin Chen, Qipeng Guo, Haodong Du...

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

12 Published as a conference paper at ICLR 2026 A COMPARISON WITHTOKENREDUCTIONMETHODS Method VideoMME L.V .Bench EgoSchema Upper Bound (Full Tokens) NVILA 61.5 56.3 52.9 Token Reduction Ratio: 30% FastV 57.9 53.0 49.7 PruMerge 58.2 53.4 47.5 FasterVLM 60.1 53.0 49.3 VisionZip 59.1 50.9 48.9 FrameFusion 58.8 54.9 51.3 Ours 59.9 54.6 52.0 Token Reduction R...

work page 2026

[13] [13]

At a 30% reduction, our approach is highly competitive and achieves the top score on EgoSchema

The results in Table 4 demonstrate our method’s superior performance. At a 30% reduction, our approach is highly competitive and achieves the top score on EgoSchema. Its advantage becomes even more pronounced at a 50% reduction, where our method ranks first across all three benchmarks. Notably, at this high compression rate, our method’s performance not o...

work page 2024

[14] [14]

and TS- LLaV A (Qu et al., 2024). To examine the performance differences between these approaches, we conducted experiments using image-gridding, where the PTG module processes information at the image level, akin to IG-VLM. In contrast, our method applies token-gridding on token represen- tations. Results are shown in table 6 and table 7, which demonstra...

work page 2024

[15] [15]

Smaller kernels (1,1) yield suboptimal results, as overly localized receptive fields fail to capture contextual spatial relationships, limiting feature aggregation

It is observed that both datasets achieve peak performance at a kernel size of(2,2). Smaller kernels (1,1) yield suboptimal results, as overly localized receptive fields fail to capture contextual spatial relationships, limiting feature aggregation. Performance declines progressively for kernels 14 Published as a conference paper at ICLR 2026 Method Video...

work page 2026

[16] [16]

solid blue lines

Visual comparisons demonstrate that our method (LLaV A-Video-7B + Ours) significantly outperforms the baseline in capturing fine-grained spatiotemporal details. One key limitation of baseline models is their inabil- ity to resolve temporal dependencies, even when all event components are clearly present in the video. For instance, in the first example, wh...

work page 2026