FCoT-VL:Advancing Text-oriented Large Vision-Language Models with Efficient Visual Token Compression

Feng Tang; Gang Huang; Jianjian Li; Junquan Fan; Nian Xie; Shitao Zhu; Songlin Liu; Wulong Liu; Yong Liao

arxiv: 2502.18512 · v1 · pith:3PIRTQXMnew · submitted 2025-02-22 · 💻 cs.CV · cs.AI

FCoT-VL:Advancing Text-oriented Large Vision-Language Models with Efficient Visual Token Compression

Jianjian Li , Junquan Fan , Feng Tang , Gang Huang , Shitao Zhu , Songlin Liu , Nian Xie , Wulong Liu

show 1 more author

Yong Liao

This is my paper

classification 💻 cs.CV cs.AI

keywords visualmodelstext-orientedcompressionhigh-resolutiontokenvllmsdegradation

0 comments

read the original abstract

The rapid success of Vision Large Language Models (VLLMs) often depends on the high-resolution images with abundant visual tokens, which hinders training and deployment efficiency. Current training-free visual token compression methods exhibit serious performance degradation in tasks involving high-resolution, text-oriented image understanding and reasoning. In this paper, we propose an efficient visual token compression framework for text-oriented VLLMs in high-resolution scenarios. In particular, we employ a light-weight self-distillation pre-training stage to compress the visual tokens, requiring a limited numbers of image-text pairs and minimal learnable parameters. Afterwards, to mitigate potential performance degradation of token-compressed models, we construct a high-quality post-train stage. To validate the effectiveness of our method, we apply it to an advanced VLLMs, InternVL2. Experimental results show that our approach significantly reduces computational overhead while outperforming the baselines across a range of text-oriented benchmarks. We will release the models and code soon.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Reasoning Text-to-Video Retrieval for Operating Room Clips via Action-Driven Digital Twins
cs.CV 2026-06 conditional novelty 7.0

OR3 converts OR clips to action-driven digital twins, uses LLM imagination for hypothetical ActDTs, and achieves 57.6 R@1 and 77.3 R@5 on 276 implicit queries from 386 robotic knee procedure clips, outperforming baselines.