Fourier Compressor: Frequency-Domain Visual Token Compression for Vision-Language Models

Bo Jiang; Haoli Bai; Huanyu Wang; Jushi Kai; Lu Hou; Zhouhan Lin; Ziwei He

arxiv: 2508.06038 · v3 · pith:POD5JHXGnew · submitted 2025-08-08 · 💻 cs.CV · cs.AI

Fourier Compressor: Frequency-Domain Visual Token Compression for Vision-Language Models

Huanyu Wang , Jushi Kai , Haoli Bai , Lu Hou , Bo Jiang , Ziwei He , Zhouhan Lin This is my paper

Pith reviewed 2026-05-21 22:36 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords visual token compressionfrequency domainFourier transformvision-language modelsparameter-free compressioninference efficiencyFFTtoken reduction

0 comments

The pith

Fourier Compressor applies FFT to visual tokens in VLMs to remove frequency-domain redundancy while retaining over 96% accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets the heavy compute load in vision-language models that stems from processing large numbers of vision tokens for high-resolution inputs. It establishes that semantic content clusters unevenly across frequency bands in these representations, so discarding lower-value bands can shrink the token set without much damage to the original distribution. The proposed module carries out this compression using a fast Fourier transform and its inverse, adding no learned parameters and only light overhead. A reader would care if the approach scales because it promises to make detailed visual reasoning practical on smaller hardware and for longer video sequences.

Core claim

Fourier Compressor transforms visual token representations into the frequency domain via FFT, identifies and removes redundant components according to their non-uniform semantic distribution, then returns the result via inverse FFT, yielding a compressed token sequence that preserves enough fidelity for downstream VLM tasks at high compression ratios.

What carries the argument

Fourier Compressor module, which performs FFT-based frequency selection and inverse transform on visual representations to excise redundancy without extra parameters.

If this is right

Retains over 96% of original accuracy on image benchmarks.
Reduces inference FLOPs by up to 83.8%.
Increases generation speed by 31.2%.
Outperforms existing parameter-free token selection and merging baselines.
Applies without modification to both LLaVA and Qwen-VL families and extends to video inputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same frequency-band analysis could be tested on audio or multimodal streams to check for analogous redundancy patterns.
Adaptive per-sample frequency thresholds might further improve the accuracy-efficiency curve beyond the fixed scheme presented.
Lower token counts could make higher-resolution or multi-image inputs feasible on edge devices without retraining the backbone.
The technique might combine with existing merging methods to reach even higher compression ratios while staying parameter-free.

Load-bearing premise

Semantic information in visual representations of VLMs is consistently non-uniform across frequency bands so that selected bands can be dropped without substantially altering the original representation distribution.

What would settle it

Running the method at the reported compression ratio on a standard image VLM benchmark and observing accuracy fall below 90% of the uncompressed baseline.

Figures

Figures reproduced from arXiv: 2508.06038 by Bo Jiang, Haoli Bai, Huanyu Wang, Jushi Kai, Lu Hou, Zhouhan Lin, Ziwei He.

**Figure 1.** Figure 1: Heatmap visualization of the frequency spectra computed from vision encoder outputs of different images. Since only the magnitude is of interest, the absolute values across all hidden dimensions are averaged for each frequency component and then plotted on a logarithmic scale. (e) and (f) show the frequency spectra of the visual encoder outputs from LLaVA-v1.5, while (g) and (h) correspond to those from Qw… view at source ↗

**Figure 2.** Figure 2: Illustration of the Fourier-VLM framework. After passing through the vision encoder, visual features are reshaped into a grid and transformed into the frequency domain. Darker colors indicate larger frequency magnitudes, while lighter colors represent smaller magnitudes. Only the low-frequency components are retained and subsequently converted back to the spatial domain, serving as the compressed visual fe… view at source ↗

**Figure 3.** Figure 3: Latency and KV cache usage of Fourier-LLaVA. 6.3. Applicability to Video Tasks Fourier-LLaVA and Fourier-Qwen series, though trained only on single-image conversations, generalize well to zeroshot video tasks. We further evaluate on MVBench (Li et al., 2024a), a comprehensive multi-modal video understanding benchmark that encompasses 20 challenging video tasks. As shown in [PITH_FULL_IMAGE:figures/full_f… view at source ↗

**Figure 4.** Figure 4: Model outputs with varying numbers of vision tokens. Here, our FFC module is directly applied to LLaVA-v1.5-7B without any additional training. Correct information is underlined, while incorrect text is highlighted in red. 12 [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

read the original abstract

Vision-Language Models (VLMs) incur substantial computational overhead and inference latency due to the large number of vision tokens introduced by high-resolution image and video inputs. Existing parameter-free token compression methods typically rely on token selection or merging, yet they risk discarding substantial visual information or distorting the original representation distribution, resulting in pronounced performance degradation at high compression ratios. In response, we aim to explore a more effective and efficient visual token compression strategy, with a promising direction in the frequency domain. Motivated by the success of frequency-domain transforms in image compression (e.g., JPEG), we systematically analyze the frequency redundancy in visual representations and uncover a non-uniform distribution of semantic information across frequency bands. Building upon this, we introduce Fourier Compressor, an effective, parameter-free, and highly generalizable module that removes redundancy from visual representations within the frequency domain. Implemented via FFT with $\mathcal{O}(n^2 \log n)$ complexity and no additional parameters, Fourier Compressor introduces negligible computational overhead while preserving semantic fidelity. Extensive experiments on image-based benchmarks demonstrate that our method achieves a favorable performance-efficiency trade-off, retaining over 96% of the original accuracy while reducing inference FLOPs by up to 83.8% and boosting generation speed by 31.2%. It consistently outperforms existing parameter-free methods and even surpasses some parameterized approaches. Importantly, Fourier Compressor generalizes consistently across both LLaVA and Qwen-VL architectures, and further extends to video understanding tasks, highlighting its practical applicability for efficient VLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Fourier Compressor shows a simple FFT-based way to prune visual tokens in VLMs for big FLOP savings with small accuracy loss, but the stability of those frequency patterns across inputs is the part that needs more checking.

read the letter

The main takeaway is that this paper takes frequency-domain compression, already common in image codecs, and applies it directly to the visual tokens inside VLMs. They first measure the frequency content of the token representations, observe that semantic information sits unevenly across bands, and then use a plain FFT to drop the lower-value components before transforming back. The whole module adds no parameters and very little compute.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Fourier Compressor, a parameter-free module for visual token compression in Vision-Language Models. Motivated by frequency-domain analysis showing non-uniform semantic information distribution across bands in VLM representations, the method applies FFT to remove redundancy (with O(n² log n) complexity and no added parameters), aiming to reduce inference costs while preserving fidelity. Experiments claim retention of over 96% original accuracy, up to 83.8% FLOP reduction, 31.2% faster generation, consistent outperformance of other parameter-free baselines, generalization across LLaVA and Qwen-VL, and extension to video tasks.

Significance. If the central performance-efficiency claims hold with robust controls, the work would be significant for practical VLM deployment on high-resolution or video inputs. Strengths include the parameter-free design via standard FFT (no training overhead or invented entities), the explicit complexity bound, and reported generalization across architectures and modalities. These elements support easy integration and falsifiable benchmark comparisons.

major comments (2)

[§3.1] §3.1 (frequency analysis): The claim of a consistent non-uniform semantic distribution permitting reliable fixed-band removal is load-bearing for the general 96% retention and 83.8% FLOP reduction figures. The manuscript should provide quantitative evidence (e.g., variance or per-sample statistics) that the concentration pattern does not vary substantially across images, models (LLaVA vs. Qwen-VL), or tasks (image vs. video); otherwise the efficiency-accuracy trade-off risks being input-dependent rather than general.
[§4.3] §4.3 (video extension): The extension to video understanding is presented as evidence of broad applicability, yet the central assumption of stable frequency redundancy may be strained by temporal dynamics. Additional controls comparing frequency-band statistics between static images and video frames would be needed to substantiate that the same removal strategy preserves fidelity without larger distortion.

minor comments (2)

[§3.2] Clarify in the method section whether the removed frequency bands are chosen via a fixed threshold or data-driven heuristic, and confirm the exact token reduction ratio used for the 83.8% FLOP figure.
[Table 2] Table 2: Ensure all compared methods use identical compression ratios and report standard deviations over multiple runs to support the outperformance claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the manuscript's significance, particularly the parameter-free design and generalization aspects. We address the major comments below and will revise the manuscript accordingly to strengthen the supporting evidence.

read point-by-point responses

Referee: [§3.1] §3.1 (frequency analysis): The claim of a consistent non-uniform semantic distribution permitting reliable fixed-band removal is load-bearing for the general 96% retention and 83.8% FLOP reduction figures. The manuscript should provide quantitative evidence (e.g., variance or per-sample statistics) that the concentration pattern does not vary substantially across images, models (LLaVA vs. Qwen-VL), or tasks (image vs. video); otherwise the efficiency-accuracy trade-off risks being input-dependent rather than general.

Authors: We agree that explicit quantitative evidence on the stability of the frequency distribution would further bolster the claims. While the current experiments already show consistent performance retention across LLaVA, Qwen-VL, and both image and video tasks, we will add per-sample statistics and variance analysis in the revised manuscript. This will include metrics such as the mean and standard deviation of retained semantic information per frequency band, computed over representative samples from the benchmarks, reported separately for each model and for image versus video inputs to confirm the pattern is sufficiently consistent for the fixed-band approach. revision: yes
Referee: [§4.3] §4.3 (video extension): The extension to video understanding is presented as evidence of broad applicability, yet the central assumption of stable frequency redundancy may be strained by temporal dynamics. Additional controls comparing frequency-band statistics between static images and video frames would be needed to substantiate that the same removal strategy preserves fidelity without larger distortion.

Authors: We acknowledge that temporal dynamics in video could influence frequency patterns and appreciate the call for direct controls. In the revision, we will incorporate additional analysis comparing frequency-band statistics (e.g., average energy distribution and variance of semantic importance) between static images and video frames. This will demonstrate that the redundancy patterns remain comparable, supporting that the same removal strategy preserves fidelity without substantially larger distortion in the video setting. revision: yes

Circularity Check

0 steps flagged

No circularity: parameter-free FFT with external benchmark validation

full rationale

The paper's chain begins with motivation from JPEG-style frequency transforms, followed by an empirical observation of non-uniform semantic distribution in VLM visual tokens, then applies standard FFT for redundancy removal. No parameters are fitted to data subsets, no predictions reduce to fitted inputs by construction, and no self-citations or uniqueness theorems are invoked as load-bearing premises. Performance results (96% accuracy retention, FLOP reductions) are measured on independent external benchmarks across LLaVA, Qwen-VL, and video tasks rather than derived from the method's own definitions or fits. The approach is self-contained against those benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical observation of frequency redundancy in visual representations and the effectiveness of selective removal in the frequency domain; no free parameters or new entities are introduced.

axioms (1)

domain assumption Visual token representations in VLMs contain exploitable frequency redundancy with non-uniform semantic information distribution across bands.
This is stated as uncovered through systematic analysis in the abstract and underpins the compression strategy.

pith-pipeline@v0.9.0 · 5817 in / 1286 out tokens · 48650 ms · 2026-05-21T22:36:08.768236+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we apply a low-pass filter to the vision features using a two-dimensional Discrete Cosine Transform (DCT)... energy tends to concentrate in low-frequency components
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

O(N² log N) complexity... no additional parameters

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

FreqCache: Accelerating Embodied VLN Models with Adaptive Frequency-Guided Token Caching
cs.RO 2026-04 unverdicted novelty 6.0

FreqCache uses frequency domain properties to adaptively select, refresh, and budget token caches in VLN models, delivering 1.59x speedup with negligible overhead.
Fre-Res: Frequency-Residual Video Token Compression for Efficient Video MLLMs
cs.CV 2026-05 unverdicted novelty 5.0

Fre-Res compresses video tokens by preserving spatial anchors and representing temporal dynamics with low-frequency residual tokens derived from 1D-DCT on inter-frame residuals, plus a Spatial-Guided Absorber to reinj...

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · cited by 2 Pith papers · 9 internal anchors

[1]

Qwen2.5-VL Technical Report

URLhttps://arxiv.org/abs/2502.13923. Chen, L., Li, J., Dong, X., Zhang, P., Zang, Y ., Chen, Z., Duan, H., Wang, J., Qiao, Y ., Lin, D., and Zhao, F. Are we on the right way for evaluating large vision-language models?,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Are We on the Right Way for Evaluating Large Vision-Language Models?

URL https://arxiv.org/abs/ 2403.20330. Feng, H., Liu, Q., Liu, H., Tang, J., Zhou, W., Li, H., and Huang, C. Docpedia: Unleashing the power of large multimodal model in the frequency domain for versatile document understanding,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

8 Fourier-VLM Fu, C., Chen, P., Shen, Y ., Qin, Y ., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., Wu, Y ., and Ji, R

URL https: //arxiv.org/abs/2311.11810. 8 Fourier-VLM Fu, C., Chen, P., Shen, Y ., Qin, Y ., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., Wu, Y ., and Ji, R. Mme: A comprehensive evaluation benchmark for multimodal large language models,

work page arXiv
[4]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

URL https: //arxiv.org/abs/2306.13394. Goyal, Y ., Khot, T., Summers-Stay, D., Batra, D., and Parikh, D. Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

URL https://arxiv.org/abs/1612. 00837. He, Z., Yang, M., Feng, M., Yin, J., Wang, X., Leng, J., and Lin, Z. Fourier transformer: Fast long range mod- eling by removing sequence redundancy with fft opera- tor. InFindings of the Association for Computational Linguistics: ACL 2023, pp. 8954–8966. Association for Computational Linguistics,

work page 2023
[6]

doi: 10.18653/v1/ 2024.acl-long.841

doi: 10.18653/v1/ 2023.findings-acl.570. URL http://dx.doi.org/ 10.18653/v1/2023.findings-acl.570. Hu, W., Dou, Z.-Y ., Li, L. H., Kamath, A., Peng, N., and Chang, K.-W. Matryoshka query transformer for large vision-language models,

work page doi:10.18653/v1/ 2023
[7]

org/abs/2405.19315

URL https://arxiv. org/abs/2405.19315. Huang, M., Huang, R., Shi, H., Chen, Y ., Zheng, C., Sun, X., Jiang, X., Li, Z., and Cheng, H. Efficient multi-modal large language models via visual token grouping,

work page arXiv
[8]

Hudson, D

URLhttps://arxiv.org/abs/2411.17773. Hudson, D. A. and Manning, C. D. Gqa: A new dataset for real-world visual reasoning and compositional question answering,

work page arXiv
[9]

GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering

URL https://arxiv.org/abs/ 1902.09506. Kai, J., Zeng, B., Wang, Y ., Bai, H., He, Z., Jiang, B., and Lin, Z. Freqkv: Frequency domain key-value compres- sion for efficient context window extension,

work page internal anchor Pith review Pith/arXiv arXiv 1902
[10]

Freqkv: Frequency domain key- value compression for efficient context window extension

URL https://arxiv.org/abs/2505.00570. Lee-Thorp, J., Ainslie, J., Eckstein, I., and Ontanon, S. Fnet: Mixing tokens with fourier transforms,

work page arXiv
[11]

Li, J., Li, D., Savarese, S., and Hoi, S

URL https://arxiv.org/abs/2105.03824. Li, J., Li, D., Savarese, S., and Hoi, S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023a. URL https:// arxiv.org/abs/2301.12597. Li, K., Wang, Y ., He, Y ., Li, Y ., Wang, Y ., Liu, Y ., Wang, Z., Xu, J., Chen, G., Luo, P., Wang, L., and Qiao, Y . Mvbench:...

work page arXiv
[12]

URL https://arxiv.org/abs/2304. 08485. Liu, H., Li, C., Li, Y ., and Lee, Y . J. Improved baselines with visual instruction tuning, 2024a. URL https:// arxiv.org/abs/2310.03744. Liu, H., Li, C., Li, Y ., Li, B., Zhang, Y ., Shen, S., and Lee, Y . J. Llava-next: Improved reason- ing, ocr, and world knowledge, January 2024b. URL https://llava-vl.github.io/b...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Learn to explain: Multimodal reasoning via thought chains for science question answering

URL https: //arxiv.org/abs/2209.09513. Shang, Y ., Cai, M., Xu, B., Lee, Y . J., and Yan, Y . Llava- prumerge: Adaptive token reduction for efficient large multimodal models,

work page arXiv
[14]

Llava-prumerge: Adaptive token reduction for efficient large multimodal models

URL https://arxiv. org/abs/2403.15388. Singh, A., Natarajan, V ., Shah, M., Jiang, Y ., Chen, X., Batra, D., Parikh, D., and Rohrbach, M. Towards vqa models that can read,

work page arXiv
[15]

Towards VQA Models That Can Read

URL https://arxiv. org/abs/1904.08920. Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., Fan, Y ., Dang, K., Du, M., Ren, X., Men, R., Liu, D., Zhou, C., Zhou, J., and Lin, J. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution,

work page internal anchor Pith review Pith/arXiv arXiv 1904
[16]

URL https://arxiv.org/abs/2409.12191. xAI. Grok-1.5V and RealWorldQA Evaluation. https: //x.ai/news/grok-1.5v,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

9 Fourier-VLM Xu, K., Qin, M., Sun, F., Wang, Y ., Chen, Y .-K., and Ren, F

Accessed: 2025- 07-29. 9 Fourier-VLM Xu, K., Qin, M., Sun, F., Wang, Y ., Chen, Y .-K., and Ren, F. Learning in the frequency domain,

work page 2025
[18]

Ye, X., Gan, Y ., Ge, Y ., Zhang, X.-P., and Tang, Y

URL https: //arxiv.org/abs/2002.12416. Ye, X., Gan, Y ., Ge, Y ., Zhang, X.-P., and Tang, Y . Atp-llava: Adaptive token pruning for large vision language mod- els,

work page arXiv 2002
[19]

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

URL https://arxiv.org/abs/2311.16502. Zhang, K., Li, B., Zhang, P., Pu, F., Cahyono, J. A., Hu, K., Liu, S., Zhang, Y ., Yang, J., Li, C., and Liu, Z. Lmms-eval: Reality check on the evaluation of large mul- timodal models,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models

URL https://arxiv.org/ abs/2407.12772. Zhang, S., Fang, Q., Yang, Z., and Feng, Y . Llava-mini: Efficient image and video large multimodal models with one vision token,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

URL https://arxiv.org/ abs/2501.03895. A. Training Details The training details for Fourier-LLaV A and Fourier-Qwen are provided in Table

work page arXiv

[1] [1]

Qwen2.5-VL Technical Report

URLhttps://arxiv.org/abs/2502.13923. Chen, L., Li, J., Dong, X., Zhang, P., Zang, Y ., Chen, Z., Duan, H., Wang, J., Qiao, Y ., Lin, D., and Zhao, F. Are we on the right way for evaluating large vision-language models?,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Are We on the Right Way for Evaluating Large Vision-Language Models?

URL https://arxiv.org/abs/ 2403.20330. Feng, H., Liu, Q., Liu, H., Tang, J., Zhou, W., Li, H., and Huang, C. Docpedia: Unleashing the power of large multimodal model in the frequency domain for versatile document understanding,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

8 Fourier-VLM Fu, C., Chen, P., Shen, Y ., Qin, Y ., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., Wu, Y ., and Ji, R

URL https: //arxiv.org/abs/2311.11810. 8 Fourier-VLM Fu, C., Chen, P., Shen, Y ., Qin, Y ., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., Wu, Y ., and Ji, R. Mme: A comprehensive evaluation benchmark for multimodal large language models,

work page arXiv

[4] [4]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

URL https: //arxiv.org/abs/2306.13394. Goyal, Y ., Khot, T., Summers-Stay, D., Batra, D., and Parikh, D. Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

URL https://arxiv.org/abs/1612. 00837. He, Z., Yang, M., Feng, M., Yin, J., Wang, X., Leng, J., and Lin, Z. Fourier transformer: Fast long range mod- eling by removing sequence redundancy with fft opera- tor. InFindings of the Association for Computational Linguistics: ACL 2023, pp. 8954–8966. Association for Computational Linguistics,

work page 2023

[6] [6]

doi: 10.18653/v1/ 2024.acl-long.841

doi: 10.18653/v1/ 2023.findings-acl.570. URL http://dx.doi.org/ 10.18653/v1/2023.findings-acl.570. Hu, W., Dou, Z.-Y ., Li, L. H., Kamath, A., Peng, N., and Chang, K.-W. Matryoshka query transformer for large vision-language models,

work page doi:10.18653/v1/ 2023

[7] [7]

org/abs/2405.19315

URL https://arxiv. org/abs/2405.19315. Huang, M., Huang, R., Shi, H., Chen, Y ., Zheng, C., Sun, X., Jiang, X., Li, Z., and Cheng, H. Efficient multi-modal large language models via visual token grouping,

work page arXiv

[8] [8]

Hudson, D

URLhttps://arxiv.org/abs/2411.17773. Hudson, D. A. and Manning, C. D. Gqa: A new dataset for real-world visual reasoning and compositional question answering,

work page arXiv

[9] [9]

GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering

URL https://arxiv.org/abs/ 1902.09506. Kai, J., Zeng, B., Wang, Y ., Bai, H., He, Z., Jiang, B., and Lin, Z. Freqkv: Frequency domain key-value compres- sion for efficient context window extension,

work page internal anchor Pith review Pith/arXiv arXiv 1902

[10] [10]

Freqkv: Frequency domain key- value compression for efficient context window extension

URL https://arxiv.org/abs/2505.00570. Lee-Thorp, J., Ainslie, J., Eckstein, I., and Ontanon, S. Fnet: Mixing tokens with fourier transforms,

work page arXiv

[11] [11]

Li, J., Li, D., Savarese, S., and Hoi, S

URL https://arxiv.org/abs/2105.03824. Li, J., Li, D., Savarese, S., and Hoi, S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023a. URL https:// arxiv.org/abs/2301.12597. Li, K., Wang, Y ., He, Y ., Li, Y ., Wang, Y ., Liu, Y ., Wang, Z., Xu, J., Chen, G., Luo, P., Wang, L., and Qiao, Y . Mvbench:...

work page arXiv

[12] [12]

URL https://arxiv.org/abs/2304. 08485. Liu, H., Li, C., Li, Y ., and Lee, Y . J. Improved baselines with visual instruction tuning, 2024a. URL https:// arxiv.org/abs/2310.03744. Liu, H., Li, C., Li, Y ., Li, B., Zhang, Y ., Shen, S., and Lee, Y . J. Llava-next: Improved reason- ing, ocr, and world knowledge, January 2024b. URL https://llava-vl.github.io/b...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

Learn to explain: Multimodal reasoning via thought chains for science question answering

URL https: //arxiv.org/abs/2209.09513. Shang, Y ., Cai, M., Xu, B., Lee, Y . J., and Yan, Y . Llava- prumerge: Adaptive token reduction for efficient large multimodal models,

work page arXiv

[14] [14]

Llava-prumerge: Adaptive token reduction for efficient large multimodal models

URL https://arxiv. org/abs/2403.15388. Singh, A., Natarajan, V ., Shah, M., Jiang, Y ., Chen, X., Batra, D., Parikh, D., and Rohrbach, M. Towards vqa models that can read,

work page arXiv

[15] [15]

Towards VQA Models That Can Read

URL https://arxiv. org/abs/1904.08920. Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., Fan, Y ., Dang, K., Du, M., Ren, X., Men, R., Liu, D., Zhou, C., Zhou, J., and Lin, J. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution,

work page internal anchor Pith review Pith/arXiv arXiv 1904

[16] [16]

URL https://arxiv.org/abs/2409.12191. xAI. Grok-1.5V and RealWorldQA Evaluation. https: //x.ai/news/grok-1.5v,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

9 Fourier-VLM Xu, K., Qin, M., Sun, F., Wang, Y ., Chen, Y .-K., and Ren, F

Accessed: 2025- 07-29. 9 Fourier-VLM Xu, K., Qin, M., Sun, F., Wang, Y ., Chen, Y .-K., and Ren, F. Learning in the frequency domain,

work page 2025

[18] [18]

Ye, X., Gan, Y ., Ge, Y ., Zhang, X.-P., and Tang, Y

URL https: //arxiv.org/abs/2002.12416. Ye, X., Gan, Y ., Ge, Y ., Zhang, X.-P., and Tang, Y . Atp-llava: Adaptive token pruning for large vision language mod- els,

work page arXiv 2002

[19] [19]

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

URL https://arxiv.org/abs/2311.16502. Zhang, K., Li, B., Zhang, P., Pu, F., Cahyono, J. A., Hu, K., Liu, S., Zhang, Y ., Yang, J., Li, C., and Liu, Z. Lmms-eval: Reality check on the evaluation of large mul- timodal models,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models

URL https://arxiv.org/ abs/2407.12772. Zhang, S., Fang, Q., Yang, Z., and Feng, Y . Llava-mini: Efficient image and video large multimodal models with one vision token,

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

URL https://arxiv.org/ abs/2501.03895. A. Training Details The training details for Fourier-LLaV A and Fourier-Qwen are provided in Table

work page arXiv