Progressive Semantic Communication for Efficient Edge-Cloud Vision-Language Models

Chrysa Papagianni; Cyril Shih-Huan Hsu; Wig Yuan-Cheng Cheng

arxiv: 2604.26508 · v1 · submitted 2026-04-29 · 💻 cs.LG · cs.AI· cs.CV· cs.DC· cs.NI

Progressive Semantic Communication for Efficient Edge-Cloud Vision-Language Models

Cyril Shih-Huan Hsu , Wig Yuan-Cheng Cheng , Chrysa Papagianni This is my paper

Pith reviewed 2026-05-07 10:41 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CVcs.DCcs.NI

keywords semantic communicationedge-cloud collaborationvision-language modelsprogressive compressionautoencodernetwork latencysemantic fidelityresource-constrained inference

0 comments

The pith

A Meta AutoEncoder compresses visual tokens into progressive layers that edge devices can send to unmodified cloud VLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that edge devices can offload part of VLM inference to the cloud by transmitting visual information in controllable stages rather than all at once or not at all. It introduces a Meta AutoEncoder that turns visual tokens into representations that become more detailed as more bits arrive, while each partial version remains usable by standard VLMs. The design removes the need to retrain or modify the large models themselves. This matters because many real-world settings have limited uplink bandwidth, where sending raw images creates long delays and doing everything locally exceeds device power or memory limits. Experiments on an embedded platform and GPU server confirm that the scheme cuts latency at 1 Mbps while preserving semantic agreement on tasks.

Core claim

The central claim is that a progressive semantic communication framework built around a Meta AutoEncoder can compress visual tokens into adaptive, refinable representations. These representations support plug-and-play use with off-the-shelf VLMs at any chosen information level, delivering a tunable balance between transmitted data volume and final semantic fidelity without requiring model fine-tuning or architectural changes.

What carries the argument

Meta AutoEncoder that encodes visual tokens into a sequence of progressively refinable representations, each compatible with the input expectations of unmodified VLMs.

If this is right

The system provides a direct dial for trading off uplink bits against semantic accuracy under changing network conditions.
At 1 Mbps uplink the progressive scheme produces lower end-to-end latency than either full local execution or full raw-image upload to the cloud.
Semantic consistency remains high even when only a small fraction of the encoded information is transmitted.
No changes to the VLM weights or architecture are needed, allowing immediate use of current public models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same encoder could be tested on other vision encoders or multimodal models to check whether progressive compatibility is specific to current VLMs or more general.
In networks whose bandwidth fluctuates rapidly, the progressive stream might be paused or resumed mid-task without restarting inference from scratch.
If the encoder is made task-aware, the early layers could be tuned to preserve only the semantics needed for a given downstream query rather than generic visual content.

Load-bearing premise

The compressed representations at every progressive stage can be fed directly into existing VLMs with no retraining or architectural modification and still produce usable outputs.

What would settle it

Run the same VLM on original images versus on the Meta AutoEncoder outputs at full transmission; if task accuracy falls by more than a few percent, the compatibility claim does not hold.

Figures

Figures reproduced from arXiv: 2604.26508 by Chrysa Papagianni, Cyril Shih-Huan Hsu, Wig Yuan-Cheng Cheng.

**Figure 1.** Figure 1: Progressively refinable latent representation via Meta AutoEncoder. The fixed-length latent is structured as an ordered token sequence, where prefixes provide increasingly accurate approximations of the full representation. This enables monotonic improvement in reconstruction quality as transmission progresses, supporting efficient incremental communication and precise control over the trade-off between ba… view at source ↗

**Figure 2.** Figure 2: System overview of the proposed progressive edge–cloud semantic communication framework. The edge encodes visual inputs into progressively refinable representations and transmits partial latent segments under bandwidth constraints. The cloud reconstructs the latent via a meta decoder and employs semantic quality-aware transmission control to adaptively request additional information until a target quality … view at source ↗

**Figure 3.** Figure 3: Example code of prefix masking module in PyTorch. view at source ↗

**Figure 4.** Figure 4: Progressive transmission with semantic quality-based control. The initial transmission of textual input is view at source ↗

**Figure 5.** Figure 5: SC-delay trade-off and SC-quality correlation. view at source ↗

**Figure 6.** Figure 6: Measured on-board power during a single Full-Edge inference cycle, using the same supply, sampling, and view at source ↗

**Figure 7.** Figure 7: Measured on-board power during a single Edge–Cloud inference cycle (supply voltage 11.7 V, sampled frame view at source ↗

read the original abstract

Deploying Vision-Language Models (VLMs) on edge devices remains challenging due to their substantial computational and memory demands, which exceed the capabilities of resource-constrained embedded platforms. Conversely, fully offloading inference to the cloud is often impractical in bandwidth-limited environments, where transmitting raw visual data introduces substantial latency overhead. While recent edge-cloud collaborative architectures attempt to partition VLM workloads across devices, they typically rely on transmitting fixed-size representations, lacking adaptability to dynamic network conditions and failing to fully exploit semantic redundancy. In this paper, we propose a progressive semantic communication framework for edge-cloud VLM inference, using a Meta AutoEncoder that compresses visual tokens into adaptive, progressively refinable representations, enabling plug-and-play deployment with off-the-shelf VLMs without additional fine-tuning. This design allows flexible transmission at different information levels, providing a controllable trade-off between communication cost and semantic fidelity. We implement a full end-to-end edge-cloud system comprising an embedded NXP i.MX95 platform and a GPU server, communicating over bandwidth-constrained networks. Experimental results show that, at 1 Mbps uplink, the proposed progressive scheme significantly reduces network latency compared to full-edge and full-cloud solutions, while maintaining high semantic consistency even under high compression. The implementation code will be released upon publication at https://github.com/open-ep/ProSemComVLM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper builds a real edge-cloud system around a Meta AutoEncoder for progressive visual token compression in VLMs, but the performance claims lack any numbers or ablations to back them up.

read the letter

The core idea here is a Meta AutoEncoder that turns visual tokens into progressively refinable representations so you can send more or less detail depending on the network. This lets the edge device and cloud share VLM inference without shipping raw images or running the whole model locally. They actually wired it up on an NXP i.MX95 board talking to a GPU server and report that at 1 Mbps uplink the scheme cuts latency versus the obvious baselines while keeping semantic consistency decent under heavy compression. Releasing the code is useful for anyone who wants to reproduce or extend it.

Referee Report

2 major / 1 minor

Summary. The paper proposes a progressive semantic communication framework for edge-cloud Vision-Language Model (VLM) inference. A Meta AutoEncoder compresses visual tokens into adaptive, progressively refinable representations that support transmission at varying information levels for a controllable trade-off between communication cost and semantic fidelity. The design is claimed to be plug-and-play with unmodified off-the-shelf VLMs without fine-tuning or retraining. A full end-to-end system is implemented on an NXP i.MX95 embedded platform and GPU server over bandwidth-constrained networks, with the abstract claiming that at 1 Mbps uplink the scheme significantly reduces network latency relative to full-edge and full-cloud baselines while preserving high semantic consistency under high compression.

Significance. If the central claims hold, the work would offer a practical advance for deploying compute-intensive VLMs in bandwidth-limited edge settings by exploiting semantic redundancy through progressive transmission rather than fixed-size features. The real-hardware implementation on i.MX95 adds engineering value beyond simulation-only studies. However, the absence of verifiable quantitative metrics in the evaluation weakens the assessed significance until the experimental details are supplied.

major comments (2)

[Abstract and §3] Abstract and §3: The load-bearing claim that the Meta AutoEncoder produces representations that remain directly compatible with unmodified off-the-shelf VLMs at every progressive refinement level (without fine-tuning or retraining) is insufficiently justified. Progressive compression must be shown to preserve exact token count, embedding dimension, and positional encoding alignment with the VLM vision tower; otherwise the forward pass and attention patterns would break, directly undermining the reported semantic-consistency and latency results at 1 Mbps.
[Experimental evaluation (presumed §4–5)] Experimental evaluation (presumed §4–5): The abstract states positive outcomes on the i.MX95 + GPU testbed but supplies no quantitative metrics (exact latency values, semantic-consistency scores such as VQA accuracy or embedding similarity), error bars, or ablation results across progressive levels and compression ratios. Without these data the claim of significant latency reduction while maintaining high consistency cannot be verified and the trade-off analysis remains non-reproducible.

minor comments (1)

[Abstract] The manuscript promises code release at https://github.com/open-ep/ProSemComVLM upon publication; including a permanent DOI or snapshot link at submission time would strengthen reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The two major comments raise important points about technical justification and experimental clarity. We address each below and will revise the manuscript to strengthen these aspects while preserving the core contributions.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3: The load-bearing claim that the Meta AutoEncoder produces representations that remain directly compatible with unmodified off-the-shelf VLMs at every progressive refinement level (without fine-tuning or retraining) is insufficiently justified. Progressive compression must be shown to preserve exact token count, embedding dimension, and positional encoding alignment with the VLM vision tower; otherwise the forward pass and attention patterns would break, directly undermining the reported semantic-consistency and latency results at 1 Mbps.

Authors: We appreciate the referee's emphasis on this foundational requirement. Section 3 describes the Meta AutoEncoder as operating on the vision encoder's output tokens and producing progressively refinable representations that are explicitly constructed to match the original token count, embedding dimension, and positional encodings at every refinement stage. This architectural choice enables direct plug-and-play use with unmodified VLMs. That said, we agree the justification can be made more rigorous and self-contained. In the revised version we will expand Section 3 with an explicit alignment proof sketch, a diagram of the token/embedding flow across refinement levels, and pseudocode confirming that the decoder always emits a representation whose shape and positional structure are identical to the vision tower output. This addition will directly address the concern without changing the reported results. revision: yes
Referee: [Experimental evaluation (presumed §4–5)] Experimental evaluation (presumed §4–5): The abstract states positive outcomes on the i.MX95 + GPU testbed but supplies no quantitative metrics (exact latency values, semantic-consistency scores such as VQA accuracy or embedding similarity), error bars, or ablation results across progressive levels and compression ratios. Without these data the claim of significant latency reduction while maintaining high consistency cannot be verified and the trade-off analysis remains non-reproducible.

Authors: We acknowledge that the abstract currently summarizes the outcomes qualitatively. Sections 4 and 5 already contain the underlying quantitative results from the NXP i.MX95 + GPU testbed, including measured uplink latencies at 1 Mbps, VQA accuracy and embedding-similarity scores for semantic consistency, error bars across multiple runs, and ablations over progressive refinement levels and compression ratios. To make these claims immediately verifiable from the abstract and to improve reproducibility, we will revise the abstract to report the key numerical values (e.g., latency reduction percentages and accuracy retention) and will ensure the experimental section presents the full tables and figures with error bars and ablation details in a more prominent format. These changes will be incorporated in the next version. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents an engineering framework for progressive semantic communication via a Meta AutoEncoder that produces refinable visual token representations for plug-and-play use with unmodified VLMs. The central claims (latency reduction at 1 Mbps uplink while preserving semantic consistency) rest on end-to-end system measurements from an NXP i.MX95 + GPU testbed rather than any closed-form derivation or prediction that reduces to fitted parameters by construction. No equations or sections define the target performance metrics in terms of themselves, invoke self-citations as load-bearing uniqueness theorems, or rename known empirical patterns as novel results. The design choice of compatibility without fine-tuning is stated as an explicit architectural property and evaluated experimentally; it does not collapse into a tautology. This is the common honest case of a self-contained systems paper whose results are externally falsifiable via the released implementation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on standard assumptions from machine learning and communication theory plus one new component.

axioms (1)

domain assumption Off-the-shelf VLMs can process progressively compressed visual tokens without performance degradation when the compression is learned via meta-training.
Invoked in the plug-and-play claim without additional fine-tuning.

invented entities (1)

Meta AutoEncoder no independent evidence
purpose: To generate adaptive, progressively refinable visual token representations for semantic communication.
New component introduced to enable the progressive transmission scheme.

pith-pipeline@v0.9.0 · 5555 in / 1176 out tokens · 26542 ms · 2026-05-07T10:41:15.631697+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 5 canonical work pages · 1 internal anchor

[1]

Distributed vlms: Efficient vision-language processing through cloud-edge collaboration,

Y . Li, D. Gumaste, M. K. T ¨urkcan, J. Ghaderi, G. Zussman, and Z. Kosti ´c, “Distributed vlms: Efficient vision-language processing through cloud-edge collaboration,”2025 IEEE International Conference on Pervasive Computing and Communications Workshops and other Affiliated Events (PerCom Workshops), pp. 280–286, 2025

2025
[2]

Collaborative edge-to-server inference for vision-language models,

S. Song and Y . Kim, “Collaborative edge-to-server inference for vision-language models,”arXiv preprint arXiv:2512.16349, 2025

work page arXiv 2025
[3]

VaVLM: Toward efficient edge-cloud video analytics with vision-language models,

Y . Zhang, H. Wang, Q. Bai, H. Liang, P. Zhu, G.-M. Muntean, and Q. Li, “VaVLM: Toward efficient edge-cloud video analytics with vision-language models,”IEEE Transactions on Broadcasting, vol. 71, no. 2, pp. 529–541, 2025

2025
[4]

Flamingo: a visual language model for few-shot learning,

J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynoldset al., “Flamingo: a visual language model for few-shot learning,”Advances in neural information processing systems, vol. 35, pp. 23 716–23 736, 2022

2022
[5]

BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,

J. Li, D. Li, S. Savarese, and S. Hoi, “BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” inProceedings of the 40th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol
[6]

19 730–19 742

PMLR, 23–29 Jul 2023, pp. 19 730–19 742

2023
[7]

Visual instruction tuning,

H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” inAdvances in Neural Information Processing Systems, vol. 36. Curran Associates, Inc., 2023, pp. 34 892–34 916

2023
[8]

Vila: On pre-training for visual language models,

J. Lin, H. Yin, W. Ping, P. Molchanov, M. Shoeybi, and S. Han, “Vila: On pre-training for visual language models,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 26 689–26 699

2024
[9]

Qwen3-VL Technical Report

S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Geet al., “Qwen3-vl technical report,”arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review arXiv 2025
[10]

Witt: A wireless image transmission transformer for semantic communications,

K. Yang, S. Wang, J. Dai, K. Tan, K. Niu, and P. Zhang, “Witt: A wireless image transmission transformer for semantic communications,” inICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5

2023
[11]

Visual language model based cross-modal semantic communication systems,

F. Jiang, C. Tang, L. Dong, K. Wang, K. Yang, and C. Pan, “Visual language model based cross-modal semantic communication systems,” IEEE Transactions on Wireless Communications, 2025

2025
[12]

Token communications: A large model-driven framework for cross-modal context-aware semantic communications,

L. Qiao, M. B. Mashhadi, Z. Gao, R. Tafazolli, M. Bennis, and D. Niyato, “Token communications: A large model-driven framework for cross-modal context-aware semantic communications,”IEEE Wireless Communications, vol. 32, no. 5, pp. 80–88, 2025

2025
[13]

Vlf-msc: Vision-language feature-based multimodal semantic communication system,

G. Ahn, J. Seo, and J. Kang, “Vlf-msc: Vision-language feature-based multimodal semantic communication system,”arXiv preprint arXiv:2511.10074, 2025

work page arXiv 2025
[14]

Video tokencom: Textual intent-guided multi-rate video token communications with uep-based adaptive source-channel coding,

J. Men, M. B. Mashhadi, N. Wang, Y . Ma, M. Nilsson, and R. Tafazolli, “Video tokencom: Textual intent-guided multi-rate video token communications with uep-based adaptive source-channel coding,”arXiv preprint arXiv:2603.02470, 2026

work page arXiv 2026
[15]

Efficient deployment of vision-language models on mobile devices: A case study on oneplus 13r,

P. R. Guerrero, Y . Pan, and S. Kashyap, “Efficient deployment of vision-language models on mobile devices: A case study on oneplus 13r,”arXiv preprint arXiv:2507.08505, 2025

work page arXiv 2025
[16]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017

2017
[17]

Bert: Pre-training of deep bidirectional transformers for language understanding,

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” inProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), 2019, pp. 4171–4186

2019
[18]

Perceptual losses for real-time style transfer and super-resolution,

J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” inEuropean conference on computer vision. Springer, 2016, pp. 694–711

2016
[19]

SmolVLM: Redefining small and efficient multimodal models,

A. Marafioti, O. Zohar, M. Farr ´e, M. noyan, E. Bakouch, P. M. C. Jim ´enez, C. Zakka, L. B. allal, A. Lozhkov, N. Tazi, V . Srivastav, J. Lochner, H. Larcher, M. Morlon, L. Tunstall, L. V . Werra, and T. Wolf, “SmolVLM: Redefining small and efficient multimodal models,” inSecond Conference on Language Modeling, 2025

2025
[20]

Microsoft coco: Common objects in context,

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll ´ar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” inEuropean conference on computer vision. Springer, 2014, pp. 740–755

2014
[21]

Evaluating object hallucination in large vision-language models,

Y . Li, Y . Du, K. Zhou, J. Wang, W. X. Zhao, and J.-R. Wen, “Evaluating object hallucination in large vision-language models,” in Proceedings of the 2023 conference on empirical methods in natural language processing, 2023, pp. 292–305. Preprint. Under review. April 2026 APPENDIXA THEORETICALJUSTIFICATION OFPROGRESSIVEREFINABILITY We now provide a theore...

2023

[1] [1]

Distributed vlms: Efficient vision-language processing through cloud-edge collaboration,

Y . Li, D. Gumaste, M. K. T ¨urkcan, J. Ghaderi, G. Zussman, and Z. Kosti ´c, “Distributed vlms: Efficient vision-language processing through cloud-edge collaboration,”2025 IEEE International Conference on Pervasive Computing and Communications Workshops and other Affiliated Events (PerCom Workshops), pp. 280–286, 2025

2025

[2] [2]

Collaborative edge-to-server inference for vision-language models,

S. Song and Y . Kim, “Collaborative edge-to-server inference for vision-language models,”arXiv preprint arXiv:2512.16349, 2025

work page arXiv 2025

[3] [3]

VaVLM: Toward efficient edge-cloud video analytics with vision-language models,

Y . Zhang, H. Wang, Q. Bai, H. Liang, P. Zhu, G.-M. Muntean, and Q. Li, “VaVLM: Toward efficient edge-cloud video analytics with vision-language models,”IEEE Transactions on Broadcasting, vol. 71, no. 2, pp. 529–541, 2025

2025

[4] [4]

Flamingo: a visual language model for few-shot learning,

J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynoldset al., “Flamingo: a visual language model for few-shot learning,”Advances in neural information processing systems, vol. 35, pp. 23 716–23 736, 2022

2022

[5] [5]

BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,

J. Li, D. Li, S. Savarese, and S. Hoi, “BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” inProceedings of the 40th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol

[6] [6]

19 730–19 742

PMLR, 23–29 Jul 2023, pp. 19 730–19 742

2023

[7] [7]

Visual instruction tuning,

H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” inAdvances in Neural Information Processing Systems, vol. 36. Curran Associates, Inc., 2023, pp. 34 892–34 916

2023

[8] [8]

Vila: On pre-training for visual language models,

J. Lin, H. Yin, W. Ping, P. Molchanov, M. Shoeybi, and S. Han, “Vila: On pre-training for visual language models,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 26 689–26 699

2024

[9] [9]

Qwen3-VL Technical Report

S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Geet al., “Qwen3-vl technical report,”arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review arXiv 2025

[10] [10]

Witt: A wireless image transmission transformer for semantic communications,

K. Yang, S. Wang, J. Dai, K. Tan, K. Niu, and P. Zhang, “Witt: A wireless image transmission transformer for semantic communications,” inICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5

2023

[11] [11]

Visual language model based cross-modal semantic communication systems,

F. Jiang, C. Tang, L. Dong, K. Wang, K. Yang, and C. Pan, “Visual language model based cross-modal semantic communication systems,” IEEE Transactions on Wireless Communications, 2025

2025

[12] [12]

Token communications: A large model-driven framework for cross-modal context-aware semantic communications,

L. Qiao, M. B. Mashhadi, Z. Gao, R. Tafazolli, M. Bennis, and D. Niyato, “Token communications: A large model-driven framework for cross-modal context-aware semantic communications,”IEEE Wireless Communications, vol. 32, no. 5, pp. 80–88, 2025

2025

[13] [13]

Vlf-msc: Vision-language feature-based multimodal semantic communication system,

G. Ahn, J. Seo, and J. Kang, “Vlf-msc: Vision-language feature-based multimodal semantic communication system,”arXiv preprint arXiv:2511.10074, 2025

work page arXiv 2025

[14] [14]

Video tokencom: Textual intent-guided multi-rate video token communications with uep-based adaptive source-channel coding,

J. Men, M. B. Mashhadi, N. Wang, Y . Ma, M. Nilsson, and R. Tafazolli, “Video tokencom: Textual intent-guided multi-rate video token communications with uep-based adaptive source-channel coding,”arXiv preprint arXiv:2603.02470, 2026

work page arXiv 2026

[15] [15]

Efficient deployment of vision-language models on mobile devices: A case study on oneplus 13r,

P. R. Guerrero, Y . Pan, and S. Kashyap, “Efficient deployment of vision-language models on mobile devices: A case study on oneplus 13r,”arXiv preprint arXiv:2507.08505, 2025

work page arXiv 2025

[16] [16]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017

2017

[17] [17]

Bert: Pre-training of deep bidirectional transformers for language understanding,

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” inProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), 2019, pp. 4171–4186

2019

[18] [18]

Perceptual losses for real-time style transfer and super-resolution,

J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” inEuropean conference on computer vision. Springer, 2016, pp. 694–711

2016

[19] [19]

SmolVLM: Redefining small and efficient multimodal models,

A. Marafioti, O. Zohar, M. Farr ´e, M. noyan, E. Bakouch, P. M. C. Jim ´enez, C. Zakka, L. B. allal, A. Lozhkov, N. Tazi, V . Srivastav, J. Lochner, H. Larcher, M. Morlon, L. Tunstall, L. V . Werra, and T. Wolf, “SmolVLM: Redefining small and efficient multimodal models,” inSecond Conference on Language Modeling, 2025

2025

[20] [20]

Microsoft coco: Common objects in context,

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll ´ar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” inEuropean conference on computer vision. Springer, 2014, pp. 740–755

2014

[21] [21]

Evaluating object hallucination in large vision-language models,

Y . Li, Y . Du, K. Zhou, J. Wang, W. X. Zhao, and J.-R. Wen, “Evaluating object hallucination in large vision-language models,” in Proceedings of the 2023 conference on empirical methods in natural language processing, 2023, pp. 292–305. Preprint. Under review. April 2026 APPENDIXA THEORETICALJUSTIFICATION OFPROGRESSIVEREFINABILITY We now provide a theore...

2023