Aligned Vector Quantization for Edge-Cloud Collabrative Vision-Language Models

Deepak Ganesan; Hui Guan; Lijun Zhang; Xiao Liu

arxiv: 2411.05961 · v2 · submitted 2024-11-08 · 💻 cs.CV · cs.AI

Aligned Vector Quantization for Edge-Cloud Collabrative Vision-Language Models

Xiao Liu , Lijun Zhang , Deepak Ganesan , Hui Guan This is my paper

Pith reviewed 2026-05-23 17:06 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords vision-language modelsedge-cloud collaborationvector quantizationfeature compressionvisual question answeringLLaVApartitioned executionmodel deployment

0 comments

The pith

The AlignedVQ algorithm compresses intermediate features in vision-language models by 1365 times, enabling edge-cloud collaboration that cuts transmission overhead by 96.8% and speeds up inference 2-15 times with accuracy within 2% of the原版

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that vision-language models can be partitioned across edge devices and the cloud by inserting a quantization step on intermediate features. The Aligned Vector Quantization method is designed to shrink those features while keeping the information required for downstream visual question answering tasks. This matters to a reader because it turns bandwidth-heavy image uploads into small feature packets and lets local hardware handle the first part of the work. Experiments across eight VQA datasets show the accuracy stays within a narrow band of the full cloud version while delivering large gains in speed and transmission cost.

Core claim

The paper claims that the Aligned Vector Quantization algorithm can be inserted into an existing LLaVA model to support partitioned execution, achieving an approximately 1365x compression rate on intermediate features. This reduces data transmission overhead by 96.8% compared with sending JPEG90-compressed images, delivers an inference speedup of 2-15x, and keeps accuracy within -2.23% to +1.6% of the original model across eight VQA datasets, all without requiring model retraining or further architectural changes.

What carries the argument

Aligned Vector Quantization (AlignedVQ), a quantization step placed after early layers that aligns codebook entries to preserve semantic content of intermediate activations for the cloud-side layers.

If this is right

Early layers run on the edge device and only quantized features travel to the cloud.
Transmission volume falls below even heavily compressed raw images.
End-to-end inference time decreases by a factor of 2 to 15.
Accuracy on visual question answering remains comparable to the cloud-only baseline.
The change requires no retraining and works on existing model weights.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same quantization step could be tested on other vision-language architectures beyond LLaVA.
Lower transmission might allow collaborative inference on slower or metered networks.
Energy use on the edge device could drop if early layers are lighter than full forward passes.
Combining AlignedVQ with existing model pruning might produce further size reductions.

Load-bearing premise

The quantized intermediate features still contain enough semantic information for the remaining model layers to answer visual questions correctly.

What would settle it

Running LLaVA-AlignedVQ on the eight VQA datasets and observing accuracy drops larger than 2.23% relative to the unmodified LLaVA model would falsify the central accuracy claim.

Figures

Figures reproduced from arXiv: 2411.05961 by Deepak Ganesan, Hui Guan, Lijun Zhang, Xiao Liu.

**Figure 2.** Figure 2: Accuracy performance of LLAVA on VQA-v2 datasets with different VQ variants. (a) Residual VQ with different [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the Vision Language Model (VLM) with the proposed AlignedVQ. The model consists of a Vision [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Histograms of feature magnitudes for different [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: The trade-off between VQA task accuracy and [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: The inference latency of the visual encoder [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

read the original abstract

Vision Language Models (VLMs) are central to Visual Question Answering (VQA) systems and are typically deployed in the cloud due to their high computational demands. However, this cloud-only approach underutilizes edge computational resources and requires significant bandwidth for transmitting raw images. In this paper, we introduce an edge-cloud collaborative VQA system, called LLaVA-AlignedVQ, which features a novel Aligned Vector Quantization algorithm (AlignedVQ) that efficiently compress intermediate features without compromising accuracy to support partitioned execution. Our experiments demonstrate that LLaVA-AlignedVQ achieves approximately 1365x compression rate of intermediate features, reducing data transmission overhead by 96.8% compared to transmitting JPEG90-compressed images to the cloud. LLaVA-AlignedVQ achieves an inference speedup of 2-15x while maintaining high accuracy, remaining within -2.23% to +1.6% of the original model's accuracy performance across eight VQA datasets, compared to the cloud-only solution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AlignedVQ shows a workable post-hoc quantization step for splitting LLaVA across edge and cloud, with concrete compression and accuracy numbers on eight datasets but thin method description.

read the letter

The main takeaway is that the authors split LLaVA so the edge runs the vision part, quantizes the intermediate features with AlignedVQ, and sends the result to the cloud for the language model. They report 1365x compression on those features, which cuts transmission by 96.8% versus JPEG90 images, delivers 2-15x end-to-end speedup, and keeps accuracy inside -2.23% to +1.6% of the original model across eight VQA datasets. Those numbers are the useful part. They actually measured transmitted bytes and wall-clock time rather than just reporting theoretical rates, and the accuracy band holds across multiple benchmarks, which gives a reader a realistic sense of the trade-off for mobile VQA. The experiments are the strongest element here. The soft spot is the method itself. The abstract and available description do not spell out the AlignedVQ algorithm, the alignment objective, training procedure, or how it differs from ordinary vector quantization in a way that would let someone reproduce it. Without pseudocode or ablations it is hard to judge whether the gains come from a genuine technical step or from careful tuning of an existing quantizer. If the full paper supplies those details and shows they are necessary, the contribution looks more solid; otherwise it reads as an engineering demonstration rather than a new primitive. This paper is for people working on partitioned VLM inference and edge deployment. A practitioner who needs to estimate bandwidth savings would get direct value from the reported sizes and speedups. It deserves peer review because the empirical claims are testable and address a real deployment constraint, even though the method section would likely need expansion during review.

Referee Report

2 major / 1 minor

Summary. The paper introduces LLaVA-AlignedVQ, an edge-cloud collaborative VQA system that inserts a novel post-hoc Aligned Vector Quantization (AlignedVQ) step after the vision encoder on the edge device. Quantized intermediate features are transmitted to the cloud for the language-model stage of LLaVA. Experiments on eight VQA datasets report ~1365× feature compression, 96.8% lower transmission cost versus JPEG90 images, 2–15× inference speedup, and accuracy within −2.23% to +1.6% of the unmodified model.

Significance. If reproducible, the work supplies a concrete, training-free mechanism for partitioning large VLMs across edge and cloud while preserving downstream task performance. The multi-dataset accuracy evaluation directly tests semantic fidelity of the compressed features and, together with the measured bandwidth and latency gains, constitutes a practical contribution to efficient VLM deployment.

major comments (2)

[Abstract and §3] Abstract and §3 (method): the central claim that AlignedVQ can be inserted post-hoc into a frozen LLaVA model without retraining or architectural changes rests on an empirical demonstration, yet the manuscript supplies neither pseudocode, the precise alignment objective, nor the training procedure used to learn the codebook. These omissions are load-bearing for verifying the reported compression ratios and accuracy band.
[§4] §4 (experiments): accuracy is stated to remain within −2.23% to +1.6% across eight datasets, but no standard deviations, number of runs, or statistical tests are reported. Without these, it is impossible to determine whether the observed deviations are within the variability of the original model or constitute a genuine degradation.

minor comments (1)

[Figures 3–5] Figure captions and axis labels use inconsistent terminology for “feature size” versus “bit rate”; clarify whether reported compression is measured in bytes or bits.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comments below and will make revisions to improve the clarity and completeness of the method description and experimental reporting.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (method): the central claim that AlignedVQ can be inserted post-hoc into a frozen LLaVA model without retraining or architectural changes rests on an empirical demonstration, yet the manuscript supplies neither pseudocode, the precise alignment objective, nor the training procedure used to learn the codebook. These omissions are load-bearing for verifying the reported compression ratios and accuracy band.

Authors: We agree that providing these details will enhance the reproducibility of our work. Although the method is presented as post-hoc without retraining the LLaVA model, the codebook learning involves a separate procedure. In the revised manuscript, we will add pseudocode for AlignedVQ, explicitly define the alignment objective, and describe the training procedure for the codebook in Section 3. revision: yes
Referee: [§4] §4 (experiments): accuracy is stated to remain within −2.23% to +1.6% across eight datasets, but no standard deviations, number of runs, or statistical tests are reported. Without these, it is impossible to determine whether the observed deviations are within the variability of the original model or constitute a genuine degradation.

Authors: We recognize the value of statistical analysis for validating the accuracy claims. Our current experiments consist of single runs for each dataset due to the substantial computational resources required for VLM evaluation. We will update Section 4 to report the number of runs and include a discussion on the observed accuracy variations across the eight datasets, which remain small and consistent. We are unable to provide standard deviations or statistical tests without conducting additional runs. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents an empirical method (AlignedVQ) for feature compression in partitioned LLaVA execution and reports measured outcomes: 1365x compression, 96.8% transmission reduction vs JPEG90, 2-15x speedup, and accuracy within -2.23% to +1.6% across eight VQA datasets. These are direct experimental results on fixed models, not derivations, fitted-parameter predictions, or self-citation chains. No equations or load-bearing steps reduce by construction to inputs; the central claim is an external benchmark test of semantic preservation under quantization. The derivation chain is self-contained against the reported datasets and baselines.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review provides no equations or implementation details, so free parameters, axioms, and invented entities cannot be enumerated beyond the high-level claim of a new quantization algorithm.

invented entities (1)

AlignedVQ no independent evidence
purpose: Compress intermediate features for edge-cloud partitioned VLM inference
New algorithm introduced to support the collaborative system; no independent evidence supplied in abstract.

pith-pipeline@v0.9.0 · 5710 in / 1263 out tokens · 41629 ms · 2026-05-23T17:06:44.164243+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 4 internal anchors

[1]

Coefficient of variation

Abdi, H. Coefficient of variation. Encyclopedia of research design, 1 0 (5): 0 169--171, 2010

work page 2010
[2]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Flamingo: a visual language model for few-shot learning

Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35: 0 23716--23736, 2022

work page 2022
[4]

S., Li, Y., and Xu, Y

Berahmand, K., Daneshfar, F., Salehi, E. S., Li, Y., and Xu, Y. Autoencoders and their applications in machine learning: a survey. Artificial Intelligence Review, 57 0 (2): 0 28, 2024

work page 2024
[5]

E., et al

Chiang, W.-L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y., Gonzalez, J. E., et al. Vicuna: An open-source chatbot impressing gpt-4 with 90\ See https://vicuna. lmsys. org (accessed 14 April 2023), 2 0 (3): 0 6, 2023

work page 2023
[6]

and Baji \'c , I

Choi, H. and Baji \'c , I. V. Deep feature compression for collaborative object detection. In 2018 25th IEEE International Conference on Image Processing (ICIP), pp.\ 3743--3747. IEEE, 2018

work page 2018
[7]

W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, Y., Wang, X., Dehghani, M., Brahma, S., et al

Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, Y., Wang, X., Dehghani, M., Brahma, S., et al. Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25 0 (70): 0 1--53, 2024

work page 2024
[8]

A., Choi, H., and Baji \'c , I

Cohen, R. A., Choi, H., and Baji \'c , I. V. Lightweight compression of neural network feature tensors for collaborative intelligence. In 2020 IEEE International Conference on Multimedia and Expo (ICME), pp.\ 1--6. IEEE, 2020

work page 2020
[9]

Vq4dit: Efficient post-training vector quantization for diffusion transformers

Deng, J., Li, S., Wang, Z., Gu, H., Xu, K., and Huang, K. Vq4dit: Efficient post-training vector quantization for diffusion transformers. arXiv preprint arXiv:2408.17131, 2024

work page arXiv 2024
[10]

Taming transformers for high-resolution image synthesis

Esser, P., Rombach, R., and Ommer, B. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 12873--12883, 2021

work page 2021
[11]

and Gray, R

Gersho, A. and Gray, R. M. Vector quantization and signal compression, volume 159. Springer Science & Business Media, 2012

work page 2012
[12]

Making the v in vqa matter: Elevating the role of image understanding in visual question answering

Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., and Parikh, D. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 6904--6913, 2017

work page 2017
[13]

J., Guo, A., Lin, C., Grauman, K., Luo, J., and Bigham, J

Gurari, D., Li, Q., Stangl, A. J., Guo, A., Lin, C., Grauman, K., Luo, J., and Bigham, J. P. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 3608--3617, 2018

work page 2018
[14]

Clio: Enabling automatic compilation of deep learning pipelines across iot and cloud

Huang, J., Samplawski, C., Ganesan, D., Marlin, B., and Kwon, H. Clio: Enabling automatic compilation of deep learning pipelines across iot and cloud. In Proceedings of the 26th Annual International Conference on Mobile Computing and Networking, pp.\ 1--12, 2020

work page 2020
[15]

Hudson, D. A. and Manning, C. D. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 6700--6709, 2019

work page 2019
[16]

Packet-loss-tolerant split inference for delay-sensitive deep learning in lossy wireless networks

Itahara, S., Nishio, T., and Yamamoto, K. Packet-loss-tolerant split inference for delay-sensitive deep learning in lossy wireless networks. arXiv preprint arXiv:2104.13629, 2021

work page arXiv 2021
[17]

Neurosurgeon: Collaborative intelligence between the cloud and mobile edge

Kang, Y., Hauswald, J., Gao, C., Rovinski, A., Mudge, T., Mars, J., and Tang, L. Neurosurgeon: Collaborative intelligence between the cloud and mobile edge. In ACM SIGARCH Computer Architecture News, volume 45, pp.\ 615--629. ACM, 2017

work page 2017
[18]

Kingma, D. P. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[19]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Li, J., Li, D., Savarese, S., and Hoi, S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pp.\ 19730--19742. PMLR, 2023 a

work page 2023
[20]

Evaluating Object Hallucination in Large Vision-Language Models

Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W. X., and Wen, J.-R. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023 b

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

Mope-clip: Structured pruning for efficient vision-language models with module-wise pruning error metric

Lin, H., Bai, H., Liu, Z., Hou, L., Sun, M., Song, L., Wei, Y., and Sun, Z. Mope-clip: Structured pruning for efficient vision-language models with module-wise pruning error metric. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 27370--27380, 2024

work page 2024
[22]

Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll \'a r, P., and Zitnick, C. L. Microsoft coco: Common objects in context. In Computer Vision--ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp.\ 740--755. Springer, 2014

work page 2014
[23]

Liu, H., Li, C., Wu, Q., and Lee, Y. J. Visual instruction tuning. Advances in neural information processing systems, 36, 2023

work page 2023
[24]

Liu, H., Li, C., Li, Y., and Lee, Y. J. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 26296--26306, 2024 a

work page 2024
[25]

Liu, H., Li, C., Li, Y., Li, B., Zhang, Y., Shen, S., and Lee, Y. J. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024 b . URL https://llava-vl.github.io/blog/2024-01-30-llava-next/

work page 2024
[26]

L., Cao, T., Li, C., and Yang, M

Liu, Y., Wen, J., Wang, Y., Ye, S., Zhang, L. L., Cao, T., Li, C., and Yang, M. Vptq: Extreme low-bit vector post-training quantization for large language models. arXiv preprint arXiv:2409.17066, 2024 c

work page arXiv 2024
[27]

Mmbench: Is your multi-modal model an all-around player? In European Conference on Computer Vision, pp.\ 216--233

Liu, Y., Duan, H., Zhang, Y., Li, B., Zhang, S., Zhao, W., Yuan, Y., Wang, J., He, C., Liu, Z., et al. Mmbench: Is your multi-modal model an all-around player? In European Conference on Computer Vision, pp.\ 216--233. Springer, 2025

work page 2025
[28]

W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.\ 8748--8763. PMLR, 2021

work page 2021
[29]

Towards vqa models that can read

Singh, A., Natarajan, V., Shah, M., Jiang, Y., Chen, X., Batra, D., Parikh, D., and Rohrbach, M. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 8317--8326, 2019

work page 2019
[30]

J., Bathla, G., Mehta, M., Chhabra, G., and Singh, P

Singh, H. J., Bathla, G., Mehta, M., Chhabra, G., and Singh, P. Visual questions answering developments, applications, datasets and opportunities: A state-of-the-art survey. In 2023 International Conference on Sustainable Computing and Data Communication Systems (ICSCDS), pp.\ 778--785. IEEE, 2023

work page 2023
[31]

And the bit goes down: Revisiting the quantization of neural networks

Stock, P., Joulin, A., Gribonval, R., Graham, B., and J \'e gou, H. And the bit goes down: Revisiting the quantization of neural networks. In ICLR 2020-Eighth International Conference on Learning Representations, pp.\ 1--11, 2020

work page 2020
[32]

Distributed deep neural networks over the cloud, the edge and end devices

Teerapittayanon, S., McDanel, B., and Kung, H.-T. Distributed deep neural networks over the cloud, the edge and end devices. In 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS), pp.\ 328--339. IEEE, 2017

work page 2017
[33]

Neural discrete representation learning

Van Den Oord, A., Vinyals, O., et al. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017

work page 2017
[34]

Hifi-codec: Group-residual vector quantization for high fidelity audio codec

Yang, D., Liu, S., Huang, R., Tian, J., Weng, C., and Zou, Y. Hifi-codec: Group-residual vector quantization for high fidelity audio codec. 2023

work page 2023
[35]

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

Yu, W., Yang, Z., Li, L., Wang, J., Lin, K., Liu, Z., Wang, X., and Wang, L. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

Soundstream: An end-to-end neural audio codec, 2021

Zeghidour, N., Luebs, A., Omran, A., Skoglund, J., and Tagliasacchi, M. Soundstream: An end-to-end neural audio codec, 2021

work page 2021
[37]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[1] [1]

Coefficient of variation

Abdi, H. Coefficient of variation. Encyclopedia of research design, 1 0 (5): 0 169--171, 2010

work page 2010

[2] [2]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Flamingo: a visual language model for few-shot learning

Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35: 0 23716--23736, 2022

work page 2022

[4] [4]

S., Li, Y., and Xu, Y

Berahmand, K., Daneshfar, F., Salehi, E. S., Li, Y., and Xu, Y. Autoencoders and their applications in machine learning: a survey. Artificial Intelligence Review, 57 0 (2): 0 28, 2024

work page 2024

[5] [5]

E., et al

Chiang, W.-L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y., Gonzalez, J. E., et al. Vicuna: An open-source chatbot impressing gpt-4 with 90\ See https://vicuna. lmsys. org (accessed 14 April 2023), 2 0 (3): 0 6, 2023

work page 2023

[6] [6]

and Baji \'c , I

Choi, H. and Baji \'c , I. V. Deep feature compression for collaborative object detection. In 2018 25th IEEE International Conference on Image Processing (ICIP), pp.\ 3743--3747. IEEE, 2018

work page 2018

[7] [7]

W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, Y., Wang, X., Dehghani, M., Brahma, S., et al

Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, Y., Wang, X., Dehghani, M., Brahma, S., et al. Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25 0 (70): 0 1--53, 2024

work page 2024

[8] [8]

A., Choi, H., and Baji \'c , I

Cohen, R. A., Choi, H., and Baji \'c , I. V. Lightweight compression of neural network feature tensors for collaborative intelligence. In 2020 IEEE International Conference on Multimedia and Expo (ICME), pp.\ 1--6. IEEE, 2020

work page 2020

[9] [9]

Vq4dit: Efficient post-training vector quantization for diffusion transformers

Deng, J., Li, S., Wang, Z., Gu, H., Xu, K., and Huang, K. Vq4dit: Efficient post-training vector quantization for diffusion transformers. arXiv preprint arXiv:2408.17131, 2024

work page arXiv 2024

[10] [10]

Taming transformers for high-resolution image synthesis

Esser, P., Rombach, R., and Ommer, B. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 12873--12883, 2021

work page 2021

[11] [11]

and Gray, R

Gersho, A. and Gray, R. M. Vector quantization and signal compression, volume 159. Springer Science & Business Media, 2012

work page 2012

[12] [12]

Making the v in vqa matter: Elevating the role of image understanding in visual question answering

Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., and Parikh, D. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 6904--6913, 2017

work page 2017

[13] [13]

J., Guo, A., Lin, C., Grauman, K., Luo, J., and Bigham, J

Gurari, D., Li, Q., Stangl, A. J., Guo, A., Lin, C., Grauman, K., Luo, J., and Bigham, J. P. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 3608--3617, 2018

work page 2018

[14] [14]

Clio: Enabling automatic compilation of deep learning pipelines across iot and cloud

Huang, J., Samplawski, C., Ganesan, D., Marlin, B., and Kwon, H. Clio: Enabling automatic compilation of deep learning pipelines across iot and cloud. In Proceedings of the 26th Annual International Conference on Mobile Computing and Networking, pp.\ 1--12, 2020

work page 2020

[15] [15]

Hudson, D. A. and Manning, C. D. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 6700--6709, 2019

work page 2019

[16] [16]

Packet-loss-tolerant split inference for delay-sensitive deep learning in lossy wireless networks

Itahara, S., Nishio, T., and Yamamoto, K. Packet-loss-tolerant split inference for delay-sensitive deep learning in lossy wireless networks. arXiv preprint arXiv:2104.13629, 2021

work page arXiv 2021

[17] [17]

Neurosurgeon: Collaborative intelligence between the cloud and mobile edge

Kang, Y., Hauswald, J., Gao, C., Rovinski, A., Mudge, T., Mars, J., and Tang, L. Neurosurgeon: Collaborative intelligence between the cloud and mobile edge. In ACM SIGARCH Computer Architecture News, volume 45, pp.\ 615--629. ACM, 2017

work page 2017

[18] [18]

Kingma, D. P. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[19] [19]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Li, J., Li, D., Savarese, S., and Hoi, S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pp.\ 19730--19742. PMLR, 2023 a

work page 2023

[20] [20]

Evaluating Object Hallucination in Large Vision-Language Models

Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W. X., and Wen, J.-R. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023 b

work page internal anchor Pith review Pith/arXiv arXiv 2023

[21] [21]

Mope-clip: Structured pruning for efficient vision-language models with module-wise pruning error metric

Lin, H., Bai, H., Liu, Z., Hou, L., Sun, M., Song, L., Wei, Y., and Sun, Z. Mope-clip: Structured pruning for efficient vision-language models with module-wise pruning error metric. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 27370--27380, 2024

work page 2024

[22] [22]

Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll \'a r, P., and Zitnick, C. L. Microsoft coco: Common objects in context. In Computer Vision--ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp.\ 740--755. Springer, 2014

work page 2014

[23] [23]

Liu, H., Li, C., Wu, Q., and Lee, Y. J. Visual instruction tuning. Advances in neural information processing systems, 36, 2023

work page 2023

[24] [24]

Liu, H., Li, C., Li, Y., and Lee, Y. J. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 26296--26306, 2024 a

work page 2024

[25] [25]

Liu, H., Li, C., Li, Y., Li, B., Zhang, Y., Shen, S., and Lee, Y. J. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024 b . URL https://llava-vl.github.io/blog/2024-01-30-llava-next/

work page 2024

[26] [26]

L., Cao, T., Li, C., and Yang, M

Liu, Y., Wen, J., Wang, Y., Ye, S., Zhang, L. L., Cao, T., Li, C., and Yang, M. Vptq: Extreme low-bit vector post-training quantization for large language models. arXiv preprint arXiv:2409.17066, 2024 c

work page arXiv 2024

[27] [27]

Mmbench: Is your multi-modal model an all-around player? In European Conference on Computer Vision, pp.\ 216--233

Liu, Y., Duan, H., Zhang, Y., Li, B., Zhang, S., Zhao, W., Yuan, Y., Wang, J., He, C., Liu, Z., et al. Mmbench: Is your multi-modal model an all-around player? In European Conference on Computer Vision, pp.\ 216--233. Springer, 2025

work page 2025

[28] [28]

W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.\ 8748--8763. PMLR, 2021

work page 2021

[29] [29]

Towards vqa models that can read

Singh, A., Natarajan, V., Shah, M., Jiang, Y., Chen, X., Batra, D., Parikh, D., and Rohrbach, M. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 8317--8326, 2019

work page 2019

[30] [30]

J., Bathla, G., Mehta, M., Chhabra, G., and Singh, P

Singh, H. J., Bathla, G., Mehta, M., Chhabra, G., and Singh, P. Visual questions answering developments, applications, datasets and opportunities: A state-of-the-art survey. In 2023 International Conference on Sustainable Computing and Data Communication Systems (ICSCDS), pp.\ 778--785. IEEE, 2023

work page 2023

[31] [31]

And the bit goes down: Revisiting the quantization of neural networks

Stock, P., Joulin, A., Gribonval, R., Graham, B., and J \'e gou, H. And the bit goes down: Revisiting the quantization of neural networks. In ICLR 2020-Eighth International Conference on Learning Representations, pp.\ 1--11, 2020

work page 2020

[32] [32]

Distributed deep neural networks over the cloud, the edge and end devices

Teerapittayanon, S., McDanel, B., and Kung, H.-T. Distributed deep neural networks over the cloud, the edge and end devices. In 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS), pp.\ 328--339. IEEE, 2017

work page 2017

[33] [33]

Neural discrete representation learning

Van Den Oord, A., Vinyals, O., et al. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017

work page 2017

[34] [34]

Hifi-codec: Group-residual vector quantization for high fidelity audio codec

Yang, D., Liu, S., Huang, R., Tian, J., Weng, C., and Zou, Y. Hifi-codec: Group-residual vector quantization for high fidelity audio codec. 2023

work page 2023

[35] [35]

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

Yu, W., Yang, Z., Li, L., Wang, J., Lin, K., Liu, Z., Wang, X., and Wang, L. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[36] [36]

Soundstream: An end-to-end neural audio codec, 2021

Zeghidour, N., Luebs, A., Omran, A., Skoglund, J., and Tagliasacchi, M. Soundstream: An end-to-end neural audio codec, 2021

work page 2021

[37] [37]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page