arxiv: 2604.10985 · v1 · submitted 2026-04-13 · 💻 cs.AI · cs.CL· cs.CV

Recognition: unknown

Back to the Barn with LLAMAs: Evolving Pretrained LLM Backbones in Finetuning Vision Language Models

Sameera Horawalavithana , Lauren Phillips , Ian Stewart , Sai Munikoti , Karl Pazdernik

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:02 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.CV

keywords Vision-Language ModelsLarge Language ModelsLLaMA backbonesMultimodal reasoningVisual Question AnsweringModel evolutionTask performance

0 comments

The pith

Newer LLM backbones do not always produce better Vision-Language Models, with gains depending on the specific task.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether swapping in newer pretrained LLM backbones improves VLMs when the vision encoder, training data, and post-training steps stay exactly the same. It shows that LLaMA-3 based models do not beat LLaMA-2 or LLaMA-1 versions across the board. In visual question answering, newer backbones answer different questions correctly rather than simply answering more questions overall. These shifts trace to changes in how the models process information, such as better calibrated confidence scores and more stable internal representations. Capabilities that rely mainly on visual understanding gain little from the upgrade, while certain reasoning abilities appear only with the newest backbone.

Core claim

By keeping the vision encoder, training data, and post-training algorithm identical across LLaMA-1, LLaMA-2, and LLaMA-3 based VLMs, newer LLM backbones do not always lead to better VLMs. Performance depends on the downstream task. In VQA, newer backbones solve different questions rather than more questions, driven by differences in information processing including better calibrated confidence and more stable internal representations. Some VLM capabilities appear only in the newest LLM generation, while tasks depending mainly on visual understanding see little benefit from a newer backbone.

What carries the argument

Controlled comparison of LLaMA-1/2/3 backbones in VLMs while holding vision encoder, training data, and post-training algorithm fixed.

If this is right

In VQA tasks, newer LLM backbones tend to solve different questions rather than simply more questions.
Performance differences arise from variations in information processing, such as better calibrated confidence and more stable internal representations.
Some VLM capabilities emerge only with the newest LLM generation.
Tasks that depend primarily on visual understanding show little improvement when the LLM backbone is upgraded.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

VLM developers should evaluate backbone upgrades on a per-task basis instead of assuming newer LLMs will improve all results.
Stability of internal representations may be a critical factor for successful multimodal alignment.
The same controlled setup could be used to test whether non-LLaMA backbones produce comparable task-dependent patterns.

Load-bearing premise

Holding the vision encoder, training data, and post-training algorithm fixed across LLaMA-1/2/3 backbones fully isolates the causal effect of the LLM backbone on VLM performance.

What would settle it

Finding that the LLaMA-3 based VLM uniformly outperforms the LLaMA-2 and LLaMA-1 versions on every tested task, including those centered on visual understanding, under identical controlled conditions.

Figures

Figures reproduced from arXiv: 2604.10985 by Ian Stewart, Karl Pazdernik, Lauren Phillips, Sai Munikoti, Sameera Horawalavithana.

**Figure 2.** Figure 2: Accuracy on predicting the Latitude/Longitude coordinate values in the Seismic dataset [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 4.** Figure 4: Analysis on layer-wise context vectors of the VLMs trained and tested on VQA [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

Vision-Language Models (VLMs) have rapidly advanced by leveraging powerful pre-trained Large Language Models (LLMs) as core reasoning backbones. As new and more capable LLMs emerge with improved reasoning, instruction-following, and generalization, there is a pressing need to efficiently update existing VLMs to incorporate these advancements. However, the integration of new LLMs into VLMs, particularly how the evolving LLMs contribute to multimodal reasoning, alignment, and task-specific performance remains underexplored. Addressing this gap is important for VLM development, given the rapid evolution of pretrained LLM backbones. This study presents a controlled and systematic investigation of how changes in the pretrained LLM backbone affect downstream VLM task performance. By having the vision encoder, training data, and post-training algorithm remain same across LLAMA-1, LLAMA-2, and LLAMA-3 based VLMs, we find that newer LLM backbones do not always lead to better VLMs, but the performance depends on the downstream VLM task. For example, in visual question and answering tasks, newer LLM backbones tend to solve different questions rather than just more questions, and our analysis shows this is driven by differences in how the models process information, including better calibrated confidence and more stable internal representations. We also find that some VLM capabilities appear only in the newest LLM generation, while tasks that depend mainly on visual understanding see little benefit from a newer LLM backbone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Newer LLaMA backbones in fixed VLMs produce task-dependent shifts rather than steady gains, with VQA improvements coming from different questions answered instead of higher overall accuracy.

read the letter

The paper runs a direct swap of LLaMA-1, 2, and 3 backbones inside otherwise identical VLMs. Vision encoder, training data, and post-training steps stay fixed. The main result is that newer backbones do not improve every task. In VQA, the gains appear as solving different questions rather than solving more of them, tied to better confidence calibration and steadier internal representations. Some capabilities show up only with the newest backbone while pure visual tasks gain little.

Referee Report

2 major / 2 minor

Summary. The paper presents a controlled empirical comparison of vision-language models built on LLaMA-1, LLaMA-2, and LLaMA-3 backbones, with the vision encoder, training data, and post-training algorithm held fixed. It claims that newer LLM backbones do not uniformly improve VLM performance; gains are task-dependent. In VQA, newer backbones solve different questions (rather than more questions) due to better-calibrated confidence and more stable internal representations, while some capabilities emerge only with the newest generation and purely visual tasks show little benefit from backbone upgrades.

Significance. If the controlled comparisons and internal analyses hold, the work provides useful evidence that simply swapping in newer pretrained LLMs is not a reliable path to better VLMs and that task-specific factors and integration details matter. The emphasis on representation stability and calibration offers mechanistic insight beyond accuracy numbers, which could inform more deliberate VLM design choices.

major comments (2)

[Abstract / Experimental Setup] The central claim that the experiment isolates the causal effect of the LLM backbone rests on the statement (abstract) that 'the vision encoder, training data, and post-training algorithm remain same.' However, LLaMA-1/2/3 differ in hidden size, attention mechanisms (GQA appears in LLaMA-2/3), layer count, and tokenizer vocabulary. Any adaptation of the vision-to-text projection layer to accommodate these differences (or any differential freezing/unfreezing) would confound attribution of VQA differences, calibration improvements, and representation stability to the backbone itself rather than to integration choices. This point is load-bearing for the 'newer backbones solve different questions' and 'driven by differences in how the models process information' conclusions.
[Results / Analysis sections] The paper reports that newer backbones 'tend to solve different questions rather than just more questions' in VQA and links this to 'better calibrated confidence and more stable internal representations,' but provides no quantitative details on the metrics (e.g., ECE for calibration, representation similarity measures, or statistical tests for question overlap). Without these, it is difficult to assess whether the observed patterns are robust or sensitive to unstated implementation choices.

minor comments (2)

[Abstract] The abstract is somewhat dense; breaking the key findings into bullet points or adding one concrete performance delta (e.g., accuracy change on a named VQA benchmark) would improve readability.
[Title] The title 'Back to the Barn with LLAMAs' is playful but does not clearly signal the technical contribution; a more descriptive subtitle could help readers locate the paper.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications on our controls and commitments to strengthen the manuscript through added details and quantitative reporting.

read point-by-point responses

Referee: [Abstract / Experimental Setup] The central claim that the experiment isolates the causal effect of the LLM backbone rests on the statement (abstract) that 'the vision encoder, training data, and post-training algorithm remain same.' However, LLaMA-1/2/3 differ in hidden size, attention mechanisms (GQA appears in LLaMA-2/3), layer count, and tokenizer vocabulary. Any adaptation of the vision-to-text projection layer to accommodate these differences (or any differential freezing/unfreezing) would confound attribution of VQA differences, calibration improvements, and representation stability to the backbone itself rather than to integration choices. This point is load-bearing for the 'newer backbones solve different questions' and 'driven by differences in how the models process information' conclusions.

Authors: We acknowledge that the architectural differences across LLaMA generations require corresponding adaptations in the vision-to-text connector to match hidden dimensions and accommodate native tokenizers. To maintain control, we used an identical connector architecture (a fixed two-layer MLP) and the same training procedure for the connector in all cases, with each LLM initialized from its respective pretrained checkpoint. The vision encoder and its output features remain unchanged, and the post-training data and algorithm are identical. We will revise the Experimental Setup section to explicitly document the dimension-matching procedure, connector details, and freezing strategy (LLM backbone frozen, connector trained), along with a brief discussion of how these adaptations are the minimal necessary to integrate each backbone while preserving the core comparison. This addresses the attribution concern without altering the central finding that performance gains are task-dependent. revision: yes
Referee: [Results / Analysis sections] The paper reports that newer backbones 'tend to solve different questions rather than just more questions' in VQA and links this to 'better calibrated confidence and more stable internal representations,' but provides no quantitative details on the metrics (e.g., ECE for calibration, representation similarity measures, or statistical tests for question overlap). Without these, it is difficult to assess whether the observed patterns are robust or sensitive to unstated implementation choices.

Authors: We appreciate the call for greater quantitative rigor. Our internal analyses included Expected Calibration Error (ECE) computed over 10 bins on VQA confidence scores, average cosine similarity of layer-wise hidden representations on a held-out validation set to quantify stability, and Jaccard overlap between correctly answered question sets with McNemar's test for statistical significance of differences. These supported the claims of improved calibration and stability with newer backbones, but were summarized rather than fully tabulated. We will add a new subsection to the Results section (with a supporting table and figure) reporting the specific ECE values, similarity scores, overlap statistics, and p-values. This will make the mechanistic explanations more transparent and allow readers to evaluate robustness directly. revision: yes

Circularity Check

0 steps flagged

Empirical comparison with no derivations, equations, or self-referential logic

full rationale

The paper is a controlled empirical study swapping LLaMA-1/2/3 backbones while fixing the vision encoder, training data, and post-training algorithm. It reports task-dependent performance differences, question-solving patterns, confidence calibration, and representation stability based on experimental measurements. No equations, fitted parameters, predictions derived from inputs, or derivations appear in the provided abstract or described structure. Claims rest on direct observations rather than any self-definitional, fitted-input, or self-citation chain that reduces to the inputs by construction. Architectural differences noted by the skeptic affect causal attribution but do not create logical circularity in the reported results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of the controlled experimental design that attributes all observed differences to the LLM backbone alone.

axioms (1)

domain assumption Vision encoder, training data, and post-training algorithm remain identical across LLAMA-1, LLAMA-2, and LLAMA-3 based VLMs.
Explicitly stated in the abstract as the means to isolate LLM backbone effects.

pith-pipeline@v0.9.0 · 5585 in / 1340 out tokens · 82269 ms · 2026-05-10T16:02:10.069236+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 11 canonical work pages · 5 internal anchors

[1]

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V Le, Sergey Levine, and Yi Ma. Sft memorizes, rl generalizes: A comparative study of foundation model post-training.arXiv preprint arXiv:2501.17161,

work page internal anchor Pith review arXiv
[3]

Llava-more: A comparative study of llms and vi- sual backbones for enhanced visual instruction tuning.arXiv preprint arXiv:2503.15621, 2025

URL https://arxiv.org/abs/ 2503.15621. H Feng, Q Liu, H Liu, W Zhou, H Li, and C Huang. Docpedia: Unleashing the power of large multimodal model in the frequency domain for versatile document understanding. arxiv

work page arXiv
[4]

SCITUNE: Aligning Large Language Models with Human-Curated Scientific Multimodal Instructions

Sameera Horawalavithana, Sai Munikoti, Ian Stewart, and Henry Kvinge. Scitune: Align- ing large language models with scientific multimodal instructions.arXiv preprint arXiv:2307.01139,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Scicap: Generating captions for scientific figures

Ting-Yao Hsu, C Lee Giles, and Ting-Hao Huang. Scicap: Generating captions for scientific figures. InFindings of the Association for Computational Linguistics: EMNLP 2021, pp. 3258–3264,

2021
[6]

Nemo: a toolkit for building ai applications using neural modules.arXiv preprint arXiv:1909.09577,

Oleksii Kuchaiev, Igor Mamyrin, Jason Li, Ekaterina Vylomova, Vladimir Zhuravsky, Alek- sandr Permyakov, Oleg Gafurov, Wenbo Li, Vladislav Popov, Oleg Kurenkov, et al. Nemo: a toolkit for building ai applications using neural modules.arXiv preprint arXiv:1909.09577,

work page arXiv 1909
[8]

Building and better understanding vision-language models: insights and future directions

URL https://arxiv.org/abs/ 2408.12637. Kenton Lee, Mandar Joshi, Iulia Raluca Turc, Hexiang Hu, Fangyu Liu, Julian Martin Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, and Kristina Toutanova. Pix2struct: Screenshot parsing as pretraining for visual language understanding. In International Conference on Machine Learning, pp. 18893–18912. PMLR,

work page arXiv
[10]

Vila: On pre-training for visual language models

URL https://arxiv.org/ abs/2312.07533. Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in Neural Information Processing Systems, 36,

work page arXiv
[11]

Visual Instruction Tuning

URL https://arxiv.org/abs/ 2304.08485. Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Stronger llms supercharge multimodal capabilities in the wild.arXiv preprint arXiv:2501.10400,

work page internal anchor Pith review arXiv
[12]

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan

URLhttps://arxiv.org/abs/2501.10400. Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in neural information processing systems, 35:2507–2521,

work page arXiv
[13]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathe- matical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255,

work page internal anchor Pith review arXiv
[14]

Mm1: Methods, analysis & insights from multimodal llm pre-training

URL https://arxiv.org/abs/2403.09611. 11 Preprint. Under review. Sangwoo Mo, Minkyu Kim, Kyungmin Lee, and Jinwoo Shin. S-clip: Semi-supervised vision-language learning using few specialist captions.Advances in Neural Information Processing Systems, 36:61187–61212,

work page arXiv
[15]

Accessed: [03/23/2026]

URL https://docs.nvidia.com/nemo-framework/user-guide/24.12/ nemotoolkit/multimodal/mllm/neva.html. Accessed: [03/23/2026]. NVIDIA-NeMo. Neva configuration ( neva config.yaml). GitHub repository file in NVIDIA-NeMo/NeMo, examples/multimodal/multimodal llm/neva/conf/,

2026
[16]

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

URL https://arxiv.org/abs/2311.16502. Peng Zhang, Yash Goyal, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Yin and yang: Balancing and answering binary visual questions. InProceedings of the IEEE conference on computer vision and pattern recognition, pp. 5014–5022,

work page internal anchor Pith review arXiv
[17]

Is the table sturdy?

and our three model variants: Llama 3.1 (8B), Llama 2 (7B), and the Llama 1 (7B). Parameters identical across all settings—such as mixed -precision (BF16), Transformer Engine, Megatron-AMP O2, parallelism configuration (TP=1, PP=1), optimizer (fused Adam), and checkpointing behavior are intentionally omitted to avoid redundancy. We retain only the best-pe...

2048