jina-vlm: Small Multilingual Vision Language Model

Andreas Koukounas; Florian H\"onicke; Georgios Mastrapas; Guillaume Roncari; Han Xiao; Scott Martens; Sedigheh Eslami

arxiv: 2512.04032 · v3 · submitted 2025-12-03 · 💻 cs.CL · cs.AI· cs.CV

jina-vlm: Small Multilingual Vision Language Model

Andreas Koukounas , Georgios Mastrapas , Florian H\"onicke , Sedigheh Eslami , Guillaume Roncari , Scott Martens , Han Xiao This is my paper

Pith reviewed 2026-05-17 01:58 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CV

keywords vision-language modelmultilingual VQAsmall VLMsimage tilingattention poolingdata ablationSigLIP2Qwen3

0 comments

The pith

A 2.4B vision-language model reaches state-of-the-art multilingual VQA among open 2B-scale models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces jina-vlm, a 2.4 billion parameter vision-language model built by pairing a SigLIP2 vision encoder with a Qwen3 language decoder. It achieves leading results on multilingual visual question answering tasks compared to other open models of similar size through the use of image tiling and attention-pooling that keeps token counts low even for high-resolution inputs. The authors run leave-one-out ablations that remove entire categories of training data by task, domain, modality, and language to determine what is essential and what transfers. A sympathetic reader would care because this points to a route for strong multilingual visual understanding in compact, openly available models rather than only in much larger systems.

Core claim

jina-vlm is a token-efficient 2.4B parameter vision-language model that achieves state-of-the-art multilingual VQA performance among open 2B-scale VLMs. The model couples a SigLIP2 vision encoder with a Qwen3 language decoder and makes use of image tiling and attention-pooling for token-efficient processing of arbitrary-resolution images. A leave-one-out data mixture ablation study systematically removes task, domain, modality, and language categories to diagnose which data types are necessary versus redundant and whether task benefits transfer across domains.

What carries the argument

Image tiling combined with attention-pooling on the outputs of the SigLIP2 vision encoder before they reach the Qwen3 decoder, which reduces token count while preserving detail for arbitrary image sizes and supports the multilingual VQA results.

If this is right

Carefully chosen small-scale open models can deliver competitive multilingual visual reasoning without scaling parameters further.
Leave-one-out ablations identify redundant training categories and show which task gains generalize across domains.
Token-efficient image handling allows arbitrary-resolution inputs without proportional compute growth.
Released weights and code enable direct fine-tuning or extension by others on additional languages or visual tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The efficiency techniques could support real-time visual assistance tools in low-resource language settings.
The same ablation approach might be reused to prune data mixtures for other multimodal models and reduce training costs.
Results suggest that future work could test whether the architecture maintains advantages when extended to video or additional modalities.

Load-bearing premise

The chosen VQA benchmarks and evaluation protocol fairly represent real-world multilingual performance across languages and domains without hidden biases in the test sets.

What would settle it

A new multilingual VQA test set drawn from underrepresented languages and everyday image domains where jina-vlm scores below other open 2B-scale models would falsify the state-of-the-art claim.

Figures

Figures reproduced from arXiv: 2512.04032 by Andreas Koukounas, Florian H\"onicke, Georgios Mastrapas, Guillaume Roncari, Han Xiao, Scott Martens, Sedigheh Eslami.

**Figure 1.** Figure 1: Architecture of jina-vlm. Images are resized to fit a grid of up to 12 overlapping tiles, plus a global thumbnail. Each tile is a square 378×378 crop; adjacent tiles overlap by 112 pixels with a stride of 266 pixels between tile origins. A 4×3 grid therefore spans 1176×910 pixels, and images exceeding this effective resolution are downscaled to fit the tile budget. Each tile produces 729 patches via SigLIP… view at source ↗

**Figure 2.** Figure 2: Answer questions given web documents. Charts & Tables Dataset: TAT-QA Zhu et al. (2021) Question Unrecognized Tax Benefits Activity related to unrecognized tax benefits is as follows (in thousands): ... As of July 31, 2019, the Company has no income tax audits in progress in the U.S. or foreign jurisdictions. What was the increase in unrecognized tax benefits in 2019? Answer $1.3 million [PITH_FULL_IM… view at source ↗

**Figure 3.** Figure 3: Financial table requiring numerical reasoning over text. [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗

**Figure 4.** Figure 4: Document image with question about textual fields. [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: Photo with textual question needing OCR reading. [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: General visual question answering on natural images. [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Scene requiring counting and spatial reasoning accuracy. [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Synthetic shapes testing compositional spatial reasoning. [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: User interface screenshot with structured textual elements. [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: Microscopic pathology image for medical VQA. [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 11.** Figure 11: Text-only tasks covering multiple languages. [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

read the original abstract

We present jina-vlm, a token-efficient 2.4B parameter vision-language model that achieves state-of-the-art multilingual VQA performance among open 2B-scale VLMs. The model couples a SigLIP2 vision encoder with a Qwen3 language decoder and makes use of image tiling and attention-pooling for token-efficient processing of arbitrary-resolution images. To understand the contribution of different training data categories, we conduct a leave-one-out data mixture ablation study-systematically removing task, domain, modality, and language categories-to diagnose which data types are necessary versus redundant and whether task benefits transfer across domains. Model weights and code are publicly released at https://huggingface.co/jinaai/jina-vlm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

jina-vlm gives a new open 2.4B multilingual VLM plus data ablation diagnostics, but the SOTA claim needs tighter checks on whether baselines used identical test conditions.

read the letter

The main takeaway is that this paper trains and releases a new 2.4B VLM called jina-vlm that pairs a SigLIP2 vision encoder with a Qwen3 decoder, adds image tiling and attention pooling for token efficiency, and reports state-of-the-art multilingual VQA results among open models of similar size. The extra piece is the leave-one-out ablation that removes entire categories of training data—task, domain, modality, or language—to measure what actually drives performance and whether benefits transfer across areas. That diagnostic is practical and gives readers concrete signals on data mixture choices without requiring them to rerun everything themselves. Public release of weights and code is also straightforward value; it lets others test or build on the model right away. The evaluation protocol and ablation results are presented clearly enough to make the work usable as a baseline. The softer spot is the SOTA claim itself. It only holds if the numbers for other open 2B VLMs come from the same multilingual test sets, splits, languages, and scoring rules rather than being pulled from earlier papers that may have used narrower or differently translated data. The ablation study diagnoses internal data contributions but does not resolve that cross-model consistency question. If the full tables and exact exclusion rules are in the manuscript, the issue stays minor; otherwise it leaves some uncertainty about how much the ranking reflects real gains. This paper is aimed at people who need small, deployable VLMs for non-English visual tasks or who run data-mixture experiments. It is incremental on the architecture side but adds verifiable empirical detail. I would send it to peer review because the model release and ablation results are grounded enough to deserve referee time, even if the evaluation section needs tightening.

Referee Report

1 major / 2 minor

Summary. The paper introduces jina-vlm, a 2.4B parameter vision-language model that couples a SigLIP2 vision encoder with a Qwen3 language decoder and employs image tiling plus attention-pooling for token-efficient handling of arbitrary-resolution images. It claims state-of-the-art multilingual VQA performance among open 2B-scale VLMs and presents a leave-one-out ablation on training data mixtures (task, domain, modality, language) to diagnose contributions and transfer effects. Model weights and code are publicly released.

Significance. If the SOTA claim is supported by consistent re-evaluation of all baselines, the work would deliver a competitive open-source small multilingual VLM with practical efficiency techniques. The ablation study supplies concrete diagnostics on data mixture design that are useful for the field. Public release of weights and code is a clear strength for reproducibility.

major comments (1)

[Abstract] Abstract: the central claim that jina-vlm achieves state-of-the-art multilingual VQA performance among open 2B-scale VLMs is load-bearing. The manuscript must explicitly document (in the experiments section or associated tables) whether every comparator model was re-run on the identical multilingual VQA test sets, language coverage, splits, and scoring rules, or whether some numbers were taken from prior papers that may have used English-only subsets or different protocols.

minor comments (2)

[Ablation study] The leave-one-out ablation would be clearer if each removed category were tied to the exact benchmark subsets and languages affected, so readers can judge cross-domain transfer.
[Results tables] Tables reporting VQA scores should include error bars or multiple-run statistics to allow assessment of result stability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We agree that transparent documentation of evaluation protocols is necessary to support the central SOTA claim and will revise the paper accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that jina-vlm achieves state-of-the-art multilingual VQA performance among open 2B-scale VLMs is load-bearing. The manuscript must explicitly document (in the experiments section or associated tables) whether every comparator model was re-run on the identical multilingual VQA test sets, language coverage, splits, and scoring rules, or whether some numbers were taken from prior papers that may have used English-only subsets or different protocols.

Authors: We agree that explicit documentation of evaluation protocols is essential for validating the SOTA claim. In the current version, baseline numbers were obtained through a combination of re-evaluation on our multilingual VQA test sets (where model weights were publicly available) and reported results from original papers. To address this, we will add a new subsection 'Evaluation Protocol for Baselines' in the Experiments section, accompanied by a table that specifies for each comparator: (i) whether it was re-run on the identical test sets, language coverage, and splits; (ii) the exact scoring rules applied; and (iii) any noted differences such as English-only subsets in prior work. This table will be included in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

Minor self-citations to component models; central SOTA claim is empirical and externally benchmarked

full rationale

The paper presents jina-vlm as an empirical model combining SigLIP2 encoder and Qwen3 decoder with image tiling, trained on a data mixture and evaluated on public multilingual VQA benchmarks. No derivation chain, equation, or first-principles prediction is claimed that reduces to inputs by construction. The leave-one-out ablation diagnoses data contributions but does not create fitted-input predictions. Self-citations to prior component models are not load-bearing for the performance claim, which rests on measured results against external test sets rather than internal definitions or self-referential uniqueness theorems.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The performance claim depends on the representativeness of standard VQA benchmarks and on the assumption that the chosen data mixture ratios generalize beyond the training distribution. No new physical entities or untested mathematical axioms are introduced.

free parameters (1)

data mixture ratios
Proportions of task, domain, modality, and language categories are selected and then ablated to determine necessity.

axioms (1)

domain assumption Standard multilingual VQA benchmarks are unbiased proxies for real-world performance across languages.
Invoked when claiming state-of-the-art status without additional cross-validation on held-out languages or domains.

pith-pipeline@v0.9.0 · 5442 in / 1289 out tokens · 28738 ms · 2026-05-17T01:58:00.598514+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

jina-vlm achieves state-of-the-art multilingual VQA performance among open 2B-scale VLMs.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages

[1]

Compute overlap-related sizes mtot ←p·(m L +m R)// Total overlap margin in pixels swin ← ⌊bh/p⌋ −(m L +m R) ·p// Tile stride in pixels

work page
[2]

Select tiling on the margin-reduced image (th, tw)←SELECTTILINGWITHMINIMALSCALECHANGE h−m tot, w−m tot, s win, M

work page
[3]

Resize image to exactly fit the chosen tiling + margins; H ′ ←t h ·s win +m tot; W ′ ←t w ·s win +m tot; Igrid ←RESIZE(I,[H ′, W ′])

work page
[4]

Extract overlapping tiles G ←EXTRACTTILES Igrid,(t h, tw), s win, b h //b h is the tile height, equal tob w here

work page
[5]

(2025) Question what is the meeting title? Answer Conflict Resolution Meeting Figure 2: Answer questions given web documents

Build thumbnail and final tile list T←RESIZE(I,[b h, bw])// Global thumbnail C ←[T] + +G// Concatenate thumbnail and tiles return(C,(t h, tw)); 13 A.2 TRAININGSETEXAMPLES Captioning & Instruction Dataset:VisualWebInstructJia et al. (2025) Question what is the meeting title? Answer Conflict Resolution Meeting Figure 2: Answer questions given web documents....

work page 2025

[1] [1]

Compute overlap-related sizes mtot ←p·(m L +m R)// Total overlap margin in pixels swin ← ⌊bh/p⌋ −(m L +m R) ·p// Tile stride in pixels

work page

[2] [2]

Select tiling on the margin-reduced image (th, tw)←SELECTTILINGWITHMINIMALSCALECHANGE h−m tot, w−m tot, s win, M

work page

[3] [3]

Resize image to exactly fit the chosen tiling + margins; H ′ ←t h ·s win +m tot; W ′ ←t w ·s win +m tot; Igrid ←RESIZE(I,[H ′, W ′])

work page

[4] [4]

Extract overlapping tiles G ←EXTRACTTILES Igrid,(t h, tw), s win, b h //b h is the tile height, equal tob w here

work page

[5] [5]

(2025) Question what is the meeting title? Answer Conflict Resolution Meeting Figure 2: Answer questions given web documents

Build thumbnail and final tile list T←RESIZE(I,[b h, bw])// Global thumbnail C ←[T] + +G// Concatenate thumbnail and tiles return(C,(t h, tw)); 13 A.2 TRAININGSETEXAMPLES Captioning & Instruction Dataset:VisualWebInstructJia et al. (2025) Question what is the meeting title? Answer Conflict Resolution Meeting Figure 2: Answer questions given web documents....

work page 2025