pith. sign in

arxiv: 2605.28820 · v1 · pith:72ZBC5UXnew · submitted 2026-05-27 · 💻 cs.CV

From Pixels to Words -- Towards Native One-Vision Models at Scale

Pith reviewed 2026-06-29 13:26 UTC · model grok-4.3

classification 💻 cs.CV
keywords native vision-language modelone-vision architectureend-to-end trainingpixel-word correspondencespatiotemporal modelingmulti-image understandingvideo understandingfine-grained visual perception
0
0 comments X

The pith

NEO-ov shows a single native model without separate encoders can learn pixel-word mappings end-to-end and close most of the gap to modular vision-language systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current vision-language models fragment pixel signals by stitching separate image encoders to language decoders through multi-stage alignment. NEO-ov replaces this with one unified architecture trained entirely end-to-end, allowing cross-frame and pixel-word correspondences to form inside the model without external components. The design produces fine-grained spatiotemporal modeling that emerges directly from the training process. On benchmarks the model narrows the performance difference with modular systems while outperforming them on tasks that require detailed visual perception. The work also supplies architectural analyses and training details to support further native multimodal models.

Core claim

NEO-ov is a native foundation model that learns cross-frame and pixel-word correspondence end-to-end without any external encoders, auxiliary adapters, or post-hoc fusion. By eliminating module boundaries entirely, fine-grained and unified spatiotemporal modeling emerges natively inside the model, narrowing the gap to modular counterparts while excelling at fine-grained visual perception.

What carries the argument

The native one-vision architecture that integrates all processing in a single end-to-end trained network without separate encoders or module boundaries.

If this is right

  • Fine-grained spatiotemporal modeling develops inside the model from end-to-end training without explicit alignment stages.
  • Performance on vision-language tasks approaches that of modular systems at large scale.
  • The model shows stronger results than modular counterparts on tasks requiring detailed visual perception.
  • Native architectures become viable for multi-image, video, and spatial-intelligence applications.
  • Architectural analyses and training recipes are released to guide future native multimodal work.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • A single-model design could reduce the engineering overhead of maintaining separate encoders and fusion modules.
  • Native cross-frame modeling may scale more naturally to longer video sequences than methods that rely on post-hoc fusion.
  • Direct pixel-word learning could preserve low-level visual details that modular pipelines tend to discard early.
  • The approach suggests that spatial-intelligence tasks may benefit from training that never breaks the image into an external representation.

Load-bearing premise

Removing all module boundaries and external encoders allows fine-grained spatiotemporal modeling and pixel-word correspondence to emerge natively from end-to-end training alone.

What would settle it

A benchmark result in which NEO-ov falls well behind modular models on fine-grained multi-frame perception tasks while using comparable compute would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.28820 by Bo Li, Dahua Lin, Haiwen Diao, HanMing Deng, Huchuan Lu, Jiahao Wang, Lei Yang, Lewei Lu, Linjun Dai, Mingxuan Li, Penghao Wu, Quan Wang, Silei Wu, Weichen Fan, Xuanyu Zheng, Yuanhan Zhang, Yue Zhu, Yuhao Dong, Yuwei Niu, Zhongang Cai, Ziwei Liu.

Figure 1
Figure 1. Figure 1: Overview of the NEO-ov model. Image or video inputs and text are encoded into token sequences via [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of native rotary position embeddings and spatial-temporal attention. It unifies bidirectional [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of three-stage training recipe. NEO-ov first aligns the Pre-Buffer with the post-LLM using [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Finetuned on SI data. 40 50 60 70 80 NEO-ov (2B) NEO-ov (9B) Avg. Accuracy (%) Stage 1 Stage 2 [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

Current vision-language models (VLMs) typically stitch together separate image encoders and language decoders via multi-stage alignment, a modular framework that inevitably fragments pixel-level signals across frames and scatters early pixel-word interactions. In parallel, native VLMs, despite impressive performance on single images, remain largely unexplored in multi-image, video understanding, and spatial intelligence. Hence, we introduce NEO-ov, a native foundation model that learns cross-frame and pixel-word correspondence end-to-end, without any external encoders, auxiliary adapters, or post-hoc fusion. By eliminating module boundaries entirely, NEO-ov enables fine-grained and unified spatiotemporal modeling to emerge natively inside the model. Notably, NEO-ov largely narrows the gap to modular counterparts while excelling at fine-grained visual perception, validating that native "one-vision" architectures are not only feasible but competitive at scale. Beyond empirical performance, we unveil systematic architectural analyses and detailed training recipes to facilitate subsequent native multimodal modeling. Our code and models are publicly available at: https://github.com/EvolvingLMMs-Lab/NEO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces NEO-ov, a native end-to-end vision-language foundation model that eliminates separate image encoders, language decoders, adapters, and module boundaries. It claims to learn cross-frame and pixel-word correspondences directly from training, enabling unified spatiotemporal modeling for multi-image, video, and spatial tasks. The work asserts that NEO-ov largely narrows the performance gap to modular VLMs while excelling at fine-grained visual perception, and it provides systematic architectural analyses plus training recipes, with code and models released publicly.

Significance. If the empirical claims are substantiated by detailed results, this would be a meaningful contribution by demonstrating the viability of fully native one-vision architectures at scale and supplying practical guidance for future unified multimodal models. The public release of code and models strengthens potential impact and reproducibility.

major comments (2)
  1. Abstract: the central empirical claim that NEO-ov 'largely narrows the gap to modular counterparts' is stated without any metrics, baselines, ablation results, or benchmark tables, preventing assessment of whether the performance assertion is supported.
  2. Abstract: the assumption that fine-grained spatiotemporal modeling and pixel-word correspondence emerge natively solely from removing module boundaries and end-to-end training is presented as validated by results, but no experimental details, training dynamics, or comparisons are supplied to test this weakest assumption.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. The two major comments both concern the abstract's level of detail in supporting its claims. We address each below and note that revisions to the abstract are appropriate.

read point-by-point responses
  1. Referee: Abstract: the central empirical claim that NEO-ov 'largely narrows the gap to modular counterparts' is stated without any metrics, baselines, ablation results, or benchmark tables, preventing assessment of whether the performance assertion is supported.

    Authors: We agree the abstract is concise and omits specific numbers. The full manuscript contains the requested details in the experimental sections and tables. To improve self-containment, we will revise the abstract to incorporate key quantitative results (e.g., performance deltas on multi-image and video benchmarks) while remaining within length limits. revision: yes

  2. Referee: Abstract: the assumption that fine-grained spatiotemporal modeling and pixel-word correspondence emerge natively solely from removing module boundaries and end-to-end training is presented as validated by results, but no experimental details, training dynamics, or comparisons are supplied to test this weakest assumption.

    Authors: The abstract summarizes the hypothesis; the supporting evidence (architectural ablations, training curves, and cross-frame correspondence analyses) appears in the main body. We will revise the abstract to include a brief reference to these experiments so the claim is better grounded within the abstract itself. revision: yes

Circularity Check

0 steps flagged

No circularity detected; derivation is empirical and self-contained

full rationale

The provided abstract and context describe an empirical architecture (NEO-ov) trained end-to-end without external encoders, with performance claims presented as outcomes of that training rather than derived from fitted parameters or self-referential definitions. No equations, fitted inputs renamed as predictions, or load-bearing self-citations appear in the text. The central claim reduces to benchmark results from native training, which is externally falsifiable and does not reduce to its own inputs by construction. This matches the default expectation of no significant circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no information on free parameters, axioms, or invented entities; all such elements are unknown from the given text.

pith-pipeline@v0.9.1-grok · 5792 in / 971 out tokens · 29411 ms · 2026-06-29T13:26:32.359394+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ERA: Entropy-Guided Visual Token Pruning with Rectified Attention for Efficient MLLMs

    cs.CV 2026-06 unverdicted novelty 5.0

    ERA proposes entropy-guided token pruning with bias-aware recycling and logit rectification to compress visual inputs in MLLMs while mitigating attention collapse.

Reference graph

Works this paper leans on

6 extracted references · 5 canonical work pages · cited by 1 Pith paper · 1 internal anchor

  1. [1]

    InAd- vances of Neural Information Processing Systems, New Orleans, LA, USA

    Instructblip: towards general-purpose vision- language models with instruction tuning. InAd- vances of Neural Information Processing Systems, New Orleans, LA, USA. Haiwen Diao, Yufeng Cui, Xiaotong Li, Yueze Wang, Huchuan Lu, and Xinlong Wang. 2024. Unveil- ing encoder-free vision-language models.CoRR, abs/2406.11832. Haiwen Diao, Mingxuan Li, Silei Wu, L...

  2. [2]

    InEuro- pean Conference on Computer Vision, volume 9908, pages 235–251, Amsterdam, The Netherlands

    A diagram is worth a dozen images. InEuro- pean Conference on Computer Vision, volume 9908, pages 235–251, Amsterdam, The Netherlands. Weixian Lei, Jiacong Wang, Haochen Wang, Xiangtai Li, Jun Hao Liew, Jiashi Feng, and Zilong Huang

  3. [3]

    Bo Li, Kaichen Zhang, Hao Zhang, Dong Guo, Ren- rui Zhang, Feng Li, Yuanhan Zhang, Ziwei Liu, and Chunyuan Li

    The scalability of simplicity: Empirical anal- ysis of vision-language learning with a single trans- former.CoRR, abs/2504.10462. Bo Li, Kaichen Zhang, Hao Zhang, Dong Guo, Ren- rui Zhang, Feng Li, Yuanhan Zhang, Ziwei Liu, and Chunyuan Li. 2024a. Llava-next: stronger llms su- percharge multimodal capabilities in the wild. Bohao Li, Rui Wang, Guangzhi Wan...

  4. [4]

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee

    Langbridge: Interpreting image as a com- bination of language embeddings.arXiv preprint arXiv:2503.19404. Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. 2024a. Improved baselines with visual instruc- tion tuning. InIEEE Conference on Computer Vision and Pattern Recognition, pages 26286–26296, Seat- tle, W A, USA. Haotian Liu, Chunyuan Li, Qingyang...

  5. [5]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    3dsrbench: A comprehensive 3d spatial rea- soning benchmark. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6924–6934. Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq R. Joty, and Enamul Hoque. 2022. Chartqa: a bench- mark for question answering about charts with vi- sual and logical reasoning. InAnnual Meeting of the Ass...

  6. [6]

    Advances in Neural Information Processing Systems, 37:28828–28857

    Longvideobench: A benchmark for long- context interleaved video-language understanding. Advances in Neural Information Processing Systems, 37:28828–28857. xAI. 2024. Grok-1.5 vision preview. Rui Yan, Lin Song, Yicheng Xiao, Runhui Huang, Yix- iao Ge, Ying Shan, and Hengshuang Zhao. 2025. Haplovl: A single-transformer baseline for multi- modal understandin...