Scaling vision pre-training to 4k resolution

Baifeng Shi, Boyi Li, Han Cai, Yao Lu, Sifei Liu, Marco Pavone, Jan Kautz, Song Han, Trevor Darrell, Pavlo Molchanov, et al · 2025 · arXiv 2503.19903

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

read on arXiv browse 2 citing papers

representative citing papers

High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning

cs.CV · 2025-07-08 · conditional · novelty 7.0

MGPO elicits grounding in LMMs via multi-turn RL with binary rewards, yielding 5.4% and 5.2% gains on MME-Realworld and V* Bench and surpassing GPT-4o on the latter after training on 21K samples.

CropVLM: Learning to Zoom for Fine-Grained Vision-Language Perception

cs.CV · 2025-11-25 · unverdicted · novelty 6.0

CropVLM uses reinforcement learning to learn image zooming policies that boost fine-grained perception in VLMs on out-of-domain high-resolution tasks without labeled boxes, synthetic data, or VLM changes.

citing papers explorer

Showing 2 of 2 citing papers.

High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning cs.CV · 2025-07-08 · conditional · none · ref 35
MGPO elicits grounding in LMMs via multi-turn RL with binary rewards, yielding 5.4% and 5.2% gains on MME-Realworld and V* Bench and surpassing GPT-4o on the latter after training on 21K samples.
CropVLM: Learning to Zoom for Fine-Grained Vision-Language Perception cs.CV · 2025-11-25 · unverdicted · none · ref 23
CropVLM uses reinforcement learning to learn image zooming policies that boost fine-grained perception in VLMs on out-of-domain high-resolution tasks without labeled boxes, synthetic data, or VLM changes.

Scaling vision pre-training to 4k resolution

fields

years

verdicts

representative citing papers

citing papers explorer