Vision Inference Former: Sustaining Visual Consistency in Multimodal Large Language Models cs.CV · 2026-05-18