{"total":99,"items":[{"citing_arxiv_id":"2606.23144","ref_index":34,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Koshur Pixel: a large-scale synthetic ocr dataset for kashmiri","primary_cat":"cs.CV","submitted_at":"2026-06-22T10:42:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Koshur Pixel is the first large-scale synthetic OCR dataset for Kashmiri with 613,078 image-text pairs generated via SynthOCR-Gen from the KS-PRET-5M corpus across multiple fonts and granularities with 25+ augmentations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.01414","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Agent Skills Should Go Beyond Text: The Case for Visual Skills","primary_cat":"cs.CV","submitted_at":"2026-05-31T19:22:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The paper proposes that reusable agent skills should incorporate visual elements alongside text, introduces three forms of visual skills and an automatic conversion system, and reports better performance on GUI and visual-centric tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00562","ref_index":28,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DeepLatent: Think with Images via Parallel Latent Visual Reasoning","primary_cat":"cs.CV","submitted_at":"2026-05-30T06:33:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DeepLatent introduces a parallel latent visual reasoning framework with learnable 2D tokens and continuous RL, trained via distillation then RL, plus a new 180K dataset, claiming SOTA benchmark results.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30244","ref_index":39,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Reinforcement Learning with Robust Rubric Rewards","primary_cat":"cs.CV","submitted_at":"2026-05-28T17:11:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"RLR³ extends RLVR to criterion-level rubric verification via dual execution paths, minimal exposure masking, hierarchical aggregation, and saturation mitigation, delivering 4.7-point gains over base on 15 benchmarks with Qwen3-VL-30B-A3B.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30170","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Unveiling the Visual Counting Bottleneck in Vision-Language Models","primary_cat":"cs.MM","submitted_at":"2026-05-28T16:20:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VLMs fail at visual counting extrapolation because they cannot project visual magnitudes onto symbolic tokens, despite intact perceptual representations, supporting a fractured magnitude hypothesis.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30126","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding","primary_cat":"cs.CV","submitted_at":"2026-05-28T15:57:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PARCEL is a new visual tokenization architecture combining pool-anchored resampling with conditioned elastic queries to enhance performance-efficiency tradeoffs in LVLMs over prior matryoshka methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30027","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DocRetriever: A Plug-and-Play Framework for Multimodal Document Retrieval with Comprehensive Benchmark","primary_cat":"cs.CV","submitted_at":"2026-05-28T14:50:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"DocRetriever introduces a framework using layout-aware sparse embeddings for hybrid encoding without OCR and a generalizable reasoning-augmented reranker for few-shot settings, plus the MultiDocR benchmark for evaluation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.29662","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SAFE-Pruner: Semantic Attention-Guided Future-Aware Token Pruning for Efficient Vision-Language-Action Manipulation","primary_cat":"cs.CV","submitted_at":"2026-05-28T09:23:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SAFE-Pruner forecasts deep-layer token saliency in VLA models via semantic attention consistency and adaptive subtask detection to achieve up to 1.89x speedup with under 1.7% success rate loss.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.29577","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Mitigating State Aliasing in Vision-Language-Action Models via Inverse Dynamics Learning","primary_cat":"cs.CV","submitted_at":"2026-05-28T08:22:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Inverse dynamics prediction is added as an auxiliary task to reduce state aliasing in VLA models by directly supervising the vision encoder on action-relevant visual distinctions using only standard observation-action pairs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22812","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"GesVLA: Gesture-Aware Vision-Language-Action Model Embedded Representations","primary_cat":"cs.RO","submitted_at":"2026-05-21T17:57:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"GesVLA encodes gesture features directly into the latent space of VLA models using a dual-VLM architecture and a rendering-based data pipeline, yielding improved target grounding in real robotic tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22098","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"TextTeacher: What Can Language Teach About Images?","primary_cat":"cs.CV","submitted_at":"2026-05-21T07:36:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TextTeacher uses frozen text embeddings from captions as semantic anchors to guide vision model training, improving ImageNet accuracy by up to 2.7 p.p. and transfer performance by 1.0 p.p. on average.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22089","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LVDrive: Latent Visual Representation Enhanced Vision-Language-Action Autonomous Driving Model","primary_cat":"cs.CV","submitted_at":"2026-05-21T07:31:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LVDrive improves closed-loop driving on Bench2Drive by adding latent future scene prediction to VLA models via unified embedding space processing and two-stage trajectory decoding.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21300","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Reducing Object Hallucination in LVLMs via Emphasizing Image-negative Tokens","primary_cat":"cs.CV","submitted_at":"2026-05-20T15:29:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Reweighting training emphasis toward image-negative tokens and filtering hallucinated data reduces object hallucination in LVLMs across three model variants.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17954","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A More Word-like Image Tokenization for MLLMs","primary_cat":"cs.CV","submitted_at":"2026-05-18T07:09:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DiVT clusters patch embeddings into coherent semantic units and adapts token count to image complexity, matching or exceeding baselines with fewer visual tokens on multimodal benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16713","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"GeoWorld-VLM: Geometry from World Models for Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2026-05-15T23:52:11+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15735","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"UAM: A Dual-Stream Perspective on Forgetting in VLA Training","primary_cat":"cs.CV","submitted_at":"2026-05-15T08:45:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"UAM adds a Dorsal Expert initialized from a generative model and trained on visual dynamics prediction to preserve over 95% of VLM multimodal ability in VLA training while achieving top success rates on manipulation tasks including OOD cases.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12369","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization","primary_cat":"cs.RO","submitted_at":"2026-05-12T16:38:40+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Tu, Jianqiang Wan, Peng Wang, Pengfei Wang, Qi- uyue Wang, Yuxuan Wang, Tianbao Xie, Yiheng Xu, Haiyang Xu, Jin Xu, Zhibo Yang, Mingkun Yang, Jianxin Yang, An Yang, Bowen Yu, Fei Zhang, Hang Zhang, Xi Zhang, Bo Zheng, Humen Zhong, Jingren Zhou, Fan Zhou, Jing Zhou, Yuanzhi Zhu, and Ke Zhu. Qwen3-vl technical report, 2025. URL https://arxiv.org/ abs/2511.21631. [4] Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024. [5] Vineet Bhat, Yu-Hsiang Lan, Prashanth Krishnamurthy, Ramesh Karri, and Farshad Khorrami."},{"citing_arxiv_id":"2605.11567","ref_index":22,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Dynamic Execution Commitment of Vision-Language-Action Models","primary_cat":"cs.CV","submitted_at":"2026-05-12T05:52:58+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[20] Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025. [21] Open X-Embodiment Collaboration, Abby O'Neill, et al. Open x-embodiment: Robotic learning datasets and rt-x models, 2023. arXiv preprint arXiv:2310.08864. 13 [22] Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024. [23] Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ"},{"citing_arxiv_id":"2605.11564","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RIO: Flexible Real-Time Robot I/O for Cross-Embodiment Robot Learning","primary_cat":"cs.RO","submitted_at":"2026-05-12T05:49:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RIO introduces a lightweight open-source framework that abstracts real-time robot I/O to support easy switching between embodiments and platforms for collecting data and deploying VLAs.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Carlo Sferrazza, Guanya Shi, Linda Shih, Jonathan Tseng, Zhen Wu, Lujie Yang, Brent Yi, and Yuanhang Zhang. Holosoma. URL https://github.com/amazon-far/ holosoma. [2] Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wen- bin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. [3] Lucas Beyer, Andreas Steiner, Andr 'e Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024. [4] Johan Bjorck, Fernando Casta ˜neda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox,"},{"citing_arxiv_id":"2605.11405","ref_index":7,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone","primary_cat":"cs.LG","submitted_at":"2026-05-12T01:51:03+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Data curation alone raises VLM accuracy by more than 11 points on average across many benchmarks while cutting required training compute by up to 87 times.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16384","ref_index":89,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Mutual Enhancement Between Global Tokens and Patch Tokens: From Theory to Practice","primary_cat":"cs.CV","submitted_at":"2026-05-11T10:51:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"TaTok is a theoretically grounded adaptive tokenization method that uses global tokens and cumulative conditional entropy filtering to reduce redundancy while improving reconstruction quality over fixed-rate patch tokenization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09719","ref_index":35,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Distilling 3D Spatial Reasoning into a Lightweight Vision-Language Model with CoT","primary_cat":"cs.CV","submitted_at":"2026-05-10T19:38:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Distills 3D spatial reasoning from a 7B teacher VLM to a 2.29B student using VGGT encoder, multi-task losses, and Hidden CoT latent tokens, yielding 8.7x lower latency with 54-72% performance retention on ScanNet and 3D-FRONT.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"uate the effectiveness of our 3D-specific distillation approach. First, we report results for theteacher model (LLaV A-3D- 7B)[4], which provides an upper-bound performance. Second, we compare against lightweight2D vision-language models of similar scale to demonstrate the benefit of 3D-specific distillation (Table I), including LLaV A-v1.5-7B (without 3D training), MobileVLM-2B [34], and PaliGemma-3B [35]. Fi- nally, we includeablation variantsof our model to analyze the contribution of individual components. These comparisons demonstrate that our 3D-specific distil- lation approach achieves better spatial reasoning than standard VLMs of similar size, validating the importance of 3D-aware training and knowledge transfer. B. Benchmarks Text Generation Performance:On a held-out test set of"},{"citing_arxiv_id":"2605.08560","ref_index":143,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ZAYA1-VL-8B Technical Report","primary_cat":"cs.CV","submitted_at":"2026-05-08T23:41:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"ZAYA1-VL-8B is a new MoE vision-language model with vision-specific LoRA adapters and bidirectional image attention that reports competitive performance against several 3B-4B models on image, reasoning, and counting benchmarks.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"4 77.3 63.2 71.0 87.3 89.2 Qwen3.5-4B 83.7 82.4 - - 81.1 85.3 80.4 82.3 56.9 76.6 56.8 74.2 84.8 84.2 InternVL3.5-4B 82.1 86.4 92.4 78 77.6 82.0 76.4 72.8 57.2 76.3 58.2 67.8 82.5 47.3 pipeline in achieving strong generalization while maintaining computational efficiency. Reproducibility notes.Results for all models are reproduced using VLMEvalKit [ 143], ensuring a consistent evaluation pipeline across models. For DocVQA and InfoVQA, we report scores from the original papers as these benchmarks require submission to an external evaluation server. We observe that some reproduced scores differ from those reported in the original works. For PixMo-Count, the official test set contains 540 examples, of which we were able to retrieve and evaluate"},{"citing_arxiv_id":"2605.08398","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Exploring and Exploiting Stability in Latent Flow Matching","primary_cat":"cs.LG","submitted_at":"2026-05-08T19:04:33+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07931","ref_index":4,"ref_count":3,"confidence":0.98,"is_internal_anchor":true,"paper_title":"One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy","primary_cat":"cs.CV","submitted_at":"2026-05-08T16:04:43+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Reducing visual input to one token per frame in VLA world models maintains or improves long-horizon performance on MetaWorld, LIBERO, and real-robot tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"future latent stream and the action trajectory under one model. • We report consistent gains over the π0 backbone on LIBERO, MetaWorld MT50, and a real Piper arm, supported by a per-frame bandwidth sweep and a latent-supervision ablation in the adaptation regime studied here. 2 Related Work 2.1 Vision-Language-Action Models Driven by the rapid progress of Multimodal Large Language Models [ 4, 30, 42, 2, 1] and the emergence of large-scale robot datasets, Vision-Language-Action (VLA) models [7, 52, 27, 23, 49, 12, 6, 39, 35, 8, 51, 26, 16, 41, 37] have become a dominant paradigm in robot learning, fine-tuning multimodal LLMs to map language instructions and visual observations directly to actions. To better capture the multi-modal nature of robot actions, a line of work [14, 6, 23, 33] replaces deterministic"},{"citing_arxiv_id":"2605.07308","ref_index":3,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AT-VLA: Adaptive Tactile Injection for Enhanced Feedback Reaction in Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-08T06:17:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AT-VLA proposes adaptive tactile injection and a dual-stream tactile reaction mechanism to enhance VLA models for contact-rich robotic manipulation with real-time responses.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"able at: https://sites.google.com/view/at-vla. 1. Introduction The development of Vision-Language-Action (VLA) mod- els [6, 7, 9, 36, 39, 40, 45, 49] has greatly accelerated the progress toward generalist robotic agents. Empowered by arXiv:2605.07308v1 [cs.RO] 8 May 2026 large-scale manipulation datasets [10, 36, 41] and the emer- gence of foundation models [ 3, 38], VLAs demonstrate strong abilities that enable robots to ground language in perception and perform diverse tasks. However, when fac- ing contact-rich manipulation scenarios that require precise understanding of physical interactions, these models remain limited, as they often overlook interaction feedback (e.g.tac- tile signals) that are essential for achieving intricate control"},{"citing_arxiv_id":"2605.08200","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Where Reliability Lives in Vision-Language Models: A Mechanistic Study of Attention, Hidden States, and Causal Circuits","primary_cat":"cs.AI","submitted_at":"2026-05-05T22:27:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Attention sharpness barely predicts VLM correctness while hidden-state probes and self-consistency strongly do, with late-fusion models showing fragile reliability bottlenecks unlike early-fusion ones.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"narrow but consequential: in 3-7B VLMs, reliability is read more reliably off hidden-state geometry, layer-wise margin formation, and sparse late-layer circuits than off attention-map sharpness. 1 Introduction Vision-language models can answer richly compositional questions about images, yet routinely producefluentmistakes: confident, well-formed answers that are not supported by the pixels they purport to describe [3, 18, 27]. For deployment in settings where errors carry cost (scientific image analysis, medical triage, robotic perception), we need reliability signals that are simultaneouslypredictive of correctnessandmechanistically interpretable. This raises a sharp interpretability question: where, inside a VLM, is the information that distinguishes a correct answer from an incorrect"},{"citing_arxiv_id":"2605.03269","ref_index":9,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RLDX-1 Technical Report","primary_cat":"cs.RO","submitted_at":"2026-05-05T01:40:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"RLDX-1 outperforms frontier VLAs such as π0.5 and GR00T N1.6 on dexterous manipulation benchmarks, reaching 86.8% success on ALLEX humanoid tasks versus around 40% for the baselines.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"Zhou, Yuanzhi Zhu, and Ke Zhu. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. [8] Bowen Baker, Ilge Akkaya, Peter Zhokov, Joost Huizinga, Jie Tang, Adrien Ecoffet, Brandon Houghton, Raul Sampedro, and Jeff Clune. Video pretraining (vpt): Learning to act by watching unlabeled online videos. InAdvances in Neural Information Processing Systems, 2022. [9] Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer. arXiv preprint arXiv:2407.07726, 2024. [10] Raunaq Bhirangi, Venkatesh Pattabiraman, Enes Erciyes, Yifeng Cao, Tess Hellebrekers, and Lerrel Pinto."},{"citing_arxiv_id":"2605.00809","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Let ViT Speak: Generative Language-Image Pre-training","primary_cat":"cs.CV","submitted_at":"2026-05-01T17:51:38+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"encoder fundamentally determine the upper bound of an MLLM's visual understanding capability. As a result, large-scale Vision-Language Pre-training (VLP) on billions of image-text corpora have become the dominant approach for developing strong vision encoders. Contrastive learning based VLP methods, exemplified by CLIP [56] and SigLIP [78], are among the most widely adopted vision encoders in MLLMs [9, 62, 63]. These methods typically employ a dual-encoder This work was completed while Yan Fang and Mengcheng Lan were interns at ByteDance. 1 arXiv:2605.00809v1 [cs.CV] 1 May 2026 Figure 1Compared with prior vision-language pretraining methods that rely on complex two-tower designs, GenLIP adopts a substantially simpler architecture. In this figure, we use \"V\" and \"T\" to denote visual and textual inputs."},{"citing_arxiv_id":"2605.00321","ref_index":53,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Embodied Interpretability: Linking Causal Understanding to Generalization in Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-01T01:00:00+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.28192","ref_index":3,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning","primary_cat":"cs.RO","submitted_at":"2026-04-30T17:59:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LaST-R1 introduces a RL post-training method called LAPO that optimizes latent Chain-of-Thought reasoning in vision-language-action models, yielding 99.9% success on LIBERO and up to 22.5% real-world gains.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Ren, Xingzhang Ren, Sibo Song, Yuchong Sun, Jun Tang, Jianhong Tu, Jianqiang Wan, Peng Wang, Pengfei Wang, Qiuyue Wang, Yuxuan Wang, Tianbao Xie, Yiheng Xu, Haiyang Xu, Jin Xu, Zhibo Yang, Mingkun Yang, Jianxin Yang, An Yang, Bowen Yu, Fei Zhang, Hang Zhang, Xi Zhang, Bo Zheng, Humen Zhong, Jingren Zhou, Fan Zhou, Jing Zhou, Yuanzhi Zhu, and Ke Zhu. Qwen3-vl technical report, 2025. [3] Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024. [4] Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair,"},{"citing_arxiv_id":"2604.27792","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MotuBrain: An Advanced World Action Model for Robot Control","primary_cat":"cs.RO","submitted_at":"2026-04-30T12:34:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MotuBrain jointly models video and action via a three-stream Mixture-of-Transformers UniDiffuser to reach 95.8-96.1% success on RoboTwin 2.0 benchmarks, top EWMScore, and fast 11 Hz inference while adapting to new robots with 50-100 trajectories.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23272","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Modular Sensory Stream for Integrating Physical Feedback in Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-04-25T12:28:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MoSS augments VLAs with decoupled modality streams for multiple physical signals, achieving synergistic gains in real-world robot tasks via joint attention and auxiliary future-signal prediction.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23121","ref_index":61,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Breaking Lock-In: Preserving Steerability under Low-Data VLA Post-Training","primary_cat":"cs.RO","submitted_at":"2026-04-25T03:18:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"UNKNOWN","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DeLock mitigates lock-in in low-data VLA post-training via visual grounding preservation and test-time contrastive prompt guidance, outperforming baselines across eight evaluations while matching data-heavy generalist policies.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"green block\" while evaluation reverses the concept order to \"stack green block on blue block\". Al- though the required stacking skill is unchanged, standard SFT repeats the training behavior, whereas DeLockfollows the new instruction and succeeds. To examine the representation-level cause, we vi- sualize vision-language cross-attention in the PaliGemma [61] backbone using instruction tokens as queries and image patches as keys (Figure 4(a)). Standard SFT shows a collapsed attention pattern, continuing to focus on the blue block regardless of the prompt. In contrast,DeLockexhibits a clear prompt-conditioned shift in attention between the blue and green blocks as their instructed roles are swapped, consistent with preserved visual grounding under low-data post-training."},{"citing_arxiv_id":"2604.22238","ref_index":26,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CodeGraphVLP: Code-as-Planner Meets Semantic-Graph State for Non-Markovian Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-04-24T05:27:27+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.22875","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SketchVLM: Vision language models can annotate images to explain thoughts and guide users","primary_cat":"cs.CV","submitted_at":"2026-04-23T22:33:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SketchVLM lets VLMs generate non-destructive SVG annotations on input images to visually explain answers, raising visual reasoning accuracy by up to 28.5 points and annotation quality by 1.48x over baselines.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"Connect-the-Dots contains 100 images spanning three subsets: 21 randomly generated dot patterns, 30 connect-the-dots puzzles derived from sil- houette SVGs, and 49 worksheet-style images collected from online sources. Models must locate each dot and connect them in order (Secs. B.1 and D.3 and Fig. 2). 2.CountingObjects contains746imagesdrawnfromCountBench[5,34], and Pixmo-Count [14]. We include object counts from 0 to 10 and filter out unsuitable Pixmo-Count examples. Models must count the target objects and place numbered markers on each one (Secs. B.2 and D.4 and Fig. 5). 3. Drawing Shapes around Objects uses 1,000 images selected from the 5,000-image COCO validation set [25]. We choose images to balance object"},{"citing_arxiv_id":"2604.19728","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VLA Foundry: A Unified Framework for Training Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-04-21T17:51:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"VLA Foundry provides a single training stack for VLA models and releases open models that match prior closed-source performance or outperform baselines on multi-task manipulation in simulation.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"We also include qualitative examples in Figure 3. Although in this instance we use a randomly initialized ViT and the in-house LLM, both could instead be replaced by off-the-shelf pre-trained components such as SigLIP [74] or DINO [47, 62] which would likely lead to improved model performance. Alternatively, the VLM itself can take advantage of pre-trained backbones such as PaliGemma2 [7] or Qwen3-VL [3]; this is precisely the route we take forFoundry-Qwen3VLA-2.1B-MT in Section 4.2. Here we show that VLA Foundry supports all stages of training and can produce a functional VLM backbone, giving full control to users to experiment with known training data and procedures, modify architectures, and train or fine-tune any part of the model."},{"citing_arxiv_id":"2604.19324","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"PLaMo 2.1-VL Technical Report","primary_cat":"cs.CV","submitted_at":"2026-04-21T10:46:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"PLaMo 2.1-VL reports 61.5 ROUGE-L on JA-VG-VQA-500, 85.2% on Japanese Ref-L4, 53.9% zero-shot factory accuracy, and raises anomaly detection F1 from 39.7 to 64.9 after fine-tuning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18000","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Unmasking the Illusion of Embodied Reasoning in Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-04-20T09:25:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"State-of-the-art vision-language-action models catastrophically fail dynamic embodied reasoning due to lexical-kinematic shortcuts, behavioral inertia, and semantic feature collapse caused by architectural bottlenecks, as shown by the new BeTTER benchmark with real-world validation.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[24] Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, et al. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model.arXiv preprint arXiv:2510.10274, 2025. [25] Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022. [26] Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024. [27] Zhiqi Li, Guo Chen, Shilong Liu, Shihao Wang, Vibashan VS, Yishen Ji, Shiyi Lan, Hao Zhang, Yilin Zhao,"},{"citing_arxiv_id":"2604.17880","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ST-$\\pi$: Structured SpatioTemporal VLA for Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-04-20T06:48:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ST-π structures VLA models by having a spatiotemporal VLM produce causally ordered chunk-level prompts that guide a dual-generator action expert to jointly handle spatial and temporal control in robotic manipulation.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"tokens to preserve the decomposition capability learned previously. The model is then fine-tuned to generate executable action chunks conditioned on the 4D representation and the chunk-level action prompt, supervised by the flow-matching loss in Sec. 3.3. 5 Experiments 5.1 Experiment Setup Implementation Details.Our model adopts the pretrained weights of PaliGemma [3] from 𝜋0.5 [13] as the VLM backbone, DINOv2 [27] from VGGT [32] as the geometry encoder, and Gemma-300M [29] equipped with a structured spatiotemporal attention mechanism as the action expert. The model is trained on 8 NVIDIA RTX PRO 6000 GPUs, using the AdamW [25] optimizer. In training stage 1, we set 𝜆𝑠 = 1, 𝜆𝐿 =𝜆 𝜏 = 0, and the learning rate to2 𝑒− 5with a batch"},{"citing_arxiv_id":"2604.17787","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"AnchorRefine: Synergy-Manipulation Based on Trajectory Anchor and Residual Refinement for Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-04-20T04:25:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AnchorRefine factorizes VLA action generation into a trajectory anchor for coarse planning and residual refinement for local corrections, improving success rates by up to 7.8% in simulation and 18% on real robots across LIBERO, CALVIN, and physical tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.16079","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"The Amazing Stability of Flow Matching","primary_cat":"cs.CV","submitted_at":"2026-04-17T14:05:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Flow matching generative models preserve sample quality, diversity, and latent representations despite pruning 50% of the CelebA-HQ dataset or altering architecture and training configurations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.16067","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"AEGIS: Anchor-Enforced Gradient Isolation for Knowledge-Preserving Vision-Language-Action Fine-Tuning","primary_cat":"cs.LG","submitted_at":"2026-04-17T13:49:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"AEGIS uses a pre-computed Gaussian anchor and layer-wise Gram-Schmidt orthogonal projections to isolate destructive gradients during VLA fine-tuning, preserving VQA performance without co-training or replay.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.13803","ref_index":57,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation","primary_cat":"cs.CV","submitted_at":"2026-04-15T12:38:51+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Alignment of vision-language models with human V1-V3 early visual cortex negatively predicts resistance to sycophantic gaslighting attacks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.12148","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ViLL-E: Video LLM Embeddings for Retrieval","primary_cat":"cs.CV","submitted_at":"2026-04-13T23:54:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ViLL-E introduces a dynamic embedding mechanism and joint contrastive-generative training for VideoLLMs, delivering up to 7% gains in temporal localization and 4% in video retrieval while enabling new zero-shot capabilities.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.12012","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment","primary_cat":"cs.CV","submitted_at":"2026-04-13T20:00:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TIPSv2 improves dense patch-text alignment in vision-language pretraining through distillation and iBOT++ modifications, yielding models on par with or better than recent baselines on 9 tasks across 20 datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.11496","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Revisiting Compositionality in Dual-Encoder Vision-Language Models: The Role of Inference","primary_cat":"cs.CV","submitted_at":"2026-04-13T14:03:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Dual-encoder VLMs gain robust compositional generalization by learning localized alignments from frozen patch and token embeddings instead of using global similarity.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.10432","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"AnySlot: Goal-Conditioned Vision-Language-Action Policies for Zero-Shot Slot-Level Placement","primary_cat":"cs.RO","submitted_at":"2026-04-12T03:09:44+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.09860","ref_index":4,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RoboLab: A High-Fidelity Simulation Benchmark for Analysis of Task Generalist Policies","primary_cat":"cs.RO","submitted_at":"2026-04-10T19:42:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"RoboLab is a new simulation benchmark with 120 tasks across visual, procedural, and relational axes that quantifies generalization gaps and perturbation sensitivity in task-generalist robotic policies.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.09330","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis","primary_cat":"cs.RO","submitted_at":"2026-04-10T13:59:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VAG is a synchronized dual-stream flow-matching framework that generates aligned video-action pairs for synthetic embodied data synthesis and policy pretraining.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"models [16, 50, 85] directly map visual observations to short-horizon actions through diffusion models or behav- ior cloning. On the other hand, a substantial line of re- search focuses on Vision-Language-Action (VLA) mod- els [6, 10, 27, 28, 30-32, 35, 38, 39, 43, 48, 49, 51, 55, 58, 60, 63, 68, 73, 79, 81, 86] based on vision-language models (VLM) [4, 14, 15, 56, 74]. They operate in an it- erative closed-loop manner: predicting 1∼2 seconds of ac- 2 Action only（1～2 s） VLA Video only（2～10 s） WM Video（1～2 s） WA Action（1～2 s）Action-Auxiliary Video only（2～10 s） WMAction (IDM) Action only（2～10 s）Video（～10 s） VAG Action（～10 s） (a)(b)(c) (d) (e) Figure 2.Architecture comparison of embodied models.(a) Vision-Language-Action (VLA) models iteratively predict and execute"}],"limit":50,"offset":0}