{"total":18,"items":[{"citing_arxiv_id":"2606.00656","ref_index":254,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Demystifying the Optimal Fair Classifier in Multi-Class Classification","primary_cat":"cs.LG","submitted_at":"2026-05-30T10:00:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Derives tractable optimal fair multi-class classifier and supplies in-processing and post-processing algorithms that converge to the accuracy-fairness Pareto frontier.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.27582","ref_index":45,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Uni-LaViRA: Language-Vision-Robot Actions Translation for Unified Embodied Navigation","primary_cat":"cs.RO","submitted_at":"2026-05-26T18:52:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A zero-shot unified agent for VLN-CE, ObjectNav, EQA and Aerial-VLN on wheeled, quadruped, humanoid and UAV platforms that translates language and vision inputs into actions via MLLMs plus TDM and SCB mechanisms, matching trained foundation models on multiple benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17249","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SEDualVLN: A Spatially-Enhanced Dual-System for Vision-Language Navigation","primary_cat":"cs.RO","submitted_at":"2026-05-17T04:12:56+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16080","ref_index":62,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ReAlign: Generalizable Image Forgery Detection via Reasoning-Aligned Representation","primary_cat":"cs.CV","submitted_at":"2026-05-15T15:43:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ReAlign distills LLM-generated reasoning texts into a lightweight AIGI forgery detector via contrastive image-text alignment to improve generalization on complex forgeries.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12624","ref_index":77,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving","primary_cat":"cs.RO","submitted_at":"2026-05-12T18:09:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MindVLA-U1 is the first unified streaming VLA architecture that surpasses human drivers on WOD-E2E planning metrics while matching VA latency and preserving language interfaces.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09053","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LCGNav: Local Candidate-Aware Geometric Enhancement for General Topological Planning in Vision-Language Navigation","primary_cat":"cs.CV","submitted_at":"2026-05-09T16:56:38+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LCGNav improves online topological VLN-CE by converting local depth views to physically truncated 3D point clouds and applying selective dimension-preserving fusion, yielding consistent gains on R2R-CE and RxR-CE benchmarks with open code.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"Spatial representation in VLN has been studied through both explicit maps and implicit memory. Explicit approaches, such as GridMM [11], CM2 [9], BEVBert [17], and OVL- MAP [18], construct top-down or hybrid spatial maps to model environment structure, while implicit approaches such as JanusVLN reduce mapping overhead by encoding history in latent memory [19]. Although dense metric maps provide rich spatial semantics, they often introduce higher compu- tation and weaker transferability, whereas implicit methods may lack stable long-horizon structure. Online topological planning offers a practical middle ground, especially when panoramic RGB-D observations are available. However, ex- isting topological methods still underuse this geometric infor-"},{"citing_arxiv_id":"2605.07931","ref_index":45,"ref_count":3,"confidence":0.9,"is_internal_anchor":false,"paper_title":"One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy","primary_cat":"cs.CV","submitted_at":"2026-05-08T16:04:43+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Reducing visual input to one token per frame in VLA world models maintains or improves long-horizon performance on MetaWorld, LIBERO, and real-robot tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[43] Junjin Xiao, Yandan Yang, Xinyuan Chang, Ronghan Chen, Feng Xiong, Mu Xu, Wei-Shi Zheng, and Qing Zhang. World-env: Leveraging world model as a virtual environment for vla post-training.arXiv preprint arXiv:2509.24948, 2025. [44] Angen Ye, Zeyu Zhang, Boyuan Wang, Xiaofeng Wang, Dapeng Zhang, and Zheng Zhu. Vla-r1: Enhancing reasoning in vision-language-action models.arXiv preprint arXiv:2510.01623, 2025. [45] Shuang Zeng, Dekang Qi, Xinyuan Chang, Feng Xiong, Shichao Xie, Xiaolong Wu, Shiyi Liang, Mu Xu, and Xing Wei. Janusvln: Decoupling semantics and spatiality with dual implicit memory for vision-language navigation.arXiv preprint arXiv:2509.22548, 2025. 12 [46] Jianke Zhang, Yanjiang Guo, Yucheng Hu, Xiaoyu Chen, Xiang Zhu, and Jianyu Chen. Up-vla: A unified understanding and prediction model for embodied agent."},{"citing_arxiv_id":"2604.27620","ref_index":77,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SpaAct: Spatially-Activated Transition Learning with Curriculum Adaptation for Vision-Language Navigation","primary_cat":"cs.CV","submitted_at":"2026-04-30T09:09:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SpaAct activates spatial awareness in VLMs using action retrospection, future frame prediction, and progressive curriculum learning to reach SOTA on VLN-CE benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Navid: Video-based vlm plans the next step for vision-and-language navigation.arXiv preprint arXiv:2402.15852 (2024). [76] Jenny Zhang, Samson Yu, Jiafei Duan, and Cheston Tan. 2023. Good time to ask: A learning framework for asking for help in embodied visual navigation. In2023 20th International Conference on Ubiquitous Robots (UR). IEEE, 503-509. [77] Jenny Zhang, Samson Yu, Jiafei Duan, and Cheston Tan. 2023. Robustness of Uti- lizing Feedback in Embodied Visual Navigation.arXiv preprint arXiv:2303.15453 (2023). [78] Lingfeng Zhang, Xiaoshuai Hao, Qinwen Xu, Qiang Zhang, Xinyao Zhang, Peng- wei Wang, Jing Zhang, Zhongyuan Wang, Shanghang Zhang, and Renjing Xu. 2025. Mapnav: A novel memory representation via annotated semantic maps"},{"citing_arxiv_id":"2604.20358","ref_index":103,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ConeSep: Cone-based Robust Noise-Unlearning Compositional Network for Composed Image Retrieval","primary_cat":"cs.CV","submitted_at":"2026-04-22T08:59:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ConeSep tackles noisy triplet correspondences in composed image retrieval by introducing geometric fidelity quantization to locate noise, negative boundary learning for semantic opposites, and targeted unlearning via optimal transport, outperforming prior methods on FashionIQ and CIRR.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"forget with machine unlearning. In2015 IEEE symposium on security and privacy, pages 463-480. IEEE, 2015. 3 [102] Shuang Zeng, Dekang Qi, Xinyuan Chang, Feng Xiong, Shichao Xie, Xiaolong Wu, Shiyi Liang, Mu Xu, and Xing Wei. Janusvln: Decoupling semantics and spatiality with dual implicit memory for vision-language navigation.arXiv preprint arXiv:2509.22548, 2025. [103] Shuang Zeng, Xinyuan Chang, Mengwei Xie, Xinran Liu, Yifan Bai, Zheng Pan, Mu Xu, and Xing Wei. Futuresight- drive: Thinking visually with spatio-temporal cot for au- tonomous driving.arXiv preprint arXiv:2505.17685, 2025. [104] Mengwei Xie, Shuang Zeng, Xinyuan Chang, Xinran Liu, Zheng Pan, Mu Xu, and Xing Wei. Seqgrowgraph: Learn- ing lane topology as a chain of graph expansions."},{"citing_arxiv_id":"2604.19536","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LiveVLN: Breaking the Stop-and-Go Loop in Vision-Language Navigation","primary_cat":"cs.RO","submitted_at":"2026-04-21T14:55:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LiveVLN enables smoother vision-language navigation by overlapping action execution with ongoing observation processing, preserving benchmark scores while cutting real-world waiting time by up to 77.7 percent.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19386","ref_index":93,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Air-Know: Arbiter-Calibrated Knowledge-Internalizing Robust Network for Composed Image Retrieval","primary_cat":"cs.CV","submitted_at":"2026-04-21T12:10:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Air-Know decouples MLLM-based external arbitration from proxy learning via knowledge internalization and dual-stream training to overcome noisy triplet correspondence in composed image retrieval.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"Composed Image Retrieval (CIR) [71, 72] aims to retrieve a target image via a multimodal query, including a ref- erence image and a modification text [73-75]. Research on CIR is expected to contribute to various applications, such as semantic understanding [76-84], and multimodal learning [85-92]. Recent methods typically utilize pre- trained models such as CLIP [93] and BLIP-2 [94] for fea- ture alignment and composition [95, 96], achieving sig- nificant progress. However, the problem of Noisy Triplet Correspondence (NTC) [1], which is prevalent in real- world data, remains inadequately addressed. Unlike tra- ditional Noisy Correspondence Learning (NCL) [57, 97- 100], NTC in CIR involves semantic inconsistency within"},{"citing_arxiv_id":"2604.16298","ref_index":42,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FineCog-Nav: Integrating Fine-grained Cognitive Modules for Zero-shot Multimodal UAV Navigation","primary_cat":"cs.CV","submitted_at":"2026-04-17T17:59:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FineCog-Nav uses fine-grained cognitive modules driven by foundation models to outperform zero-shot baselines in UAV navigation and introduces the AerialVLN-Fine benchmark with refined instructions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Vlm-rrt: Vision language model guided rrt search for au- tonomous uav navigation.2025 International Conference on Unmanned Aircraft Systems (ICUAS), pages 633-640, 2025. 2 [41] Bangguo Yu, Hamidreza Kasaei, and Ming Cao. L3mvn: Leveraging large language models for visual target naviga- tion. In2023 IEEE/RSJ International Conference on Intelli- gent Robots and Systems (IROS), 2023. 2 [42] Shuang Zeng, Dekang Qi, Xinyuan Chang, Feng Xiong, Shichao Xie, Xiaolong Wu, Shiyi Liang, Mu Xu, and Xing Wei. Janusvln: Decoupling semantics and spatiality with dual implicit memory for vision-language navigation.arXiv preprint arXiv:2509.22548, 2025. 2 [43] Jiazhao Zhang, Kunyu Wang, Shaoan Wang, Minghan Li, Haoran Liu, Songlin Wei, Zhongyuan Wang, Zhizheng"},{"citing_arxiv_id":"2604.13453","ref_index":58,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FAST: A Synergistic Framework of Attention and State-space Models for Spatiotemporal Traffic Prediction","primary_cat":"cs.LG","submitted_at":"2026-04-15T04:05:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"FAST uses a Temporal-Spatial-Temporal structure with attention and Mamba modules plus learnable embeddings to achieve better accuracy on traffic prediction tasks than previous models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.10096","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ABot-Claw: A Foundation for Persistent, Cooperative, and Self-Evolving Robotic Agents","primary_cat":"cs.CV","submitted_at":"2026-04-11T08:33:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"ABot-Claw is an embodied software layer that adds unified robot scheduling, cross-embodiment visual memory, and critic-driven replanning on top of OpenClaw to support persistent multi-robot execution from natural-language goals.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.04664","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ROSClaw: A Hierarchical Semantic-Physical Framework for Heterogeneous Multi-Agent Collaboration","primary_cat":"cs.RO","submitted_at":"2026-04-06T13:16:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ROSClaw is a hierarchical framework that unifies vision-language model control with e-URDF-based sim-to-real mapping and closed-loop data collection to enable semantic-physical collaboration among heterogeneous multi-agent robots.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.21058","ref_index":57,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Zero-Shot Vulnerability Detection in Low-Resource Smart Contracts Through Solidity-Only Training","primary_cat":"cs.CR","submitted_at":"2026-03-22T04:53:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Sol2Vy transfers vulnerability detection from Solidity to Vyper in zero-shot fashion, outperforming prior methods on reentrancy, weak randomness, and unchecked transfers.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.05467","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MerNav: A Highly Generalizable Memory-Execute-Review Framework for Zero-Shot Object Goal Navigation","primary_cat":"cs.CV","submitted_at":"2026-02-05T09:15:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MerNav's Memory-Execute-Review framework improves success rates in zero-shot object goal navigation by 5-8% over baselines on four datasets while outperforming both training-free and supervised methods on key benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.17685","ref_index":85,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving","primary_cat":"cs.CV","submitted_at":"2025-05-23T09:55:32+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FSDrive uses a generated future scene frame as visual spatio-temporal CoT to improve VLA models for safer autonomous driving trajectory prediction.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}