{"total":11,"items":[{"citing_arxiv_id":"2606.06294","ref_index":45,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Towards One-to-Many Temporal Grounding","primary_cat":"cs.CV","submitted_at":"2026-06-04T15:31:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Introduces OMTG benchmark with C-Acc and EtF1 metrics, a 56k dataset, and caption/temporal rewards, reaching 43.65% EtF1 SOTA on the new bench.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06121","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Pest-Thinker: Learning to Think and Reason like Entomologists via Reinforcement Learning","primary_cat":"cs.CV","submitted_at":"2026-05-07T12:30:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Pest-Thinker is a reinforcement learning framework that improves MLLMs' expert-level reasoning on pest morphology via synthesized CoT trajectories, GRPO optimization, and an LLM-judged feature reward on new benchmarks QFSD and AgriInsect.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.03485","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MHPR: Multidimensional Human Perception and Reasoning Benchmark for Large Vision-Languate Models","primary_cat":"cs.CV","submitted_at":"2026-05-05T08:20:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MHPR is a multidimensional benchmark for LVLM human-centric perception-reasoning with C-RD, SFT-D, RL-D, T-D data tiers and ACVG pipeline, showing training gains on Qwen2.5-VL-7B to near-parity with larger models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01324","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Beyond Perceptual Shortcuts: Causal-Inspired Debiasing Optimization for Generalizable Video Reasoning in Lightweight MLLMs","primary_cat":"cs.CV","submitted_at":"2026-05-02T08:41:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VideoThinker improves lightweight MLLM video reasoning by creating a bias model to capture shortcuts and applying causal debiasing policy optimization to push away from them, achieving SOTA efficiency with minimal data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19218","ref_index":48,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Thinking Before Matching: A Reinforcement Reasoning Paradigm Towards General Person Re-Identification","primary_cat":"cs.CV","submitted_at":"2026-04-21T08:24:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ReID-R achieves competitive person re-identification performance using chain-of-thought reasoning and reinforcement learning with only 14.3K non-trivial samples, about 20.9% of typical data scales, while providing interpretations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.13602","ref_index":192,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges","primary_cat":"cs.LG","submitted_at":"2026-04-15T08:11:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under optimization pressure.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"First, Inherent Proxy Limitations prevent reward models from capturing holistic human preferences. Standard Bradley-Terry losses focus on pairwise comparisons and may ignore the absolute data distribution, rewarding superficial shortcut features [181]. If training data contains hidden noise or poisoning, the reward model may associate unnatural patterns with high scores [217]. Smaller-capacity RMs are particularly easy to bypass [192]. Furthermore, global scalar rewards fail to penalize localized distortions [182, 214]. Structural mismatches, such as using 2D proxies for 3D structures, create significant evaluation gaps [172, 210, 218]. Second, Optimization Amplification can guide the policy toward unintended outcomes. Theoretical analysis suggests that pure reward maximization inclines the policy"},{"citing_arxiv_id":"2604.08230","ref_index":117,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Generalization Under Scrutiny: Cross-Domain Detection Progresses, Pitfalls, and Persistent Challenges","primary_cat":"cs.CV","submitted_at":"2026-04-09T13:21:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A survey that organizes methods for cross-domain object detection into a taxonomy, analyzes domain shift across detection stages, and outlines persistent challenges.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"interfaces (text or visual) that control domain or output set are rare [118]. It is unclear how prompts should interact with the detection pipeline (backbone, proposals, heads). Research questions include: How to design detection models that accept domain or task prompts and adjust behavior without fine-tuning [49, 118]? Can prompts modu- late feature extraction, proposal scoring, or NMS [117]? How to evaluate prompt-based adaptation (same model, different prompts, multiple domains) [9, 18]? 9.7 Data-Centric Approaches Problem: CDOD remains largely model-centric, while data selection/synthesis and curation receive fragmented treatment. Data-centric approaches affect all stages via input distribution, preserving or improving proposal coverage and discriminativity"},{"citing_arxiv_id":"2604.04379","ref_index":50,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Reinforce to Learn, Elect to Reason: A Dual Paradigm for Video Reasoning","primary_cat":"cs.CV","submitted_at":"2026-04-06T03:01:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RLER trains video-reasoning models with three task-driven RL rewards for evidence production and elects the best answer from a few candidates via evidence consistency scoring, yielding 6.3% average gains on eight benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Yan, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou. mplug-owl3: Towards long image-sequence understanding in multi-modal large language models.arXiv preprint arXiv:2408.04840, 2024. 6 [49] Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Guoyin Wang, Heng Li, Jiangcheng Zhu, Jianqun Chen, et al. Yi: Open foundation models by 01. ai. arXiv preprint arXiv:2403.04652, 2024. 2 [50] Yufei Zhan, Yousong Zhu, Shurong Zheng, Hongyin Zhao, Fan Yang, Ming Tang, and Jinqiao Wang. Vision-r1: Evolv- ing human-free alignment in large vision-language models via vision-guided reinforcement learning.arXiv preprint arXiv:2503.18013, 2025. 1, 2 [51] Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang,"},{"citing_arxiv_id":"2602.20913","ref_index":60,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding","primary_cat":"cs.CV","submitted_at":"2026-02-24T13:49:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LongVideo-R1 trains a reasoning agent on 33K trajectories to intelligently select informative video clips via iterative refinement and RL, achieving better accuracy-efficiency tradeoffs on long video QA benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.00748","ref_index":50,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Improving the Reasoning of Multi-Image Grounding in MLLMs via Reinforcement Learning","primary_cat":"cs.CV","submitted_at":"2025-07-01T13:48:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A pipeline of chain-of-thought data synthesis, LoRA-based supervised fine-tuning, rejection sampling, and rule-based reinforcement learning raises multi-image grounding accuracy by 9.04% on MIG-Bench and 4.41% on average across seven other benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2504.06958","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning","primary_cat":"cs.CV","submitted_at":"2025-04-09T15:09:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Reinforcement fine-tuning with temporal rewards produces VideoChat-R1, a video MLLM showing large gains on spatio-temporal perception benchmarks such as +31.8 temporal grounding and +31.2 object tracking.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}