LongVideo-R1 trains a reasoning agent on 33K trajectories to intelligently select informative video clips via iterative refinement and RL, achieving better accuracy-efficiency tradeoffs on long video QA benchmarks.
hub Mixed citations
Aria: An open multimodal native mixture-of- experts model
Mixed citation behavior. Most common role is background (50%).
hub tools
citation-role summary
citation-polarity summary
representative citing papers
MMSearch-R1 uses reinforcement learning to train multimodal models for on-demand multi-turn internet search with image and text tools, outperforming same-size RAG baselines and matching larger ones while cutting search calls by over 30%.
Video-MMMU benchmark shows large multimodal models exhibit steep performance drops on higher cognitive tasks when learning from professional videos and lag significantly behind humans in knowledge acquisition.
AsyMoE adds hyperbolic geometry for cross-modal hierarchies and evidence-priority experts to address vision-language asymmetry in LVLMs, reporting 1.5% average gains and 25.45% fewer active parameters.
MobileMoE introduces on-device MoE LLMs that match dense models with 2-4x fewer FLOPs and provide efficient smartphone inference.
OProver-32B achieves top Pass@32 scores on MiniF2F, ProverBench, and PutnamBench by combining continued pretraining with iterative agentic proving, retrieval, SFT on repairs, and RL on unresolved cases using a 6.86M-proof dataset.
DR-MMSearchAgent derives batch-wide trajectory advantages and uses differentiated Gaussian rewards to prevent premature collapse in multimodal agents, outperforming MMSearch-R1 by 8.4% on FVQA-test.
VideoStir introduces a spatio-temporal graph-based structure and intent-aware retrieval for long-video RAG, achieving competitive performance with SOTA methods via a new IR-600K dataset.
Sa2VA unifies SAM-2 segmentation with MLLM reasoning into a single model for referring segmentation and conversation on images and videos, supported by a new 72k-expression Ref-SAV dataset.
MotionBench is a new benchmark showing poor fine-grained motion understanding in VLMs and proposes TE Fusion to improve performance with higher frame rates.
A sandbox-trained multimodal search agent with process-oriented rewards transfers zero-shot to real Google Search and outperforms prior methods on FVQA, InfoSeek, and MMSearch.
Mini-o3 scales visual search reasoning to tens of interaction turns via a new probe dataset, iterative trajectory collection, and over-turn masking in RL, claiming SOTA performance while training only up to six turns.
DeepSeek-VL2 is a series of MoE vision-language models using dynamic tiling and latent attention that reach competitive or state-of-the-art results on VQA, OCR, document understanding and grounding with 1.0B to 4.5B activated parameters.
LLaVA-OV-2 uses codec-stream tokenization and a shared 3D RoPE to improve video, spatial, and tracking performance over Qwen3-VL-8B, while introducing the JumpScore benchmark for fine-grained motion localization.
A survey consolidating frameworks, data practices, large action models, benchmarks, applications, and research gaps in LLM-brained GUI agents.
citing papers explorer
No citing papers match the current filters.