hub Mixed citations

Aria: An open multimodal native mixture-of- experts model

Dongxu Li, Yudong Liu, Haoning Wu, Yue Wang, Zhiqi Shen, Bowen Qu, Xinyao Niu, Fan Zhou, Chengen Huang, Yanpeng Li, Chongyan Zhu, Xiaoyi Ren, Chao Li, Yifan Ye, Peng Liu, Lihuan Zhang, Hanshu Yan, Guoyin Wang, Bei Chen, Junnan Li · 2025 · arXiv 2410.05993

Mixed citation behavior. Most common role is background (50%).

15 Pith papers citing it

Background 50% of classified citations

read on arXiv browse 15 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3 baseline 2 method 1

citation-polarity summary

background 3 baseline 2 use method 1

representative citing papers

LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding

cs.CV · 2026-02-24 · unverdicted · novelty 7.0

LongVideo-R1 trains a reasoning agent on 33K trajectories to intelligently select informative video clips via iterative refinement and RL, achieving better accuracy-efficiency tradeoffs on long video QA benchmarks.

MMSearch-R1: Incentivizing LMMs to Search

cs.CV · 2025-06-25 · unverdicted · novelty 7.0

MMSearch-R1 uses reinforcement learning to train multimodal models for on-demand multi-turn internet search with image and text tools, outperforming same-size RAG baselines and matching larger ones while cutting search calls by over 30%.

Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

cs.CV · 2025-01-23 · unverdicted · novelty 7.0

Video-MMMU benchmark shows large multimodal models exhibit steep performance drops on higher cognitive tasks when learning from professional videos and lag significantly behind humans in knowledge acquisition.

Hyperbolic and Evidence-Prioritized Experts for Large Vision-Language Models

cs.CV · 2026-05-29 · unverdicted · novelty 6.0

AsyMoE adds hyperbolic geometry for cross-modal hierarchies and evidence-priority experts to address vision-language asymmetry in LVLMs, reporting 1.5% average gains and 25.45% fewer active parameters.

MobileMoE: Scaling On-Device Mixture of Experts

cs.LG · 2026-05-26 · unverdicted · novelty 6.0

MobileMoE introduces on-device MoE LLMs that match dense models with 2-4x fewer FLOPs and provide efficient smartphone inference.

OProver: A Unified Framework for Agentic Formal Theorem Proving

cs.CL · 2026-05-17 · unverdicted · novelty 6.0

OProver-32B achieves top Pass@32 scores on MiniF2F, ProverBench, and PutnamBench by combining continued pretraining with iterative agentic proving, retrieval, SFT on repairs, and RL on unresolved cases using a 6.86M-proof dataset.

DR-MMSearchAgent: Deepening Reasoning in Multimodal Search Agents

cs.CV · 2026-04-21 · unverdicted · novelty 6.0

DR-MMSearchAgent derives batch-wide trajectory advantages and uses differentiated Gaussian rewards to prevent premature collapse in multimodal agents, outperforming MMSearch-R1 by 8.4% on FVQA-test.

VideoStir: Understanding Long Videos via Spatio-Temporally Structured and Intent-Aware RAG

cs.CV · 2026-04-07 · unverdicted · novelty 6.0

VideoStir introduces a spatio-temporal graph-based structure and intent-aware retrieval for long-video RAG, achieving competitive performance with SOTA methods via a new IR-600K dataset.

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

cs.CV · 2025-01-07 · conditional · novelty 6.0

Sa2VA unifies SAM-2 segmentation with MLLM reasoning into a single model for referring segmentation and conversation on images and videos, supported by a new 72k-expression Ref-SAV dataset.

MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models

cs.CV · 2025-01-06 · unverdicted · novelty 6.0

MotionBench is a new benchmark showing poor fine-grained motion understanding in VLMs and proposes TE Fusion to improve performance with higher frame rates.

ProMMSearchAgent: A Generalizable Multimodal Search Agent Trained with Process-Oriented Rewards

cs.CV · 2026-04-22 · unverdicted · novelty 5.0

A sandbox-trained multimodal search agent with process-oriented rewards transfers zero-shot to real Google Search and outperforms prior methods on FVQA, InfoSeek, and MMSearch.

Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search

cs.CV · 2025-09-09 · unverdicted · novelty 5.0

Mini-o3 scales visual search reasoning to tens of interaction turns via a new probe dataset, iterative trajectory collection, and over-turn masking in RL, claiming SOTA performance while training only up to six turns.

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

cs.CV · 2024-12-13 · accept · novelty 5.0

DeepSeek-VL2 is a series of MoE vision-language models using dynamic tiling and latent attention that reach competitive or state-of-the-art results on VQA, OCR, document understanding and grounding with 1.0B to 4.5B activated parameters.

LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence

cs.CV · 2026-05-25 · unverdicted · novelty 4.0

LLaVA-OV-2 uses codec-stream tokenization and a shared 3D RoPE to improve video, spatial, and tracking performance over Qwen3-VL-8B, while introducing the JumpScore benchmark for fine-grained motion localization.

Large Language Model-Brained GUI Agents: A Survey

cs.AI · 2024-11-27 · unverdicted · novelty 4.0

A survey consolidating frameworks, data practices, large action models, benchmarks, applications, and research gaps in LLM-brained GUI agents.

citing papers explorer

Showing 15 of 15 citing papers.

LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding cs.CV · 2026-02-24 · unverdicted · none · ref 20
LongVideo-R1 trains a reasoning agent on 33K trajectories to intelligently select informative video clips via iterative refinement and RL, achieving better accuracy-efficiency tradeoffs on long video QA benchmarks.
MMSearch-R1: Incentivizing LMMs to Search cs.CV · 2025-06-25 · unverdicted · none · ref 33
MMSearch-R1 uses reinforcement learning to train multimodal models for on-demand multi-turn internet search with image and text tools, outperforming same-size RAG baselines and matching larger ones while cutting search calls by over 30%.
Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos cs.CV · 2025-01-23 · unverdicted · none · ref 16
Video-MMMU benchmark shows large multimodal models exhibit steep performance drops on higher cognitive tasks when learning from professional videos and lag significantly behind humans in knowledge acquisition.
Hyperbolic and Evidence-Prioritized Experts for Large Vision-Language Models cs.CV · 2026-05-29 · unverdicted · none · ref 20
AsyMoE adds hyperbolic geometry for cross-modal hierarchies and evidence-priority experts to address vision-language asymmetry in LVLMs, reporting 1.5% average gains and 25.45% fewer active parameters.
MobileMoE: Scaling On-Device Mixture of Experts cs.LG · 2026-05-26 · unverdicted · none · ref 32
MobileMoE introduces on-device MoE LLMs that match dense models with 2-4x fewer FLOPs and provide efficient smartphone inference.
OProver: A Unified Framework for Agentic Formal Theorem Proving cs.CL · 2026-05-17 · unverdicted · none · ref 69
OProver-32B achieves top Pass@32 scores on MiniF2F, ProverBench, and PutnamBench by combining continued pretraining with iterative agentic proving, retrieval, SFT on repairs, and RL on unresolved cases using a 6.86M-proof dataset.
DR-MMSearchAgent: Deepening Reasoning in Multimodal Search Agents cs.CV · 2026-04-21 · unverdicted · none · ref 26
DR-MMSearchAgent derives batch-wide trajectory advantages and uses differentiated Gaussian rewards to prevent premature collapse in multimodal agents, outperforming MMSearch-R1 by 8.4% on FVQA-test.
VideoStir: Understanding Long Videos via Spatio-Temporally Structured and Intent-Aware RAG cs.CV · 2026-04-07 · unverdicted · none · ref 82
VideoStir introduces a spatio-temporal graph-based structure and intent-aware retrieval for long-video RAG, achieving competitive performance with SOTA methods via a new IR-600K dataset.
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos cs.CV · 2025-01-07 · conditional · none · ref 49
Sa2VA unifies SAM-2 segmentation with MLLM reasoning into a single model for referring segmentation and conversation on images and videos, supported by a new 72k-expression Ref-SAV dataset.
MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models cs.CV · 2025-01-06 · unverdicted · none · ref 19
MotionBench is a new benchmark showing poor fine-grained motion understanding in VLMs and proposes TE Fusion to improve performance with higher frame rates.
ProMMSearchAgent: A Generalizable Multimodal Search Agent Trained with Process-Oriented Rewards cs.CV · 2026-04-22 · unverdicted · none · ref 18
A sandbox-trained multimodal search agent with process-oriented rewards transfers zero-shot to real Google Search and outperforms prior methods on FVQA, InfoSeek, and MMSearch.
Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search cs.CV · 2025-09-09 · unverdicted · none · ref 16
Mini-o3 scales visual search reasoning to tens of interaction turns via a new probe dataset, iterative trajectory collection, and over-turn masking in RL, claiming SOTA performance while training only up to six turns.
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding cs.CV · 2024-12-13 · accept · none · ref 47
DeepSeek-VL2 is a series of MoE vision-language models using dynamic tiling and latent attention that reach competitive or state-of-the-art results on VQA, OCR, document understanding and grounding with 1.0B to 4.5B activated parameters.
LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence cs.CV · 2026-05-25 · unverdicted · none · ref 19
LLaVA-OV-2 uses codec-stream tokenization and a shared 3D RoPE to improve video, spatial, and tracking performance over Qwen3-VL-8B, while introducing the JumpScore benchmark for fine-grained motion localization.
Large Language Model-Brained GUI Agents: A Survey cs.AI · 2024-11-27 · unverdicted · none · ref 220
A survey consolidating frameworks, data practices, large action models, benchmarks, applications, and research gaps in LLM-brained GUI agents.

Aria: An open multimodal native mixture-of- experts model

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer