pith. sign in

arxiv: 2511.16672 · v4 · pith:RW52MPL5new · submitted 2025-11-20 · 💻 cs.CV

EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards

classification 💻 cs.CV
keywords evolmmmodelsmultimodalreasoningcontinuousdatafashionlarge
0
0 comments X
read the original abstract

Recent advances in large multimodal models (LMMs) have enabled impressive reasoning and perception abilities, yet most existing training pipelines still depend on human-curated data or externally verified reward models, limiting their autonomy and scalability. In this work, we strive to improve LMM reasoning capabilities in a purely unsupervised fashion (without any annotated data or reward distillation). To this end, we propose a self-evolving framework, named EvoLMM, that instantiates two cooperative agents from a single backbone model: a Proposer, which generates diverse, image-grounded questions, and a Solver, which solves them through internal consistency, where learning proceeds through a continuous self-rewarding process. This dynamic feedback encourages both the generation of informative queries and the refinement of structured reasoning without relying on ground-truth or human judgments. When using the popular Qwen2.5-VL as the base model, our EvoLMM yields consistent gains upto $\sim$3\% on multimodal math-reasoning benchmarks, including ChartQA, MathVista, and MathVision, using only raw training images. We hope our simple yet effective approach will serve as a solid baseline easing future research in self-improving LMMs in a fully-unsupervised fashion. Our code and models are available at https://github.com/mbzuai-oryx/EvoLMM.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 8 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations

    cs.CV 2026-04 unverdicted novelty 8.0

    EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.

  2. EvoGround: Self-Evolving Video Agents for Video Temporal Grounding

    cs.CV 2026-05 unverdicted novelty 7.0

    A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.

  3. Ask, Solve, Generate: Self-Evolving Unified Multimodal Understanding and Generation via Self-Consistency Rewards

    cs.CV 2026-06 unverdicted novelty 6.0

    A self-evolving framework with proposer-solver-generator roles, Solver Token Entropy, and multi-scale internal evaluation improves unified LMMs on understanding and generation tasks using only self-derived consistency...

  4. Paying More Attention to Visual Tokens in Self-Evolving Large Multimodal Models

    cs.CV 2026-06 unverdicted novelty 6.0

    VISE is an unsupervised self-evolving method for LMMs that uses invariance rewards to improve visual conditioning, reporting gains on captioning and reduced hallucination across multiple models.

  5. EvoVid: Temporal-Centric Self-Evolution for Video Large Language Models

    cs.CV 2026-05 unverdicted novelty 6.0

    EvoVid proposes a temporal-centric self-evolution framework for Video-LLMs that uses temporal-aware Questioner and temporal-grounded Solver rewards to improve performance directly from unannotated videos.

  6. RISE: Reliable Improvement in Self-Evolving Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 6.0

    RISE is a self-evolving framework for VLMs that adds fine-grained alternation, quality supervision, and dynamic balancing to produce reliable gains on seven benchmarks from unlabeled data.

  7. Agentic AI for Remote Sensing: Technical Challenges and Research Directions

    cs.CV 2026-04 unverdicted novelty 6.0

    Agentic AI faces structural challenges in remote sensing due to geospatial data properties and workflow constraints, requiring EO-native agents built around structured state, tool-aware reasoning, and validity-aware e...

  8. Agentic AI for Remote Sensing: Technical Challenges and Research Directions

    cs.CV 2026-04 unverdicted novelty 5.0

    Agentic AI for remote sensing requires new designs centered on structured geospatial state, tool-aware reasoning, verifier-guided execution, and physical validity rather than generic extensions.