super hub Mixed citations

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

GLM-V Team: Wenyi Hong, Guobing Gan, Guo Wang, Haomiao Tang, Wenmeng Yu, Xiaotao Gu · 2025 · cs.CV · arXiv 2507.01006

Mixed citation behavior. Most common role is background (57%).

145 Pith papers citing it

Background 57% of classified citations

open full Pith review browse 145 citing papers more from GLM-V Team: Wenyi Hong arXiv PDF

abstract

We present GLM-4.1V-Thinking, GLM-4.5V, and GLM-4.6V, a family of vision-language models (VLMs) designed to advance general-purpose multimodal understanding and reasoning. In this report, we share our key findings in the development of the reasoning-centric training framework. We first develop a capable vision foundation model with significant potential through large-scale pre-training, which arguably sets the upper bound for the final performance. We then propose Reinforcement Learning with Curriculum Sampling (RLCS) to unlock the full potential of the model, leading to comprehensive capability enhancement across a diverse range of tasks, including STEM problem solving, video understanding, content recognition, coding, grounding, GUI-based agents, and long document interpretation. In a comprehensive evaluation across 42 public benchmarks, GLM-4.5V achieves state-of-the-art performance on nearly all tasks among open-source models of similar size, and demonstrates competitive or even superior results compared to closed-source models such as Gemini-2.5-Flash on challenging tasks including Coding and GUI Agents. Meanwhile, the smaller GLM-4.1V-9B-Thinking remains highly competitive-achieving superior results to the much larger Qwen2.5-VL-72B on 29 benchmarks. We open-source both GLM-4.1V-9B-Thinking and GLM-4.5V. We further introduce the GLM-4.6V series, open-source multimodal models with native tool use and a 128K context window. A brief overview is available at https://z.ai/blog/glm-4.6v. Code, models and more information are released at https://github.com/zai-org/GLM-V.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 21 baseline 10 method 3 dataset 2 other 1

citation-polarity summary

background 21 baseline 10 use method 3 use dataset 2 unclear 1

claims ledger

abstract We present GLM-4.1V-Thinking, GLM-4.5V, and GLM-4.6V, a family of vision-language models (VLMs) designed to advance general-purpose multimodal understanding and reasoning. In this report, we share our key findings in the development of the reasoning-centric training framework. We first develop a capable vision foundation model with significant potential through large-scale pre-training, which arguably sets the upper bound for the final performance. We then propose Reinforcement Learning with Curriculum Sampling (RLCS) to unlock the full potential of the model, leading to comprehensive capabili

authors

GLM-V Team: Wenyi Hong Guobing Gan Guo Wang Haomiao Tang Wenmeng Yu Xiaotao Gu

co-cited works

representative citing papers

DataComp-VLM: Improved Open Datasets for Vision-Language Models

cs.CV · 2026-06-26 · conditional · novelty 8.0 · 2 refs

DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).

MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark

cs.CV · 2026-04-12 · unverdicted · novelty 8.0 · 2 refs

MMRareBench provides 1,756 QA pairs and 7,958 images from PMC rare-disease cases to evaluate 23 MLLMs, revealing low treatment-planning scores and medical models underperforming general models on multi-image tasks due to capacity dilution.

HM-Bench: A Comprehensive Benchmark for Multimodal Large Language Models in Hyperspectral Remote Sensing

cs.CV · 2026-04-10 · accept · novelty 8.0

HM-Bench is the first benchmark for MLLMs on hyperspectral images, showing models struggle with complex spatial-spectral reasoning and perform better with visual PCA images than textual reports.

Supply-Chain Poisoning Attacks Against LLM Coding Agent Skill Ecosystems

cs.CR · 2026-04-03 · unverdicted · novelty 8.0

DDIPE poisons LLM agent skills by embedding malicious logic in documentation examples, achieving 11.6-33.5% bypass rates across frameworks while explicit attacks are blocked, with 2.5% evading detection.

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

cs.CV · 2026-01-15 · unverdicted · novelty 8.0

Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.

OmniCoT: A Benchmark for Global and Multi-Step Panoramic Reasoning

cs.CV · 2026-06-29 · unverdicted · novelty 7.0

OmniCoT is a new panoramic reasoning benchmark with 6.7K eval, 1K real, and 14.3K training examples plus a two-stage SFT+GRPO training method to enforce global 360-degree consistency.

MuseBench: Benchmarking Intent-Level Audiovisual Arts Understanding in MLLMs

cs.CV · 2026-06-29 · unverdicted · novelty 7.0

MuseBench shows state-of-the-art MLLMs achieve only 48.29% accuracy on intent-level audiovisual arts understanding versus 87.18% for human experts.

SpreadsheetBench 2: Evaluating Agents on End-to-End Business Spreadsheet Workflows

cs.SE · 2026-06-29 · unverdicted · novelty 7.0

SpreadsheetBench 2 provides 321 expert-validated tasks from authentic business data showing frontier LLMs reach only 34.89% overall accuracy on end-to-end spreadsheet workflows.

Position Rebinding Cache Reuse: Replay-Free Visual Revisiting for Interleaved Multimodal Reasoning

cs.CV · 2026-06-25 · unverdicted · novelty 7.0

PRCR enables replay-free visual revisiting in interleaved multimodal reasoning by storing raw visual KV caches with spatial coordinates and rebinding keys to position-compatible coordinates, matching replay performance while cutting computation by orders of magnitude.

Each Judge Its Own Yardstick: Discovering Per-VLM Taxonomies for Physical Video Evaluation

cs.CV · 2026-06-22 · unverdicted · novelty 7.0

JudgeFit produces per-VLM physical video evaluation taxonomies that improve held-out accuracy by a mean 32% relative to a single global schema across 16 models from eight families.

CheXpercept: A Benchmark for Evaluating Expert-Level Lesion Perception in Chest X-rays

cs.CV · 2026-06-19 · unverdicted · novelty 7.0

CheXpercept is a sequential multi-level perception benchmark showing VLMs perform adequately only on coarse lesion detection in chest X-rays while degrading sharply on finer tasks, with medical VLMs offering no advantage over general models.

PorTEXTO: A European Portuguese Benchmark for Visual Text Extraction

cs.CV · 2026-06-17 · unverdicted · novelty 7.0 · 2 refs

PorTEXTO benchmark shows sharp real-world performance drops in pt-PT OCR and finds specialized multilingual data outperforms model size or resolution increases.

One Polluted Page Is Enough: Evaluating Web Content Pollution in Generative Recommenders

cs.CL · 2026-06-11 · unverdicted · novelty 7.0

FORGE benchmark shows search-augmented LLMs recommend fake products at rates up to 27% from one polluted page and 73.8% from top-3 replacement across 12 models and 225 products.

SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks

cs.AI · 2026-06-08 · unverdicted · novelty 7.0

SpatialWorld is a new multi-simulator benchmark showing top multimodal agents achieve under 18% success on interactive spatial tasks requiring active exploration and long-horizon planning.

NutriMLLM: Multimodal Large Language Models for Dietary Micronutrient Analysis

cs.CV · 2026-06-08 · unverdicted · novelty 7.0

NutriMLLM models fine-tuned on 1.1 million synthetic food image-nutrient triplets from population dietary recalls achieve near-complete coverage and competitive accuracy on real food images for comprehensive micronutrient estimation compared to proprietary MLLMs.

UltraVR: A Diagnostic Ultra-Resolution Image-VQA Benchmark for Evidence-Grounded Reasoning

cs.CV · 2026-06-04 · unverdicted · novelty 7.0

UltraVR is a new diagnostic benchmark for evidence-grounded VQA on ultra-resolution images, with structured chain-of-thought annotations that localize failures in grounding, perception, and inference.

FindIt: A Format-Informed Visual Detection Benchmark for Generalist Multimodal LLMs

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

FindIt is the first comprehensive benchmark for evaluating generalist MLLMs on promptable object detection, referring expression detection, instance-level detection, and video detection with standardized parsable outputs.

Benchmarking Visual State Tracking in Multimodal Video Understanding

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

VSTAT benchmark shows state-of-the-art MLLMs perform far below humans and only modestly above answer-prior baselines on visual state tracking, failing at visual perception despite correct textual reasoning.

Towards Characterizing Scientific Image Utility and Upgradability

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

The SIU²A framework evaluates scientific images for error detection, repair feasibility, and correction quality, showing current multimodal systems have major limitations in preserving scientific validity.

Moment-Video: Diagnosing Temporal Fidelity of Video MLLMs on Momentary Visual Events

cs.CV · 2026-06-01 · conditional · novelty 7.0

Moment-Video benchmark shows top video MLLM achieves only 39.6% accuracy on momentary visual event tasks, with most open-source models below 25%.

ChartArena: Benchmarking Chart Parsing across Languages, Scenarios, and Formats

cs.CV · 2026-05-31 · unverdicted · novelty 7.0

ChartArena is a new benchmark dataset and evaluation protocol for chart parsing by MLLMs that covers numeric and diagrammatic charts in multiple languages and real-world visual conditions.

MM-Snowball: Evaluating and Mitigating Hallucination Snowballing in Multimodal Multi-Turn Dialogue

cs.CV · 2026-05-30 · unverdicted · novelty 7.0

MM-Snowball benchmark diagnoses hallucination snowballing in multi-turn MLLM dialogues; CAVR mitigates it via dual visual rectification at representation and logit levels.

DeepLatent: Think with Images via Parallel Latent Visual Reasoning

cs.CV · 2026-05-30 · unverdicted · novelty 7.0

DeepLatent introduces a parallel latent visual reasoning framework with learnable 2D tokens and continuous RL, trained via distillation then RL, plus a new 180K dataset, claiming SOTA benchmark results.

StemBind: When MLLMs Get Lost Between Rules and Instances in Abstract Visual Reasoning

cs.CV · 2026-05-29 · unverdicted · novelty 7.0

StemBind benchmark diagnoses MLLM failures in abstract visual reasoning by separating perception, rule induction, and answer selection on shared stems, finding a persistent rule-to-instance binding gap even when perception and rule are correct.

citing papers explorer

Showing 3 of 3 citing papers after filters.

Mem-W: Latent Memory-Native GUI Agents cs.CL · 2026-05-10 · unverdicted · none · ref 3 · internal anchor
Mem-W embeds historical trajectories and working memory as compact latent tokens into GUI agents' continuous context via a trajectory-to-latent compressor, yielding up to +30 point gains on navigation benchmarks.
Purifying Multimodal Retrieval: Fragment-Level Evidence Selection for RAG cs.IR · 2026-04-30 · unverdicted · none · ref 12 · internal anchor
FES-RAG reframes multimodal RAG as fragment-level selection using Fragment Information Gain to outperform document-level methods with up to 27% relative CIDEr gains on M2RAG while shortening context.
VL-SAM-v3: Memory-Guided Visual Priors for Open-World Object Detection cs.CV · 2026-05-05 · unverdicted · none · ref 15 · 3 links · internal anchor
VL-SAM-v3 retrieves visual prototypes from memory to generate sparse spatial and dense contextual priors that refine detection prompts, yielding gains on rare categories in LVIS for both open-vocabulary and open-ended settings.

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer