hub Mixed citations

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

GLM-V Team: Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang · 2025 · cs.CV · arXiv 2507.01006

Mixed citation behavior. Most common role is background (57%).

86 Pith papers citing it

Background 57% of classified citations

open full Pith review browse 86 citing papers arXiv PDF

abstract

We present GLM-4.1V-Thinking, GLM-4.5V, and GLM-4.6V, a family of vision-language models (VLMs) designed to advance general-purpose multimodal understanding and reasoning. In this report, we share our key findings in the development of the reasoning-centric training framework. We first develop a capable vision foundation model with significant potential through large-scale pre-training, which arguably sets the upper bound for the final performance. We then propose Reinforcement Learning with Curriculum Sampling (RLCS) to unlock the full potential of the model, leading to comprehensive capability enhancement across a diverse range of tasks, including STEM problem solving, video understanding, content recognition, coding, grounding, GUI-based agents, and long document interpretation. In a comprehensive evaluation across 42 public benchmarks, GLM-4.5V achieves state-of-the-art performance on nearly all tasks among open-source models of similar size, and demonstrates competitive or even superior results compared to closed-source models such as Gemini-2.5-Flash on challenging tasks including Coding and GUI Agents. Meanwhile, the smaller GLM-4.1V-9B-Thinking remains highly competitive-achieving superior results to the much larger Qwen2.5-VL-72B on 29 benchmarks. We open-source both GLM-4.1V-9B-Thinking and GLM-4.5V. We further introduce the GLM-4.6V series, open-source multimodal models with native tool use and a 128K context window. A brief overview is available at https://z.ai/blog/glm-4.6v. Code, models and more information are released at https://github.com/zai-org/GLM-V.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 21 baseline 10 method 3 dataset 2 other 1

citation-polarity summary

background 21 baseline 10 use method 3 use dataset 2 unclear 1

claims ledger

abstract We present GLM-4.1V-Thinking, GLM-4.5V, and GLM-4.6V, a family of vision-language models (VLMs) designed to advance general-purpose multimodal understanding and reasoning. In this report, we share our key findings in the development of the reasoning-centric training framework. We first develop a capable vision foundation model with significant potential through large-scale pre-training, which arguably sets the upper bound for the final performance. We then propose Reinforcement Learning with Curriculum Sampling (RLCS) to unlock the full potential of the model, leading to comprehensive capabili

co-cited works

representative citing papers

MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark

cs.CV · 2026-04-12 · unverdicted · novelty 8.0 · 2 refs

MMRareBench provides 1,756 QA pairs and 7,958 images from PMC rare-disease cases to evaluate 23 MLLMs, revealing low treatment-planning scores and medical models underperforming general models on multi-image tasks due to capacity dilution.

HM-Bench: A Comprehensive Benchmark for Multimodal Large Language Models in Hyperspectral Remote Sensing

cs.CV · 2026-04-10 · accept · novelty 8.0

HM-Bench is the first benchmark for MLLMs on hyperspectral images, showing models struggle with complex spatial-spectral reasoning and perform better with visual PCA images than textual reports.

Supply-Chain Poisoning Attacks Against LLM Coding Agent Skill Ecosystems

cs.CR · 2026-04-03 · unverdicted · novelty 8.0

DDIPE poisons LLM agent skills by embedding malicious logic in documentation examples, achieving 11.6-33.5% bypass rates across frameworks while explicit attacks are blocked, with 2.5% evading detection.

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

cs.CV · 2026-01-15 · unverdicted · novelty 8.0

Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.

AgroTools: A Benchmark for Tool-Augmented Multimodal Agents in Agriculture

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

AgroTools is a new benchmark for tool-augmented multimodal agents in agriculture featuring 539 QA pairs, 1,097 images, five task families, and 14 tools, with evaluations showing major limitations in current models' tool planning and execution.

Perception or Prejudice: Can MLLMs Go Beyond First Impressions of Personality?

cs.AI · 2026-05-21 · unverdicted · novelty 7.0

Introduces the Grounded Personality Reasoning task and MM-OCEAN dataset to show that MLLMs frequently produce correct Big Five personality ratings without grounding them in observable video evidence.

HEED: Density-Weighted Residual Alignment for Hybrid Vision-Language Model Distillation

cs.CV · 2026-05-16 · unverdicted · novelty 7.0

HEED replaces uniform residual alignment with density-weighted alignment using patch self-dissimilarity to improve hybrid VLM distillation, gaining 8.7 points on OCRBench v2 and 5.13 on a 10-benchmark average.

PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control

cs.AI · 2026-05-15 · unverdicted · novelty 7.0

PAGER achieves 4.1x higher task success in point-precise geometric GUI control by combining topology-aware planning with precision-aligned reinforcement learning on the new PAGE Bench dataset of 4,906 problems.

Chronicles-OCR: A Cross-Temporal Perception Benchmark for the Evolutionary Trajectory of Chinese Characters

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

Chronicles-OCR is the first benchmark with 2,800 images across the complete evolutionary trajectory of Chinese characters, defining four tasks to evaluate VLLMs' cross-temporal visual perception.

Mem-W: Latent Memory-Native GUI Agents

cs.CL · 2026-05-10 · unverdicted · novelty 7.0

Mem-W embeds historical trajectories and working memory as compact latent tokens into GUI agents' continuous context via a trajectory-to-latent compressor, yielding up to +30 point gains on navigation benchmarks.

UniShield: Unified Face Attack Detection via KG-Informed Multimodal Reasoning

cs.CV · 2026-05-09 · unverdicted · novelty 7.0

UniShield introduces a knowledge-graph-informed multimodal framework that improves unified detection of physical and digital face attacks through instruction tuning and consistency-optimized reasoning.

SalesSim: Benchmarking and Aligning Multimodal Language Models as Retail User Simulators

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

SalesSim benchmarks MLLMs as retail user simulators, finds gaps in persona adherence and over-persuasion, and introduces UserGRPO RL to raise decision alignment by 13.8%.

SphereVAD: Training-Free Video Anomaly Detection via Geodesic Inference on the Unit Hypersphere

cs.CV · 2026-05-08 · unverdicted · novelty 7.0

SphereVAD performs training-free video anomaly detection by recasting anomaly discrimination as von Mises-Fisher likelihood-ratio geodesic inference on the unit hypersphere using intermediate MLLM features, with Frechet mean centering, holistic scene attention, and spherical geodesic pulling.

RobotEQ: Transitioning from Passive Intelligence to Active Intelligence in Embodied AI

cs.RO · 2026-05-07 · unverdicted · novelty 7.0

RobotEQ is the first benchmark for active intelligence in embodied AI, demonstrating that current models underperform on social norm adherence and spatial grounding tasks.

MolRecBench-Wild: A Real-World Benchmark for Optical Chemical Structure Recognition

cs.AI · 2026-05-07 · unverdicted · novelty 7.0

MolRecBench-Wild reveals that 18 existing OCSR models suffer severe performance drops on complex real-world academic molecular images compared with prior patent benchmarks.

Purifying Multimodal Retrieval: Fragment-Level Evidence Selection for RAG

cs.IR · 2026-04-30 · unverdicted · novelty 7.0

FES-RAG reframes multimodal RAG as fragment-level selection using Fragment Information Gain to outperform document-level methods with up to 27% relative CIDEr gains on M2RAG while shortening context.

OS-SPEAR: A Toolkit for the Safety, Performance,Efficiency, and Robustness Analysis of OS Agents

cs.CL · 2026-04-27 · unverdicted · novelty 7.0

OS-SPEAR is a new evaluation toolkit that tests 22 OS agents and identifies trade-offs between efficiency and safety or robustness.

Towards Temporal Compositional Reasoning in Long-Form Sports Videos

cs.CV · 2026-04-24 · unverdicted · novelty 7.0

SportsTime benchmark and CoTR method improve multimodal AI's temporal compositional reasoning and evidence grounding in long-form sports videos.

X-PCR: A Benchmark for Cross-modality Progressive Clinical Reasoning in Ophthalmic Diagnosis

cs.CV · 2026-04-22 · unverdicted · novelty 7.0

X-PCR is a new benchmark of 26,415 images and 177,868 expert VQA pairs that evaluates MLLMs on six-stage progressive reasoning and cross-modality integration in ophthalmology.

Hybrid Latent Reasoning with Decoupled Policy Optimization

cs.CV · 2026-04-22 · unverdicted · novelty 7.0

HyLaR with DePO enables effective RL in hybrid discrete-continuous spaces for multimodal models, outperforming prior MLLMs on perception and understanding benchmarks.

OASIS: On-Demand Hierarchical Event Memory for Streaming Video Reasoning

cs.CV · 2026-04-18 · unverdicted · novelty 7.0

OASIS organizes streaming video into hierarchical events and retrieves memory on-demand via intent-driven refinement to improve long-horizon accuracy and compositional reasoning with bounded token costs.

MirrorBench: Evaluating Self-centric Intelligence in MLLMs by Introducing a Mirror

cs.AI · 2026-04-16 · unverdicted · novelty 7.0

MirrorBench reveals that leading MLLMs perform far below humans on tasks requiring self-referential perception and representation, even at the simplest level.

RiskWebWorld: A Realistic Interactive Benchmark for GUI Agents in E-commerce Risk Management

cs.AI · 2026-04-15 · unverdicted · novelty 7.0

RiskWebWorld is the first realistic interactive benchmark for GUI agents in e-commerce risk management, revealing a large gap between generalist and specialized models plus RL gains.

Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding

cs.CV · 2026-04-14 · unverdicted · novelty 7.0

Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.

citing papers explorer

Showing 50 of 86 citing papers.

MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark cs.CV · 2026-04-12 · unverdicted · none · ref 10 · 2 links · internal anchor
MMRareBench provides 1,756 QA pairs and 7,958 images from PMC rare-disease cases to evaluate 23 MLLMs, revealing low treatment-planning scores and medical models underperforming general models on multi-image tasks due to capacity dilution.
HM-Bench: A Comprehensive Benchmark for Multimodal Large Language Models in Hyperspectral Remote Sensing cs.CV · 2026-04-10 · accept · none · ref 20 · internal anchor
HM-Bench is the first benchmark for MLLMs on hyperspectral images, showing models struggle with complex spatial-spectral reasoning and perform better with visual PCA images than textual reports.
Supply-Chain Poisoning Attacks Against LLM Coding Agent Skill Ecosystems cs.CR · 2026-04-03 · unverdicted · none · ref 19 · internal anchor
DDIPE poisons LLM agent skills by embedding malicious logic in documentation examples, achieving 11.6-33.5% bypass rates across frameworks while explicit attacks are blocked, with 2.5% evading detection.
Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding cs.CV · 2026-01-15 · unverdicted · none · ref 137 · internal anchor
Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.
AgroTools: A Benchmark for Tool-Augmented Multimodal Agents in Agriculture cs.CV · 2026-05-21 · unverdicted · none · ref 19 · internal anchor
AgroTools is a new benchmark for tool-augmented multimodal agents in agriculture featuring 539 QA pairs, 1,097 images, five task families, and 14 tools, with evaluations showing major limitations in current models' tool planning and execution.
Perception or Prejudice: Can MLLMs Go Beyond First Impressions of Personality? cs.AI · 2026-05-21 · unverdicted · none · ref 28 · internal anchor
Introduces the Grounded Personality Reasoning task and MM-OCEAN dataset to show that MLLMs frequently produce correct Big Five personality ratings without grounding them in observable video evidence.
HEED: Density-Weighted Residual Alignment for Hybrid Vision-Language Model Distillation cs.CV · 2026-05-16 · unverdicted · none · ref 37 · internal anchor
HEED replaces uniform residual alignment with density-weighted alignment using patch self-dissimilarity to improve hybrid VLM distillation, gaining 8.7 points on OCRBench v2 and 5.13 on a 10-benchmark average.
PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control cs.AI · 2026-05-15 · unverdicted · none · ref 14 · internal anchor
PAGER achieves 4.1x higher task success in point-precise geometric GUI control by combining topology-aware planning with precision-aligned reinforcement learning on the new PAGE Bench dataset of 4,906 problems.
Chronicles-OCR: A Cross-Temporal Perception Benchmark for the Evolutionary Trajectory of Chinese Characters cs.CV · 2026-05-12 · unverdicted · none · ref 76 · internal anchor
Chronicles-OCR is the first benchmark with 2,800 images across the complete evolutionary trajectory of Chinese characters, defining four tasks to evaluate VLLMs' cross-temporal visual perception.
Mem-W: Latent Memory-Native GUI Agents cs.CL · 2026-05-10 · unverdicted · none · ref 3 · internal anchor
Mem-W embeds historical trajectories and working memory as compact latent tokens into GUI agents' continuous context via a trajectory-to-latent compressor, yielding up to +30 point gains on navigation benchmarks.
UniShield: Unified Face Attack Detection via KG-Informed Multimodal Reasoning cs.CV · 2026-05-09 · unverdicted · none · ref 23 · internal anchor
UniShield introduces a knowledge-graph-informed multimodal framework that improves unified detection of physical and digital face attacks through instruction tuning and consistency-optimized reasoning.
SalesSim: Benchmarking and Aligning Multimodal Language Models as Retail User Simulators cs.CL · 2026-05-08 · unverdicted · none · ref 39 · internal anchor
SalesSim benchmarks MLLMs as retail user simulators, finds gaps in persona adherence and over-persuasion, and introduces UserGRPO RL to raise decision alignment by 13.8%.
SphereVAD: Training-Free Video Anomaly Detection via Geodesic Inference on the Unit Hypersphere cs.CV · 2026-05-08 · unverdicted · none · ref 13 · internal anchor
SphereVAD performs training-free video anomaly detection by recasting anomaly discrimination as von Mises-Fisher likelihood-ratio geodesic inference on the unit hypersphere using intermediate MLLM features, with Frechet mean centering, holistic scene attention, and spherical geodesic pulling.
RobotEQ: Transitioning from Passive Intelligence to Active Intelligence in Embodied AI cs.RO · 2026-05-07 · unverdicted · none · ref 22 · internal anchor
RobotEQ is the first benchmark for active intelligence in embodied AI, demonstrating that current models underperform on social norm adherence and spatial grounding tasks.
MolRecBench-Wild: A Real-World Benchmark for Optical Chemical Structure Recognition cs.AI · 2026-05-07 · unverdicted · none · ref 9 · internal anchor
MolRecBench-Wild reveals that 18 existing OCSR models suffer severe performance drops on complex real-world academic molecular images compared with prior patent benchmarks.
Purifying Multimodal Retrieval: Fragment-Level Evidence Selection for RAG cs.IR · 2026-04-30 · unverdicted · none · ref 12 · internal anchor
FES-RAG reframes multimodal RAG as fragment-level selection using Fragment Information Gain to outperform document-level methods with up to 27% relative CIDEr gains on M2RAG while shortening context.
OS-SPEAR: A Toolkit for the Safety, Performance,Efficiency, and Robustness Analysis of OS Agents cs.CL · 2026-04-27 · unverdicted · none · ref 94 · internal anchor
OS-SPEAR is a new evaluation toolkit that tests 22 OS agents and identifies trade-offs between efficiency and safety or robustness.
Towards Temporal Compositional Reasoning in Long-Form Sports Videos cs.CV · 2026-04-24 · unverdicted · none · ref 30 · internal anchor
SportsTime benchmark and CoTR method improve multimodal AI's temporal compositional reasoning and evidence grounding in long-form sports videos.
X-PCR: A Benchmark for Cross-modality Progressive Clinical Reasoning in Ophthalmic Diagnosis cs.CV · 2026-04-22 · unverdicted · none · ref 13 · internal anchor
X-PCR is a new benchmark of 26,415 images and 177,868 expert VQA pairs that evaluates MLLMs on six-stage progressive reasoning and cross-modality integration in ophthalmology.
Hybrid Latent Reasoning with Decoupled Policy Optimization cs.CV · 2026-04-22 · unverdicted · none · ref 13 · internal anchor
HyLaR with DePO enables effective RL in hybrid discrete-continuous spaces for multimodal models, outperforming prior MLLMs on perception and understanding benchmarks.
OASIS: On-Demand Hierarchical Event Memory for Streaming Video Reasoning cs.CV · 2026-04-18 · unverdicted · none · ref 18 · internal anchor
OASIS organizes streaming video into hierarchical events and retrieves memory on-demand via intent-driven refinement to improve long-horizon accuracy and compositional reasoning with bounded token costs.
MirrorBench: Evaluating Self-centric Intelligence in MLLMs by Introducing a Mirror cs.AI · 2026-04-16 · unverdicted · none · ref 34 · internal anchor
MirrorBench reveals that leading MLLMs perform far below humans on tasks requiring self-referential perception and representation, even at the simplest level.
RiskWebWorld: A Realistic Interactive Benchmark for GUI Agents in E-commerce Risk Management cs.AI · 2026-04-15 · unverdicted · none · ref 16 · internal anchor
RiskWebWorld is the first realistic interactive benchmark for GUI agents in e-commerce risk management, revealing a large gap between generalist and specialized models plus RL gains.
Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding cs.CV · 2026-04-14 · unverdicted · none · ref 40 · internal anchor
Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.
UIPress: Bringing Optical Token Compression to UI-to-Code Generation cs.CL · 2026-04-10 · unverdicted · none · ref 49 · internal anchor
UIPress is the first encoder-side learned optical compression method for UI-to-Code that compresses visual tokens to 256, outperforming the uncompressed baseline by 7.5% CLIP score and the best inference-time baseline by 4.6% while delivering 9.1x TTFT speedup.
MolmoWeb: Open Visual Web Agent and Open Data for the Open Web cs.CV · 2026-04-09 · unverdicted · none · ref 28 · internal anchor
Open 4B and 8B visual web agents achieve state-of-the-art results on browser benchmarks by predicting actions from screenshots and instructions, outperforming similar open models and some closed larger-model agents, with full release of data and code planned.
PokeGym: A Visually-Driven Long-Horizon Benchmark for Vision-Language Models cs.CV · 2026-04-09 · unverdicted · none · ref 70 · internal anchor
PokeGym is a new benchmark that tests VLMs on long-horizon tasks in a complex 3D game using only visual observations, identifying deadlock recovery as the primary failure mode.
DetailVerifyBench: A Benchmark for Dense Hallucination Localization in Long Image Captions cs.CV · 2026-04-07 · unverdicted · none · ref 44 · internal anchor
DetailVerifyBench supplies 1,000 images and densely annotated long captions to evaluate precise hallucination localization in multimodal large language models.
EpiBench: Benchmarking Multi-turn Research Workflows for Multimodal Agents cs.CL · 2026-04-07 · unverdicted · none · ref 28 · internal anchor
EpiBench is a new episodic multi-turn multimodal benchmark where even leading AI agents score only 29.23% on hard tasks requiring cross-paper evidence integration from figures and tables.
Internalized Reasoning for Long-Context Visual Document Understanding cs.CV · 2026-03-31 · unverdicted · none · ref 58 · internal anchor
A synthetic pipeline creates and internalizes reasoning traces in VLMs for long-context visual document understanding, with a 32B model surpassing a 235B model on MMLongBenchDoc and showing 12.4x fewer output tokens.
Reasoning over Video: Evaluating How MLLMs Extract, Integrate, and Reconstruct Spatiotemporal Evidence cs.CV · 2026-03-13 · unverdicted · none · ref 10 · internal anchor
VAEX-BENCH shows state-of-the-art MLLMs perform substantially worse on abstractive spatiotemporal reasoning tasks than on matched extractive tasks in video understanding.
SUPERGLASSES: Benchmarking Vision Language Models as Intelligent Agents for AI Smart Glasses cs.CV · 2026-02-26 · conditional · none · ref 18 · internal anchor
SUPERGLASSES is the first VQA benchmark built from actual smart glasses data, and SUPERLENS is an agent using automatic object detection, query decoupling, and multimodal search that outperforms GPT-4o by 2.19% on it.
VISTA-Bench: Do Vision-Language Models Really Understand Visualized Text as Well as Pure Text? cs.CV · 2026-02-04 · conditional · none · ref 6 · internal anchor
VISTA-Bench shows vision-language models degrade on visualized text in images compared to equivalent pure text, with larger gaps under increased perceptual difficulty.
SpatiaLab: Can Vision-Language Models Perform Spatial Reasoning in the Wild? cs.CV · 2026-02-03 · conditional · none · ref 2 · internal anchor
SpatiaLab benchmark shows state-of-the-art VLMs achieve 54.93% accuracy on multiple-choice spatial reasoning in real scenes versus 87.57% for humans.
CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding cs.CL · 2026-02-02 · unverdicted · none · ref 88 · internal anchor
Multimodal LLMs process code as images to achieve up to 8x token compression, with visual cues like syntax highlighting aiding tasks and clone detection remaining resilient or even improving under compression.
Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space cs.CV · 2025-12-14 · unverdicted · none · ref 3 · internal anchor
DMLR performs dynamic visual-textual interleaving in latent space using confidence-guided latent policy gradient optimization and a dynamic visual injection strategy, yielding improved multimodal reasoning on benchmarks.
Setting the Stage: Text-Driven Scene-Consistent Image Generation cs.CV · 2025-12-14 · conditional · none · ref 8 · internal anchor
A new data pipeline using real photos, entity removal, and image-to-video models plus a cross-view attention loss enables text-driven generation of actors in reference scenes with improved alignment.
The Art of Scaling Reinforcement Learning Compute for LLMs cs.LG · 2025-10-15 · unverdicted · none · ref 4 · internal anchor
A 400k+ GPU-hour study shows RL scaling in LLMs follows predictable sigmoidal trajectories, with most design choices affecting efficiency rather than the performance asymptote, enabling accurate large-scale predictions via the ScaleRL recipe.
SecureWebArena: A Holistic Security Evaluation Benchmark for LVLM-based Web Agents cs.CR · 2025-10-11 · unverdicted · none · ref 38 · internal anchor
SecureWebArena is a new benchmark suite for holistic security evaluation of LVLM-based web agents using diverse simulated environments, attack taxonomies, and multi-layered failure analysis across reasoning, behavior, and outcomes.
LVBench: An Extreme Long Video Understanding Benchmark cs.CV · 2024-06-12 · accept · none · ref 13 · internal anchor
LVBench is a new benchmark for extreme long video understanding that evaluates multimodal large language models on hour-scale videos using tasks designed to probe extended memory and comprehension.
Artifact-Bench: Evaluating MLLMs on Detecting and Assessing the Artifacts of AI-Generated Videos cs.CV · 2026-05-18 · unverdicted · none · ref 12 · internal anchor
Artifact-Bench supplies a three-level artifact taxonomy and three evaluation tasks that show 19 MLLMs perform near or below random on AI-video realism detection and reasoning.
Artificial Intolerance: Stigmatizing Language in Clinical Documentation Skews Large Language Model Decision-Making cs.CL · 2026-05-17 · unverdicted · none · ref 100 · internal anchor
Frontier LLMs exhibit bias from stigmatizing language in clinical vignettes across four conditions, skewing decisions toward less aggressive management, with limited mitigation from Chain-of-Thought or self-debiasing prompts.
To See is Not to Learn: Protecting Multimodal Data from Unauthorized Fine-Tuning of Large Vision-Language Model cs.CR · 2026-05-14 · unverdicted · none · ref 19 · internal anchor
MMGuard generates unlearnable multimodal examples via perturbations that exploit LVLM optimization shortcuts and disrupt cross-modal bindings, providing robust protection against unauthorized fine-tuning across threat models.
Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context cs.CV · 2026-05-13 · unverdicted · none · ref 14 · internal anchor
Continued pre-training with balanced long-document VQA data extends a 7B LVLM to 128K context, improving long-document VQA by 7.1% and generalizing to 512K without further training.
NICE FACT: Diagnosing and Calibrating VLMs in Quantitative Reasoning for Kinematic Physics cs.CV · 2026-05-08 · unverdicted · none · ref 13 · internal anchor
VLMs fail to identify visual preconditions or apply physical laws in kinematic physics tasks, as shown by new FACT diagnostics and NICE calibration methods evaluated on six state-of-the-art models.
Hard to Read, Easy to Jailbreak: How Visual Degradation Bypasses MLLM Safety Alignment cs.CV · 2026-05-08 · conditional · none · ref 55 · internal anchor
Degraded image resolution in MLLMs bypasses safety alignments via cognitive overload, raising jailbreak rates across perturbations.
Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs cs.CV · 2026-05-01 · unverdicted · none · ref 33 · 2 links · internal anchor
PVM adds a parallel branch to LVLMs that directly supplies visual embeddings to prevent attention decay over long generated sequences, yielding accuracy gains on reasoning tasks with minimal overhead.
MAIC-UI: Making Interactive Courseware with Generative UI cs.CL · 2026-04-28 · unverdicted · none · ref 21 · internal anchor
MAIC-UI provides a zero-code authoring system for generating and iteratively editing interactive courseware from educational materials via structured analysis and incremental generation, with lab and classroom evaluations showing usability gains and learning improvements.
SOLAR-RL: Semi-Online Long-horizon Assignment Reinforcement Learning cs.LG · 2026-04-24 · unverdicted · none · ref 6 · internal anchor
SOLAR-RL assigns dense step-level rewards from static trajectory data by detecting first failure points and applying target-aligned shaping to improve long-horizon GUI task completion without full online interactions.
AeSlides: Incentivizing Aesthetic Layout in LLM-Based Slide Generation via Verifiable Rewards cs.CV · 2026-04-21 · unverdicted · none · ref 7 · internal anchor
AeSlides is a GRPO-based RL framework that uses verifiable aesthetic metrics to optimize LLM slide generation, achieving large gains in layout quality metrics and human scores with only 5K prompts.

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer