arxiv: 2507.01006 · v6 · submitted 2025-07-01 · 💻 cs.CV · cs.AI· cs.LG

Recognition: 1 theorem link

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

GLM-V Team: Wenyi Hong , Wenmeng Yu , Xiaotao Gu , Guo Wang , Guobing Gan , Haomiao Tang , Jiale Cheng , Ji Qi

show 84 more authors

Junhui Ji Lihang Pan Shuaiqi Duan Weihan Wang Yan Wang Yean Cheng Zehai He Zhe Su Zhen Yang Ziyang Pan Aohan Zeng Baoxu Wang Bin Chen Boyan Shi Changyu Pang Chenhui Zhang Da Yin Fan Yang Guoqing Chen Haochen Li Jiale Zhu Jiali Chen Jiaxing Xu Jiazheng Xu Jing Chen Jinghao Lin Jinhao Chen Jinjiang Wang Junjie Chen Leqi Lei Letian Gong Leyi Pan Mingdao Liu Mingde Xu Mingzhi Zhang Qinkai Zheng Ruiliang Lyu Shangqin Tu Sheng Yang Shengbiao Meng Shi Zhong Shiyu Huang Shuyuan Zhao Siyan Xue Tianshu Zhang Tianwei Luo Tianxiang Hao Tianyu Tong Wei Jia Wenkai Li Xiao Liu Xiaohan Zhang Xin Lyu Xinyu Zhang Xinyue Fan Xuancheng Huang Yadong Xue Yanfeng Wang Yanling Wang Yanzi Wang Yifan An Yifan Du Yiheng Huang Yilin Niu Yiming Shi Yu Wang Yuan Wang Yuanchang Yue Yuchen Li Yusen Liu Yutao Zhang Yuting Wang Yuxuan Zhang Zhao Xue Zhengxiao Du Zhenyu Hou Zihan Wang Peng Zhang Debing Liu Bin Xu Juanzi Li Minlie Huang Yuxiao Dong Jie Tang

Authors on Pith no claims yet

Pith reviewed 2026-05-11 04:43 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords vision-language modelsmultimodal reasoningreinforcement learningcurriculum samplingGLM-4.5Vopen-source modelsbenchmark evaluationGUI agents

0 comments

The pith

Large-scale pre-training of a vision foundation model followed by reinforcement learning with curriculum sampling produces GLM-4.5V, which leads open-source models on nearly all of 42 multimodal benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper first builds a strong vision foundation model through extensive pre-training, which it treats as setting the performance ceiling. It then applies Reinforcement Learning with Curriculum Sampling (RLCS) to progressively train the model on harder examples drawn from a structured curriculum. This combination yields broad gains across STEM problem-solving, video understanding, coding, grounding, GUI agents, and long-document tasks. GLM-4.5V reaches state-of-the-art results among open-source models of comparable size and matches or exceeds closed-source systems such as Gemini-2.5-Flash on coding and agent benchmarks. The smaller GLM-4.1V-9B-Thinking variant even outperforms the much larger Qwen2.5-VL-72B on 29 of the evaluated benchmarks.

Core claim

The central claim is that a capable vision foundation model pre-trained at large scale can have its full potential realized through Reinforcement Learning with Curriculum Sampling, producing versatile multimodal reasoning that improves performance across a wide range of tasks without evident overfitting to specific benchmarks.

What carries the argument

Reinforcement Learning with Curriculum Sampling (RLCS), which samples training examples from a progressively harder curriculum to refine the pre-trained vision-language model for reasoning.

If this is right

GLM-4.5V sets new open-source records on nearly all of 42 public benchmarks spanning STEM, video, coding, GUI agents, and document understanding.
The 9B GLM-4.1V-Thinking variant surpasses the 72B Qwen2.5-VL on 29 benchmarks despite its smaller size.
The models demonstrate competitive or superior results to closed-source Gemini-2.5-Flash specifically on coding and GUI-agent tasks.
The GLM-4.6V series adds native tool use and a 128K context window while retaining the same training approach.
Open-sourcing the 9B Thinking model and GLM-4.5V enables direct community inspection and further fine-tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Curriculum sampling during RL may prove more important than raw scale for avoiding overfitting in vision-language models.
The same pre-train-then-RLCS recipe could be tested on non-vision modalities to check whether the performance pattern generalizes.
If the method scales cleanly, future open models might routinely match or exceed closed models on agent-style tasks without requiring proprietary data.
The approach highlights a practical way to convert raw pre-training compute into measurable gains on long-horizon reasoning benchmarks.

Load-bearing premise

That the large-scale pre-trained vision foundation model already encodes a reliable upper bound on capability and that RLCS can unlock this bound without creating benchmark-specific overfitting or evaluation artifacts.

What would settle it

A new multimodal reasoning benchmark drawn from entirely unseen distributions would show whether GLM-4.5V maintains its reported performance edge or drops to levels comparable with prior open models.

read the original abstract

We present GLM-4.1V-Thinking, GLM-4.5V, and GLM-4.6V, a family of vision-language models (VLMs) designed to advance general-purpose multimodal understanding and reasoning. In this report, we share our key findings in the development of the reasoning-centric training framework. We first develop a capable vision foundation model with significant potential through large-scale pre-training, which arguably sets the upper bound for the final performance. We then propose Reinforcement Learning with Curriculum Sampling (RLCS) to unlock the full potential of the model, leading to comprehensive capability enhancement across a diverse range of tasks, including STEM problem solving, video understanding, content recognition, coding, grounding, GUI-based agents, and long document interpretation. In a comprehensive evaluation across 42 public benchmarks, GLM-4.5V achieves state-of-the-art performance on nearly all tasks among open-source models of similar size, and demonstrates competitive or even superior results compared to closed-source models such as Gemini-2.5-Flash on challenging tasks including Coding and GUI Agents. Meanwhile, the smaller GLM-4.1V-9B-Thinking remains highly competitive-achieving superior results to the much larger Qwen2.5-VL-72B on 29 benchmarks. We open-source both GLM-4.1V-9B-Thinking and GLM-4.5V. We further introduce the GLM-4.6V series, open-source multimodal models with native tool use and a 128K context window. A brief overview is available at https://z.ai/blog/glm-4.6v. Code, models and more information are released at https://github.com/zai-org/GLM-V.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GLM-4.5V shows that curriculum RL on a strong base VLM can lift open models to competitive levels on reasoning tasks, with the 9B variant and code released for direct use.

read the letter

The main point is that this paper demonstrates measurable gains from applying reinforcement learning with curriculum sampling to multimodal models, and the open release of GLM-4.1V-9B-Thinking and GLM-4.5V makes the claims testable. They start with a capable vision foundation model from large-scale pretraining, then use RLCS to improve performance across STEM, coding, GUI agents, video, and document tasks. The 9B model beats Qwen2.5-VL-72B on 29 of the 42 benchmarks, while the larger GLM-4.5V reaches or exceeds several closed models like Gemini-2.5-Flash on the harder subsets. That scale of open result is the practical contribution here. The evaluation covers a wide set of public benchmarks, and the paper supplies training details plus ablations that address the usual questions about what actually moves the needle. Open-sourcing the checkpoints and GitHub repo is the part that stands out most, since it lets others run the models and check for contamination or overfitting directly. The soft spots are the standard ones for this kind of work: results still depend on the quality of the initial pretraining data and the exact curriculum schedule, and public benchmarks always carry some leakage risk even when the models are released. No internal contradictions or unverifiable scaling claims appear in the sections provided. This paper is aimed at groups working on open VLMs or RL post-training for vision-language systems. Anyone evaluating new multimodal agents or looking for reproducible baselines will get value from the released artifacts. It has enough empirical substance and verification paths to warrant a serious referee rather than a desk reject.

Referee Report

1 major / 3 minor

Summary. The paper introduces GLM-4.1V-Thinking, GLM-4.5V, and GLM-4.6V, a family of vision-language models built from a large-scale pre-trained vision foundation model that is further improved via Reinforcement Learning with Curriculum Sampling (RLCS). It reports that GLM-4.5V achieves state-of-the-art results among open-source models of comparable size across 42 public benchmarks and is competitive with or superior to closed-source models such as Gemini-2.5-Flash on coding and GUI-agent tasks; the smaller GLM-4.1V-9B-Thinking outperforms the much larger Qwen2.5-VL-72B on 29 benchmarks. The work supplies training details, ablations, and benchmark tables, and open-sources the GLM-4.1V-9B-Thinking and GLM-4.5V checkpoints together with code.

Significance. If the empirical results hold, the paper demonstrates that scalable RL with curriculum sampling can substantially unlock multimodal reasoning potential in a strong vision foundation model, yielding open-source VLMs that rival or exceed larger open models and some closed systems on diverse tasks. The release of models, code, and detailed training information provides a valuable, reproducible baseline for the community.

major comments (1)

[Section 3] Section 3 (RLCS): the curriculum sampling schedule is described at a high level with free parameters listed in the method; the paper should report sensitivity analysis or default values used for the schedule, as these directly affect reproducibility of the claimed performance gains.

minor comments (3)

[Table 1] Table 1 and benchmark tables: include error bars or standard deviations from multiple runs where available, and explicitly state the data splits or contamination checks performed for the 42 benchmarks.
[Abstract] Abstract and Section 1: the brief mention of GLM-4.6V (native tool use, 128K context) should be expanded with one sentence on how it differs from the 4.5V series to clarify the overall model family.
[Figures] Figure captions and training curves: ensure all axes are labeled with units and that the curves are referenced in the text when discussing ablation results.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive evaluation and the recommendation of minor revision. We address the single major comment below.

read point-by-point responses

Referee: [Section 3] Section 3 (RLCS): the curriculum sampling schedule is described at a high level with free parameters listed in the method; the paper should report sensitivity analysis or default values used for the schedule, as these directly affect reproducibility of the claimed performance gains.

Authors: We agree that additional details on the curriculum sampling schedule parameters would improve reproducibility. In the revised manuscript we will explicitly list the default values employed for all free parameters in the RLCS formulation and include a concise sensitivity analysis on the most impactful hyperparameters, drawing from the ablation experiments already conducted during development. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims only

full rationale

The paper presents an empirical pipeline: large-scale pre-training of a vision foundation model followed by RLCS training, with performance evaluated on 42 public benchmarks. No mathematical derivations, equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central claims rest on benchmark results and open-sourced models rather than reducing to inputs by construction. This is the standard non-circular outcome for an applied ML report.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard assumptions of large-scale pretraining and RL optimization; no new physical entities or ad-hoc constants are introduced beyond typical training hyperparameters.

free parameters (1)

Curriculum sampling schedule parameters
The RLCS method requires choices for how task difficulty increases over training; these are not quantified in the abstract but are necessary for the reported gains.

axioms (2)

domain assumption Large-scale pre-training produces a vision foundation model whose capabilities form an upper bound for subsequent RL fine-tuning.
Stated directly in the abstract as the first step before applying RLCS.
domain assumption Public multimodal benchmarks provide an unbiased measure of general reasoning capability.
The evaluation across 42 benchmarks is presented as comprehensive evidence of capability enhancement.

pith-pipeline@v0.9.0 · 5999 in / 1492 out tokens · 65273 ms · 2026-05-11T04:43:49.425790+00:00 · methodology

discussion (0)

Forward citations

Cited by 58 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark
cs.CV 2026-04 unverdicted novelty 8.0

MMRareBench provides 1,756 QA pairs and 7,958 images from PMC rare-disease cases to evaluate 23 MLLMs, revealing low treatment-planning scores and medical models underperforming general models on multi-image tasks due...
MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark
cs.CV 2026-04 unverdicted novelty 8.0

MMRareBench is the first rare-disease benchmark for multimodal and multi-image clinical evaluation of MLLMs, revealing fragmented capabilities, low treatment-planning scores, and medical models underperforming general...
HM-Bench: A Comprehensive Benchmark for Multimodal Large Language Models in Hyperspectral Remote Sensing
cs.CV 2026-04 accept novelty 8.0

HM-Bench is the first benchmark for MLLMs on hyperspectral images, showing models struggle with complex spatial-spectral reasoning and perform better with visual PCA images than textual reports.
Supply-Chain Poisoning Attacks Against LLM Coding Agent Skill Ecosystems
cs.CR 2026-04 unverdicted novelty 8.0

DDIPE poisons LLM agent skills by embedding malicious logic in documentation examples, achieving 11.6-33.5% bypass rates across frameworks while explicit attacks are blocked, with 2.5% evading detection.
Chronicles-OCR: A Cross-Temporal Perception Benchmark for the Evolutionary Trajectory of Chinese Characters
cs.CV 2026-05 unverdicted novelty 7.0

Chronicles-OCR is the first benchmark with 2,800 images across the complete evolutionary trajectory of Chinese characters, defining four tasks to evaluate VLLMs' cross-temporal visual perception.
Mem-W: Latent Memory-Native GUI Agents
cs.CL 2026-05 unverdicted novelty 7.0

Mem-W embeds historical trajectories and working memory as compact latent tokens into GUI agents' continuous context via a trajectory-to-latent compressor, yielding up to +30 point gains on navigation benchmarks.
UniShield: Unified Face Attack Detection via KG-Informed Multimodal Reasoning
cs.CV 2026-05 unverdicted novelty 7.0

UniShield introduces a knowledge-graph-informed multimodal framework that improves unified detection of physical and digital face attacks through instruction tuning and consistency-optimized reasoning.
SalesSim: Benchmarking and Aligning Multimodal Language Models as Retail User Simulators
cs.CL 2026-05 unverdicted novelty 7.0

SalesSim benchmarks MLLMs as retail user simulators, finds gaps in persona adherence and over-persuasion, and introduces UserGRPO RL to raise decision alignment by 13.8%.
SphereVAD: Training-Free Video Anomaly Detection via Geodesic Inference on the Unit Hypersphere
cs.CV 2026-05 unverdicted novelty 7.0

SphereVAD performs training-free video anomaly detection by recasting anomaly discrimination as von Mises-Fisher likelihood-ratio geodesic inference on the unit hypersphere using intermediate MLLM features, with Frech...
RobotEQ: Transitioning from Passive Intelligence to Active Intelligence in Embodied AI
cs.RO 2026-05 unverdicted novelty 7.0

RobotEQ is the first benchmark for active intelligence in embodied AI, demonstrating that current models underperform on social norm adherence and spatial grounding tasks.
MolRecBench-Wild: A Real-World Benchmark for Optical Chemical Structure Recognition
cs.AI 2026-05 unverdicted novelty 7.0

MolRecBench-Wild reveals that 18 existing OCSR models suffer severe performance drops on complex real-world academic molecular images compared with prior patent benchmarks.
VT-Bench: A Unified Benchmark for Visual-Tabular Multi-Modal Learning
cs.CV 2026-05 unverdicted novelty 7.0

VT-Bench is the first unified benchmark aggregating 14 visual-tabular datasets with over 756K samples and evaluating 23 models to expose challenges in this multi-modal area.
Purifying Multimodal Retrieval: Fragment-Level Evidence Selection for RAG
cs.IR 2026-04 unverdicted novelty 7.0

FES-RAG reframes multimodal RAG as fragment-level selection using Fragment Information Gain to outperform document-level methods with up to 27% relative CIDEr gains on M2RAG while shortening context.
OS-SPEAR: A Toolkit for the Safety, Performance,Efficiency, and Robustness Analysis of OS Agents
cs.CL 2026-04 unverdicted novelty 7.0

OS-SPEAR is a new evaluation toolkit that tests 22 OS agents and identifies trade-offs between efficiency and safety or robustness.
Towards Temporal Compositional Reasoning in Long-Form Sports Videos
cs.CV 2026-04 unverdicted novelty 7.0

SportsTime benchmark and CoTR method improve multimodal AI's temporal compositional reasoning and evidence grounding in long-form sports videos.
X-PCR: A Benchmark for Cross-modality Progressive Clinical Reasoning in Ophthalmic Diagnosis
cs.CV 2026-04 unverdicted novelty 7.0

X-PCR is a new benchmark of 26,415 images and 177,868 expert VQA pairs that evaluates MLLMs on six-stage progressive reasoning and cross-modality integration in ophthalmology.
Hybrid Latent Reasoning with Decoupled Policy Optimization
cs.CV 2026-04 unverdicted novelty 7.0

HyLaR with DePO enables effective RL in hybrid discrete-continuous spaces for multimodal models, outperforming prior MLLMs on perception and understanding benchmarks.
OASIS: On-Demand Hierarchical Event Memory for Streaming Video Reasoning
cs.CV 2026-04 unverdicted novelty 7.0

OASIS organizes streaming video into hierarchical events and retrieves memory on-demand via intent-driven refinement to improve long-horizon accuracy and compositional reasoning with bounded token costs.
MirrorBench: Evaluating Self-centric Intelligence in MLLMs by Introducing a Mirror
cs.AI 2026-04 unverdicted novelty 7.0

MirrorBench reveals that leading MLLMs perform far below humans on tasks requiring self-referential perception and representation, even at the simplest level.
RiskWebWorld: A Realistic Interactive Benchmark for GUI Agents in E-commerce Risk Management
cs.AI 2026-04 unverdicted novelty 7.0

RiskWebWorld is the first realistic interactive benchmark for GUI agents in e-commerce risk management, revealing a large gap between generalist and specialized models plus RL gains.
Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding
cs.CV 2026-04 unverdicted novelty 7.0

Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.
UIPress: Bringing Optical Token Compression to UI-to-Code Generation
cs.CL 2026-04 unverdicted novelty 7.0

UIPress is the first encoder-side learned optical compression method for UI-to-Code that compresses visual tokens to 256, outperforming the uncompressed baseline by 7.5% CLIP score and the best inference-time baseline...
MolmoWeb: Open Visual Web Agent and Open Data for the Open Web
cs.CV 2026-04 unverdicted novelty 7.0

Open 4B and 8B visual web agents achieve state-of-the-art results on browser benchmarks by predicting actions from screenshots and instructions, outperforming similar open models and some closed larger-model agents, w...
PokeGym: A Visually-Driven Long-Horizon Benchmark for Vision-Language Models
cs.CV 2026-04 unverdicted novelty 7.0

PokeGym is a new benchmark that tests VLMs on long-horizon tasks in a complex 3D game using only visual observations, identifying deadlock recovery as the primary failure mode.
DetailVerifyBench: A Benchmark for Dense Hallucination Localization in Long Image Captions
cs.CV 2026-04 unverdicted novelty 7.0

DetailVerifyBench supplies 1,000 images and densely annotated long captions to evaluate precise hallucination localization in multimodal large language models.
EpiBench: Benchmarking Multi-turn Research Workflows for Multimodal Agents
cs.CL 2026-04 unverdicted novelty 7.0

EpiBench is a new episodic multi-turn multimodal benchmark where even leading AI agents score only 29.23% on hard tasks requiring cross-paper evidence integration from figures and tables.
Internalized Reasoning for Long-Context Visual Document Understanding
cs.CV 2026-03 unverdicted novelty 7.0

A synthetic pipeline creates and internalizes reasoning traces in VLMs for long-context visual document understanding, with a 32B model surpassing a 235B model on MMLongBenchDoc and showing 12.4x fewer output tokens.
Reasoning over Video: Evaluating How MLLMs Extract, Integrate, and Reconstruct Spatiotemporal Evidence
cs.CV 2026-03 unverdicted novelty 7.0

VAEX-BENCH shows state-of-the-art MLLMs perform substantially worse on abstractive spatiotemporal reasoning tasks than on matched extractive tasks in video understanding.
To See is Not to Learn: Protecting Multimodal Data from Unauthorized Fine-Tuning of Large Vision-Language Model
cs.CR 2026-05 unverdicted novelty 6.0

MMGuard generates unlearnable multimodal examples via perturbations that exploit LVLM optimization shortcuts and disrupt cross-modal bindings, providing robust protection against unauthorized fine-tuning across threat models.
Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context
cs.CV 2026-05 unverdicted novelty 6.0

Continued pre-training with balanced long-document VQA data extends a 7B LVLM to 128K context, improving long-document VQA by 7.1% and generalizing to 512K without further training.
NICE FACT: Diagnosing and Calibrating VLMs in Quantitative Reasoning for Kinematic Physics
cs.CV 2026-05 unverdicted novelty 6.0

VLMs fail to identify visual preconditions or apply physical laws in kinematic physics tasks, as shown by new FACT diagnostics and NICE calibration methods evaluated on six state-of-the-art models.
Hard to Read, Easy to Jailbreak: How Visual Degradation Bypasses MLLM Safety Alignment
cs.CV 2026-05 conditional novelty 6.0

Degraded image resolution in MLLMs bypasses safety alignments via cognitive overload, raising jailbreak rates across perturbations.
Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs
cs.CV 2026-05 unverdicted novelty 6.0

PVM adds a parallel branch to LVLMs that directly supplies visual embeddings to prevent attention decay over long generated sequences, yielding accuracy gains on reasoning tasks with minimal overhead.
MAIC-UI: Making Interactive Courseware with Generative UI
cs.CL 2026-04 unverdicted novelty 6.0

MAIC-UI provides a zero-code authoring system for generating and iteratively editing interactive courseware from educational materials via structured analysis and incremental generation, with lab and classroom evaluat...
SOLAR-RL: Semi-Online Long-horizon Assignment Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 6.0

SOLAR-RL assigns dense step-level rewards from static trajectory data by detecting first failure points and applying target-aligned shaping to improve long-horizon GUI task completion without full online interactions.
AeSlides: Incentivizing Aesthetic Layout in LLM-Based Slide Generation via Verifiable Rewards
cs.CV 2026-04 unverdicted novelty 6.0

AeSlides is a GRPO-based RL framework that uses verifiable aesthetic metrics to optimize LLM slide generation, achieving large gains in layout quality metrics and human scores with only 5K prompts.
SAMoRA: Semantic-Aware Mixture of LoRA Experts for Task-Adaptive Learning
cs.CL 2026-04 unverdicted novelty 6.0

SAMoRA is a parameter-efficient fine-tuning framework that uses semantic-aware routing and task-adaptive scaling within a Mixture of LoRA Experts to improve multi-task performance and generalization over prior methods.
Back into Plato's Cave: Examining Cross-modal Representational Convergence at Scale
cs.CV 2026-04 unverdicted novelty 6.0

Evidence for cross-modal representational convergence weakens substantially at scale and in realistic many-to-many settings, indicating models learn rich but distinct representations.
UCCL-Zip: Lossless Compression Supercharged GPU Communication
cs.DC 2026-04 unverdicted novelty 6.0

UCCL-Zip adds lossless compression to GPU communication to reduce LLM bottlenecks while preserving exact numerical correctness.
Grasp in Gaussians: Fast Monocular Reconstruction of Dynamic Hand-Object Interactions
cs.CV 2026-04 unverdicted novelty 6.0

GraG reconstructs dynamic 3D hand-object interactions from monocular video 6.4x faster than prior work by using compact Sum-of-Gaussians tracking initialized from large models and refined with 2D losses.
Every Picture Tells a Dangerous Story: Memory-Augmented Multi-Agent Jailbreak Attacks on VLMs
cs.AI 2026-04 unverdicted novelty 6.0

MemJack achieves 71.48% attack success rate on unmodified COCO val2017 images against Qwen3-VL-Plus by coordinating agents to map visual entities to malicious intents, apply multi-angle camouflage, and filter refusals...
Towards Robust Real-World Spreadsheet Understanding with Multi-Agent Multi-Format Reasoning
cs.CL 2026-04 unverdicted novelty 6.0

SpreadsheetAgent uses incremental multi-format reading, structural sketching, and verification to raise spreadsheet benchmark accuracy from 35.27% to 38.16%.
POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs
cs.CV 2026-04 unverdicted novelty 6.0

POINTS-Long is a dual-mode multimodal large language model that uses dynamic visual token scaling to retain 97.7-99.7% accuracy on long-form tasks with 1/40 to 1/10th the tokens and supports streaming via detachable KV-cache.
GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents
cs.CV 2026-04 unverdicted novelty 6.0

GameWorld is a new benchmark providing standardized interfaces, 34 games, 170 tasks, and verifiable outcome metrics to evaluate multimodal large language model agents in video game environments.
Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding
cs.CV 2026-04 unverdicted novelty 6.0

Video-MME-v2 is a new benchmark that applies progressive visual-to-reasoning levels and non-linear group scoring to expose gaps in video MLLM capabilities.
CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning
cs.CV 2026-04 unverdicted novelty 6.0

CoME-VL fuses contrastive and self-supervised vision encoders via entropy-guided multi-layer aggregation and RoPE cross-attention to improve vision-language model performance on benchmarks.
ForestPrune: High-ratio Visual Token Compression for Video Multimodal Large Language Models via Spatial-Temporal Forest Modeling
cs.CV 2026-03 unverdicted novelty 6.0

ForestPrune prunes 90% of visual tokens in video MLLMs like LLaVA-OneVision while retaining 95.8% accuracy by modeling tokens as spatial-temporal forests and scoring importance via tree depth and node roles.
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
cs.CV 2025-08 unverdicted novelty 6.0

InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture
cs.CV 2026-05 unverdicted novelty 5.0

SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.
VL-SAM-v3: Memory-Guided Visual Priors for Open-World Object Detection
cs.CV 2026-05 unverdicted novelty 5.0

VL-SAM-v3 retrieves visual prototypes from memory to generate sparse spatial and dense contextual priors that refine detection prompts, yielding gains on rare categories in LVIS for both open-vocabulary and open-ended...
VL-SAM-v3: Memory-Guided Visual Priors for Open-World Object Detection
cs.CV 2026-05 unverdicted novelty 5.0

VL-SAM-v3 augments open-world object detection with retrieval from a visual memory bank to generate instance-level spatial and class-aware contextual priors that improve performance on rare categories in zero-shot LVIS tests.
VL-SAM-v3: Memory-Guided Visual Priors for Open-World Object Detection
cs.CV 2026-05 unverdicted novelty 5.0

VL-SAM-v3 improves open-world object detection on LVIS by retrieving visual prototypes from a memory bank to generate sparse spatial and dense contextual priors that are fused into detection prompts.
Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs
cs.CV 2026-05 unverdicted novelty 5.0

PVM adds a parallel learnable branch to LVLMs that supplies visual embeddings on demand to structurally prevent attention decay and visual signal dilution during deep autoregressive generation.
SpatialImaginer: Towards Adaptive Visual Imagination for Spatial Reasoning
cs.CV 2026-04 unverdicted novelty 5.0

SpatialImaginer integrates visual imagination with textual chain-of-thought to improve spatial reasoning robustness in multimodal large language models.
DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding
cs.AI 2026-04 unverdicted novelty 5.0

DocSeeker uses supervised fine-tuning on distilled data followed by evidence-aware group relative policy optimization to improve long-document understanding and evidence grounding in MLLMs.
DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding
cs.AI 2026-04 unverdicted novelty 5.0

DocSeeker improves long-document understanding in MLLMs via a two-stage training process that combines supervised fine-tuning from distilled data with evidence-aware group relative policy optimization and memory-effic...
MedConclusion: A Benchmark for Biomedical Conclusion Generation from Structured Abstracts
cs.CL 2026-04 unverdicted novelty 5.0

MedConclusion is a 5.7M-instance benchmark dataset for generating biomedical conclusions from structured PubMed abstracts, with LLM evaluations showing conclusion writing differs from summarization and that judge choi...
An Empirical Study of Multi-Agent Collaboration for Automated Research
cs.MA 2026-03 unverdicted novelty 5.0

Subagent architectures deliver stable high-throughput optimization under tight time limits while agent teams enable deeper refactoring at the cost of higher fragility.

Reference graph

Works this paper leans on

77 extracted references · 77 canonical work pages · cited by 53 Pith papers · 18 internal anchors

[1]

Flame-code-vlm.https://github.com/Flame-Code-VLM/Flame-Code-VLM

work page
[2]

Geobench.https://github.com/ccmdi/geobench

work page
[3]

Awadalla, L

A. Awadalla, L. Xue, O. Lo, M. Shu, H. Lee, E. Guha, S. Shen, M. Awadalla, S. Savarese, C. Xiong, et al. Mint-1t: Scaling open-source multimodal data by 10x: A multimodal dataset with one trillion tokens.Advances in Neural Information Processing Systems, 37:36805–36828, 2024

work page 2024
[4]

S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

arXiv preprint arXiv:2308.13418 , year=

L. Blecher, G. Cucurull, T. Scialom, and R. Stojnic. Nougat: Neural optical understanding for academic documents.arXiv preprint arXiv:2308.13418, 2023

work page arXiv 2023
[6]

J. Chen, F. Wei, J. Zhao, S. Song, B. Wu, Z. Peng, S.-H. G. Chan, and H. Zhang. Revisiting referring expression comprehension evaluation in the era of large multimodal models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 513–524, 2025

work page 2025
[7]

L. Chen, J. Li, X. Dong, P. Zhang, Y . Zang, Z. Chen, H. Duan, J. Wang, Y . Qiao, D. Lin, et al. Are we on the right way for evaluating large vision-language models?arXiv preprint arXiv:2403.20330, 2024

work page internal anchor Pith review arXiv 2024
[8]

A. Fang, A. M. Jose, A. Jain, L. Schmidt, A. Toshev, and V . Shankar. Data filtering networks. arXiv preprint arXiv:2309.17425, 2023

work page arXiv 2023
[9]

E. Fini, M. Shukor, X. Li, P. Dufter, M. Klein, D. Haldimann, S. Aitharaju, V . G. T. da Costa, L. Béthune, Z. Gan, et al. Multimodal autoregressive pre-training of large vision encoders. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 9641–9654, 2025

work page 2025
[10]

C. Fu, Y . Dai, Y . Luo, L. Li, S. Ren, R. Zhang, Z. Wang, C. Zhou, Y . Shen, M. Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis.arXiv:2405.21075, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

X. Fu, Y . Hu, B. Li, Y . Feng, H. Wang, X. Lin, D. Roth, N. A. Smith, W.-C. Ma, and R. Kr- ishna. Blink: Multimodal large language models can see but not perceive.arXiv preprint arXiv:2404.12390, 2024

work page arXiv 2024
[12]

S. Y . Gadre, G. Ilharco, A. Fang, J. Hayase, G. Smyrnis, T. Nguyen, R. Marten, M. Wortsman, D. Ghosh, J. Zhang, et al. Datacomp: In search of the next generation of multimodal datasets. Advances in Neural Information Processing Systems, 36:27092–27112, 2023

work page 2023
[13]

T. GLM, A. Zeng, B. Xu, B. Wang, C. Zhang, D. Yin, D. Rojas, G. Feng, H. Zhao, H. Lai, et al. Chatglm: A family of large language models from glm-130b to glm-4 all tools.arXiv preprint arXiv:2406.12793, 2024

work page internal anchor Pith review arXiv 2024
[14]

J. Gu, X. Meng, G. Lu, L. Hou, N. Minzhe, X. Liang, L. Yao, R. Huang, W. Zhang, X. Jiang, et al. Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark.Advances in Neural Information Processing Systems, 35:26418–26431, 2022

work page 2022
[15]

T. Guan, F. Liu, X. Wu, R. Xian, Z. Li, X. Liu, X. Wang, L. Chen, F. Huang, Y . Yacoob, et al. Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14375–14385, 2024

work page 2024
[16]

D. Guo, F. Wu, F. Zhu, F. Leng, G. Shi, H. Chen, H. Fan, J. Wang, J. Jiang, J. Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025

work page internal anchor Pith review arXiv 2025
[17]

D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 20

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

H. He, W. Yao, K. Ma, W. Yu, Y . Dai, H. Zhang, Z. Lan, and D. Yu. Webvoyager: Building an end-to-end web agent with large multimodal models.arXiv preprint arXiv:2401.13919, 2024

work page arXiv 2024
[19]

Hong*, Y

W. Hong*, Y . Cheng*, Z. Yang*, W. Wang, L. Wang, X. Gu, S. Huang, Y . Dong, and J. Tang. Motionbench: Benchmarking and improving fine-grained video motion understanding for vision language models, 2024

work page 2024
[20]

W. Hong, W. Wang, M. Ding, W. Yu, Q. Lv, Y . Wang, Y . Cheng, S. Huang, J. Ji, Z. Xue, et al. Cogvlm2: Visual language models for image and video understanding.arXiv preprint arXiv:2408.16500, 2024

work page arXiv 2024
[21]

W. Hong, W. Wang, Q. Lv, J. Xu, W. Yu, J. Ji, Y . Wang, Z. Wang, Y . Dong, M. Ding, et al. Cogagent: A visual language model for gui agents. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14281–14290, 2024

work page 2024
[22]

K. Hu, P. Wu, F. Pu, W. Xiao, Y . Zhang, X. Yue, B. Li, and Z. Liu. Video-mmmu: Evaluating knowledge acquisition from multi-discipline professional videos. 2025

work page 2025
[23]

OpenAI o1 System Card

A. Jaech, A. Kalai, A. Lerer, A. Richardson, A. El-Kishky, A. Low, A. Helyar, A. Madry, A. Beutel, A. Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

M. Jia, Z. Qi, S. Zhang, W. Zhang, X. Yu, J. He, H. Wang, and L. Yi. Omnispatial: To- wards comprehensive spatial reasoning benchmark for vision language models.arXiv preprint arXiv:2506.03135, 2025

work page arXiv 2025
[25]

Kazemzadeh, V

S. Kazemzadeh, V . Ordonez, M. Matten, and T. Berg. Referitgame: Referring to objects in photographs of natural scenes. InProceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 787–798, 2014

work page 2014
[26]

Kembhavi, M

A. Kembhavi, M. Salvato, E. Kolve, M. Seo, H. Hajishirzi, and A. Farhadi. A diagram is worth a dozen images. InComputer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pages 235–251. Springer, 2016

work page 2016
[27]

J. Li, D. Li, S. Savarese, and S. Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023

work page 2023
[28]

K. Li, Y . Wang, Y . He, Y . Li, Y . Wang, Y . Liu, Z. Wang, J. Xu, G. Chen, P. Luo, L. Wang, and Y . Qiao. MVBench: A comprehensive multi-modal video understanding benchmark, 2023

work page 2023
[29]

Q. Li, Z. Chen, W. Wang, W. Wang, S. Ye, Z. Jin, G. Chen, Y . He, Z. Gao, E. Cui, et al. Omnicorpus: A unified multimodal corpus of 10 billion-level images interleaved with text. arXiv preprint arXiv:2406.08418, 2024

work page arXiv 2024
[30]

Y . Liu, H. Duan, Y . Zhang, B. Li, S. Zhang, W. Zhao, Y . Yuan, J. Wang, C. He, Z. Liu, K. Chen, and D. Lin. Mmbench: Is your multi-modal model an all-around player?arXiv:2307.06281, 2023

work page internal anchor Pith review arXiv 2023
[31]

Y . Liu, Z. Li, M. Huang, B. Yang, W. Yu, C. Li, X.-C. Yin, C.-L. Liu, L. Jin, and X. Bai. Ocrbench: on the hidden mystery of ocr in large multimodal models.Science China Information Sciences, 67(12), Dec. 2024

work page 2024
[32]

P. Lu, H. Bansal, T. Xia, J. Liu, C. Li, H. Hajishirzi, H. Cheng, K.-W. Chang, M. Galley, and J. Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

Y . Ma, L. Du, X. Shen, S. Chen, P. Li, Q. Ren, L. Ma, Y . Dai, P. Liu, and J. Yan. One rl to see them all: Visual triple unified reinforcement learning, 2025

work page 2025
[34]

Y . Ma, Y . Zang, L. Chen, M. Chen, Y . Jiao, X. Li, X. Lu, Z. Liu, Y . Ma, X. Dong, P. Zhang, L. Pan, Y .-G. Jiang, J. Wang, Y . Cao, and A. Sun. Mmlongbench-doc: Benchmarking long- context document understanding with visualizations, 2024. 21

work page 2024
[35]

Masry, M

A. Masry, M. S. Islam, M. Ahmed, A. Bajaj, F. Kabir, A. Kartha, M. T. R. Laskar, M. Rahman, S. Rahman, M. Shahmohammadi, M. Thakkar, M. R. Parvez, E. Hoque, and S. Joty. Chartqapro: A more diverse and challenging benchmark for chart question answering, 2025

work page 2025
[36]

OpenAI. Gpt-4o. 2024

work page 2024
[37]

R. Qiao, Q. Tan, G. Dong, M. Wu, C. Sun, X. Song, Z. GongQue, S. Lei, Z. Wei, M. Zhang, et al. We-math: Does your large multimodal model achieve human-like mathematical reasoning? arXiv preprint arXiv:2407.01284, 2024

work page arXiv 2024
[38]

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

C. Rawles, S. Clinckemaillie, Y . Chang, J. Waltz, G. Lau, M. Fair, A. Li, W. Bishop, W. Li, F. Campbell-Ajala, et al. Androidworld: A dynamic benchmarking environment for autonomous agents.arXiv:2405.14573, 2024

work page internal anchor Pith review arXiv 2024
[39]

Zerobench: An impossible visual benchmark for contemporary large multimodal models.arXiv preprint arXiv:2502.09696, 2025

J. Roberts, M. R. Taesiri, A. Sharma, A. Gupta, S. Roberts, I. Croitoru, S.-V . Bogolin, J. Tang, F. Langer, V . Raina, et al. Zerobench: An impossible visual benchmark for contemporary large multimodal models.arXiv preprint arXiv:2502.09696, 2025

work page arXiv 2025
[40]

Schuhmann, R

C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278– 25294, 2022

work page 2022
[41]

R. Shao, S. S. Li, R. Xin, S. Geng, Y . Wang, S. Oh, S. S. Du, N. Lambert, S. Min, R. Krishna, et al. Spurious rewards: Rethinking training signals in rlvr.arXiv preprint arXiv:2506.10947, 2025

work page arXiv 2025
[42]

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . Li, Y . Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

C. Si, Y . Zhang, Z. Yang, R. Liu, and D. Yang. Design2code: How far are we from automating front-end engineering?, 2024.URL https://arxiv. org/abs/2403, 3163, 2024

work page 2024
[44]

J. Su, Y . Lu, S. Pan, A. Murtadha, B. Wen, and Y . Liu. Roformer: Enhanced transformer with rotary position embedding.arXiv preprint arXiv:2104.09864, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[45]

L. Tang, G. Kim, X. Zhao, T. Lake, W. Ding, F. Yin, P. Singhal, M. Wadhwa, Z. L. Liu, Z. Sprague, et al. Chartmuseum: Testing visual reasoning capabilities of large vision-language models.arXiv preprint arXiv:2505.13444, 2025

work page arXiv 2025
[46]

C. Team, Z. Yue, Z. Lin, Y . Song, W. Wang, S. Ren, S. Gu, S. Li, P. Li, L. Zhao, L. Li, K. Bao, H. Tian, H. Zhang, G. Wang, D. Zhu, Cici, C. He, B. Ye, B. Shen, Z. Zhang, Z. Jiang, Z. Zheng, Z. Song, Z. Luo, Y . Yu, Y . Wang, Y . Tian, Y . Tu, Y . Yan, Y . Huang, X. Wang, X. Xu, X. Song, X. Zhang, X. Yong, X. Zhang, X. Deng, W. Yang, W. Ma, W. Lv, W. Zhu...

work page 2025
[47]

G. Team, R. Anil, S. Borgeaud, Y . Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican, D. Silver, S. Petrov, M. Johnson, I. Antonoglou, J. Schrittwieser, A. Glaese, J. Chen, E. Pitler, T. Lillicrap, A. Lazaridou, O. Firat, J. Molloy, M. Isard, P. R. Barham, T. Hennigan, B. Lee, F. Viola, M. Reynolds, Y . Xu, R. Doherty, E...

work page 2023
[48]

G. Team, A. Kamath, J. Ferret, S. Pathak, N. Vieillard, R. Merhej, S. Perrin, T. Matejovicova, A. Ramé, M. Rivière, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[49]

G. R. Team, S. Abeyruwan, J. Ainslie, J.-B. Alayrac, M. G. Arenas, T. Armstrong, A. Balakr- ishna, R. Baruch, M. Bauza, M. Blokzijl, et al. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020, 2025

work page internal anchor Pith review arXiv 2025
[50]

K. Team, A. Du, B. Yin, B. Xing, B. Qu, B. Wang, C. Chen, C. Zhang, C. Du, C. Wei, C. Wang, D. Zhang, D. Du, D. Wang, E. Yuan, E. Lu, F. Li, F. Sung, G. Wei, G. Lai, H. Zhu, H. Ding, H. Hu, H. Yang, H. Zhang, H. Wu, H. Yao, H. Lu, H. Wang, H. Gao, H. Zheng, J. Li, J. Su, J. Wang, J. Deng, J. Qiu, J. Xie, J. Wang, J. Liu, J. Yan, K. Ouyang, L. Chen, L. Sui...

work page 2025
[51]

S. Tong, E. Brown, P. Wu, S. Woo, M. Middepogu, S. C. Akula, J. Yang, S. Yang, A. Iyer, X. Pan, A. Wang, R. Fergus, Y . LeCun, and S. Xie. Cambrian-1: A fully open, vision-centric exploration of multimodal llms, 2024

work page 2024
[52]

B. Wang, B. Wang, C. Wan, G. Huang, H. Hu, H. Jia, H. Nie, M. Li, N. Chen, S. Chen, et al. Step-3 is large yet affordable: Model-system co-design for cost-effective decoding.arXiv preprint arXiv:2507.19427, 2025

work page arXiv 2025
[53]

F. Wang, X. Fu, J. Y . Huang, Z. Li, Q. Liu, X. Liu, M. D. Ma, N. Xu, W. Zhou, K. Zhang, et al. Muirbench: A comprehensive benchmark for robust multi-image understanding.arXiv preprint arXiv:2406.09411, 2024

work page arXiv 2024
[54]

H. Wang, X. Li, Z. Huang, A. Wang, J. Wang, T. Zhang, J. Zheng, S. Bai, Z. Kang, J. Feng, et al. Traceable evidence enhanced visual grounded reasoning: Evaluation and methodology. arXiv preprint arXiv:2507.07999, 2025

work page arXiv 2025
[55]

K. Wang, J. Pan, W. Shi, Z. Lu, M. Zhan, and H. Li. Measuring multimodal mathematical reasoning with math-vision dataset.arXiv:2402.14804, 2024

work page arXiv 2024
[56]

M. Wang, S. Sunkara, G. Baechler, J. Lin, Y . Zhu, F. Zubach, L. Shu, and J. Chen. Webquest: A benchmark for multimodal qa on web page sequences, 2024

work page 2024
[57]

P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[58]

W. Wang, Z. He, W. Hong, Y . Cheng, X. Zhang, J. Qi, S. Huang, B. Xu, Y . Dong, M. Ding, et al. Lvbench: An extreme long video understanding benchmark.arXiv preprint arXiv:2406.08035, 2024

work page arXiv 2024
[59]

W. Wang, Q. Lv, W. Yu, W. Hong, J. Qi, Y . Wang, J. Ji, Z. Yang, L. Zhao, X. Song, et al. Cogvlm: Visual expert for pretrained language models.arXiv preprint arXiv:2311.03079, 2023

work page arXiv 2023
[60]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V . Le, and D. Zhou. Chain-of-thought prompting elicits reasoning in large language models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Proc. of Neural Information Processing Systems, 2022

work page 2022
[61]

Y . Xiao, E. Sun, T. Liu, and W. Wang. Logicvista: Multimodal llm logical reasoning benchmark in visual contexts, 2024

work page 2024
[62]

T. Xie, D. Zhang, J. Chen, X. Li, S. Zhao, R. Cao, J. H. Toh, Z. Cheng, D. Shin, F. Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. Advances in Neural Information Processing Systems, 37:52040–52094, 2025

work page 2025
[63]

H. Xu, S. Xie, X. E. Tan, P.-Y . Huang, R. Howes, V . Sharma, S.-W. Li, G. Ghosh, L. Zettlemoyer, and C. Feichtenhofer. Demystifying clip data.arXiv preprint arXiv:2309.16671, 2023

work page arXiv 2023
[64]

Y . Xu, H. Dong, L. Wang, D. Sahoo, J. Li, and C. Xiong. Scalable chain of thoughts via elastic reasoning.arXiv preprint arXiv:2505.05315, 2025

work page arXiv 2025
[65]

C.-H. Yeh, C. Wang, S. Tong, T.-Y . Cheng, R. Wang, T. Chu, Y . Zhai, Y . Chen, S. Gao, and Y . Ma. Seeing from another perspective: Evaluating multi-view understanding in mllms.arXiv preprint arXiv:2504.15280, 2025

work page arXiv 2025
[66]

Q. Yu, Z. Zhang, R. Zhu, Y . Yuan, X. Zuo, Y . Yue, W. Dai, T. Fan, G. Liu, L. Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025. 25

work page internal anchor Pith review Pith/arXiv arXiv 2025
[67]

X. Yue, Y . Ni, K. Zhang, T. Zheng, R. Liu, G. Zhang, S. Stevens, D. Jiang, W. Ren, Y . Sun, C. Wei, B. Yu, R. Yuan, R. Sun, M. Yin, B. Zheng, Z. Yang, Y . Liu, W. Huang, H. Sun, Y . Su, and W. Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InProc. of Computer Vision and Pattern Recognition, 2024

work page 2024
[68]

X. Yue, T. Zheng, Y . Ni, Y . Wang, K. Zhang, S. Tong, Y . Sun, B. Yu, G. Zhang, H. Sun, Y . Su, W. Chen, and G. Neubig. Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark.arXiv preprint arXiv:2409.02813, 2024

work page internal anchor Pith review arXiv 2024
[69]

Zhang, P

H. Zhang, P. Zhang, X. Hu, Y .-C. Chen, L. Li, X. Dai, L. Wang, L. Yuan, J.-N. Hwang, and J. Gao. Glipv2: Unifying localization and vision-language understanding.Proc. of Neural Information Processing Systems, 35:36067–36080, 2022

work page 2022
[70]

Zhang, D

R. Zhang, D. Jiang, Y . Zhang, H. Lin, Z. Guo, P. Qiu, A. Zhou, P. Lu, K.-W. Chang, Y . Qiao, et al. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? InEuropean Conference on Computer Vision, pages 169–186. Springer, 2024

work page 2024
[71]

Y . Zhao, L. Xie, H. Zhang, G. Gan, Y . Long, Z. Hu, T. Hu, W. Chen, C. Li, J. Song, Z. Xu, C. Wang, W. Pan, Z. Shangguan, X. Tang, Z. Liang, Y . Liu, C. Zhao, and A. Cohan. Mmvu: Measuring expert-level multi-discipline video understanding, 2025

work page 2025
[72]

J. Zhu, W. Wang, Z. Chen, Z. Liu, S. Ye, L. Gu, H. Tian, Y . Duan, W. Su, J. Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[73]

W. Zhu, J. Hessel, A. Awadalla, S. Y . Gadre, J. Dodge, A. Fang, Y . Yu, L. Schmidt, W. Y . Wang, and Y . Choi. Multimodal c4: An open, billion-scale corpus of images interleaved with text. Advances in Neural Information Processing Systems, 36:8958–8974, 2023

work page 2023
[74]

Dynamath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models.arXiv preprint arXiv:2411.00836, 2024

C. Zou, X. Guo, R. Yang, J. Zhang, B. Hu, and H. Zhang. Dynamath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models.arXiv preprint arXiv:2411.00836, 2024. 26 A Qualitative Examples In this section, we demonstrate the capabilities and advantages of our model through various qualita- tive examples. A.1 UI ...

work page arXiv 2024
[75]

毛细管”。当左右两个装有不同颜色液体的杯子与中间的空杯之间用纸巾连接时，纸巾会利用自身吸水性和纤维间的毛细作用，将左侧红色液体和右侧蓝色液体通过纤维间隙输送至中间的空杯中。随着这种输送过程的进行，中间的空杯逐渐被液体填满，从而出现了“中间水杯有水

识别实验装置：三个玻璃杯，左右装有色液体，中间空，中间有纸巾连接。2. 确定物理原理：毛细现象（或毛细作用），纸巾的多孔性使其能吸附液体并通过纤维间的通道输送。3. 解释过程：纸巾吸收左侧红色液体和右侧蓝色液体，通过毛细作用将这些液体输送到中间的空杯中，因此中间水杯里有水（实际上是两种液体通过纸巾的毛细作用流入中间杯）。所以答案要说明是毛细现象（或毛细作用），纸巾的吸水和导水能力导致中间杯有水。</think><answer>这个现象是由于**毛细现象（或毛细作用）**导致的。具体来说：纸巾是多孔材料，其纤维之间的缝隙类似于“毛细管”。当左右两个装有不同颜色液体的杯子与中间的空杯之间用纸巾连接时，纸巾会利用自身吸水性和纤维间的毛细作用，将左侧红色液体和右侧蓝色液体通过纤维间隙输送至中间的空杯...

work page
[76]

Meeting" event - October 9th has a

The calendar displays dates from October 1-31, with days of the week labeled as S, M, T, W, T, F, S at the top. 3. Today appears to be October 15th, as indicated by the blue highlighted circle around the number 15 in the fourth row. 4. The task requires creating an event in two weeks from today (October 15th) which would be October 29th. 5. Looking at the...

work page
[77]

GLM-4.1V-9B-Thinking sets a new state-of-the-art across 23 out of 28 benchmarks among models under 10B parameters, including MiMo-VL [ 46], Kimi-VL [50], InternVL3 [72], and Qwen-VL series [4]. 41 Task Benchmark GLM-4.1V-9B-ThinkingQwen2.5-VL7B InternVL39B Kimi-VLA3B-ThinkingMiMo-VL7B-RLQwen2.5-VL72B GPT-4o2024-11-20 General VQA MMBench-V1.1-EN85.8 82.7 8...

work page