OProver: A Unified Framework for Agentic Formal Theorem Proving

David Ma; Enduo Zhao; Gavin Cheung; Jiaheng Liu; Jiajun Shi; Kaijing Ma; Shawn Guo; Yunfeng Shi; Zhaoxiang Zhang; Zili Wang

arxiv: 2605.17283 · v1 · pith:ZFIIU7OKnew · submitted 2026-05-17 · 💻 cs.CL · cs.AI

OProver: A Unified Framework for Agentic Formal Theorem Proving

David Ma , Kaijing Ma , Shawn Guo , Yunfeng Shi , Enduo Zhao , Jiajun Shi , Zhaoxiang Zhang , Gavin Cheung

show 2 more authors

Jiaheng Liu Zili Wang

This is my paper

Pith reviewed 2026-05-20 14:38 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords formal theorem provingLean 4agentic provingproof repairiterative post-trainingOProofs datasetwhole-proof generationretrieval-augmented proving

0 comments

The pith

OProver embeds agentic proof revision into the training loop itself, not just inference, for Lean 4 theorem proving.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents OProver as a framework that folds agentic proving into model training for formal theorem proving. It runs cycles of proof generation, retrieval of verified proofs, repair of failures using compiler feedback, and then feeds the resulting trajectories back as supervised data while routing hard cases to reinforcement learning. A 32B model trained this way records the highest Pass@32 scores on MiniF2F, ProverBench, and PutnamBench among open-weight whole-proof systems and places second on two others. Readers would care because the approach turns the prover's own mistakes and successes into a growing, self-improving data source rather than depending only on static external corpora.

Core claim

OProver is a unified Lean 4 framework in which failed proof attempts are iteratively revised using retrieved compiler-verified proofs and Lean compiler feedback; it is trained via continued pretraining followed by iterative post-training that indexes new verified proofs into OProofs and retrieval memory, converts repair trajectories into SFT data, and routes unresolved cases to RL, yielding a 32B model with top Pass@32 results on three of five benchmarks and more leading placements than prior open-weight provers.

What carries the argument

The iterative post-training loop that runs agentic proving, indexes newly verified proofs, uses repair trajectories as SFT data, and unresolved hard cases for RL, supported by the OProofs dataset of 1.77M statements and 6.86M compiler-verified proofs with serialized trajectories.

If this is right

Repair trajectories extracted from the model's own agentic attempts supply higher-quality training signals than static proof corpora alone.
Self-generated verified proofs can be indexed and retrieved to bootstrap further capability without external human annotations.
Unresolved hard cases identified during agentic runs can be routed to RL to target remaining weaknesses directly.
A single framework can unify proof generation, verification, retrieval, and retraining into repeated closed loops.
Open-weight models can achieve leading placements across both introductory and olympiad-level benchmarks through this process.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same retrieval-plus-repair loop could be adapted to other formal languages if the compiler feedback and indexing components are ported.
Over multiple iterations the method may reduce reliance on any fixed external dataset by growing its own verified proof memory.
Measuring performance on problems whose statements lie far outside the distribution of OProofs would test whether gains reflect improved reasoning rather than memorization.
Combining the loop with even larger base models might extend strong results to harder open problems that current runs leave unresolved.

Load-bearing premise

The iterative post-training loop that indexes newly verified proofs and uses repair trajectories as SFT data plus unresolved cases for RL produces genuine capability gains rather than benchmark-specific overfitting or distribution shift from synthetic data.

What would settle it

Test the final 32B model on a fresh collection of theorems drawn from sources never appearing in OProofs construction or any post-training iteration and measure whether Pass@32 remains at the reported levels.

Figures

Figures reproduced from arXiv: 2605.17283 by David Ma, Enduo Zhao, Gavin Cheung, Jiaheng Liu, Jiajun Shi, Kaijing Ma, Shawn Guo, Yunfeng Shi, Zhaoxiang Zhang, Zili Wang.

**Figure 2.** Figure 2: Overview of OProver. The framework has three components: [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: OProofs Lean 4 corpus statistics. (a) Headline counts of statements, proofs, [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Test-time scaling of OProver under a fixed total budget [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: Performance under fixed total budget 𝐵 as a function of refinement depth 𝑅. Each curve plots Pass(𝑅, 𝐵/𝑅), isolating the trade-off between deeper interaction and wider sampling. On most benchmarks performance keeps improving as 𝑅 grows up to 𝑅 = 16, while on the hardest benchmark (PutnamBench) the optimum settles around 𝑅 = 8, indicating that very narrow sampling can hurt when single-chain success probabil… view at source ↗

**Figure 6.** Figure 6: MiniF2F-Test Pass@32 across post-training iterations. Both OProver-8B and OProver-32B improve monotonically. To examine whether the iterative co-evolution loop between OProver and OProofs (§2.3) yields measurable improvements at each step, we evaluate Pass@32 on MiniF2F-Test for every intermediate checkpoint [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

read the original abstract

Recent progress in formal theorem proving has benefited from large-scale proof generation and verifier-aware training, but agentic proving is rarely integrated into prover training, appearing only at inference time. We present OProver, a unified framework for agentic formal theorem proving in Lean 4, in which failed proof attempts are iteratively revised using retrieved compiler verified proofs and Lean compiler feedback. OProver is trained through continued pretraining followed by iterative post-training: each iteration runs agentic proving, indexes newly verified proofs into OProofs and the retrieval memory, uses repair trajectories as SFT data, and uses unresolved hard cases for RL. OProofs is built from public Lean resources, large-scale proof synthesis, and agentic proving traces, containing 1.77M Lean statements, 6.86M compiler-verified proofs, and serialized trajectories with retrieved context, failed attempts, feedback, and repairs. Across five benchmarks, OProver-32B attains the best Pass@32 on MiniF2F (93.3%), ProverBench (58.2%), and PutnamBench (11.3%), and ranks second on MathOlympiad (22.8%) and ProofNet (33.2%) more top placements than any prior open-weight whole-proof prover.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OProver folds agentic repair and retrieval into the training loop itself and posts the best or second-best Pass@32 numbers on five Lean benchmarks, but the evidence that the loop creates real capability gains rather than tighter synthetic fit is still thin.

read the letter

The main thing to know is that this paper moves agentic proving from inference-time trick to part of the training process. They run continued pretraining then iterative post-training: the model attempts proofs with retrieved verified examples, gets Lean compiler feedback, repairs failures, and the successful repair trajectories become SFT data while the still-unresolved cases feed RL. Verified proofs get indexed back into OProofs for the next round. The resulting 32B model leads on MiniF2F, ProverBench, and PutnamBench and comes in second on the other two, which is more top placements than earlier open-weight whole-proof systems.

Referee Report

3 major / 2 minor

Summary. The paper presents OProver, a unified framework for agentic formal theorem proving in Lean 4. It integrates agentic proving into training via continued pretraining followed by iterative post-training: each iteration performs agentic proving on statements, indexes newly verified proofs into the OProofs dataset and retrieval memory, uses repair trajectories as SFT data, and unresolved cases for RL. OProofs is constructed from public Lean resources, large-scale proof synthesis, and agentic traces, containing 1.77M statements and 6.86M compiler-verified proofs. The OProver-32B model reports the best Pass@32 on MiniF2F (93.3%), ProverBench (58.2%), and PutnamBench (11.3%), ranking second on MathOlympiad (22.8%) and ProofNet (33.2%), claiming more top placements than prior open-weight whole-proof provers.

Significance. If the central performance claims hold after addressing data and ablation issues, the work would be significant for formal theorem proving by moving agentic revision and retrieval from inference-only to a closed training loop. The explicit use of compiler-verified proofs and serialized trajectories in OProofs provides a reproducible data resource. The benchmark results, if shown to be robust, would represent a practical advance over prior open-weight provers that separate training from agentic search.

major comments (3)

[§4, Table 1] §4 (Experiments), Table 1: Pass@32 scores are reported as single point estimates (e.g., 93.3% on MiniF2F) with no standard deviations, multiple random seeds, or statistical tests against baselines; this leaves the claim of superior performance only partially supported and vulnerable to run-to-run variance.
[§3.3] §3.3 (OProofs Construction): The description of dataset sources (public Lean resources plus large-scale synthesis and agentic traces) does not state or verify that the five evaluation benchmarks were excluded from all synthesis pipelines or retrieval indices, creating a concrete risk that measured gains reflect reduced distribution shift rather than new reasoning capability.
[§3.2] §3.2 (Iterative Post-Training Loop): No ablation is presented that compares the full iterative loop (repair SFT + hard-case RL + retrieval update) against a baseline that simply scales SFT data volume without the agentic repair/RL cycle; without this isolation, it is unclear whether the reported gains are load-bearingly due to the unified framework or to data scale alone.

minor comments (2)

[Abstract] Abstract: The claim of 'more top placements than any prior open-weight whole-proof prover' would be clearer if the abstract briefly listed the exact prior systems and their placements for direct comparison.
[§4.1] §4.1: The baseline descriptions omit whether any prior systems were also given access to the same retrieval memory or compiler feedback during their reported evaluations.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, providing clarifications and committing to revisions that strengthen the presentation of our results and methodology without overstating what the current experiments demonstrate.

read point-by-point responses

Referee: [§4, Table 1] §4 (Experiments), Table 1: Pass@32 scores are reported as single point estimates (e.g., 93.3% on MiniF2F) with no standard deviations, multiple random seeds, or statistical tests against baselines; this leaves the claim of superior performance only partially supported and vulnerable to run-to-run variance.

Authors: We agree that single-point estimates limit the robustness of the performance claims. In the revised manuscript we will rerun the key Pass@32 evaluations on MiniF2F, ProverBench, and PutnamBench using at least three independent random seeds, report means together with standard deviations, and add pairwise statistical comparisons (e.g., McNemar or bootstrap tests) against the strongest published baselines. These additions will be placed in an expanded §4 and Table 1. revision: yes
Referee: [§3.3] §3.3 (OProofs Construction): The description of dataset sources (public Lean resources plus large-scale synthesis and agentic traces) does not state or verify that the five evaluation benchmarks were excluded from all synthesis pipelines or retrieval indices, creating a concrete risk that measured gains reflect reduced distribution shift rather than new reasoning capability.

Authors: We appreciate the referee’s emphasis on this reproducibility concern. All five evaluation benchmarks were in fact held out from every stage of OProofs construction: public Lean resources were filtered to exclude benchmark statements, synthesis prompts were generated only from non-benchmark theorems, and retrieval indices contained solely compiler-verified proofs from the training distribution. We will revise §3.3 to include an explicit paragraph describing the exclusion protocol and the verification steps performed, thereby removing any ambiguity about distribution shift. revision: yes
Referee: [§3.2] §3.2 (Iterative Post-Training Loop): No ablation is presented that compares the full iterative loop (repair SFT + hard-case RL + retrieval update) against a baseline that simply scales SFT data volume without the agentic repair/RL cycle; without this isolation, it is unclear whether the reported gains are load-bearingly due to the unified framework or to data scale alone.

Authors: We concur that a controlled ablation isolating the iterative agentic components from raw data scaling would be ideal. Training an additional 32B model on an SFT corpus matched in volume but stripped of repair trajectories, RL on unresolved cases, and dynamic retrieval updates is computationally prohibitive within our current budget. Instead, we will add a new subsection in §3.2 that quantifies incremental gains observed after each iteration of the loop and contrasts these trajectories against prior large-scale SFT-only provers. This analysis, together with the existing cross-benchmark comparisons, will clarify the contribution of the unified framework while acknowledging the absence of the exact requested ablation. revision: partial

Circularity Check

0 steps flagged

No significant circularity: performance claims rest on external benchmarks

full rationale

The paper presents an empirical framework whose central results are Pass@32 scores on fixed external benchmarks (MiniF2F, ProverBench, PutnamBench, MathOlympiad, ProofNet). OProofs construction and the iterative loop (agentic proving → indexing verified proofs → SFT on repairs → RL on hard cases) are described as training procedures whose outputs are then measured against those held-out benchmarks. No equation, uniqueness theorem, or self-citation is invoked to derive the reported numbers from the training distribution itself; the numbers are direct empirical measurements. Because the evaluation sets are standard public benchmarks and the paper does not claim a closed-form derivation that reduces to its own synthetic data, the logic does not collapse by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claims rest on the Lean 4 compiler correctly identifying valid proofs and on the assumption that self-generated repair trajectories improve generalization. No new physical entities are postulated.

free parameters (1)

32B model scale
Chosen scale for the final reported model; affects compute and capacity.

axioms (1)

domain assumption Lean 4 compiler provides reliable and complete feedback on proof validity.
Invoked for all verification steps and feedback generation in the agentic loop.

invented entities (1)

OProofs dataset no independent evidence
purpose: Collection of statements, verified proofs, and serialized agentic trajectories for training.
Built from public resources plus model-generated traces; no independent external validation cited.

pith-pipeline@v0.9.0 · 5784 in / 1346 out tokens · 48198 ms · 2026-05-20T14:38:33.935894+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

171 extracted references · 171 canonical work pages · 30 internal anchors

[1]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

work page
[2]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Yi: Open Foundation Models by 01.AI

Yi: Open foundation models by 01. ai , author=. arXiv preprint arXiv:2403.04652 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[4]

arXiv preprint arXiv:2405.19327 , year=

Map-neo: Highly capable and transparent bilingual large language model series , author=. arXiv preprint arXiv:2405.19327 , year=

work page arXiv
[5]

Advances in neural information processing systems , volume=

Visual instruction tuning , author=. Advances in neural information processing systems , volume=

work page
[6]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[7]

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Video-llava: Learning united visual representation by alignment before projection , author=. arXiv preprint arXiv:2311.10122 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=

A survey on multimodal large language models for autonomous driving , author=. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=

work page
[9]

arXiv preprint arXiv:2503.11579 , year=

VAMBA: Understanding Hour-Long Videos with Hybrid Mamba-Transformers , author=. arXiv preprint arXiv:2503.11579 , year=

work page arXiv
[10]

arXiv preprint arXiv:2402.12451 , year=

The revolution of multimodal large language models: a survey , author=. arXiv preprint arXiv:2402.12451 , year=

work page arXiv
[11]

Proceedings of the 3rd International Conference on Computer, Artificial Intelligence and Control Engineering , pages=

A Survey of Multimodel Large Language Models , author=. Proceedings of the 3rd International Conference on Computer, Artificial Intelligence and Control Engineering , pages=

work page
[12]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[13]

arXiv preprint arXiv:2407.17379 , year=

MMRA: A Benchmark for Evaluating Multi-Granularity and Multi-Image Relational Association Capabilities in Large Visual Language Models , author=. arXiv preprint arXiv:2407.17379 , year=

work page arXiv
[14]

arXiv preprint arXiv:2401.13478 , year=

SciMMIR: Benchmarking Scientific Multi-modal Information Retrieval , author=. arXiv preprint arXiv:2401.13478 , year=

work page arXiv
[15]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Mvbench: A comprehensive multi-modal video understanding benchmark , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[16]

MLVU: Benchmarking Multi-task Long Video Understanding

Mlvu: A comprehensive benchmark for multi-task long video understanding , author=. arXiv preprint arXiv:2406.04264 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Advances in Neural Information Processing Systems , volume=

Mmbench-video: A long-form multi-shot benchmark for holistic video understanding , author=. Advances in Neural Information Processing Systems , volume=

work page
[18]

2024 , journal=

Aria: An Open Multimodal Native Mixture-of-Experts Model , author=. 2024 , journal=

work page 2024
[19]

InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling

InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling , author=. arXiv preprint arXiv:2501.12386 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling , author=. arXiv preprint arXiv:2412.05271 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[21]

2025 , url =

Kimina-Prover Preview: Towards Large Formal Reasoning Models with Reinforcement Learning , author =. 2025 , url =

work page 2025
[22]

CoRR , volume =

Ran Xin and Zeyu Zheng and Yanchen Nie and Kun Yuan and Xia Xiao , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2509.06493 , eprinttype =. 2509.06493 , timestamp =

work page doi:10.48550/arxiv.2509.06493 2025
[23]

2024 , eprint =

Lean Workbook: A large-scale Lean problem set formalized from natural language math problems , author =. 2024 , eprint =

work page 2024
[24]

European Conference on Computer Vision , year=

LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models , author=. European Conference on Computer Vision , year=

work page
[25]

LLaVA-NeXT: A Strong Zero-shot Video Understanding Model , url=

Zhang, Yuanhan and Li, Bo and Liu, haotian and Lee, Yong jae and Gui, Liangke and Fu, Di and Feng, Jiashi and Liu, Ziwei and Li, Chunyuan , month=. LLaVA-NeXT: A Strong Zero-shot Video Understanding Model , url=

work page
[26]

2024 , eprint=

LLaVA-OneVision: Easy Visual Task Transfer , author=. 2024 , eprint=

work page 2024
[27]

2025 , eprint=

LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token , author=. 2025 , eprint=

work page 2025
[28]

2024 , eprint=

LongVILA: Scaling Long-Context Visual Language Models for Long Videos , author=. 2024 , eprint=

work page 2024
[29]

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding , author=. arXiv:2410.17434 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[30]

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

MiniCPM-V: A GPT-4V Level MLLM on Your Phone , author=. arXiv preprint arXiv:2408.01800 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[31]

2024 , eprint=

mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models , author=. 2024 , eprint=

work page 2024
[32]

2024 , eprint=

NVILA: Efficient Frontier Visual Language Models , author=. 2024 , eprint=

work page 2024
[33]

2025 , eprint=

Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive Modality Alignment , author=. 2025 , eprint=

work page 2025
[34]

2024 , eprint=

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone , author=. 2024 , eprint=

work page 2024
[35]

2024 , eprint=

Phi-4 Technical Report , author=. 2024 , eprint=

work page 2024
[36]

2024 , eprint=

LIME: Less Is More for MLLM Evaluation , author=. 2024 , eprint=

work page 2024
[37]

Qwen2.5-VL Technical Report

Qwen2. 5-vl technical report , author=. arXiv preprint arXiv:2502.13923 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[38]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution , author=. arXiv preprint arXiv:2409.12191 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[39]

2023 , eprint=

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning , author=. 2023 , eprint=

work page 2023
[40]

2024 , eprint=

VideoChat: Chat-Centric Video Understanding , author=. 2024 , eprint=

work page 2024
[41]

2024 , eprint=

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models , author=. 2024 , eprint=

work page 2024
[42]

2024 , eprint=

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution , author=. 2024 , eprint=

work page 2024
[43]

2025 , eprint=

Qwen2.5-VL Technical Report , author=. 2025 , eprint=

work page 2025
[44]

2024 , eprint=

Long Context Transfer from Language to Vision , author=. 2024 , eprint=

work page 2024
[45]

2024 , eprint=

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding , author=. 2024 , eprint=

work page 2024
[46]

2024 , eprint=

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks , author=. 2024 , eprint=

work page 2024
[47]

2025 , eprint=

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling , author=. 2025 , eprint=

work page 2025
[48]

2017 , eprint=

The Kinetics Human Action Video Dataset , author=. 2017 , eprint=

work page 2017
[49]

something something

The "something something" video database for learning and evaluating visual common sense , author=. 2017 , eprint=

work page 2017
[50]

2024 , eprint=

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis , author=. 2024 , eprint=

work page 2024
[51]

2024 , eprint=

InfiniBench: A Comprehensive Benchmark for Large Multimodal Models in Very Long Video Understanding , author=. 2024 , eprint=

work page 2024
[52]

2024 , eprint=

How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs , author=. 2024 , eprint=

work page 2024
[53]

2024 , eprint=

VideoVista: A Versatile Benchmark for Video Understanding and Reasoning , author=. 2024 , eprint=

work page 2024
[54]

arxiv , year=

VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding , author=. arxiv , year=

work page
[55]

Proceedings of CVPR , year=

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI , author=. Proceedings of CVPR , year=

work page
[56]

arXiv preprint arXiv:2401.11944 , year=

Cmmmu: A chinese massive multi-discipline multimodal understanding benchmark , author=. arXiv preprint arXiv:2401.11944 , year=

work page arXiv
[57]

SimpleVQA: Multimodal factuality evaluation for multimodal large lan- guage models.arXiv preprint arXiv:2502.13059,

SimpleVQA: Multimodal Factuality Evaluation for Multimodal Large Language Models , author=. arXiv preprint arXiv:2502.13059 , year=

work page arXiv
[58]

arXiv preprint arXiv:2311.16103 , year=

Video-bench: A comprehensive benchmark and toolkit for evaluating video-based large language models , author=. arXiv preprint arXiv:2311.16103 , year=

work page arXiv
[59]

Advances in Neural Information Processing Systems , volume=

Longvideobench: A benchmark for long-context interleaved video-language understanding , author=. Advances in Neural Information Processing Systems , volume=

work page
[60]

Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos , author=. arXiv preprint arXiv:2501.13826 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[61]

Phi-4 Technical Report

Phi-4 technical report , author=. arXiv preprint arXiv:2412.08905 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[62]

2024 , publisher =

Microsoft , title =. 2024 , publisher =

work page 2024
[63]

NVILA: Efficient Frontier Visual Language Models

NVILA: Efficient frontier visual language models , author=. arXiv preprint arXiv:2412.04468 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[64]

Long Context Transfer from Language to Vision

Long context transfer from language to vision , author=. arXiv preprint arXiv:2406.16852 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[65]

Llava-mini: Efficient image and video large mul- timodal models with one vision token,

LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token , author=. arXiv preprint arXiv:2501.03895 , year=

work page arXiv
[66]

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding , author=. arXiv preprint arXiv:2501.13106 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[67]

European Conference on Computer Vision , pages=

Lita: Language instructed temporal-localization assistant , author=. European Conference on Computer Vision , pages=. 2024 , organization=

work page 2024
[68]

InternVL2-8B , author =

work page
[69]

arXiv preprint arXiv:2410.05993 , year=

Aria: An open multimodal native mixture-of-experts model , author=. arXiv preprint arXiv:2410.05993 , year=

work page arXiv
[70]

arXiv preprint arXiv:2412.05237 , year=

Mammoth-vl: Eliciting multimodal reasoning with instruction tuning at scale , author=. arXiv preprint arXiv:2412.05237 , year=

work page arXiv
[71]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Vista-llama: Reducing hallucination in video language models via equal distance to visual tokens , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[72]

LLaVA-NeXT: Improved reasoning, OCR, and world knowledge , author=

work page
[73]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models , author=. arXiv preprint arXiv:2304.10592 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[74]

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs , author=. arXiv preprint arXiv:2406.07476 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[75]

VideoChat: Chat-Centric Video Understanding

VideoChat: Chat-Centric Video Understanding , author=. arXiv preprint arXiv:2305.06355 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[76]

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding , author =. arXiv preprint arXiv:2306.02858 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[77]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv e-prints , keywords =. doi:10.48550/arXiv.2010.11929 , archivePrefix =. 2010.11929 , primaryClass =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2010.11929 2010
[78]

International conference on machine learning , pages=

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models , author=. International conference on machine learning , pages=. 2023 , organization=

work page 2023
[79]

European Conference on Computer Vision , pages=

Llama-vid: An image is worth 2 tokens in large language models , author=. European Conference on Computer Vision , pages=. 2024 , organization=

work page 2024
[80]

arXiv preprint arXiv:2406.08085 , year=

Flash-vstream: Memory-based real-time understanding for long video streams , author=. arXiv preprint arXiv:2406.08085 , year=

work page arXiv

Showing first 80 references.

[1] [1]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

work page

[2] [2]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Yi: Open Foundation Models by 01.AI

Yi: Open foundation models by 01. ai , author=. arXiv preprint arXiv:2403.04652 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

arXiv preprint arXiv:2405.19327 , year=

Map-neo: Highly capable and transparent bilingual large language model series , author=. arXiv preprint arXiv:2405.19327 , year=

work page arXiv

[5] [5]

Advances in neural information processing systems , volume=

Visual instruction tuning , author=. Advances in neural information processing systems , volume=

work page

[6] [6]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page

[7] [7]

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Video-llava: Learning united visual representation by alignment before projection , author=. arXiv preprint arXiv:2311.10122 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=

A survey on multimodal large language models for autonomous driving , author=. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=

work page

[9] [9]

arXiv preprint arXiv:2503.11579 , year=

VAMBA: Understanding Hour-Long Videos with Hybrid Mamba-Transformers , author=. arXiv preprint arXiv:2503.11579 , year=

work page arXiv

[10] [10]

arXiv preprint arXiv:2402.12451 , year=

The revolution of multimodal large language models: a survey , author=. arXiv preprint arXiv:2402.12451 , year=

work page arXiv

[11] [11]

Proceedings of the 3rd International Conference on Computer, Artificial Intelligence and Control Engineering , pages=

A Survey of Multimodel Large Language Models , author=. Proceedings of the 3rd International Conference on Computer, Artificial Intelligence and Control Engineering , pages=

work page

[12] [12]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[13] [13]

arXiv preprint arXiv:2407.17379 , year=

MMRA: A Benchmark for Evaluating Multi-Granularity and Multi-Image Relational Association Capabilities in Large Visual Language Models , author=. arXiv preprint arXiv:2407.17379 , year=

work page arXiv

[14] [14]

arXiv preprint arXiv:2401.13478 , year=

SciMMIR: Benchmarking Scientific Multi-modal Information Retrieval , author=. arXiv preprint arXiv:2401.13478 , year=

work page arXiv

[15] [15]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Mvbench: A comprehensive multi-modal video understanding benchmark , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[16] [16]

MLVU: Benchmarking Multi-task Long Video Understanding

Mlvu: A comprehensive benchmark for multi-task long video understanding , author=. arXiv preprint arXiv:2406.04264 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Advances in Neural Information Processing Systems , volume=

Mmbench-video: A long-form multi-shot benchmark for holistic video understanding , author=. Advances in Neural Information Processing Systems , volume=

work page

[18] [18]

2024 , journal=

Aria: An Open Multimodal Native Mixture-of-Experts Model , author=. 2024 , journal=

work page 2024

[19] [19]

InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling

InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling , author=. arXiv preprint arXiv:2501.12386 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling , author=. arXiv preprint arXiv:2412.05271 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

2025 , url =

Kimina-Prover Preview: Towards Large Formal Reasoning Models with Reinforcement Learning , author =. 2025 , url =

work page 2025

[22] [22]

CoRR , volume =

Ran Xin and Zeyu Zheng and Yanchen Nie and Kun Yuan and Xia Xiao , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2509.06493 , eprinttype =. 2509.06493 , timestamp =

work page doi:10.48550/arxiv.2509.06493 2025

[23] [23]

2024 , eprint =

Lean Workbook: A large-scale Lean problem set formalized from natural language math problems , author =. 2024 , eprint =

work page 2024

[24] [24]

European Conference on Computer Vision , year=

LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models , author=. European Conference on Computer Vision , year=

work page

[25] [25]

LLaVA-NeXT: A Strong Zero-shot Video Understanding Model , url=

Zhang, Yuanhan and Li, Bo and Liu, haotian and Lee, Yong jae and Gui, Liangke and Fu, Di and Feng, Jiashi and Liu, Ziwei and Li, Chunyuan , month=. LLaVA-NeXT: A Strong Zero-shot Video Understanding Model , url=

work page

[26] [26]

2024 , eprint=

LLaVA-OneVision: Easy Visual Task Transfer , author=. 2024 , eprint=

work page 2024

[27] [27]

2025 , eprint=

LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token , author=. 2025 , eprint=

work page 2025

[28] [28]

2024 , eprint=

LongVILA: Scaling Long-Context Visual Language Models for Long Videos , author=. 2024 , eprint=

work page 2024

[29] [29]

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding , author=. arXiv:2410.17434 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[30] [30]

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

MiniCPM-V: A GPT-4V Level MLLM on Your Phone , author=. arXiv preprint arXiv:2408.01800 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[31] [31]

2024 , eprint=

mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models , author=. 2024 , eprint=

work page 2024

[32] [32]

2024 , eprint=

NVILA: Efficient Frontier Visual Language Models , author=. 2024 , eprint=

work page 2024

[33] [33]

2025 , eprint=

Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive Modality Alignment , author=. 2025 , eprint=

work page 2025

[34] [34]

2024 , eprint=

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone , author=. 2024 , eprint=

work page 2024

[35] [35]

2024 , eprint=

Phi-4 Technical Report , author=. 2024 , eprint=

work page 2024

[36] [36]

2024 , eprint=

LIME: Less Is More for MLLM Evaluation , author=. 2024 , eprint=

work page 2024

[37] [37]

Qwen2.5-VL Technical Report

Qwen2. 5-vl technical report , author=. arXiv preprint arXiv:2502.13923 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[38] [38]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution , author=. arXiv preprint arXiv:2409.12191 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[39] [39]

2023 , eprint=

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning , author=. 2023 , eprint=

work page 2023

[40] [40]

2024 , eprint=

VideoChat: Chat-Centric Video Understanding , author=. 2024 , eprint=

work page 2024

[41] [41]

2024 , eprint=

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models , author=. 2024 , eprint=

work page 2024

[42] [42]

2024 , eprint=

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution , author=. 2024 , eprint=

work page 2024

[43] [43]

2025 , eprint=

Qwen2.5-VL Technical Report , author=. 2025 , eprint=

work page 2025

[44] [44]

2024 , eprint=

Long Context Transfer from Language to Vision , author=. 2024 , eprint=

work page 2024

[45] [45]

2024 , eprint=

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding , author=. 2024 , eprint=

work page 2024

[46] [46]

2024 , eprint=

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks , author=. 2024 , eprint=

work page 2024

[47] [47]

2025 , eprint=

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling , author=. 2025 , eprint=

work page 2025

[48] [48]

2017 , eprint=

The Kinetics Human Action Video Dataset , author=. 2017 , eprint=

work page 2017

[49] [49]

something something

The "something something" video database for learning and evaluating visual common sense , author=. 2017 , eprint=

work page 2017

[50] [50]

2024 , eprint=

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis , author=. 2024 , eprint=

work page 2024

[51] [51]

2024 , eprint=

InfiniBench: A Comprehensive Benchmark for Large Multimodal Models in Very Long Video Understanding , author=. 2024 , eprint=

work page 2024

[52] [52]

2024 , eprint=

How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs , author=. 2024 , eprint=

work page 2024

[53] [53]

2024 , eprint=

VideoVista: A Versatile Benchmark for Video Understanding and Reasoning , author=. 2024 , eprint=

work page 2024

[54] [54]

arxiv , year=

VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding , author=. arxiv , year=

work page

[55] [55]

Proceedings of CVPR , year=

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI , author=. Proceedings of CVPR , year=

work page

[56] [56]

arXiv preprint arXiv:2401.11944 , year=

Cmmmu: A chinese massive multi-discipline multimodal understanding benchmark , author=. arXiv preprint arXiv:2401.11944 , year=

work page arXiv

[57] [57]

SimpleVQA: Multimodal factuality evaluation for multimodal large lan- guage models.arXiv preprint arXiv:2502.13059,

SimpleVQA: Multimodal Factuality Evaluation for Multimodal Large Language Models , author=. arXiv preprint arXiv:2502.13059 , year=

work page arXiv

[58] [58]

arXiv preprint arXiv:2311.16103 , year=

Video-bench: A comprehensive benchmark and toolkit for evaluating video-based large language models , author=. arXiv preprint arXiv:2311.16103 , year=

work page arXiv

[59] [59]

Advances in Neural Information Processing Systems , volume=

Longvideobench: A benchmark for long-context interleaved video-language understanding , author=. Advances in Neural Information Processing Systems , volume=

work page

[60] [60]

Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos , author=. arXiv preprint arXiv:2501.13826 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[61] [61]

Phi-4 Technical Report

Phi-4 technical report , author=. arXiv preprint arXiv:2412.08905 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[62] [62]

2024 , publisher =

Microsoft , title =. 2024 , publisher =

work page 2024

[63] [63]

NVILA: Efficient Frontier Visual Language Models

NVILA: Efficient frontier visual language models , author=. arXiv preprint arXiv:2412.04468 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[64] [64]

Long Context Transfer from Language to Vision

Long context transfer from language to vision , author=. arXiv preprint arXiv:2406.16852 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[65] [65]

Llava-mini: Efficient image and video large mul- timodal models with one vision token,

LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token , author=. arXiv preprint arXiv:2501.03895 , year=

work page arXiv

[66] [66]

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding , author=. arXiv preprint arXiv:2501.13106 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[67] [67]

European Conference on Computer Vision , pages=

Lita: Language instructed temporal-localization assistant , author=. European Conference on Computer Vision , pages=. 2024 , organization=

work page 2024

[68] [68]

InternVL2-8B , author =

work page

[69] [69]

arXiv preprint arXiv:2410.05993 , year=

Aria: An open multimodal native mixture-of-experts model , author=. arXiv preprint arXiv:2410.05993 , year=

work page arXiv

[70] [70]

arXiv preprint arXiv:2412.05237 , year=

Mammoth-vl: Eliciting multimodal reasoning with instruction tuning at scale , author=. arXiv preprint arXiv:2412.05237 , year=

work page arXiv

[71] [71]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Vista-llama: Reducing hallucination in video language models via equal distance to visual tokens , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[72] [72]

LLaVA-NeXT: Improved reasoning, OCR, and world knowledge , author=

work page

[73] [73]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models , author=. arXiv preprint arXiv:2304.10592 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[74] [74]

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs , author=. arXiv preprint arXiv:2406.07476 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[75] [75]

VideoChat: Chat-Centric Video Understanding

VideoChat: Chat-Centric Video Understanding , author=. arXiv preprint arXiv:2305.06355 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[76] [76]

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding , author =. arXiv preprint arXiv:2306.02858 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[77] [77]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv e-prints , keywords =. doi:10.48550/arXiv.2010.11929 , archivePrefix =. 2010.11929 , primaryClass =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2010.11929 2010

[78] [78]

International conference on machine learning , pages=

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models , author=. International conference on machine learning , pages=. 2023 , organization=

work page 2023

[79] [79]

European Conference on Computer Vision , pages=

Llama-vid: An image is worth 2 tokens in large language models , author=. European Conference on Computer Vision , pages=. 2024 , organization=

work page 2024

[80] [80]

arXiv preprint arXiv:2406.08085 , year=

Flash-vstream: Memory-based real-time understanding for long video streams , author=. arXiv preprint arXiv:2406.08085 , year=

work page arXiv