pith. sign in

arxiv: 2605.17283 · v1 · pith:ZFIIU7OKnew · submitted 2026-05-17 · 💻 cs.CL · cs.AI

OProver: A Unified Framework for Agentic Formal Theorem Proving

Pith reviewed 2026-05-20 14:38 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords formal theorem provingLean 4agentic provingproof repairiterative post-trainingOProofs datasetwhole-proof generationretrieval-augmented proving
0
0 comments X

The pith

OProver embeds agentic proof revision into the training loop itself, not just inference, for Lean 4 theorem proving.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents OProver as a framework that folds agentic proving into model training for formal theorem proving. It runs cycles of proof generation, retrieval of verified proofs, repair of failures using compiler feedback, and then feeds the resulting trajectories back as supervised data while routing hard cases to reinforcement learning. A 32B model trained this way records the highest Pass@32 scores on MiniF2F, ProverBench, and PutnamBench among open-weight whole-proof systems and places second on two others. Readers would care because the approach turns the prover's own mistakes and successes into a growing, self-improving data source rather than depending only on static external corpora.

Core claim

OProver is a unified Lean 4 framework in which failed proof attempts are iteratively revised using retrieved compiler-verified proofs and Lean compiler feedback; it is trained via continued pretraining followed by iterative post-training that indexes new verified proofs into OProofs and retrieval memory, converts repair trajectories into SFT data, and routes unresolved cases to RL, yielding a 32B model with top Pass@32 results on three of five benchmarks and more leading placements than prior open-weight provers.

What carries the argument

The iterative post-training loop that runs agentic proving, indexes newly verified proofs, uses repair trajectories as SFT data, and unresolved hard cases for RL, supported by the OProofs dataset of 1.77M statements and 6.86M compiler-verified proofs with serialized trajectories.

If this is right

  • Repair trajectories extracted from the model's own agentic attempts supply higher-quality training signals than static proof corpora alone.
  • Self-generated verified proofs can be indexed and retrieved to bootstrap further capability without external human annotations.
  • Unresolved hard cases identified during agentic runs can be routed to RL to target remaining weaknesses directly.
  • A single framework can unify proof generation, verification, retrieval, and retraining into repeated closed loops.
  • Open-weight models can achieve leading placements across both introductory and olympiad-level benchmarks through this process.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same retrieval-plus-repair loop could be adapted to other formal languages if the compiler feedback and indexing components are ported.
  • Over multiple iterations the method may reduce reliance on any fixed external dataset by growing its own verified proof memory.
  • Measuring performance on problems whose statements lie far outside the distribution of OProofs would test whether gains reflect improved reasoning rather than memorization.
  • Combining the loop with even larger base models might extend strong results to harder open problems that current runs leave unresolved.

Load-bearing premise

The iterative post-training loop that indexes newly verified proofs and uses repair trajectories as SFT data plus unresolved cases for RL produces genuine capability gains rather than benchmark-specific overfitting or distribution shift from synthetic data.

What would settle it

Test the final 32B model on a fresh collection of theorems drawn from sources never appearing in OProofs construction or any post-training iteration and measure whether Pass@32 remains at the reported levels.

Figures

Figures reproduced from arXiv: 2605.17283 by David Ma, Enduo Zhao, Gavin Cheung, Jiaheng Liu, Jiajun Shi, Kaijing Ma, Shawn Guo, Yunfeng Shi, Zhaoxiang Zhang, Zili Wang.

Figure 1
Figure 1. Figure 1: Pass@32 performance on MiniF2F-Test, PutnamBench, and MathOlympiadBench. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of OProver. The framework has three components: [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: OProofs Lean 4 corpus statistics. (a) Headline counts of statements, proofs, [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Test-time scaling of OProver under a fixed total budget [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Performance under fixed total budget 𝐵 as a function of refinement depth 𝑅. Each curve plots Pass(𝑅, 𝐵/𝑅), isolating the trade-off between deeper interaction and wider sampling. On most benchmarks performance keeps improving as 𝑅 grows up to 𝑅 = 16, while on the hardest benchmark (PutnamBench) the optimum settles around 𝑅 = 8, indicating that very narrow sampling can hurt when single-chain success probabil… view at source ↗
Figure 6
Figure 6. Figure 6: MiniF2F-Test Pass@32 across post-training iterations. Both OProver-8B and OProver-32B improve monotonically. To examine whether the iterative co-evolution loop between OProver and OProofs (§2.3) yields measurable improvements at each step, we eval￾uate Pass@32 on MiniF2F-Test for every inter￾mediate checkpoint [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
read the original abstract

Recent progress in formal theorem proving has benefited from large-scale proof generation and verifier-aware training, but agentic proving is rarely integrated into prover training, appearing only at inference time. We present OProver, a unified framework for agentic formal theorem proving in Lean 4, in which failed proof attempts are iteratively revised using retrieved compiler verified proofs and Lean compiler feedback. OProver is trained through continued pretraining followed by iterative post-training: each iteration runs agentic proving, indexes newly verified proofs into OProofs and the retrieval memory, uses repair trajectories as SFT data, and uses unresolved hard cases for RL. OProofs is built from public Lean resources, large-scale proof synthesis, and agentic proving traces, containing 1.77M Lean statements, 6.86M compiler-verified proofs, and serialized trajectories with retrieved context, failed attempts, feedback, and repairs. Across five benchmarks, OProver-32B attains the best Pass@32 on MiniF2F (93.3%), ProverBench (58.2%), and PutnamBench (11.3%), and ranks second on MathOlympiad (22.8%) and ProofNet (33.2%) more top placements than any prior open-weight whole-proof prover.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper presents OProver, a unified framework for agentic formal theorem proving in Lean 4. It integrates agentic proving into training via continued pretraining followed by iterative post-training: each iteration performs agentic proving on statements, indexes newly verified proofs into the OProofs dataset and retrieval memory, uses repair trajectories as SFT data, and unresolved cases for RL. OProofs is constructed from public Lean resources, large-scale proof synthesis, and agentic traces, containing 1.77M statements and 6.86M compiler-verified proofs. The OProver-32B model reports the best Pass@32 on MiniF2F (93.3%), ProverBench (58.2%), and PutnamBench (11.3%), ranking second on MathOlympiad (22.8%) and ProofNet (33.2%), claiming more top placements than prior open-weight whole-proof provers.

Significance. If the central performance claims hold after addressing data and ablation issues, the work would be significant for formal theorem proving by moving agentic revision and retrieval from inference-only to a closed training loop. The explicit use of compiler-verified proofs and serialized trajectories in OProofs provides a reproducible data resource. The benchmark results, if shown to be robust, would represent a practical advance over prior open-weight provers that separate training from agentic search.

major comments (3)
  1. [§4, Table 1] §4 (Experiments), Table 1: Pass@32 scores are reported as single point estimates (e.g., 93.3% on MiniF2F) with no standard deviations, multiple random seeds, or statistical tests against baselines; this leaves the claim of superior performance only partially supported and vulnerable to run-to-run variance.
  2. [§3.3] §3.3 (OProofs Construction): The description of dataset sources (public Lean resources plus large-scale synthesis and agentic traces) does not state or verify that the five evaluation benchmarks were excluded from all synthesis pipelines or retrieval indices, creating a concrete risk that measured gains reflect reduced distribution shift rather than new reasoning capability.
  3. [§3.2] §3.2 (Iterative Post-Training Loop): No ablation is presented that compares the full iterative loop (repair SFT + hard-case RL + retrieval update) against a baseline that simply scales SFT data volume without the agentic repair/RL cycle; without this isolation, it is unclear whether the reported gains are load-bearingly due to the unified framework or to data scale alone.
minor comments (2)
  1. [Abstract] Abstract: The claim of 'more top placements than any prior open-weight whole-proof prover' would be clearer if the abstract briefly listed the exact prior systems and their placements for direct comparison.
  2. [§4.1] §4.1: The baseline descriptions omit whether any prior systems were also given access to the same retrieval memory or compiler feedback during their reported evaluations.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, providing clarifications and committing to revisions that strengthen the presentation of our results and methodology without overstating what the current experiments demonstrate.

read point-by-point responses
  1. Referee: [§4, Table 1] §4 (Experiments), Table 1: Pass@32 scores are reported as single point estimates (e.g., 93.3% on MiniF2F) with no standard deviations, multiple random seeds, or statistical tests against baselines; this leaves the claim of superior performance only partially supported and vulnerable to run-to-run variance.

    Authors: We agree that single-point estimates limit the robustness of the performance claims. In the revised manuscript we will rerun the key Pass@32 evaluations on MiniF2F, ProverBench, and PutnamBench using at least three independent random seeds, report means together with standard deviations, and add pairwise statistical comparisons (e.g., McNemar or bootstrap tests) against the strongest published baselines. These additions will be placed in an expanded §4 and Table 1. revision: yes

  2. Referee: [§3.3] §3.3 (OProofs Construction): The description of dataset sources (public Lean resources plus large-scale synthesis and agentic traces) does not state or verify that the five evaluation benchmarks were excluded from all synthesis pipelines or retrieval indices, creating a concrete risk that measured gains reflect reduced distribution shift rather than new reasoning capability.

    Authors: We appreciate the referee’s emphasis on this reproducibility concern. All five evaluation benchmarks were in fact held out from every stage of OProofs construction: public Lean resources were filtered to exclude benchmark statements, synthesis prompts were generated only from non-benchmark theorems, and retrieval indices contained solely compiler-verified proofs from the training distribution. We will revise §3.3 to include an explicit paragraph describing the exclusion protocol and the verification steps performed, thereby removing any ambiguity about distribution shift. revision: yes

  3. Referee: [§3.2] §3.2 (Iterative Post-Training Loop): No ablation is presented that compares the full iterative loop (repair SFT + hard-case RL + retrieval update) against a baseline that simply scales SFT data volume without the agentic repair/RL cycle; without this isolation, it is unclear whether the reported gains are load-bearingly due to the unified framework or to data scale alone.

    Authors: We concur that a controlled ablation isolating the iterative agentic components from raw data scaling would be ideal. Training an additional 32B model on an SFT corpus matched in volume but stripped of repair trajectories, RL on unresolved cases, and dynamic retrieval updates is computationally prohibitive within our current budget. Instead, we will add a new subsection in §3.2 that quantifies incremental gains observed after each iteration of the loop and contrasts these trajectories against prior large-scale SFT-only provers. This analysis, together with the existing cross-benchmark comparisons, will clarify the contribution of the unified framework while acknowledging the absence of the exact requested ablation. revision: partial

Circularity Check

0 steps flagged

No significant circularity: performance claims rest on external benchmarks

full rationale

The paper presents an empirical framework whose central results are Pass@32 scores on fixed external benchmarks (MiniF2F, ProverBench, PutnamBench, MathOlympiad, ProofNet). OProofs construction and the iterative loop (agentic proving → indexing verified proofs → SFT on repairs → RL on hard cases) are described as training procedures whose outputs are then measured against those held-out benchmarks. No equation, uniqueness theorem, or self-citation is invoked to derive the reported numbers from the training distribution itself; the numbers are direct empirical measurements. Because the evaluation sets are standard public benchmarks and the paper does not claim a closed-form derivation that reduces to its own synthetic data, the logic does not collapse by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claims rest on the Lean 4 compiler correctly identifying valid proofs and on the assumption that self-generated repair trajectories improve generalization. No new physical entities are postulated.

free parameters (1)
  • 32B model scale
    Chosen scale for the final reported model; affects compute and capacity.
axioms (1)
  • domain assumption Lean 4 compiler provides reliable and complete feedback on proof validity.
    Invoked for all verification steps and feedback generation in the agentic loop.
invented entities (1)
  • OProofs dataset no independent evidence
    purpose: Collection of statements, verified proofs, and serialized agentic trajectories for training.
    Built from public resources plus model-generated traces; no independent external validation cited.

pith-pipeline@v0.9.0 · 5784 in / 1346 out tokens · 48198 ms · 2026-05-20T14:38:33.935894+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

171 extracted references · 171 canonical work pages · 30 internal anchors

  1. [1]

    Advances in neural information processing systems , volume=

    Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

  2. [2]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

  3. [3]

    Yi: Open Foundation Models by 01.AI

    Yi: Open foundation models by 01. ai , author=. arXiv preprint arXiv:2403.04652 , year=

  4. [4]

    arXiv preprint arXiv:2405.19327 , year=

    Map-neo: Highly capable and transparent bilingual large language model series , author=. arXiv preprint arXiv:2405.19327 , year=

  5. [5]

    Advances in neural information processing systems , volume=

    Visual instruction tuning , author=. Advances in neural information processing systems , volume=

  6. [6]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  7. [7]

    Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

    Video-llava: Learning united visual representation by alignment before projection , author=. arXiv preprint arXiv:2311.10122 , year=

  8. [8]

    Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=

    A survey on multimodal large language models for autonomous driving , author=. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=

  9. [9]

    arXiv preprint arXiv:2503.11579 , year=

    VAMBA: Understanding Hour-Long Videos with Hybrid Mamba-Transformers , author=. arXiv preprint arXiv:2503.11579 , year=

  10. [10]

    arXiv preprint arXiv:2402.12451 , year=

    The revolution of multimodal large language models: a survey , author=. arXiv preprint arXiv:2402.12451 , year=

  11. [11]

    Proceedings of the 3rd International Conference on Computer, Artificial Intelligence and Control Engineering , pages=

    A Survey of Multimodel Large Language Models , author=. Proceedings of the 3rd International Conference on Computer, Artificial Intelligence and Control Engineering , pages=

  12. [12]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  13. [13]

    arXiv preprint arXiv:2407.17379 , year=

    MMRA: A Benchmark for Evaluating Multi-Granularity and Multi-Image Relational Association Capabilities in Large Visual Language Models , author=. arXiv preprint arXiv:2407.17379 , year=

  14. [14]

    arXiv preprint arXiv:2401.13478 , year=

    SciMMIR: Benchmarking Scientific Multi-modal Information Retrieval , author=. arXiv preprint arXiv:2401.13478 , year=

  15. [15]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Mvbench: A comprehensive multi-modal video understanding benchmark , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  16. [16]

    MLVU: Benchmarking Multi-task Long Video Understanding

    Mlvu: A comprehensive benchmark for multi-task long video understanding , author=. arXiv preprint arXiv:2406.04264 , year=

  17. [17]

    Advances in Neural Information Processing Systems , volume=

    Mmbench-video: A long-form multi-shot benchmark for holistic video understanding , author=. Advances in Neural Information Processing Systems , volume=

  18. [18]

    2024 , journal=

    Aria: An Open Multimodal Native Mixture-of-Experts Model , author=. 2024 , journal=

  19. [19]

    InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling

    InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling , author=. arXiv preprint arXiv:2501.12386 , year=

  20. [20]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling , author=. arXiv preprint arXiv:2412.05271 , year=

  21. [21]

    2025 , url =

    Kimina-Prover Preview: Towards Large Formal Reasoning Models with Reinforcement Learning , author =. 2025 , url =

  22. [22]

    CoRR , volume =

    Ran Xin and Zeyu Zheng and Yanchen Nie and Kun Yuan and Xia Xiao , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2509.06493 , eprinttype =. 2509.06493 , timestamp =

  23. [23]

    2024 , eprint =

    Lean Workbook: A large-scale Lean problem set formalized from natural language math problems , author =. 2024 , eprint =

  24. [24]

    European Conference on Computer Vision , year=

    LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models , author=. European Conference on Computer Vision , year=

  25. [25]

    LLaVA-NeXT: A Strong Zero-shot Video Understanding Model , url=

    Zhang, Yuanhan and Li, Bo and Liu, haotian and Lee, Yong jae and Gui, Liangke and Fu, Di and Feng, Jiashi and Liu, Ziwei and Li, Chunyuan , month=. LLaVA-NeXT: A Strong Zero-shot Video Understanding Model , url=

  26. [26]

    2024 , eprint=

    LLaVA-OneVision: Easy Visual Task Transfer , author=. 2024 , eprint=

  27. [27]

    2025 , eprint=

    LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token , author=. 2025 , eprint=

  28. [28]

    2024 , eprint=

    LongVILA: Scaling Long-Context Visual Language Models for Long Videos , author=. 2024 , eprint=

  29. [29]

    LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

    LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding , author=. arXiv:2410.17434 , year=

  30. [30]

    MiniCPM-V: A GPT-4V Level MLLM on Your Phone

    MiniCPM-V: A GPT-4V Level MLLM on Your Phone , author=. arXiv preprint arXiv:2408.01800 , year=

  31. [31]

    2024 , eprint=

    mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models , author=. 2024 , eprint=

  32. [32]

    2024 , eprint=

    NVILA: Efficient Frontier Visual Language Models , author=. 2024 , eprint=

  33. [33]

    2025 , eprint=

    Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive Modality Alignment , author=. 2025 , eprint=

  34. [34]

    2024 , eprint=

    Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone , author=. 2024 , eprint=

  35. [35]

    2024 , eprint=

    Phi-4 Technical Report , author=. 2024 , eprint=

  36. [36]

    2024 , eprint=

    LIME: Less Is More for MLLM Evaluation , author=. 2024 , eprint=

  37. [37]

    Qwen2.5-VL Technical Report

    Qwen2. 5-vl technical report , author=. arXiv preprint arXiv:2502.13923 , year=

  38. [38]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution , author=. arXiv preprint arXiv:2409.12191 , year=

  39. [39]

    2023 , eprint=

    InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning , author=. 2023 , eprint=

  40. [40]

    2024 , eprint=

    VideoChat: Chat-Centric Video Understanding , author=. 2024 , eprint=

  41. [41]

    2024 , eprint=

    Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models , author=. 2024 , eprint=

  42. [42]

    2024 , eprint=

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution , author=. 2024 , eprint=

  43. [43]

    2025 , eprint=

    Qwen2.5-VL Technical Report , author=. 2025 , eprint=

  44. [44]

    2024 , eprint=

    Long Context Transfer from Language to Vision , author=. 2024 , eprint=

  45. [45]

    2024 , eprint=

    LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding , author=. 2024 , eprint=

  46. [46]

    2024 , eprint=

    InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks , author=. 2024 , eprint=

  47. [47]

    2025 , eprint=

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling , author=. 2025 , eprint=

  48. [48]

    2017 , eprint=

    The Kinetics Human Action Video Dataset , author=. 2017 , eprint=

  49. [49]

    something something

    The "something something" video database for learning and evaluating visual common sense , author=. 2017 , eprint=

  50. [50]

    2024 , eprint=

    Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis , author=. 2024 , eprint=

  51. [51]

    2024 , eprint=

    InfiniBench: A Comprehensive Benchmark for Large Multimodal Models in Very Long Video Understanding , author=. 2024 , eprint=

  52. [52]

    2024 , eprint=

    How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs , author=. 2024 , eprint=

  53. [53]

    2024 , eprint=

    VideoVista: A Versatile Benchmark for Video Understanding and Reasoning , author=. 2024 , eprint=

  54. [54]

    arxiv , year=

    VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding , author=. arxiv , year=

  55. [55]

    Proceedings of CVPR , year=

    MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI , author=. Proceedings of CVPR , year=

  56. [56]

    arXiv preprint arXiv:2401.11944 , year=

    Cmmmu: A chinese massive multi-discipline multimodal understanding benchmark , author=. arXiv preprint arXiv:2401.11944 , year=

  57. [57]

    SimpleVQA: Multimodal factuality evaluation for multimodal large lan- guage models.arXiv preprint arXiv:2502.13059,

    SimpleVQA: Multimodal Factuality Evaluation for Multimodal Large Language Models , author=. arXiv preprint arXiv:2502.13059 , year=

  58. [58]

    Video-bench: A comprehensive benchmark and toolkit for evaluat- ing video-based large language models

    Video-bench: A comprehensive benchmark and toolkit for evaluating video-based large language models , author=. arXiv preprint arXiv:2311.16103 , year=

  59. [59]

    Advances in Neural Information Processing Systems , volume=

    Longvideobench: A benchmark for long-context interleaved video-language understanding , author=. Advances in Neural Information Processing Systems , volume=

  60. [60]

    Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

    Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos , author=. arXiv preprint arXiv:2501.13826 , year=

  61. [61]

    Phi-4 Technical Report

    Phi-4 technical report , author=. arXiv preprint arXiv:2412.08905 , year=

  62. [62]

    2024 , publisher =

    Microsoft , title =. 2024 , publisher =

  63. [63]

    NVILA: Efficient Frontier Visual Language Models

    NVILA: Efficient frontier visual language models , author=. arXiv preprint arXiv:2412.04468 , year=

  64. [64]

    Long Context Transfer from Language to Vision

    Long context transfer from language to vision , author=. arXiv preprint arXiv:2406.16852 , year=

  65. [65]

    Llava-mini: Efficient image and video large mul- timodal models with one vision token,

    LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token , author=. arXiv preprint arXiv:2501.03895 , year=

  66. [66]

    VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

    VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding , author=. arXiv preprint arXiv:2501.13106 , year=

  67. [67]

    European Conference on Computer Vision , pages=

    Lita: Language instructed temporal-localization assistant , author=. European Conference on Computer Vision , pages=. 2024 , organization=

  68. [68]

    InternVL2-8B , author =

  69. [69]

    arXiv preprint arXiv:2410.05993 , year=

    Aria: An open multimodal native mixture-of-experts model , author=. arXiv preprint arXiv:2410.05993 , year=

  70. [70]

    arXiv preprint arXiv:2412.05237 , year=

    Mammoth-vl: Eliciting multimodal reasoning with instruction tuning at scale , author=. arXiv preprint arXiv:2412.05237 , year=

  71. [71]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Vista-llama: Reducing hallucination in video language models via equal distance to visual tokens , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  72. [72]

    LLaVA-NeXT: Improved reasoning, OCR, and world knowledge , author=

  73. [73]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models , author=. arXiv preprint arXiv:2304.10592 , year=

  74. [74]

    VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

    VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs , author=. arXiv preprint arXiv:2406.07476 , year=

  75. [75]

    VideoChat: Chat-Centric Video Understanding

    VideoChat: Chat-Centric Video Understanding , author=. arXiv preprint arXiv:2305.06355 , year=

  76. [76]

    Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

    Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding , author =. arXiv preprint arXiv:2306.02858 , year =

  77. [77]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv e-prints , keywords =. doi:10.48550/arXiv.2010.11929 , archivePrefix =. 2010.11929 , primaryClass =

  78. [78]

    International conference on machine learning , pages=

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models , author=. International conference on machine learning , pages=. 2023 , organization=

  79. [79]

    European Conference on Computer Vision , pages=

    Llama-vid: An image is worth 2 tokens in large language models , author=. European Conference on Computer Vision , pages=. 2024 , organization=

  80. [80]

    arXiv preprint arXiv:2406.08085 , year=

    Flash-vstream: Memory-based real-time understanding for long video streams , author=. arXiv preprint arXiv:2406.08085 , year=

Showing first 80 references.