pith. machine review for the scientific record. sign in

arxiv: 2503.01743 · v2 · submitted 2025-03-03 · 💻 cs.CL · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

Authors on Pith no claims yet

Pith reviewed 2026-05-11 22:17 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords compact language modelssynthetic datamixture of LoRAsmultimodal reasoningvision and speech integrationefficient inferencePhi-4 series
0
0 comments X

The pith

A 3.8 billion parameter model matches models twice its size on complex math and coding reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Phi-4-Mini, a 3.8-billion-parameter language model trained on high-quality web and synthetic data, along with its multimodal extension Phi-4-Multimodal. It shows that careful curation of synthetic math and coding data allows the small model to outperform recent open-source models of similar size while matching the performance of models twice as large on tasks that require complex reasoning. The multimodal version adds vision and speech capabilities through a Mixture-of-LoRAs architecture that uses modality-specific routers and low-rank adapters, enabling combined inference modes without cross-modality interference. This setup also produces a top-ranked result on the OpenASR speech recognition leaderboard using only a 460-million-parameter speech adapter. An experimental further-trained version of the base model reaches reasoning levels comparable to larger distilled models such as DeepSeek-R1-Distill-Qwen-7B.

Core claim

Phi-4-Mini is a 3.8-billion-parameter language model trained on high-quality web and synthetic data that significantly outperforms recent open-source models of similar size and matches the performance of models twice its size on math and coding tasks requiring complex reasoning. Phi-4-Multimodal integrates text, vision, and speech/audio inputs into a single model by leveraging LoRA adapters and modality-specific routers, supporting multiple inference modes without interference and outperforming larger vision-language and speech-language models on a wide range of tasks while ranking first on the OpenASR leaderboard.

What carries the argument

Mixture-of-LoRAs with modality-specific routers that attach separate low-rank adapters for vision and speech to a shared language-model backbone, allowing independent activation of modalities during inference.

If this is right

  • High-quality synthetic data focused on reasoning can close much of the performance gap between small and large language models.
  • Modality extensions can be added to an existing language model with only a few hundred million additional parameters while preserving base-model behavior.
  • Group-query attention and a 200K-token vocabulary improve efficiency for long sequences and multilingual use without increasing overall model size.
  • An additional phase of reasoning-focused training on a compact model can bring its capabilities in line with larger distilled reasoning models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach may lower the compute barrier for deploying capable multimodal systems in resource-constrained settings.
  • Similar router-based adapter designs could be tested for adding other input types such as video or sensor data.
  • If the synthetic-data advantage holds on out-of-distribution problems, it would indicate that data curation can serve as an alternative to continued parameter scaling for certain capabilities.

Load-bearing premise

The curated synthetic data produces genuine generalization on reasoning tasks rather than fitting to the specific benchmarks used for evaluation.

What would settle it

A controlled evaluation on a fresh set of math and coding problems that are structurally different from the synthetic training data, using identical prompting and decoding settings, in which Phi-4-Mini shows no advantage over other 3-4B open models.

read the original abstract

We introduce Phi-4-Mini and Phi-4-Multimodal, compact yet highly capable language and multimodal models. Phi-4-Mini is a 3.8-billion-parameter language model trained on high-quality web and synthetic data, significantly outperforming recent open-source models of similar size and matching the performance of models twice its size on math and coding tasks requiring complex reasoning. This achievement is driven by a carefully curated synthetic data recipe emphasizing high-quality math and coding datasets. Compared to its predecessor, Phi-3.5-Mini, Phi-4-Mini features an expanded vocabulary size of 200K tokens to better support multilingual applications, as well as group query attention for more efficient long-sequence generation. Phi-4-Multimodal is a multimodal model that integrates text, vision, and speech/audio input modalities into a single model. Its novel modality extension approach leverages LoRA adapters and modality-specific routers to allow multiple inference modes combining various modalities without interference. For example, it now ranks first in the OpenASR leaderboard to date, although the LoRA component of the speech/audio modality has just 460 million parameters. Phi-4-Multimodal supports scenarios involving (vision + language), (vision + speech), and (speech/audio) inputs, outperforming larger vision-language and speech-language models on a wide range of tasks. Additionally, we experiment to further train Phi-4-Mini to enhance its reasoning capabilities. Despite its compact 3.8-billion-parameter size, this experimental version achieves reasoning performance on par with or surpassing significantly larger models, including DeepSeek-R1-Distill-Qwen-7B and DeepSeek-R1-Distill-Llama-8B.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Phi-4-Mini, a 3.8B-parameter language model trained on high-quality web and synthetic data that claims to significantly outperform recent open-source models of similar size and match the performance of models twice its size on math and coding tasks requiring complex reasoning. This is attributed to an expanded 200K-token vocabulary, group query attention, and a curated synthetic data recipe. It also presents Phi-4-Multimodal, which extends the model to vision and speech via Mixture-of-LoRAs with modality-specific routers, reporting first place on the OpenASR leaderboard with a 460M-parameter speech LoRA and outperforming larger models on multimodal tasks. An experimental further-trained variant is claimed to reach reasoning performance on par with or exceeding 7B-8B models such as DeepSeek-R1-Distill variants.

Significance. If the empirical claims are substantiated with full evaluation details, the work would provide concrete evidence that targeted synthetic data curation combined with efficient parameter-efficient adaptation (Mixture-of-LoRAs) can close the gap between compact and much larger models on reasoning and multimodal benchmarks. The explicit reporting of LoRA parameter counts and the modality-router design offer practical engineering contributions for deploying capable multimodal systems under resource constraints.

major comments (3)
  1. [Abstract] Abstract: The central claims of outperforming similar-sized models and matching 2x larger models on math/coding rest on unspecified benchmarks, shot counts, decoding parameters, and statistical significance. Without these, it is impossible to verify that the reported gains reflect the synthetic data recipe or architecture rather than evaluation differences.
  2. [Training] Training and data sections: The paper states that performance 'is driven by a carefully curated synthetic data recipe' but supplies no information on data sources, decontamination steps, exclusion of test-set-like problems, or contamination controls. This directly undermines the claim that gains represent genuine generalization on complex reasoning tasks.
  3. [Multimodal Architecture] Multimodal extension: While the 460M-parameter speech LoRA size is stated, the description of modality-specific routers preventing interference lacks ablation studies or quantitative metrics isolating the router contribution versus simple LoRA addition, which is load-bearing for the novelty claim of the Mixture-of-LoRAs approach.
minor comments (2)
  1. The paper would benefit from a dedicated reproducibility appendix listing exact prompt templates, evaluation harness versions, and hardware details for all reported benchmarks.
  2. Notation for the modality routers could be clarified with a small equation or pseudocode block to distinguish router gating from standard LoRA scaling.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We address each major comment below and commit to revising the paper to improve clarity, transparency, and rigor where feasible.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claims of outperforming similar-sized models and matching 2x larger models on math/coding rest on unspecified benchmarks, shot counts, decoding parameters, and statistical significance. Without these, it is impossible to verify that the reported gains reflect the synthetic data recipe or architecture rather than evaluation differences.

    Authors: We agree that greater specificity in the abstract would strengthen verifiability. In the revised manuscript, we will expand the abstract to name the primary benchmarks (MATH, GSM8K, HumanEval, MBPP), note the few-shot settings and decoding parameters used, and indicate that results include standard deviations from multiple runs where applicable. revision: yes

  2. Referee: [Training] Training and data sections: The paper states that performance 'is driven by a carefully curated synthetic data recipe' but supplies no information on data sources, decontamination steps, exclusion of test-set-like problems, or contamination controls. This directly undermines the claim that gains represent genuine generalization on complex reasoning tasks.

    Authors: We acknowledge the need for greater transparency on data practices to support generalization claims. The manuscript describes the high-level synthetic data recipe focused on math and coding, but we agree more detail is warranted. In revision we will add a dedicated subsection outlining general decontamination procedures, similarity-based exclusion of test-set overlaps, and high-level source categories. Full proprietary data sources cannot be disclosed for licensing and competitive reasons. revision: partial

  3. Referee: [Multimodal Architecture] Multimodal extension: While the 460M-parameter speech LoRA size is stated, the description of modality-specific routers preventing interference lacks ablation studies or quantitative metrics isolating the router contribution versus simple LoRA addition, which is load-bearing for the novelty claim of the Mixture-of-LoRAs approach.

    Authors: The modality-specific routers are central to the Mixture-of-LoRAs design for interference-free multi-modal inference. We agree that explicit ablations would better substantiate the novelty. In the revised manuscript we will include new ablation experiments comparing the full router-equipped setup against plain LoRA additions, reporting quantitative metrics on both task performance and cross-modal interference. revision: yes

Circularity Check

0 steps flagged

No derivation chain or self-referential reductions present

full rationale

The paper is an empirical technical report describing model architecture, training data curation, and benchmark results for Phi-4-Mini and Phi-4-Multimodal. It contains no equations, first-principles derivations, or predictive claims that could reduce to inputs by construction. Performance statements are direct comparisons to external models and benchmarks; the synthetic data recipe is described at a high level without any fitted-parameter-to-prediction loop. Self-references to prior Phi models are limited to factual comparisons and do not carry load-bearing uniqueness theorems or ansatzes. The central claims remain independent of any internal circular structure.

Axiom & Free-Parameter Ledger

2 free parameters · 0 axioms · 1 invented entities

Performance claims rest on the unverified quality and lack of contamination in the synthetic math/coding data, plus the assumption that LoRA adapters plus routers produce non-interfering modality fusion without additional hidden costs.

free parameters (2)
  • Vocabulary size
    Expanded to 200K tokens to support multilingual use; chosen rather than derived.
  • Speech LoRA parameter count
    Set at 460 million parameters for the audio adapter; a design choice that directly affects the multimodal claim.
invented entities (1)
  • Mixture-of-LoRAs with modality-specific routers no independent evidence
    purpose: Enable multiple inference modes across text, vision, and speech without interference
    New architectural pattern introduced to combine modalities; no independent evidence provided beyond the reported results.

pith-pipeline@v0.9.0 · 5931 in / 1203 out tokens · 38264 ms · 2026-05-11T22:17:26.602927+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 39 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 8.0

    SenseBench is the first physics-based benchmark with 10K+ instances and dual protocols to evaluate VLMs on remote sensing low-level perception and diagnostic description, revealing domain bias and specific failure modes.

  2. HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models

    cs.SD 2026-04 unverdicted novelty 8.0

    HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-lang...

  3. DialBGM: A Benchmark for Background Music Recommendation from Everyday Multi-Turn Dialogues

    cs.AI 2026-04 unverdicted novelty 8.0

    DialBGM is a new benchmark dataset revealing that existing AI models fall far short of human performance when recommending fitting background music for open-domain conversations.

  4. MetaBackdoor: Exploiting Positional Encoding as a Backdoor Attack Surface in LLMs

    cs.CR 2026-05 unverdicted novelty 7.0

    MetaBackdoor shows that LLMs can be backdoored using positional triggers like sequence length, enabling stealthy activation on clean inputs to leak system prompts or trigger malicious behavior.

  5. Omni-Persona: Systematic Benchmarking and Improving Omnimodal Personalization

    cs.CV 2026-05 unverdicted novelty 7.0

    Omni-Persona benchmark with 18 tasks shows open-source models have audio-visual grounding gaps, RLVR narrows them but leads to conservative outputs, and scale or recall alone fail as diagnostics.

  6. Trust Me, Import This: Dependency Steering Attacks via Malicious Agent Skills

    cs.CR 2026-05 unverdicted novelty 7.0

    Malicious Skills induce coding agents to hallucinate and import attacker-controlled packages at high rates while evading detection.

  7. How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation

    cs.LG 2026-05 unverdicted novelty 7.0

    DAPRO provides the first dynamic, theoretically guaranteed way to allocate interaction budgets across test cases for bounding time-to-event in multi-turn LLM evaluations, achieving tighter coverage than static conform...

  8. RobotEQ: Transitioning from Passive Intelligence to Active Intelligence in Embodied AI

    cs.RO 2026-05 unverdicted novelty 7.0

    RobotEQ is the first benchmark for active intelligence in embodied AI, demonstrating that current models underperform on social norm adherence and spatial grounding tasks.

  9. Multimodal Data Curation Through Ranked Retrieval

    cs.IR 2026-05 unverdicted novelty 7.0

    Symmetric Nucleus Subsampling and Expert Embedding Engine reduce modality gaps in multimodal embeddings by over 90% and outperform baselines in data curation for downstream models.

  10. SENECA: Small-Sample Discrete Entropy Estimation via Self-Consistent Missing Mass

    cs.IT 2026-05 unverdicted novelty 7.0

    SENECA uses a novel self-consistent missing mass calculation to improve discrete entropy estimates in small-sample regimes and outperforms alternatives in numerical tests.

  11. AppTek Call-Center Dialogues: A Multi-Accent Long-Form Benchmark for English ASR

    cs.CL 2026-04 unverdicted novelty 7.0

    A new multi-accent long-form call-center dialogue dataset for English ASR evaluation shows substantial performance variation across accents and segmentation methods.

  12. Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models

    eess.AS 2026-04 unverdicted novelty 7.0

    Semantic-level and verification-based uncertainty methods outperform token-level baselines for audio reasoning in ALLMs, but their relative performance on hallucination and unanswerable-question benchmarks is model- a...

  13. MUSCAT: MUltilingual, SCientific ConversATion Benchmark

    cs.CL 2026-04 unverdicted novelty 7.0

    MUSCAT is a benchmark of bilingual scientific conversations designed to evaluate ASR systems on code-switching and domain-specific challenges.

  14. Hijacking Large Audio-Language Models via Context-Agnostic and Imperceptible Auditory Prompt Injection

    cs.CR 2026-04 unverdicted novelty 7.0

    AudioHijack generates imperceptible adversarial audio via gradient estimation, attention supervision, and reverberation blending to hijack 13 LALMs with 79-96% success on unseen contexts and real commercial agents.

  15. Too Nice to Tell the Truth: Quantifying Agreeableness-Driven Sycophancy in Role-Playing Language Models

    cs.CL 2026-04 unverdicted novelty 7.0

    Agreeableness in AI personas reliably predicts sycophantic behavior in 9 of 13 tested language models.

  16. GeoMMBench and GeoMMAgent: Toward Expert-Level Multimodal Intelligence in Geoscience and Remote Sensing

    cs.CV 2026-04 unverdicted novelty 7.0

    GeoMMBench reveals deficiencies in current multimodal LLMs for geoscience tasks while GeoMMAgent demonstrates that tool-integrated agents achieve significantly higher performance.

  17. OmniTrace: A Unified Framework for Generation-Time Attribution in Omni-Modal LLMs

    cs.CL 2026-03 unverdicted novelty 7.0

    OmniTrace converts token-level signals into span-level cross-modal attributions for open-ended generation in omni-modal LLMs via generation-time tracing.

  18. Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

    cs.SD 2025-07 unverdicted novelty 7.0

    Audio Flamingo 3 introduces an open large audio-language model achieving new state-of-the-art results on over 20 audio understanding and reasoning benchmarks using a unified encoder and curriculum training on open data.

  19. Verification Mirage: Mapping the Reliability Boundary of Self-Verification in Medical VQA

    cs.CV 2026-05 unverdicted novelty 6.0

    Self-verification in medical VQA creates a verification mirage where verifiers exhibit high error and agreement bias on wrong answers, with reliability strongly conditioned on task type.

  20. SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation

    cs.CL 2026-05 unverdicted novelty 6.0

    SimCT recovers discarded teacher signal in cross-tokenizer on-policy distillation by enlarging supervision to jointly realizable multi-token continuations, yielding consistent gains on math reasoning and code generati...

  21. Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models

    cs.LG 2026-05 unverdicted novelty 6.0

    Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing ...

  22. HumanNet: Scaling Human-centric Video Learning to One Million Hours

    cs.CV 2026-05 unverdicted novelty 6.0

    HumanNet is a 1M-hour human-centric video dataset with interaction annotations that enables better vision-language-action model performance than equivalent robot data in a controlled test.

  23. All That Glitters Is Not Audio: Rethinking Text Priors and Audio Reliance in Audio-Language Evaluation

    cs.SD 2026-04 unverdicted novelty 6.0

    Audio-language models retain 60-72% of benchmark scores without audio, and most audio-dependent items can be solved from short fragments rather than full clips.

  24. HeadRouter: Dynamic Head-Weight Routing for Task-Adaptive Audio Token Pruning in Large Audio Language Models

    cs.SD 2026-04 unverdicted novelty 6.0

    HeadRouter prunes audio tokens more effectively by dynamically routing based on per-head importance for semantic versus acoustic tasks, exceeding baseline performance at 70% token retention on Qwen2.5-Omni models.

  25. COMPASS: COntinual Multilingual PEFT with Adaptive Semantic Sampling

    cs.LG 2026-04 unverdicted novelty 6.0

    COMPASS uses semantic clustering on multilingual embeddings to select auxiliary data for PEFT adapters, outperforming linguistic-similarity baselines on multilingual benchmarks while supporting continual adaptation.

  26. In-Situ Behavioral Evaluation for LLM Fairness, Not Standardized-Test Scores

    cs.CL 2026-04 unverdicted novelty 6.0

    Standardized-test benchmarks for LLM fairness are unreliable because prompt wording alone drives most score variance and ranking changes, while a multi-agent conversational framework reveals consistent model-specific ...

  27. VIBE: Voice-Induced open-ended Bias Evaluation for Large Audio-Language Models via Real-World Speech

    eess.AS 2026-04 unverdicted novelty 6.0

    VIBE evaluates generative biases in large audio-language models with real-world speech and open-ended tasks, showing that gender cues produce larger distributional shifts than accent cues across 11 tested models.

  28. GroupDPO: Memory efficient Group-wise Direct Preference Optimization

    cs.CL 2026-04 unverdicted novelty 6.0

    GroupDPO decouples group-wise preference optimization during backpropagation to cut peak memory while keeping the same gradients, allowing larger groups and consistent gains over single-pair DPO plus an NLL term on positives.

  29. Contextual Biasing for ASR in Speech LLM with Common Word Cues and Bias Word Position Prediction

    eess.AS 2026-04 unverdicted novelty 6.0

    Common-word acoustic cues and bias-word position prediction in speech LLMs cut rare-word transcription errors by 16.3% versus baselines, including out-of-domain cases.

  30. CheeseBench: Evaluating Large Language Models on Rodent Behavioral Neuroscience Paradigms

    cs.AI 2026-04 unverdicted novelty 6.0

    LLMs reach 52.6% average success on text-based rodent neuroscience tasks, above random agents at 32.1% but below approximate rodent baselines at 78.9%.

  31. Differences in Text Generated by Diffusion and Autoregressive Language Models

    cs.CL 2026-04 unverdicted novelty 6.0

    DLMs exhibit lower n-gram entropy, higher semantic coherence, and higher semantic diversity than ARMs, primarily due to bidirectional context and remasking decoding strategies.

  32. Beyond Content Safety: Real-Time Monitoring for Reasoning Vulnerabilities in Large Language Models

    cs.AI 2026-03 unverdicted novelty 6.0

    An external zero-shot monitor detects nine unsafe reasoning behaviors in LLMs at 87% step-level accuracy with low false positives and low latency.

  33. Multimodal LLMs are not all you need for Pediatric Speech Language Pathology

    cs.CL 2026-04 unverdicted novelty 5.0

    Fine-tuned speech representation models with hierarchical classification outperform multimodal LLMs on pediatric speech sound disorder tasks.

  34. AUDITA: A New Dataset to Audit Humans vs. AI Skill at Audio QA

    cs.CL 2026-04 unverdicted novelty 5.0

    AUDITA is a challenging audio QA benchmark where humans score 32% accuracy on average while state-of-the-art models score below 9%, using IRT to reveal systematic model deficiencies.

  35. UniMesh: Unifying 3D Mesh Understanding and Generation

    cs.CV 2026-04 unverdicted novelty 5.0

    UniMesh unifies 3D mesh generation and understanding in one model via a Mesh Head interface, Chain of Mesh iterative editing, and an Actor-Evaluator self-reflection loop.

  36. Demographic and Linguistic Bias Evaluation in Omnimodal Language Models

    cs.CV 2026-04 unverdicted novelty 5.0

    Omnimodal models show reduced demographic bias in image and video tasks compared to substantial biases and lower performance in audio tasks.

  37. Beyond Pedestrians: Caption-Guided CLIP Framework for High-Difficulty Video-based Person Re-Identification

    cs.CV 2026-04 unverdicted novelty 5.0

    CG-CLIP adds caption-guided memory refinement and token-based spatiotemporal aggregation to CLIP for video person ReID, outperforming SOTA on MARS, iLIDS-VID, SportsVReID and DanceVReID.

  38. CareGuardAI: Context-Aware Multi-Agent Guardrails for Clinical Safety & Hallucination Mitigation in Patient-Facing LLMs

    cs.CY 2026-04 unverdicted novelty 5.0

    CareGuardAI introduces dual risk assessments (SRA and HRA) and a multi-stage agent pipeline that only releases LLM responses when both risks score at or below 2, outperforming GPT-4o-mini on PatientSafeBench, MedSafet...

  39. Low-Rank Adaptation Redux for Large Models

    cs.LG 2026-04 unverdicted novelty 3.0

    An overview revisits LoRA variants by categorizing advances in architectural design, efficient optimization, and applications while linking them to classical signal processing tools for principled fine-tuning.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · cited by 39 Pith papers · 31 internal anchors

  1. [1]

    Phi-4 Technical Report

    [AAB+24] Marah Abdin, Jyoti Aneja, Harkirat Behl, S´ ebastien Bubeck, Ronen Eldan, Suriya Gu- nasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 technical report. arXiv preprint arXiv:2412.08905 ,

  2. [2]

    Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

    [AJA+24] Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219,

  3. [3]

    GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

    [ALTdJ+23] Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebr´ on, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi- head checkpoints. arXiv preprint arXiv:2305.13245 ,

  4. [4]

    Program Synthesis with Large Language Models

    [AON+21] Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models. arXiv preprint arXiv:2108.07732 ,

  5. [5]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    [BBY+23] Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966 ,

  6. [6]

    Seamlessm4t: Massively multilingual & multimodal ma- chine translation,

    24 [BCM+23] Lo¨ ıc Barrault, Yu-An Chung, Mariano Cora Meglioli, David Dale, Ning Dong, Paul- Ambroise Duquenne, Hady Elsahar, Hongyu Gong, Kevin Heffernan, John Hoffman, et al. Seamlessm4t-massively multilingual & multimodal machine translation. arXiv preprint arXiv:2308.11596,

  7. [7]

    Piqa: Reasoning about physical commonsense in natural language

    [BZGC19] Yonatan Bisk, Rowan Zellers, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language. arXiv preprint arXiv:1911.11641 ,

  8. [8]

    Training Verifiers to Solve Math Word Problems

    [CKB+21] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,

  9. [9]

    Boolq: Exploring the surprising difficulty of natural yes/no questions

    [CLC+19] Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short...

  10. [10]

    Fleurs: Few-shot learning evaluation of universal representations of speech

    [CMK+23] Alexis Conneau, Min Ma, Simran Khanuja, Yu Zhang, Vera Axelrod, Siddharth Dalmia, Jason Riesa, Clara Rivera, and Ankur Bapna. Fleurs: Few-shot learning evaluation of universal representations of speech. In 2022 IEEE Spoken Language Technology Workshop (SLT), pages 798–805. IEEE,

  11. [11]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    [CWC+24] Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271,

  12. [12]

    How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

    [CWT+24] Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? clos- ing the gap to commercial multimodal models with open-source suites. arXiv preprint arXiv:2404.16821,

  13. [13]

    Qwen2-Audio Technical Report

    [CXY+24] Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, et al. Qwen2-audio technical report. arXiv preprint arXiv:2407.10759,

  14. [14]

    Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

    [DCL+24] Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models. arXiv preprint arXiv:2409.17146,

  15. [15]

    The Llama 3 Herd of Models

    [DJP+24] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 ,

  16. [16]

    NVLM: Open frontier-class multimodal LLMs

    [DLW+24] Wenliang Dai, Nayeon Lee, Boxin Wang, Zhuolin Yang, Zihan Liu, Jon Barker, Tuomas Rintamaki, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Nvlm: Open frontier- class multimodal llms. arXiv preprint arXiv:2409.11402 ,

  17. [17]

    Internlm-xcomposer2: Mastering free-form text-image composition and compre- hension in vision-language large model

    26 [DZZ+24b] Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, et al. Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model.arXiv preprint arXiv:2401.16420,

  18. [18]

    Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

    [FDL+24] Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever com- prehensive evaluation benchmark of multi-modal llms in video analysis. arXiv preprint arXiv:2405.21075,

  19. [19]

    Blink: Multimodal large language models can see but not perceive

    [FHL+24] Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. arXiv preprint arXiv:2404.12390 ,

  20. [20]

    Audiochatllama: Towards general-purpose speech abilities for llms

    [FWL+24] Yassir Fathullah, Chunyang Wu, Egor Lakomkin, Ke Li, Junteng Jia, Yuan Shangguan, Jay Mahadeokar, Ozlem Kalinli, Christian Fuegen, and Mike Seltzer. Audiochatllama: Towards general-purpose speech abilities for llms. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language...

  21. [21]

    Joint audio and speech understanding

    [GLL+23] Yuan Gong, Alexander H Liu, Hongyin Luo, Leonid Karlinsky, and James Glass. Joint audio and speech understanding. In 2023 IEEE Automatic Speech Recognition and Un- derstanding Workshop (ASRU) ,

  22. [22]

    Conformer: Convolution-augmented transformer for speech recognition

    [GQC+20] Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, and Ruoming Pang. Conformer: Convolution-augmented transformer for speech recognition. In 21st Annual Conference of the International Speech Communication Association, Interspeech 2020, Virtual Event, Shanghai, China, Oct...

  23. [23]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    [GYZ+25] Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qi- hao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 ,

  24. [24]

    Measuring Massive Multitask Language Understanding

    [HBB+20] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300,

  25. [25]

    GPT-4o System Card

    27 [HLG+24] Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276 ,

  26. [26]

    Llm2clip: Powerful language model unlocks richer visual representation.arXiv preprint arXiv:2411.04997, 2024

    [HWY+24] Weiquan Huang, Aoqi Wu, Yifan Yang, Xufang Luo, Yuqing Yang, Liang Hu, Qi Dai, Xiyang Dai, Dongdong Chen, Chong Luo, et al. Llm2clip: Powerful language model unlock richer visual representation. arXiv preprint arXiv:2411.04997 ,

  27. [27]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    [JHG+24] Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974,

  28. [28]

    [LBX+24] Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao

    Accessed: 2025-01-22. [LBX+24] Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathe- matical reasoning of foundation models in visual contexts,

  29. [29]

    Let's Verify Step by Step

    [LKB+23] Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. arXiv preprint arXiv:2305.20050 ,

  30. [30]

    Red teaming visual language models

    [LLY+24] Mukai Li, Lei Li, Yuwei Yin, Masood Ahmed, Zhenguang Liu, and Qi Liu. Red teaming visual language models. arXiv preprint arXiv:2401.12915 ,

  31. [31]

    Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

    [LXWZ23] Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. arXiv preprint arXiv:2305.01210 ,

  32. [32]

    LLaVA-OneVision: Easy Visual Task Transfer

    [LZG+24] Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326 ,

  33. [33]

    American invitational mathematics examination–aime

    [MAA24] MAA. American invitational mathematics examination–aime. In American Invitational Mathematics Examination–AIME 2024, February

  34. [34]

    Break-Fix

    [Mic24] Microsoft. Phi-3 safety post-training: Aligning language models with a “break-fix” cycle. arXiv preprint arXiv:2407.13833 ,

  35. [35]

    ChartQA: A benchmark for question answering about charts with visual and logical reasoning

    [MLT+22] Ahmed Masry, Do Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. In Findings of the Association for Computational Linguistics: ACL 2022 , pages 2263–2279, Dublin, Ireland, May

  36. [36]

    s1: Simple test-time scaling

    Association for Computational Linguistics. [MYS+25] Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Ha- jishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Cand` es, and Tatsunori Hashimoto. s1: Simple test-time scaling. arXiv preprint arXiv:2501.19393 ,

  37. [37]

    Robust speech recognition via large-scale weak supervision

    [RKX+23] Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA , volume 202, pages 28492–28518. PMLR,

  38. [38]

    WinoGrande: An Adversarial Winograd Schema Challenge at Scale

    [SLBBC19] Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.arXiv preprint arXiv:1907.10641,

  39. [39]

    SocialIQA: Commonsense Reasoning about Social Interactions

    [SRC+19] Maarten Sap, Hannah Rashkin, Derek Chen, Ronan LeBras, and Yejin Choi. Socialiqa: Commonsense reasoning about social interactions. arXiv preprint arXiv:1904.09728,

  40. [40]

    Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

    [SRR+22] Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adri` a Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615 ,

  41. [41]

    MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark

    30 [STK+24] S Sakshi, Utkarsh Tyagi, Sonal Kumar, Ashish Seth, Ramaneswaran Selvakumar, Oriol Nieto, Ramani Duraiswami, Sreyan Ghosh, and Dinesh Manocha. Mmau: A massive multi- task audio understanding and reasoning benchmark. arXiv preprint arXiv:2410.19168 ,

  42. [42]

    Gemini: A Family of Highly Capable Multimodal Models

    [TAB+23] Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 ,

  43. [43]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    [TGL+24] Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Un- locking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530,

  44. [44]

    Gemma 2: Improving Open Language Models at a Practical Size

    [TRP+24] Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, L´ eonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ram´ e, et al. Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118,

  45. [45]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    [WBT+24] Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191 ,

  46. [46]

    Bow- man

    [WPC+22] Alex Wang, Richard Yuanzhe Pang, Angelica Chen, Jason Phang, and Samuel R. Bow- man. SQuALITY: Building a long-document summarization dataset the hard way. arXiv preprint 2205.11465,

  47. [47]

    Covost 2 and massively multilin- gual speech translation

    [WWGP21] Changhan Wang, Anne Wu, Jiatao Gu, and Juan Pino. Covost 2 and massively multilin- gual speech translation. In Proceedings of Interspeech 2021, pages 2247–2251,

  48. [48]

    Mini-omni2: Towards open-source gpt-4o with vision, speech and duplex capabilities.arXiv preprint arXiv:2410.11190, 2024

    [XW24] Zhifei Xie and Changqiao Wu. Mini-omni2: Towards open-source gpt-4o with vision, speech and duplex capabilities. arXiv preprint arXiv:2410.11190 ,

  49. [49]

    arXiv preprint arXiv:2502.03387 , year=

    31 [YHX+25] Yixin Ye, Zhen Huang, Yang Xiao, Ethan Chern, Shijie Xia, and Pengfei Liu. Limo: Less is more for reasoning. arXiv preprint arXiv:2502.03387 ,

  50. [50]

    AIR-bench: Benchmarking large audio-language models via generative comprehension

    [YXL+24] Qian Yang, Jin Xu, Wenrui Liu, Yunfei Chu, Ziyue Jiang, Xiaohuan Zhou, Yichong Leng, Yuanjun Lv, Zhou Zhao, Chang Zhou, and Jingren Zhou. AIR-bench: Benchmarking large audio-language models via generative comprehension. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 1979–...

  51. [51]

    [YYZ+24] An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115,

  52. [52]

    MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

    [YZN+24] Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, et al. Mmmu-pro: A more robust multi- discipline multimodal understanding benchmark. arXiv preprint arXiv:2409.02813 ,

  53. [53]

    Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task

    [YZY+18] Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, et al. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task.arXiv preprint arXiv:1809.08887,

  54. [54]

    arXiv preprint arXiv:2402.02207 , year=

    [ZBY+24] Yongshuo Zong, Ondrej Bohdal, Tingyang Yu, Yongxin Yang, and Timothy Hospedales. Safety fine-tuning at (almost) no cost: A baseline for vision large language models. arXiv preprint arXiv:2402.02207,

  55. [55]

    Internlm-xcomposer2

    [ZDC+24] Pan Zhang, Xiaoyi Dong, Yuhang Cao, Yuhang Zang, Rui Qian, Xilin Wei, Lin Chen, Yifei Li, Junbo Niu, Shuangrui Ding, et al. Internlm-xcomposer2. 5-omnilive: A comprehensive multimodal system for long-term streaming video and audio interactions. arXiv preprint arXiv:2412.09596,

  56. [56]

    Glm-4-voice: Towards intelli- gent and human-like end-to-end spoken chatbot.arXiv preprint arXiv:2412.02612,

    [ZDL+24] Aohan Zeng, Zhengxiao Du, Mingdao Liu, Kedong Wang, Shengmin Jiang, Lei Zhao, Yuxiao Dong, and Jie Tang. Glm-4-voice: Towards intelligent and human-like end-to-end spoken chatbot. arXiv preprint arXiv:2412.02612 ,

  57. [57]

    Internlm- xcomposer: A vision-language large model for advanced text-image comprehension and composition

    32 [ZDW+23] Pan Zhang, Xiaoyi Dong, Bin Wang, Yuhang Cao, Chao Xu, Linke Ouyang, Zhiyuan Zhao, Haodong Duan, Songyang Zhang, Shuangrui Ding, et al. Internlm-xcomposer: A vision- language large model for advanced text-image comprehension and composition. arXiv preprint arXiv:2309.15112,

  58. [58]

    BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

    [ZVC+24] Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, et al. Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions. arXiv preprint arXiv:2406.15877,