Recognition: 2 theorem links
· Lean TheoremPhi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
Pith reviewed 2026-05-11 22:17 UTC · model grok-4.3
The pith
A 3.8 billion parameter model matches models twice its size on complex math and coding reasoning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Phi-4-Mini is a 3.8-billion-parameter language model trained on high-quality web and synthetic data that significantly outperforms recent open-source models of similar size and matches the performance of models twice its size on math and coding tasks requiring complex reasoning. Phi-4-Multimodal integrates text, vision, and speech/audio inputs into a single model by leveraging LoRA adapters and modality-specific routers, supporting multiple inference modes without interference and outperforming larger vision-language and speech-language models on a wide range of tasks while ranking first on the OpenASR leaderboard.
What carries the argument
Mixture-of-LoRAs with modality-specific routers that attach separate low-rank adapters for vision and speech to a shared language-model backbone, allowing independent activation of modalities during inference.
If this is right
- High-quality synthetic data focused on reasoning can close much of the performance gap between small and large language models.
- Modality extensions can be added to an existing language model with only a few hundred million additional parameters while preserving base-model behavior.
- Group-query attention and a 200K-token vocabulary improve efficiency for long sequences and multilingual use without increasing overall model size.
- An additional phase of reasoning-focused training on a compact model can bring its capabilities in line with larger distilled reasoning models.
Where Pith is reading between the lines
- The approach may lower the compute barrier for deploying capable multimodal systems in resource-constrained settings.
- Similar router-based adapter designs could be tested for adding other input types such as video or sensor data.
- If the synthetic-data advantage holds on out-of-distribution problems, it would indicate that data curation can serve as an alternative to continued parameter scaling for certain capabilities.
Load-bearing premise
The curated synthetic data produces genuine generalization on reasoning tasks rather than fitting to the specific benchmarks used for evaluation.
What would settle it
A controlled evaluation on a fresh set of math and coding problems that are structurally different from the synthetic training data, using identical prompting and decoding settings, in which Phi-4-Mini shows no advantage over other 3-4B open models.
read the original abstract
We introduce Phi-4-Mini and Phi-4-Multimodal, compact yet highly capable language and multimodal models. Phi-4-Mini is a 3.8-billion-parameter language model trained on high-quality web and synthetic data, significantly outperforming recent open-source models of similar size and matching the performance of models twice its size on math and coding tasks requiring complex reasoning. This achievement is driven by a carefully curated synthetic data recipe emphasizing high-quality math and coding datasets. Compared to its predecessor, Phi-3.5-Mini, Phi-4-Mini features an expanded vocabulary size of 200K tokens to better support multilingual applications, as well as group query attention for more efficient long-sequence generation. Phi-4-Multimodal is a multimodal model that integrates text, vision, and speech/audio input modalities into a single model. Its novel modality extension approach leverages LoRA adapters and modality-specific routers to allow multiple inference modes combining various modalities without interference. For example, it now ranks first in the OpenASR leaderboard to date, although the LoRA component of the speech/audio modality has just 460 million parameters. Phi-4-Multimodal supports scenarios involving (vision + language), (vision + speech), and (speech/audio) inputs, outperforming larger vision-language and speech-language models on a wide range of tasks. Additionally, we experiment to further train Phi-4-Mini to enhance its reasoning capabilities. Despite its compact 3.8-billion-parameter size, this experimental version achieves reasoning performance on par with or surpassing significantly larger models, including DeepSeek-R1-Distill-Qwen-7B and DeepSeek-R1-Distill-Llama-8B.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Phi-4-Mini, a 3.8B-parameter language model trained on high-quality web and synthetic data that claims to significantly outperform recent open-source models of similar size and match the performance of models twice its size on math and coding tasks requiring complex reasoning. This is attributed to an expanded 200K-token vocabulary, group query attention, and a curated synthetic data recipe. It also presents Phi-4-Multimodal, which extends the model to vision and speech via Mixture-of-LoRAs with modality-specific routers, reporting first place on the OpenASR leaderboard with a 460M-parameter speech LoRA and outperforming larger models on multimodal tasks. An experimental further-trained variant is claimed to reach reasoning performance on par with or exceeding 7B-8B models such as DeepSeek-R1-Distill variants.
Significance. If the empirical claims are substantiated with full evaluation details, the work would provide concrete evidence that targeted synthetic data curation combined with efficient parameter-efficient adaptation (Mixture-of-LoRAs) can close the gap between compact and much larger models on reasoning and multimodal benchmarks. The explicit reporting of LoRA parameter counts and the modality-router design offer practical engineering contributions for deploying capable multimodal systems under resource constraints.
major comments (3)
- [Abstract] Abstract: The central claims of outperforming similar-sized models and matching 2x larger models on math/coding rest on unspecified benchmarks, shot counts, decoding parameters, and statistical significance. Without these, it is impossible to verify that the reported gains reflect the synthetic data recipe or architecture rather than evaluation differences.
- [Training] Training and data sections: The paper states that performance 'is driven by a carefully curated synthetic data recipe' but supplies no information on data sources, decontamination steps, exclusion of test-set-like problems, or contamination controls. This directly undermines the claim that gains represent genuine generalization on complex reasoning tasks.
- [Multimodal Architecture] Multimodal extension: While the 460M-parameter speech LoRA size is stated, the description of modality-specific routers preventing interference lacks ablation studies or quantitative metrics isolating the router contribution versus simple LoRA addition, which is load-bearing for the novelty claim of the Mixture-of-LoRAs approach.
minor comments (2)
- The paper would benefit from a dedicated reproducibility appendix listing exact prompt templates, evaluation harness versions, and hardware details for all reported benchmarks.
- Notation for the modality routers could be clarified with a small equation or pseudocode block to distinguish router gating from standard LoRA scaling.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback on our manuscript. We address each major comment below and commit to revising the paper to improve clarity, transparency, and rigor where feasible.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claims of outperforming similar-sized models and matching 2x larger models on math/coding rest on unspecified benchmarks, shot counts, decoding parameters, and statistical significance. Without these, it is impossible to verify that the reported gains reflect the synthetic data recipe or architecture rather than evaluation differences.
Authors: We agree that greater specificity in the abstract would strengthen verifiability. In the revised manuscript, we will expand the abstract to name the primary benchmarks (MATH, GSM8K, HumanEval, MBPP), note the few-shot settings and decoding parameters used, and indicate that results include standard deviations from multiple runs where applicable. revision: yes
-
Referee: [Training] Training and data sections: The paper states that performance 'is driven by a carefully curated synthetic data recipe' but supplies no information on data sources, decontamination steps, exclusion of test-set-like problems, or contamination controls. This directly undermines the claim that gains represent genuine generalization on complex reasoning tasks.
Authors: We acknowledge the need for greater transparency on data practices to support generalization claims. The manuscript describes the high-level synthetic data recipe focused on math and coding, but we agree more detail is warranted. In revision we will add a dedicated subsection outlining general decontamination procedures, similarity-based exclusion of test-set overlaps, and high-level source categories. Full proprietary data sources cannot be disclosed for licensing and competitive reasons. revision: partial
-
Referee: [Multimodal Architecture] Multimodal extension: While the 460M-parameter speech LoRA size is stated, the description of modality-specific routers preventing interference lacks ablation studies or quantitative metrics isolating the router contribution versus simple LoRA addition, which is load-bearing for the novelty claim of the Mixture-of-LoRAs approach.
Authors: The modality-specific routers are central to the Mixture-of-LoRAs design for interference-free multi-modal inference. We agree that explicit ablations would better substantiate the novelty. In the revised manuscript we will include new ablation experiments comparing the full router-equipped setup against plain LoRA additions, reporting quantitative metrics on both task performance and cross-modal interference. revision: yes
Circularity Check
No derivation chain or self-referential reductions present
full rationale
The paper is an empirical technical report describing model architecture, training data curation, and benchmark results for Phi-4-Mini and Phi-4-Multimodal. It contains no equations, first-principles derivations, or predictive claims that could reduce to inputs by construction. Performance statements are direct comparisons to external models and benchmarks; the synthetic data recipe is described at a high level without any fitted-parameter-to-prediction loop. Self-references to prior Phi models are limited to factual comparisons and do not carry load-bearing uniqueness theorems or ansatzes. The central claims remain independent of any internal circular structure.
Axiom & Free-Parameter Ledger
free parameters (2)
- Vocabulary size
- Speech LoRA parameter count
invented entities (1)
-
Mixture-of-LoRAs with modality-specific routers
no independent evidence
Forward citations
Cited by 39 Pith papers
-
SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models
SenseBench is the first physics-based benchmark with 10K+ instances and dual protocols to evaluate VLMs on remote sensing low-level perception and diagnostic description, revealing domain bias and specific failure modes.
-
HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models
HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-lang...
-
DialBGM: A Benchmark for Background Music Recommendation from Everyday Multi-Turn Dialogues
DialBGM is a new benchmark dataset revealing that existing AI models fall far short of human performance when recommending fitting background music for open-domain conversations.
-
MetaBackdoor: Exploiting Positional Encoding as a Backdoor Attack Surface in LLMs
MetaBackdoor shows that LLMs can be backdoored using positional triggers like sequence length, enabling stealthy activation on clean inputs to leak system prompts or trigger malicious behavior.
-
Omni-Persona: Systematic Benchmarking and Improving Omnimodal Personalization
Omni-Persona benchmark with 18 tasks shows open-source models have audio-visual grounding gaps, RLVR narrows them but leads to conservative outputs, and scale or recall alone fail as diagnostics.
-
Trust Me, Import This: Dependency Steering Attacks via Malicious Agent Skills
Malicious Skills induce coding agents to hallucinate and import attacker-controlled packages at high rates while evading detection.
-
How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation
DAPRO provides the first dynamic, theoretically guaranteed way to allocate interaction budgets across test cases for bounding time-to-event in multi-turn LLM evaluations, achieving tighter coverage than static conform...
-
RobotEQ: Transitioning from Passive Intelligence to Active Intelligence in Embodied AI
RobotEQ is the first benchmark for active intelligence in embodied AI, demonstrating that current models underperform on social norm adherence and spatial grounding tasks.
-
Multimodal Data Curation Through Ranked Retrieval
Symmetric Nucleus Subsampling and Expert Embedding Engine reduce modality gaps in multimodal embeddings by over 90% and outperform baselines in data curation for downstream models.
-
SENECA: Small-Sample Discrete Entropy Estimation via Self-Consistent Missing Mass
SENECA uses a novel self-consistent missing mass calculation to improve discrete entropy estimates in small-sample regimes and outperforms alternatives in numerical tests.
-
AppTek Call-Center Dialogues: A Multi-Accent Long-Form Benchmark for English ASR
A new multi-accent long-form call-center dialogue dataset for English ASR evaluation shows substantial performance variation across accents and segmentation methods.
-
Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models
Semantic-level and verification-based uncertainty methods outperform token-level baselines for audio reasoning in ALLMs, but their relative performance on hallucination and unanswerable-question benchmarks is model- a...
-
MUSCAT: MUltilingual, SCientific ConversATion Benchmark
MUSCAT is a benchmark of bilingual scientific conversations designed to evaluate ASR systems on code-switching and domain-specific challenges.
-
Hijacking Large Audio-Language Models via Context-Agnostic and Imperceptible Auditory Prompt Injection
AudioHijack generates imperceptible adversarial audio via gradient estimation, attention supervision, and reverberation blending to hijack 13 LALMs with 79-96% success on unseen contexts and real commercial agents.
-
Too Nice to Tell the Truth: Quantifying Agreeableness-Driven Sycophancy in Role-Playing Language Models
Agreeableness in AI personas reliably predicts sycophantic behavior in 9 of 13 tested language models.
-
GeoMMBench and GeoMMAgent: Toward Expert-Level Multimodal Intelligence in Geoscience and Remote Sensing
GeoMMBench reveals deficiencies in current multimodal LLMs for geoscience tasks while GeoMMAgent demonstrates that tool-integrated agents achieve significantly higher performance.
-
OmniTrace: A Unified Framework for Generation-Time Attribution in Omni-Modal LLMs
OmniTrace converts token-level signals into span-level cross-modal attributions for open-ended generation in omni-modal LLMs via generation-time tracing.
-
Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models
Audio Flamingo 3 introduces an open large audio-language model achieving new state-of-the-art results on over 20 audio understanding and reasoning benchmarks using a unified encoder and curriculum training on open data.
-
Verification Mirage: Mapping the Reliability Boundary of Self-Verification in Medical VQA
Self-verification in medical VQA creates a verification mirage where verifiers exhibit high error and agreement bias on wrong answers, with reliability strongly conditioned on task type.
-
SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation
SimCT recovers discarded teacher signal in cross-tokenizer on-policy distillation by enlarging supervision to jointly realizable multi-token continuations, yielding consistent gains on math reasoning and code generati...
-
Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models
Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing ...
-
HumanNet: Scaling Human-centric Video Learning to One Million Hours
HumanNet is a 1M-hour human-centric video dataset with interaction annotations that enables better vision-language-action model performance than equivalent robot data in a controlled test.
-
All That Glitters Is Not Audio: Rethinking Text Priors and Audio Reliance in Audio-Language Evaluation
Audio-language models retain 60-72% of benchmark scores without audio, and most audio-dependent items can be solved from short fragments rather than full clips.
-
HeadRouter: Dynamic Head-Weight Routing for Task-Adaptive Audio Token Pruning in Large Audio Language Models
HeadRouter prunes audio tokens more effectively by dynamically routing based on per-head importance for semantic versus acoustic tasks, exceeding baseline performance at 70% token retention on Qwen2.5-Omni models.
-
COMPASS: COntinual Multilingual PEFT with Adaptive Semantic Sampling
COMPASS uses semantic clustering on multilingual embeddings to select auxiliary data for PEFT adapters, outperforming linguistic-similarity baselines on multilingual benchmarks while supporting continual adaptation.
-
In-Situ Behavioral Evaluation for LLM Fairness, Not Standardized-Test Scores
Standardized-test benchmarks for LLM fairness are unreliable because prompt wording alone drives most score variance and ranking changes, while a multi-agent conversational framework reveals consistent model-specific ...
-
VIBE: Voice-Induced open-ended Bias Evaluation for Large Audio-Language Models via Real-World Speech
VIBE evaluates generative biases in large audio-language models with real-world speech and open-ended tasks, showing that gender cues produce larger distributional shifts than accent cues across 11 tested models.
-
GroupDPO: Memory efficient Group-wise Direct Preference Optimization
GroupDPO decouples group-wise preference optimization during backpropagation to cut peak memory while keeping the same gradients, allowing larger groups and consistent gains over single-pair DPO plus an NLL term on positives.
-
Contextual Biasing for ASR in Speech LLM with Common Word Cues and Bias Word Position Prediction
Common-word acoustic cues and bias-word position prediction in speech LLMs cut rare-word transcription errors by 16.3% versus baselines, including out-of-domain cases.
-
CheeseBench: Evaluating Large Language Models on Rodent Behavioral Neuroscience Paradigms
LLMs reach 52.6% average success on text-based rodent neuroscience tasks, above random agents at 32.1% but below approximate rodent baselines at 78.9%.
-
Differences in Text Generated by Diffusion and Autoregressive Language Models
DLMs exhibit lower n-gram entropy, higher semantic coherence, and higher semantic diversity than ARMs, primarily due to bidirectional context and remasking decoding strategies.
-
Beyond Content Safety: Real-Time Monitoring for Reasoning Vulnerabilities in Large Language Models
An external zero-shot monitor detects nine unsafe reasoning behaviors in LLMs at 87% step-level accuracy with low false positives and low latency.
-
Multimodal LLMs are not all you need for Pediatric Speech Language Pathology
Fine-tuned speech representation models with hierarchical classification outperform multimodal LLMs on pediatric speech sound disorder tasks.
-
AUDITA: A New Dataset to Audit Humans vs. AI Skill at Audio QA
AUDITA is a challenging audio QA benchmark where humans score 32% accuracy on average while state-of-the-art models score below 9%, using IRT to reveal systematic model deficiencies.
-
UniMesh: Unifying 3D Mesh Understanding and Generation
UniMesh unifies 3D mesh generation and understanding in one model via a Mesh Head interface, Chain of Mesh iterative editing, and an Actor-Evaluator self-reflection loop.
-
Demographic and Linguistic Bias Evaluation in Omnimodal Language Models
Omnimodal models show reduced demographic bias in image and video tasks compared to substantial biases and lower performance in audio tasks.
-
Beyond Pedestrians: Caption-Guided CLIP Framework for High-Difficulty Video-based Person Re-Identification
CG-CLIP adds caption-guided memory refinement and token-based spatiotemporal aggregation to CLIP for video person ReID, outperforming SOTA on MARS, iLIDS-VID, SportsVReID and DanceVReID.
-
CareGuardAI: Context-Aware Multi-Agent Guardrails for Clinical Safety & Hallucination Mitigation in Patient-Facing LLMs
CareGuardAI introduces dual risk assessments (SRA and HRA) and a multi-stage agent pipeline that only releases LLM responses when both risks score at or below 2, outperforming GPT-4o-mini on PatientSafeBench, MedSafet...
-
Low-Rank Adaptation Redux for Large Models
An overview revisits LoRA variants by categorizing advances in architectural design, efficient optimization, and applications while linking them to classical signal processing tools for principled fine-tuning.
Reference graph
Works this paper leans on
-
[1]
[AAB+24] Marah Abdin, Jyoti Aneja, Harkirat Behl, S´ ebastien Bubeck, Ronen Eldan, Suriya Gu- nasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 technical report. arXiv preprint arXiv:2412.08905 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
[AJA+24] Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
[ALTdJ+23] Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebr´ on, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi- head checkpoints. arXiv preprint arXiv:2305.13245 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Program Synthesis with Large Language Models
[AON+21] Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models. arXiv preprint arXiv:2108.07732 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
[BBY+23] Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Seamlessm4t: Massively multilingual & multimodal ma- chine translation,
24 [BCM+23] Lo¨ ıc Barrault, Yu-An Chung, Mariano Cora Meglioli, David Dale, Ning Dong, Paul- Ambroise Duquenne, Hady Elsahar, Hongyu Gong, Kevin Heffernan, John Hoffman, et al. Seamlessm4t-massively multilingual & multimodal machine translation. arXiv preprint arXiv:2308.11596,
-
[7]
Piqa: Reasoning about physical commonsense in natural language
[BZGC19] Yonatan Bisk, Rowan Zellers, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language. arXiv preprint arXiv:1911.11641 ,
-
[8]
Training Verifiers to Solve Math Word Problems
[CKB+21] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Boolq: Exploring the surprising difficulty of natural yes/no questions
[CLC+19] Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short...
work page 2019
-
[10]
Fleurs: Few-shot learning evaluation of universal representations of speech
[CMK+23] Alexis Conneau, Min Ma, Simran Khanuja, Yu Zhang, Vera Axelrod, Siddharth Dalmia, Jason Riesa, Clara Rivera, and Ankur Bapna. Fleurs: Few-shot learning evaluation of universal representations of speech. In 2022 IEEE Spoken Language Technology Workshop (SLT), pages 798–805. IEEE,
work page 2022
-
[11]
[CWC+24] Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
[CWT+24] Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? clos- ing the gap to commercial multimodal models with open-source suites. arXiv preprint arXiv:2404.16821,
work page internal anchor Pith review arXiv
-
[13]
[CXY+24] Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, et al. Qwen2-audio technical report. arXiv preprint arXiv:2407.10759,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
[DCL+24] Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models. arXiv preprint arXiv:2409.17146,
work page internal anchor Pith review arXiv
-
[15]
[DJP+24] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
NVLM: Open frontier-class multimodal LLMs
[DLW+24] Wenliang Dai, Nayeon Lee, Boxin Wang, Zhuolin Yang, Zihan Liu, Jon Barker, Tuomas Rintamaki, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Nvlm: Open frontier- class multimodal llms. arXiv preprint arXiv:2409.11402 ,
-
[17]
26 [DZZ+24b] Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, et al. Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model.arXiv preprint arXiv:2401.16420,
-
[18]
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
[FDL+24] Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever com- prehensive evaluation benchmark of multi-modal llms in video analysis. arXiv preprint arXiv:2405.21075,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
Blink: Multimodal large language models can see but not perceive
[FHL+24] Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. arXiv preprint arXiv:2404.12390 ,
-
[20]
Audiochatllama: Towards general-purpose speech abilities for llms
[FWL+24] Yassir Fathullah, Chunyang Wu, Egor Lakomkin, Ke Li, Junteng Jia, Yuan Shangguan, Jay Mahadeokar, Ozlem Kalinli, Christian Fuegen, and Mike Seltzer. Audiochatllama: Towards general-purpose speech abilities for llms. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language...
work page 2024
-
[21]
Joint audio and speech understanding
[GLL+23] Yuan Gong, Alexander H Liu, Hongyin Luo, Leonid Karlinsky, and James Glass. Joint audio and speech understanding. In 2023 IEEE Automatic Speech Recognition and Un- derstanding Workshop (ASRU) ,
work page 2023
-
[22]
Conformer: Convolution-augmented transformer for speech recognition
[GQC+20] Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, and Ruoming Pang. Conformer: Convolution-augmented transformer for speech recognition. In 21st Annual Conference of the International Speech Communication Association, Interspeech 2020, Virtual Event, Shanghai, China, Oct...
work page 2020
-
[23]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
[GYZ+25] Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qi- hao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
Measuring Massive Multitask Language Understanding
[HBB+20] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300,
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[25]
27 [HLG+24] Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
[HWY+24] Weiquan Huang, Aoqi Wu, Yifan Yang, Xufang Luo, Yuqing Yang, Liang Hu, Qi Dai, Xiyang Dai, Dongdong Chen, Chong Luo, et al. Llm2clip: Powerful language model unlock richer visual representation. arXiv preprint arXiv:2411.04997 ,
-
[27]
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
[JHG+24] Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974,
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
Accessed: 2025-01-22. [LBX+24] Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathe- matical reasoning of foundation models in visual contexts,
work page 2025
-
[29]
[LKB+23] Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. arXiv preprint arXiv:2305.20050 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
Red teaming visual language models
[LLY+24] Mukai Li, Lei Li, Yuwei Yin, Masood Ahmed, Zhenguang Liu, and Qi Liu. Red teaming visual language models. arXiv preprint arXiv:2401.12915 ,
-
[31]
[LXWZ23] Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. arXiv preprint arXiv:2305.01210 ,
work page internal anchor Pith review arXiv
-
[32]
LLaVA-OneVision: Easy Visual Task Transfer
[LZG+24] Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
American invitational mathematics examination–aime
[MAA24] MAA. American invitational mathematics examination–aime. In American Invitational Mathematics Examination–AIME 2024, February
work page 2024
- [34]
-
[35]
ChartQA: A benchmark for question answering about charts with visual and logical reasoning
[MLT+22] Ahmed Masry, Do Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. In Findings of the Association for Computational Linguistics: ACL 2022 , pages 2263–2279, Dublin, Ireland, May
work page 2022
-
[36]
Association for Computational Linguistics. [MYS+25] Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Ha- jishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Cand` es, and Tatsunori Hashimoto. s1: Simple test-time scaling. arXiv preprint arXiv:2501.19393 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[37]
Robust speech recognition via large-scale weak supervision
[RKX+23] Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA , volume 202, pages 28492–28518. PMLR,
work page 2023
-
[38]
WinoGrande: An Adversarial Winograd Schema Challenge at Scale
[SLBBC19] Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.arXiv preprint arXiv:1907.10641,
work page internal anchor Pith review arXiv 1907
-
[39]
SocialIQA: Commonsense Reasoning about Social Interactions
[SRC+19] Maarten Sap, Hannah Rashkin, Derek Chen, Ronan LeBras, and Yejin Choi. Socialiqa: Commonsense reasoning about social interactions. arXiv preprint arXiv:1904.09728,
work page internal anchor Pith review arXiv 1904
-
[40]
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
[SRR+22] Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adri` a Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[41]
MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark
30 [STK+24] S Sakshi, Utkarsh Tyagi, Sonal Kumar, Ashish Seth, Ramaneswaran Selvakumar, Oriol Nieto, Ramani Duraiswami, Sreyan Ghosh, and Dinesh Manocha. Mmau: A massive multi- task audio understanding and reasoning benchmark. arXiv preprint arXiv:2410.19168 ,
work page internal anchor Pith review arXiv
-
[42]
Gemini: A Family of Highly Capable Multimodal Models
[TAB+23] Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[43]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
[TGL+24] Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Un- locking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530,
work page internal anchor Pith review Pith/arXiv arXiv
-
[44]
Gemma 2: Improving Open Language Models at a Practical Size
[TRP+24] Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, L´ eonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ram´ e, et al. Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118,
work page internal anchor Pith review Pith/arXiv arXiv
-
[45]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
[WBT+24] Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191 ,
work page internal anchor Pith review Pith/arXiv arXiv
- [46]
-
[47]
Covost 2 and massively multilin- gual speech translation
[WWGP21] Changhan Wang, Anne Wu, Jiatao Gu, and Juan Pino. Covost 2 and massively multilin- gual speech translation. In Proceedings of Interspeech 2021, pages 2247–2251,
work page 2021
-
[48]
[XW24] Zhifei Xie and Changqiao Wu. Mini-omni2: Towards open-source gpt-4o with vision, speech and duplex capabilities. arXiv preprint arXiv:2410.11190 ,
-
[49]
arXiv preprint arXiv:2502.03387 , year=
31 [YHX+25] Yixin Ye, Zhen Huang, Yang Xiao, Ethan Chern, Shijie Xia, and Pengfei Liu. Limo: Less is more for reasoning. arXiv preprint arXiv:2502.03387 ,
-
[50]
AIR-bench: Benchmarking large audio-language models via generative comprehension
[YXL+24] Qian Yang, Jin Xu, Wenrui Liu, Yunfei Chu, Ziyue Jiang, Xiaohuan Zhou, Yichong Leng, Yuanjun Lv, Zhou Zhao, Chang Zhou, and Jingren Zhou. AIR-bench: Benchmarking large audio-language models via generative comprehension. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 1979–...
work page 1979
-
[51]
[YYZ+24] An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115,
work page internal anchor Pith review Pith/arXiv arXiv
-
[52]
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
[YZN+24] Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, et al. Mmmu-pro: A more robust multi- discipline multimodal understanding benchmark. arXiv preprint arXiv:2409.02813 ,
work page internal anchor Pith review arXiv
-
[53]
[YZY+18] Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, et al. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task.arXiv preprint arXiv:1809.08887,
-
[54]
arXiv preprint arXiv:2402.02207 , year=
[ZBY+24] Yongshuo Zong, Ondrej Bohdal, Tingyang Yu, Yongxin Yang, and Timothy Hospedales. Safety fine-tuning at (almost) no cost: A baseline for vision large language models. arXiv preprint arXiv:2402.02207,
-
[55]
[ZDC+24] Pan Zhang, Xiaoyi Dong, Yuhang Cao, Yuhang Zang, Rui Qian, Xilin Wei, Lin Chen, Yifei Li, Junbo Niu, Shuangrui Ding, et al. Internlm-xcomposer2. 5-omnilive: A comprehensive multimodal system for long-term streaming video and audio interactions. arXiv preprint arXiv:2412.09596,
-
[56]
[ZDL+24] Aohan Zeng, Zhengxiao Du, Mingdao Liu, Kedong Wang, Shengmin Jiang, Lei Zhao, Yuxiao Dong, and Jie Tang. Glm-4-voice: Towards intelligent and human-like end-to-end spoken chatbot. arXiv preprint arXiv:2412.02612 ,
-
[57]
32 [ZDW+23] Pan Zhang, Xiaoyi Dong, Bin Wang, Yuhang Cao, Chao Xu, Linke Ouyang, Zhiyuan Zhao, Haodong Duan, Songyang Zhang, Shuangrui Ding, et al. Internlm-xcomposer: A vision- language large model for advanced text-image comprehension and composition. arXiv preprint arXiv:2309.15112,
-
[58]
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions
[ZVC+24] Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, et al. Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions. arXiv preprint arXiv:2406.15877,
work page internal anchor Pith review arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.