Recognition: 2 theorem links
· Lean TheoremPhi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Pith reviewed 2026-05-10 20:15 UTC · model grok-4.3
The pith
A 3.8 billion parameter model matches the performance of much larger models like Mixtral 8x7B while running on a phone.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Phi-3-mini is a 3.8 billion parameter language model trained on 3.3 trillion tokens whose performance rivals Mixtral 8x7B and GPT-3.5, achieving 69 percent on MMLU and 8.38 on MT-bench while fitting on a phone. Its training set is a scaled-up version of the phi-2 dataset built from heavily filtered public web data and synthetic data, followed by alignment for safety and chat use. The 7B and 14B phi-3 variants reach 75 percent and 78 percent on MMLU respectively, and the phi-3.5-MoE model with 6.6 billion active parameters exceeds similar-scale open models on language, math, and code tasks.
What carries the argument
The phi-3-mini model and its training dataset of heavily filtered web data plus synthetic data, which together enable high benchmark scores at small parameter count.
If this is right
- The 7B and 14B phi-3 models deliver higher scores than phi-3-mini on the same benchmarks, reaching 75 percent and 78 percent on MMLU.
- Phi-3.5-MoE with 6.6 billion active parameters outperforms other open models of similar scale on reasoning, math, and code.
- Phi-3.5-Vision handles both single-image and multi-image prompts in reasoning tasks at competitive levels.
- All models support local deployment, removing the need for constant cloud connectivity for inference.
Where Pith is reading between the lines
- Emphasis on data quality over raw scale may shift research priorities toward curation techniques that work across different model sizes.
- On-device models of this class could support privacy-sensitive applications where user data never leaves the phone.
- The same data-filtering recipe might be tested on other base architectures to check whether the performance-to-size gains transfer.
Load-bearing premise
The filtered web and synthetic data produce real capability gains instead of benchmark-specific optimization or undetected contamination.
What would settle it
Performance of phi-3-mini falling well below the reported levels on a new set of benchmarks created after the training data cutoff and free of any possible overlap with the training corpus.
read the original abstract
We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite being small enough to be deployed on a phone. Our training dataset is a scaled-up version of the one used for phi-2, composed of heavily filtered publicly available web data and synthetic data. The model is also further aligned for robustness, safety, and chat format. We also provide parameter-scaling results with a 7B, 14B models trained for 4.8T tokens, called phi-3-small, phi-3-medium, both significantly more capable than phi-3-mini (e.g., respectively 75%, 78% on MMLU, and 8.7, 8.9 on MT-bench). To enhance multilingual, multimodal, and long-context capabilities, we introduce three models in the phi-3.5 series: phi-3.5-mini, phi-3.5-MoE, and phi-3.5-Vision. The phi-3.5-MoE, a 16 x 3.8B MoE model with 6.6 billion active parameters, achieves superior performance in language reasoning, math, and code tasks compared to other open-source models of similar scale, such as Llama 3.1 and the Mixtral series, and on par with Gemini-1.5-Flash and GPT-4o-mini. Meanwhile, phi-3.5-Vision, a 4.2 billion parameter model derived from phi-3.5-mini, excels in reasoning tasks and is adept at handling both single-image and text prompts, as well as multi-image and text prompts.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the Phi-3 family of language models. Phi-3-mini (3.8B parameters, trained on 3.3T tokens of filtered web + synthetic data) is reported to reach 69% MMLU and 8.38 MT-bench, rivaling Mixtral 8x7B and GPT-3.5 while fitting on a phone. Larger Phi-3-small/medium variants (7B/14B) and the Phi-3.5 series (mini, 16x3.8B MoE with 6.6B active params, and Vision) are also presented, with claims of superior or competitive performance on reasoning, math, code, multilingual, and multimodal tasks relative to open models of similar scale.
Significance. If the headline performance numbers reflect genuine generalization rather than contamination or benchmark-specific optimization, the work provides concrete evidence that carefully filtered and synthetic data can close much of the capability gap between small and large models, with direct implications for on-device deployment. The inclusion of parameter-scaling curves (mini to medium) and the MoE/vision extensions supplies useful empirical data points on efficiency trade-offs.
major comments (2)
- [Abstract / §2] Abstract and §2 (Training Data): the central claim that Phi-3-mini rivals much larger models rests on the training corpus being verifiably disjoint from MMLU, MT-bench, and related test sets. The text states only that the data consists of “heavily filtered publicly available web data and synthetic data” with no reported n-gram overlap statistics, embedding-based decontamination procedure, or per-benchmark contamination rates. Without these, the generalization interpretation of the 69% MMLU / 8.38 MT-bench numbers cannot be assessed.
- [§4 / Table 1] §4 (Evaluation) and Table 1: benchmark scores are given as point estimates (e.g., 69% MMLU, 8.38 MT-bench) with no error bars, standard deviations across runs, or explicit statement of the evaluation protocol and data-exclusion criteria. This makes it impossible to judge whether the reported rivalry with Mixtral 8x7B and GPT-3.5 is statistically robust.
minor comments (2)
- [Figure 1] Figure 1 and scaling plots: axis labels and legend entries for the 3.8B/7B/14B curves could be made more explicit to avoid ambiguity when comparing token counts and model sizes.
- [§3.2] §3.2 (Phi-3.5-MoE): the active-parameter count (6.6B) is stated but the routing details and expert utilization statistics during inference are not provided, which would help readers interpret the efficiency claims.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on data transparency and evaluation robustness. We address each major comment below and will revise the manuscript accordingly to improve clarity without altering the core technical claims.
read point-by-point responses
-
Referee: [Abstract / §2] Abstract and §2 (Training Data): the central claim that Phi-3-mini rivals much larger models rests on the training corpus being verifiably disjoint from MMLU, MT-bench, and related test sets. The text states only that the data consists of “heavily filtered publicly available web data and synthetic data” with no reported n-gram overlap statistics, embedding-based decontamination procedure, or per-benchmark contamination rates. Without these, the generalization interpretation of the 69% MMLU / 8.38 MT-bench numbers cannot be assessed.
Authors: We agree that greater transparency on potential contamination would strengthen the presentation. In the revised version we will expand §2 with additional description of the filtering pipeline, including the specific heuristics and quality classifiers applied to the web data and the generation process for the synthetic portion. However, exhaustive n-gram overlap statistics or per-benchmark contamination rates were not computed at the full 3.3 T token scale, as doing so is computationally prohibitive; we therefore cannot supply those exact figures. The synthetic data is produced by teacher models whose training cutoffs predate the relevant benchmarks, which substantially reduces the risk of direct leakage. We view this as a partial but honest response to the concern. revision: partial
-
Referee: [§4 / Table 1] §4 (Evaluation) and Table 1: benchmark scores are given as point estimates (e.g., 69% MMLU, 8.38 MT-bench) with no error bars, standard deviations across runs, or explicit statement of the evaluation protocol and data-exclusion criteria. This makes it impossible to judge whether the reported rivalry with Mixtral 8x7B and GPT-3.5 is statistically robust.
Authors: We accept that the current presentation lacks sufficient detail on evaluation methodology. In the revision we will add an explicit subsection in §4 describing the evaluation protocol (standard benchmark implementations, prompt formats, and any data-exclusion rules applied). Because each model variant was trained only once, standard deviations across independent training runs are unavailable; this is standard practice for models of this scale given the prohibitive compute cost. Where internal multi-prompt or few-shot variance estimates exist, we will report them as approximate indicators of stability. These additions should allow readers to assess the robustness of the reported comparisons. revision: yes
Circularity Check
No circularity: empirical benchmark results with no derivations or self-referential fits
full rationale
The paper is a technical report on training and evaluating Phi-3 language models. It reports empirical performance numbers (e.g., 69% MMLU, 8.38 MT-bench for phi-3-mini) obtained from external benchmarks after training on filtered web + synthetic data. No mathematical derivations, equations, predictions from fitted parameters, or first-principles results are present. There are no self-definitional steps, fitted inputs renamed as predictions, load-bearing self-citations, uniqueness theorems, or ansatzes smuggled via citation. All central claims rest on independent external evaluations rather than quantities defined internally by the paper itself. The data composition is described at a high level without any internal fitting loops that could create circularity.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our training dataset is a scaled-up version of the one used for phi-2, composed of heavily filtered publicly available web data and synthetic data.
-
Foundation.PhiForcingphi_equation unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench, rivaling Mixtral 8x7B and GPT-3.5
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 60 Pith papers
-
Acceptance Cards:A Four-Diagnostic Standard for Safe Fine-Tuning Defense Claims
Acceptance Cards is a new four-diagnostic standard for safe fine-tuning defense claims that requires statistical reliability, fresh semantic generalization, mechanism alignment, and cross-task transfer; under this pro...
-
Architecture Determines Observability of Transformers
Certain transformer architectures lose internal linear signals for decision quality during training, making observability an architecture-dependent property rather than a universal one.
-
ArgBench: Benchmarking LLMs on Computational Argumentation Tasks
ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.
-
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.
-
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.
-
LiveBench: A Challenging, Contamination-Limited LLM Benchmark
LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.
-
RULER: What's the Real Context Size of Your Long-Context Language Models?
RULER shows most long-context LMs drop sharply in performance on complex tasks as length and difficulty increase, with only half maintaining results at 32K tokens.
-
Temper and Tilt Lead to SLOP: Reward Hacking Mitigation with Inference-Time Alignment
Temperature adjustment on the reference model generalizes inference-time alignment to SLOP ensembles of reward models, with a calibration algorithm that improves robustness to reward hacking while preserving alignment...
-
DisaBench: A Participatory Evaluation Framework for Disability Harms in Language Models
DisaBench supplies a participatory taxonomy of twelve disability harm types, paired benign-adversarial prompts across seven life domains, and human-annotated data showing that standard safety tests miss context-depend...
-
Covering Human Action Space for Computer Use: Data Synthesis and Benchmark
Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.
-
FinSTaR: Towards Financial Reasoning with Time Series Reasoning Models
FinSTaR reaches 78.9% accuracy on a new financial time series reasoning benchmark by applying Compute-in-CoT for deterministic assessments and Scenario-Aware CoT for stochastic predictions.
-
MemFlow: Intent-Driven Memory Orchestration for Small Language Model Agents
MemFlow routes queries by intent to tiered memory operations, nearly doubling accuracy of a 1.7B SLM on long-horizon benchmarks compared to full-context baselines.
-
RouteHijack: Routing-Aware Attack on Mixture-of-Experts LLMs
RouteHijack is a routing-aware jailbreak that identifies safety-critical experts via activation contrast and optimizes suffixes to suppress them, reaching 69.3% average attack success rate on seven MoE LLMs with stron...
-
MASCing: Configurable Mixture-of-Experts Behavior via Activation Steering Masks
MASCing uses an LSTM surrogate and optimized steering masks to enable flexible, inference-time control over MoE expert routing for safety objectives, improving jailbreak defense and content generation success rates su...
-
Evaluating Temporal Consistency in Multi-Turn Language Models
Language models frequently violate temporal scope stability in multi-turn dialogues by drifting toward present-day assumptions even when they possess the correct facts.
-
Listening with Time: Precise Temporal Awareness for Long-Form Audio Understanding
LAT-Audio introduces a global-to-local reasoning approach with TWA-CoT that outperforms prior models on temporal tasks for audio up to 30 minutes.
-
Serialisation Strategy Matters: How FHIR Data Format Affects LLM Medication Reconciliation
Clinical narrative format beats raw JSON for LLMs up to 8B parameters on medication reconciliation but raw JSON wins at 70B scale, with omissions as the main error type.
-
Rethinking Scale: Deployment Trade-offs of Small Language Models under Agent Paradigms
Single-agent systems with tools provide the optimal performance-efficiency trade-off for small language models, outperforming base models and multi-agent setups.
-
DocQAC: Adaptive Trie-Guided Decoding for Effective In-Document Query Auto-Completion
Adaptive trie-guided decoding with document context and tunable penalties improves in-document query auto-completion, outperforming baselines and larger models like LLaMA-3 on seen queries.
-
Towards Unconstrained Human-Object Interaction
Introduces the U-HOI task and shows MLLMs plus a language-to-graph pipeline can handle human-object interactions without any predefined vocabulary at training or inference time.
-
Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation
Alignment of vision-language models with human V1-V3 early visual cortex negatively predicts resistance to sycophantic gaslighting attacks.
-
GeoMMBench and GeoMMAgent: Toward Expert-Level Multimodal Intelligence in Geoscience and Remote Sensing
GeoMMBench reveals deficiencies in current multimodal LLMs for geoscience tasks while GeoMMAgent demonstrates that tool-integrated agents achieve significantly higher performance.
-
SUPERGLASSES: Benchmarking Vision Language Models as Intelligent Agents for AI Smart Glasses
SUPERGLASSES is the first VQA benchmark built from actual smart glasses data, and SUPERLENS is an agent using automatic object detection, query decoupling, and multimodal search that outperforms GPT-4o by 2.19% on it.
-
LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory
LongMemEval benchmarks long-term memory in chat assistants, revealing 30% accuracy drops across sustained interactions and proposing indexing-retrieval-reading optimizations that boost performance.
-
GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models
LLMs display high variance and major accuracy drops on GSM-Symbolic variants of grade-school math problems, indicating they replicate training patterns rather than execute logical reasoning.
-
Know When To Fold 'Em: Token-Efficient LLM Synthetic Data Generation via Multi-Stage In-Flight Rejection
MSIFR stops faulty LLM generations early via staged rule-based checks, reducing token consumption 11-78% with no accuracy loss.
-
Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy
Pretrained base models exhibit higher yield to peer disagreement than RLHF instruct variants, with the effect localized to mid-layer attention and mitigated by structured dissent rather than prompt defenses.
-
Language-Conditioned Visual Grounding with CLIP Multilingual
Fixing the visual encoder in multilingual CLIP isolates text-branch deficits as the cause of lower visual grounding performance for low-resource languages, with model scaling widening some gaps but not others.
-
MoE-Hub: Taming Software Complexity for Seamless MoE Overlap with Hardware-Accelerated Communication on Multi-GPU Systems
MoE-Hub enables seamless MoE communication overlap via hardware-accelerated destination-agnostic data transmission, delivering 1.40x-3.08x per-layer and 1.21x-1.98x end-to-end speedups over prior systems.
-
ZAYA1-8B Technical Report
ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
-
Power Distribution Bridges Sampling, Self-Reward RL, and Self-Distillation
The power distribution is the target of power sampling, the closed-form solution to self-reward KL-regularized RL, and the basis for power self-distillation that matches sampling performance at lower cost.
-
Edge-Efficient Image Restoration: Transformer Distillation into State-Space Models
Hybrid transformer-SSM networks found by multi-objective search run 1.17x to 3.4x faster on edge CPUs for image restoration tasks with competitive quality.
-
DocSync: Agentic Documentation Maintenance via Critic-Guided Reflexion
DocSync fuses AST-aware retrieval with an iterative critic loop to update documentation, outperforming CodeT5-base on semantic alignment and automated judge scores in a proxy code-to-text task.
-
Test-Time Safety Alignment
Optimizing input embeddings sub-lexically via black-box zeroth-order gradients neutralizes all safety-flagged responses from aligned models on standard benchmarks.
-
Architecture Determines Observability of Transformers
Architecture and training determine whether transformers retain a readable internal signal that lets activation monitors catch errors missed by output confidence.
-
Mind the Unseen Mass: Unmasking LLM Hallucinations via Soft-Hybrid Alphabet Estimation
SHADE adaptively combines coverage and spectral signals to estimate semantic alphabet size from few LLM samples, yielding better performance than baselines in low-sample regimes for alphabet estimation and QA error detection.
-
Co-Located Tests, Better AI Code: How Test Syntax Structure Affects Foundation Model Code Generation
Co-locating tests with implementation code yields substantially higher preservation and correctness in foundation-model-generated programs than separated test syntax.
-
Accuracy Is Speed: Towards Long-Context-Aware Routing for Distributed LLM Serving
In long-context LLM serving, accuracy becomes speed via retry dynamics, and accuracy-aware routing reduces time-to-correct-answer.
-
SIMMER: Cross-Modal Food Image--Recipe Retrieval via MLLM-Based Embedding
SIMMER uses a single multimodal LLM (VLM2Vec) with custom prompts and partial-recipe augmentation to embed food images and recipes, achieving new state-of-the-art retrieval accuracy on Recipe1M.
-
Robust Ultra Low-Bit Post-Training Quantization via Stable Diagonal Curvature Estimate
DASH-Q uses a stable diagonal curvature estimate and weighted least squares to achieve robust ultra-low-bit post-training quantization of LLMs, improving zero-shot accuracy by 7% on average over baselines.
-
From Anchors to Supervision: Memory-Graph Guided Corpus-Free Unlearning for Large Language Models
MAGE builds a memory graph from a user anchor to generate its own supervision signals for corpus-free unlearning, matching the effectiveness of methods that use external reference data on TOFU and RWKU benchmarks.
-
MedRCube: A Multidimensional Framework for Fine-Grained and In-Depth Evaluation of MLLMs in Medical Imaging
MedRCube is a new fine-grained evaluation framework that benchmarks 33 MLLMs on medical imaging, ranks Lingshu-32B highest, and finds a significant positive link between shortcut behaviors and diagnostic performance.
-
Local-Splitter: A Measurement Study of Seven Tactics for Reducing Cloud LLM Token Usage on Coding-Agent Workloads
Combining local routing with prompt compression saves 45-79% cloud tokens on edit and explanation workloads, while a fuller set including draft-review saves 51% on RAG-heavy tasks.
-
MLLM-as-a-Judge Exhibits Model Preference Bias
MLLMs show self-preference bias and family-level mutual bias when judging captions; Philautia-Eval quantifies it and Pomms ensemble reduces it.
-
EdgeCIM: A Hardware-Software Co-Design for CIM-Based Acceleration of Small Language Models
A CIM-based hardware-software co-design in 65nm achieves up to 7.3x higher throughput and 49.59x better energy efficiency than NVIDIA Orin Nano for LLaMA3.2-1B, averaging 336 tokens/s and 173 tokens/J under INT4 acros...
-
Adapting 2D Multi-Modal Large Language Model for 3D CT Image Analysis
Transferring a 2D MLLM to 3D CT inputs via parameter reuse, a Text-Guided Hierarchical MoE framework, and two-stage training yields better performance than prior 3D medical MLLMs on medical report generation and visua...
-
Dead Weights, Live Signals: Feedforward Graphs of Frozen Language Models
A feedforward graph of heterogeneous frozen LLMs linked by linear projections in a shared latent space outperforms single models on ARC-Challenge, OpenBookQA, and MMLU using just 17.6M trainable parameters.
-
ReAlign: Optimizing the Visual Document Retriever with Reasoning-Guided Fine-Grained Alignment
ReAlign improves visual document retrieval by training retrievers to match query-induced rankings with rankings derived from VLM-generated, region-focused descriptions of relevant page content.
-
In-Place Test-Time Training
In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.
-
Do Hallucination Neurons Generalize? Evidence from Cross-Domain Transfer in LLMs
Hallucination neurons in LLMs are domain-specific, with cross-domain classifiers dropping from AUROC 0.783 within-domain to 0.563 across domains.
-
Don't Ignore the Tail: Decoupling top-K Probabilities for Efficient Language Model Distillation
A modified divergence decouples top-K teacher probabilities from the distribution tail during distillation, yielding competitive performance on decoder models with standard compute.
-
SAM 3D: 3Dfy Anything in Images
SAM 3D reconstructs 3D objects from single images with geometry, texture, and pose using human-model annotated data at scale and synthetic-to-real training, achieving 5:1 human preference wins.
-
The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
LRMs exhibit complete accuracy collapse beyond certain puzzle complexities, with reasoning effort rising then declining, outperforming standard LLMs only on medium-complexity tasks.
-
TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies
Visual trace prompting improves spatial-temporal awareness in VLA models, delivering 10% gains on SimplerEnv and 3.5x on real-robot tasks.
-
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
-
CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation
CogACT is a new VLA model that uses a conditioned diffusion action transformer to achieve over 35% higher average success rates than OpenVLA in simulation and 55% in real-robot experiments while generalizing to new ro...
-
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
FineWeb is a curated 15T-token web dataset that produces stronger LLMs than prior open collections, while its educational subset sharply improves performance on MMLU and ARC benchmarks.
-
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
MMLU-Pro is a revised benchmark that makes language model evaluation harder and more stable by using ten options per question and emphasizing reasoning over simple knowledge recall.
-
DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models
DriveVLM adds vision-language models with scene description, analysis, and hierarchical planning modules to autonomous driving, paired with a hybrid DriveVLM-Dual system tested on nuScenes and SUP-AD datasets and depl...
-
A Picture is Worth a Thousand Words? An Empirical Study of Aggregation Strategies for Visual Financial Document Retrieval
Single-vector aggregation in visual financial document retrieval collapses semantically distinct documents due to global texture dominance, as demonstrated by a new diagnostic benchmark where patch-level signals detec...
Reference graph
Works this paper leans on
-
[1]
Program Synthesis with Large Language Models
[AON+21] Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models. arXiv preprint arXiv:2108.07732 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Piqa: Reasoning about physical commonsense in natural language
[BZGC19] Yonatan Bisk, Rowan Zellers, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language. arXiv preprint arXiv:1911.11641 ,
-
[3]
Training Verifiers to Solve Math Word Problems
[CKB+21] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Boolq: Exploring the surprising difficulty of natural yes/no questions
[CLC+19] Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short...
work page 2019
-
[5]
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
[CWT+24] Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv preprint arXiv:2404.16821 ,
work page internal anchor Pith review arXiv
-
[6]
[DZZ+24b] Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Songyang Zhang, Haodong Duan, Wenwei Zhang, Yining Li, et al. Internlm-xcomposer2-4khd: A pioneering large vision-language model handling resolutions from 336 pixels to 4k hd. arXiv preprint arXiv:2404.06512,
-
[7]
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
[FDL+24] Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever com- prehensive evaluation benchmark of multi-modal llms in video analysis. arXiv preprint arXiv:2405.21075,
work page internal anchor Pith review arXiv
-
[8]
BLINK: Multimodal Large Language Models Can See but Not Perceive
[FHL+24] Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. arXiv preprint arXiv:2404.12390 ,
work page internal anchor Pith review arXiv
-
[9]
[GZA+23] Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio C´ esar Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Gustavo de Rosa Piero Kauffmann, Olli Saarikivia, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, S´ ebastien Bubeck, Ronen El- dan, Adam Tauman Kalai, Yin Tat Lee, and Yuanzhi Li. Textbooks are all you need. arXiv pre...
work page internal anchor Pith review arXiv
-
[10]
Training Compute-Optimal Large Language Models
17 [HBM+22] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Eliza Ruther- ford Trevor Cai, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Scaling Laws for Neural Language Models
[KMH+20] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 ,
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[12]
Textbooks Are All You Need II: phi-1.5 technical report
[LBE+23] Yuanzhi Li, S´ ebastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, and Yin Tat Lee. Textbooks are all you need ii: phi-1.5 technical report. arXiv preprint arXiv:2309.05463,
work page internal anchor Pith review arXiv
-
[13]
Bridging discrete and backpropagation: Straight-through and beyond
[LDL+23] Liyuan Liu, Chengyu Dong, Xiaodong Liu, Bin Yu, and Jianfeng Gao. Bridging discrete and backpropagation: Straight-through and beyond. arXiv:2304.08612,
-
[14]
Sparse backpropagation for moe training
[LGC23] Liyuan Liu, Jianfeng Gao, and Weizhu Chen. Sparse backpropagation for moe training. arXiv:2310.00811,
-
[15]
Improved Baselines with Visual Instruction Tuning
[LLLL23] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744 ,
work page internal anchor Pith review arXiv
-
[16]
Red teaming visual language models
[LLY+24] Mukai Li, Lei Li, Yuwei Yin, Masood Ahmed, Zhenguang Liu, and Qi Liu. Red teaming visual language models. arXiv preprint arXiv:2401.12915 ,
-
[17]
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
19 [LZZ+24] Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chun- yuan Li. Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models. arXiv preprint arXiv:2407.07895 ,
work page internal anchor Pith review arXiv
-
[18]
Mm1: Methods, analysis & insights from multimodal llm pre-training
[MGF+24] Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier, Sam Dodge, Bowen Zhang, Philipp Dufter, Dhruti Shah, Xianzhi Du, Futang Peng, Floris Weers, Anton Belyi, Haotian Zhang, Karanjeet Singh, Doug Kang, Ankur Jain, Hongyu H` e, Max Schwarzer, Tom Gunter, Xiang Kong, Aonan Zhang, Jianyu Wang, Chong Wang, Nan Du, Tao Lei, Sam Wiseman, Guoli Yin, Mark...
-
[19]
ChartQA: A benchmark for question answering about charts with visual and logical reasoning
[MLT+22] Ahmed Masry, Do Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. In Find- ings of the Association for Computational Linguistics: ACL 2022 , pages 2263–2279, Dublin, Ireland, May
work page 2022
-
[20]
Association for Computational Linguistics. [MRB+23] Niklas Muennighoff, Alexander M Rush, Boaz Barak, Teven Le Scao, Aleksandra Piktus, Nouamane Tazi, Sampo Pyysalo, Thomas Wolf, and Colin Raffel. Scaling data-constrained language models. arXiv preprint arXiv:2305.16264 ,
-
[21]
WinoGrande: An Adversarial Winograd Schema Challenge at Scale
[SLBBC19] Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. arXiv preprint arXiv:1907.10641 ,
work page internal anchor Pith review arXiv 1907
-
[22]
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
20 [SRR+22] Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adri` a Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615 ,
work page internal anchor Pith review arXiv
-
[23]
Gemini: A Family of Highly Capable Multimodal Models
[TAB+23] Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
LLaMA: Open and Efficient Foundation Language Models
[TLI+23] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ ee Lacroix, Baptiste Rozi` ere, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
arXiv preprint arXiv:2402.02207 , year=
[ZBY+24] Yongshuo Zong, Ondrej Bohdal, Tingyang Yu, Yongxin Yang, and Timothy Hospedales. Safety fine-tuning at (almost) no cost: A baseline for vision large language models. arXiv preprint arXiv:2402.02207,
-
[26]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
21 [ZCS+23] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685 ,
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.