super hub Canonical reference

Mixtral of Experts

Albert Q · 2024 · cs.LG · arXiv 2401.04088

Canonical reference. 81% of citing Pith papers cite this work as background.

327 Pith papers citing it

Background 81% of classified citations

open full Pith review browse 327 citing papers more from Albert Q arXiv PDF

abstract

We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep. As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and it outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular, Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and multilingual benchmarks. We also provide a model fine-tuned to follow instructions, Mixtral 8x7B - Instruct, that surpasses GPT-3.5 Turbo, Claude-2.1, Gemini Pro, and Llama 2 70B - chat model on human benchmarks. Both the base and instruct models are released under the Apache 2.0 license.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 51 baseline 4 method 3 dataset 2 other 2

citation-polarity summary

background 50 baseline 4 unclear 3 use method 3 use dataset 2

claims ledger

abstract We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep. As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference. Mixtral was trained with a context size of 32k tok

authors

Albert Q

co-cited works

representative citing papers

UltraEP: Unleash MoE Training and Inference on Rack-Scale Nodes with Near-Optimal Load Balancing

cs.DC · 2026-06-02 · unverdicted · novelty 8.0

UltraEP is the first exact-load real-time expert balancer for large-EP MoE training and serving on rack-scale nodes, reaching 94.3% of ideal throughput and 1.49x over no-balancing.

HodgeCover: Higher-Order Topological Coverage Drives Compression of Sparse Mixture-of-Experts

cs.LG · 2026-05-13 · unverdicted · novelty 8.0

HodgeCover isolates the harmonic kernel of a simplicial Laplacian on an expert 2-complex to identify irreducible merge cycles and selects experts for aggressive compression, matching or exceeding baselines on open-weight MoE models.

Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models

cs.AR · 2026-05-11 · conditional · novelty 8.0

Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.

ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning

cs.LG · 2026-05-09 · conditional · novelty 8.0

ReLibra uses pre-known token-to-expert routing from RL rollouts to perform inter-batch expert reordering and intra-batch replication, delivering up to 1.6x higher throughput than Megatron-LM and 1.2x over oracle-equipped EPLB while staying within 6-10% of an ideal balanced baseline.

Efficient Training on Multiple Consumer GPUs with RoundPipe

cs.DC · 2026-04-29 · conditional · novelty 8.0

RoundPipe achieves near-zero-bubble pipeline parallelism for LLM training on consumer GPUs by dynamically dispatching computation stages round-robin, yielding 1.48-2.16x speedups and enabling 235B model fine-tuning on 8x RTX 4090.

When Does Content-Based Routing Work? Representation Requirements for Selective Attention in Hybrid Sequence Models

cs.LG · 2026-03-22 · conditional · novelty 8.0

Content-based routing succeeds only when models provide bidirectional context and perform pairwise comparisons, with bidirectional Mamba plus rank-1 projection reaching 99.7% precision at linear inference cost.

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

cs.AI · 2024-08-12 · unverdicted · novelty 8.0

The AI Scientist framework enables LLMs to independently conduct the full scientific process from idea generation to paper writing and review, demonstrated across three ML subfields with papers costing under $15 each.

AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments

cs.HC · 2024-05-13 · conditional · novelty 8.0

AgentClinic is a multimodal agent benchmark demonstrating that LLM diagnostic accuracy on MedQA drops to below one-tenth in sequential clinical simulations, with Claude-3.5 leading and large tool-use differences across models.

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

cs.AI · 2024-04-11 · accept · novelty 8.0

OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.

RULER: What's the Real Context Size of Your Long-Context Language Models?

cs.CL · 2024-04-09 · accept · novelty 8.0

RULER shows most long-context LMs drop sharply in performance on complex tasks as length and difficulty increase, with only half maintaining results at 32K tokens.

CrossPool: Efficient Multi-LLM Serving for Cold MoE Models through KV-Cache and Weight Disaggregation

cs.DC · 2026-06-23 · unverdicted · novelty 7.0 · 2 refs

CrossPool separates weights and KV-cache into distinct GPU pools plus a planner, virtualizer, and layer-wise scheduler to cut P99 time-between-tokens by up to 10.4x versus prior kvcached multi-LLM systems.

Toward Calibrated Mixture-of-Experts Under Distribution Shift

cs.AI · 2026-06-18 · unverdicted · novelty 7.0

Expert calibration suffices for MoE calibration under distribution shifts in hard-routed models but not soft-routed ones; adversarial reweighting improves the accuracy-calibration tradeoff across models and shifts.

Comparing Linear Probes with Mahalanobis Cosine Similarity

cs.LG · 2026-06-17 · unverdicted · novelty 7.0

For balanced Gaussian class projections, OOD AUROC is a linear function of MCS to the reference probe because both are sigmoid-shaped functions of the probe SNR on test data.

Quo Vadis, Visual In-Context Learning? A Unified Benchmark Across Domains and Tasks

cs.CV · 2026-06-09 · unverdicted · novelty 7.0

The paper constructs the VIBE benchmark and evaluates six visual in-context learning models on 14 datasets, 12 tasks, and 106 combinations under a unified one-shot protocol, revealing limitations and failure modes.

DICE: Entropy-Regularized Equilibrium Selection for Stable Multi-Agent LLM Coordination

cs.LG · 2026-06-06 · unverdicted · novelty 7.0

DICE formalizes multi-agent LLM coordination as discounted incomplete-information Markov games and introduces Heterogeneous Quantal Response Equilibrium (HQRE) to achieve unique stable equilibria with bounded regret, demonstrated via prompt-control and fine-tuning algorithms on eleven benchmarks.

MOSAIC: A Workload-Driven Simulation and Design-Space Exploration Framework for Heterogeneous NPUs

cs.AR · 2026-06-03 · unverdicted · novelty 7.0

MOSAIC is a simulation and DSE framework for heterogeneous NPUs that finds designs achieving 46.91% mean iso-area energy savings over homogeneous baselines on 20 workloads.

Knowledge Index of Noah's Ark

cs.AI · 2026-06-03 · unverdicted · novelty 7.0

Introduces KINA benchmark with 899 items over 261 disciplines, formal (1-1/e) coverage guarantee and bonus-on-bar tournament theorem, plus evaluations of 42 models with top score 53.17%.

LoopMoE: Unifying Iterative Computation with Mixture-of-Experts for Language Modeling

cs.LG · 2026-06-03 · unverdicted · novelty 7.0

LoopMoE is a looped MoE language model that outperforms matched vanilla MoE on 8 of 9 downstream benchmarks at 3B scale and continues to outperform at 9B scale under strictly controlled budgets.

ClinicalMC: A Benchmark for Multi-Course Clinical Decision-Making with Large Language Models

cs.AI · 2026-06-02 · unverdicted · novelty 7.0

ClinicalMC is a benchmark of 1,275 Chinese and 5,804 English multi-course clinical samples across four stages, evaluated via a multi-agent framework on closed-source, open-source, and medical LLMs in static and dynamic settings.

ViBE: Co-Optimizing Workload Skew and Hardware Variability for MoE Serving

cs.DC · 2026-05-30 · unverdicted · novelty 7.0

ViBE co-optimizes expert placement with measured GPU performance variability in MoE inference to cut execution-time imbalance, delivering 14% better SLO attainment and up to 45% lower P90 TTFT.

EST-PRM: Stress-Testing Process Reward Models Before They Become Load-Bearing

cs.LG · 2026-05-30 · unverdicted · novelty 7.0

EST-PRM stress-tests five PRM models on 4,687 reasoning chains from MATH-500, GSM8K, and PRMBench using three label-preserving transformations and reports model-specific vulnerability patterns.

Next-Billion AI Index: The compass for AI utility and adoption in the global majority

cs.CY · 2026-05-29 · unverdicted · novelty 7.0

Introduces nexbax, a diagnostic framework with three themes and 10 dimensions for evaluating AI economic viability, operational practicality, and societal integrity in next-billion-user contexts.

Model-Native Computing Architecture: Envisioning Future System Architecture Through the Lens of Computer Architecture

cs.AI · 2026-05-29 · unverdicted · novelty 7.0

Proposes the Intelligent Computing Architecture (ICA) as a six-layer framework with dual probabilistic-deterministic planes and three Amdahl-style heuristics to unify design of LLM-based systems.

Every Act Has Its Price: Compressed Moral Composition in Frontier LLMs

cs.CL · 2026-05-29 · unverdicted · novelty 7.0

Moral Trolley Arena shows frontier LLMs produce composite moral preferences that are compressed rather than additive functions of calibrated component act strengths across Moral Foundations Theory.

citing papers explorer

Showing 35 of 35 citing papers after filters.

Quo Vadis, Visual In-Context Learning? A Unified Benchmark Across Domains and Tasks cs.CV · 2026-06-09 · unverdicted · none · ref 53 · internal anchor
The paper constructs the VIBE benchmark and evaluates six visual in-context learning models on 14 datasets, 12 tasks, and 106 combinations under a unified one-shot protocol, revealing limitations and failure modes.
CaMo: Camera Motion Grounded Evaluation and Training for Vision-Language Models cs.CV · 2026-05-19 · unverdicted · none · ref 13 · internal anchor
Proposes Spatial Narrative Score (SNS) evaluation for VLMs' camera motion understanding and introduces CaMo model achieving consistent performance on SNS and direct QA.
Geometry-Aware State Space Model: A New Paradigm for Whole-Slide Image Representation cs.CV · 2026-05-06 · unverdicted · none · ref 24 · internal anchor
BatMIL uses hybrid hyperbolic-Euclidean geometry, an S4 state-space backbone, and chunk-level mixture-of-experts to outperform prior multiple-instance learning methods on seven whole-slide image datasets across six cancers.
Multi-Domain Learning with Global Expert Mapping cs.CV · 2026-04-20 · unverdicted · none · ref 47 · internal anchor
GEM replaces learned routers in MoE models with a global scheduler based on linear programming relaxation and hierarchical rounding, achieving SOTA on the UODB multi-domain benchmark with gains on rare domains.
A Sanity Check on Composed Image Retrieval cs.CV · 2026-04-14 · unverdicted · none · ref 14 · internal anchor
The paper creates FISD, a controlled benchmark for composed image retrieval that removes query ambiguity via generative models, and proposes a multi-round agentic evaluation to assess models in interactive settings.
In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer cs.CV · 2025-04-29 · unverdicted · none · ref 40 · internal anchor
ICEdit achieves state-of-the-art instructional image editing in Diffusion Transformers via in-context generation, requiring only 0.1% of prior training data and 1% trainable parameters.
LFX: Towards Unified Light Field Dense Semantic Segmentation and Salient Object Detection cs.CV · 2025-03-02 · unverdicted · none · ref 21 · internal anchor
LFX is the first unified framework for light field perception that adapts to heterogeneous representations via a representation-invariant feature modulation space and FoP-ASM, achieving SOTA results on three benchmarks including 84.37 mIoU for segmentation and 0.029/0.027 MAE for detection.
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems? cs.CV · 2024-03-21 · conditional · none · ref 28 · internal anchor
MathVerse is a benchmark that tests multi-modal LLMs on visual math by providing each problem in six versions with progressively less diagram and text information to measure true visual understanding.
PARE: Pruning and Adaptive Routing for Efficient Video Generation cs.CV · 2026-05-26 · unverdicted · none · ref 13 · internal anchor
PARE applies structure-aware head pruning and timestep/content-conditioned block routing to compress video DiTs, reducing per-step compute while preserving quality on Wan2.1-14B.
Reinforcing Few-step Generators via Reward-Tilted Distribution Matching cs.CV · 2026-05-25 · unverdicted · none · ref 28 · internal anchor
RTDMD unifies KL minimization to a reward-tilted teacher into distribution matching plus reward terms, using AC-DMD in stage one and hybrid GRPO-style gradients plus SubGRPO in stage two to reach new SOTA on preference, aesthetic, and compositional metrics with 4-step generation on SD3, SD3.5, and F
Motion-Compensated Weight Compression cs.CV · 2026-05-23 · unverdicted · none · ref 27 · internal anchor
MCWC aligns permutation-symmetric blocks across layers to enable sequential prediction and residual entropy coding, improving rate-accuracy tradeoffs versus quantization and prior codecs on language and vision models.
Translating Signals to Languages for sEMG-Based Activity Recognition cs.CV · 2026-05-21 · unverdicted · none · ref 37 · internal anchor
LLM-sEMG maps sEMG signals to language via a dedicated mechanism to enable LLMs to perform accurate activity recognition.
World-Ego Modeling for Long-Horizon Evolution in Hybrid Embodied Tasks cs.CV · 2026-05-19 · unverdicted · none · ref 73 · internal anchor
Proposes World-Ego Modeling with WEM using CP-MoE diffusion and a new HTEWorld benchmark, claiming SOTA on hybrid navigation-manipulation tasks.
Text-to-CAD Retrieval: a Strong Baseline cs.CV · 2026-05-07 · unverdicted · none · ref 31 · internal anchor
Text-to-CAD retrieval is introduced as a cross-modal task with a baseline that learns joint embeddings from CAD construction sequences, point clouds, and text queries via a masked feature decoder.
Erasing Thousands of Concepts: Towards Scalable and Practical Concept Erasure for Text-to-Image Diffusion Models cs.CV · 2026-04-12 · unverdicted · none · ref 29 · internal anchor
ETC scales concept erasure to thousands of concepts in T2I diffusion models via tMM modeling, affine optimal transport, and a robust MoEraser module.
Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion cs.CV · 2026-02-08 · unverdicted · none · ref 44 · internal anchor
Rolling Sink is a training-free cache adjustment technique that maintains visual consistency in autoregressive video diffusion models for ultra-long open-ended generation beyond training horizons.
DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving cs.CV · 2025-05-22 · unverdicted · none · ref 44 · internal anchor
DriveMoE applies scene-specialized Vision MoE and skill-specialized Action MoE to a VLA baseline to achieve SOTA closed-loop performance on Bench2Drive.
MathFlow: Enhancing the Perceptual Flow of MLLMs for Visual Mathematical Problems cs.CV · 2025-03-19 · unverdicted · none · ref 27 · internal anchor
MathFlow decouples perception and inference stages in MLLMs for visual math, with a dedicated perception model delivering gains on the FlowVerse benchmark when paired with existing reasoners.
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding cs.CV · 2024-10-22 · unverdicted · none · ref 12 · internal anchor
LongVU adaptively compresses long video tokens using DINOv2-based frame deduplication, text-guided cross-modal selection, and temporal spatial reduction to improve video-language understanding in MLLMs with minimal detail loss.
VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation cs.CV · 2024-09-06 · unverdicted · none · ref 8 · internal anchor
VILA-U unifies visual understanding and generation inside one autoregressive next-token prediction model, removing separate diffusion components while claiming near state-of-the-art results.
What to Say and When to Say it: Live Fitness Coaching as a Testbed for Situated Interaction cs.CV · 2024-07-11 · unverdicted · none · ref 28 · internal anchor
Introduces the QEVD benchmark for asynchronous situated interaction in fitness coaching and proposes a streaming baseline to address limitations of existing vision-language models.
Are We on the Right Way for Evaluating Large Vision-Language Models? cs.CV · 2024-03-29 · conditional · none · ref 19 · internal anchor
Current LVLM benchmarks overestimate capabilities because many questions can be answered without images due to design flaws or data leakage; MMStar is a human-curated set of 1,500 vision-indispensable samples across 6 capabilities and 18 axes with new metrics for leakage and true multi-modal gain.
Rosetta: Composable Native Multimodal Pretraining cs.CV · 2026-07-01 · unverdicted · none · ref 22 · internal anchor
Rosetta proposes a composable multimodal pretraining method with MAOP to prevent catastrophic forgetting when expanding modalities beyond standard MoE and MoT approaches.
MONET: A Massive, Open, Non-redundant and Enriched Text-to-image dataset cs.CV · 2026-05-20 · unverdicted · none · ref 41 · internal anchor
MONET is an open 104.9M image-text pair dataset created via safety filtering, deduplication, and multi-VLM recaptioning from 2.9B raw pairs, validated by training a competitive 4B-parameter latent diffusion model.
When Does Sparse MoE Help in Vision? The Role of Backbone Compute Leverage in Sparse Routing cs.CV · 2026-05-15 · unverdicted · none · ref 40 · internal anchor
Sparse MoE vision models show positive accuracy gaps only when routing a substantial compute fraction ρ and using k≥2 experts at large scale; batch-axis dispatch is identified as a key failure mode.
Teacher-Guided Routing for Sparse Vision Mixture-of-Experts cs.CV · 2026-04-23 · unverdicted · none · ref 17 · internal anchor
Teacher-guided routing supplies pseudo-supervision from a dense model's intermediate features to stabilize expert selection in sparse vision MoE models.
LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding cs.CV · 2025-01-09 · unverdicted · none · ref 25 · internal anchor
LLaVA-Octopus introduces instruction-driven adaptive fusion of multiple visual projectors in a multimodal LLM to improve video understanding performance.
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models cs.CV · 2024-03-27 · unverdicted · none · ref 29 · internal anchor
Mini-Gemini enhances VLMs via high-resolution visual refinement, curated reasoning data, and self-guided generation to reach leading zero-shot benchmark results across 2B-34B LLMs.
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model cs.CV · 2024-01-29 · unverdicted · none · ref 38 · internal anchor
InternLM-XComposer2 introduces Partial LoRA on InternLM2-7B to enable high-quality free-form text-image composition while matching or exceeding GPT-4V on select vision-language benchmarks.
Multimodal Group Emotion Recognition In-the-Wild Towards a Privacy-Safe Non-Individual Approach cs.CV · 2026-05-27 · unverdicted · none · ref 110 · internal anchor
Proposes cross-attention audio-video fusion and VE-MD latent-space models for group emotion recognition that avoid individual cues and report competitive performance via ablation studies on synthetic and real data.
CADS: Conformal Adaptive Decision System for Cost-Efficient Image Classification cs.CV · 2026-05-12 · unverdicted · none · ref 21 · 2 links · internal anchor
CADS is a conformal-prediction-driven adaptive system that cascades models for image classification to achieve up to 12x lower computational cost while maintaining accuracy.
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs cs.CV · 2024-06-11 · unverdicted · none · ref 21 · internal anchor
VideoLLaMA 2 improves video LLMs via a new STC connector for spatial-temporal dynamics and joint audio training, reaching competitive results on video QA and captioning benchmarks.
A Survey on Multimodal Large Language Models cs.CV · 2023-06-23 · accept · none · ref 68 · internal anchor
This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.
Back into Plato's Cave: Examining Cross-modal Representational Convergence at Scale cs.CV · 2026-04-20 · unreviewed · ref 44 · internal anchor
Symbiotic-MoE: Unlocking the Synergy between Generation and Understanding cs.CV · 2026-04-09 · unreviewed · ref 22 · internal anchor

Mixtral of Experts

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer