super hub Canonical reference

Mixtral of Experts

Albert Q · 2024 · cs.LG · arXiv 2401.04088

Canonical reference. 80% of citing Pith papers cite this work as background.

258 Pith papers citing it

Background 80% of classified citations

open full Pith review browse 258 citing papers more from Albert Q arXiv PDF

abstract

We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep. As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and it outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular, Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and multilingual benchmarks. We also provide a model fine-tuned to follow instructions, Mixtral 8x7B - Instruct, that surpasses GPT-3.5 Turbo, Claude-2.1, Gemini Pro, and Llama 2 70B - chat model on human benchmarks. Both the base and instruct models are released under the Apache 2.0 license.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 50 baseline 4 method 3 dataset 2 other 2

citation-polarity summary

background 49 baseline 4 unclear 3 use method 3 use dataset 2

claims ledger

abstract We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep. As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference. Mixtral was trained with a context size of 32k tok

authors

Albert Q

co-cited works

representative citing papers

HodgeCover: Higher-Order Topological Coverage Drives Compression of Sparse Mixture-of-Experts

cs.LG · 2026-05-13 · unverdicted · novelty 8.0

HodgeCover isolates the harmonic kernel of a simplicial Laplacian on an expert 2-complex to identify irreducible merge cycles and selects experts for aggressive compression, matching or exceeding baselines on open-weight MoE models.

Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models

cs.AR · 2026-05-11 · conditional · novelty 8.0

Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.

ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning

cs.LG · 2026-05-09 · conditional · novelty 8.0

ReLibra uses pre-known token-to-expert routing from RL rollouts to perform inter-batch expert reordering and intra-batch replication, delivering up to 1.6x higher throughput than Megatron-LM and 1.2x over oracle-equipped EPLB while staying within 6-10% of an ideal balanced baseline.

Efficient Training on Multiple Consumer GPUs with RoundPipe

cs.DC · 2026-04-29 · conditional · novelty 8.0

RoundPipe achieves near-zero-bubble pipeline parallelism for LLM training on consumer GPUs by dynamically dispatching computation stages round-robin, yielding 1.48-2.16x speedups and enabling 235B model fine-tuning on 8x RTX 4090.

When Does Content-Based Routing Work? Representation Requirements for Selective Attention in Hybrid Sequence Models

cs.LG · 2026-03-22 · conditional · novelty 8.0

Content-based routing succeeds only when models provide bidirectional context and perform pairwise comparisons, with bidirectional Mamba plus rank-1 projection reaching 99.7% precision at linear inference cost.

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

cs.AI · 2024-08-12 · unverdicted · novelty 8.0

The AI Scientist framework enables LLMs to independently conduct the full scientific process from idea generation to paper writing and review, demonstrated across three ML subfields with papers costing under $15 each.

AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments

cs.HC · 2024-05-13 · conditional · novelty 8.0

AgentClinic is a multimodal agent benchmark demonstrating that LLM diagnostic accuracy on MedQA drops to below one-tenth in sequential clinical simulations, with Claude-3.5 leading and large tool-use differences across models.

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

cs.AI · 2024-04-11 · accept · novelty 8.0

OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.

RULER: What's the Real Context Size of Your Long-Context Language Models?

cs.CL · 2024-04-09 · accept · novelty 8.0

RULER shows most long-context LMs drop sharply in performance on complex tasks as length and difficulty increase, with only half maintaining results at 32K tokens.

CrossPool: Efficient Multi-LLM Serving for Cold MoE Models through KV-Cache and Weight Disaggregation

cs.DC · 2026-06-23 · unverdicted · novelty 7.0 · 2 refs

CrossPool separates weights and KV-cache into distinct GPU pools plus a planner, virtualizer, and layer-wise scheduler to cut P99 time-between-tokens by up to 10.4x versus prior kvcached multi-LLM systems.

EST-PRM: Stress-Testing Process Reward Models Before They Become Load-Bearing

cs.LG · 2026-05-30 · unverdicted · novelty 7.0

EST-PRM stress-tests five PRM models on 4,687 reasoning chains from MATH-500, GSM8K, and PRMBench using three label-preserving transformations and reports model-specific vulnerability patterns.

Next-Billion AI Index: The compass for AI utility and adoption in the global majority

cs.CY · 2026-05-29 · unverdicted · novelty 7.0

Introduces nexbax, a diagnostic framework with three themes and 10 dimensions for evaluating AI economic viability, operational practicality, and societal integrity in next-billion-user contexts.

Every Act Has Its Price: Compressed Moral Composition in Frontier LLMs

cs.CL · 2026-05-29 · unverdicted · novelty 7.0

Moral Trolley Arena shows frontier LLMs produce composite moral preferences that are compressed rather than additive functions of calibrated component act strengths across Moral Foundations Theory.

Latent Performance Profiling of Large Language Models

cs.CL · 2026-05-28 · unverdicted · novelty 7.0

Introduces Latent Performance Profiling (LPP) as a task-agnostic framework deriving scalar metrics from LLM latent representations and dynamics to complement benchmark evaluations.

Fine-grained Claim-level RAG Benchmark for Law

cs.CL · 2026-05-20 · unverdicted · novelty 7.0 · 6 refs

ClaimRAG-LAW is a French-English legal RAG benchmark with claim-level granularity for experts and non-experts that reveals limitations in current retrieval and generation performance.

CaMo: Camera Motion Grounded Evaluation and Training for Vision-Language Models

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

Proposes Spatial Narrative Score (SNS) evaluation for VLMs' camera motion understanding and introduces CaMo model achieving consistent performance on SNS and direct QA.

Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-Experts

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

Routers in SMoE models form geometric alignments with their experts through shared gradient directions, enabling effective specialization that auxiliary load-balancing losses tend to disrupt.

HEBATRON: A Hebrew-Specialized Open-Weight Mixture-of-Experts Language Model

cs.CL · 2026-05-11 · unverdicted · novelty 7.0

Hebatron is the first open-weight Hebrew MoE LLM adapted from Nemotron-3, reaching 73.8% on Hebrew reasoning benchmarks while activating only 3B parameters per pass and supporting 65k-token context.

Unsolvability Ceiling in Multi-LLM Routing: An Empirical Study of Evaluation Artifacts

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Evaluation artifacts substantially inflate the measured unsolvability ceiling in multi-LLM routing, leading to distorted router training and overstated headroom.

MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference

cs.LG · 2026-05-08 · conditional · novelty 7.0

MISA routes to a small subset of indexer heads via block statistics, matching full DSA performance on LongBench with 4-8x fewer heads and 3.82x speedup while recovering over 92% of selected tokens.

When Are Experts Misrouted? Counterfactual Routing Analysis in Mixture-of-Experts Language Models

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Standard top-k routers in MoE language models often select suboptimal routes for difficult tokens, and updating only the final router layer raises pass@K on AIME and HMMT benchmarks across multiple models.

Every Feedforward Neural Network Definable in an o-Minimal Structure Has Finite Sample Complexity

stat.ML · 2026-05-08 · unverdicted · novelty 7.0

Every fixed finite feedforward neural network definable in an o-minimal structure has finite sample complexity in the agnostic PAC setting.

Geometry-Aware State Space Model: A New Paradigm for Whole-Slide Image Representation

cs.CV · 2026-05-06 · unverdicted · novelty 7.0

BatMIL uses hybrid hyperbolic-Euclidean geometry, an S4 state-space backbone, and chunk-level mixture-of-experts to outperform prior multiple-instance learning methods on seven whole-slide image datasets across six cancers.

Coral: Cost-Efficient Multi-LLM Serving over Heterogeneous Cloud GPUs

cs.DC · 2026-05-05 · unverdicted · novelty 7.0

Coral cuts multi-LLM serving costs by up to 2.79x and raises goodput by up to 2.39x on heterogeneous GPUs through adaptive joint optimization and a lossless two-stage decomposition that solves quickly.

citing papers explorer

Showing 30 of 30 citing papers after filters.

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery cs.AI · 2024-08-12 · unverdicted · none · ref 45 · internal anchor
The AI Scientist framework enables LLMs to independently conduct the full scientific process from idea generation to paper writing and review, demonstrated across three ML subfields with papers costing under $15 each.
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments cs.AI · 2024-04-11 · accept · none · ref 19 · internal anchor
OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.
Machine Collective Intelligence for Explainable Scientific Discovery cs.AI · 2026-04-30 · unverdicted · none · ref 44 · internal anchor
Machine collective intelligence uses coordinated AI agents to evolve symbolic hypotheses and recover governing equations from observations in deterministic, stochastic, and uncharacterized systems, achieving up to six orders of magnitude better extrapolation than neural networks with 5-40 parameters
Equifinality in Mixture of Experts: Routing Topology Does Not Determine Language Modeling Quality cs.AI · 2026-04-15 · conditional · none · ref 19 · internal anchor
Routing topology in sparse Mixture-of-Experts models does not determine asymptotic language modeling perplexity; multiple variants including cosine-similarity routing achieve statistically equivalent performance.
An Agentic Evaluation Architecture for Historical Bias Detection in Educational Textbooks cs.AI · 2026-04-09 · unverdicted · none · ref 10 · internal anchor
An agentic architecture with multimodal screening, a five-agent jury, meta-synthesis, and source attribution protocol detects biases in Romanian history textbooks more accurately than zero-shot baselines, achieving 83.3% acceptable excerpts and human preference in 64.8% of blind comparisons.
EngiBench: A Benchmark for Evaluating Large Language Models on Engineering Problem Solving cs.AI · 2025-09-22 · unverdicted · none · ref 23 · internal anchor
EngiBench shows LLMs accuracy drops with task complexity, degrades under perturbations, and stays below human performance on open-ended engineering problems.
Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark cs.AI · 2024-10-06 · unverdicted · none · ref 19 · internal anchor
PolyMATH is a new 5,000-image benchmark where top MLLMs reach at most 41 percent accuracy on multi-modal mathematical reasoning, with ablation showing minimal gain from text over images.
Think Thrice Before You Speak: Dual knowledge-enhanced Theory-of-Mind Reasoning for Persuasive Agents cs.AI · 2026-05-21 · unverdicted · none · ref 50 · internal anchor
Introduces ToM-PD task and ToM-BPD dataset plus TTBYS dual-knowledge framework, with Qwen3-8B outperforming GPT-5 on desire, belief, and strategy prediction.
PALS: Power-Aware LLM Serving for Mixture-of-Experts Models cs.AI · 2026-05-20 · unverdicted · none · ref 15 · internal anchor
PALS adds dynamic GPU power capping to LLM serving frameworks like vLLM, jointly tuning it with batch size via offline models and feedback control to improve energy efficiency up to 26.3% and cut QoS violations 4-7x on dense and MoE models.
What Really Improves Mathematical Reasoning: Structured Reasoning Signals Beyond Pure Code cs.AI · 2026-05-19 · unverdicted · none · ref 18 · internal anchor
Controlled experiments show structured reasoning traces and higher-density math-domain samples improve mathematical reasoning more than pure executable code, with internal routing patterns reflecting these data effects.
TRACE: Trajectory Correction from Cross-layer Evidence for Hallucination Reduction cs.AI · 2026-05-18 · unverdicted · none · ref 24 · internal anchor
TRACE uses cross-layer candidate trajectories inside frozen LLMs to dynamically select and apply one of three correction operators, delivering mean gains of +12.26 MC1 and +8.65 MC2 points across 15 models and 3 benchmarks with no regressions.
BEAM: Binary Expert Activation Masking for Dynamic Routing in MoE cs.AI · 2026-05-14 · conditional · none · ref 10 · internal anchor
BEAM uses binary expert activation masks trained end-to-end to achieve dynamic sparsity in MoE models, cutting FLOPs by 85% with over 98% performance retention.
When Attention Closes: How LLMs Lose the Thread in Multi-Turn Interaction cs.AI · 2026-05-13 · unverdicted · none · ref 28 · internal anchor
Attention to goal tokens declines in multi-turn LLM interactions while residual representations often retain decodable goal information, and the gap between these predicts whether goal-conditioned behavior survives.
NanoResearch: Co-Evolving Skills, Memory, and Policy for Personalized Research Automation cs.AI · 2026-05-11 · unverdicted · none · ref 13 · internal anchor
NanoResearch introduces a tri-level co-evolving framework of skills, memory, and policy to personalize LLM-powered research automation across projects and users.
Geometric Metrics for MoE Specialization: From Fisher Information to Early Failure Detection cs.AI · 2026-04-16 · unverdicted · none · ref 1 · internal anchor
Fisher information geometry supplies parameterization-invariant metrics for MoE specialization dynamics and early failure prediction with strong empirical correlations.
Geometric Routing Enables Causal Expert Control in Mixture of Experts cs.AI · 2026-04-15 · unverdicted · none · ref 2 · internal anchor
Cosine-similarity routing in low-dimensional space makes MoE experts monosemantic by construction and enables direct causal control via centroid interventions.
REAM: Merging Improves Pruning of Experts in LLMs cs.AI · 2026-04-06 · unverdicted · none · ref 3 · internal anchor
REAM merges experts in MoE LLMs rather than pruning them, often matching uncompressed performance by tuning the mix of calibration data.
Token-Level LLM Collaboration via FusionRoute cs.AI · 2026-01-08 · unverdicted · none · ref 13 · internal anchor
FusionRoute augments token-level expert routing with a trainable complementary logit generator to expand the policy class and recover optimal decoding under mild conditions, outperforming prior collaboration and merging methods on reasoning and generation benchmarks.
The Impact of Off-Policy Training Data on Probe Generalisation cs.AI · 2025-11-21 · unverdicted · none · ref 19 · internal anchor
Off-policy training data for LLM behavior probes causes significant generalization failures especially for intent-based behaviors like deception, and performance on coerced incentivised data correlates with real on-policy success.
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning cs.AI · 2025-07-01 · conditional · none · ref 296 · internal anchor
Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.
SGLang: Efficient Execution of Structured Language Model Programs cs.AI · 2023-12-12 · conditional · none · ref 17 · internal anchor
SGLang is a new system that speeds up structured LLM programs by up to 6.4x using RadixAttention for KV cache reuse and compressed finite state machines for output decoding.
Do Linear Probes Generalize Better in Persona Coordinates? cs.AI · 2026-05-10 · unverdicted · none · ref 2 · 2 links · internal anchor
Persona axes derived from contrastive prompts and PCA yield linear probes that generalize better than raw-activation probes across 10 datasets for deception and sycophancy.
Concise Geometric Description as a Bridge: Unleashing the Potential of LLM for Plane Geometry Problem Solving cs.AI · 2026-01-29 · unverdicted · none · ref 13 · internal anchor
An MLLM interpreter generates concise CDL descriptions from diagrams, enabling an off-the-shelf LLM to solve plane geometry problems competitively after training on only 5.5k examples.
Breaking Down and Building Up: Mixture of Skill-Based Vision-and-Language Navigation Agents cs.AI · 2025-08-11 · unverdicted · none · ref 1 · internal anchor
SkillNav decomposes VLN into skill-specific agents trained on synthetic data and routed by a VLM, achieving competitive benchmark results and SOTA generalization on GSA-R2R.
MAC: Masked Agent Collaboration Boosts Large Language Model Medical Decision-Making cs.AI · 2025-07-25 · unverdicted · none · ref 47 · internal anchor
MAC framework selects Pareto-optimal LLM agents and masks low cross-consistency outputs for adaptive collaboration in medical decision-making.
Token Economics for LLM Agents: A Dual-View Study from Computing and Economics cs.AI · 2026-05-09 · unverdicted · none · ref 179 · internal anchor
The paper delivers a unified survey of token economics for LLM agents, conceptualizing tokens as production factors, exchange mediums, and units of account across micro, meso, macro, and security dimensions using established economic theories.
Human-LLM Dialogue Improves Diagnostic Accuracy in Emergency Care cs.AI · 2026-05-08 · unverdicted · none · ref 34 · internal anchor
Interactive LLM dialogue raised residents' hard-case diagnostic correctness from 0.589 to 0.734 and produced medium effect sizes in a blinded study of seven physicians on 52 emergency cases.
AI Realtor: Towards Grounded Persuasive Language Generation for Automated Copywriting cs.AI · 2025-02-24 · unverdicted · none · ref 25 · internal anchor
An LLM agent with grounding, personalization, and marketing modules generates real estate descriptions that human buyers prefer over expert-written ones while matching factual accuracy.
A Survey on Foundation Models for Personalized Federated Intelligence cs.AI · 2025-05-11 · unverdicted · none · ref 92 · internal anchor
The survey introduces personalized federated intelligence (PFI) as a framework integrating federated learning and foundation models to support privacy-aware personalization of AI models.
Entry-level guide to the use of large language models for medical research cs.AI · 2024-10-24 · unverdicted · none · ref 9 · internal anchor
A tutorial guide outlining phases for integrating LLMs into medical research, including task formulation, model choice, prompt engineering, fine-tuning, and deployment with ethical considerations.

Mixtral of Experts

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer