super hub Canonical reference

Mixtral of Experts

Albert Q · 2024 · cs.LG · arXiv 2401.04088

Canonical reference. 80% of citing Pith papers cite this work as background.

263 Pith papers citing it

Background 80% of classified citations

open full Pith review browse 263 citing papers more from Albert Q arXiv PDF

abstract

We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep. As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and it outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular, Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and multilingual benchmarks. We also provide a model fine-tuned to follow instructions, Mixtral 8x7B - Instruct, that surpasses GPT-3.5 Turbo, Claude-2.1, Gemini Pro, and Llama 2 70B - chat model on human benchmarks. Both the base and instruct models are released under the Apache 2.0 license.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 50 baseline 4 method 3 dataset 2 other 2

citation-polarity summary

background 49 baseline 4 unclear 3 use method 3 use dataset 2

claims ledger

abstract We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep. As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference. Mixtral was trained with a context size of 32k tok

authors

Albert Q

co-cited works

representative citing papers

HodgeCover: Higher-Order Topological Coverage Drives Compression of Sparse Mixture-of-Experts

cs.LG · 2026-05-13 · unverdicted · novelty 8.0

HodgeCover isolates the harmonic kernel of a simplicial Laplacian on an expert 2-complex to identify irreducible merge cycles and selects experts for aggressive compression, matching or exceeding baselines on open-weight MoE models.

Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models

cs.AR · 2026-05-11 · conditional · novelty 8.0

Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.

ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning

cs.LG · 2026-05-09 · conditional · novelty 8.0

ReLibra uses pre-known token-to-expert routing from RL rollouts to perform inter-batch expert reordering and intra-batch replication, delivering up to 1.6x higher throughput than Megatron-LM and 1.2x over oracle-equipped EPLB while staying within 6-10% of an ideal balanced baseline.

Efficient Training on Multiple Consumer GPUs with RoundPipe

cs.DC · 2026-04-29 · conditional · novelty 8.0

RoundPipe achieves near-zero-bubble pipeline parallelism for LLM training on consumer GPUs by dynamically dispatching computation stages round-robin, yielding 1.48-2.16x speedups and enabling 235B model fine-tuning on 8x RTX 4090.

When Does Content-Based Routing Work? Representation Requirements for Selective Attention in Hybrid Sequence Models

cs.LG · 2026-03-22 · conditional · novelty 8.0

Content-based routing succeeds only when models provide bidirectional context and perform pairwise comparisons, with bidirectional Mamba plus rank-1 projection reaching 99.7% precision at linear inference cost.

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

cs.AI · 2024-08-12 · unverdicted · novelty 8.0

The AI Scientist framework enables LLMs to independently conduct the full scientific process from idea generation to paper writing and review, demonstrated across three ML subfields with papers costing under $15 each.

AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments

cs.HC · 2024-05-13 · conditional · novelty 8.0

AgentClinic is a multimodal agent benchmark demonstrating that LLM diagnostic accuracy on MedQA drops to below one-tenth in sequential clinical simulations, with Claude-3.5 leading and large tool-use differences across models.

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

cs.AI · 2024-04-11 · accept · novelty 8.0

OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.

RULER: What's the Real Context Size of Your Long-Context Language Models?

cs.CL · 2024-04-09 · accept · novelty 8.0

RULER shows most long-context LMs drop sharply in performance on complex tasks as length and difficulty increase, with only half maintaining results at 32K tokens.

CrossPool: Efficient Multi-LLM Serving for Cold MoE Models through KV-Cache and Weight Disaggregation

cs.DC · 2026-06-23 · unverdicted · novelty 7.0 · 2 refs

CrossPool separates weights and KV-cache into distinct GPU pools plus a planner, virtualizer, and layer-wise scheduler to cut P99 time-between-tokens by up to 10.4x versus prior kvcached multi-LLM systems.

EST-PRM: Stress-Testing Process Reward Models Before They Become Load-Bearing

cs.LG · 2026-05-30 · unverdicted · novelty 7.0

EST-PRM stress-tests five PRM models on 4,687 reasoning chains from MATH-500, GSM8K, and PRMBench using three label-preserving transformations and reports model-specific vulnerability patterns.

Next-Billion AI Index: The compass for AI utility and adoption in the global majority

cs.CY · 2026-05-29 · unverdicted · novelty 7.0

Introduces nexbax, a diagnostic framework with three themes and 10 dimensions for evaluating AI economic viability, operational practicality, and societal integrity in next-billion-user contexts.

Every Act Has Its Price: Compressed Moral Composition in Frontier LLMs

cs.CL · 2026-05-29 · unverdicted · novelty 7.0

Moral Trolley Arena shows frontier LLMs produce composite moral preferences that are compressed rather than additive functions of calibrated component act strengths across Moral Foundations Theory.

Latent Performance Profiling of Large Language Models

cs.CL · 2026-05-28 · unverdicted · novelty 7.0

Introduces Latent Performance Profiling (LPP) as a task-agnostic framework deriving scalar metrics from LLM latent representations and dynamics to complement benchmark evaluations.

Fine-grained Claim-level RAG Benchmark for Law

cs.CL · 2026-05-20 · unverdicted · novelty 7.0 · 6 refs

ClaimRAG-LAW is a French-English legal RAG benchmark with claim-level granularity for experts and non-experts that reveals limitations in current retrieval and generation performance.

CaMo: Camera Motion Grounded Evaluation and Training for Vision-Language Models

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

Proposes Spatial Narrative Score (SNS) evaluation for VLMs' camera motion understanding and introduces CaMo model achieving consistent performance on SNS and direct QA.

Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-Experts

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

Routers in SMoE models form geometric alignments with their experts through shared gradient directions, enabling effective specialization that auxiliary load-balancing losses tend to disrupt.

HEBATRON: A Hebrew-Specialized Open-Weight Mixture-of-Experts Language Model

cs.CL · 2026-05-11 · unverdicted · novelty 7.0

Hebatron is the first open-weight Hebrew MoE LLM adapted from Nemotron-3, reaching 73.8% on Hebrew reasoning benchmarks while activating only 3B parameters per pass and supporting 65k-token context.

Unsolvability Ceiling in Multi-LLM Routing: An Empirical Study of Evaluation Artifacts

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Evaluation artifacts substantially inflate the measured unsolvability ceiling in multi-LLM routing, leading to distorted router training and overstated headroom.

MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference

cs.LG · 2026-05-08 · conditional · novelty 7.0

MISA routes to a small subset of indexer heads via block statistics, matching full DSA performance on LongBench with 4-8x fewer heads and 3.82x speedup while recovering over 92% of selected tokens.

When Are Experts Misrouted? Counterfactual Routing Analysis in Mixture-of-Experts Language Models

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Standard top-k routers in MoE language models often select suboptimal routes for difficult tokens, and updating only the final router layer raises pass@K on AIME and HMMT benchmarks across multiple models.

Every Feedforward Neural Network Definable in an o-Minimal Structure Has Finite Sample Complexity

stat.ML · 2026-05-08 · unverdicted · novelty 7.0

Every fixed finite feedforward neural network definable in an o-minimal structure has finite sample complexity in the agnostic PAC setting.

Geometry-Aware State Space Model: A New Paradigm for Whole-Slide Image Representation

cs.CV · 2026-05-06 · unverdicted · novelty 7.0

BatMIL uses hybrid hyperbolic-Euclidean geometry, an S4 state-space backbone, and chunk-level mixture-of-experts to outperform prior multiple-instance learning methods on seven whole-slide image datasets across six cancers.

Coral: Cost-Efficient Multi-LLM Serving over Heterogeneous Cloud GPUs

cs.DC · 2026-05-05 · unverdicted · novelty 7.0

Coral cuts multi-LLM serving costs by up to 2.79x and raises goodput by up to 2.39x on heterogeneous GPUs through adaptive joint optimization and a lossless two-stage decomposition that solves quickly.

citing papers explorer

Showing 50 of 263 citing papers.

SpikingBrain: Spiking Brain-inspired Large Models cs.LG · 2025-09-05 · unverdicted · none · ref 15 · internal anchor
SpikingBrain-7B and SpikingBrain-76B achieve Transformer-comparable performance after continual pre-training on 150B tokens, with over 100x TTFT speedup on 4M-token sequences and 69.15% sparsity from event-driven spiking.
Training-Free Multimodal Large Language Model Orchestration cs.CL · 2025-08-06 · unverdicted · none · ref 19 · 2 links · internal anchor
LLM Orchestration integrates modality experts via an LLM controller, cross-modal memory, and interaction layer to enable multimodal input-output without gradient-based training.
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning cs.AI · 2025-07-01 · conditional · none · ref 296 · internal anchor
Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.
Mixture-of-Experts Can Surpass Dense LLMs Under Strictly Equal Resource cs.CL · 2025-06-13 · conditional · none · ref 18 · internal anchor
MoE models with activation rates in an optimal region outperform dense LLMs of identical total parameter count, training compute, and data budget, with the optimal region consistent across scales.
Large Language Models for Combinatorial Optimization of Design Structure Matrix cs.CE · 2025-06-11 · unverdicted · none · ref 55 · internal anchor
LLM framework combines network topology and domain knowledge for iterative DSM sequencing optimization and outperforms stochastic and deterministic baselines on convergence speed and solution quality.
DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving cs.CV · 2025-05-22 · unverdicted · none · ref 44 · internal anchor
DriveMoE applies scene-specialized Vision MoE and skill-specialized Action MoE to a VLA baseline to achieve SOTA closed-loop performance on Bench2Drive.
MegaScale-Data: Scaling Dataloader for Multisource Large Foundation Model Training cs.DC · 2025-04-14 · unverdicted · none · ref 36 · internal anchor
MegaScale-Data is a distributed data loading system that disaggregates preprocessing and applies auto-partitioning to deliver 4.5x higher end-to-end training throughput and 13.5x lower CPU memory usage for multisource large foundation models.
MathFlow: Enhancing the Perceptual Flow of MLLMs for Visual Mathematical Problems cs.CV · 2025-03-19 · unverdicted · none · ref 27 · internal anchor
MathFlow decouples perception and inference stages in MLLMs for visual math, with a dedicated perception model delivering gains on the FlowVerse benchmark when paired with existing reasoners.
Capacity-Aware Inference: Mitigating the Straggler Effect in Mixture of Experts cs.LG · 2025-03-07 · conditional · none · ref 12 · internal anchor
Capacity-aware dropping techniques mitigate load imbalance in MoE inference, delivering up to 1.85x speedup with 0.2% or less performance change on models including Mixtral-8x7B.
TabICL: A Tabular Foundation Model for In-Context Learning on Large Data cs.LG · 2025-02-08 · unverdicted · none · ref 251 · internal anchor
TabICL scales in-context learning to large tabular data via column-then-row attention for row embeddings followed by a transformer, matching TabPFNv2 speed and performance while outperforming it and CatBoost on datasets over 10K samples.
Condense, Don't Just Prune: Enhancing Efficiency and Performance in MoE Layer Pruning cs.LG · 2024-11-26 · unverdicted · none · ref 20 · internal anchor
CD-MoE condenses fine-grained MoE layers with shared experts into dense layers, retaining 90% accuracy with 27.5% memory cut and 1.26x speedup on DeepSeekMoE-16B, recovering 98% via brief fine-tuning.
Lynx: Enabling Efficient MoE Inference through Dynamic Batch-Aware Expert Selection cs.LG · 2024-11-13 · unverdicted · none · ref 9 · internal anchor
Lynx exploits training-induced batch-level expert activation skews via AffinityBinning to reduce invoked experts per batch, delivering up to 1.30x throughput with under 1% accuracy loss across four model families.
Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models cs.CL · 2024-11-07 · conditional · none · ref 16 · internal anchor
MoT decouples non-embedding parameters by modality in transformers to match dense multi-modal performance with roughly one-third to one-half the FLOPs.
Dictionary Insertion Prompting for Multilingual Reasoning on Multilingual Large Language Models cs.CL · 2024-11-02 · unverdicted · none · ref 13 · internal anchor
DIP interleaves English word translations into non-English prompts to boost multilingual reasoning on synthetic benchmarks spanning 10-200 languages.
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding cs.CV · 2024-10-22 · unverdicted · none · ref 12 · internal anchor
LongVU adaptively compresses long video tokens using DINOv2-based frame deduplication, text-guided cross-modal selection, and temporal spatial reduction to improve video-language understanding in MLLMs with minimal detail loss.
VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation cs.CV · 2024-09-06 · unverdicted · none · ref 8 · internal anchor
VILA-U unifies visual understanding and generation inside one autoregressive next-token prediction model, removing separate diffusion components while claiming near state-of-the-art results.
What to Say and When to Say it: Live Fitness Coaching as a Testbed for Situated Interaction cs.CV · 2024-07-11 · unverdicted · none · ref 28 · internal anchor
Introduces the QEVD benchmark for asynchronous situated interaction in fitness coaching and proposes a streaming baseline to address limitations of existing vision-language models.
RouteLLM: Learning to Route LLMs with Preference Data cs.LG · 2024-06-26 · unverdicted · none · ref 18 · internal anchor
Router models trained on preference data dynamically select between strong and weak LLMs, cutting inference costs by more than 2x on benchmarks with no quality loss and showing transfer to new model pairs.
LaMSUM: Amplifying Voices Against Harassment through LLM Guided Extractive Summarization of User Incident Reports cs.CL · 2024-06-22 · unverdicted · none · ref 21 · internal anchor
LaMSUM is a novel multi-level LLM framework with voting methods for extractive summarization of large incident report collections that outperforms prior extractive methods.
Mixture-of-Agents Enhances Large Language Model Capabilities cs.CL · 2024-06-07 · unverdicted · none · ref 12 · 2 links · internal anchor
A layered Mixture-of-Agents system combining multiple LLMs achieves state-of-the-art results on AlpacaEval 2.0 (65.1%), MT-Bench, and FLASK, outperforming GPT-4 Omni.
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark cs.CL · 2024-06-03 · conditional · none · ref 21 · internal anchor
MMLU-Pro is a revised benchmark that makes language model evaluation harder and more stable by using ten options per question and emphasizing reasoning over simple knowledge recall.
Chameleon: Mixed-Modal Early-Fusion Foundation Models cs.CL · 2024-05-16 · unverdicted · none · ref 14 · internal anchor
Chameleon is an early-fusion token model that handles mixed image-text sequences for understanding and generation, achieving competitive or superior performance to larger models like Llama-2, Mixtral, and Gemini-Pro on captioning, VQA, text, and image tasks.
Argumentative Large Language Models for Explainable and Contestable Claim Verification cs.CL · 2024-05-03 · unverdicted · none · ref 27 · internal anchor
ArgLLMs build argumentation frameworks from LLMs to support explainable and contestable formal reasoning for claim verification.
Are We on the Right Way for Evaluating Large Vision-Language Models? cs.CV · 2024-03-29 · conditional · none · ref 19 · internal anchor
Current LVLM benchmarks overestimate capabilities because many questions can be answered without images due to design flaws or data leakage; MMStar is a human-curated set of 1,500 vision-indispensable samples across 6 capabilities and 18 axes with new metrics for leakage and true multi-modal gain.
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models cs.CR · 2024-03-28 · accept · none · ref 22 · internal anchor
JailbreakBench supplies an evolving set of jailbreak prompts, a 100-behavior dataset aligned with usage policies, a standardized evaluation framework, and a leaderboard to enable comparable assessments of attacks and defenses on LLMs.
EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty cs.LG · 2024-01-26 · unverdicted · none · ref 21 · internal anchor
EAGLE resolves feature-level uncertainty in speculative sampling via one-step token advancement, delivering 2.7x-3.5x speedup on LLaMA2-Chat 70B and doubled throughput across multiple model families and tasks.
SGLang: Efficient Execution of Structured Language Model Programs cs.AI · 2023-12-12 · conditional · none · ref 17 · internal anchor
SGLang is a new system that speeds up structured LLM programs by up to 6.4x using RadixAttention for KV cache reuse and compressed finite state machines for output decoding.
Piper: Efficient Large-Scale MoE Training via Resource Modeling and Pipelined Hybrid Parallelism cs.DC · 2026-05-06 · unverdicted · none · ref 8
Piper introduces resource modeling and pipelined hybrid parallelism for MoE training, delivering 2-3.5X higher MFU than prior frameworks and 1.2-9X better all-to-all bandwidth.
Adaptive Inverted-Index Routing for Granular Mixtures-of-Experts cs.LG · 2026-05-06 · unverdicted · none · ref 23
AIR-MoE introduces a two-stage inverted-index routing method based on vector quantization that approximates optimal expert selection for granular MoE models at lower cost and with empirical performance gains.
Dialectics of Alignment: Harnessing Unsafe Knowledge for Dynamic Safety Routing cs.LG · 2026-05-30 · unverdicted · none · ref 41 · internal anchor
SafeMoE isolates unsafe knowledge in domain-specific LoRA experts and routes them via a lightweight gate trained on safe responses to produce safer and more informative LLM outputs with zero-shot generalization.
CARE-RL: Capability-Aware Reinforcement Learning for Mitigating Cross-Domain Conflicts cs.LG · 2026-05-30 · unverdicted · none · ref 43 · internal anchor
CARE-RL combines PA-GRM for task-adaptive rewards on open-ended tasks and DACSP for modulating RL updates using historical capability directions, reporting higher total average scores than baselines on Qwen models.
Graph-Conditioned Mixture of Graph Neural Network Experts for Traffic Forecasting cs.LG · 2026-05-28 · unverdicted · none · ref 18 · internal anchor
GC-MoE improves MAE on four traffic forecasting benchmarks by routing nodes to combinations of frozen spatio-temporal GNN experts via a graph-conditioned lightweight router, training only ~17K parameters atop 1.5M frozen weights.
MONET: A Massive, Open, Non-redundant and Enriched Text-to-image dataset cs.CV · 2026-05-20 · unverdicted · none · ref 41 · internal anchor
MONET is an open 104.9M image-text pair dataset created via safety filtering, deduplication, and multi-VLM recaptioning from 2.9B raw pairs, validated by training a competitive 4B-parameter latent diffusion model.
TIDE: Efficient and Lossless MoE Diffusion LLM Inference with I/O-aware Expert Offload cs.CL · 2026-05-19 · unverdicted · none · ref 3 · internal anchor
TIDE schedules I/O-aware expert offloading for MoE diffusion LLMs by solving for an optimal refresh interval that exploits temporal stability of activations, yielding up to 1.5x throughput gain losslessly.
CoX-MoE: Coalesced Expert Execution for High-Throughput MoE Inference with AMX-Enabled CPU-GPU Co-Execution cs.LG · 2026-05-18 · unverdicted · none · ref 11 · internal anchor
CoX-MoE achieves up to 7.1x higher throughput than FlexGen for MoE inference via coalesced expert execution and AMX-enabled CPU-GPU orchestration with static expert stratification.
Mixture of Experts for Low-Resource LLMs cs.CL · 2026-05-17 · unverdicted · none · ref 4 · internal anchor
Pre-trained MoE models exhibit deep-layer routing collapse for low-resource languages like Hebrew, largely corrected by continual pre-training on balanced bilingual data, with consistent patterns observed in Japanese.
When Does Sparse MoE Help in Vision? The Role of Backbone Compute Leverage in Sparse Routing cs.CV · 2026-05-15 · unverdicted · none · ref 40 · internal anchor
Sparse MoE vision models show positive accuracy gaps only when routing a substantial compute fraction ρ and using k≥2 experts at large scale; batch-axis dispatch is identified as a key failure mode.
Dense vs Sparse Pretraining at Tiny Scale: Active-Parameter vs Total-Parameter Matching cs.CL · 2026-05-13 · accept · none · ref 4 · internal anchor
At tiny scale, MoE transformers lower validation loss versus dense models when active parameters match but raise it when total stored parameters match.
Do Linear Probes Generalize Better in Persona Coordinates? cs.AI · 2026-05-10 · unverdicted · none · ref 2 · 2 links · internal anchor
Persona axes derived from contrastive prompts and PCA yield linear probes that generalize better than raw-activation probes across 10 datasets for deception and sycophancy.
FLAME: Adaptive Mixture-of-Experts for Continual Multimodal Multi-Task Learning cs.LG · 2026-05-10 · unverdicted · none · ref 27 · internal anchor
FLAME is an MoE architecture using modality-specific routers and low-rank compression of expert knowledge to support efficient continual multimodal multi-task learning while reducing catastrophic forgetting.
SimReg: Achieving Higher Performance in the Pretraining via Embedding Similarity Regularization cs.CL · 2026-05-09 · unverdicted · none · ref 6 · internal anchor
SimReg regularization accelerates LLM pretraining convergence by over 30% and raises average zero-shot performance by over 1% across benchmarks.
SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training cs.LG · 2026-05-09 · unverdicted · none · ref 36 · 2 links · internal anchor
Pruning pretrained MoE models outperforms training from scratch under fixed budget, different expert compression methods converge after continued training, and progressive pruning plus multi-token KD improves the final 23A2B model.
Efficient Pre-Training with Token Superposition cs.CL · 2026-05-07 · unverdicted · none · ref 28 · 2 links · internal anchor
Token-Superposition Training combines multiple tokens into bags for multi-hot cross-entropy pre-training followed by a recovery phase, yielding up to 2.5x reduction in training time at 10B scale under equal-loss conditions.
Cubit: Token Mixer with Kernel Ridge Regression cs.LG · 2026-05-07 · unverdicted · none · ref 35 · 2 links · internal anchor
Cubit replaces Transformer's attention with a closed-form Kernel Ridge Regression token mixer and reports larger gains as training sequence length increases.
TIDE: Every Layer Knows the Token Beneath the Context cs.CL · 2026-05-07 · unverdicted · none · ref 73 · internal anchor
TIDE augments standard transformers with per-layer token embedding injection via an ensemble of memory blocks and a depth-conditioned router to mitigate rare-token undertraining and contextual collapse.
ZipCCL: Efficient Lossless Data Compression of Communication Collectives for Accelerating LLM Training cs.DC · 2026-04-30 · unverdicted · none · ref 14 · internal anchor
ZipCCL delivers up to 1.35x faster communication and 1.18x end-to-end speedup in LLM training through lossless compression of near-Gaussian collectives on 64-GPU clusters.
FaaSMoE: A Serverless Framework for Multi-Tenant Mixture-of-Experts Serving cs.DC · 2026-04-29 · unverdicted · none · ref 11 · internal anchor
FaaSMoE treats MoE experts as on-demand FaaS functions with configurable granularity, using under one-third the resources of a full-model baseline under multi-tenant workloads.
A Comparative Analysis on the Performance of Upper Confidence Bound Algorithms in Adaptive Deep Neural Networks cs.LG · 2026-04-27 · unverdicted · none · ref 3 · 2 links · internal anchor
Comparative study applies UCB-V, UCB-Tuned, UCB-Bayes and UCB-BwK to ADNN early-exit selection on ResNet and MobileViT using CIFAR-10/100, reporting sub-linear regret with UCB-Bayes fastest and UCB-V/UCB-Tuned best on accuracy-energy and accuracy-latency Pareto fronts.
LayerBoost: Layer-Aware Attention Reduction for Efficient LLMs cs.LG · 2026-04-23 · unverdicted · none · ref 4 · internal anchor
LayerBoost selectively replaces or removes attention in non-critical transformer layers to cut inference latency up to 68% while recovering quality via brief distillation.
Teacher-Guided Routing for Sparse Vision Mixture-of-Experts cs.CV · 2026-04-23 · unverdicted · none · ref 17 · internal anchor
Teacher-guided routing supplies pseudo-supervision from a dense model's intermediate features to stabilize expert selection in sparse vision MoE models.

Mixtral of Experts

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer