pith. sign in

Jetmoe: Reaching llama2 performance with 0.1 m dollars

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

citation-role summary

background 2 dataset 1

citation-polarity summary

years

2026 4 2024 1

representative citing papers

Hierarchical Mixture-of-Experts with Two-Stage Optimization

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

Hi-MoE uses two-level hierarchical routing objectives to enforce group-level balance while promoting within-group specialization, yielding better perplexity and expert utilization than prior MoE baselines in NLP and vision tasks.

ZAYA1-8B Technical Report

cs.AI · 2026-05-06 · unverdicted · novelty 6.0

ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.

Cubit: Token Mixer with Kernel Ridge Regression

cs.LG · 2026-05-07 · unverdicted · novelty 5.0 · 2 refs

Cubit replaces Transformer's attention with a closed-form Kernel Ridge Regression token mixer and reports larger gains as training sequence length increases.

citing papers explorer

Showing 5 of 5 citing papers.

  • Fast MoE Inference via Predictive Prefetching and Expert Replication cs.LG · 2026-05-12 · conditional · none · ref 20

    Dynamic replication of predicted overloaded experts in MoE models achieves near-100% GPU utilization and up to 3x faster inference while retaining 90-95% of baseline performance.

  • Hierarchical Mixture-of-Experts with Two-Stage Optimization cs.LG · 2026-05-08 · unverdicted · none · ref 33

    Hi-MoE uses two-level hierarchical routing objectives to enforce group-level balance while promoting within-group specialization, yielding better perplexity and expert utilization than prior MoE baselines in NLP and vision tasks.

  • ZAYA1-8B Technical Report cs.AI · 2026-05-06 · unverdicted · none · ref 32

    ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.

  • Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models cs.CL · 2024-11-07 · conditional · none · ref 35

    MoT decouples non-embedding parameters by modality in transformers to match dense multi-modal performance with roughly one-third to one-half the FLOPs.

  • Cubit: Token Mixer with Kernel Ridge Regression cs.LG · 2026-05-07 · unverdicted · none · ref 72 · 2 links

    Cubit replaces Transformer's attention with a closed-form Kernel Ridge Regression token mixer and reports larger gains as training sequence length increases.