Mixed citations

TorchTitan: One-stop PyTorch native solution for production ready LLM pre-training

Wanchao Liang, Tianyu Liu, Less Wright, Will Constable, Andrew Gu, Chien-Chin Huang, Iris Zhang, Wei Feng, Howard Huang, Junjie Wang, Sanket Purandare, Gokul Nadathur, Stratos Idreos · 2025 · arXiv 2410.06511

Mixed citation behavior. Most common role is background (60%).

11 Pith papers citing it

Background 60% of classified citations

read on arXiv browse 11 citing papers

citation-role summary

background 3 method 2

citation-polarity summary

background 3 use method 2

representative citing papers

CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs

cs.LG · 2026-05-19 · unverdicted · novelty 7.0 · 2 refs

CODA re-expresses most non-attention Transformer computations as GEMM-plus-epilogue programs using a constrained set of composable primitives to keep intermediate results on-chip and cut global memory traffic.

CCL-Bench 1.0: A Trace-Based Benchmark for LLM Infrastructure

cs.DC · 2026-05-07 · unverdicted · novelty 7.0

CCL-Bench packages traces and metadata to compute detailed compute, memory, and communication efficiency metrics, surfacing performance insights unavailable from end-to-end benchmarks.

LoKA: Low-precision Kernel Applications for Recommendation Models At Scale

cs.LG · 2026-05-11 · unverdicted · novelty 6.0 · 2 refs

LoKA enables practical FP8 use in numerically sensitive large recommendation models via online profiling of activations, reusable model modifications for stability, and dynamic kernel dispatching.

Decoupling the Benefits of Subword Tokenization for Language Model Training via Byte-level Simulation

cs.CL · 2026-04-29 · unverdicted · novelty 6.0 · 2 refs

Byte-level simulations show subword tokenization improves LLM training mainly via increased throughput and boundary priors.

BOOST: BOttleneck-Optimized Scalable Training Framework for Low-Rank Large Language Models

cs.LG · 2025-12-13 · unverdicted · novelty 6.0

BOOST delivers 1.46-2.27x end-to-end speedups for low-rank bottleneck LLMs by redesigning tensor parallelism around the bottleneck structure plus supporting optimizations.

FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

cs.GR · 2025-06-17 · unverdicted · novelty 6.0

FLUX.1 Kontext unifies image generation and editing via flow matching and sequence concatenation, delivering improved multi-turn consistency and speed on the new KontextBench benchmark.

MAGI-1: Autoregressive Video Generation at Scale

cs.CV · 2025-05-19 · unverdicted · novelty 6.0

MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.

Mixture-of-Parallelisms: Towards Memory-Efficient Training Stack for Mixture-of-Experts Models

cs.DC · 2026-07-02 · unverdicted · novelty 5.0

MoP combines specialized parallelisms and a new optimizer step for memory-efficient MoE training, delivering 4.7x-8.2x higher per-GPU throughput than FSDP2 while supporting 1M-token contexts on 12 8x H200 nodes.

DynaFlow: Transparent and Flexible Intra-Device Parallelism via Programmable Operator Scheduling

cs.DC · 2026-05-20 · unverdicted · novelty 5.0

DynaFlow enables transparent intra-device parallelism in ML systems by separating model definition from execution scheduling, integrating into 6 frameworks with up to 1.29x throughput gains and minimal code changes.

UniEP: Unified Expert-Parallel MoE MegaKernel for LLM Training

cs.DC · 2026-04-21 · unverdicted · novelty 5.0

UniEP fuses MoE communication and computation into unified MegaKernels with deterministic token ordering, delivering 1.03x-1.38x speedups over prior work while preserving training accuracy.

Flint: Compiler Enabled Cluster-Free Design Space Exploration for Distributed ML

cs.DC · 2026-04-19 · unverdicted · novelty 5.0

Flint generates compiler-derived workload graphs that support cluster-free design space exploration for distributed machine learning systems.

citing papers explorer

Showing 11 of 11 citing papers.

CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs cs.LG · 2026-05-19 · unverdicted · none · ref 11 · 2 links
CODA re-expresses most non-attention Transformer computations as GEMM-plus-epilogue programs using a constrained set of composable primitives to keep intermediate results on-chip and cut global memory traffic.
CCL-Bench 1.0: A Trace-Based Benchmark for LLM Infrastructure cs.DC · 2026-05-07 · unverdicted · none · ref 35
CCL-Bench packages traces and metadata to compute detailed compute, memory, and communication efficiency metrics, surfacing performance insights unavailable from end-to-end benchmarks.
LoKA: Low-precision Kernel Applications for Recommendation Models At Scale cs.LG · 2026-05-11 · unverdicted · none · ref 47 · 2 links
LoKA enables practical FP8 use in numerically sensitive large recommendation models via online profiling of activations, reusable model modifications for stability, and dynamic kernel dispatching.
Decoupling the Benefits of Subword Tokenization for Language Model Training via Byte-level Simulation cs.CL · 2026-04-29 · unverdicted · none · ref 21 · 2 links
Byte-level simulations show subword tokenization improves LLM training mainly via increased throughput and boundary priors.
BOOST: BOttleneck-Optimized Scalable Training Framework for Low-Rank Large Language Models cs.LG · 2025-12-13 · unverdicted · none · ref 14
BOOST delivers 1.46-2.27x end-to-end speedups for low-rank bottleneck LLMs by redesigning tensor parallelism around the bottleneck structure plus supporting optimizations.
FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space cs.GR · 2025-06-17 · unverdicted · none · ref 29
FLUX.1 Kontext unifies image generation and editing via flow matching and sequence concatenation, delivering improved multi-turn consistency and speed on the new KontextBench benchmark.
MAGI-1: Autoregressive Video Generation at Scale cs.CV · 2025-05-19 · unverdicted · none · ref 25
MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.
Mixture-of-Parallelisms: Towards Memory-Efficient Training Stack for Mixture-of-Experts Models cs.DC · 2026-07-02 · unverdicted · none · ref 7
MoP combines specialized parallelisms and a new optimizer step for memory-efficient MoE training, delivering 4.7x-8.2x higher per-GPU throughput than FSDP2 while supporting 1M-token contexts on 12 8x H200 nodes.
DynaFlow: Transparent and Flexible Intra-Device Parallelism via Programmable Operator Scheduling cs.DC · 2026-05-20 · unverdicted · none · ref 7
DynaFlow enables transparent intra-device parallelism in ML systems by separating model definition from execution scheduling, integrating into 6 frameworks with up to 1.29x throughput gains and minimal code changes.
UniEP: Unified Expert-Parallel MoE MegaKernel for LLM Training cs.DC · 2026-04-21 · unverdicted · none · ref 21
UniEP fuses MoE communication and computation into unified MegaKernels with deterministic token ordering, delivering 1.03x-1.38x speedups over prior work while preserving training accuracy.
Flint: Compiler Enabled Cluster-Free Design Space Exploration for Distributed ML cs.DC · 2026-04-19 · unverdicted · none · ref 30
Flint generates compiler-derived workload graphs that support cluster-free design space exploration for distributed machine learning systems.

TorchTitan: One-stop PyTorch native solution for production ready LLM pre-training

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer