Canonical reference

InProceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, Yuxiong He · 2020 · arXiv 4486.340670

Canonical reference. 71% of citing Pith papers cite this work as background.

23 Pith papers citing it

Background 71% of classified citations

read on arXiv browse 23 citing papers

citation-role summary

background 6 method 1

citation-polarity summary

background 5 unclear 1 use method 1

representative citing papers

ORPO: Monolithic Preference Optimization without Reference Model

cs.CL · 2024-03-12 · conditional · novelty 8.0

ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

cs.CL · 2023-04-03 · accept · novelty 8.0

Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.

RATrain: A Resource-Aware Training Runtime for Large Language Models on Bandwidth-Constrained Heterogeneous Supercomputing Platforms

cs.DC · 2026-06-09 · unverdicted · novelty 7.0

RATrain introduces a resource-aware scheduler and MT-3000-specific backend for 1F1B LLM training that achieves 1.35x speedup and 97% scaling efficiency while preserving training correctness.

TokAlign++: Advancing Vocabulary Adaptation via Better Token Alignment

cs.CL · 2026-05-13 · unverdicted · novelty 7.0

TokAlign++ learns token alignments between LLM vocabularies from monolingual representations to enable faster adaptation, better text compression, and effective token-level distillation across 15 languages with minimal steps.

Adaptive Multi-Resolution Procedural Knowledge Compression for Large Language Models

cs.CL · 2026-06-10 · unverdicted · novelty 6.0

SKIM is an adaptive multi-resolution soft-token framework that compresses procedural skills while aiming to preserve logical dependencies and task performance better than prior compression methods.

KoRe: Compact Knowledge Representations for Large Language Models

cs.CL · 2026-05-19 · unverdicted · novelty 6.0

KoRe encodes 1-hop knowledge graph subgraphs as compact discrete tokens for injection into LLMs, achieving competitive benchmark performance with up to 10x token reduction.

A Few GPUs, A Whole Lotta Scale: Faithful LLM Training Emulation with PrismLLM

cs.DC · 2026-05-15 · conditional · novelty 6.0

PrismLLM constructs a sliced execution graph and uses hybrid emulation to faithfully reproduce performance and memory behavior of up to 8192-GPU LLM training runs on fewer than 1% of the original GPUs.

MegaScale-Omni: A Hyper-Scale, Workload-Resilient System for MultiModal LLM Training in Production

cs.DC · 2026-05-09 · unverdicted · novelty 6.0

MegaScale-Omni delivers 1.27x-7.57x higher throughput for dynamic multimodal LLM training by decoupling encoder and LLM parallelism, using unified colocation, and applying adaptive workload balancing.

UCCL-Zip: Lossless Compression Supercharged GPU Communication

cs.DC · 2026-04-19 · unverdicted · novelty 6.0

UCCL-Zip adds lossless compression to GPU communication to reduce LLM bottlenecks while preserving exact numerical correctness.

Symphony: Taming Step Misalignments in the Network for Ring-based Collective Operations

cs.NI · 2026-04-18 · unverdicted · novelty 6.0

Symphony detects step misalignments in ring collectives via lightweight in-network tracking and mitigates them by throttling outpacing flows with congestion signals, yielding up to 54% better communication times in Astra-Sim simulations and a Tofino2 prototype.

Llemma: An Open Language Model For Mathematics

cs.CL · 2023-10-16 · unverdicted · novelty 6.0

Continued pretraining of Code Llama on Proof-Pile-2 yields Llemma, an open math-specialized LLM that beats known open base models on MATH and supports tool use plus formal proving out of the box.

Scaling Relationship on Learning Mathematical Reasoning with Large Language Models

cs.CL · 2023-08-03 · unverdicted · novelty 6.0

Pre-training loss predicts LLM math reasoning better than parameter count; rejection sampling fine-tuning with diverse paths raises LLaMA-7B accuracy on GSM8K from 35.9% with SFT to 49.3%.

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

cs.CL · 2022-11-09 · unverdicted · novelty 6.0

BLOOM is a 176B-parameter open-access multilingual language model trained on the ROOTS corpus that achieves competitive performance on benchmarks, with improved results after multitask prompted finetuning.

GPT-NeoX-20B: An Open-Source Autoregressive Language Model

cs.CL · 2022-04-14 · accept · novelty 6.0

GPT-NeoX-20B is a publicly released 20B parameter autoregressive language model trained on the Pile that shows strong gains in five-shot reasoning over similarly sized prior models.

HSAP: A Hierarchical Sequence-aware Parallelism for Hybrid-Context Generative Models

cs.LG · 2026-06-29 · unverdicted · novelty 5.0 · 2 refs

HSAP introduces a hierarchical framework and sequence-aware algorithm with JIT-optimized NCCL communication to enable correct causal attention computation on hybrid-context packed sequences without limiting parallelism.

Eidola: Modeling Multi-GPU Network Communication Traffic in Distributed AI Workloads

cs.DC · 2026-06-10 · unverdicted · novelty 5.0

Eidola is a gem5 extension that emulates cycle-level peer-to-peer GPU writes via real-application timing profiles to simulate traffic and synchronization in multi-GPU AI systems.

CroCo: Cross-Lingual Contrastive Preference Tuning on Self-Generations

cs.CL · 2026-05-25 · unverdicted · novelty 5.0

CroCo applies English-reward-ranked self-generations for contrastive preference tuning that improves two LLMs on structured and open-ended tasks across 14 languages without language-specific annotations.

torchtune: PyTorch native post-training library

cs.LG · 2026-05-20 · unverdicted · novelty 5.0

torchtune is a modular PyTorch library for LLM post-training that delivers competitive performance and memory efficiency while supporting rapid research iteration through hackable components.

VLA Foundry: A Unified Framework for Training Vision-Language-Action Models

cs.RO · 2026-04-21 · unverdicted · novelty 5.0

VLA Foundry provides a single training stack for VLA models and releases open models that match prior closed-source performance or outperform baselines on multi-task manipulation in simulation.

Bias-constrained multimodal intelligence for equitable and reliable clinical AI

cs.CV · 2026-04-18 · unverdicted · novelty 5.0

BiasCareVL is a bias-aware vision-language framework trained on 3.44 million medical samples that outperforms prior methods on clinical tasks like diagnosis and segmentation while aiming for equitable performance under data imbalances.

PRISM: Probabilistic Runtime Insights and Scalable Performance Modeling for Large-Scale Distributed Training

cs.DC · 2025-10-17 · unverdicted · novelty 5.0

PRISM introduces a probabilistic performance modeling framework that quantifies guarantees on training time for large-scale distributed systems under runtime variability.

Cross-Layer Energy Analysis of Multimodal Training on Grace Hopper Superchips

cs.DC · 2026-05-03 · unverdicted · novelty 4.0

On Grace Hopper superchips, energy efficiency during multimodal training is governed by data movement and overlap rather than compute utilization, and runtime-optimal configurations are not always energy-optimal.

DFLOP: A Data-driven Framework for Multimodal LLM Training Pipeline Optimization

cs.DC · 2026-03-26 · unverdicted · novelty 4.0

DFLOP is a data-driven framework that profiles data-induced computation variance and uses predictive scheduling to balance workloads in multimodal LLM training pipelines, claiming up to 3.6x faster training than existing frameworks.

citing papers explorer

Showing 23 of 23 citing papers.

ORPO: Monolithic Preference Optimization without Reference Model cs.CL · 2024-03-12 · conditional · none · ref 122
ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling cs.CL · 2023-04-03 · accept · none · ref 140
Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.
RATrain: A Resource-Aware Training Runtime for Large Language Models on Bandwidth-Constrained Heterogeneous Supercomputing Platforms cs.DC · 2026-06-09 · unverdicted · none · ref 22
RATrain introduces a resource-aware scheduler and MT-3000-specific backend for 1F1B LLM training that achieves 1.35x speedup and 97% scaling efficiency while preserving training correctness.
TokAlign++: Advancing Vocabulary Adaptation via Better Token Alignment cs.CL · 2026-05-13 · unverdicted · none · ref 60
TokAlign++ learns token alignments between LLM vocabularies from monolingual representations to enable faster adaptation, better text compression, and effective token-level distillation across 15 languages with minimal steps.
Adaptive Multi-Resolution Procedural Knowledge Compression for Large Language Models cs.CL · 2026-06-10 · unverdicted · none · ref 65
SKIM is an adaptive multi-resolution soft-token framework that compresses procedural skills while aiming to preserve logical dependencies and task performance better than prior compression methods.
KoRe: Compact Knowledge Representations for Large Language Models cs.CL · 2026-05-19 · unverdicted · none · ref 25
KoRe encodes 1-hop knowledge graph subgraphs as compact discrete tokens for injection into LLMs, achieving competitive benchmark performance with up to 10x token reduction.
A Few GPUs, A Whole Lotta Scale: Faithful LLM Training Emulation with PrismLLM cs.DC · 2026-05-15 · conditional · none · ref 26
PrismLLM constructs a sliced execution graph and uses hybrid emulation to faithfully reproduce performance and memory behavior of up to 8192-GPU LLM training runs on fewer than 1% of the original GPUs.
MegaScale-Omni: A Hyper-Scale, Workload-Resilient System for MultiModal LLM Training in Production cs.DC · 2026-05-09 · unverdicted · none · ref 45
MegaScale-Omni delivers 1.27x-7.57x higher throughput for dynamic multimodal LLM training by decoupling encoder and LLM parallelism, using unified colocation, and applying adaptive workload balancing.
UCCL-Zip: Lossless Compression Supercharged GPU Communication cs.DC · 2026-04-19 · unverdicted · none · ref 46
UCCL-Zip adds lossless compression to GPU communication to reduce LLM bottlenecks while preserving exact numerical correctness.
Symphony: Taming Step Misalignments in the Network for Ring-based Collective Operations cs.NI · 2026-04-18 · unverdicted · none · ref 69
Symphony detects step misalignments in ring collectives via lightweight in-network tracking and mitigates them by throttling outpacing flows with congestion signals, yielding up to 54% better communication times in Astra-Sim simulations and a Tofino2 prototype.
Llemma: An Open Language Model For Mathematics cs.CL · 2023-10-16 · unverdicted · none · ref 79
Continued pretraining of Code Llama on Proof-Pile-2 yields Llemma, an open math-specialized LLM that beats known open base models on MATH and supports tool use plus formal proving out of the box.
Scaling Relationship on Learning Mathematical Reasoning with Large Language Models cs.CL · 2023-08-03 · unverdicted · none · ref 89
Pre-training loss predicts LLM math reasoning better than parameter count; rejection sampling fine-tuning with diverse paths raises LLaMA-7B accuracy on GSM8K from 35.9% with SFT to 49.3%.
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model cs.CL · 2022-11-09 · unverdicted · none · ref 124
BLOOM is a 176B-parameter open-access multilingual language model trained on the ROOTS corpus that achieves competitive performance on benchmarks, with improved results after multitask prompted finetuning.
GPT-NeoX-20B: An Open-Source Autoregressive Language Model cs.CL · 2022-04-14 · accept · none · ref 77
GPT-NeoX-20B is a publicly released 20B parameter autoregressive language model trained on the Pile that shows strong gains in five-shot reasoning over similarly sized prior models.
HSAP: A Hierarchical Sequence-aware Parallelism for Hybrid-Context Generative Models cs.LG · 2026-06-29 · unverdicted · none · ref 72 · 2 links
HSAP introduces a hierarchical framework and sequence-aware algorithm with JIT-optimized NCCL communication to enable correct causal attention computation on hybrid-context packed sequences without limiting parallelism.
Eidola: Modeling Multi-GPU Network Communication Traffic in Distributed AI Workloads cs.DC · 2026-06-10 · unverdicted · none · ref 36
Eidola is a gem5 extension that emulates cycle-level peer-to-peer GPU writes via real-application timing profiles to simulate traffic and synchronization in multi-GPU AI systems.
CroCo: Cross-Lingual Contrastive Preference Tuning on Self-Generations cs.CL · 2026-05-25 · unverdicted · none · ref 41
CroCo applies English-reward-ranked self-generations for contrastive preference tuning that improves two LLMs on structured and open-ended tasks across 14 languages without language-specific annotations.
torchtune: PyTorch native post-training library cs.LG · 2026-05-20 · unverdicted · none · ref 70
torchtune is a modular PyTorch library for LLM post-training that delivers competitive performance and memory efficiency while supporting rapid research iteration through hackable components.
VLA Foundry: A Unified Framework for Training Vision-Language-Action Models cs.RO · 2026-04-21 · unverdicted · none · ref 57
VLA Foundry provides a single training stack for VLA models and releases open models that match prior closed-source performance or outperform baselines on multi-task manipulation in simulation.
Bias-constrained multimodal intelligence for equitable and reliable clinical AI cs.CV · 2026-04-18 · unverdicted · none · ref 51
BiasCareVL is a bias-aware vision-language framework trained on 3.44 million medical samples that outperforms prior methods on clinical tasks like diagnosis and segmentation while aiming for equitable performance under data imbalances.
PRISM: Probabilistic Runtime Insights and Scalable Performance Modeling for Large-Scale Distributed Training cs.DC · 2025-10-17 · unverdicted · none · ref 32
PRISM introduces a probabilistic performance modeling framework that quantifies guarantees on training time for large-scale distributed systems under runtime variability.
Cross-Layer Energy Analysis of Multimodal Training on Grace Hopper Superchips cs.DC · 2026-05-03 · unverdicted · none · ref 18
On Grace Hopper superchips, energy efficiency during multimodal training is governed by data movement and overlap rather than compute utilization, and runtime-optimal configurations are not always energy-optimal.
DFLOP: A Data-driven Framework for Multimodal LLM Training Pipeline Optimization cs.DC · 2026-03-26 · unverdicted · none · ref 57
DFLOP is a data-driven framework that profiles data-induced computation variance and uses predictive scheduling to balance workloads in multimodal LLM training pipelines, claiming up to 3.6x faster training than existing frameworks.

InProceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer