2309.10818 , archivePrefix =

Zhiqiang Shen, Tianhua Tao, Liqun Ma, Willie Neiswanger, Jo el Hestness, Natalia Vassilieva, Daria Soboleva, Eric Xing · 2024 · arXiv 2309.10818

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

read on arXiv browse 9 citing papers

citation-role summary

background 1 dataset 1

citation-polarity summary

background 1 use dataset 1

representative citing papers

Data Mixing Agent: Learning to Re-weight Domains for Continual Pre-training

cs.LG · 2025-07-21 · unverdicted · novelty 7.0

An RL agent learns domain re-weighting policies from evaluation feedback to improve balanced performance in continual pre-training of LLMs across source and target domains.

A Geometric Perspective on Next-Token Prediction in Large Language Models: Three Emerging Phases

cs.LG · 2026-05-09 · unverdicted · novelty 6.0

LLMs exhibit three geometric phases in next-token prediction—seeding multiplexing, hoisting overriding, and focal convergence—where predictive subspaces rise, stabilize, and converge across layers.

A3 : an Analytical Low-Rank Approximation Framework for Attention

cs.CL · 2025-05-19 · conditional · novelty 6.0

A3 splits Transformer layers into QK, OV, and MLP components and derives analytical low-rank approximations that reduce hidden dimensions while minimizing each component's functional loss, yielding better perplexity than prior low-rank methods on LLaMA models.

LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws

cs.LG · 2025-02-17 · unverdicted · novelty 6.0

Pretraining data determines loss-to-loss scaling laws in LLMs, while model size, optimization, tokenizer, and architecture have limited impact.

Refresh-Scaling the Memory of Balanced Adam

cs.LG · 2026-05-11 · unverdicted · novelty 5.0 · 2 refs

Setting β in balanced Adam to achieve a refresh count R_β ≈1000 based on effective learning horizon T_ES improves validation robustness over fixed-β baselines across 11 vision and language experiments.

SEDD: Scalable and Efficient Dataset Deduplication with GPUs

cs.CL · 2025-01-02 · unverdicted · novelty 5.0

SEDD delivers a distributed GPU deduplication system that reports up to 158x speedup over CPU baselines and 7.8x over NeMo Curator on 30M documents while preserving MinHash fidelity above 0.95 Jaccard.

A Survey of Large Language Models

cs.CL · 2023-03-31 · accept · novelty 3.0

This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.

Bias in Large Language Models: Origin, Evaluation, and Mitigation

cs.CL · 2024-11-16 · unverdicted · novelty 2.0

A literature review that categorizes bias in LLMs, surveys evaluation and mitigation techniques, and discusses ethical implications.

Towards Generalization of Block Attention via Automatic Segmentation and Block Distillation

cs.CL · 2026-05-15 · 2 refs

citing papers explorer

Showing 9 of 9 citing papers.

Data Mixing Agent: Learning to Re-weight Domains for Continual Pre-training cs.LG · 2025-07-21 · unverdicted · none · ref 31
An RL agent learns domain re-weighting policies from evaluation feedback to improve balanced performance in continual pre-training of LLMs across source and target domains.
A Geometric Perspective on Next-Token Prediction in Large Language Models: Three Emerging Phases cs.LG · 2026-05-09 · unverdicted · none · ref 22
LLMs exhibit three geometric phases in next-token prediction—seeding multiplexing, hoisting overriding, and focal convergence—where predictive subspaces rise, stabilize, and converge across layers.
A3 : an Analytical Low-Rank Approximation Framework for Attention cs.CL · 2025-05-19 · conditional · none · ref 13
A3 splits Transformer layers into QK, OV, and MLP components and derives analytical low-rank approximations that reduce hidden dimensions while minimizing each component's functional loss, yielding better perplexity than prior low-rank methods on LLaMA models.
LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws cs.LG · 2025-02-17 · unverdicted · none · ref 42
Pretraining data determines loss-to-loss scaling laws in LLMs, while model size, optimization, tokenizer, and architecture have limited impact.
Refresh-Scaling the Memory of Balanced Adam cs.LG · 2026-05-11 · unverdicted · none · ref 7 · 2 links
Setting β in balanced Adam to achieve a refresh count R_β ≈1000 based on effective learning horizon T_ES improves validation robustness over fixed-β baselines across 11 vision and language experiments.
SEDD: Scalable and Efficient Dataset Deduplication with GPUs cs.CL · 2025-01-02 · unverdicted · none · ref 12
SEDD delivers a distributed GPU deduplication system that reports up to 158x speedup over CPU baselines and 7.8x over NeMo Curator on 30M documents while preserving MinHash fidelity above 0.95 Jaccard.
A Survey of Large Language Models cs.CL · 2023-03-31 · accept · none · ref 253
This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.
Bias in Large Language Models: Origin, Evaluation, and Mitigation cs.CL · 2024-11-16 · unverdicted · none · ref 63
A literature review that categorizes bias in LLMs, surveys evaluation and mitigation techniques, and discusses ethical implications.
Towards Generalization of Block Attention via Automatic Segmentation and Block Distillation cs.CL · 2026-05-15 · unreviewed · ref 13 · 2 links

2309.10818 , archivePrefix =

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer