Fewer Truncations Improve Language Modeling

Anoop Deoras; Dan Roth; Giovanni Paolini; Hantian Ding; Stefano Soatto; Varun Kumar; Zijian Wang

arxiv: 2404.10830 · v2 · pith:37GDJQ4Fnew · submitted 2024-04-16 · 💻 cs.CL · cs.AI· cs.LG

Fewer Truncations Improve Language Modeling

Hantian Ding , Zijian Wang , Giovanni Paolini , Varun Kumar , Anoop Deoras , Dan Roth , Stefano Soatto This is my paper

classification 💻 cs.CL cs.AIcs.LG

keywords documentsmethodtrainingtruncationsconcatenationcontextefficiencylanguage

0 comments

read the original abstract

In large language model training, input documents are typically concatenated together and then split into sequences of equal length to avoid padding tokens. Despite its efficiency, the concatenation approach compromises data integrity -- it inevitably breaks many documents into incomplete pieces, leading to excessive truncations that hinder the model from learning to compose logically coherent and factually consistent content that is grounded on the complete context. To address the issue, we propose Best-fit Packing, a scalable and efficient method that packs documents into training sequences through length-aware combinatorial optimization. Our method completely eliminates unnecessary truncations while retaining the same training efficiency as concatenation. Empirical results from both text and code pre-training show that our method achieves superior performance (e.g., relatively +4.7% on reading comprehension; +16.8% in context following; and +9.2% on program synthesis), and reduces closed-domain hallucination effectively by up to 58.3%.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
cs.LG 2024-05 unverdicted novelty 7.0

Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.
Parcae: Scaling Laws For Stable Looped Language Models
cs.LG 2026-04 unverdicted novelty 6.0

Parcae stabilizes looped LLMs via spectral norm constraints on injection parameters, enabling power-law scaling for training FLOPs and saturating exponential scaling at test time that improves quality over fixed-depth...
FlowTrain: Flow-Based Decoupled Training for Industrial-Grade Vision-Language Models
cs.LG 2026-06 unverdicted novelty 5.0

FlowTrain decouples encoder and backbone training via flow-based scheduling and heterogeneous allocation to reach over 50% MFU and 1.7x throughput versus prior VLM approaches.
Can Continual Pre-training Bridge the Performance Gap between General-purpose and Specialized Language Models in the Medical Domain?
cs.CL 2026-04 unverdicted novelty 5.0

Continual pre-training on a German medical corpus lets 7B models close much of the performance gap with 24B general models on medical benchmarks, though merging introduces some language mixing and verbosity.
Mellum2 Technical Report
cs.CL 2026-05 unverdicted novelty 3.0

Mellum 2 is a 12B MoE model with 2.5B active parameters, trained on 10.6T tokens with MoE, GQA, SWA, and MTP, then post-trained into Instruct and Thinking variants, claimed competitive with 4B-14B models at 2.5B compute.