LLMForge is a NAS framework with Infinite-Head Attention, a Forge-Former surrogate, and Forge-DSE engine that discovers hardware-specific architectures for edge language models, yielding variants with improved accuracy, energy, or latency on different substrates.
John D Lafferty, Andrew McCallum, and Fernando CN Pereira
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
unclear 1representative citing papers
Frequency aggregation of supermerge candidates and a two-phase formulation make BoundlessBPE and SuperBPE training over 600x faster on 1GB data while preserving identical results, with open-source Python and Rust code.
DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.
Applying muP allows Probabilistic Transformers to scale to 0.4B parameters with transferred hyperparameters and outperform standard transformers on MLM tasks under equal parameter budgets.
citing papers explorer
-
LLMForge: Multi-Backend Hardware-Aware Neural Architecture Search with Infinite-Head Attention for Edge Language Models
LLMForge is a NAS framework with Infinite-Head Attention, a Forge-Former surrogate, and Forge-DSE engine that discovers hardware-specific architectures for edge language models, yielding variants with improved accuracy, energy, or latency on different substrates.
-
Faster Superword Tokenization
Frequency aggregation of supermerge candidates and a two-phase formulation make BoundlessBPE and SuperBPE training over 600x faster on 1GB data while preserving identical results, with open-source Python and Rust code.
-
DataComp-LM: In search of the next generation of training sets for language models
DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.
-
Scaling Probabilistic Transformer via Efficient Cross-Scale Hyperparameter Transfer
Applying muP allows Probabilistic Transformers to scale to 0.4B parameters with transferred hyperparameters and outperform standard transformers on MLM tasks under equal parameter budgets.