LLMForge is a NAS framework with Infinite-Head Attention, a Forge-Former surrogate, and Forge-DSE engine that discovers hardware-specific architectures for edge language models, yielding variants with improved accuracy, energy, or latency on different substrates.
hub
Liu, and Matt Gardner
22 Pith papers cite this work. Polarity classification is still indexing.
abstract
We present a novel method for obtaining high-quality, domain-targeted multiple choice questions from crowd workers. Generating these questions can be difficult without trading away originality, relevance or diversity in the answer options. Our method addresses these problems by leveraging a large corpus of domain-specific text and a small set of existing questions. It produces model suggestions for document selection and answer distractor choice which aid the human question generation process. With this method we have assembled SciQ, a dataset of 13.7K multiple choice science exam questions (Dataset available at http://allenai.org/data.html). We demonstrate that the method produces in-domain questions by providing an analysis of this new dataset and by showing that humans cannot distinguish the crowdsourced questions from original questions. When using SciQ as additional training data to existing questions, we observe accuracy improvements on real science exams.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Inducing artificial uncertainty on trivial tasks allows training probes that achieve higher calibration on hard data than standard approaches while retaining performance on easy data.
Deep Delta Learning replaces additive residual updates with a gated delta-rule that selectively overwrites residual content along learned directions, improving language modeling quality over standard ResNet-style accumulation.
An RL agent learns domain re-weighting policies from evaluation feedback to improve balanced performance in continual pre-training of LLMs across source and target domains.
Multitask fine-tuning of an encoder-decoder model on prompted datasets produces zero-shot generalization that often beats models up to 16 times larger on standard benchmarks.
K-sparse autoencoders with dead-latent fixes produce clean scaling laws and better feature quality metrics that improve with size, shown by training a 16-million-latent model on GPT-4 activations.
Empirical study shows mixture pretraining tolerates higher target data repetition than single-source training, with a new repetition-aware scaling law enabling principled mixture selection based on data size, compute, and model scale.
BackFlush detects backdoors via susceptibility amplification and eliminates them with RoPE unlearning to reach 1% ASR and 99% clean accuracy while preserving watermarks.
M²RNN achieves perfect state tracking at unseen lengths and outperforms Gated DeltaNet hybrids by 0.4-0.5 perplexity on 7B models with 3x smaller recurrent states.
L2R improves MoE performance by routing in a low-rank space with Lipschitz-controlled saturated inner-product scoring and multi-anchor mechanisms.
A conditional scaling law fitted on over 200 models from 80M to 3B parameters identifies architectures that deliver up to 2.1% higher accuracy and 42% higher inference throughput than LLaMA-3.2 under the same training budget.
MoE models with activation rates in an optimal region outperform dense LLMs of identical total parameter count, training compute, and data budget, with the optimal region consistent across scales.
An analytical post-training method restructures FFNs into MoE by partitioning neurons based on activation patterns and building a router from statistics, achieving 1.17x speedup with minimal resources.
Repeating training data up to 4 epochs yields negligible loss increase versus unique data for fixed compute, and a new scaling law accounts for the decaying value of repeated tokens and excess parameters.
AGoQ delivers up to 52% lower memory use and 1.34x faster training for 8B-32B LLaMA models by using near-4-bit adaptive activations and 8-bit gradients while preserving pretraining convergence and downstream accuracy.
Unsupervised single-generation confidence calibration for reasoning LLMs via offline self-consistency proxy distillation outperforms baselines on math and QA tasks and improves selective prediction.
KnowledgeBerg benchmark shows open-source LLMs achieve only 5.26-36.88 F1 on universe enumeration and 16-44% accuracy on knowledge-grounded compositional reasoning, with persistent failures in completeness, awareness, and application.
Output prefilling with a structured prefix steers LLMs to produce cleaner first tokens in MCQA, raising accuracy and calibration over standard first-token probability.
FedProxy replaces weak adapters with a proxy SLM for federated LLM fine-tuning, outperforming prior methods and approaching centralized performance via compression, heterogeneity-aware aggregation, and training-free fusion.
Galactica, a science-specialized LLM, reports higher scores than GPT-3, Chinchilla, and PaLM on LaTeX knowledge, mathematical reasoning, and medical QA benchmarks while outperforming general models on BIG-bench.
DoGraph models LLM data scheduling as a graph optimization problem based on gradient-domain connections and delivers competitive results on GPT-2 models of different sizes.
The paper claims a selective fine-tuning method that identifies and freezes core parameters to mitigate catastrophic forgetting in LLMs while improving domain adaptation, shown in experiments with GPT-J and LLaMA-3.
citing papers explorer
-
LLMForge: Multi-Backend Hardware-Aware Neural Architecture Search with Infinite-Head Attention for Edge Language Models
LLMForge is a NAS framework with Infinite-Head Attention, a Forge-Former surrogate, and Forge-DSE engine that discovers hardware-specific architectures for edge language models, yielding variants with improved accuracy, energy, or latency on different substrates.
-
Inducing Artificial Uncertainty in Language Models
Inducing artificial uncertainty on trivial tasks allows training probes that achieve higher calibration on hard data than standard approaches while retaining performance on easy data.
-
Deep Delta Learning
Deep Delta Learning replaces additive residual updates with a gated delta-rule that selectively overwrites residual content along learned directions, improving language modeling quality over standard ResNet-style accumulation.
-
Data Mixing Agent: Learning to Re-weight Domains for Continual Pre-training
An RL agent learns domain re-weighting policies from evaluation feedback to improve balanced performance in continual pre-training of LLMs across source and target domains.
-
Multitask Prompted Training Enables Zero-Shot Task Generalization
Multitask fine-tuning of an encoder-decoder model on prompted datasets produces zero-shot generalization that often beats models up to 16 times larger on standard benchmarks.
-
Scaling and evaluating sparse autoencoders
K-sparse autoencoders with dead-latent fixes produce clean scaling laws and better feature quality metrics that improve with size, shown by training a 16-million-latent model on GPT-4 activations.
-
Scaling Laws for Mixture Pretraining Under Data Constraints
Empirical study shows mixture pretraining tolerates higher target data repetition than single-source training, with a new repetition-aware scaling law enabling principled mixture selection based on data size, compute, and model scale.
-
BackFlush: Knowledge-Free Backdoor Detection and Elimination with Watermark Preservation in Large Language Models
BackFlush detects backdoors via susceptibility amplification and eliminates them with RoPE unlearning to reach 1% ASR and 99% clean accuracy while preserving watermarks.
-
M$^2$RNN: Non-Linear RNNs with Matrix-Valued States for Scalable Language Modeling
M²RNN achieves perfect state tracking at unseen lengths and outperforms Gated DeltaNet hybrids by 0.4-0.5 perplexity on 7B models with 3x smaller recurrent states.
-
L2R: Low-Rank and Lipschitz-Controlled Routing for Mixture-of-Experts
L2R improves MoE performance by routing in a low-rank space with Lipschitz-controlled saturated inner-product scoring and multi-anchor mechanisms.
-
Scaling Laws Meet Model Architecture: Toward Inference-Efficient LLMs
A conditional scaling law fitted on over 200 models from 80M to 3B parameters identifies architectures that deliver up to 2.1% higher accuracy and 42% higher inference throughput than LLaMA-3.2 under the same training budget.
-
Mixture-of-Experts Can Surpass Dense LLMs Under Strictly Equal Resource
MoE models with activation rates in an optimal region outperform dense LLMs of identical total parameter count, training compute, and data budget, with the optimal region consistent across scales.
-
Analytical FFN-to-MoE Restructuring via Activation Pattern Analysis
An analytical post-training method restructures FFNs into MoE by partitioning neurons based on activation patterns and building a router from statistics, achieving 1.17x speedup with minimal resources.
-
Scaling Data-Constrained Language Models
Repeating training data up to 4 epochs yields negligible loss increase versus unique data for fixed compute, and a new scaling law accounts for the decaying value of repeated tokens and excess parameters.
-
AGoQ: Activation and Gradient Quantization for Memory-Efficient Distributed Training of LLMs
AGoQ delivers up to 52% lower memory use and 1.34x faster training for 8B-32B LLaMA models by using near-4-bit adaptive activations and 8-bit gradients while preserving pretraining convergence and downstream accuracy.
-
Unsupervised Confidence Calibration for Reasoning LLMs from a Single Generation
Unsupervised single-generation confidence calibration for reasoning LLMs via offline self-consistency proxy distillation outperforms baselines on math and QA tasks and improves selective prediction.
-
KnowledgeBerg: Evaluating Systematic Knowledge Coverage and Compositional Reasoning in Large Language Models
KnowledgeBerg benchmark shows open-source LLMs achieve only 5.26-36.88 F1 on universe enumeration and 16-44% accuracy on knowledge-grounded compositional reasoning, with persistent failures in completeness, awareness, and application.
-
Improving LLM First-Token Predictions in Multiple-Choice Question Answering via Output Prefilling
Output prefilling with a structured prefix steers LLMs to produce cleaner first tokens in MCQA, raising accuracy and calibration over standard first-token probability.
-
FedProxy: Federated Fine-Tuning of LLMs via Proxy SLMs and Heterogeneity-Aware Fusion
FedProxy replaces weak adapters with a proxy SLM for federated LLM fine-tuning, outperforming prior methods and approaching centralized performance via compression, heterogeneity-aware aggregation, and training-free fusion.
-
Galactica: A Large Language Model for Science
Galactica, a science-specialized LLM, reports higher scores than GPT-3, Chinchilla, and PaLM on LaTeX knowledge, mathematical reasoning, and medical QA benchmarks while outperforming general models on BIG-bench.
-
Rethinking Data Mixing from the Perspective of Large Language Models
DoGraph models LLM data scheduling as a graph optimization problem based on gradient-domain connections and delivers competitive results on GPT-2 models of different sizes.
-
Efficient Task Adaptation in Large Language Models via Selective Parameter Optimization
The paper claims a selective fine-tuning method that identifies and freezes core parameters to mitigate catastrophic forgetting in LLMs while improving domain adaptation, shown in experiments with GPT-J and LLaMA-3.