Language model circuits show high within-task consistency and necessity but substantial overlap across tasks, making them less specific than assumed.
hub
ArXiv , year=
10 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
LoRA adapters should be scaled by 1/sqrt(rank) rather than 1/rank to stabilize learning and enable effective use of higher ranks during fine-tuning of large language models.
SP-KV trains a utility predictor jointly with the LLM to dynamically prune low-utility KV cache entries, achieving 3-10x memory reduction during generation with negligible performance loss.
OP-Mix is an on-policy data mixing method that uses low-rank adapter interpolation to find near-optimal data mixtures throughout language model training with reduced compute.
Structured Recurrent Mixers enable algebraic switching between parallel training and recurrent inference representations, yielding higher throughput, concurrency, and training efficiency than comparable linear-complexity models on language tasks.
Atlas reaches over 42% accuracy on Natural Questions with only 64 examples, outperforming a 540B-parameter model by 3% with 50x fewer parameters.
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
Hugging Face releases an open-source Python library that supplies a unified API and pretrained weights for major Transformer architectures used in natural language processing.
Toeplitz MLP Mixers replace attention with masked Toeplitz multiplications for sub-quadratic complexity while retaining more sequence information and outperforming on copying and in-context tasks.
citing papers explorer
-
How Much Do Circuits Tell Us? Measuring the Consistency and Specificity of Language Model Circuits
Language model circuits show high within-task consistency and necessity but substantial overlap across tasks, making them less specific than assumed.
-
A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA
LoRA adapters should be scaled by 1/sqrt(rank) rather than 1/rank to stabilize learning and enable effective use of higher ranks during fine-tuning of large language models.
-
Self-Pruned Key-Value Attention: Learning When to Write by Predicting Future Utility
SP-KV trains a utility predictor jointly with the LLM to dynamically prune low-utility KV cache entries, achieving 3-10x memory reduction during generation with negligible performance loss.
-
Always Learning, Always Mixing: Efficient and Simple Data Mixing All The Time
OP-Mix is an on-policy data mixing method that uses low-rank adapter interpolation to find near-optimal data mixtures throughout language model training with reduced compute.
-
Structured Recurrent Mixers for Massively Parallelized Sequence Generation
Structured Recurrent Mixers enable algebraic switching between parallel training and recurrent inference representations, yielding higher throughput, concurrency, and training efficiency than comparable linear-complexity models on language tasks.
-
Atlas: Few-shot Learning with Retrieval Augmented Language Models
Atlas reaches over 42% accuracy on Natural Questions with only 64 examples, outperforming a 540B-parameter model by 3% with 50x fewer parameters.
-
Language Models (Mostly) Know What They Know
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
-
HuggingFace's Transformers: State-of-the-art Natural Language Processing
Hugging Face releases an open-source Python library that supplies a unified API and pretrained weights for major Transformer architectures used in natural language processing.
-
Toeplitz MLP Mixers are Low Complexity, Information-Rich Sequence Models
Toeplitz MLP Mixers replace attention with masked Toeplitz multiplications for sub-quadratic complexity while retaining more sequence information and outperforming on copying and in-context tasks.
- KnowledgeBerg: Evaluating Systematic Knowledge Coverage and Compositional Reasoning in Large Language Models