MultiHashFormer enables hash-based autoregression in LMs by encoding tokens as multi-hash signatures, outperforming standard Transformers at 100M-3B scales while keeping parameter count constant for multilingual expansion.
hub
ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models
17 Pith papers cite this work. Polarity classification is still indexing.
hub tools
representative citing papers
LangMAP adapts UnigramLM for multilingual use to deliver language-specific tokenization from a shared vocabulary, boosting boundary alignment metrics across natural and programming languages with mixed downstream fine-tuning gains.
Continuous language diffusion works by entering high-margin decoder basins where frozen T5 embeddings recover 93-96% of native decisions and linear readouts reach 97.9% agreement, implying models should be evaluated as representation-decoder systems.
Prefix gain measured via student-model solve-rate improvement is used to train a Prefix Utility Model (PUM) that supplies stronger supervision than correctness-based process rewards for mathematical reasoning.
FLEXITOKENS replaces rigid subword tokenizers and fixed-compression auxiliary losses with a simplified boundary-prediction objective in byte-level models, yielding lower over-fragmentation and up to 10-point gains on multilingual and domain-adaptation tasks.
An inference-time technique turns BPE-based LMs into byte- or character-level models, solving the prompt boundary problem while unifying vocabularies across different tokenizers.
A panel of smaller diverse LLMs outperforms a single large model as an evaluator of generations, showing less intra-model bias and over 7x lower cost.
MultiHop-RAG is a new benchmark dataset demonstrating that existing retrieval-augmented generation systems perform poorly on multi-hop queries requiring retrieval and reasoning over multiple evidence pieces.
In compute-optimal regimes, language model parameter count scales proportionally with data bytes rather than tokens, and the optimal compression rate decreases with increasing compute.
Decisive combines document-grounded option scoring with adaptive Bayesian preference elicitation to achieve up to 20% higher decision accuracy than LLMs and existing frameworks across domains.
GPT-NeoX-20B is a publicly released 20B parameter autoregressive language model trained on the Pile that shows strong gains in five-shot reasoning over similarly sized prior models.
ProbScale finds layer subsets in SLMs like RoBERTa-Large and T5-Base that cut parameters 5-10x while retaining 95-98% of original task performance by maximizing aggregated probe scores under a budget.
Byte modeling incurs greater scaling overhead for masked diffusion than autoregressive models because the diffusion objective destroys local byte contiguity needed to resolve semantics.
LoRA-FA freezes LoRA's A matrix and trains only B with gradient corrections to approximate full fine-tuning gradients more closely.
LLM-extracted patterns merging logical structures and linguistic cues yield statistically significant gains in fallacy classification over zero-shot baselines with cross-dataset generalization.
MMoA adds LSTM recurrence to Mixture-of-Agents routing, reaching 58.0% win rate on AlpacaEval 2.0 versus 59.8% for baseline MoA while cutting runtime by up to 4.6%.
Fine-tuned PEGASUS achieves state-of-the-art ROUGE scores on XL-Sum English corpus with 4.04% ROUGE-1, 15.25% ROUGE-2, and 3.39% ROUGE-L gains over mT5 baseline.
citing papers explorer
-
LoRA-FA: Efficient and Effective Low Rank Representation Fine-tuning
LoRA-FA freezes LoRA's A matrix and trains only B with gradient corrections to approximate full fine-tuning gradients more closely.