TokAlign++ learns token alignments between LLM vocabularies from monolingual representations to enable faster adaptation, better text compression, and effective token-level distillation across 15 languages with minimal steps.
Title resolution pending
23 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
BWLA is the first post-training quantization method for LLMs that achieves 1-bit weights paired with low-bit activations such as 6 bits, using OKT to reshape weights and suppress activation tails plus PSP for low-rank refinement.
Uncertainty and correctness in LLMs are encoded by distinct feature populations, with suppression of confounded features improving accuracy and reducing entropy.
LoRA adapters should be scaled by 1/sqrt(rank) rather than 1/rank to stabilize learning and enable effective use of higher ranks during fine-tuning of large language models.
With enough compute, large models benefit from training on unfiltered data that includes low-quality and distractor examples instead of requiring high-quality filtered data.
GTLM injects graph-aware attention biases into LLMs using only 0.015% extra parameters, enabling native graph processing that matches 7B models with a 1B model on text-attributed graph benchmarks.
TPAW uses teams of current and historical model checkpoints that collaborate and compete, plus adaptive weightings for responses and players, to improve self-supervised LLM alignment and outperform baselines.
LLMs contain identifiable COCO neurons that enable implicit self-correction against stereotypes; targeted editing of these neurons improves fairness and robustness to jailbreaks while preserving generation quality.
Decision theory shows that LLM cascades are structurally limited by always incurring the cheap model's cost before deciding to escalate, with the best performance given by the envelope of pairwise cascades rather than fixed chains or many stages.
ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
InfoLaw models pretraining as information accumulation where quality sets information density and repetition causes scale-dependent diminishing returns, predicting loss with low error on unseen mixtures and larger scales up to 7B models and 425B tokens.
GRASPrune removes 50% of parameters from LLaMA-2-7B via global gating and projected straight-through estimation, reaching 12.18 WikiText-2 perplexity and competitive zero-shot accuracy after four epochs on 512 calibration sequences.
DPOP is a new loss function that prevents DPO from lowering preferred response likelihoods and outperforms standard DPO on diverse datasets, MT-Bench, and enables Smaug-72B to exceed 80% on the Open LLM Leaderboard.
Byte modeling incurs greater scaling overhead for masked diffusion than autoregressive models because the diffusion objective destroys local byte contiguity needed to resolve semantics.
Continual pre-training on a German medical corpus lets 7B models close much of the performance gap with 24B general models on medical benchmarks, though merging introduces some language mixing and verbosity.
MASS-RAG uses distinct agents for evidence summarization, extraction, and reasoning, then synthesizes their outputs to improve answer quality over standard RAG baselines on four benchmarks, especially when evidence is distributed.
SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.
Multilingual pooling for quality classifiers outperforms monolingual baselines in rank stability and accuracy for LLM pretraining data selection across high- and low-resource languages.
Gemma introduces open 2B and 7B LLMs derived from Gemini technology that beat comparable open models on 11 of 18 text tasks and come with safety assessments.
Gemma 2 models achieve leading performance at their sizes by combining established Transformer modifications with knowledge distillation for the 2B and 9B variants.