For uniform keys on the d-dimensional sphere, softmax attention becomes selective at inverse temperature scaling β_n* ≍ n^{2/(d-1)}, with explicit limiting laws for attention weights and outputs in each regime.
hub
Neurocomputing , volume=
28 Pith papers cite this work. Polarity classification is still indexing.
hub tools
representative citing papers
Img2CADSeq generates standard CAD sequences from images via a multi-stage pipeline with three-level hierarchical codebook encoding, importance-guided compression, and contrastive point-cloud conditioning of a VQ-Diffusion model, outperforming prior methods on new CAD-220K and PrintCAD datasets.
ALiBi bias is the expectation of positional LSH-induced block masks, yielding spectral and max-norm approximation bounds that reduce long-context biased attention to randomized short-context unbiased attention.
DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
Self-pretraining improves Transformer sequence classification by enabling learning of proximity-biased attention from positional encodings that label supervision alone cannot easily acquire from random starts.
Introduces TEDBench benchmark and MiAE self-supervised framework that outperforms baselines for large-scale protein fold classification.
Attention to goal tokens declines in multi-turn LLM interactions while residual representations often retain decodable goal information, and the gap between these predicts whether goal-conditioned behavior survives.
Theoretical analysis of continual factual knowledge acquisition shows data replay stabilizes pretrained knowledge by shifting convergence dynamics while regularization only slows forgetting, leading to the STOC method for attention-based replay selection.
DeG models 3D Gaussians via learned octree density and uses VecSeq Sobol re-indexing to turn set generation into sequence modeling, claiming SOTA quality in single-image-to-3D.
In LVLMs, attention can be replaced by random Gaussian weights with little or no performance loss, indicating that current models get lost in attention rather than efficiently using visual context.
ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
HEXST applies a hexagonal shifted-window Transformer with rotary positional encodings, contrast-sensitive training objectives, and single-cell priors to predict gene expression from histology slides, outperforming prior models on seven datasets while preserving spatial heterogeneity.
DiffMAS jointly optimizes latent communication and reasoning in multi-agent LLM systems via parameter-efficient supervised training on trajectories, yielding consistent gains over baselines on math, science, and code benchmarks.
Denoising Recursion Models train multi-step noise reversal in looped transformers and outperform the prior Tiny Recursion Model on ARC-AGI.
OmniMouse demonstrates data-driven scaling in multi-task brain models on a 150B-token neural dataset, achieving SOTA across prediction, decoding, and forecasting while model size gains saturate.
mHC projects hyper-connection residual spaces onto a manifold to restore identity mapping, enabling stable large-scale training with performance gains over standard HC.
Rolling Forcing generates multi-minute videos in real time by jointly denoising frames at increasing noise levels, anchoring attention to early frames, and using windowed distillation to limit error accumulation.
Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.
Muon optimizer with weight decay and update scaling achieves ~2x efficiency over AdamW for large LLMs, shown via the Moonlight 3B/16B MoE model trained on 5.7T tokens.
CogVideoX generates coherent 10-second text-to-video outputs at high resolution using a 3D VAE, expert adaptive LayerNorm transformer, progressive training, and a custom data pipeline, claiming state-of-the-art results.
Introduces dual pose-image representation, cross-modal alignment, and iterative construction to improve prompt alignment and diversity in multi-person text-to-image generation.
TaTok is a theoretically grounded adaptive tokenization method that uses global tokens and cumulative conditional entropy filtering to reduce redundancy while improving reconstruction quality over fixed-rate patch tokenization.
Grouping attention heads in Muon creates a trade-off between whitening gains and norm costs that, when tuned, improves training loss over full or per-head Muon on GPT-2.
MDN parallelizes stepwise momentum for delta linear attention using geometric reordering and dynamical systems analysis, yielding performance gains over Mamba2 and GDN on 400M and 1.3B models.
citing papers explorer
-
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning
Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.