hub

Neurocomputing , volume=

Roformer: Enhanced transformer with rotary position embedding , author= · 2024

28 Pith papers cite this work. Polarity classification is still indexing.

28 Pith papers citing it

browse 28 citing papers

hub tools

JSON dossier citing papers JSON

representative citing papers

Scaling Limits of Long-Context Transformers

cs.LG · 2026-05-08 · unverdicted · novelty 8.0

For uniform keys on the d-dimensional sphere, softmax attention becomes selective at inverse temperature scaling β_n* ≍ n^{2/(d-1)}, with explicit limiting laws for attention weights and outputs in each regime.

Img2CADSeq: Image-to-CAD Generation via Sequence-Based Diffusion

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

Img2CADSeq generates standard CAD sequences from images via a multi-stage pipeline with three-level hierarchical codebook encoding, importance-guided compression, and contrastive point-cloud conditioning of a VQ-Diffusion model, outperforming prior methods on new CAD-220K and PrintCAD datasets.

Positional LSH: Binary Block Matrix Approximation for Attention with Linear Biases

cs.LG · 2026-05-10 · unverdicted · novelty 7.0

ALiBi bias is the expectation of positional LSH-induced block masks, yielding spectral and max-norm approximation bounds that reduce long-context biased attention to randomized short-context unbiased attention.

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

cs.CL · 2024-05-07 · unverdicted · novelty 7.0

DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.

Towards Understanding Self-Pretraining for Sequence Classification

cs.LG · 2026-05-20 · unverdicted · novelty 6.0

Self-pretraining improves Transformer sequence classification by enabling learning of proximity-biased attention from positional encodings that label supervision alone cannot easily acquire from random starts.

Protein Fold Classification at Scale: Benchmarking and Pretraining

cs.LG · 2026-05-18 · unverdicted · novelty 6.0

Introduces TEDBench benchmark and MiAE self-supervised framework that outperforms baselines for large-scale protein fold classification.

When Attention Closes: How LLMs Lose the Thread in Multi-Turn Interaction

cs.AI · 2026-05-13 · unverdicted · novelty 6.0

Attention to goal tokens declines in multi-turn LLM interactions while residual representations often retain decodable goal information, and the gap between these predicts whether goal-conditioned behavior survives.

Towards Understanding Continual Factual Knowledge Acquisition of Language Models: From Theory to Algorithm

cs.CL · 2026-05-11 · unverdicted · novelty 6.0

Theoretical analysis of continual factual knowledge acquisition shows data replay stabilizes pretrained knowledge by shifting convergence dynamics while regularization only slows forgetting, leading to the STOC method for attention-based replay selection.

Generative 3D Gaussians with Learned Density Control

cs.GR · 2026-05-08 · unverdicted · novelty 6.0

DeG models 3D Gaussians via learned octree density and uses VecSeq Sobol re-indexing to turn set generation into sequence modeling, claiming SOTA quality in single-image-to-3D.

Large Vision-Language Models Get Lost in Attention

cs.AI · 2026-05-07 · unverdicted · novelty 6.0

In LVLMs, attention can be replaced by random Gaussian weights with little or no performance loss, indicating that current models get lost in attention rather than efficiently using visual context.

ZAYA1-8B Technical Report

cs.AI · 2026-05-06 · unverdicted · novelty 6.0

ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.

HEXST: Hexagonal Shifted-Window Transformer for Spatial Transcriptomics Gene Expression Prediction

cs.LG · 2026-05-06 · unverdicted · novelty 6.0

HEXST applies a hexagonal shifted-window Transformer with rotary positional encodings, contrast-sensitive training objectives, and single-cell priors to predict gene expression from histology slides, outperforming prior models on seven datasets while preserving spatial heterogeneity.

Learning to Communicate: Toward End-to-End Optimization of Multi-Agent Language Systems

cs.AI · 2026-04-23 · unverdicted · novelty 6.0

DiffMAS jointly optimizes latent communication and reasoning in multi-agent LLM systems via parameter-efficient supervised training on trajectories, yielding consistent gains over baselines on math, science, and code benchmarks.

One Step Forward and K Steps Back: Better Reasoning with Denoising Recursion Models

cs.LG · 2026-04-20 · unverdicted · novelty 6.0

Denoising Recursion Models train multi-step noise reversal in looped transformers and outperform the prior Tiny Recursion Model on ARC-AGI.

OmniMouse: Scaling properties of multi-modal, multi-task Brain Models on 150B Neural Tokens

q-bio.NC · 2026-04-20 · unverdicted · novelty 6.0

OmniMouse demonstrates data-driven scaling in multi-task brain models on a 150B-token neural dataset, achieving SOTA across prediction, decoding, and forecasting while model size gains saturate.

mHC: Manifold-Constrained Hyper-Connections

cs.CL · 2025-12-31 · unverdicted · novelty 6.0

mHC projects hyper-connection residual spaces onto a manifold to restore identity mapping, enabling stable large-scale training with performance gains over standard HC.

Rolling Forcing: Autoregressive Long Video Diffusion in Real Time

cs.CV · 2025-09-29 · unverdicted · novelty 6.0

Rolling Forcing generates multi-minute videos in real time by jointly denoising frames at increasing noise levels, anchoring attention to early frames, and using windowed distillation to limit error accumulation.

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

cs.AI · 2025-07-01 · conditional · novelty 6.0

Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.

Muon is Scalable for LLM Training

cs.LG · 2025-02-24 · unverdicted · novelty 6.0

Muon optimizer with weight decay and update scaling achieves ~2x efficiency over AdamW for large LLMs, shown via the Moonlight 3B/16B MoE model trained on 5.7T tokens.

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

cs.CV · 2024-08-12 · unverdicted · novelty 6.0

CogVideoX generates coherent 10-second text-to-video outputs at high resolution using a 3D VAE, expert adaptive LayerNorm transformer, progressive training, and a custom data pipeline, claiming state-of-the-art results.

Composing People Together: Iterative Pose-Image Generation for Multi-Person Interaction Scenes

cs.CV · 2026-05-22 · unverdicted · novelty 5.0

Introduces dual pose-image representation, cross-modal alignment, and iterative construction to improve prompt alignment and diversity in multi-person text-to-image generation.

Mutual Enhancement Between Global Tokens and Patch Tokens: From Theory to Practice

cs.CV · 2026-05-11 · unverdicted · novelty 5.0

TaTok is a theoretically grounded adaptive tokenization method that uses global tokens and cumulative conditional entropy filtering to reduce redundancy while improving reconstruction quality over fixed-rate patch tokenization.

When and Why Grouping Attention Heads Accelerates Muon Optimization

cs.LG · 2026-05-09 · unverdicted · novelty 5.0

Grouping attention heads in Muon creates a trade-off between whitening gains and norm costs that, when tuned, improves training loss over full or per-head Muon on GPT-2.

MDN: Parallelizing Stepwise Momentum for Delta Linear Attention

cs.LG · 2026-05-07 · unverdicted · novelty 5.0

MDN parallelizes stepwise momentum for delta linear attention using geometric reordering and dynamical systems analysis, yielding performance gains over Mamba2 and GDN on 400M and 1.3B models.

citing papers explorer

Showing 28 of 28 citing papers.

Scaling Limits of Long-Context Transformers cs.LG · 2026-05-08 · unverdicted · none · ref 69
For uniform keys on the d-dimensional sphere, softmax attention becomes selective at inverse temperature scaling β_n* ≍ n^{2/(d-1)}, with explicit limiting laws for attention weights and outputs in each regime.
Img2CADSeq: Image-to-CAD Generation via Sequence-Based Diffusion cs.CV · 2026-05-13 · unverdicted · none · ref 8
Img2CADSeq generates standard CAD sequences from images via a multi-stage pipeline with three-level hierarchical codebook encoding, importance-guided compression, and contrastive point-cloud conditioning of a VQ-Diffusion model, outperforming prior methods on new CAD-220K and PrintCAD datasets.
Positional LSH: Binary Block Matrix Approximation for Attention with Linear Biases cs.LG · 2026-05-10 · unverdicted · none · ref 3
ALiBi bias is the expectation of positional LSH-induced block masks, yielding spectral and max-norm approximation bounds that reduce long-context biased attention to randomized short-context unbiased attention.
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cs.CL · 2024-05-07 · unverdicted · none · ref 66
DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
Towards Understanding Self-Pretraining for Sequence Classification cs.LG · 2026-05-20 · unverdicted · none · ref 42
Self-pretraining improves Transformer sequence classification by enabling learning of proximity-biased attention from positional encodings that label supervision alone cannot easily acquire from random starts.
Protein Fold Classification at Scale: Benchmarking and Pretraining cs.LG · 2026-05-18 · unverdicted · none · ref 22
Introduces TEDBench benchmark and MiAE self-supervised framework that outperforms baselines for large-scale protein fold classification.
When Attention Closes: How LLMs Lose the Thread in Multi-Turn Interaction cs.AI · 2026-05-13 · unverdicted · none · ref 4
Attention to goal tokens declines in multi-turn LLM interactions while residual representations often retain decodable goal information, and the gap between these predicts whether goal-conditioned behavior survives.
Towards Understanding Continual Factual Knowledge Acquisition of Language Models: From Theory to Algorithm cs.CL · 2026-05-11 · unverdicted · none · ref 24
Theoretical analysis of continual factual knowledge acquisition shows data replay stabilizes pretrained knowledge by shifting convergence dynamics while regularization only slows forgetting, leading to the STOC method for attention-based replay selection.
Generative 3D Gaussians with Learned Density Control cs.GR · 2026-05-08 · unverdicted · none · ref 79
DeG models 3D Gaussians via learned octree density and uses VecSeq Sobol re-indexing to turn set generation into sequence modeling, claiming SOTA quality in single-image-to-3D.
Large Vision-Language Models Get Lost in Attention cs.AI · 2026-05-07 · unverdicted · none · ref 78
In LVLMs, attention can be replaced by random Gaussian weights with little or no performance loss, indicating that current models get lost in attention rather than efficiently using visual context.
ZAYA1-8B Technical Report cs.AI · 2026-05-06 · unverdicted · none · ref 146
ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
HEXST: Hexagonal Shifted-Window Transformer for Spatial Transcriptomics Gene Expression Prediction cs.LG · 2026-05-06 · unverdicted · none · ref 19
HEXST applies a hexagonal shifted-window Transformer with rotary positional encodings, contrast-sensitive training objectives, and single-cell priors to predict gene expression from histology slides, outperforming prior models on seven datasets while preserving spatial heterogeneity.
Learning to Communicate: Toward End-to-End Optimization of Multi-Agent Language Systems cs.AI · 2026-04-23 · unverdicted · none · ref 39
DiffMAS jointly optimizes latent communication and reasoning in multi-agent LLM systems via parameter-efficient supervised training on trajectories, yielding consistent gains over baselines on math, science, and code benchmarks.
One Step Forward and K Steps Back: Better Reasoning with Denoising Recursion Models cs.LG · 2026-04-20 · unverdicted · none · ref 220
Denoising Recursion Models train multi-step noise reversal in looped transformers and outperform the prior Tiny Recursion Model on ARC-AGI.
OmniMouse: Scaling properties of multi-modal, multi-task Brain Models on 150B Neural Tokens q-bio.NC · 2026-04-20 · unverdicted · none · ref 21
OmniMouse demonstrates data-driven scaling in multi-task brain models on a 150B-token neural dataset, achieving SOTA across prediction, decoding, and forecasting while model size gains saturate.
mHC: Manifold-Constrained Hyper-Connections cs.CL · 2025-12-31 · unverdicted · none · ref 18
mHC projects hyper-connection residual spaces onto a manifold to restore identity mapping, enabling stable large-scale training with performance gains over standard HC.
Rolling Forcing: Autoregressive Long Video Diffusion in Real Time cs.CV · 2025-09-29 · unverdicted · none · ref 45
Rolling Forcing generates multi-minute videos in real time by jointly denoising frames at increasing noise levels, anchoring attention to early frames, and using windowed distillation to limit error accumulation.
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning cs.AI · 2025-07-01 · conditional · none · ref 154
Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.
Muon is Scalable for LLM Training cs.LG · 2025-02-24 · unverdicted · none · ref 27
Muon optimizer with weight decay and update scaling achieves ~2x efficiency over AdamW for large LLMs, shown via the Moonlight 3B/16B MoE model trained on 5.7T tokens.
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer cs.CV · 2024-08-12 · unverdicted · none · ref 46
CogVideoX generates coherent 10-second text-to-video outputs at high resolution using a 3D VAE, expert adaptive LayerNorm transformer, progressive training, and a custom data pipeline, claiming state-of-the-art results.
Composing People Together: Iterative Pose-Image Generation for Multi-Person Interaction Scenes cs.CV · 2026-05-22 · unverdicted · none · ref 53
Introduces dual pose-image representation, cross-modal alignment, and iterative construction to improve prompt alignment and diversity in multi-person text-to-image generation.
Mutual Enhancement Between Global Tokens and Patch Tokens: From Theory to Practice cs.CV · 2026-05-11 · unverdicted · none · ref 18
TaTok is a theoretically grounded adaptive tokenization method that uses global tokens and cumulative conditional entropy filtering to reduce redundancy while improving reconstruction quality over fixed-rate patch tokenization.
When and Why Grouping Attention Heads Accelerates Muon Optimization cs.LG · 2026-05-09 · unverdicted · none · ref 16
Grouping attention heads in Muon creates a trade-off between whitening gains and norm costs that, when tuned, improves training loss over full or per-head Muon on GPT-2.
MDN: Parallelizing Stepwise Momentum for Delta Linear Attention cs.LG · 2026-05-07 · unverdicted · none · ref 23
MDN parallelizes stepwise momentum for delta linear attention using geometric reordering and dynamical systems analysis, yielding performance gains over Mamba2 and GDN on 400M and 1.3B models.
F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching eess.AS · 2024-10-09 · unverdicted · none · ref 29
F5-TTS generates natural speech from text via flow matching on DiT with simple text padding, ConvNeXt refinement, and sway sampling, trained on 100K hours multilingual data.
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism cs.CL · 2024-01-05 · unverdicted · none · ref 78
DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.
QuadLink: Autoregressive Quad-Dominant Mesh Generation via Point-Relation Learning cs.GR · 2026-05-16 · unreviewed · ref 228
GQLA: Group-Query Latent Attention for Hardware-Adaptive Large Language Model Decoding cs.LG · 2026-05-14 · unreviewed · ref 18

Neurocomputing , volume=

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer