Sumi is an openly released 7B parameter uniform diffusion language model pretrained from scratch on 1.5T tokens that matches autoregressive models on several benchmarks.
Mixed citations
Title resolution pending
Mixed citation behavior. Most common role is background (62%).
citation-role summary
citation-polarity summary
representative citing papers
Transformers without positional signals cannot solve order-sensitive tasks; optimal encodings are approximated by classical MDS on Hellinger distance, with ALiBi achieving lower stress than sinusoidal or RoPE and effective rank at most n-1.
Long-range dependency in integer multiplication is a mirage from 1D representation; a 2D grid reduces it to local 3x3 operations, letting a 321-parameter neural cellular automaton generalize perfectly to inputs 683 times longer than training while Transformers fail.
ORiGAMi synthesizes sparse semi-structured mixed-type JSON data using path-encoded autoregressive tokenization and schema constraints, outperforming flattened tabular baselines on 17 of 18 fidelity, detection, and utility metrics while keeping privacy above 96%.
SpheRoPE modifies rotary position embeddings in diffusion transformers to enforce spherical topology for zero-shot 360 panorama generation across multiple backbones.
Prime Fourier Embeddings provide a group-theoretic basis for integer representations in which modular arithmetic becomes channel selection, with Schur's lemma guaranteeing block-diagonal equivariant maps and empirical confirmation of prime-channel specialization on square-free moduli.
AdaVoMP predicts accurate dense spatially-varying Young's modulus, Poisson's ratio and density for 3D objects using an adaptive sparse voxel structure generated by a sparse transformer encoder-decoder at 16^3 higher resolution than prior fixed-voxel methods.
Multimodal KB-VQA exhibits a primacy bias where gold passages at prompt start outperform those at the end by 16-26 points, flipping the text-only lost-in-the-middle pattern.
Kuramoto synchronization dynamics implement a provably unique and globally attractive attention mechanism that replaces softmax for physical substrates and shows competitive empirical performance.
LazyAttention kernelizes deferred positional encoding to enable zero-copy, position-agnostic KV cache reuse, delivering 1.37× lower TTFT and 1.40× higher throughput than Block-Attention under skewed document distributions while preserving output quality.
Leyline adds a policy-directed KV cache edit primitive with closed-form RoPE correction for agentic inference, reporting +11.2 pp cache-hit lift and +14.3 pp solve-rate gain.
Repetition rate mismatch between small-scale proxies and target budgets is the main reason data mixture experiments do not scale; a subsampling procedure that equalizes repetition rates recovers optimal mixtures from 1/16-scale experiments.
Parallax is a scalable parameterized local linear attention variant that improves LLM pretraining perplexity at 0.6B/1.7B scales with a hardware-aware kernel and shows gains under parameter- and compute-matched controls.
BodyReLux achieves photorealistic, temporally consistent full-body video relighting via a diffusion model with token-based lighting conditioning trained on a hybrid static-dynamic capture dataset.
iTryOn is a diffusion-based framework that adds spatial 3D hand guidance and semantic action-aware embeddings to handle complex garment deformations during human-clothing interactions in videos.
A transformer with prediction-correction and hierarchical super-token merging unifies simulation of six physical dynamics categories on Lagrangian particles and generalizes to unseen conditions.
ZipRerank delivers state-of-the-art multimodal listwise reranking accuracy for long documents at up to 10x lower latency via early interaction and single-pass scoring.
ConQuR is a post-training rotation calibration technique that aligns activations to hypercube corners via Procrustes optimization and online updates, delivering competitive LLM quantization performance without end-to-end training or offline activation storage.
Transpose-invariant spectral diagnostics on attention operators are orientation-blind, and a φ-G two-axis diagnostic distinguishes hallucination modes with 0.62-0.84 LC-AUROC and predicted polarity reversal.
TCDA introduces TC-DAG to filter cross-thread noise while preserving temporal order and D-RoPE to align semantics across layers and reduce distance dilution, achieving state-of-the-art results on two DiaASQ benchmarks.
A new framework shows concept subspaces are not unique, estimator choice affects containment and disentanglement, LEACE works well but generalizes poorly, and HuBERT encodes phone info as contained and disentangled from speaker info while speaker info resists compact containment.
Local attention in fixed-precision transformers introduces a second past operator in linear temporal logic, strictly increasing expressivity over global attention alone, with hybrids being most expressive.
A fitted iso-depth scaling law measures that one recurrence in looped transformers is worth r^0.46 unique blocks in validation loss.
NEAT achieves state-of-the-art 3D molecular generation on QM9 and GEOM-Drugs via a neighborhood-guided autoregressive set transformer that ensures atom-level permutation invariance and offers a significant speed advantage.
citing papers explorer
-
Sumi: Open Uniform Diffusion Language Model from Scratch
Sumi is an openly released 7B parameter uniform diffusion language model pretrained from scratch on 1.5T tokens that matches autoregressive models on several benchmarks.
-
On the Geometry of Positional Encodings in Transformers
Transformers without positional signals cannot solve order-sensitive tasks; optimal encodings are approximated by classical MDS on Hellinger distance, with ALiBi achieving lower stress than sinusoidal or RoPE and effective rank at most n-1.
-
On the Mirage of Long-Range Dependency, with an Application to Integer Multiplication
Long-range dependency in integer multiplication is a mirage from 1D representation; a 2D grid reduces it to local 3x3 operations, letting a 321-parameter neural cellular automaton generalize perfectly to inputs 683 times longer than training while Transformers fail.
-
Autoregressive Synthesis of Sparse and Semi-Structured Mixed-Type Data
ORiGAMi synthesizes sparse semi-structured mixed-type JSON data using path-encoded autoregressive tokenization and schema constraints, outperforming flattened tabular baselines on 17 of 18 fidelity, detection, and utility metrics while keeping privacy above 96%.
-
SpheRoPE: Zero-Shot Optimization-Free 360 Panorama Generation with Spherical RoPE
SpheRoPE modifies rotary position embeddings in diffusion transformers to enforce spherical topology for zero-shot 360 panorama generation across multiple backbones.
-
Prime Fourier Embeddings: A Principled Basis for Modular Arithmetic
Prime Fourier Embeddings provide a group-theoretic basis for integer representations in which modular arithmetic becomes channel selection, with Schur's lemma guaranteeing block-diagonal equivariant maps and empirical confirmation of prime-channel specialization on square-free moduli.
-
Adaptive Volumetric Mechanical Property Fields Invariant to Resolution
AdaVoMP predicts accurate dense spatially-varying Young's modulus, Poisson's ratio and density for 3D objects using an adaptive sparse voxel structure generated by a sparse transformer encoder-decoder at 16^3 higher resolution than prior fixed-voxel methods.
-
Lost at the End: Primacy Bias in Multimodal Retrieval-Augmented Question Answering
Multimodal KB-VQA exhibits a primacy bias where gold passages at prompt start outperform those at the end by 16-26 points, flipping the text-only lost-in-the-middle pattern.
-
Attention by Synchronization in Coupled Oscillator Networks
Kuramoto synchronization dynamics implement a provably unique and globally attractive attention mechanism that replaces softmax for physical substrates and shows competitive empirical performance.
-
LazyAttention: Efficient Retrieval-Augmented Generation with Deferred Positional Encoding
LazyAttention kernelizes deferred positional encoding to enable zero-copy, position-agnostic KV cache reuse, delivering 1.37× lower TTFT and 1.40× higher throughput than Block-Attention under skewed document distributions while preserving output quality.
-
Leyline: KV Cache Directives for Agentic Inference
Leyline adds a policy-directed KV cache edit primitive with closed-form RoPE correction for agentic inference, reporting +11.2 pp cache-hit lift and +14.3 pp solve-rate gain.
-
Repetition Mismatch: Why Data Mixture Experiments Don't Scale and How to Fix Them
Repetition rate mismatch between small-scale proxies and target budgets is the main reason data mixture experiments do not scale; a subsampling procedure that equalizes repetition rates recovers optimal mixtures from 1/16-scale experiments.
-
Parallax: Parameterized Local Linear Attention for Language Modeling
Parallax is a scalable parameterized local linear attention variant that improves LLM pretraining perplexity at 0.6B/1.7B scales with a hardware-aware kernel and shows gains under parameter- and compute-matched controls.
-
BodyReLux: Temporally Consistent Full-Body Video Relighting
BodyReLux achieves photorealistic, temporally consistent full-body video relighting via a diffusion model with token-based lighting conditioning trained on a hybrid static-dynamic capture dataset.
-
iTryOn: Mastering Interactive Video Virtual Try-On with Spatial-Semantic Guidance
iTryOn is a diffusion-based framework that adds spatial 3D hand guidance and semantic action-aware embeddings to handle complex garment deformations during human-clothing interactions in videos.
-
WorldParticle: Unified World Simulation of Lagrangian Particle Dynamics via Transformer
A transformer with prediction-correction and hierarchical super-token merging unifies simulation of six physical dynamics categories on Lagrangian particles and generalizes to unseen conditions.
-
Very Efficient Listwise Multimodal Reranking for Long Documents
ZipRerank delivers state-of-the-art multimodal listwise reranking accuracy for long documents at up to 10x lower latency via early interaction and single-pass scoring.
-
ConQuR: Corner Aligned Activation Quantization via Optimized Rotations for LLMs
ConQuR is a post-training rotation calibration technique that aligns activations to hypercube corners via Procrustes optimization and online updates, delivering competitive LLM quantization performance without end-to-end training or offline activation storage.
-
Self-Attention as Transport: Limits of Symmetric Spectral Diagnostics
Transpose-invariant spectral diagnostics on attention operators are orientation-blind, and a φ-G two-axis diagnostic distinguishes hallucination modes with 0.62-0.84 LC-AUROC and predicted polarity reversal.
-
TCDA: Thread-Constrained Discourse-Aware Modeling for Conversational Sentiment Quadruple Analysis
TCDA introduces TC-DAG to filter cross-thread noise while preserving temporal order and D-RoPE to align semantics across layers and reduce distance dilution, achieving state-of-the-art results on two DiaASQ benchmarks.
-
A framework for analyzing concept representations in neural models
A new framework shows concept subspaces are not unique, estimator choice affects containment and disentanglement, LEACE works well but generalizes poorly, and HuBERT encodes phone info as contained and disentangled from speaker info while speaker info resists compact containment.
-
Characterizing the Expressivity of Local Attention in Transformers
Local attention in fixed-precision transformers introduces a second past operator in linear temporal logic, strictly increasing expressivity over global attention alone, with hybrids being most expressive.
-
How Much Is One Recurrence Worth? Iso-Depth Scaling Laws for Looped Language Models
A fitted iso-depth scaling law measures that one recurrence in looped transformers is worth r^0.46 unique blocks in validation loss.
-
NEAT: Neighborhood-Guided, Efficient, Autoregressive Set Transformer for 3D Molecular Generation
NEAT achieves state-of-the-art 3D molecular generation on QM9 and GEOM-Drugs via a neighborhood-guided autoregressive set transformer that ensures atom-level permutation invariance and offers a significant speed advantage.
-
SeKV: Resolution-Adaptive KV Cache with Hierarchical Semantic Memory for Long-Context LLM Inference
SeKV introduces resolution-adaptive semantic KV caching with GPU-CPU hierarchy and selective zoom-in reconstruction, achieving 5.9% average improvement over semantic baselines and 53.3% GPU memory reduction at 128K context.
-
Internal Data Repetition Destroys Language Models
Repetition of training data produces a systematic eval loss peak at intermediate repeat counts whose location scales with model size, quantifiable as large compute-equivalent loss even at modest repetition fractions.
-
Kamera: Unified Position-Invariant Multimodal KV Cache for Training-Free Reuse
Kamera stores a low-rank patch with each position-free KV chunk to restore cross-chunk conditioning lost in naive reuse, enabling cheap reordering, sliding windows, and recall across attention mechanisms.
-
Attention mechanism for scalable mesh-based neural surrogates of free-surface fluids
Self-attention mechanisms are used to build mesh-preserving neural surrogates that approximate PFEM dynamics for free-surface flows, delivering accurate transient predictions and improved scalability on 2D and 3D benchmarks.
-
Scalable Physics-Inspired Transformers for Spin Glasses
A physics-inspired transformer with sparse attention and FlashAttention enables up to 100x faster sampling of large spin-glass systems, providing distributions, free energies, and overlaps for SK and EA models where prior ML methods fail at some temperatures.
-
Controllable Texture Tiling with Transformed RoPE-Enhanced Diffusion Models
A Diffusion Transformer framework applies coordinate-transformed RoPE and disjoint attention masks to achieve controllable, high-fidelity texture tiling that preserves reference structure and scene lighting.
-
Multi-Task Bayesian In-Context Learning
A transformer trained on sequences of prior and target tasks performs amortized Bayesian inference that adapts to new priors via in-context prefixes and matches oracle performance at much higher speed.
-
Variable-Width Transformers
×-shaped variable-width transformers outperform parameter-matched uniform baselines on language modeling loss with 22% fewer FLOPs and 15% smaller KV cache.
-
RSRank: Learning Relevance from Representational Shifts
RSRank learns calibrated relevance scores from alignment between representational shifts induced by candidate documents and those from oracle document sets, enabling zero-threshold filtering.
-
nD-RoPE: A Generalized RoPE for n-Dimensional Position Embedding
nD-RoPE derives an isotropic n-dimensional RoPE from a translation-invariant Hilbert-space formulation and instantiates it via multi-scale regular-simplex wave vectors, reporting gains on multi-dimensional data.
-
Next-Token Prediction Learns Generalisable Representations of Sleep Physiology
Next-token prediction on multi-modal tokenized sleep signals yields embeddings that match supervised performance with far less labels and generalize to daytime heart data.
-
Multi-Hop Knowledge Composition is Bound by Pretraining Exposure
Controlled experiments show implicit multi-hop reasoning in LLMs requires prior exposure to compositional contexts during pretraining and does not transfer to unexposed individuals.
-
Data-Driven Forecasting of three-Component Seismograms Using Transformer Architectures
SeismoGPT is a transformer autoregressive model achieving median normalized cross-correlation above 0.93 when forecasting synthetic three-component seismograms up to 240 s ahead from P- and S-wave context.
-
STAP: A Shuffle-Tokenized App Predictor with Ultra Long Context for Vocabulary-Free Mobile App Prediction
A Transformer model with app-identity shuffling and ultra-long context achieves vocabulary-free next-app prediction with cross-dataset zero-shot capability and competitive cold-start performance.
-
Sampling Triangulations and Calabi-Yau Threefolds with Autoregressive GNNs
Introduces dualGNN, an autoregressive message-passing GNN using signed circuits to sample uniform fine regular triangulations of lattice polytopes, applied to Calabi-Yau threefolds at h^{1,1}=86 and 128.
-
Geometry-Aware Tabular Diffusion
GATD adds explicit geometric relational supervision to tabular diffusion, achieving SOTA benchmark wins with substantially fewer parameters across ten datasets.
-
Chessformer: A Unified Architecture for Chess Modeling
Chessformer is a unified encoder-only transformer for chess that uses square tokens, geometric attention bias, and an attention-based policy head to set new records in human move prediction accuracy, playing strength, and interpretability.
-
Continuous Diffusion Scales Competitively with Discrete Diffusion for Language
RePlaid achieves a 20x compute gap to autoregressive models, new SOTA PPL of 22.1 among continuous DLMs on OpenWebText, and competitive scaling laws by aligning architecture with modern discrete DLMs.
-
Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers
Proposes equivariant optimizer updates matched to layer symmetries for embeddings, SwiGLU MLPs, and MoE routers, with reported gains in validation loss and training stability on several language model architectures.
-
Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps
RTPurbo converts full-attention LLMs to sparse attention by retaining full KV for retrieval heads and using a low-dimensional dynamic indexer, achieving near-lossless accuracy after minimal adaptation.
-
Deep Pre-Alignment for VLMs
Deep Pre-Alignment uses a small VLM perceiver instead of ViT to pre-align visual features with LLM text space, yielding 1.9-3.0 point gains on multimodal benchmarks and 32.9% less language forgetting.
-
Understanding and Accelerating the Training of Masked Diffusion Language Models
Bell-shaped time sampling accelerates masked diffusion language model training by roughly 4x on LM1B by countering locality bias in language data.
-
DVD: Discrete Voxel Diffusion for 3D Generation and Editing
DVD applies discrete diffusion directly to voxel occupancy for 3D generation, uncertainty estimation via entropy, and single-round editing via block perturbation fine-tuning.
-
The Position Curse: LLMs Struggle to Locate the Last Few Items in a List
LLMs exhibit the Position Curse, with backward position retrieval in lists lagging far behind forward retrieval, showing only partial gains from PosBench fine-tuning.
-
RealCam: Real-Time Novel-View Video Generation with Interactive Camera Control
RealCam is a causal autoregressive model for real-time camera-controlled video-to-video generation, using cross-frame in-context teacher distillation and loop-closed data augmentation to achieve high fidelity and consistency.
-
How Can We Synthesize High-Quality Pretraining Data? A Systematic Study of Prompt Design, Generator Model, and Source Data
Rephrasing web text into structured formats such as tables, math problems, FAQs, and tutorials produces higher-quality synthetic pretraining data than curated web baselines or prior synthetic methods, as demonstrated by trillion-token experiments and the resulting FinePhrase dataset that reduces gen