Sumi is an openly released 7B parameter uniform diffusion language model pretrained from scratch on 1.5T tokens that matches autoregressive models on several benchmarks.
Mixed citations
Title resolution pending
Mixed citation behavior. Most common role is background (62%).
citation-role summary
citation-polarity summary
representative citing papers
Transformers without positional signals cannot solve order-sensitive tasks; optimal encodings are approximated by classical MDS on Hellinger distance, with ALiBi achieving lower stress than sinusoidal or RoPE and effective rank at most n-1.
Long-range dependency in integer multiplication is a mirage from 1D representation; a 2D grid reduces it to local 3x3 operations, letting a 321-parameter neural cellular automaton generalize perfectly to inputs 683 times longer than training while Transformers fail.
ORiGAMi synthesizes sparse semi-structured mixed-type JSON data using path-encoded autoregressive tokenization and schema constraints, outperforming flattened tabular baselines on 17 of 18 fidelity, detection, and utility metrics while keeping privacy above 96%.
SpheRoPE modifies rotary position embeddings in diffusion transformers to enforce spherical topology for zero-shot 360 panorama generation across multiple backbones.
Prime Fourier Embeddings provide a group-theoretic basis for integer representations in which modular arithmetic becomes channel selection, with Schur's lemma guaranteeing block-diagonal equivariant maps and empirical confirmation of prime-channel specialization on square-free moduli.
AdaVoMP predicts accurate dense spatially-varying Young's modulus, Poisson's ratio and density for 3D objects using an adaptive sparse voxel structure generated by a sparse transformer encoder-decoder at 16^3 higher resolution than prior fixed-voxel methods.
Multimodal KB-VQA exhibits a primacy bias where gold passages at prompt start outperform those at the end by 16-26 points, flipping the text-only lost-in-the-middle pattern.
Kuramoto synchronization dynamics implement a provably unique and globally attractive attention mechanism that replaces softmax for physical substrates and shows competitive empirical performance.
LazyAttention kernelizes deferred positional encoding to enable zero-copy, position-agnostic KV cache reuse, delivering 1.37× lower TTFT and 1.40× higher throughput than Block-Attention under skewed document distributions while preserving output quality.
Leyline adds a policy-directed KV cache edit primitive with closed-form RoPE correction for agentic inference, reporting +11.2 pp cache-hit lift and +14.3 pp solve-rate gain.
Repetition rate mismatch between small-scale proxies and target budgets is the main reason data mixture experiments do not scale; a subsampling procedure that equalizes repetition rates recovers optimal mixtures from 1/16-scale experiments.
Parallax is a scalable parameterized local linear attention variant that improves LLM pretraining perplexity at 0.6B/1.7B scales with a hardware-aware kernel and shows gains under parameter- and compute-matched controls.
BodyReLux achieves photorealistic, temporally consistent full-body video relighting via a diffusion model with token-based lighting conditioning trained on a hybrid static-dynamic capture dataset.
iTryOn is a diffusion-based framework that adds spatial 3D hand guidance and semantic action-aware embeddings to handle complex garment deformations during human-clothing interactions in videos.
A transformer with prediction-correction and hierarchical super-token merging unifies simulation of six physical dynamics categories on Lagrangian particles and generalizes to unseen conditions.
ZipRerank delivers state-of-the-art multimodal listwise reranking accuracy for long documents at up to 10x lower latency via early interaction and single-pass scoring.
ConQuR is a post-training rotation calibration technique that aligns activations to hypercube corners via Procrustes optimization and online updates, delivering competitive LLM quantization performance without end-to-end training or offline activation storage.
Transpose-invariant spectral diagnostics on attention operators are orientation-blind, and a φ-G two-axis diagnostic distinguishes hallucination modes with 0.62-0.84 LC-AUROC and predicted polarity reversal.
TCDA introduces TC-DAG to filter cross-thread noise while preserving temporal order and D-RoPE to align semantics across layers and reduce distance dilution, achieving state-of-the-art results on two DiaASQ benchmarks.
A new framework shows concept subspaces are not unique, estimator choice affects containment and disentanglement, LEACE works well but generalizes poorly, and HuBERT encodes phone info as contained and disentangled from speaker info while speaker info resists compact containment.
Local attention in fixed-precision transformers introduces a second past operator in linear temporal logic, strictly increasing expressivity over global attention alone, with hybrids being most expressive.
A fitted iso-depth scaling law measures that one recurrence in looped transformers is worth r^0.46 unique blocks in validation loss.
NEAT achieves state-of-the-art 3D molecular generation on QM9 and GEOM-Drugs via a neighborhood-guided autoregressive set transformer that ensures atom-level permutation invariance and offers a significant speed advantage.
citing papers explorer
-
Short Data, Long Context: Distilling Positional Knowledge in Transformers
Long-context retrieval transfers to student models through logit-based distillation on packed short sequences, aided by phase-wise RoPE scaling and observable positional propagation to output logits.
-
ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution
ShinkaEvolve improves sample efficiency in LLM-driven program evolution via parent sampling, code novelty rejection-sampling, and bandit LLM ensemble selection, achieving new SOTA circle packing with 150 samples and gains on math reasoning and competitive programming tasks.
-
LinStereo: Linear-Complexity Global Attention for Multi-Scale Iterative Stereo Matching
LinStereo uses Position-Aware Linear Attention, Hierarchical Semantic Cost Volumes, and Depth Prior Initialization to enable global aggregation in iterative stereo matching at linear complexity, showing improved performance on standard and underwater benchmarks.
-
Sakana Fugu Technical Report
Sakana Fugu trains LLM orchestrators using fine-tuning, evolutionary algorithms, and RL to build query-adaptive multi-agent scaffolds, claiming SOTA results on benchmarks including SWE-Bench Pro and GPQA-Diamond.
-
Beyond Perplexity: UTF-8 Validity in Byte-aware Language Models
A 355M-parameter byte-level LM on 80B multilingual tokens exhibits UTF-8 validity converging after 4.2B tokens versus 2.1B for perplexity, with higher validity on rare characters than common ones.
-
Sparse2Act: Learning Action-Aligned Sparse 3D Representations for Cross-Domain Robot Manipulation
Sparse2Act pretrains sparse 3D encoders via masked action-alignment supervision, yielding reusable representations that reach 86.9% success on LIBERO-10 and enable cross-domain transfer.
-
A Recipe for Long-Context Reasoning in Large Language Models via On-Policy Optimization and Distillation
Combines GRPO with teacher-guided on-policy distillation and introduces LongBlocks dataset to yield more stable long-context reasoning than either method alone.
-
Symmetry in the Wild: The Role of Equivariance in Neural Fluid Surrogates
Explicit E(3)-equivariance in neural CFD surrogates improves generalization on diverse-geometry hemodynamics benchmarks but degrades in-distribution performance on strongly aligned aerodynamics data, consistently beating data augmentation.
-
A-THENA: Early Intrusion Detection for IoT with Time-Aware Hybrid Encoding and Network-Specific Augmentation
A-THENA improves averaged IoT intrusion detection accuracy by 3.69-6.88 percentage points over baselines on three datasets using time-aware hybrid encoding and network-specific augmentation, with near-zero false alarms and real-time deployment on Raspberry Pi Zero 2 W.
-
LaplacianFormer:Rethinking Linear Attention with Laplacian Kernel
LaplacianFormer uses a Laplacian kernel with an injective feature map and efficient approximations to achieve linear attention that preserves mid-range interactions better than Gaussian-based linear attention in vision transformers.
-
CXRMate-2: Structured Multimodal Temporal Embeddings and Tractable Reinforcement Learning for Clinically Acceptable Chest X-ray Radiology Report Generation
CXRMate-2 improves chest X-ray report generation via temporal embeddings and tractable RL, delivering metric gains and 45% acceptability in radiologist review with no significant preference difference on most findings.
-
Towards Better Static Code Analysis Reports: Sentence Transformer-based Filtering of Non-Actionable Alerts
STAF applies sentence embeddings from transformers to classify SCA findings, reaching 89% F1 and beating prior filters by 11% within projects and 6% across projects.
-
Representation Before Training: A Fixed-Budget Benchmark for Generative Medical Event Models
Fused code-value tokenization improves mortality AUROC from 0.891 to 0.915 and other clinical outcome predictions, while certain temporal encodings like event order match or exceed time tokens with shorter sequences.
-
Shuffle the Context: RoPE-Perturbed Self-Distillation for Long-Context Adaptation
RoPE-Perturbed Self-Distillation improves positional robustness during long-context fine-tuning of LLMs by training models to produce consistent outputs across RoPE-perturbed views of the input.
-
Long-Context Reasoning Through Proxy-Based Chain-of-Thought Tuning
ProxyCoT transfers CoT reasoning from proxy short contexts to full long contexts through RL/distillation followed by SFT, outperforming baselines with lower overhead and generalizing out-of-domain.
-
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference
ModernBERT is a new bidirectional encoder model achieving SOTA performance on diverse classification and retrieval benchmarks while offering superior speed and memory efficiency for long-context inference.
-
ZONOS2 Technical Report
ZONOS2 8B is a scaled MoE TTS model with 900M active parameters trained on 6M hours of data that reports competitive SOTA results on naturalness, speaker similarity, WER, and a new ZTTS1-Eval benchmark while releasing weights and code.
-
PRISM: Position-encoded Regressive Inverse Spectral Model for Multilayer Thin-Film Design
PRISM is a position-encoded autoregressive transformer that solves the inverse design of multilayer thin films via spectrum prefix conditioning and cumulative-depth RoPE, reporting over 50% MAE reduction versus baselines with fewer parameters.
-
Can Muon Fine-tune Adam-Pretrained Models?
Constraining fine-tuning updates with LoRA mitigates performance degradation when switching from Adam to Muon on pretrained models.
-
Comparison of Modern Multilingual Text Embedding Techniques for Hate Speech Detection Task
Supervised models using embeddings like jina and e5 reach up to 92% accuracy on multilingual hate speech detection, substantially outperforming anomaly detection, while PCA to 64 dimensions preserves most performance in the supervised case.
-
Advancing Polish Language Modeling through Tokenizer Optimization in the Bielik v3 7B and 11B Series
Bielik v3 models achieve better Polish language modeling efficiency by switching to a dedicated tokenizer, FOCUS initialization, multi-stage pretraining, and post-training with SFT, DPO, and GRPO.
-
Legal Domain Adaptation of Modern BERT Models
Further pre-training ModernBERT on US court opinions improves results on legal datasets compared to the base model, with gains similar to early BERT domain adaptation work.
-
K-Quantization and its Impact on Output Performance
Empirical evaluation of quantization effects on eight LLMs across bit widths, showing performance generally declines at lower precision but with model-size-dependent resilience and acceptable accuracy at 2 bits for many cases.