Mixed citations

Title resolution pending

RoFormer: Enhanced transformer with Rotary Position Embedding , journal = · 2024 · arXiv 2023.127063

Mixed citation behavior. Most common role is background (62%).

38 Pith papers citing it

Background 62% of classified citations

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

citation-role summary

background 5 method 2 dataset 1

citation-polarity summary

background 5 use method 2 use dataset 1

representative citing papers

On the Geometry of Positional Encodings in Transformers

cs.LG · 2026-04-06 · unverdicted · novelty 8.0

Transformers without positional signals cannot solve order-sensitive tasks; optimal encodings are approximated by classical MDS on Hellinger distance, with ALiBi achieving lower stress than sinusoidal or RoPE and effective rank at most n-1.

On the Mirage of Long-Range Dependency, with an Application to Integer Multiplication

cs.LG · 2026-03-30 · unverdicted · novelty 8.0

Long-range dependency in integer multiplication is a mirage from 1D representation; a 2D grid reduces it to local 3x3 operations, letting a 321-parameter neural cellular automaton generalize perfectly to inputs 683 times longer than training while Transformers fail.

Autoregressive Synthesis of Sparse and Semi-Structured Mixed-Type Data

cs.LG · 2026-03-02 · conditional · novelty 8.0

ORiGAMi synthesizes sparse semi-structured mixed-type JSON data using path-encoded autoregressive tokenization and schema constraints, outperforming flattened tabular baselines on 17 of 18 fidelity, detection, and utility metrics while keeping privacy above 96%.

BodyReLux: Temporally Consistent Full-Body Video Relighting

cs.CV · 2026-05-20 · unverdicted · novelty 7.0

BodyReLux achieves photorealistic, temporally consistent full-body video relighting via a diffusion model with token-based lighting conditioning trained on a hybrid static-dynamic capture dataset.

iTryOn: Mastering Interactive Video Virtual Try-On with Spatial-Semantic Guidance

cs.CV · 2026-05-20 · unverdicted · novelty 7.0

iTryOn is a video diffusion Transformer that injects spatial 3D hand guidance and semantic action captions to enable interactive garment replacement in videos.

WorldParticle: Unified World Simulation of Lagrangian Particle Dynamics via Transformer

cs.GR · 2026-05-14 · unverdicted · novelty 7.0 · 2 refs

A transformer with prediction-correction and hierarchical super-token merging unifies simulation of six physical dynamics categories on Lagrangian particles and generalizes to unseen conditions.

Very Efficient Listwise Multimodal Reranking for Long Documents

cs.IR · 2026-05-12 · unverdicted · novelty 7.0

ZipRerank delivers state-of-the-art multimodal listwise reranking accuracy for long documents at up to 10x lower latency via early interaction and single-pass scoring.

ConQuR: Corner Aligned Activation Quantization via Optimized Rotations for LLMs

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

ConQuR is a post-training rotation calibration technique that aligns activations to hypercube corners via Procrustes optimization and online updates, delivering competitive LLM quantization performance without end-to-end training or offline activation storage.

TCDA: Thread-Constrained Discourse-Aware Modeling for Conversational Sentiment Quadruple Analysis

cs.CL · 2026-05-03 · unverdicted · novelty 7.0 · 2 refs

TCDA introduces TC-DAG to filter cross-thread noise while preserving temporal order and D-RoPE to align semantics across layers and reduce distance dilution, achieving state-of-the-art results on two DiaASQ benchmarks.

A framework for analyzing concept representations in neural models

cs.CL · 2026-05-02 · unverdicted · novelty 7.0

A new framework shows concept subspaces are not unique, estimator choice affects containment and disentanglement, LEACE works well but generalizes poorly, and HuBERT encodes phone info as contained and disentangled from speaker info while speaker info resists compact containment.

Characterizing the Expressivity of Local Attention in Transformers

cs.CL · 2026-05-01 · conditional · novelty 7.0 · 2 refs

Local attention strictly enlarges the class of regular languages recognizable by fixed-precision transformers by introducing a second temporal operator in LTL, with global and local attention being expressively complementary.

How Much Is One Recurrence Worth? Iso-Depth Scaling Laws for Looped Language Models

cs.LG · 2026-04-22 · unverdicted · novelty 7.0

A fitted iso-depth scaling law measures that one recurrence in looped transformers is worth r^0.46 unique blocks in validation loss.

NEAT: Neighborhood-Guided, Efficient, Autoregressive Set Transformer for 3D Molecular Generation

cs.LG · 2025-12-05 · unverdicted · novelty 7.0

NEAT achieves state-of-the-art 3D molecular generation on QM9 and GEOM-Drugs via a neighborhood-guided autoregressive set transformer that ensures atom-level permutation invariance and offers a significant speed advantage.

Chessformer: A Unified Architecture for Chess Modeling

cs.LG · 2026-05-18 · unverdicted · novelty 6.0

Chessformer is a unified encoder-only transformer for chess that uses square tokens, geometric attention bias, and an attention-based policy head to set new records in human move prediction accuracy, playing strength, and interpretability.

Continuous Diffusion Scales Competitively with Discrete Diffusion for Language

cs.CL · 2026-05-18 · conditional · novelty 6.0

RePlaid achieves a 20x compute gap to autoregressive models, new SOTA PPL of 22.1 among continuous DLMs on OpenWebText, and competitive scaling laws by aligning architecture with modern discrete DLMs.

Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps

cs.CL · 2026-05-16 · unverdicted · novelty 6.0

RTPurbo exploits intrinsic sparsity in full-attention LLMs to achieve near-lossless sparse inference after only a few hundred training steps via retrieval-head identification and a lightweight token indexer.

Deep Pre-Alignment for VLMs

cs.CV · 2026-05-14 · unverdicted · novelty 6.0

Deep Pre-Alignment uses a small VLM perceiver instead of ViT to pre-align visual features with LLM text space, yielding 1.9-3.0 point gains on multimodal benchmarks and 32.9% less language forgetting.

Understanding and Accelerating the Training of Masked Diffusion Language Models

cs.LG · 2026-05-13 · conditional · novelty 6.0

Bell-shaped time sampling accelerates masked diffusion language model training by roughly 4x on LM1B by countering locality bias in language data.

DVD: Discrete Voxel Diffusion for 3D Generation and Editing

cs.CV · 2026-05-08 · unverdicted · novelty 6.0

DVD treats voxel occupancy as a discrete variable in a diffusion framework to generate, assess, and edit sparse 3D voxels without continuous thresholding.

The Position Curse: LLMs Struggle to Locate the Last Few Items in a List

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

LLMs exhibit the Position Curse, with backward position retrieval in lists lagging far behind forward retrieval, showing only partial gains from PosBench fine-tuning.

RealCam: Real-Time Novel-View Video Generation with Interactive Camera Control

cs.CV · 2026-05-07 · unverdicted · novelty 6.0

RealCam is a causal autoregressive model for real-time camera-controlled video-to-video generation, using cross-frame in-context teacher distillation and loop-closed data augmentation to achieve high fidelity and consistency.

How Can We Synthesize High-Quality Pretraining Data? A Systematic Study of Prompt Design, Generator Model, and Source Data

cs.CL · 2026-04-15 · unverdicted · novelty 6.0

Rephrasing web text into structured formats such as tables, math problems, FAQs, and tutorials produces higher-quality synthetic pretraining data than curated web baselines or prior synthetic methods, as demonstrated by trillion-token experiments and the resulting FinePhrase dataset that reduces gen

Short Data, Long Context: Distilling Positional Knowledge in Transformers

cs.CL · 2026-04-07 · unverdicted · novelty 6.0

Long-context retrieval transfers to student models through logit-based distillation on packed short sequences, aided by phase-wise RoPE scaling and observable positional propagation to output logits.

ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution

cs.CL · 2025-09-17 · unverdicted · novelty 6.0

ShinkaEvolve improves sample efficiency in LLM-driven program evolution via parent sampling, code novelty rejection-sampling, and bandit LLM ensemble selection, achieving new SOTA circle packing with 150 samples and gains on math reasoning and competitive programming tasks.

citing papers explorer

Showing 38 of 38 citing papers.

On the Geometry of Positional Encodings in Transformers cs.LG · 2026-04-06 · unverdicted · none · ref 8
Transformers without positional signals cannot solve order-sensitive tasks; optimal encodings are approximated by classical MDS on Hellinger distance, with ALiBi achieving lower stress than sinusoidal or RoPE and effective rank at most n-1.
On the Mirage of Long-Range Dependency, with an Application to Integer Multiplication cs.LG · 2026-03-30 · unverdicted · none · ref 16
Long-range dependency in integer multiplication is a mirage from 1D representation; a 2D grid reduces it to local 3x3 operations, letting a 321-parameter neural cellular automaton generalize perfectly to inputs 683 times longer than training while Transformers fail.
Autoregressive Synthesis of Sparse and Semi-Structured Mixed-Type Data cs.LG · 2026-03-02 · conditional · none · ref 38
ORiGAMi synthesizes sparse semi-structured mixed-type JSON data using path-encoded autoregressive tokenization and schema constraints, outperforming flattened tabular baselines on 17 of 18 fidelity, detection, and utility metrics while keeping privacy above 96%.
BodyReLux: Temporally Consistent Full-Body Video Relighting cs.CV · 2026-05-20 · unverdicted · none · ref 90
BodyReLux achieves photorealistic, temporally consistent full-body video relighting via a diffusion model with token-based lighting conditioning trained on a hybrid static-dynamic capture dataset.
iTryOn: Mastering Interactive Video Virtual Try-On with Spatial-Semantic Guidance cs.CV · 2026-05-20 · unverdicted · none · ref 101
iTryOn is a video diffusion Transformer that injects spatial 3D hand guidance and semantic action captions to enable interactive garment replacement in videos.
WorldParticle: Unified World Simulation of Lagrangian Particle Dynamics via Transformer cs.GR · 2026-05-14 · unverdicted · none · ref 9 · 2 links
A transformer with prediction-correction and hierarchical super-token merging unifies simulation of six physical dynamics categories on Lagrangian particles and generalizes to unseen conditions.
Very Efficient Listwise Multimodal Reranking for Long Documents cs.IR · 2026-05-12 · unverdicted · none · ref 39
ZipRerank delivers state-of-the-art multimodal listwise reranking accuracy for long documents at up to 10x lower latency via early interaction and single-pass scoring.
ConQuR: Corner Aligned Activation Quantization via Optimized Rotations for LLMs cs.LG · 2026-05-11 · unverdicted · none · ref 45
ConQuR is a post-training rotation calibration technique that aligns activations to hypercube corners via Procrustes optimization and online updates, delivering competitive LLM quantization performance without end-to-end training or offline activation storage.
TCDA: Thread-Constrained Discourse-Aware Modeling for Conversational Sentiment Quadruple Analysis cs.CL · 2026-05-03 · unverdicted · none · ref 9 · 2 links
TCDA introduces TC-DAG to filter cross-thread noise while preserving temporal order and D-RoPE to align semantics across layers and reduce distance dilution, achieving state-of-the-art results on two DiaASQ benchmarks.
A framework for analyzing concept representations in neural models cs.CL · 2026-05-02 · unverdicted · none · ref 92
A new framework shows concept subspaces are not unique, estimator choice affects containment and disentanglement, LEACE works well but generalizes poorly, and HuBERT encodes phone info as contained and disentangled from speaker info while speaker info resists compact containment.
Characterizing the Expressivity of Local Attention in Transformers cs.CL · 2026-05-01 · conditional · none · ref 33 · 2 links
Local attention strictly enlarges the class of regular languages recognizable by fixed-precision transformers by introducing a second temporal operator in LTL, with global and local attention being expressively complementary.
How Much Is One Recurrence Worth? Iso-Depth Scaling Laws for Looped Language Models cs.LG · 2026-04-22 · unverdicted · none · ref 19
A fitted iso-depth scaling law measures that one recurrence in looped transformers is worth r^0.46 unique blocks in validation loss.
NEAT: Neighborhood-Guided, Efficient, Autoregressive Set Transformer for 3D Molecular Generation cs.LG · 2025-12-05 · unverdicted · none · ref 20
NEAT achieves state-of-the-art 3D molecular generation on QM9 and GEOM-Drugs via a neighborhood-guided autoregressive set transformer that ensures atom-level permutation invariance and offers a significant speed advantage.
Chessformer: A Unified Architecture for Chess Modeling cs.LG · 2026-05-18 · unverdicted · none · ref 17
Chessformer is a unified encoder-only transformer for chess that uses square tokens, geometric attention bias, and an attention-based policy head to set new records in human move prediction accuracy, playing strength, and interpretability.
Continuous Diffusion Scales Competitively with Discrete Diffusion for Language cs.CL · 2026-05-18 · conditional · none · ref 67
RePlaid achieves a 20x compute gap to autoregressive models, new SOTA PPL of 22.1 among continuous DLMs on OpenWebText, and competitive scaling laws by aligning architecture with modern discrete DLMs.
Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps cs.CL · 2026-05-16 · unverdicted · none · ref 19
RTPurbo exploits intrinsic sparsity in full-attention LLMs to achieve near-lossless sparse inference after only a few hundred training steps via retrieval-head identification and a lightweight token indexer.
Deep Pre-Alignment for VLMs cs.CV · 2026-05-14 · unverdicted · none · ref 129
Deep Pre-Alignment uses a small VLM perceiver instead of ViT to pre-align visual features with LLM text space, yielding 1.9-3.0 point gains on multimodal benchmarks and 32.9% less language forgetting.
Understanding and Accelerating the Training of Masked Diffusion Language Models cs.LG · 2026-05-13 · conditional · none · ref 65
Bell-shaped time sampling accelerates masked diffusion language model training by roughly 4x on LM1B by countering locality bias in language data.
DVD: Discrete Voxel Diffusion for 3D Generation and Editing cs.CV · 2026-05-08 · unverdicted · none · ref 55
DVD treats voxel occupancy as a discrete variable in a diffusion framework to generate, assess, and edit sparse 3D voxels without continuous thresholding.
The Position Curse: LLMs Struggle to Locate the Last Few Items in a List cs.LG · 2026-05-08 · unverdicted · none · ref 21
LLMs exhibit the Position Curse, with backward position retrieval in lists lagging far behind forward retrieval, showing only partial gains from PosBench fine-tuning.
RealCam: Real-Time Novel-View Video Generation with Interactive Camera Control cs.CV · 2026-05-07 · unverdicted · none · ref 14
RealCam is a causal autoregressive model for real-time camera-controlled video-to-video generation, using cross-frame in-context teacher distillation and loop-closed data augmentation to achieve high fidelity and consistency.
How Can We Synthesize High-Quality Pretraining Data? A Systematic Study of Prompt Design, Generator Model, and Source Data cs.CL · 2026-04-15 · unverdicted · none · ref 5
Rephrasing web text into structured formats such as tables, math problems, FAQs, and tutorials produces higher-quality synthetic pretraining data than curated web baselines or prior synthetic methods, as demonstrated by trillion-token experiments and the resulting FinePhrase dataset that reduces gen
Short Data, Long Context: Distilling Positional Knowledge in Transformers cs.CL · 2026-04-07 · unverdicted · none · ref 2
Long-context retrieval transfers to student models through logit-based distillation on packed short sequences, aided by phase-wise RoPE scaling and observable positional propagation to output logits.
ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution cs.CL · 2025-09-17 · unverdicted · none · ref 255
ShinkaEvolve improves sample efficiency in LLM-driven program evolution via parent sampling, code novelty rejection-sampling, and bandit LLM ensemble selection, achieving new SOTA circle packing with 150 samples and gains on math reasoning and competitive programming tasks.
Symmetry in the Wild: The Role of Equivariance in Neural Fluid Surrogates cs.LG · 2026-05-12 · unverdicted · none · ref 13
Explicit E(3)-equivariance in neural CFD surrogates improves generalization on diverse-geometry hemodynamics benchmarks but degrades in-distribution performance on strongly aligned aerodynamics data, consistently beating data augmentation.
A-THENA: Early Intrusion Detection for IoT with Time-Aware Hybrid Encoding and Network-Specific Augmentation cs.CR · 2026-04-23 · unverdicted · none · ref 65
A-THENA improves averaged IoT intrusion detection accuracy by 3.69-6.88 percentage points over baselines on three datasets using time-aware hybrid encoding and network-specific augmentation, with near-zero false alarms and real-time deployment on Raspberry Pi Zero 2 W.
LaplacianFormer:Rethinking Linear Attention with Laplacian Kernel cs.CV · 2026-04-22 · unverdicted · none · ref 57
LaplacianFormer uses a Laplacian kernel with an injective feature map and efficient approximations to achieve linear attention that preserves mid-range interactions better than Gaussian-based linear attention in vision transformers.
CXRMate-2: Structured Multimodal Temporal Embeddings and Tractable Reinforcement Learning for Clinically Acceptable Chest X-ray Radiology Report Generation cs.CV · 2026-04-21 · unverdicted · none · ref 45
CXRMate-2 improves chest X-ray report generation via temporal embeddings and tractable RL, delivering metric gains and 45% acceptability in radiologist review with no significant preference difference on most findings.
Towards Better Static Code Analysis Reports: Sentence Transformer-based Filtering of Non-Actionable Alerts cs.SE · 2026-04-20 · conditional · none · ref 38
STAF applies sentence embeddings from transformers to classify SCA findings, reaching 89% F1 and beating prior filters by 11% within projects and 6% across projects.
Representation Before Training: A Fixed-Budget Benchmark for Generative Medical Event Models cs.LG · 2026-04-18 · unverdicted · none · ref 36
Fused code-value tokenization improves mortality AUROC from 0.891 to 0.915 and other clinical outcome predictions, while certain temporal encodings like event order match or exceed time tokens with shorter sequences.
Shuffle the Context: RoPE-Perturbed Self-Distillation for Long-Context Adaptation cs.CL · 2026-04-15 · unverdicted · none · ref 4
RoPE-Perturbed Self-Distillation improves positional robustness during long-context fine-tuning of LLMs by training models to produce consistent outputs across RoPE-perturbed views of the input.
Long-Context Reasoning Through Proxy-Based Chain-of-Thought Tuning cs.CL · 2026-04-06 · unverdicted · none · ref 22 · 2 links
ProxyCoT transfers CoT reasoning from proxy short contexts to full long contexts through RL/distillation followed by SFT, outperforming baselines with lower overhead and generalizing out-of-domain.
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference cs.CL · 2024-12-18 · unverdicted · none · ref 186
ModernBERT is a new bidirectional encoder model achieving SOTA performance on diverse classification and retrieval benchmarks while offering superior speed and memory efficiency for long-context inference.
Can Muon Fine-tune Adam-Pretrained Models? cs.LG · 2026-05-11 · unverdicted · none · ref 66
Constraining fine-tuning updates with LoRA mitigates performance degradation when switching from Adam to Muon on pretrained models.
Comparison of Modern Multilingual Text Embedding Techniques for Hate Speech Detection Task cs.CL · 2026-04-16 · unverdicted · none · ref 61
Supervised models using embeddings like jina and e5 reach up to 92% accuracy on multilingual hate speech detection, substantially outperforming anomaly detection, while PCA to 64 dimensions preserves most performance in the supervised case.
Advancing Polish Language Modeling through Tokenizer Optimization in the Bielik v3 7B and 11B Series cs.CL · 2026-04-12 · unverdicted · none · ref 2
Bielik v3 models achieve better Polish language modeling efficiency by switching to a dedicated tokenizer, FOCUS initialization, multi-stage pretraining, and post-training with SFT, DPO, and GRPO.
K-Quantization and its Impact on Output Performance cs.CL · 2026-05-19 · unverdicted · none · ref 51
Empirical evaluation of quantization effects on eight LLMs across bit widths, showing performance generally declines at lower precision but with model-size-dependent resilience and acceptable accuracy at 2 bits for many cases.
Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers math.OC · 2026-05-18 · unreviewed · ref 144

Title resolution pending

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer