Mixed citations

Title resolution pending

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, Yunfeng Liu · 2024 · arXiv 2023.127063

Mixed citation behavior. Most common role is background (62%).

73 Pith papers citing it

Background 62% of classified citations

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

citation-role summary

background 5 method 2 dataset 1

citation-polarity summary

background 5 use method 2 use dataset 1

representative citing papers

Sumi: Open Uniform Diffusion Language Model from Scratch

cs.CL · 2026-06-17 · unverdicted · novelty 8.0

Sumi is an openly released 7B parameter uniform diffusion language model pretrained from scratch on 1.5T tokens that matches autoregressive models on several benchmarks.

On the Geometry of Positional Encodings in Transformers

cs.LG · 2026-04-06 · unverdicted · novelty 8.0

Transformers without positional signals cannot solve order-sensitive tasks; optimal encodings are approximated by classical MDS on Hellinger distance, with ALiBi achieving lower stress than sinusoidal or RoPE and effective rank at most n-1.

On the Mirage of Long-Range Dependency, with an Application to Integer Multiplication

cs.LG · 2026-03-30 · unverdicted · novelty 8.0

Long-range dependency in integer multiplication is a mirage from 1D representation; a 2D grid reduces it to local 3x3 operations, letting a 321-parameter neural cellular automaton generalize perfectly to inputs 683 times longer than training while Transformers fail.

Autoregressive Synthesis of Sparse and Semi-Structured Mixed-Type Data

cs.LG · 2026-03-02 · conditional · novelty 8.0

ORiGAMi synthesizes sparse semi-structured mixed-type JSON data using path-encoded autoregressive tokenization and schema constraints, outperforming flattened tabular baselines on 17 of 18 fidelity, detection, and utility metrics while keeping privacy above 96%.

SpheRoPE: Zero-Shot Optimization-Free 360 Panorama Generation with Spherical RoPE

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

SpheRoPE modifies rotary position embeddings in diffusion transformers to enforce spherical topology for zero-shot 360 panorama generation across multiple backbones.

Prime Fourier Embeddings: A Principled Basis for Modular Arithmetic

cs.LG · 2026-06-22 · unverdicted · novelty 7.0

Prime Fourier Embeddings provide a group-theoretic basis for integer representations in which modular arithmetic becomes channel selection, with Schur's lemma guaranteeing block-diagonal equivariant maps and empirical confirmation of prime-channel specialization on square-free moduli.

Adaptive Volumetric Mechanical Property Fields Invariant to Resolution

cs.CV · 2026-06-16 · unverdicted · novelty 7.0

AdaVoMP predicts accurate dense spatially-varying Young's modulus, Poisson's ratio and density for 3D objects using an adaptive sparse voxel structure generated by a sparse transformer encoder-decoder at 16^3 higher resolution than prior fixed-voxel methods.

Lost at the End: Primacy Bias in Multimodal Retrieval-Augmented Question Answering

cs.CL · 2026-06-15 · unverdicted · novelty 7.0

Multimodal KB-VQA exhibits a primacy bias where gold passages at prompt start outperform those at the end by 16-26 points, flipping the text-only lost-in-the-middle pattern.

Attention by Synchronization in Coupled Oscillator Networks

cs.LG · 2026-06-10 · unverdicted · novelty 7.0

Kuramoto synchronization dynamics implement a provably unique and globally attractive attention mechanism that replaces softmax for physical substrates and shows competitive empirical performance.

LazyAttention: Efficient Retrieval-Augmented Generation with Deferred Positional Encoding

cs.CL · 2026-06-03 · unverdicted · novelty 7.0

LazyAttention kernelizes deferred positional encoding to enable zero-copy, position-agnostic KV cache reuse, delivering 1.37× lower TTFT and 1.40× higher throughput than Block-Attention under skewed document distributions while preserving output quality.

Leyline: KV Cache Directives for Agentic Inference

cs.DC · 2026-05-31 · unverdicted · novelty 7.0

Leyline adds a policy-directed KV cache edit primitive with closed-form RoPE correction for agentic inference, reporting +11.2 pp cache-hit lift and +14.3 pp solve-rate gain.

Repetition Mismatch: Why Data Mixture Experiments Don't Scale and How to Fix Them

cs.LG · 2026-05-29 · conditional · novelty 7.0

Repetition rate mismatch between small-scale proxies and target budgets is the main reason data mixture experiments do not scale; a subsampling procedure that equalizes repetition rates recovers optimal mixtures from 1/16-scale experiments.

Parallax: Parameterized Local Linear Attention for Language Modeling

cs.LG · 2026-05-27 · unverdicted · novelty 7.0

Parallax is a scalable parameterized local linear attention variant that improves LLM pretraining perplexity at 0.6B/1.7B scales with a hardware-aware kernel and shows gains under parameter- and compute-matched controls.

BodyReLux: Temporally Consistent Full-Body Video Relighting

cs.CV · 2026-05-20 · unverdicted · novelty 7.0

BodyReLux achieves photorealistic, temporally consistent full-body video relighting via a diffusion model with token-based lighting conditioning trained on a hybrid static-dynamic capture dataset.

iTryOn: Mastering Interactive Video Virtual Try-On with Spatial-Semantic Guidance

cs.CV · 2026-05-20 · unverdicted · novelty 7.0

iTryOn is a diffusion-based framework that adds spatial 3D hand guidance and semantic action-aware embeddings to handle complex garment deformations during human-clothing interactions in videos.

WorldParticle: Unified World Simulation of Lagrangian Particle Dynamics via Transformer

cs.GR · 2026-05-14 · unverdicted · novelty 7.0 · 2 refs

A transformer with prediction-correction and hierarchical super-token merging unifies simulation of six physical dynamics categories on Lagrangian particles and generalizes to unseen conditions.

Very Efficient Listwise Multimodal Reranking for Long Documents

cs.IR · 2026-05-12 · unverdicted · novelty 7.0

ZipRerank delivers state-of-the-art multimodal listwise reranking accuracy for long documents at up to 10x lower latency via early interaction and single-pass scoring.

ConQuR: Corner Aligned Activation Quantization via Optimized Rotations for LLMs

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

ConQuR is a post-training rotation calibration technique that aligns activations to hypercube corners via Procrustes optimization and online updates, delivering competitive LLM quantization performance without end-to-end training or offline activation storage.

Self-Attention as Transport: Limits of Symmetric Spectral Diagnostics

cs.LG · 2026-05-06 · unverdicted · novelty 7.0

Transpose-invariant spectral diagnostics on attention operators are orientation-blind, and a φ-G two-axis diagnostic distinguishes hallucination modes with 0.62-0.84 LC-AUROC and predicted polarity reversal.

TCDA: Thread-Constrained Discourse-Aware Modeling for Conversational Sentiment Quadruple Analysis

cs.CL · 2026-05-03 · unverdicted · novelty 7.0 · 2 refs

TCDA introduces TC-DAG to filter cross-thread noise while preserving temporal order and D-RoPE to align semantics across layers and reduce distance dilution, achieving state-of-the-art results on two DiaASQ benchmarks.

A framework for analyzing concept representations in neural models

cs.CL · 2026-05-02 · unverdicted · novelty 7.0

A new framework shows concept subspaces are not unique, estimator choice affects containment and disentanglement, LEACE works well but generalizes poorly, and HuBERT encodes phone info as contained and disentangled from speaker info while speaker info resists compact containment.

Characterizing the Expressivity of Local Attention in Transformers

cs.CL · 2026-05-01 · unverdicted · novelty 7.0 · 3 refs

Local attention in fixed-precision transformers introduces a second past operator in linear temporal logic, strictly increasing expressivity over global attention alone, with hybrids being most expressive.

How Much Is One Recurrence Worth? Iso-Depth Scaling Laws for Looped Language Models

cs.LG · 2026-04-22 · unverdicted · novelty 7.0

A fitted iso-depth scaling law measures that one recurrence in looped transformers is worth r^0.46 unique blocks in validation loss.

NEAT: Neighborhood-Guided, Efficient, Autoregressive Set Transformer for 3D Molecular Generation

cs.LG · 2025-12-05 · unverdicted · novelty 7.0

NEAT achieves state-of-the-art 3D molecular generation on QM9 and GEOM-Drugs via a neighborhood-guided autoregressive set transformer that ensures atom-level permutation invariance and offers a significant speed advantage.

citing papers explorer

Showing 21 of 21 citing papers after filters.

Sumi: Open Uniform Diffusion Language Model from Scratch cs.CL · 2026-06-17 · unverdicted · none · ref 33
Sumi is an openly released 7B parameter uniform diffusion language model pretrained from scratch on 1.5T tokens that matches autoregressive models on several benchmarks.
Lost at the End: Primacy Bias in Multimodal Retrieval-Augmented Question Answering cs.CL · 2026-06-15 · unverdicted · none · ref 21
Multimodal KB-VQA exhibits a primacy bias where gold passages at prompt start outperform those at the end by 16-26 points, flipping the text-only lost-in-the-middle pattern.
LazyAttention: Efficient Retrieval-Augmented Generation with Deferred Positional Encoding cs.CL · 2026-06-03 · unverdicted · none · ref 89
LazyAttention kernelizes deferred positional encoding to enable zero-copy, position-agnostic KV cache reuse, delivering 1.37× lower TTFT and 1.40× higher throughput than Block-Attention under skewed document distributions while preserving output quality.
TCDA: Thread-Constrained Discourse-Aware Modeling for Conversational Sentiment Quadruple Analysis cs.CL · 2026-05-03 · unverdicted · none · ref 9 · 2 links
TCDA introduces TC-DAG to filter cross-thread noise while preserving temporal order and D-RoPE to align semantics across layers and reduce distance dilution, achieving state-of-the-art results on two DiaASQ benchmarks.
A framework for analyzing concept representations in neural models cs.CL · 2026-05-02 · unverdicted · none · ref 92
A new framework shows concept subspaces are not unique, estimator choice affects containment and disentanglement, LEACE works well but generalizes poorly, and HuBERT encodes phone info as contained and disentangled from speaker info while speaker info resists compact containment.
Characterizing the Expressivity of Local Attention in Transformers cs.CL · 2026-05-01 · unverdicted · none · ref 33 · 3 links
Local attention in fixed-precision transformers introduces a second past operator in linear temporal logic, strictly increasing expressivity over global attention alone, with hybrids being most expressive.
SeKV: Resolution-Adaptive KV Cache with Hierarchical Semantic Memory for Long-Context LLM Inference cs.CL · 2026-06-30 · unverdicted · none · ref 48
SeKV introduces resolution-adaptive semantic KV caching with GPU-CPU hierarchy and selective zoom-in reconstruction, achieving 5.9% average improvement over semantic baselines and 53.3% GPU memory reduction at 128K context.
Variable-Width Transformers cs.CL · 2026-06-16 · conditional · none · ref 38
×-shaped variable-width transformers outperform parameter-matched uniform baselines on language modeling loss with 22% fewer FLOPs and 15% smaller KV cache.
Multi-Hop Knowledge Composition is Bound by Pretraining Exposure cs.CL · 2026-06-08 · unverdicted · none · ref 2
Controlled experiments show implicit multi-hop reasoning in LLMs requires prior exposure to compositional contexts during pretraining and does not transfer to unexposed individuals.
Continuous Diffusion Scales Competitively with Discrete Diffusion for Language cs.CL · 2026-05-18 · conditional · none · ref 67
RePlaid achieves a 20x compute gap to autoregressive models, new SOTA PPL of 22.1 among continuous DLMs on OpenWebText, and competitive scaling laws by aligning architecture with modern discrete DLMs.
Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps cs.CL · 2026-05-16 · unverdicted · none · ref 19 · 2 links
RTPurbo converts full-attention LLMs to sparse attention by retaining full KV for retrieval heads and using a low-dimensional dynamic indexer, achieving near-lossless accuracy after minimal adaptation.
How Can We Synthesize High-Quality Pretraining Data? A Systematic Study of Prompt Design, Generator Model, and Source Data cs.CL · 2026-04-15 · unverdicted · none · ref 5
Rephrasing web text into structured formats such as tables, math problems, FAQs, and tutorials produces higher-quality synthetic pretraining data than curated web baselines or prior synthetic methods, as demonstrated by trillion-token experiments and the resulting FinePhrase dataset that reduces gen
Short Data, Long Context: Distilling Positional Knowledge in Transformers cs.CL · 2026-04-07 · unverdicted · none · ref 2
Long-context retrieval transfers to student models through logit-based distillation on packed short sequences, aided by phase-wise RoPE scaling and observable positional propagation to output logits.
Beyond Perplexity: UTF-8 Validity in Byte-aware Language Models cs.CL · 2026-06-12 · unverdicted · none · ref 18
A 355M-parameter byte-level LM on 80B multilingual tokens exhibits UTF-8 validity converging after 4.2B tokens versus 2.1B for perplexity, with higher validity on rare characters than common ones.
A Recipe for Long-Context Reasoning in Large Language Models via On-Policy Optimization and Distillation cs.CL · 2026-05-12 · unverdicted · none · ref 5
Combines GRPO with teacher-guided on-policy distillation and introduces LongBlocks dataset to yield more stable long-context reasoning than either method alone.
Shuffle the Context: RoPE-Perturbed Self-Distillation for Long-Context Adaptation cs.CL · 2026-04-15 · unverdicted · none · ref 4
RoPE-Perturbed Self-Distillation improves positional robustness during long-context fine-tuning of LLMs by training models to produce consistent outputs across RoPE-perturbed views of the input.
Long-Context Reasoning Through Proxy-Based Chain-of-Thought Tuning cs.CL · 2026-04-06 · unverdicted · none · ref 22 · 2 links
ProxyCoT transfers CoT reasoning from proxy short contexts to full long contexts through RL/distillation followed by SFT, outperforming baselines with lower overhead and generalizing out-of-domain.
Comparison of Modern Multilingual Text Embedding Techniques for Hate Speech Detection Task cs.CL · 2026-04-16 · unverdicted · none · ref 61
Supervised models using embeddings like jina and e5 reach up to 92% accuracy on multilingual hate speech detection, substantially outperforming anomaly detection, while PCA to 64 dimensions preserves most performance in the supervised case.
Advancing Polish Language Modeling through Tokenizer Optimization in the Bielik v3 7B and 11B Series cs.CL · 2026-04-12 · unverdicted · none · ref 2
Bielik v3 models achieve better Polish language modeling efficiency by switching to a dedicated tokenizer, FOCUS initialization, multi-stage pretraining, and post-training with SFT, DPO, and GRPO.
Legal Domain Adaptation of Modern BERT Models cs.CL · 2026-06-26 · unverdicted · none · ref 31
Further pre-training ModernBERT on US court opinions improves results on legal datasets compared to the base model, with gains similar to early BERT domain adaptation work.
K-Quantization and its Impact on Output Performance cs.CL · 2026-05-19 · unverdicted · none · ref 51
Empirical evaluation of quantization effects on eight LLMs across bit widths, showing performance generally declines at lower precision but with model-size-dependent resilience and acceptable accuracy at 2 bits for many cases.

Title resolution pending

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer