super hub Canonical reference

Advances in neural information processing systems , volume=

Attention is all you need, author=

Canonical reference. 86% of citing Pith papers cite this work as background.

143 Pith papers citing it

Background 86% of classified citations

browse 143 citing papers more from Attention is all you need

hub tools

JSON dossier citing papers JSON

citation-role summary

background 6 method 1

citation-polarity summary

background 6 use method 1

authors

Attention is all you need author=

co-cited works

representative citing papers

Quotient-Space Diffusion Models

cs.LG · 2026-04-23 · unverdicted · novelty 8.0

Quotient-space diffusion models generate correct symmetric distributions by removing redundancy on the quotient space, simplifying learning and improving results on small molecules and proteins under SE(3) symmetry.

Move on Muon : A Hamiltonian probability gradient flow perspective of Muon optimizer

stat.ML · 2026-05-22 · unverdicted · novelty 7.0

Regularized Muon induces a damped Hamiltonian flow on probability measures over matrix parameters, yielding exponential convergence under gradient dominance assumptions.

Learning Causal Orderings for In-Context Tabular Prediction

cs.LG · 2026-05-21 · unverdicted · novelty 7.0

TabOrder learns unsupervised causal variable orderings and enforces them with order-constrained attention for tabular prediction and imputation under distribution shifts.

ConTact: Contact-First Antibody CDR Design via Explicit Interface Reasoning

cs.LG · 2026-05-20 · unverdicted · novelty 7.0

ConTact decomposes CDR design into surface fingerprint learning, contact prediction, and contact-gated sequence generation using distance-biased attention and weighted loss, reporting 7% RMSD and 10% F1 gains on CHIMERA-Bench.

BrepForge: Factorized B-rep Synthesis via Wireframe Composition and Boundary-Conditioned Surface Instantiation

cs.GR · 2026-05-19 · unverdicted · novelty 7.0

BrepForge factorizes B-rep synthesis into face-aware autoregressive wireframe composition followed by boundary-conditioned surface instantiation using learning-free geometric priors.

Functionalization via Structure Completion and Motion Rectification

cs.CV · 2026-05-18 · unverdicted · novelty 7.0

Object functionalization is cast as neural graph completion over a functional graph of parts, contacts, and motions, followed by geometry realization that also rectifies erroneous motions, demonstrated on furniture with a new paired dataset.

Knowledge Beyond Language: Bridging the Gap in Multilingual Machine Unlearning Evaluation

cs.CL · 2026-05-14 · unverdicted · novelty 7.0

New metrics KSS and KPS are introduced to evaluate multilingual machine unlearning quality and cross-language consistency in LLMs, addressing limitations of single-language evaluation protocols.

Convergence of difference inclusions via a diameter criterion

math.OC · 2026-05-14 · unverdicted · novelty 7.0

A diameter criterion tied to a potential function certifies convergence of difference inclusions, enabling discrete proofs for first-order optimization methods with diminishing steps.

ViT-K: A Few-Shot Learning Model for Coupled Fluid-Porous Media Flows with Interface Conditions

math.NA · 2026-05-13 · unverdicted · novelty 7.0

ViT-K uses Vision Transformers and Koopman operators to learn stable long-term spatiotemporal dynamics of coupled fluid-porous media flows from sparse data.

TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching

cs.CL · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

TBPO posits a token-level Bradley-Terry model and derives a Bregman-divergence density-ratio matching loss that generalizes DPO while preserving token-level optimality.

OpenSGA: Efficient 3D Scene Graph Alignment in the Open World

cs.CV · 2026-05-11 · conditional · novelty 7.0

OpenSGA fuses vision-language, textual, and geometric features via a distance-gated attention encoder and minimum-cost-flow allocator to outperform prior methods on both frame-to-scan and subscan-to-subscan 3D scene graph alignment, backed by a new 700k-sample ScanNet-SG dataset.

Self-Attention as a Covariance Readout: A Unified View of In-Context Learning and Repetition

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

Self-attention acts as a covariance readout that unifies in-context learning via population gradient descent and repetitive generation via asymptotic Markov behavior.

The Benefits of Temporal Correlations: SGD Learns k-Juntas from Random Walks Efficiently

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

Temporal correlations from lazy random walks enable efficient SGD learning of k-juntas via temporal-difference loss on ReLU networks, achieving linear sample complexity in d.

Polyphonia: Zero-Shot Timbre Transfer in Polyphonic Music with Acoustic-Informed Attention Calibration

cs.SD · 2026-05-11 · unverdicted · novelty 7.0

Polyphonia improves zero-shot stem-specific timbre transfer in polyphonic music by 15.5% target alignment via acoustic-informed attention calibration that uses probabilistic priors to set coarse boundaries.

Positional LSH: Binary Block Matrix Approximation for Attention with Linear Biases

cs.LG · 2026-05-10 · unverdicted · novelty 7.0

ALiBi bias is the expectation of positional LSH-induced block masks, yielding spectral and max-norm approximation bounds that reduce long-context biased attention to randomized short-context unbiased attention.

Value-Decomposed Reinforcement Learning Framework for Taxiway Routing with Hierarchical Conflict-Aware Observations

cs.AI · 2026-05-09 · unverdicted · novelty 7.0 · 2 refs

CaTR applies value-decomposed RL with hierarchical conflict-aware observations to achieve better safety-efficiency trade-offs than planning, optimization, and standard RL baselines in a realistic airport taxiway simulation.

Sinkhorn Treatment Effects: A Causal Optimal Transport Measure

stat.ML · 2026-05-08 · unverdicted · novelty 7.0

The Sinkhorn treatment effect is a new entropic optimal transport measure of divergence between counterfactual distributions that admits first- and second-order pathwise differentiability, debiased estimators, and asymptotically valid tests for distributional treatment effects.

LLM-guided Semi-Supervised Approaches for Social Media Crisis Data Classification

cs.AI · 2026-05-08 · conditional · novelty 7.0

LG-CoTrain, an LLM-guided co-training method, outperforms classical semi-supervised baselines for crisis tweet classification in low-resource settings with 5-25 labeled examples per class.

EyeCue: Driver Cognitive Distraction Detection via Gaze-Empowered Egocentric Video Understanding

cs.CV · 2026-05-08 · unverdicted · novelty 7.0

EyeCue detects driver cognitive distraction by modeling gaze-visual context interactions in egocentric videos and achieves 74.38% accuracy on the new CogDrive dataset, outperforming 11 baselines.

Learning to Communicate Locally for Large-Scale Multi-Agent Pathfinding

cs.AI · 2026-05-08 · unverdicted · novelty 7.0 · 2 refs

LC-MAPF uses multi-round local communication between neighboring agents in a pre-trained model to outperform prior learning-based MAPF solvers on diverse unseen scenarios while preserving scalability.

Randomness is sometimes necessary for coordination

cs.AI · 2026-05-07 · conditional · novelty 7.0

Structured per-agent randomness via ranked masking in attention allows symmetric agents to break ties and coordinate, achieving perfect success on symmetric tasks where deterministic policies fail and enabling zero-shot transfer across team sizes.

Does Synthetic Data Help? Empirical Evidence from Deep Learning Time Series Forecasters

cs.LG · 2026-05-07 · accept · novelty 7.0

Synthetic data augmentation helps channel-mixing time series models but degrades channel-independent ones, with reliable gains only from seasonal-trend generators and gradual schedules in low-resource settings.

Transformers with Selective Access to Early Representations

cs.LG · 2026-05-05 · unverdicted · novelty 7.0 · 2 refs

SATFormer uses a context-dependent gate for selective reuse of early Transformer representations, improving validation loss and zero-shot accuracy especially on retrieval benchmarks.

Generative Modeling with Orbit-Space Particle Flow Matching

cs.GR · 2026-05-04 · unverdicted · novelty 7.0

OGPP is a particle flow-matching method using orbit-space canonicalization and geometric paths that achieves lower error and fewer steps than prior approaches on 3D benchmarks.

citing papers explorer

Showing 50 of 143 citing papers.

Quotient-Space Diffusion Models cs.LG · 2026-04-23 · unverdicted · none · ref 13
Quotient-space diffusion models generate correct symmetric distributions by removing redundancy on the quotient space, simplifying learning and improving results on small molecules and proteins under SE(3) symmetry.
Move on Muon : A Hamiltonian probability gradient flow perspective of Muon optimizer stat.ML · 2026-05-22 · unverdicted · none · ref 13
Regularized Muon induces a damped Hamiltonian flow on probability measures over matrix parameters, yielding exponential convergence under gradient dominance assumptions.
Learning Causal Orderings for In-Context Tabular Prediction cs.LG · 2026-05-21 · unverdicted · none · ref 52
TabOrder learns unsupervised causal variable orderings and enforces them with order-constrained attention for tabular prediction and imputation under distribution shifts.
ConTact: Contact-First Antibody CDR Design via Explicit Interface Reasoning cs.LG · 2026-05-20 · unverdicted · none · ref 203
ConTact decomposes CDR design into surface fingerprint learning, contact prediction, and contact-gated sequence generation using distance-biased attention and weighted loss, reporting 7% RMSD and 10% F1 gains on CHIMERA-Bench.
BrepForge: Factorized B-rep Synthesis via Wireframe Composition and Boundary-Conditioned Surface Instantiation cs.GR · 2026-05-19 · unverdicted · none · ref 106
BrepForge factorizes B-rep synthesis into face-aware autoregressive wireframe composition followed by boundary-conditioned surface instantiation using learning-free geometric priors.
Functionalization via Structure Completion and Motion Rectification cs.CV · 2026-05-18 · unverdicted · none · ref 144
Object functionalization is cast as neural graph completion over a functional graph of parts, contacts, and motions, followed by geometry realization that also rectifies erroneous motions, demonstrated on furniture with a new paired dataset.
Knowledge Beyond Language: Bridging the Gap in Multilingual Machine Unlearning Evaluation cs.CL · 2026-05-14 · unverdicted · none · ref 50
New metrics KSS and KPS are introduced to evaluate multilingual machine unlearning quality and cross-language consistency in LLMs, addressing limitations of single-language evaluation protocols.
Convergence of difference inclusions via a diameter criterion math.OC · 2026-05-14 · unverdicted · none · ref 129
A diameter criterion tied to a potential function certifies convergence of difference inclusions, enabling discrete proofs for first-order optimization methods with diminishing steps.
ViT-K: A Few-Shot Learning Model for Coupled Fluid-Porous Media Flows with Interface Conditions math.NA · 2026-05-13 · unverdicted · none · ref 43
ViT-K uses Vision Transformers and Koopman operators to learn stable long-term spatiotemporal dynamics of coupled fluid-porous media flows from sparse data.
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching cs.CL · 2026-05-12 · unverdicted · none · ref 57 · 2 links
TBPO posits a token-level Bradley-Terry model and derives a Bregman-divergence density-ratio matching loss that generalizes DPO while preserving token-level optimality.
OpenSGA: Efficient 3D Scene Graph Alignment in the Open World cs.CV · 2026-05-11 · conditional · none · ref 9
OpenSGA fuses vision-language, textual, and geometric features via a distance-gated attention encoder and minimum-cost-flow allocator to outperform prior methods on both frame-to-scan and subscan-to-subscan 3D scene graph alignment, backed by a new 700k-sample ScanNet-SG dataset.
Self-Attention as a Covariance Readout: A Unified View of In-Context Learning and Repetition cs.LG · 2026-05-11 · unverdicted · none · ref 1
Self-attention acts as a covariance readout that unifies in-context learning via population gradient descent and repetitive generation via asymptotic Markov behavior.
The Benefits of Temporal Correlations: SGD Learns k-Juntas from Random Walks Efficiently cs.LG · 2026-05-11 · unverdicted · none · ref 122
Temporal correlations from lazy random walks enable efficient SGD learning of k-juntas via temporal-difference loss on ReLU networks, achieving linear sample complexity in d.
Polyphonia: Zero-Shot Timbre Transfer in Polyphonic Music with Acoustic-Informed Attention Calibration cs.SD · 2026-05-11 · unverdicted · none · ref 62
Polyphonia improves zero-shot stem-specific timbre transfer in polyphonic music by 15.5% target alignment via acoustic-informed attention calibration that uses probabilistic priors to set coarse boundaries.
Positional LSH: Binary Block Matrix Approximation for Attention with Linear Biases cs.LG · 2026-05-10 · unverdicted · none · ref 1
ALiBi bias is the expectation of positional LSH-induced block masks, yielding spectral and max-norm approximation bounds that reduce long-context biased attention to randomized short-context unbiased attention.
Value-Decomposed Reinforcement Learning Framework for Taxiway Routing with Hierarchical Conflict-Aware Observations cs.AI · 2026-05-09 · unverdicted · none · ref 34 · 2 links
CaTR applies value-decomposed RL with hierarchical conflict-aware observations to achieve better safety-efficiency trade-offs than planning, optimization, and standard RL baselines in a realistic airport taxiway simulation.
Sinkhorn Treatment Effects: A Causal Optimal Transport Measure stat.ML · 2026-05-08 · unverdicted · none · ref 36
The Sinkhorn treatment effect is a new entropic optimal transport measure of divergence between counterfactual distributions that admits first- and second-order pathwise differentiability, debiased estimators, and asymptotically valid tests for distributional treatment effects.
LLM-guided Semi-Supervised Approaches for Social Media Crisis Data Classification cs.AI · 2026-05-08 · conditional · none · ref 26
LG-CoTrain, an LLM-guided co-training method, outperforms classical semi-supervised baselines for crisis tweet classification in low-resource settings with 5-25 labeled examples per class.
EyeCue: Driver Cognitive Distraction Detection via Gaze-Empowered Egocentric Video Understanding cs.CV · 2026-05-08 · unverdicted · none · ref 56
EyeCue detects driver cognitive distraction by modeling gaze-visual context interactions in egocentric videos and achieves 74.38% accuracy on the new CogDrive dataset, outperforming 11 baselines.
Learning to Communicate Locally for Large-Scale Multi-Agent Pathfinding cs.AI · 2026-05-08 · unverdicted · none · ref 67 · 2 links
LC-MAPF uses multi-round local communication between neighboring agents in a pre-trained model to outperform prior learning-based MAPF solvers on diverse unseen scenarios while preserving scalability.
Randomness is sometimes necessary for coordination cs.AI · 2026-05-07 · conditional · none · ref 12
Structured per-agent randomness via ranked masking in attention allows symmetric agents to break ties and coordinate, achieving perfect success on symmetric tasks where deterministic policies fail and enabling zero-shot transfer across team sizes.
Does Synthetic Data Help? Empirical Evidence from Deep Learning Time Series Forecasters cs.LG · 2026-05-07 · accept · none · ref 271
Synthetic data augmentation helps channel-mixing time series models but degrades channel-independent ones, with reliable gains only from seasonal-trend generators and gradual schedules in low-resource settings.
Transformers with Selective Access to Early Representations cs.LG · 2026-05-05 · unverdicted · none · ref 5 · 2 links
SATFormer uses a context-dependent gate for selective reuse of early Transformer representations, improving validation loss and zero-shot accuracy especially on retrieval benchmarks.
Generative Modeling with Orbit-Space Particle Flow Matching cs.GR · 2026-05-04 · unverdicted · none · ref 124
OGPP is a particle flow-matching method using orbit-space canonicalization and geometric paths that achieves lower error and fewer steps than prior approaches on 3D benchmarks.
VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation cs.CV · 2026-05-02 · unverdicted · none · ref 11
VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.
TimeTok: Granularity-Controllable Time-Series Generation via Hierarchical Tokenization cs.AI · 2026-05-02 · unverdicted · none · ref 58
TimeTok is a unified framework using hierarchical tokenization for granularity-controllable time-series generation that achieves state-of-the-art performance in standard tasks and shows transferability across heterogeneous datasets.
A Multi-View Media Profiling Suite: Resources, Evaluation, and Analysis cs.CL · 2026-05-02 · unverdicted · none · ref 103
Presents MBFC-2025 dataset and multi-view embeddings with fusion methods for media bias and factuality, reporting SOTA results on ACL-2020 and new benchmarks on MBFC-2025.
New Bounds for Kernel Sums via Fast Spherical Embeddings cs.DS · 2026-05-02 · unverdicted · none · ref 54
New query-time bound of tilde O(d + epsilon Delta squared + 1/epsilon cubed) for Gaussian kernel mean estimation, improving prior bounds for small epsilon and intermediate diameter via a fast spherical embedding theorem.
ReLay: Personalized LLM-Generated Plain-Language Summaries for Better Understanding, but at What Cost? cs.CL · 2026-05-01 · unverdicted · none · ref 1
Personalized LLM-generated plain language summaries improve lay readers' comprehension and quality ratings but increase risks of reinforcing biases and introducing hallucinations compared to static expert summaries.
Minimal, Local, Causal Explanations for Jailbreak Success in Large Language Models cs.AI · 2026-04-30 · unverdicted · none · ref 28
LOCA identifies an average of six minimal interpretable changes in intermediate representations that causally induce refusal on otherwise successful jailbreaks for Gemma and Llama models.
Preconditioned DeltaNet: Curvature-aware Sequence Modeling for Linear Recurrences cs.LG · 2026-04-22 · unverdicted · none · ref 89
Preconditioned delta-rule models with a diagonal curvature approximation improve upon standard DeltaNet, GDN, and KDA by better approximating the test-time regression objective.
Participatory provenance as representational auditing for AI-mediated public consultation cs.AI · 2026-04-22 · unverdicted · none · ref 29
Participatory provenance auditing of Canada's AI strategy consultation shows official AI summaries exclude 15-17% of participants more than random baselines, with 33-88% exclusion for dissent clusters.
Long-Text-to-Image Generation via Compositional Prompt Decomposition cs.CV · 2026-04-20 · unverdicted · none · ref 58
PRISM lets pre-trained text-to-image models handle long prompts by breaking them into compositional parts, predicting noise separately, and merging outputs via energy-based conjunction, matching fine-tuned models while generalizing better to prompts over 500 tokens.
GaLa: Hypergraph-Guided Visual Language Models for Procedural Planning cs.RO · 2026-04-19 · unverdicted · none · ref 32
GaLa uses hypergraph representations of objects and a TriView encoder with contrastive learning to improve vision-language models on procedural planning benchmarks.
HeadRank: Decoding-Free Passage Reranking via Preference-Aligned Attention Heads cs.IR · 2026-04-19 · unverdicted · none · ref 33
HeadRank lifts preference optimization into attention space via entropy-regularized head selection and distribution regularizers to sharpen discriminability for efficient listwise reranking.
Refusal in Language Models Is Mediated by a Single Direction cs.LG · 2024-06-17 · accept · none · ref 33
Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data cs.LG · 2024-06-06 · conditional · none · ref 65
Absorbing discrete diffusion models the conditional distributions of clean data; reparameterizing yields a time-independent RADD that unifies with AO-ARMs and reaches SOTA perplexity among diffusion models on zero-shot language benchmarks.
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cs.CL · 2024-05-07 · unverdicted · none · ref 88
DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
Eliciting Latent Predictions from Transformers with the Tuned Lens cs.LG · 2023-03-14 · accept · none · ref 88
Training per-layer affine probes on frozen transformers yields more reliable latent predictions than the logit lens and enables detection of malicious inputs from prediction trajectories.
Fast Inference from Transformers via Speculative Decoding cs.LG · 2022-11-30 · accept · none · ref 30
Speculative decoding accelerates exact sampling from large autoregressive models by 2-3x on T5-XXL by running smaller approximation models in parallel to propose token sequences that the large model then verifies in batches while preserving the original output distribution.
Discontinuous Galerkin Neural Operator for Pathology Defocus Deblurring eess.IV · 2026-05-22 · unverdicted · none · ref 37
DGNO parameterizes integral kernels with discontinuous Galerkin elements for heterogeneous defocus deblurring in pathology images and reports superior performance over prior methods.
Turning Adaptation into Assets: Cross-Domain Bridging for Online Vision-Language Navigation cs.RO · 2026-05-22 · unverdicted · none · ref 24
IDEA is a TTA framework for VLN that builds a dynamic asset library from Fisher-weighted soft prompts and domain coordinates, then uses convex-hull projection for cross-domain bridging and training-free adaptation.
Convex Optimization for Alignment and Preference Learning on a Single GPU cs.LG · 2026-05-22 · unverdicted · none · ref 57
COALA applies convex optimization reformulations of neural networks to direct preference optimization, claiming single-GPU training with ~18% of DPO's TFLOPs and competitive performance on multiple datasets and models up to 8B parameters.
Why SGD is not Brownian Motion: A New Perspective on Stochastic Dynamics cs.LG · 2026-05-21 · unverdicted · none · ref 188
SGD is reformulated via a master equation from discrete updates, producing a discrete Fokker-Planck equation that predicts non-stationary variance growth proportional to learning rate in flat Hessian directions.
AgForce Enables Antigen-conditioned Generative Antibody Design cs.LG · 2026-05-20 · unverdicted · none · ref 203
AgForce improves antigen-conditioned antibody design by using framework dropout, gated bottlenecks, hyperbolic cross attention, MDN sequence head with Potts-like coupling, annealed MCL, and antigen cycle consistency to achieve 8% better amino acid recovery and superior binding metrics on CHIMERA-BEN
EvoStruct: Bridging Evolutionary and Structural Priors for Antibody CDR Design via Protein Language Model Adaptation cs.LG · 2026-05-20 · unverdicted · none · ref 203
EvoStruct integrates evolutionary priors from a protein language model with structural priors from an E(3)-equivariant GNN to raise amino acid recovery by 16% and diversity by 2.3x on CHIMERA-Bench while cutting perplexity 43%.
MixRea: Benchmarking Explicit-Implicit Reasoning in Large Language Models cs.CL · 2026-05-19 · unverdicted · none · ref 26
MixRea benchmark reveals LLMs achieve at most 42.8% consistency on explicit-implicit reasoning tasks, with PRCP prompting proposed to recover overlooked relations.
ST-TGExplainer: Disentangling Stability and Transition Patterns for Temporal GNN Interpretability cs.LG · 2026-05-19 · unverdicted · none · ref 105
ST-TGExplainer disentangles stability and transition patterns in temporal graphs via a self-explainable TGNN guided by a disentangled information bottleneck objective to produce more faithful explanations.
Skinned Motion Retargeting with Spatially Adaptive Interaction Guidance cs.GR · 2026-05-19 · unverdicted · none · ref 44
A geometry-aware retargeting method uses Transformer-refined adaptive anchors and a graph autoencoder to preserve interaction semantics like self-contact across characters with exaggerated proportions.
Spatial Blindness in Whole-Slide Multiple Instance Learning cs.CV · 2026-05-17 · unverdicted · none · ref 14
Standard MIL models for whole-slide pathology images exhibit spatial blindness under coordinate permutation; ResTopoMIL separates appearance and spatial learning to restore sensitivity and improve classification and survival prediction.

Advances in neural information processing systems , volume=

hub tools

citation-role summary

citation-polarity summary

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer