hub

Differential transformer

· 2024 · arXiv 2410.05258

15 Pith papers cite this work. Polarity classification is still indexing.

15 Pith papers citing it

read on arXiv browse 15 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3

citation-polarity summary

background 3

representative citing papers

Pay Less Attention to Function Words for Free Robustness of Vision-Language Models

cs.LG · 2025-12-08 · conditional · novelty 7.0

FDA differentially subtracts function-word cross-attention from original attention heads to cut attack success rates by 18-90% across models and tasks while dropping performance by at most 0.6%.

IAFormer: Interaction-Aware Transformer network for collider data analysis

hep-ph · 2025-05-06 · unverdicted · novelty 7.0

IAFormer uses boost-invariant pairwise quantities and differential attention to create a sparse Transformer that achieves state-of-the-art classification on top-quark and quark-gluon jet datasets while using over an order of magnitude fewer parameters than prior Particle Transformer models.

Most Transformer Modifications Still Do Not Transfer at 1-3B: A 2020-2026 Update to Narang et al. (2021) with Downstream Evaluation and a Noise Floor

cs.LG · 2026-05-20 · conditional · novelty 6.0

Empirical update to prior work shows most of 20 recent Transformer modifications do not transfer at 1-3B scales when measured with downstream CLIMB-12 tasks, multi-seed noise floor, and cross-scale stability.

FocuSFT: Bilevel Optimization for Dilution-Aware Long-Context Fine-Tuning

cs.CL · 2026-05-11 · unverdicted · novelty 6.0

FocuSFT uses an inner optimization loop to adapt fast-weight parameters into a parametric memory that sharpens attention on relevant content, then conditions outer-loop supervised fine-tuning on this representation, yielding gains on long-context benchmarks.

SAGE: Selective Attention-Guided Extraction for Token-Efficient Document Indexing

cs.DB · 2026-04-16 · unverdicted · novelty 6.0

SAGE is a training-free context reduction method that converts attention signals from a small LLM into a differential relevance heatmap to select top units for downstream QA, achieving competitive accuracy at 10% token budget on benchmarks like QuALITY-hard.

Brain-OF: An Omnifunctional Foundation Model for fMRI, EEG and MEG

cs.LG · 2026-02-26 · unverdicted · novelty 6.0

Brain-OF is a multimodal foundation model for fMRI, EEG and MEG using any-resolution sampling, DINT attention with sparse MoE, and masked temporal-frequency pretraining on ~40 datasets to achieve superior downstream performance.

Flashlight: PyTorch Compiler Extensions to Accelerate Attention Variants

cs.LG · 2025-11-03 · unverdicted · novelty 6.0

Flashlight is a compiler-native PyTorch framework that generates efficient fused kernels for arbitrary and data-dependent attention variants, supporting more cases than FlexAttention with competitive performance.

LLM-Enhanced Multi-Agent Reinforcement Learning with Expert Workflow for Real-Time P2P Energy Trading

cs.MA · 2025-07-20 · unverdicted · novelty 6.0

An LLM-enhanced MARL system with differential attention critic produces lower economic costs and voltage violations than baselines in simulated real-time P2P electricity trading.

cs.LG · 2026-05-11 · unverdicted · novelty 5.0

Temporal Operator Attention augments softmax attention with learnable sequence-space operators for signed temporal mixing and uses stochastic regularization to enable practical training, yielding consistent gains on time series benchmarks.

When Does Removing LayerNorm Help? Activation Bounding as a Regime-Dependent Implicit Regularizer

cs.LG · 2026-04-25 · unverdicted · novelty 5.0

DyT improves validation loss 27% at 64M params/1M tokens but worsens it 19% at 118M tokens, with saturation levels predicting the sign of the effect.

Contrastive Learning for Multimodal Human Activity Recognition with Limited Labeled Data

cs.LG · 2026-04-25 · unverdicted · novelty 5.0

CLMM is a two-stage contrastive learning framework using CNN-DiffTransformer encoders and dual-branch fusion to improve multimodal human activity recognition under limited labels.

GraphDiffMed: Knowledge-Constrained Differential Attention with Pharmacological Graph Priors for Medication Recommendation

cs.LG · 2026-03-21 · unverdicted · novelty 5.0

GraphDiffMed integrates dual-scale differential attention with pharmacological graph priors to improve medication recommendation quality, ranking, and safety balance on MIMIC-III data.

Hybrid Architectures for Language Models: Systematic Analysis and Design Insights

cs.CL · 2025-10-06 · unverdicted · novelty 4.0

This work systematically compares inter-layer and intra-layer hybridization strategies for combining self-attention and Mamba-style state space models, evaluating them on language modeling, downstream tasks, long-context performance, scaling, and efficiency to derive optimal design recipes.

A Cellular Doctrine of Morality: Intrinsic Active Precision and the Mind-Reality Overload Dilemma

cs.AI · 2026-05-02 · unverdicted · novelty 3.0

AI incorporating active precision from pyramidal neurons may reduce information overload by evaluating evidence coherence before attention rather than maximizing rewards.

The General Theory of Localization Methods

cs.LG · 2026-05-20

citing papers explorer

Showing 15 of 15 citing papers.

Pay Less Attention to Function Words for Free Robustness of Vision-Language Models cs.LG · 2025-12-08 · conditional · none · ref 22
FDA differentially subtracts function-word cross-attention from original attention heads to cut attack success rates by 18-90% across models and tasks while dropping performance by at most 0.6%.
IAFormer: Interaction-Aware Transformer network for collider data analysis hep-ph · 2025-05-06 · unverdicted · none · ref 46
IAFormer uses boost-invariant pairwise quantities and differential attention to create a sparse Transformer that achieves state-of-the-art classification on top-quark and quark-gluon jet datasets while using over an order of magnitude fewer parameters than prior Particle Transformer models.
Most Transformer Modifications Still Do Not Transfer at 1-3B: A 2020-2026 Update to Narang et al. (2021) with Downstream Evaluation and a Noise Floor cs.LG · 2026-05-20 · conditional · none · ref 50
Empirical update to prior work shows most of 20 recent Transformer modifications do not transfer at 1-3B scales when measured with downstream CLIMB-12 tasks, multi-seed noise floor, and cross-scale stability.
FocuSFT: Bilevel Optimization for Dilution-Aware Long-Context Fine-Tuning cs.CL · 2026-05-11 · unverdicted · none · ref 38
FocuSFT uses an inner optimization loop to adapt fast-weight parameters into a parametric memory that sharpens attention on relevant content, then conditions outer-loop supervised fine-tuning on this representation, yielding gains on long-context benchmarks.
SAGE: Selective Attention-Guided Extraction for Token-Efficient Document Indexing cs.DB · 2026-04-16 · unverdicted · none · ref 50
SAGE is a training-free context reduction method that converts attention signals from a small LLM into a differential relevance heatmap to select top units for downstream QA, achieving competitive accuracy at 10% token budget on benchmarks like QuALITY-hard.
Brain-OF: An Omnifunctional Foundation Model for fMRI, EEG and MEG cs.LG · 2026-02-26 · unverdicted · none · ref 14
Brain-OF is a multimodal foundation model for fMRI, EEG and MEG using any-resolution sampling, DINT attention with sparse MoE, and masked temporal-frequency pretraining on ~40 datasets to achieve superior downstream performance.
Flashlight: PyTorch Compiler Extensions to Accelerate Attention Variants cs.LG · 2025-11-03 · unverdicted · none · ref 22
Flashlight is a compiler-native PyTorch framework that generates efficient fused kernels for arbitrary and data-dependent attention variants, supporting more cases than FlexAttention with competitive performance.
LLM-Enhanced Multi-Agent Reinforcement Learning with Expert Workflow for Real-Time P2P Energy Trading cs.MA · 2025-07-20 · unverdicted · none · ref 23
An LLM-enhanced MARL system with differential attention critic produces lower economic costs and voltage violations than baselines in simulated real-time P2P electricity trading.
Beyond Similarity: Temporal Operator Attention for Time Series Analysis cs.LG · 2026-05-11 · unverdicted · none · ref 31
Temporal Operator Attention augments softmax attention with learnable sequence-space operators for signed temporal mixing and uses stochastic regularization to enable practical training, yielding consistent gains on time series benchmarks.
When Does Removing LayerNorm Help? Activation Bounding as a Regime-Dependent Implicit Regularizer cs.LG · 2026-04-25 · unverdicted · none · ref 30
DyT improves validation loss 27% at 64M params/1M tokens but worsens it 19% at 118M tokens, with saturation levels predicting the sign of the effect.
Contrastive Learning for Multimodal Human Activity Recognition with Limited Labeled Data cs.LG · 2026-04-25 · unverdicted · none · ref 27
CLMM is a two-stage contrastive learning framework using CNN-DiffTransformer encoders and dual-branch fusion to improve multimodal human activity recognition under limited labels.
GraphDiffMed: Knowledge-Constrained Differential Attention with Pharmacological Graph Priors for Medication Recommendation cs.LG · 2026-03-21 · unverdicted · none · ref 17
GraphDiffMed integrates dual-scale differential attention with pharmacological graph priors to improve medication recommendation quality, ranking, and safety balance on MIMIC-III data.
Hybrid Architectures for Language Models: Systematic Analysis and Design Insights cs.CL · 2025-10-06 · unverdicted · none · ref 62
This work systematically compares inter-layer and intra-layer hybridization strategies for combining self-attention and Mamba-style state space models, evaluating them on language modeling, downstream tasks, long-context performance, scaling, and efficiency to derive optimal design recipes.
A Cellular Doctrine of Morality: Intrinsic Active Precision and the Mind-Reality Overload Dilemma cs.AI · 2026-05-02 · unverdicted · none · ref 34
AI incorporating active precision from pyramidal neurons may reduce information overload by evaluating evidence coherence before attention rather than maximizing rewards.
The General Theory of Localization Methods cs.LG · 2026-05-20 · unreviewed · ref 145

Differential transformer

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer