hub

Differential transformer

Tianzhu Ye, Li Dong, Yuqing Xia, Yutao Sun, Yi Zhu, Gao Huang, Furu Wei · 2024 · arXiv 2410.05258

22 Pith papers cite this work. Polarity classification is still indexing.

22 Pith papers citing it

read on arXiv browse 22 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4

citation-polarity summary

background 4

representative citing papers

PithTrain: A Compact and Agent-Native MoE Training System

cs.LG · 2026-05-29 · unverdicted · novelty 7.0

PithTrain is a compact agent-native MoE training system that matches production throughput and improves agent-task efficiency by up to 62% fewer turns and 64% less GPU time on the new ATE-Bench.

Pay Less Attention to Function Words for Free Robustness of Vision-Language Models

cs.LG · 2025-12-08 · conditional · novelty 7.0

FDA differentially subtracts function-word cross-attention from original attention heads to cut attack success rates by 18-90% across models and tasks while dropping performance by at most 0.6%.

IAFormer: Interaction-Aware Transformer network for collider data analysis

hep-ph · 2025-05-06 · unverdicted · novelty 7.0

IAFormer uses boost-invariant pairwise quantities and differential attention to create a sparse Transformer that achieves state-of-the-art classification on top-quark and quark-gluon jet datasets while using over an order of magnitude fewer parameters than prior Particle Transformer models.

Contribution Weights: A Geometrical Analysis of Self-Attention Transformers

cs.LG · 2026-05-29 · unverdicted · novelty 6.0

Contribution Weights combine attention, value magnitude, and directional alignment to measure token influence more faithfully than attention alone, and show attention sinks actively suppress information via a convex sink-rate to output-norm relationship.

HADT: A Heterogeneous Multi-Agent Differential Transformer for Autonomous Earth Observation Satellite Cluster

cs.AI · 2026-05-29 · unverdicted · novelty 6.0

Proposes HADT, a heterogeneous multi-agent differential transformer with relational observations-actions tokenization for model-free RL-based autonomous resource management in EO satellite clusters, claiming gains over baselines and adaptability to cluster size changes.

ROVER: Routing Object-Centric Visual Evidence for Grounded Multi-Image Reasoning

cs.CV · 2026-05-27 · unverdicted · novelty 6.0

ROVER introduces a learnable routing plugin for object-centric visual evidence in MLLMs via token triplets and differential attention, reporting gains on MM-GCoT and VideoEspresso when integrated into Qwen2.5-VL-7B.

Most Transformer Modifications Still Do Not Transfer at 1-3B: A 2020-2026 Update to Narang et al. (2021) with Downstream Evaluation and a Noise Floor

cs.LG · 2026-05-20 · conditional · novelty 6.0

Empirical update to prior work shows most of 20 recent Transformer modifications do not transfer at 1-3B scales when measured with downstream CLIMB-12 tasks, multi-seed noise floor, and cross-scale stability.

cs.LG · 2026-05-11 · unverdicted · novelty 6.0 · 2 refs

TOA augments attention with learnable sequence-space operators and stochastic regularization to enable signed temporal mixing, yielding gains on forecasting and related benchmarks when added to PatchTST and iTransformer.

FocuSFT: Bilevel Optimization for Dilution-Aware Long-Context Fine-Tuning

cs.CL · 2026-05-11 · unverdicted · novelty 6.0

FocuSFT uses an inner optimization loop to adapt fast-weight parameters into a parametric memory that sharpens attention on relevant content, then conditions outer-loop supervised fine-tuning on this representation, yielding gains on long-context benchmarks.

SAGE: Selective Attention-Guided Extraction for Token-Efficient Document Indexing

cs.DB · 2026-04-16 · unverdicted · novelty 6.0

SAGE is a training-free context reduction method that converts attention signals from a small LLM into a differential relevance heatmap to select top units for downstream QA, achieving competitive accuracy at 10% token budget on benchmarks like QuALITY-hard.

Brain-OF: An Omnifunctional Foundation Model for fMRI, EEG and MEG

cs.LG · 2026-02-26 · unverdicted · novelty 6.0

Brain-OF is a multimodal foundation model for fMRI, EEG and MEG using any-resolution sampling, DINT attention with sparse MoE, and masked temporal-frequency pretraining on ~40 datasets to achieve superior downstream performance.

Flashlight: PyTorch Compiler Extensions to Accelerate Attention Variants

cs.LG · 2025-11-03 · unverdicted · novelty 6.0

Flashlight is a compiler-native PyTorch framework that generates efficient fused kernels for arbitrary and data-dependent attention variants, supporting more cases than FlexAttention with competitive performance.

LLM-Enhanced Multi-Agent Reinforcement Learning with Expert Workflow for Real-Time P2P Energy Trading

cs.MA · 2025-07-20 · unverdicted · novelty 6.0

An LLM-enhanced MARL system with differential attention critic produces lower economic costs and voltage violations than baselines in simulated real-time P2P electricity trading.

SamatNext v0.2-B: An Exploratory Study of RMS-Normalized Hybrid Decoders for Curriculum Retention in Small Code Models

cs.LG · 2026-06-20 · unverdicted · novelty 5.0 · 2 refs

SamatNext v0.2-B reaches 100% on Stage 5 and retains 98.8% of Stage 3 behavior versus 97.6% and 6% for the Transformer baseline in a controlled curriculum setting.

Building The Ph(ysical)AI Layer Of Machine Intelligence

cs.LG · 2026-06-02 · unverdicted · novelty 5.0

A principle-driven RF encoder achieves 77.7% average accuracy across 15 cross-modal tasks, performing better on physically grounded tasks than semantic ones.

When Does Removing LayerNorm Help? Activation Bounding as a Regime-Dependent Implicit Regularizer

cs.LG · 2026-04-25 · unverdicted · novelty 5.0

DyT improves validation loss 27% at 64M params/1M tokens but worsens it 19% at 118M tokens, with saturation levels predicting the sign of the effect.

Contrastive Learning for Multimodal Human Activity Recognition with Limited Labeled Data

cs.LG · 2026-04-25 · unverdicted · novelty 5.0

CLMM is a two-stage contrastive learning framework using CNN-DiffTransformer encoders and dual-branch fusion to improve multimodal human activity recognition under limited labels.

GraphDiffMed: Knowledge-Constrained Differential Attention with Pharmacological Graph Priors for Medication Recommendation

cs.LG · 2026-03-21 · unverdicted · novelty 5.0

GraphDiffMed integrates dual-scale differential attention with pharmacological graph priors to improve medication recommendation quality, ranking, and safety balance on MIMIC-III data.

Hybrid Architectures for Language Models: Systematic Analysis and Design Insights

cs.CL · 2025-10-06 · unverdicted · novelty 4.0

This work systematically compares inter-layer and intra-layer hybridization strategies for combining self-attention and Mamba-style state space models, evaluating them on language modeling, downstream tasks, long-context performance, scaling, and efficiency to derive optimal design recipes.

Bridging the Gap Between Natural Language and Market Dynamics via High-Dimensional Representation Learning

cs.LG · 2026-05-28 · unverdicted · novelty 3.0

Siamese-optimized high-dimensional FinBERT embeddings outperform scalar sentiment baselines and raw embeddings for short-term stock price prediction on the FNSPID dataset.

The General Theory of Localization Methods

cs.LG · 2026-05-20 · unverdicted · novelty 3.0 · 2 refs

The localization method is presented as a unifying framework connecting kernel methods, MeanShift, Hopfield networks, LLE, fuzzy inference, denoising autoencoders, and Transformers via local models and the localization trick.

A Cellular Doctrine of Morality: Intrinsic Active Precision and the Mind-Reality Overload Dilemma

cs.AI · 2026-05-02 · unverdicted · novelty 3.0

AI incorporating active precision from pyramidal neurons may reduce information overload by evaluating evidence coherence before attention rather than maximizing rewards.

citing papers explorer

Showing 20 of 20 citing papers after filters.

PithTrain: A Compact and Agent-Native MoE Training System cs.LG · 2026-05-29 · unverdicted · none · ref 43
PithTrain is a compact agent-native MoE training system that matches production throughput and improves agent-task efficiency by up to 62% fewer turns and 64% less GPU time on the new ATE-Bench.
IAFormer: Interaction-Aware Transformer network for collider data analysis hep-ph · 2025-05-06 · unverdicted · none · ref 46
IAFormer uses boost-invariant pairwise quantities and differential attention to create a sparse Transformer that achieves state-of-the-art classification on top-quark and quark-gluon jet datasets while using over an order of magnitude fewer parameters than prior Particle Transformer models.
Contribution Weights: A Geometrical Analysis of Self-Attention Transformers cs.LG · 2026-05-29 · unverdicted · none · ref 23
Contribution Weights combine attention, value magnitude, and directional alignment to measure token influence more faithfully than attention alone, and show attention sinks actively suppress information via a convex sink-rate to output-norm relationship.
HADT: A Heterogeneous Multi-Agent Differential Transformer for Autonomous Earth Observation Satellite Cluster cs.AI · 2026-05-29 · unverdicted · none · ref 26
Proposes HADT, a heterogeneous multi-agent differential transformer with relational observations-actions tokenization for model-free RL-based autonomous resource management in EO satellite clusters, claiming gains over baselines and adaptability to cluster size changes.
ROVER: Routing Object-Centric Visual Evidence for Grounded Multi-Image Reasoning cs.CV · 2026-05-27 · unverdicted · none · ref 68
ROVER introduces a learnable routing plugin for object-centric visual evidence in MLLMs via token triplets and differential attention, reporting gains on MM-GCoT and VideoEspresso when integrated into Qwen2.5-VL-7B.
Beyond Similarity: Temporal Operator Attention for Time Series Analysis cs.LG · 2026-05-11 · unverdicted · none · ref 31 · 2 links
TOA augments attention with learnable sequence-space operators and stochastic regularization to enable signed temporal mixing, yielding gains on forecasting and related benchmarks when added to PatchTST and iTransformer.
FocuSFT: Bilevel Optimization for Dilution-Aware Long-Context Fine-Tuning cs.CL · 2026-05-11 · unverdicted · none · ref 38
FocuSFT uses an inner optimization loop to adapt fast-weight parameters into a parametric memory that sharpens attention on relevant content, then conditions outer-loop supervised fine-tuning on this representation, yielding gains on long-context benchmarks.
SAGE: Selective Attention-Guided Extraction for Token-Efficient Document Indexing cs.DB · 2026-04-16 · unverdicted · none · ref 50
SAGE is a training-free context reduction method that converts attention signals from a small LLM into a differential relevance heatmap to select top units for downstream QA, achieving competitive accuracy at 10% token budget on benchmarks like QuALITY-hard.
Brain-OF: An Omnifunctional Foundation Model for fMRI, EEG and MEG cs.LG · 2026-02-26 · unverdicted · none · ref 14
Brain-OF is a multimodal foundation model for fMRI, EEG and MEG using any-resolution sampling, DINT attention with sparse MoE, and masked temporal-frequency pretraining on ~40 datasets to achieve superior downstream performance.
Flashlight: PyTorch Compiler Extensions to Accelerate Attention Variants cs.LG · 2025-11-03 · unverdicted · none · ref 22
Flashlight is a compiler-native PyTorch framework that generates efficient fused kernels for arbitrary and data-dependent attention variants, supporting more cases than FlexAttention with competitive performance.
LLM-Enhanced Multi-Agent Reinforcement Learning with Expert Workflow for Real-Time P2P Energy Trading cs.MA · 2025-07-20 · unverdicted · none · ref 23
An LLM-enhanced MARL system with differential attention critic produces lower economic costs and voltage violations than baselines in simulated real-time P2P electricity trading.
SamatNext v0.2-B: An Exploratory Study of RMS-Normalized Hybrid Decoders for Curriculum Retention in Small Code Models cs.LG · 2026-06-20 · unverdicted · none · ref 8 · 2 links
SamatNext v0.2-B reaches 100% on Stage 5 and retains 98.8% of Stage 3 behavior versus 97.6% and 6% for the Transformer baseline in a controlled curriculum setting.
Building The Ph(ysical)AI Layer Of Machine Intelligence cs.LG · 2026-06-02 · unverdicted · none · ref 25
A principle-driven RF encoder achieves 77.7% average accuracy across 15 cross-modal tasks, performing better on physically grounded tasks than semantic ones.
When Does Removing LayerNorm Help? Activation Bounding as a Regime-Dependent Implicit Regularizer cs.LG · 2026-04-25 · unverdicted · none · ref 30
DyT improves validation loss 27% at 64M params/1M tokens but worsens it 19% at 118M tokens, with saturation levels predicting the sign of the effect.
Contrastive Learning for Multimodal Human Activity Recognition with Limited Labeled Data cs.LG · 2026-04-25 · unverdicted · none · ref 27
CLMM is a two-stage contrastive learning framework using CNN-DiffTransformer encoders and dual-branch fusion to improve multimodal human activity recognition under limited labels.
GraphDiffMed: Knowledge-Constrained Differential Attention with Pharmacological Graph Priors for Medication Recommendation cs.LG · 2026-03-21 · unverdicted · none · ref 17
GraphDiffMed integrates dual-scale differential attention with pharmacological graph priors to improve medication recommendation quality, ranking, and safety balance on MIMIC-III data.
Hybrid Architectures for Language Models: Systematic Analysis and Design Insights cs.CL · 2025-10-06 · unverdicted · none · ref 62
This work systematically compares inter-layer and intra-layer hybridization strategies for combining self-attention and Mamba-style state space models, evaluating them on language modeling, downstream tasks, long-context performance, scaling, and efficiency to derive optimal design recipes.
Bridging the Gap Between Natural Language and Market Dynamics via High-Dimensional Representation Learning cs.LG · 2026-05-28 · unverdicted · none · ref 26
Siamese-optimized high-dimensional FinBERT embeddings outperform scalar sentiment baselines and raw embeddings for short-term stock price prediction on the FNSPID dataset.
The General Theory of Localization Methods cs.LG · 2026-05-20 · unverdicted · none · ref 22 · 2 links
The localization method is presented as a unifying framework connecting kernel methods, MeanShift, Hopfield networks, LLE, fuzzy inference, denoising autoencoders, and Transformers via local models and the localization trick.
A Cellular Doctrine of Morality: Intrinsic Active Precision and the Mind-Reality Overload Dilemma cs.AI · 2026-05-02 · unverdicted · none · ref 34
AI incorporating active precision from pyramidal neurons may reduce information overload by evaluating evidence coherence before attention rather than maximizing rewards.

Differential transformer

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer