PithTrain is a compact agent-native MoE training system that matches production throughput and improves agent-task efficiency by up to 62% fewer turns and 64% less GPU time on the new ATE-Bench.
hub
Differential transformer
22 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 4polarities
background 4representative citing papers
FDA differentially subtracts function-word cross-attention from original attention heads to cut attack success rates by 18-90% across models and tasks while dropping performance by at most 0.6%.
IAFormer uses boost-invariant pairwise quantities and differential attention to create a sparse Transformer that achieves state-of-the-art classification on top-quark and quark-gluon jet datasets while using over an order of magnitude fewer parameters than prior Particle Transformer models.
Contribution Weights combine attention, value magnitude, and directional alignment to measure token influence more faithfully than attention alone, and show attention sinks actively suppress information via a convex sink-rate to output-norm relationship.
Proposes HADT, a heterogeneous multi-agent differential transformer with relational observations-actions tokenization for model-free RL-based autonomous resource management in EO satellite clusters, claiming gains over baselines and adaptability to cluster size changes.
ROVER introduces a learnable routing plugin for object-centric visual evidence in MLLMs via token triplets and differential attention, reporting gains on MM-GCoT and VideoEspresso when integrated into Qwen2.5-VL-7B.
Empirical update to prior work shows most of 20 recent Transformer modifications do not transfer at 1-3B scales when measured with downstream CLIMB-12 tasks, multi-seed noise floor, and cross-scale stability.
TOA augments attention with learnable sequence-space operators and stochastic regularization to enable signed temporal mixing, yielding gains on forecasting and related benchmarks when added to PatchTST and iTransformer.
FocuSFT uses an inner optimization loop to adapt fast-weight parameters into a parametric memory that sharpens attention on relevant content, then conditions outer-loop supervised fine-tuning on this representation, yielding gains on long-context benchmarks.
SAGE is a training-free context reduction method that converts attention signals from a small LLM into a differential relevance heatmap to select top units for downstream QA, achieving competitive accuracy at 10% token budget on benchmarks like QuALITY-hard.
Brain-OF is a multimodal foundation model for fMRI, EEG and MEG using any-resolution sampling, DINT attention with sparse MoE, and masked temporal-frequency pretraining on ~40 datasets to achieve superior downstream performance.
Flashlight is a compiler-native PyTorch framework that generates efficient fused kernels for arbitrary and data-dependent attention variants, supporting more cases than FlexAttention with competitive performance.
An LLM-enhanced MARL system with differential attention critic produces lower economic costs and voltage violations than baselines in simulated real-time P2P electricity trading.
SamatNext v0.2-B reaches 100% on Stage 5 and retains 98.8% of Stage 3 behavior versus 97.6% and 6% for the Transformer baseline in a controlled curriculum setting.
A principle-driven RF encoder achieves 77.7% average accuracy across 15 cross-modal tasks, performing better on physically grounded tasks than semantic ones.
DyT improves validation loss 27% at 64M params/1M tokens but worsens it 19% at 118M tokens, with saturation levels predicting the sign of the effect.
CLMM is a two-stage contrastive learning framework using CNN-DiffTransformer encoders and dual-branch fusion to improve multimodal human activity recognition under limited labels.
GraphDiffMed integrates dual-scale differential attention with pharmacological graph priors to improve medication recommendation quality, ranking, and safety balance on MIMIC-III data.
This work systematically compares inter-layer and intra-layer hybridization strategies for combining self-attention and Mamba-style state space models, evaluating them on language modeling, downstream tasks, long-context performance, scaling, and efficiency to derive optimal design recipes.
Siamese-optimized high-dimensional FinBERT embeddings outperform scalar sentiment baselines and raw embeddings for short-term stock price prediction on the FNSPID dataset.
The localization method is presented as a unifying framework connecting kernel methods, MeanShift, Hopfield networks, LLE, fuzzy inference, denoising autoencoders, and Transformers via local models and the localization trick.
AI incorporating active precision from pyramidal neurons may reduce information overload by evaluating evidence coherence before attention rather than maximizing rewards.
citing papers explorer
-
PithTrain: A Compact and Agent-Native MoE Training System
PithTrain is a compact agent-native MoE training system that matches production throughput and improves agent-task efficiency by up to 62% fewer turns and 64% less GPU time on the new ATE-Bench.
-
IAFormer: Interaction-Aware Transformer network for collider data analysis
IAFormer uses boost-invariant pairwise quantities and differential attention to create a sparse Transformer that achieves state-of-the-art classification on top-quark and quark-gluon jet datasets while using over an order of magnitude fewer parameters than prior Particle Transformer models.
-
Contribution Weights: A Geometrical Analysis of Self-Attention Transformers
Contribution Weights combine attention, value magnitude, and directional alignment to measure token influence more faithfully than attention alone, and show attention sinks actively suppress information via a convex sink-rate to output-norm relationship.
-
HADT: A Heterogeneous Multi-Agent Differential Transformer for Autonomous Earth Observation Satellite Cluster
Proposes HADT, a heterogeneous multi-agent differential transformer with relational observations-actions tokenization for model-free RL-based autonomous resource management in EO satellite clusters, claiming gains over baselines and adaptability to cluster size changes.
-
ROVER: Routing Object-Centric Visual Evidence for Grounded Multi-Image Reasoning
ROVER introduces a learnable routing plugin for object-centric visual evidence in MLLMs via token triplets and differential attention, reporting gains on MM-GCoT and VideoEspresso when integrated into Qwen2.5-VL-7B.
-
Beyond Similarity: Temporal Operator Attention for Time Series Analysis
TOA augments attention with learnable sequence-space operators and stochastic regularization to enable signed temporal mixing, yielding gains on forecasting and related benchmarks when added to PatchTST and iTransformer.
-
FocuSFT: Bilevel Optimization for Dilution-Aware Long-Context Fine-Tuning
FocuSFT uses an inner optimization loop to adapt fast-weight parameters into a parametric memory that sharpens attention on relevant content, then conditions outer-loop supervised fine-tuning on this representation, yielding gains on long-context benchmarks.
-
SAGE: Selective Attention-Guided Extraction for Token-Efficient Document Indexing
SAGE is a training-free context reduction method that converts attention signals from a small LLM into a differential relevance heatmap to select top units for downstream QA, achieving competitive accuracy at 10% token budget on benchmarks like QuALITY-hard.
-
Brain-OF: An Omnifunctional Foundation Model for fMRI, EEG and MEG
Brain-OF is a multimodal foundation model for fMRI, EEG and MEG using any-resolution sampling, DINT attention with sparse MoE, and masked temporal-frequency pretraining on ~40 datasets to achieve superior downstream performance.
-
Flashlight: PyTorch Compiler Extensions to Accelerate Attention Variants
Flashlight is a compiler-native PyTorch framework that generates efficient fused kernels for arbitrary and data-dependent attention variants, supporting more cases than FlexAttention with competitive performance.
-
LLM-Enhanced Multi-Agent Reinforcement Learning with Expert Workflow for Real-Time P2P Energy Trading
An LLM-enhanced MARL system with differential attention critic produces lower economic costs and voltage violations than baselines in simulated real-time P2P electricity trading.
-
SamatNext v0.2-B: An Exploratory Study of RMS-Normalized Hybrid Decoders for Curriculum Retention in Small Code Models
SamatNext v0.2-B reaches 100% on Stage 5 and retains 98.8% of Stage 3 behavior versus 97.6% and 6% for the Transformer baseline in a controlled curriculum setting.
-
Building The Ph(ysical)AI Layer Of Machine Intelligence
A principle-driven RF encoder achieves 77.7% average accuracy across 15 cross-modal tasks, performing better on physically grounded tasks than semantic ones.
-
When Does Removing LayerNorm Help? Activation Bounding as a Regime-Dependent Implicit Regularizer
DyT improves validation loss 27% at 64M params/1M tokens but worsens it 19% at 118M tokens, with saturation levels predicting the sign of the effect.
-
Contrastive Learning for Multimodal Human Activity Recognition with Limited Labeled Data
CLMM is a two-stage contrastive learning framework using CNN-DiffTransformer encoders and dual-branch fusion to improve multimodal human activity recognition under limited labels.
-
GraphDiffMed: Knowledge-Constrained Differential Attention with Pharmacological Graph Priors for Medication Recommendation
GraphDiffMed integrates dual-scale differential attention with pharmacological graph priors to improve medication recommendation quality, ranking, and safety balance on MIMIC-III data.
-
Hybrid Architectures for Language Models: Systematic Analysis and Design Insights
This work systematically compares inter-layer and intra-layer hybridization strategies for combining self-attention and Mamba-style state space models, evaluating them on language modeling, downstream tasks, long-context performance, scaling, and efficiency to derive optimal design recipes.
-
Bridging the Gap Between Natural Language and Market Dynamics via High-Dimensional Representation Learning
Siamese-optimized high-dimensional FinBERT embeddings outperform scalar sentiment baselines and raw embeddings for short-term stock price prediction on the FNSPID dataset.
-
The General Theory of Localization Methods
The localization method is presented as a unifying framework connecting kernel methods, MeanShift, Hopfield networks, LLE, fuzzy inference, denoising autoencoders, and Transformers via local models and the localization trick.
-
A Cellular Doctrine of Morality: Intrinsic Active Precision and the Mind-Reality Overload Dilemma
AI incorporating active precision from pyramidal neurons may reduce information overload by evaluating evidence coherence before attention rather than maximizing rewards.