pith. sign in

hub

Differential transformer

22 Pith papers cite this work. Polarity classification is still indexing.

22 Pith papers citing it

hub tools

citation-role summary

background 4

citation-polarity summary

years

2026 17 2025 5

roles

background 4

polarities

background 4

clear filters

representative citing papers

PithTrain: A Compact and Agent-Native MoE Training System

cs.LG · 2026-05-29 · unverdicted · novelty 7.0

PithTrain is a compact agent-native MoE training system that matches production throughput and improves agent-task efficiency by up to 62% fewer turns and 64% less GPU time on the new ATE-Bench.

IAFormer: Interaction-Aware Transformer network for collider data analysis

hep-ph · 2025-05-06 · unverdicted · novelty 7.0

IAFormer uses boost-invariant pairwise quantities and differential attention to create a sparse Transformer that achieves state-of-the-art classification on top-quark and quark-gluon jet datasets while using over an order of magnitude fewer parameters than prior Particle Transformer models.

Contribution Weights: A Geometrical Analysis of Self-Attention Transformers

cs.LG · 2026-05-29 · unverdicted · novelty 6.0

Contribution Weights combine attention, value magnitude, and directional alignment to measure token influence more faithfully than attention alone, and show attention sinks actively suppress information via a convex sink-rate to output-norm relationship.

FocuSFT: Bilevel Optimization for Dilution-Aware Long-Context Fine-Tuning

cs.CL · 2026-05-11 · unverdicted · novelty 6.0

FocuSFT uses an inner optimization loop to adapt fast-weight parameters into a parametric memory that sharpens attention on relevant content, then conditions outer-loop supervised fine-tuning on this representation, yielding gains on long-context benchmarks.

Brain-OF: An Omnifunctional Foundation Model for fMRI, EEG and MEG

cs.LG · 2026-02-26 · unverdicted · novelty 6.0

Brain-OF is a multimodal foundation model for fMRI, EEG and MEG using any-resolution sampling, DINT attention with sparse MoE, and masked temporal-frequency pretraining on ~40 datasets to achieve superior downstream performance.

Hybrid Architectures for Language Models: Systematic Analysis and Design Insights

cs.CL · 2025-10-06 · unverdicted · novelty 4.0

This work systematically compares inter-layer and intra-layer hybridization strategies for combining self-attention and Mamba-style state space models, evaluating them on language modeling, downstream tasks, long-context performance, scaling, and efficiency to derive optimal design recipes.

The General Theory of Localization Methods

cs.LG · 2026-05-20 · unverdicted · novelty 3.0 · 2 refs

The localization method is presented as a unifying framework connecting kernel methods, MeanShift, Hopfield networks, LLE, fuzzy inference, denoising autoencoders, and Transformers via local models and the localization trick.

citing papers explorer

Showing 0 of 0 citing papers after filters.

No citing papers match the current filters.