TBP-mHC proposes parameterizations of the Birkhoff polytope via transportation polytopes that achieve exact double stochasticity for hyper-connections using only (n-1)^2 degrees of freedom.
hub
mHC: Manifold-Constrained Hyper-Connections
32 Pith papers cite this work. Polarity classification is still indexing.
abstract
Recently, studies exemplified by Hyper-Connections (HC) have extended the ubiquitous residual connection paradigm established over the past decade by expanding the residual stream width and diversifying connectivity patterns. While yielding substantial performance gains, this diversification fundamentally compromises the identity mapping property intrinsic to the residual connection, which causes severe training instability and restricted scalability, and additionally incurs notable memory access overhead. To address these challenges, we propose Manifold-Constrained Hyper-Connections (mHC), a general framework that projects the residual connection space of HC onto a specific manifold to restore the identity mapping property, while incorporating rigorous infrastructure optimization to ensure efficiency. Empirical experiments demonstrate that mHC is effective for training at scale, offering tangible performance improvements and superior scalability. We anticipate that mHC, as a flexible and practical extension of HC, will contribute to a deeper understanding of topological architecture design and suggest promising directions for the evolution of foundational models.
hub tools
citation-role summary
citation-polarity summary
years
2026 32roles
background 4representative citing papers
Delta Attention Residuals attend over per-sublayer deltas instead of cumulative hidden states, producing higher-contrast attention weights and 1.7-8.2% validation perplexity gains over standard and attention residuals across 220M-7.6B models.
An efficiently computable HS-Jacobian acts as a conservative mapping for projections onto polyhedral sets, supporting provably convergent Adam-based end-to-end training of linearly constrained deep neural networks.
FLUID is a continuous-time transformer using Liquid Attention Networks to model attention as stable ODE solutions that interpolate between discrete SDPA and CT-RNNs, with an explicit sink gate and liquid hyper-connections for better information flow.
SATFormer uses a context-dependent gate for selective reuse of early Transformer representations, improving validation loss and zero-shot accuracy especially on retrieval benchmarks.
Skip-connected MLPs and residual-free MLPs of equal width represent generically disjoint function classes for common activations, with explicit impossibility proofs and a non-generic absorption condition for ReLU and GELU.
A new SFT framework for MoE models combines bias-driven sparsification with gated condenser experts to retain long-tailed expert information, outperforming DenseMixer and ESFT by over 2.5% on math reasoning and commonsense QA benchmarks.
LoopCTR trains CTR models with recursive layer reuse and process supervision so that zero-loop inference outperforms baselines on public and industrial datasets.
Deep Delta Learning replaces additive residual updates with a gated delta-rule that selectively overwrites residual content along learned directions, improving language modeling quality over standard ResNet-style accumulation.
DAR replaces residual addition in DiTs with learnable timestep-adaptive non-incremental aggregation of sublayer outputs, improving FID by 2.11 on ImageNet 256x256 and accelerating convergence by 8.75x.
AOT-POT adaptively reshapes complex PDE solution operators via input-dependent transformations and parallel stream mixing to enable effective large-scale pre-training, yielding SOTA results on 12 benchmarks with minimal added parameters.
SODA unifies several modern optimizers under optimistic dual averaging and supplies a 1/k decay wrapper that improves performance without weight decay tuning.
The EΔ-MHC-Geo Transformer achieves input-adaptive unconditionally orthogonal residual connections via a Cayley-based rotation that works for all parameters, combined with a learned hybrid gate for reflections.
Graph Normalization is a convergent dynamical system that approximates MWIS by always reaching a binary maximum independent set via majorization-minimization and evolutionary game equivalence.
HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.
DsmNet substitutes Laplacian matrices with approximated doubly stochastic matrices in GNNs, using Neumann truncation and residual mass compensation to achieve O(K|E|) efficiency and bound Dirichlet energy decay for reduced over-smoothing.
ResBM achieves 128x activation compression in pipeline-parallel transformer training by adding a residual bottleneck module that preserves a low-rank identity path, with no major loss in convergence or added overhead.
LPC-SM is a hybrid architecture separating local attention, persistent memory, predictive correction, and control with ONT for memory writes, showing loss reductions on 158M-parameter models up to 4096-token contexts.
SiameseNorm is a two-stream architecture that reconciles Pre-Norm and Post-Norm in Transformers by coupling streams via shared residual blocks, yielding performance gains with maintained stability on language, vision, and diffusion models.
TGR performs manifold-informed latent foresight search to boost trajectory coverage in long-context reasoning tasks by up to 13 AUC points with minimal overhead.
S³GNN mitigates oversquashing in message-passing networks via lightweight global mixing without strong prior assumptions, yielding up to 10x error reduction and 50% fewer parameters across multiple domains.
Manifold-constrained multi-stream mixing plus per-stream adapters improves SSM language model validation loss from 6.3507 to 6.1353 and perplexity from 572.91 to 461.88 on WikiText-2.
Cubit replaces Transformer's attention with a closed-form Kernel Ridge Regression token mixer and reports larger gains as training sequence length increases.
Hyperloop Transformers outperform standard and mHC Transformers with roughly 50% fewer parameters by looping a middle block of layers and applying hyper-connections only after each loop.
citing papers explorer
-
TBP-mHC: full expressivity for manifold-constrained hyper connections through transportation polytopes
TBP-mHC proposes parameterizations of the Birkhoff polytope via transportation polytopes that achieve exact double stochasticity for hyper-connections using only (n-1)^2 degrees of freedom.
-
Delta Attention Residuals
Delta Attention Residuals attend over per-sublayer deltas instead of cumulative hidden states, producing higher-contrast attention weights and 1.7-8.2% validation perplexity gains over standard and attention residuals across 220M-7.6B models.
-
Efficient and provably convergent end-to-end training of deep neural networks with linear constraints
An efficiently computable HS-Jacobian acts as a conservative mapping for projections onto polyhedral sets, supporting provably convergent Adam-based end-to-end training of linearly constrained deep neural networks.
-
FLUID: Continuous-Time Hyperconnected Sparse Transformer for Sink-Free Learning
FLUID is a continuous-time transformer using Liquid Attention Networks to model attention as stable ODE solutions that interpolate between discrete SDPA and CT-RNNs, with an explicit sink gate and liquid hyper-connections for better information flow.
-
Transformers with Selective Access to Early Representations
SATFormer uses a context-dependent gate for selective reuse of early Transformer representations, improving validation loss and zero-shot accuracy especially on retrieval benchmarks.
-
Can an MLP Absorb Its Own Skip Connection?
Skip-connected MLPs and residual-free MLPs of equal width represent generically disjoint function classes for common activations, with explicit impossibility proofs and a non-generic absorption condition for ReLU and GELU.
-
Preserving Long-Tailed Expert Information in Mixture-of-Experts Tuning
A new SFT framework for MoE models combines bias-driven sparsification with gated condenser experts to retain long-tailed expert information, outperforming DenseMixer and ESFT by over 2.5% on math reasoning and commonsense QA benchmarks.
-
LoopCTR: Unlocking the Loop Scaling Power for Click-Through Rate Prediction
LoopCTR trains CTR models with recursive layer reuse and process supervision so that zero-loop inference outperforms baselines on public and industrial datasets.
-
Deep Delta Learning
Deep Delta Learning replaces additive residual updates with a gated delta-rule that selectively overwrites residual content along learned directions, improving language modeling quality over standard ResNet-style accumulation.
-
Rethinking Cross-Layer Information Routing in Diffusion Transformers
DAR replaces residual addition in DiTs with learnable timestep-adaptive non-incremental aggregation of sublayer outputs, improving FID by 2.11 on ImageNet 256x256 and accelerating convergence by 8.75x.
-
AOT-POT: Adaptive Operator Transformation for Large-Scale PDE Pre-training
AOT-POT adaptively reshapes complex PDE solution operators via input-dependent transformations and parallel stream mixing to enable effective large-scale pre-training, yielding SOTA results on 12 benchmarks with minimal added parameters.
-
Optimistic Dual Averaging Unifies Modern Optimizers
SODA unifies several modern optimizers under optimistic dual averaging and supplies a 1/k decay wrapper that improves performance without weight decay tuning.
-
The E$\Delta$-MHC-Geo Transformer: Adaptive Geodesic Operations with Guaranteed Orthogonality
The EΔ-MHC-Geo Transformer achieves input-adaptive unconditionally orthogonal residual connections via a Cayley-based rotation that works for all parameters, combined with a learned hybrid gate for reflections.
-
Graph Normalization: Fast Binarizing Dynamics for Differentiable MWIS
Graph Normalization is a convergent dynamical system that approximates MWIS by always reaching a binary maximum independent set via majorization-minimization and evolutionary game equivalence.
-
HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering
HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.
-
Beyond the Laplacian: Doubly Stochastic Matrices for Graph Neural Networks
DsmNet substitutes Laplacian matrices with approximated doubly stochastic matrices in GNNs, using Neumann truncation and residual mass compensation to achieve O(K|E|) efficiency and bound Dirichlet energy decay for reduced over-smoothing.
-
ResBM: Residual Bottleneck Models for Low-Bandwidth Pipeline Parallelism
ResBM achieves 128x activation compression in pipeline-parallel transformer training by adding a residual bottleneck module that preserves a low-rank identity path, with no major loss in convergence or added overhead.
-
LPC-SM: Local Predictive Coding and Sparse Memory for Long-Context Language Modeling
LPC-SM is a hybrid architecture separating local attention, persistent memory, predictive correction, and control with ONT for memory writes, showing loss reductions on 158M-parameter models up to 4096-token contexts.
-
SiameseNorm: Breaking the Barrier to Reconciling Pre/Post-Norm
SiameseNorm is a two-stream architecture that reconciles Pre-Norm and Post-Norm in Transformers by coupling streams via shared residual blocks, yielding performance gains with maintained stability on language, vision, and diffusion models.
-
The Geometric Reasoner: Manifold-Informed Latent Foresight Search for Long-Context Reasoning
TGR performs manifold-informed latent foresight search to boost trajectory coverage in long-context reasoning tasks by up to 13 AUC points with minimal overhead.
-
S$^3$GNN: Efficient Global Mixing and Local Message Passing for Long-Range Graph Learning
S³GNN mitigates oversquashing in message-passing networks via lightweight global mixing without strong prior assumptions, yielding up to 10x error reduction and 50% fewer parameters across multiple domains.
-
mHC-SSM: Manifold-Constrained Hyper-Connections for State Space Language Models with Stream-Specialized Adapters
Manifold-constrained multi-stream mixing plus per-stream adapters improves SSM language model validation loss from 6.3507 to 6.1353 and perplexity from 572.91 to 461.88 on WikiText-2.
-
Cubit: Token Mixer with Kernel Ridge Regression
Cubit replaces Transformer's attention with a closed-form Kernel Ridge Regression token mixer and reports larger gains as training sequence length increases.
-
Hyperloop Transformers
Hyperloop Transformers outperform standard and mHC Transformers with roughly 50% fewer parameters by looping a middle block of layers and applying hyper-connections only after each loop.
-
Nexusformer: Nonlinear Attention Expansion for Stable and Inheritable Transformer Scaling
Nexusformer uses a three-stage nonlinear mapping in attention to enable stable, inheritable scaling of transformers, matching baseline perplexity with up to 41.5% less compute when growing from 240M to 440M parameters.
-
Attention Residuals
Attention Residuals replaces fixed residual summation with input-dependent softmax attention over preceding layers, and a blocked variant is shown to improve uniformity and downstream performance in a 48B-parameter model pre-trained on 1.4T tokens.
-
YOCO++: Enhancing YOCO with KV Residual Connections for Efficient LLM Inference
YOCO++ enhances YOCO by adding weighted residual KV connections from bottom layers, delivering state-of-the-art results among cross-layer compression methods at 50% KV cache reduction and outperforming the standard Transformer.
-
Multi-Gate Residuals
Multi-Gate Residuals stabilizes activation scales in deep residual networks via multi-stream gating and attention pooling without added communication overhead.
- SNLP: Layer-Parallel Inference via Structured Newton Corrections
- Exact Linear Attention
- Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models
- Beyond Linearity in Attention Projections: The Case for Nonlinear Queries