Title resolution pending

Residual: Transformer with dual residual connections · 2023 · arXiv 2304.14802

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

representative citing papers

Transformers Provably Learn to Internalize Chain-of-Thought

cs.LG · 2026-05-27 · unverdicted · novelty 8.0

L-layer transformers under Log-ICoT curriculum provably learn k-parity with poly(n) samples and log k stages, matching explicit CoT efficiency without inference overhead.

Depth-Attention: Cross-Layer Value Mixing for Language Models

cs.CL · 2026-06-03 · unverdicted · novelty 7.0

Depth-Attention mixes values from earlier layers into the current attention value by having the query attend to previous-layer keys at the same position, yielding lower perplexity and up to 2.3 points higher average accuracy than vanilla transformers on Qwen3-style models with negligible extra FLOPs

Neural posterior estimation of the neutrino direction in IceCube using transformer-encoded normalizing flows on the sphere

hep-ex · 2026-04-21 · unverdicted · novelty 7.0

A transformer-encoded spherical normalizing flow achieves state-of-the-art angular resolution for IceCube neutrino tracks and showers, improving median resolution by factors of 1.3-2.5 over B-spline likelihoods at 100 TeV and outperforming prior ML methods for muons.

AOT-POT: Adaptive Operator Transformation for Large-Scale PDE Pre-training

cs.LG · 2026-05-15 · unverdicted · novelty 6.0

AOT-POT adaptively reshapes complex PDE solution operators via input-dependent transformations and parallel stream mixing to enable effective large-scale pre-training, yielding SOTA results on 12 benchmarks with minimal added parameters.

SiameseNorm: Breaking the Barrier to Reconciling Pre/Post-Norm

cs.LG · 2026-02-08 · unverdicted · novelty 6.0

SiameseNorm is a two-stream architecture that reconciles Pre-Norm and Post-Norm in Transformers by coupling streams via shared residual blocks, yielding performance gains with maintained stability on language, vision, and diffusion models.

Analyzing Stream Collapse in Hyper-Connections: From Diagnosis to Mitigation

cs.LG · 2026-06-02 · unverdicted · novelty 5.0

Hyper-Connections models show stream collapse to a dominant stream with near-identity residual mixing after seeding; symmetry-breaking initialization mitigates dominance and raises performance.

Understanding the Prompt Sensitivity

cs.CL · 2026-04-20 · unverdicted · novelty 5.0

LLMs disperse meaning-preserving prompts internally instead of clustering them, which produces an excessively high upper bound on output log-probability differences via Taylor expansion and Cauchy-Schwarz.

citing papers explorer

Showing 7 of 7 citing papers after filters.

Transformers Provably Learn to Internalize Chain-of-Thought cs.LG · 2026-05-27 · unverdicted · none · ref 52
L-layer transformers under Log-ICoT curriculum provably learn k-parity with poly(n) samples and log k stages, matching explicit CoT efficiency without inference overhead.
Depth-Attention: Cross-Layer Value Mixing for Language Models cs.CL · 2026-06-03 · unverdicted · none · ref 14
Depth-Attention mixes values from earlier layers into the current attention value by having the query attend to previous-layer keys at the same position, yielding lower perplexity and up to 2.3 points higher average accuracy than vanilla transformers on Qwen3-style models with negligible extra FLOPs
Neural posterior estimation of the neutrino direction in IceCube using transformer-encoded normalizing flows on the sphere hep-ex · 2026-04-21 · unverdicted · none · ref 49
A transformer-encoded spherical normalizing flow achieves state-of-the-art angular resolution for IceCube neutrino tracks and showers, improving median resolution by factors of 1.3-2.5 over B-spline likelihoods at 100 TeV and outperforming prior ML methods for muons.
AOT-POT: Adaptive Operator Transformation for Large-Scale PDE Pre-training cs.LG · 2026-05-15 · unverdicted · none · ref 61
AOT-POT adaptively reshapes complex PDE solution operators via input-dependent transformations and parallel stream mixing to enable effective large-scale pre-training, yielding SOTA results on 12 benchmarks with minimal added parameters.
SiameseNorm: Breaking the Barrier to Reconciling Pre/Post-Norm cs.LG · 2026-02-08 · unverdicted · none · ref 24
SiameseNorm is a two-stream architecture that reconciles Pre-Norm and Post-Norm in Transformers by coupling streams via shared residual blocks, yielding performance gains with maintained stability on language, vision, and diffusion models.
Analyzing Stream Collapse in Hyper-Connections: From Diagnosis to Mitigation cs.LG · 2026-06-02 · unverdicted · none · ref 2
Hyper-Connections models show stream collapse to a dominant stream with near-identity residual mixing after seeding; symmetry-breaking initialization mitigates dominance and raises performance.
Understanding the Prompt Sensitivity cs.CL · 2026-04-20 · unverdicted · none · ref 31
LLMs disperse meaning-preserving prompts internally instead of clustering them, which produces an excessively high upper bound on output log-probability differences via Taylor expansion and Cauchy-Schwarz.

Title resolution pending

fields

years

verdicts

representative citing papers

citing papers explorer