pith. sign in

Title resolution pending

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

years

2026 7

verdicts

UNVERDICTED 7

clear filters

representative citing papers

Depth-Attention: Cross-Layer Value Mixing for Language Models

cs.CL · 2026-06-03 · unverdicted · novelty 7.0

Depth-Attention mixes values from earlier layers into the current attention value by having the query attend to previous-layer keys at the same position, yielding lower perplexity and up to 2.3 points higher average accuracy than vanilla transformers on Qwen3-style models with negligible extra FLOPs

SiameseNorm: Breaking the Barrier to Reconciling Pre/Post-Norm

cs.LG · 2026-02-08 · unverdicted · novelty 6.0

SiameseNorm is a two-stream architecture that reconciles Pre-Norm and Post-Norm in Transformers by coupling streams via shared residual blocks, yielding performance gains with maintained stability on language, vision, and diffusion models.

Understanding the Prompt Sensitivity

cs.CL · 2026-04-20 · unverdicted · novelty 5.0

LLMs disperse meaning-preserving prompts internally instead of clustering them, which produces an excessively high upper bound on output log-probability differences via Taylor expansion and Cauchy-Schwarz.

citing papers explorer

Showing 7 of 7 citing papers after filters.

  • Transformers Provably Learn to Internalize Chain-of-Thought cs.LG · 2026-05-27 · unverdicted · none · ref 52

    L-layer transformers under Log-ICoT curriculum provably learn k-parity with poly(n) samples and log k stages, matching explicit CoT efficiency without inference overhead.

  • Depth-Attention: Cross-Layer Value Mixing for Language Models cs.CL · 2026-06-03 · unverdicted · none · ref 14

    Depth-Attention mixes values from earlier layers into the current attention value by having the query attend to previous-layer keys at the same position, yielding lower perplexity and up to 2.3 points higher average accuracy than vanilla transformers on Qwen3-style models with negligible extra FLOPs

  • Neural posterior estimation of the neutrino direction in IceCube using transformer-encoded normalizing flows on the sphere hep-ex · 2026-04-21 · unverdicted · none · ref 49

    A transformer-encoded spherical normalizing flow achieves state-of-the-art angular resolution for IceCube neutrino tracks and showers, improving median resolution by factors of 1.3-2.5 over B-spline likelihoods at 100 TeV and outperforming prior ML methods for muons.

  • AOT-POT: Adaptive Operator Transformation for Large-Scale PDE Pre-training cs.LG · 2026-05-15 · unverdicted · none · ref 61

    AOT-POT adaptively reshapes complex PDE solution operators via input-dependent transformations and parallel stream mixing to enable effective large-scale pre-training, yielding SOTA results on 12 benchmarks with minimal added parameters.

  • SiameseNorm: Breaking the Barrier to Reconciling Pre/Post-Norm cs.LG · 2026-02-08 · unverdicted · none · ref 24

    SiameseNorm is a two-stream architecture that reconciles Pre-Norm and Post-Norm in Transformers by coupling streams via shared residual blocks, yielding performance gains with maintained stability on language, vision, and diffusion models.

  • Analyzing Stream Collapse in Hyper-Connections: From Diagnosis to Mitigation cs.LG · 2026-06-02 · unverdicted · none · ref 2

    Hyper-Connections models show stream collapse to a dominant stream with near-identity residual mixing after seeding; symmetry-breaking initialization mitigates dominance and raises performance.

  • Understanding the Prompt Sensitivity cs.CL · 2026-04-20 · unverdicted · none · ref 31

    LLMs disperse meaning-preserving prompts internally instead of clustering them, which produces an excessively high upper bound on output log-probability differences via Taylor expansion and Cauchy-Schwarz.