Mechanistic design and scaling of hybrid architectures

Mechanistic design, scaling of hybrid architectures , author= · 2024 · arXiv 2403.17844

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

representative citing papers

Towards Understanding Self-Pretraining for Sequence Classification

cs.LG · 2026-05-20 · unverdicted · novelty 6.0

Self-pretraining improves Transformer sequence classification by enabling learning of proximity-biased attention from positional encodings that label supervision alone cannot easily acquire from random starts.

Flash PD-SSM: Memory-Optimized Structured Sparse State-Space Models

cs.LG · 2026-05-18 · unverdicted · novelty 6.0

Flash PD-SSM achieves FSA-level expressivity by discretely selecting one matrix from a trainable set of structured sparse transition matrices at each time step while preserving the runtime and memory efficiency of standard structured SSMs.

RubiConv -- Efficient Boundary-Respecting Convolutions

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

RubiConv enables boundary-respecting convolutions on packed sequences using an efficient algorithm that outperforms both attention and standard FFT baselines in speed.

ZAYA1-8B Technical Report

cs.AI · 2026-05-06 · unverdicted · novelty 6.0

ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.

ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution

cs.CL · 2025-09-17 · unverdicted · novelty 6.0

ShinkaEvolve improves sample efficiency in LLM-driven program evolution via parent sampling, code novelty rejection-sampling, and bandit LLM ensemble selection, achieving new SOTA circle packing with 150 samples and gains on math reasoning and competitive programming tasks.

Hybrid Architectures for Language Models: Systematic Analysis and Design Insights

cs.CL · 2025-10-06 · unverdicted · novelty 4.0

This work systematically compares inter-layer and intra-layer hybridization strategies for combining self-attention and Mamba-style state space models, evaluating them on language modeling, downstream tasks, long-context performance, scaling, and efficiency to derive optimal design recipes.

citing papers explorer

Showing 6 of 6 citing papers.

Towards Understanding Self-Pretraining for Sequence Classification cs.LG · 2026-05-20 · unverdicted · none · ref 92
Self-pretraining improves Transformer sequence classification by enabling learning of proximity-biased attention from positional encodings that label supervision alone cannot easily acquire from random starts.
Flash PD-SSM: Memory-Optimized Structured Sparse State-Space Models cs.LG · 2026-05-18 · unverdicted · none · ref 41
Flash PD-SSM achieves FSA-level expressivity by discretely selecting one matrix from a trainable set of structured sparse transition matrices at each time step while preserving the runtime and memory efficiency of standard structured SSMs.
RubiConv -- Efficient Boundary-Respecting Convolutions cs.LG · 2026-05-08 · unverdicted · none · ref 16
RubiConv enables boundary-respecting convolutions on packed sequences using an efficient algorithm that outperforms both attention and standard FFT baselines in speed.
ZAYA1-8B Technical Report cs.AI · 2026-05-06 · unverdicted · none · ref 51
ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution cs.CL · 2025-09-17 · unverdicted · none · ref 189
ShinkaEvolve improves sample efficiency in LLM-driven program evolution via parent sampling, code novelty rejection-sampling, and bandit LLM ensemble selection, achieving new SOTA circle packing with 150 samples and gains on math reasoning and competitive programming tasks.
Hybrid Architectures for Language Models: Systematic Analysis and Design Insights cs.CL · 2025-10-06 · unverdicted · none · ref 40
This work systematically compares inter-layer and intra-layer hybridization strategies for combining self-attention and Mamba-style state space models, evaluating them on language modeling, downstream tasks, long-context performance, scaling, and efficiency to derive optimal design recipes.

Mechanistic design and scaling of hybrid architectures

fields

years

verdicts

representative citing papers

citing papers explorer