NeurIPS , year=

Attention Is All You Need , author=

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

browse 4 citing papers

representative citing papers

Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs

cs.CL · 2024-12-30 · unverdicted · novelty 7.0

o1-like models overthink easy tasks; self-training reduces compute use without accuracy loss on GSM8K, MATH500, GPQA, and AIME.

Temporal Functional Circuits: From Spline Plots to Faithful Explanations in KAN Forecasting

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

A gated residual KAN framework called Temporal Functional Circuits maps edge functions to input lags, ranks them by activation, and validates faithfulness via interventions showing that learned B-splines add predictive value beyond base activations.

The Recurrent Transformer: Greater Effective Depth and Efficient Decoding

cs.LG · 2026-04-23 · unverdicted · novelty 6.0

Recurrent Transformers add per-layer recurrent memory via self-attention on own activations plus a tiling algorithm that reduces training memory traffic, yielding better C4 pretraining cross-entropy than parameter-matched standard transformers with fewer layers.

Safe Continual Reinforcement Learning under Nonstationarity via Adaptive Safety Constraints

cs.LG · 2026-05-13 · unverdicted · novelty 5.0

LILAC+ combines context-based, adaptation-speed, and budget-to-state safety constraints to reduce violations in continual RL under nonstationary conditions, demonstrated in simulated driving tasks.

citing papers explorer

Showing 4 of 4 citing papers.

Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs cs.CL · 2024-12-30 · unverdicted · none · ref 163
o1-like models overthink easy tasks; self-training reduces compute use without accuracy loss on GSM8K, MATH500, GPQA, and AIME.
Temporal Functional Circuits: From Spline Plots to Faithful Explanations in KAN Forecasting cs.LG · 2026-05-07 · unverdicted · none · ref 20
A gated residual KAN framework called Temporal Functional Circuits maps edge functions to input lags, ranks them by activation, and validates faithfulness via interventions showing that learned B-splines add predictive value beyond base activations.
The Recurrent Transformer: Greater Effective Depth and Efficient Decoding cs.LG · 2026-04-23 · unverdicted · none · ref 2
Recurrent Transformers add per-layer recurrent memory via self-attention on own activations plus a tiling algorithm that reduces training memory traffic, yielding better C4 pretraining cross-entropy than parameter-matched standard transformers with fewer layers.
Safe Continual Reinforcement Learning under Nonstationarity via Adaptive Safety Constraints cs.LG · 2026-05-13 · unverdicted · none · ref 122
LILAC+ combines context-based, adaptation-speed, and budget-to-state safety constraints to reduce violations in continual RL under nonstationary conditions, demonstrated in simulated driving tasks.

NeurIPS , year=

fields

years

verdicts

representative citing papers

citing papers explorer