pith. sign in

Advances in Neural Information Processing Systems , volume=

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

fields

cs.LG 3

years

2026 3

representative citing papers

Can Transformers Learn to Verify During Backtracking Search?

cs.LG · 2026-05-21 · conditional · novelty 7.0

Decoder-only transformers fail to base verification decisions solely on current search state in cumulative traces because of scattered retrieval and history entanglement; Selective State Attention enforces state-only decisions via a fixed mask.

Mechanisms of Misgeneralization in Physical Sequence Modeling

cs.LG · 2026-05-19 · unverdicted · novelty 6.0

Generative sequence models for physical tasks exhibit physical misgeneralization where local prediction errors propagate through physical measurements to distort aggregate distributions over quantities like distance or energy; a data deviation kernel explains and predicts the shifts and supports a内核

citing papers explorer

Showing 3 of 3 citing papers.

  • Can Transformers Learn to Verify During Backtracking Search? cs.LG · 2026-05-21 · conditional · none · ref 57

    Decoder-only transformers fail to base verification decisions solely on current search state in cumulative traces because of scattered retrieval and history entanglement; Selective State Attention enforces state-only decisions via a fixed mask.

  • Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why cs.LG · 2026-05-11 · unverdicted · none · ref 19

    Distillation signals align better with ideal updates on incorrect student rollouts than correct ones, with optimal teacher context depending on student capacity and task.

  • Mechanisms of Misgeneralization in Physical Sequence Modeling cs.LG · 2026-05-19 · unverdicted · none · ref 112

    Generative sequence models for physical tasks exhibit physical misgeneralization where local prediction errors propagate through physical measurements to distort aggregate distributions over quantities like distance or energy; a data deviation kernel explains and predicts the shifts and supports a内核