arxiv: 2604.18780 · v1 · submitted 2026-04-20 · 💻 cs.LG

Recognition: unknown

Streaming Structured Inference with Flash-SemiCRF

Benjamin K. Johnson , Thomas Goralski , Ayush Semwal , Hui Shen , H. Josh Jang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 05:32 UTC · model grok-4.3

classification 💻 cs.LG

keywords semi-CRFstructured inferencestreaming algorithmsmemory efficiencyTriton kernelsequence labelingexact inferenceprefix sums

0 comments

The pith

Flash-SemiCRF replaces the large edge-potential tensor with on-the-fly prefix-sum lookup and streaming forward-backward to make exact semi-CRF inference feasible on long sequences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Semi-CRFs label entire segments instead of single tokens, which gives clean uncertainty at boundaries but forces every implementation to store an edge-potential tensor whose size grows with sequence length, maximum segment length, and label count. For speech or genomic data this tensor quickly exceeds GPU memory. The paper removes the tensor entirely by recovering the needed potentials from a compact prefix-sum array, pairs that change with a streaming forward-backward pass that checkpoints only at segment boundaries, and stabilizes the arithmetic with zero-centered cumulative scores. The resulting fused Triton kernel therefore runs exact inference and exact gradients on problem sizes that were previously intractable.

Core claim

Exact semi-CRF inference on long sequences is achieved by evaluating edge potentials from a prefix-sum array rather than materializing them, combined with a streaming forward-backward algorithm whose working memory remains sublinear in sequence length while gradients stay exact.

What carries the argument

Flash-SemiCRF, the fused Triton kernel that performs prefix-sum lookup for edge potentials, streaming forward-backward with checkpoint-boundary normalization, and zero-centered cumulative scores.

If this is right

Exact segment-level inference becomes practical for sequences longer than 100,000 positions and for label sets that were previously too large.
Memory footprint shrinks by a factor proportional to the product of maximum segment length and label count.
Working memory during training stays sublinear in sequence length while gradients remain exact.
Zero-centered scores stabilize numerics and automatically induce an adaptive duration prior when labels are imbalanced.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same prefix-sum and streaming pattern could be applied to other segment-based structured models outside semi-CRFs.
Integration with existing PyTorch models for genomic annotation would become straightforward once the kernel is available.
Empirical tests on real long-sequence benchmarks would show whether the induced duration prior improves boundary accuracy.

Load-bearing premise

Edge potentials can be exactly recovered from a compact prefix-sum array without numerical drift or storage that grows with sequence length and label count.

What would settle it

Run both Flash-SemiCRF and a reference semi-CRF implementation on a sequence of length 5,000 with 50 labels where both fit in memory and verify that the computed log-likelihood and parameter gradients agree to within floating-point tolerance.

read the original abstract

Semi-Markov Conditional Random Fields (semi-CRFs) assign labels to segments of a sequence rather than to individual positions, enabling exact inference over segment-level features and principled uncertainty estimates at their boundaries. However, existing implementations must materialize a large edge potential tensor whose size grows with sequence length, maximum segment length, and label count, becoming prohibitive for speech-scale state spaces and intractable at genomic scales where sequences can exceed 100,000 positions. This memory bottleneck has limited the adoption of exact segment-level inference for long sequences and large label sets. We identify that the core inefficiency is materializing edge potentials that can instead be evaluated on-the-fly from a compact prefix-sum array, and make several improvements. First, replacing the stored edge tensor with prefix-sum lookup reduces the memory footprint by a factor proportional to the product of segment length and label count. Second, a streaming forward-backward pass with checkpoint-boundary normalization keeps working memory sublinear in sequence length while preserving exact gradients. Third, zero-centered cumulative scores control numerical drift and induce an adaptive duration prior under label imbalance. We integrate these ideas into Flash-SemiCRF, a fused Triton kernel that enables exact semi-CRF inference on previously intractable problem sizes. Available at https://github.com/biobenkj/flash-semicrf.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a memory-efficient route to exact semi-CRF inference via prefix-sum lookup and streaming forward-backward, but the exactness claim under floating-point accumulation still needs checking.

read the letter

The main point is that they replace the full edge-potential tensor with on-the-fly evaluation from a compact prefix-sum array, then run a streaming forward-backward pass with checkpoint normalization and zero-centered cumulative scores. This keeps memory sublinear in sequence length while claiming to preserve exact gradients, which would let semi-CRFs handle genomic-scale sequences or large label sets in speech without approximation. That combination of tricks looks new for this model class and directly targets a real bottleneck that has kept exact segment-level inference off the table for long inputs. The Triton kernel implementation is a practical step that could be reused or adapted. The soft spot is the missing verification for numerical stability. Recovering edge potentials from long prefix sums plus repeated normalization steps can accumulate floating-point drift, especially past 10^4 positions, and the abstract supplies no derivation, error bound, or side-by-side comparison against the materialized version. The stress-test concern on this is reasonable until the full paper shows the recovered potentials match within machine precision. This is for people already working on structured prediction who need segment features on long sequences. A reader facing the same memory wall would find the specific engineering choices useful even if they reimplement parts themselves. It deserves peer review because the problem is concrete, the proposed fix is plausible, and the work is grounded enough to be worth referee time and empirical scrutiny.

Referee Report

2 major / 0 minor

Summary. The paper claims to enable exact semi-CRF inference on long sequences (e.g., >100k positions) and large label spaces by replacing materialization of the edge-potential tensor with on-the-fly evaluation from a compact prefix-sum array, using a streaming forward-backward pass with checkpoint-boundary normalization to achieve sublinear working memory while preserving exact gradients, and applying zero-centered cumulative scores to control drift and induce an adaptive duration prior. These are integrated into a fused Triton kernel called Flash-SemiCRF.

Significance. If the exactness and stability claims hold, the work would remove a key memory barrier that has restricted exact segment-level inference in semi-CRFs, making principled boundary uncertainty estimates practical for speech-scale and genomic-scale problems where only approximate methods were previously feasible. The memory reduction scaling with segment length and label count, together with the open-source kernel, would be a concrete advance for structured prediction on long sequences.

major comments (2)

[Abstract] Abstract: the central claim that 'exact gradients are preserved' and that the method enables 'exact semi-CRF inference' rests on recovering full edge potentials from the prefix-sum array under streaming normalization, yet no derivation, equivalence proof, or floating-point error bound is supplied for sequences with L ≫ 10^4; this directly undermines the 'exact' qualifier.
[Abstract] Abstract: the weakest assumption—that edge potentials are exactly recoverable from the compact prefix-sum array without numerical drift or extra storage scaling with sequence length and label count—is load-bearing for the memory and exactness claims, but the manuscript provides neither an error analysis nor empirical verification of stability under repeated normalization steps.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful review and for identifying the need for stronger justification of the exactness and stability claims. We address each major comment below and will revise the manuscript to incorporate the requested derivations, proofs, and analyses.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that 'exact gradients are preserved' and that the method enables 'exact semi-CRF inference' rests on recovering full edge potentials from the prefix-sum array under streaming normalization, yet no derivation, equivalence proof, or floating-point error bound is supplied for sequences with L ≫ 10^4; this directly undermines the 'exact' qualifier.

Authors: We agree that an explicit derivation is needed to substantiate the exactness claim. The streaming forward-backward with checkpoint-boundary normalization is constructed so that each local normalization factor cancels in the final marginals and gradients, recovering the same values as the standard algorithm (modulo floating-point rounding). In the revision we will add a formal equivalence proof in an appendix, together with a floating-point error analysis that bounds accumulated drift for L > 10^4 under the zero-centered cumulative scores. We will also report empirical differences versus a naive implementation on sequences up to 200k positions. revision: yes
Referee: [Abstract] Abstract: the weakest assumption—that edge potentials are exactly recoverable from the compact prefix-sum array without numerical drift or extra storage scaling with sequence length and label count—is load-bearing for the memory and exactness claims, but the manuscript provides neither an error analysis nor empirical verification of stability under repeated normalization steps.

Authors: The prefix-sum representation stores only cumulative sums of edge potentials, so recovery is exact in exact arithmetic and storage scales linearly with sequence length (not with max-segment-length × label-count). The zero-centered cumulative scores are introduced precisely to bound numerical drift. We will add both a theoretical error analysis showing that repeated normalizations do not introduce bias beyond machine epsilon and new experiments that measure the L1 difference from the non-streaming baseline across increasing sequence lengths and numbers of normalization steps. revision: yes

Circularity Check

0 steps flagged

No circularity; algorithmic replacements are independent of inputs

full rationale

The derivation chain consists of concrete algorithmic substitutions: edge potentials evaluated on-the-fly from prefix sums, streaming forward-backward with checkpoint normalization, and zero-centered scores. These steps are presented as direct engineering replacements that reduce memory while preserving exact gradients, without any equation reducing to a fitted parameter, self-definition, or self-citation chain. No load-bearing uniqueness theorem or ansatz is imported; the central claim of exact inference on large sizes follows from the stated data structures and passes rather than tautological re-expression of the original dynamic program.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on the standard semi-CRF forward-backward recurrence and the assumption that edge potentials admit an exact prefix-sum representation; no new free parameters or invented entities are introduced in the abstract.

axioms (2)

domain assumption The semi-CRF edge potentials can be exactly recovered from a compact prefix-sum array without loss of information.
Stated as the core inefficiency fix in the abstract.
domain assumption Checkpoint-boundary normalization preserves exact gradients during the streaming forward-backward pass.
Claimed to keep working memory sublinear while preserving exactness.

pith-pipeline@v0.9.0 · 5536 in / 1288 out tokens · 32300 ms · 2026-05-10T05:32:27.744353+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 10 canonical work pages · 1 internal anchor

[1]

In: Advances in Neural Information Processing Systems, vol

Sarawagi, S., Cohen, W.W.: Semi-Markov conditional random fields for information extraction. In: Advances in Neural Information Processing Systems, vol. 17 (2004)

2004
[2]

In: Proc

Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. In: Proc. NAACL-HLT, pp. 260–270 (2016). https://doi.org/10.18653/ v1/N16-1030

2016
[3]

In: Proc

Ma, X., Hovy, E.: End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In: Proc. ACL, pp. 1064–1074 (2016). https://doi.org/10.18653/v1/P16-1101

work page doi:10.18653/v1/p16-1101 2016
[4]

In: Proc

Kong, L., Dyer, C., Smith, N.A.: Segmental recurrent neural networks. In: Proc. ICLR (2016)

2016
[5]

In: Proc

Ye, Z., Ling, Z.-H.: Hybrid semi-Markov CRF for neural sequence labeling. In: Proc. ACL, pp. 235–240 (2018). https://doi.org/10.18653/v1/P18-2038

work page doi:10.18653/v1/p18-2038 2018
[6]

In: Proc

Rush, A.: Torch-struct: Deep structured prediction library. In: Proc. ACL: System Demonstrations, pp. 335–342 (2020). https://doi.org/10.18653/v1/2020.acl-demos.38

work page doi:10.18653/v1/2020.acl-demos.38 2020
[7]

https: //github.com/nanoporetech/bonito

Oxford Nanopore Technologies: Bonito: A PyTorch Basecaller for Oxford Nanopore Reads. https: //github.com/nanoporetech/bonito
[8]

https://github.com/ nanoporetech/dorado

Oxford Nanopore Technologies: Dorado: Oxford Nanopore’s Basecaller. https://github.com/ nanoporetech/dorado
[9]

bioRxiv (2025) https://doi.org/10.1101/2025.07.25.666829

Semwal, A., Morrison, J., Beddows, I., Palmer, T., Majewski, M.F., Jang, H.J., Johnson, B.K., Shen, H.: Tranquillyzer: A flexible neural network framework for structural annotation and demultiplexing of long-read transcriptomes. bioRxiv (2025) https://doi.org/10.1101/2025.07.25.666829

work page doi:10.1101/2025.07.25.666829 2025
[10]

Genome Research17(9), 1389–1398 (2007) https: //doi.org/10.1101/gr.6558107

DeCaprio, D., Vinson, J.P., Pearson, M.D., Montgomery, P., Doherty, M., Galagan, J.E.: Conrad: Gene prediction using conditional random fields. Genome Research17(9), 1389–1398 (2007) https: //doi.org/10.1101/gr.6558107

work page doi:10.1101/gr.6558107 2007
[11]

PLoS Computational Biology3(3), 54 (2007) https://doi

Bernal, A., Crammer, K., Hatzigeorgiou, A., Pereira, F.: Global discriminative learning for higher- accuracy computational gene prediction. PLoS Computational Biology3(3), 54 (2007) https://doi. 31 org/10.1371/journal.pcbi.0030054

work page doi:10.1371/journal.pcbi.0030054 2007
[12]

In: Findings of EMNLP, pp

Zaratiana, U., Holat, P., Tomeh, N., Charnois, T.: Filtered semi-Markov CRF. In: Findings of EMNLP, pp. 13553–13564 (2023)

2023
[13]

In: Advances in Neural Information Processing Systems, vol

Dao, T., Fu, D., Ermon, S., Rudra, A., R´ e, C.: FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In: Advances in Neural Information Processing Systems, vol. 35 (2022)

2022
[14]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Gu, A., Dao, T.: Mamba: Linear-time sequence modeling with selective state spaces. In: Proc. ICML (2024). Preprint arXiv:2312.00752, 2023

work page Pith review arXiv 2024
[15]

Cambridge University Press, Cambridge (2010)

Efron, B.: Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction. Cambridge University Press, Cambridge (2010)

2010
[16]

National Institute of Standards and Technology (NIST), Gaithersburg, MD (1993)

Garofolo, J.S., Lamel, L.F., Fisher, W.M., Fiscus, J.G., Pallett, D.S., Dahlgren, N.L., Zue, V.: DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus CD-ROM. National Institute of Standards and Technology (NIST), Gaithersburg, MD (1993)

1993
[17]

IEEE Transactions on Acoustics, Speech, and Signal Processing37(11), 1641–1648 (1989)

Lee, K.-F., Hon, H.-W.: Speaker-independent phone recognition using hidden Markov models. IEEE Transactions on Acoustics, Speech, and Signal Processing37(11), 1641–1648 (1989)

1989
[18]

Speech Communication48(12), 1666–1676 (2006) https://doi.org/10.1016/j.specom

Salvi, G.: Segment boundary detection via class entropy measurements in connectionist phoneme recognition. Speech Communication48(12), 1666–1676 (2006) https://doi.org/10.1016/j.specom. 2006.07.009 . arXiv:2401.05717

work page doi:10.1016/j.specom 2006
[19]

Spectraofquantizedsignals

Shannon, C.E.: A mathematical theory of communication. Bell System Technical Journal27(3), 379–423 (1948) https://doi.org/10.1002/j.1538-7305.1948.tb01338.x

work page doi:10.1002/j.1538-7305.1948.tb01338.x 1948
[20]

Computational Linguistics25(4), 573–605 (1999)

Goodman, J.: Semiring parsing. Computational Linguistics25(4), 573–605 (1999)

1999
[21]

Training Deep Nets with Sublinear Memory Cost

Chen, T., Xu, B., Zhang, C., Guestrin, C.: Training deep nets with sublinear memory cost. In: arXiv:1604.06174 (2016)

work page internal anchor Pith review arXiv 2016
[22]

Technical Report CMU-CS-90-190, Carnegie Mellon University (1990)

Blelloch, G.E.: Prefix sums and their applications. Technical Report CMU-CS-90-190, Carnegie Mellon University (1990)

1990
[23]

In: Proceedings of the 24th National Conference of the ACM, pp

Cuthill, E., McKee, J.: Reducing the bandwidth of sparse symmetric matrices. In: Proceedings of the 24th National Conference of the ACM, pp. 157–172 (1969)

1969
[24]

Prentice- Hall, Englewood Cliffs, NJ (1981) 32

George, A., Liu, J.W.H.: Computer Solution of Large Sparse Positive Definite Systems. Prentice- Hall, Englewood Cliffs, NJ (1981) 32

1981