Dynamic Chunking for Diffusion Language Models

Debing Zhang; James Kwok; Peng Zhao; Weiyu Chen; Xiaoming Shi; Yichen Zhu

arxiv: 2605.15676 · v1 · pith:HAV7A7ULnew · submitted 2026-05-15 · 💻 cs.CL

Dynamic Chunking for Diffusion Language Models

Yichen Zhu , Xiaoming Shi , Peng Zhao , Weiyu Chen , Debing Zhang , James Kwok This is my paper

Pith reviewed 2026-05-20 19:35 UTC · model grok-4.3

classification 💻 cs.CL

keywords diffusion language modelsdynamic chunkingsemantic chunkschunking attentiondiscrete diffusionautoregressive factorizationcontent-based partitioningblock diffusion

0 comments

The pith

Dynamic chunking replaces fixed positional blocks in diffusion language models with content-defined semantic clusters, improving benchmark performance up to 1.5B parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Block discrete diffusion language models factorize sequences over rigid positional blocks, which often split related tokens and group unrelated ones. The paper introduces Dynamic Chunking Diffusion Models that learn semantic chunks instead through a Chunking Attention layer. This layer assigns tokens to clusters using learnable subspaces, all optimized by the diffusion training objective. The resulting chunk-causal mask allows the denoiser to autoregress over meaningful content units rather than position. Experiments show consistent gains over both unstructured diffusion and fixed-block baselines, with the edge appearing early and holding across scales.

Core claim

Block discrete diffusion language models factorize a sequence autoregressively over fixed-size positional blocks. DCDM replaces these with content-defined semantic chunks produced by Chunking Attention, a differentiable layer that routes tokens into K clusters parameterized by learnable subspaces. The cluster assignments induce a chunk-causal attention mask under which the discrete diffusion denoiser factorizes the sequence likelihood strictly generalizing the positional-block approach.

What carries the argument

Chunking Attention, a differentiable layer that routes tokens into K clusters via learnable subspaces shaped end-to-end by the diffusion objective to produce a content-based chunk-causal mask.

Load-bearing premise

The Chunking Attention layer, when trained end-to-end with the diffusion objective, will produce cluster assignments that induce a chunk-causal mask meaningfully better than fixed positional blocks.

What would settle it

A direct comparison in which the learned cluster assignments are replaced by random or fixed groupings while keeping all other components identical, and measuring whether the performance advantage disappears.

Figures

Figures reproduced from arXiv: 2605.15676 by Debing Zhang, James Kwok, Peng Zhao, Weiyu Chen, Xiaoming Shi, Yichen Zhu.

**Figure 1.** Figure 1: Overview of DCDM. Left: The denoiser stacks N−1 DiT blocks on top of a single Chunking Attention layer that produces the content-defined partition consumed by all downstream blocks. Right: Operationally, the chunking attention takes the noisy input together with a noise mask and emits a per-token cluster assignment (color-coded as Green, Orange, Yellow), which induces the chunk-causal attention mask used b… view at source ↗

**Figure 2.** Figure 2: Training loss against training steps for the [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Scaling and training efficiency of the three dense diffusion models (MDLM, BDLM, DCDM) on the [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Point (h = 1) vs. subspace (h = 48) chunking attention at the 0.1B scale; all other training settings are identical. Left: Diffusion training loss. The subspace variant reaches a lower final loss (2.304 vs. 2.544) and follows a smoother trajectory throughout training. Right: Cluster violation, a measure of deviation from uniform centroid usage (lower values indicate more uniform routing). The subspace vari… view at source ↗

read the original abstract

Block discrete diffusion language models factorize a sequence autoregressively over fixed-size positional blocks, decoupling within-block parallel denoising from across-block conditioning. We argue that this rigid partition wastes structure already present in the sequence: blocks defined by position rather than by content separate semantically coherent tokens and group unrelated ones together. We introduce the \textbf{D}ynamic \textbf{C}hunking \textbf{D}iffusion \textbf{M}odel (DCDM), which replaces positional blocks with content-defined semantic chunks. At its core is Chunking Attention, a differentiable layer that routes tokens into $K$ clusters parameterized by learnable subspaces and shaped end-to-end by the diffusion objective. The resulting cluster assignments induce a chunk-causal attention mask under which a discrete diffusion denoiser factorizes the sequence likelihood autoregressively over semantic chunks, strictly generalizing block discrete diffusion. On downstream benchmarks at parameter scales up to 1.5B, DCDM consistently improves over both unstructured and positional-block diffusion baselines, with the advantage stable across scales and visible early in training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper replaces positional blocks with learned semantic chunks via Chunking Attention in discrete diffusion LMs and claims stable gains, but the abstract leaves the source of those gains unclear.

read the letter

The paper replaces fixed positional blocks with learned semantic chunks via Chunking Attention in discrete diffusion LMs and claims stable gains, but the abstract leaves the source of those gains unclear. Chunking Attention routes tokens into K clusters using learnable subspaces, with the diffusion objective shaping the assignments to produce a chunk-causal mask. This setup strictly generalizes prior block discrete diffusion by letting the partitions depend on content rather than position alone. The end-to-end training ties the clustering directly to the denoising task, which is a clean extension of the existing framework. They report consistent improvements over unstructured and positional-block baselines on downstream benchmarks up to 1.5B parameters, with the advantage appearing early and holding across scales. The main soft spot is the lack of evidence that the clusters actually stay semantic. Nothing in the diffusion loss explicitly prevents subspace collapse or forces the chunks to respect meaning over positional or frequency patterns, so the extra parameters in the attention layer could explain the gains instead. The abstract also omits training details, exact baselines, ablations on the chunking component, and any statistical checks, which makes it hard to judge how much the dynamic masking contributes. This work is for researchers focused on non-autoregressive generation and ways to add structure to diffusion language models. A reader looking for ideas on parallel decoding would find the mechanism worth a look even if the numbers require verification. I would send it for peer review. The generalization is straightforward and the claims are specific enough that referees can check whether the chunks deliver real semantic benefit beyond added capacity.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes the Dynamic Chunking Diffusion Model (DCDM), which replaces fixed positional blocks in discrete diffusion language models with content-defined semantic chunks. A Chunking Attention layer routes tokens into K clusters via learnable subspaces, inducing a chunk-causal attention mask that is trained end-to-end with the diffusion objective. The paper claims this strictly generalizes block discrete diffusion and yields consistent empirical gains on downstream benchmarks at scales up to 1.5B parameters over both unstructured and positional-block baselines, with the advantage appearing early in training and remaining stable across scales.

Significance. If the reported gains are robust and attributable to semantically coherent dynamic chunking rather than additional parameters or implementation artifacts, the approach could advance diffusion language models by allowing the denoising factorization to better respect natural data structure. The end-to-end optimization of chunk boundaries without hand-crafted rules is a clean generalization of prior block-based methods and could improve both efficiency and modeling quality at scale.

major comments (2)

[§4] §4 (Experiments): The central claim of consistent improvements on downstream benchmarks lacks supporting details on training hyperparameters, baseline implementations, number of random seeds or statistical significance tests, and ablations that isolate the Chunking Attention component from the mere addition of extra parameters. This omission makes it impossible to determine whether the gains are reproducible or specifically due to content-defined chunks.
[§3.2] §3.2 (Chunking Attention): The routing mechanism performs soft assignment to K learnable subspaces followed by a hard chunk-causal mask. The diffusion loss contains no explicit term to penalize subspace collapse or to encourage clusters to capture semantic rather than positional or frequency-based structure. Without quantitative analysis of cluster diversity, assignment entropy, or qualitative inspection of chunk boundaries, it remains possible that the learned masks are close to fixed positional blocks, undermining the explanation for the observed performance advantage.

minor comments (2)

[§3] The notation for the subspace projection matrices and the soft-to-hard assignment function should be introduced with explicit equations rather than prose descriptions to improve reproducibility.
[Figure 2] Figure 2 or the corresponding results table would benefit from error bars or standard deviations across runs to support the claim of stable advantages across scales.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback on our manuscript. We appreciate the referee's positive assessment of the potential impact of dynamic chunking and address each major comment below. We will revise the paper to improve reproducibility and provide supporting analyses as outlined.

read point-by-point responses

Referee: [§4] §4 (Experiments): The central claim of consistent improvements on downstream benchmarks lacks supporting details on training hyperparameters, baseline implementations, number of random seeds or statistical significance tests, and ablations that isolate the Chunking Attention component from the mere addition of extra parameters. This omission makes it impossible to determine whether the gains are reproducible or specifically due to content-defined chunks.

Authors: We agree that the current manuscript would benefit from expanded experimental details to support reproducibility and isolate the contribution of Chunking Attention. In the revised version, Section 4 will be updated to include a full table of training hyperparameters (learning rate schedules, batch sizes, diffusion timesteps, and optimizer settings), precise descriptions of baseline implementations ensuring matched parameter counts and training protocols, results from multiple random seeds (at least three) with standard deviations and statistical significance tests (e.g., paired t-tests), and a dedicated ablation comparing DCDM to a parameter-matched fixed-block variant. These additions will demonstrate that performance gains are attributable to content-defined chunks rather than capacity differences or implementation choices. revision: yes
Referee: [§3.2] §3.2 (Chunking Attention): The routing mechanism performs soft assignment to K learnable subspaces followed by a hard chunk-causal mask. The diffusion loss contains no explicit term to penalize subspace collapse or to encourage clusters to capture semantic rather than positional or frequency-based structure. Without quantitative analysis of cluster diversity, assignment entropy, or qualitative inspection of chunk boundaries, it remains possible that the learned masks are close to fixed positional blocks, undermining the explanation for the observed performance advantage.

Authors: The referee is correct that the manuscript lacks an explicit regularization term against subspace collapse and does not yet provide quantitative or qualitative analysis of the learned clusters. While the diffusion objective implicitly favors chunk boundaries that improve denoising efficiency, this alone does not fully rule out degenerate solutions. In the revision, we will add to Section 3.2: (i) plots of assignment entropy and cluster-size variance over the course of training, (ii) quantitative comparison of learned masks against fixed positional blocks (e.g., via mask overlap metrics), and (iii) qualitative examples of chunk boundaries on held-out text demonstrating semantic coherence beyond positional or frequency patterns. These analyses will strengthen the claim that the observed gains arise from semantically meaningful dynamic chunking. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained via end-to-end training of new components

full rationale

The paper defines Chunking Attention as a new differentiable routing layer with K learnable subspaces whose parameters are optimized directly by the discrete diffusion objective to induce content-based chunk-causal masks. This architecture generalizes positional-block diffusion without reducing any claimed prediction or uniqueness result to a prior fit, self-citation chain, or ansatz smuggled from the authors' own work. The central modeling step (soft assignment into subspaces followed by hard mask) is not equivalent by construction to its inputs; the diffusion loss provides an external training signal that can in principle discover semantic structure. No load-bearing self-citation or renaming of known results appears in the provided derivation. The model therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The central claim rests on the existence of a differentiable clustering mechanism whose assignments improve the diffusion factorization, plus standard assumptions about discrete diffusion training dynamics.

free parameters (2)

K (number of clusters)
The number of semantic chunks is a hyperparameter chosen by the authors.
learnable subspace parameters
Parameters of the Chunking Attention layer are fitted during training.

axioms (1)

domain assumption Cluster assignments produced by Chunking Attention can be used to construct a valid chunk-causal attention mask that strictly generalizes positional blocks.
Invoked when the paper states that the resulting mask allows autoregressive factorization over semantic chunks.

invented entities (1)

Chunking Attention layer no independent evidence
purpose: Differentiable routing of tokens into K content-defined clusters
New component introduced to replace fixed positional partitioning.

pith-pipeline@v0.9.0 · 5720 in / 1319 out tokens · 45713 ms · 2026-05-20T19:35:28.269183+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Chunking Attention... routes tokens into K clusters parameterized by learnable subspaces... induces a chunk-causal attention mask
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DCDM... strictly generalizing block discrete diffusion

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 21 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and V olodymyr Kuleshov

Marianne Arriola, Aaron Gokaslan, Justin T. Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and V olodymyr Kuleshov. Block diffusion: Interpolating between autoregressive and diffusion language models,

work page
[3]

URLhttps://arxiv.org/abs/2503.09573

work page internal anchor Pith review Pith/arXiv arXiv
[4]

and Ho, Jonathan and Tarlow, Daniel and van den Berg, Rianne , year =

Jacob Austin, Daniel D. Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg. Structured denoising diffusion models in discrete state-spaces, 2023. URLhttps://arxiv.org/abs/2107.03006

work page arXiv 2023
[5]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng X...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Efficient Training of Language Models to Fill in the Middle

Mohammad Bavarian, Heewoo Jun, Nikolas Tezak, John Schulman, Christine McLeavey, Jerry Tworek, and Mark Chen. Efficient training of language models to fill in the middle.arXiv preprint arXiv:2207.14255, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[7]

Piqa: Reasoning about physical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical commonsense in natural language. InProceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020

work page 2020
[8]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Nee- lakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020. 9 APREPRINT- MAY18, 2026

work page 1901
[9]

One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling

Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. One billion word benchmark for measuring progress in statistical language modeling. Preprint arXiv:1312.3005, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[10]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv:1803.05457v1, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[11]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[12]

A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents

Arman Cohan, Franck Dernoncourt, Doo Soon Kim, Trung Bui, Seokhwan Kim, Walter Chang, and Nazli Goharian. A discourse-aware attention model for abstractive summarization of long documents. Preprint arXiv:1804.05685, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[13]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

work page 2022
[15]

Openwebtext corpus

Aaron Gokaslan and Vanya Cohen. Openwebtext corpus. http://Skylion007.github.io/ OpenWebTextCorpus, 2019

work page 2019
[16]

Scaling Diffusion Language Models via Adaptation from Autoregressive Models

Shansan Gong, Shivam Agarwal, Yizhe Zhang, Jiacheng Ye, Lin Zheng, Mukai Li, Chenxin An, Peilin Zhao, Wei Bi, Jiawei Han, et al. Scaling diffusion language models via adaptation from autoregressive models.arXiv preprint arXiv:2410.17891, 2024

work page internal anchor Pith review arXiv 2024
[17]

Gemini diffusion, 2025

Google DeepMind. Gemini diffusion, 2025. URLhttps://deepmind.google/models/gemini-diffusion/. Accessed: 2026-04-21

work page 2025
[18]

Aligning ai with shared human values.Proceedings of the International Conference on Learning Representations (ICLR), 2021

Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob Steinhardt. Aligning ai with shared human values.Proceedings of the International Conference on Learning Representations (ICLR), 2021

work page 2021
[19]

Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021

work page 2021
[20]

Autoregressive diffusion models.arXiv preprint arXiv:2110.02037, 2021

Emiel Hoogeboom, Alexey A Gritsenko, Jasmijn Bastings, Ben Poole, Rianne van den Berg, and Tim Salimans. Autoregressive diffusion models.arXiv preprint arXiv:2110.02037, 2021

work page arXiv 2021
[21]

Categorical Reparameterization with Gumbel-Softmax

Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax.arXiv preprint arXiv:1611.01144, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[22]

A mini-batch training strategy for deep subspace clustering networks.arXiv preprint arXiv:2507.19917, 2025

Yuxuan Jiang, Chenwei Yu, Zhi Lin, and Xiaolan Liu. A mini-batch training strategy for deep subspace clustering networks.arXiv preprint arXiv:2507.19917, 2025

work page arXiv 2025
[23]

Mercury: Ultra-Fast Language Models Based on Diffusion

Inception Labs, Samar Khanna, Siddhant Kharbanda, Shufan Li, Harshit Varma, Eric Wang, Sawyer Birnbaum, Ziyang Luo, Yanis Miraoui, Akash Palrecha, et al. Mercury: Ultra-fast language models based on diffusion.arXiv preprint arXiv:2506.17298, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Let's Verify Step by Step

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step.arXiv preprint arXiv:2305.20050, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Truthfulqa: Measuring how models mimic human falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. In Proceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers), pages 3214–3252, 2022

work page 2022
[26]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Adablock-dllm: Semantic-aware diffusion llm inference via adaptive block size.arXiv preprint arXiv:2509.26432, 2025

Guanxi Lu, Hao Mark Chen, Yuto Karashima, Zhican Wang, Daichi Fujiki, and Hongxiang Fan. Adablock-dllm: Semantic-aware diffusion llm inference via adaptive block size.arXiv preprint arXiv:2509.26432, 2025

work page arXiv 2025
[28]

Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz

Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. Building a large annotated corpus of English: The Penn Treebank.Computational Linguistics, 19(2):313–330, 1993. URL https://aclanthology. org/J93-2004/

work page 1993
[29]

Attention-based clustering.arXiv preprint arXiv:2505.13112, 2025

Rodrigo Maulen-Soto, Pierre Marion, and Claire Boyer. Attention-based clustering.arXiv preprint arXiv:2505.13112, 2025. 10 APREPRINT- MAY18, 2026

work page arXiv 2025
[30]

Pointer sentinel mixture models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. In International Conference on Learning Representations, 2017

work page 2017
[31]

Large Language Diffusion Models

Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models.arXiv preprint arXiv:2502.09992, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

The lambada dataset: Word prediction requiring a broad discourse context

D Paperno, G Kruszewski, A Lazaridou, QN Pham, Raffaella Bernardi, S Pezzelle, M Baroni, G Boleda, and R Fernández. The lambada dataset: Word prediction requiring a broad discourse context. In54th Annual Meeting of the Association for Computational Linguistics, ACL 2016-Long Papers, volume 3, pages 1525–1534. Association for Computational Linguistics (ACL), 2016

work page 2016
[33]

Subspace clustering for high dimensional data: a review

Lance Parsons, Ehtesham Haque, and Huan Liu. Subspace clustering for high dimensional data: a review. SIGKDD Explor. Newsl., 6(1):90–105, June 2004. ISSN 1931-0145. doi: 10.1145/1007730.1007731. URL https://doi.org/10.1145/1007730.1007731

work page doi:10.1145/1007730.1007731 2004
[34]

Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

work page 2019
[35]

Simple and effective masked diffusion language models, 2024

Subham Sekhar Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T Chiu, Alexander Rush, and V olodymyr Kuleshov. Simple and effective masked diffusion language models, 2024. URLhttps: //arxiv.org/abs/2406.07524

work page arXiv 2024
[36]

Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

work page 2021
[37]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Out- rageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[38]

Simplified and generalized masked diffusion for discrete data.Advances in neural information processing systems, 37:103131–103167, 2024

Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis Titsias. Simplified and generalized masked diffusion for discrete data.Advances in neural information processing systems, 37:103131–103167, 2024

work page 2024
[39]

Training and inference on any-order autoregressive models the right way.Advances in Neural Information Processing Systems, 35:2762–2775, 2022

Andy Shih, Dorsa Sadigh, and Stefano Ermon. Training and inference on any-order autoregressive models the right way.Advances in Neural Information Processing Systems, 35:2762–2775, 2022

work page 2022
[40]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[41]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

work page 2017
[42]

Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts

Lean Wang, Huazuo Gao, Chenggang Zhao, Xu Sun, and Damai Dai. Auxiliary-loss-free load balancing strategy for mixture-of-experts.arXiv preprint arXiv:2408.15664, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding

Chengyue Wu, Hao Zhang, Shuchen Xue, Zhijian Liu, Shizhe Diao, Ligeng Zhu, Ping Luo, Song Han, and Enze Xie. Fast-dllm: Training-free acceleration of diffusion llm by enabling kv cache and parallel decoding.arXiv preprint arXiv:2505.22618, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

Deep structure and attention aware subspace clustering

Wenhao Wu, Weiwei Wang, and Shengjiang Kong. Deep structure and attention aware subspace clustering. In Chinese Conference on Pattern Recognition and Computer Vision (PRCV), pages 139–150. Springer, 2023

work page 2023
[45]

Dream 7B: Diffusion Large Language Models

Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

HellaSwag: Can a Machine Really Finish Your Sentence?

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence?arXiv preprint arXiv:1905.07830, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905
[47]

subspace

Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification. 2015. 11 APREPRINT- MAY18, 2026 A Pseudocode Algorithm 1 reproduces the chunking attention layer of Section 4.1 as a single-batch computation. The notation follows the main text: L is the sequence length, d the model dimension, K the number of clusters...

work page 2015

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and V olodymyr Kuleshov

Marianne Arriola, Aaron Gokaslan, Justin T. Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and V olodymyr Kuleshov. Block diffusion: Interpolating between autoregressive and diffusion language models,

work page

[3] [3]

URLhttps://arxiv.org/abs/2503.09573

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

and Ho, Jonathan and Tarlow, Daniel and van den Berg, Rianne , year =

Jacob Austin, Daniel D. Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg. Structured denoising diffusion models in discrete state-spaces, 2023. URLhttps://arxiv.org/abs/2107.03006

work page arXiv 2023

[5] [5]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng X...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

Efficient Training of Language Models to Fill in the Middle

Mohammad Bavarian, Heewoo Jun, Nikolas Tezak, John Schulman, Christine McLeavey, Jerry Tworek, and Mark Chen. Efficient training of language models to fill in the middle.arXiv preprint arXiv:2207.14255, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[7] [7]

Piqa: Reasoning about physical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical commonsense in natural language. InProceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020

work page 2020

[8] [8]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Nee- lakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020. 9 APREPRINT- MAY18, 2026

work page 1901

[9] [9]

One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling

Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. One billion word benchmark for measuring progress in statistical language modeling. Preprint arXiv:1312.3005, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[10] [10]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv:1803.05457v1, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[11] [11]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[12] [12]

A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents

Arman Cohan, Franck Dernoncourt, Doo Soon Kim, Trung Bui, Seokhwan Kim, Walter Chang, and Nazli Goharian. A discourse-aware attention model for abstractive summarization of long documents. Preprint arXiv:1804.05685, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[13] [13]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

work page 2022

[15] [15]

Openwebtext corpus

Aaron Gokaslan and Vanya Cohen. Openwebtext corpus. http://Skylion007.github.io/ OpenWebTextCorpus, 2019

work page 2019

[16] [16]

Scaling Diffusion Language Models via Adaptation from Autoregressive Models

Shansan Gong, Shivam Agarwal, Yizhe Zhang, Jiacheng Ye, Lin Zheng, Mukai Li, Chenxin An, Peilin Zhao, Wei Bi, Jiawei Han, et al. Scaling diffusion language models via adaptation from autoregressive models.arXiv preprint arXiv:2410.17891, 2024

work page internal anchor Pith review arXiv 2024

[17] [17]

Gemini diffusion, 2025

Google DeepMind. Gemini diffusion, 2025. URLhttps://deepmind.google/models/gemini-diffusion/. Accessed: 2026-04-21

work page 2025

[18] [18]

Aligning ai with shared human values.Proceedings of the International Conference on Learning Representations (ICLR), 2021

Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob Steinhardt. Aligning ai with shared human values.Proceedings of the International Conference on Learning Representations (ICLR), 2021

work page 2021

[19] [19]

Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021

work page 2021

[20] [20]

Autoregressive diffusion models.arXiv preprint arXiv:2110.02037, 2021

Emiel Hoogeboom, Alexey A Gritsenko, Jasmijn Bastings, Ben Poole, Rianne van den Berg, and Tim Salimans. Autoregressive diffusion models.arXiv preprint arXiv:2110.02037, 2021

work page arXiv 2021

[21] [21]

Categorical Reparameterization with Gumbel-Softmax

Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax.arXiv preprint arXiv:1611.01144, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[22] [22]

A mini-batch training strategy for deep subspace clustering networks.arXiv preprint arXiv:2507.19917, 2025

Yuxuan Jiang, Chenwei Yu, Zhi Lin, and Xiaolan Liu. A mini-batch training strategy for deep subspace clustering networks.arXiv preprint arXiv:2507.19917, 2025

work page arXiv 2025

[23] [23]

Mercury: Ultra-Fast Language Models Based on Diffusion

Inception Labs, Samar Khanna, Siddhant Kharbanda, Shufan Li, Harshit Varma, Eric Wang, Sawyer Birnbaum, Ziyang Luo, Yanis Miraoui, Akash Palrecha, et al. Mercury: Ultra-fast language models based on diffusion.arXiv preprint arXiv:2506.17298, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[24] [24]

Let's Verify Step by Step

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step.arXiv preprint arXiv:2305.20050, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[25] [25]

Truthfulqa: Measuring how models mimic human falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. In Proceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers), pages 3214–3252, 2022

work page 2022

[26] [26]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[27] [27]

Adablock-dllm: Semantic-aware diffusion llm inference via adaptive block size.arXiv preprint arXiv:2509.26432, 2025

Guanxi Lu, Hao Mark Chen, Yuto Karashima, Zhican Wang, Daichi Fujiki, and Hongxiang Fan. Adablock-dllm: Semantic-aware diffusion llm inference via adaptive block size.arXiv preprint arXiv:2509.26432, 2025

work page arXiv 2025

[28] [28]

Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz

Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. Building a large annotated corpus of English: The Penn Treebank.Computational Linguistics, 19(2):313–330, 1993. URL https://aclanthology. org/J93-2004/

work page 1993

[29] [29]

Attention-based clustering.arXiv preprint arXiv:2505.13112, 2025

Rodrigo Maulen-Soto, Pierre Marion, and Claire Boyer. Attention-based clustering.arXiv preprint arXiv:2505.13112, 2025. 10 APREPRINT- MAY18, 2026

work page arXiv 2025

[30] [30]

Pointer sentinel mixture models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. In International Conference on Learning Representations, 2017

work page 2017

[31] [31]

Large Language Diffusion Models

Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models.arXiv preprint arXiv:2502.09992, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [32]

The lambada dataset: Word prediction requiring a broad discourse context

D Paperno, G Kruszewski, A Lazaridou, QN Pham, Raffaella Bernardi, S Pezzelle, M Baroni, G Boleda, and R Fernández. The lambada dataset: Word prediction requiring a broad discourse context. In54th Annual Meeting of the Association for Computational Linguistics, ACL 2016-Long Papers, volume 3, pages 1525–1534. Association for Computational Linguistics (ACL), 2016

work page 2016

[33] [33]

Subspace clustering for high dimensional data: a review

Lance Parsons, Ehtesham Haque, and Huan Liu. Subspace clustering for high dimensional data: a review. SIGKDD Explor. Newsl., 6(1):90–105, June 2004. ISSN 1931-0145. doi: 10.1145/1007730.1007731. URL https://doi.org/10.1145/1007730.1007731

work page doi:10.1145/1007730.1007731 2004

[34] [34]

Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

work page 2019

[35] [35]

Simple and effective masked diffusion language models, 2024

Subham Sekhar Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T Chiu, Alexander Rush, and V olodymyr Kuleshov. Simple and effective masked diffusion language models, 2024. URLhttps: //arxiv.org/abs/2406.07524

work page arXiv 2024

[36] [36]

Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

work page 2021

[37] [37]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Out- rageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[38] [38]

Simplified and generalized masked diffusion for discrete data.Advances in neural information processing systems, 37:103131–103167, 2024

Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis Titsias. Simplified and generalized masked diffusion for discrete data.Advances in neural information processing systems, 37:103131–103167, 2024

work page 2024

[39] [39]

Training and inference on any-order autoregressive models the right way.Advances in Neural Information Processing Systems, 35:2762–2775, 2022

Andy Shih, Dorsa Sadigh, and Stefano Ermon. Training and inference on any-order autoregressive models the right way.Advances in Neural Information Processing Systems, 35:2762–2775, 2022

work page 2022

[40] [40]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[41] [41]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

work page 2017

[42] [42]

Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts

Lean Wang, Huazuo Gao, Chenggang Zhao, Xu Sun, and Damai Dai. Auxiliary-loss-free load balancing strategy for mixture-of-experts.arXiv preprint arXiv:2408.15664, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[43] [43]

Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding

Chengyue Wu, Hao Zhang, Shuchen Xue, Zhijian Liu, Shizhe Diao, Ligeng Zhu, Ping Luo, Song Han, and Enze Xie. Fast-dllm: Training-free acceleration of diffusion llm by enabling kv cache and parallel decoding.arXiv preprint arXiv:2505.22618, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[44] [44]

Deep structure and attention aware subspace clustering

Wenhao Wu, Weiwei Wang, and Shengjiang Kong. Deep structure and attention aware subspace clustering. In Chinese Conference on Pattern Recognition and Computer Vision (PRCV), pages 139–150. Springer, 2023

work page 2023

[45] [45]

Dream 7B: Diffusion Large Language Models

Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[46] [46]

HellaSwag: Can a Machine Really Finish Your Sentence?

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence?arXiv preprint arXiv:1905.07830, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905

[47] [47]

subspace

Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification. 2015. 11 APREPRINT- MAY18, 2026 A Pseudocode Algorithm 1 reproduces the chunking attention layer of Section 4.1 as a single-batch computation. The notation follows the main text: L is the sequence length, d the model dimension, K the number of clusters...

work page 2015