pith. sign in

arxiv: 2606.19939 · v1 · pith:5JYYZUB5new · submitted 2026-06-18 · 💻 cs.CV

DiffMath: Symbol- and Graph-Aware Latent Diffusion Transformer for Handwritten Mathematical Expression Generation

Pith reviewed 2026-06-26 17:52 UTC · model grok-4.3

classification 💻 cs.CV
keywords handwritten mathematical expression generationlatent diffusionstructural priors from LaTeXdata augmentation for OCRgraph-aware generationsymbol-aware regularization
0
0 comments X

The pith

DiffMath generates handwritten math expressions from LaTeX hierarchies alone by encoding them as compact symbol-relation-depth triplets instead of using bounding-box labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops DiffMath to solve the problem of creating realistic handwritten mathematical expressions whose two-dimensional layouts are hard to capture without expensive position annotations. It extracts the existing tree structure from LaTeX or MathML into short sequences that record each symbol, its spatial relation to others, and its nesting depth. A variational autoencoder then learns a latent space that keeps both symbol identity and these spatial relations intact, after which a diffusion transformer denoises new samples in that space while an adaptive normalization layer injects a global count of symbols for extra coherence. The resulting images are structurally consistent and, when added to training sets, raise the accuracy of downstream math OCR systems.

Core claim

DiffMath is a symbol- and graph-aware latent diffusion framework that uses the hierarchical structure inherent in LaTeX as a structural prior. It first converts expressions via Relational Abstract Syntax Tree (RelAST) into triplet sequences [S, R, D], trains MathVAE with symbol-aware and relation-aware perceptual regularization to obtain structure-preserving latents, and runs MathDiT for conditional denoising guided by a symbol-count prior through Adaptive Layer Normalization (AdaLN).

What carries the argument

RelAST, a generation-oriented representation that distills MathML trees into compact triplet sequences [S, R, D] where each token encodes symbol identity, spatial relation, or nesting depth.

If this is right

  • Generated expressions maintain correct spatial topology without any bounding-box supervision during training or inference.
  • The method achieves higher scores than prior approaches on standard generation metrics for handwritten math.
  • Synthetic images produced by the model improve accuracy when used to augment training data for downstream OCR systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The triplet encoding could be applied to other hierarchically structured generation problems such as chemical diagrams or circuit schematics where explicit coordinates are costly.
  • Because the approach removes the need for position labels, it may allow creation of much larger synthetic datasets covering rare symbols or unusual layouts.
  • The learned latent space might support controlled editing, such as changing one sub-expression while keeping overall structure fixed.

Load-bearing premise

The hierarchical structure inherent in LaTeX can be distilled into compact triplet sequences [S, R, D] that preserve spatial topology sufficiently well to replace explicit positional supervision.

What would settle it

If expressions generated by the model show frequent spatial errors such as misplaced superscripts or unbalanced fractions when inspected by eye, or if adding the synthetic samples to an OCR training set produces no measurable accuracy gain over real data alone.

Figures

Figures reproduced from arXiv: 2606.19939 by Dezhi Peng, Hiuyi Cheng, Huiguo He, Lianwen Jin, Minghui Liao, Wei Pan, Xuhan Zheng, Yilin Shi.

Figure 1
Figure 1. Figure 1: Comparison of DiffMath (Ours) and two-stage generation paradigms. Unlike decoupled two-stage approaches that require explicit position-level supervision, Diff￾Math adopts a streamlined end-to-end framework to directly map LaTeX to formula pixels, reducing data dependency while improving global structural consistency. Recognition (HMER). However, HMEG poses unique challenges because it must model complex tw… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the DiffMath Framework. (a) LaTeX is parsed into a struc￾tured representation (symbols, relations, depths) to provide explicit structural guid￾ance. (b) MathVAE compresses raw trajectories into a latent space, utilizing perceptual losses (Lsym, Lrel) to encode geometries and topologies. (c) MathDiT reconstructs the clean latent xˆ0 from noise, conditioned on structural tokens and global counts … view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the RelAST construction process. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison with SOTA methods. [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: MathDiT generation with VAE variants. Red/blue boxes mark content/style errors. MathVAE produces more accurate and consistent results. Visual ablation results in [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Ablation study of MathDiT. Red/blue boxes mark content/structure er￾rors. Full structural inputs reduce errors. Symbol counts further improve completeness. Visual ablation results in [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: presents representative failure cases. Most errors occur in LaTeX expres￾sions with rare symbols, dense layouts, or deep nesting, where the model may omit small components, confuse similar symbols, or misplace superscripts, sub￾scripts, and fraction elements. These cases indicate that compact and complex mathematical structures remain challenging for generation. (a) (b) (c) [PITH_FULL_IMAGE:figures/full_f… view at source ↗
Figure 8
Figure 8. Figure 8: More qualitative comparisons with SOTA methods. [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Additional samples generated by DiffMath (Ours). [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗
read the original abstract

Handwritten Mathematical Expression Generation (HMEG) is challenging due to the complex two-dimensional layouts and long-range structural dependencies of mathematical expressions. Existing methods typically rely on explicit spatial supervision, such as symbol-level bounding boxes, which incurs high annotation costs and limits scalability. In this work, we propose DiffMath, a symbol- and graph-aware latent diffusion framework that leverages the hierarchical structure inherent in LaTeX as a structural prior, eliminating the need for positional supervision. First, we design a Relational Abstract Syntax Tree (RelAST), a generation-oriented representation that distills MathML trees into compact triplet sequences [S, R, D], where each token directly encodes a symbol identity, spatial relation, or nesting depth. Second, we introduce MathVAE, which learns structure-preserving latent representations through symbol-aware and relation-aware perceptual regularization, ensuring that the latent space captures both character semantics and spatial topology. Third, MathDiT performs conditional denoising in this structured latent space, further guided by a global symbol-count prior via Adaptive Layer Normalization (AdaLN) to improve structural coherence. Experiments show that DiffMath produces structurally consistent handwritten expressions, achieves superior performance over existing methods, and improves the accuracy of downstream OCR models through synthetic data augmentation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes DiffMath, a symbol- and graph-aware latent diffusion framework for Handwritten Mathematical Expression Generation (HMEG). It introduces a Relational Abstract Syntax Tree (RelAST) that distills MathML/LaTeX trees into compact [S, R, D] triplet sequences, a MathVAE that learns structure-preserving latent representations via symbol-aware and relation-aware perceptual regularization, and a MathDiT that performs conditional denoising in this latent space with global symbol-count guidance via Adaptive Layer Normalization (AdaLN). The central claim is that this approach eliminates the need for explicit positional supervision such as symbol-level bounding boxes while producing structurally consistent expressions, outperforming existing methods, and improving downstream OCR accuracy through synthetic data augmentation.

Significance. If the quantitative claims hold, the work could meaningfully lower annotation costs for spatial supervision in HMEG datasets and supply higher-quality synthetic data for training mathematical OCR models. The use of an external LaTeX structural prior to replace explicit geometry supervision is a potentially high-impact direction if the topology is shown to be preserved.

major comments (2)
  1. [Abstract] Abstract: the abstract asserts superior performance and downstream OCR gains but supplies no quantitative metrics, baseline comparisons, ablation results, or dataset details; claims cannot be verified from the given text.
  2. [Abstract] Abstract: the central claim requires that RelAST triplets distilled from MathML/LaTeX trees encode spatial relations and nesting sufficiently to replace explicit positional supervision (bounding boxes). The representation converts trees to compact sequences where each token is symbol, relation or depth; however, flattening a 2D layout graph into a linear triplet stream can lose alignment, adjacency and long-range spatial constraints (e.g., horizontal positioning in matrices or vertical centering in fractions). If this occurs, MathVAE perceptual regularization and MathDiT denoising must implicitly recover the missing geometry, which the abstract does not demonstrate is possible without additional supervision.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful comments. We address each major point below and indicate planned revisions where appropriate. The full manuscript contains the supporting experiments and ablations referenced in the abstract.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the abstract asserts superior performance and downstream OCR gains but supplies no quantitative metrics, baseline comparisons, ablation results, or dataset details; claims cannot be verified from the given text.

    Authors: We agree the abstract is too terse to allow verification of the claims. In the revised manuscript we will expand the abstract to include the primary quantitative results (e.g., the main HMEG metric and the downstream OCR accuracy gain) together with the dataset names and a brief statement of the strongest baseline. revision: yes

  2. Referee: [Abstract] Abstract: the central claim requires that RelAST triplets distilled from MathML/LaTeX trees encode spatial relations and nesting sufficiently to replace explicit positional supervision (bounding boxes). The representation converts trees to compact sequences where each token is symbol, relation or depth; however, flattening a 2D layout graph into a linear triplet stream can lose alignment, adjacency and long-range spatial constraints (e.g., horizontal positioning in matrices or vertical centering in fractions). If this occurs, MathVAE perceptual regularization and MathDiT denoising must implicitly recover the missing geometry, which the abstract does not demonstrate is possible without additional supervision.

    Authors: RelAST explicitly encodes spatial relations via the R component of each triplet and nesting via D; the linear sequence therefore retains the topology that would otherwise be supplied by bounding boxes. The symbol-aware and relation-aware perceptual losses in MathVAE are designed to enforce preservation of this topology in the latent space, while MathDiT’s conditional denoising and AdaLN symbol-count guidance further promote global structural consistency. Section 4 and the associated ablations show that the resulting generations are structurally coherent and improve downstream OCR without any bounding-box supervision. We will add one sentence to the abstract clarifying that the perceptual regularizers recover the necessary geometry. revision: partial

Circularity Check

0 steps flagged

No circularity; derivation uses external LaTeX prior and independent training

full rationale

The paper introduces RelAST as a distillation of standard MathML/LaTeX trees into [S, R, D] triplets, an external structural prior rather than a self-defined quantity. MathVAE perceptual regularization and MathDiT denoising operate on this input representation with no equations shown that equate outputs to fitted parameters or prior self-citations by construction. Performance claims rest on downstream experiments and OCR augmentation, which are falsifiable outside the method definition. No load-bearing step reduces to tautology or self-referential fit.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 3 invented entities

The central claim rests on the assumption that LaTeX trees supply adequate spatial information and on several new architectural modules whose hyperparameters are not disclosed.

free parameters (1)
  • diffusion and VAE training hyperparameters
    Standard but unspecified model knobs that control latent space quality and denoising behavior.
axioms (1)
  • domain assumption LaTeX/MathML trees encode sufficient spatial topology for generation without explicit bounding boxes
    Invoked to justify removal of positional supervision.
invented entities (3)
  • RelAST no independent evidence
    purpose: Compact triplet encoding of symbols, relations, and depth
    New intermediate representation distilled from MathML.
  • MathVAE no independent evidence
    purpose: Structure-preserving latent encoder with symbol and relation regularizers
    New VAE variant tailored to the task.
  • MathDiT no independent evidence
    purpose: Conditional latent diffusion transformer guided by symbol count via AdaLN
    New diffusion backbone variant.

pith-pipeline@v0.9.1-grok · 5773 in / 1330 out tokens · 33417 ms · 2026-06-26T17:52:36.091653+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 14 canonical work pages · 5 internal anchors

  1. [1]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

    Bhunia, A.K., Khan, S., Cholakkal, H., Anwer, R.M., Khan, F.S., Shah, M.: Hand- writing transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 1086–1094 (October 2021)

  2. [2]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Chen, Y., Gao, F., Zhang, Y., Qiao, M., Wang, N.: Generating handwritten mathe- matical expressions from symbol graphs: An end-to-end pipeline. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 15675–15685 (June 2024)

  3. [3]

    In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G

    Dai, G., Zhang, Y., Ke, Q., Guo, Q., Huang, S.: One-DM: One-shot diffusion mimicker for handwritten text generation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) Computer Vision – ECCV 2024. pp. 410–427. Springer Nature Switzerland, Cham (2025)

  4. [4]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

    Dai, G., Zhang, Y., Qin, Y., Guo, Q., Huang, S., Yan, S.: Beyond isolated words: Diffusion brush for handwritten text-line generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 19054– 19064 (October 2025)

  5. [5]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Dai, G., Zhang, Y., Wang, Q., Du, Q., Yu, Z., Liu, Z., Huang, S.: Disentangling writer and character styles for handwriting generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5977–5986 (June 2023)

  6. [6]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2020)

    Fogel, S., Averbuch-Elor, H., Cohen, S., Mazor, S., Litman, R.: ScrabbleGAN: Semi-supervised varying length handwritten text generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2020)

  7. [7]

    Gan, J., Li, B., Zhang, Y.M., Leng, J., Wang, W., Gao, X.: Stylized handwriting generation of arbitrary structures and OOV expressions: A decoupled approach via layout-offsets (2025),https://openreview.net/forum?id=SuLp0J2uan

  8. [8]

    Proceedings of the AAAI Conference on Artificial In- telligence35(9), 7484–7492 (May 2021).https://doi.org/10.1609/aaai.v35i9

    Gan,J.,Wang,W.:HiGAN:Handwritingimitationconditionedonarbitrary-length texts and disentangled styles. Proceedings of the AAAI Conference on Artificial In- telligence35(9), 7484–7492 (May 2021).https://doi.org/10.1609/aaai.v35i9. 16917

  9. [9]

    Gervais, A

    Gervais, P., Fadeeva, A., Maksai, A.: MathWriting: A dataset for handwritten mathematical expression recognition. In: Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2. p. 5459–5469. KDD ’25, Association for Computing Machinery, New York, NY, USA (2025).https: //doi.org/10.1145/3711896.3737436

  10. [10]

    Generating Sequences With Recurrent Neural Networks

    Graves, A.: Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850 (2013)

  11. [11]

    Proceedings of the 23rd International Conference on Machine Learning , series =

    Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning. p. 369–376. ICML ’06, Association for Computing Machinery, New York, NY, USA (2006).https://doi.org/10.1145/1143844.1143891

  12. [12]

    In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G

    Guan, T., Lin, C., Shen, W., Yang, X.: PosFormer: Recognizing complex handwrit- ten mathematical expression with position forest transformer. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) Computer Vision – ECCV 2024. pp. 130–147. Springer Nature Switzerland, Cham (2025)

  13. [13]

    In: 16 W

    Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local nash equilibrium. In: 16 W. Pan et al. Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems. vol. 30. Curran Associates, In...

  14. [14]

    In: NeurIPS 2021 Work- shop on Deep Generative Models and Downstream Applications (2021),https: //openreview.net/forum?id=qw8AKxfYbI

    Ho, J., Salimans, T.: Classifier-free diffusion guidance. In: NeurIPS 2021 Work- shop on Deep Generative Models and Downstream Applications (2021),https: //openreview.net/forum?id=qw8AKxfYbI

  15. [15]

    Labs, B.F.: FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2 (2025)

  16. [16]

    Labs,B.F.,Batifol,S.,Blattmann,A.,Boesel,F.,Consul,S.,Diagne,C.,Dockhorn, T., English, J., English, Z., Esser, P., Kulal, S., Lacey, K., Levi, Y., Li, C., Lorenz, D., Müller, J., Podell, D., Rombach, R., Saini, H., Sauer, A., Smith, L.: FLUX.1 Kontext: Flow matching for in-context image generation and editing in latent space (2025),https://arxiv.org/abs/2...

  17. [17]

    In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T

    Li, B., Yuan, Y., Liang, D., Liu, X., Ji, Z., Bai, J., Liu, W., Bai, X.: When counting meets HMER: Counting-aware network for handwritten mathematical expression recognition. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. pp. 197–214. Springer Nature Switzerland, Cham (2022)

  18. [18]

    In: The Thirty-ninth Annual Conference on Neural Informa- tion Processing Systems (2025),https://openreview.net/forum?id=oHbVboLXz6

    Li,Y.,Jiang,J.,Zhu,J.,Peng,S.,Wei,B.,Zhou,Y.,Gao,L.:Uni-MuMER:Unified multi-task fine-tuning of vision-language model for handwritten mathematical ex- pression recognition. In: The Thirty-ninth Annual Conference on Neural Informa- tion Processing Systems (2025),https://openreview.net/forum?id=oHbVboLXz6

  19. [19]

    IEEE transactions on neural networks and learning systems34(11), 8503–8515 (2022)

    Luo, C., Zhu, Y., Jin, L., Li, Z., Peng, D.: SLOGAN: handwriting style synthesis for arbitrary-length and out-of-vocabulary text. IEEE transactions on neural networks and learning systems34(11), 8503–8515 (2022)

  20. [20]

    In: 2019 International Confer- ence on Document Analysis and Recognition (ICDAR)

    Mahdavi, M., Zanibbi, R., Mouchere, H., Viard-Gaudin, C., Garain, U.: Ic- dar 2019 crohme + tfd: Competition on recognition of handwritten mathemat- ical expressions and typeset formula detection. In: 2019 International Confer- ence on Document Analysis and Recognition (ICDAR). pp. 1533–1538 (2019). https://doi.org/10.1109/ICDAR.2019.00247

  21. [21]

    https://github.com/brucemiller/LaTeXML(2026), accessed: 2026-03-05

    Miller, B.: LaTeXML: a tex and latex to xml/html/epub/mathml translator. https://github.com/brucemiller/LaTeXML(2026), accessed: 2026-03-05

  22. [22]

    In: 2014 14th International Conference on Frontiers in Handwriting Recognition

    Mouchère, H., Viard-Gaudin, C., Zanibbi, R., Garain, U.: Icfhr 2014 competition on recognition of on-line handwritten mathematical expressions (crohme 2014). In: 2014 14th International Conference on Frontiers in Handwriting Recognition. pp. 791–796 (2014).https://doi.org/10.1109/ICFHR.2014.138

  23. [23]

    In: 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR)

    Mouchère, H., Viard-Gaudin, C., Zanibbi, R., Garain, U.: Icfhr2016 crohme: Com- petition on recognition of online handwritten mathematical expressions. In: 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR). pp. 607–612 (2016).https://doi.org/10.1109/ICFHR.2016.0116

  24. [24]

    In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G

    Nikolaidou, K., Retsinas, G., Sfikas, G., Liwicki, M.: DiffusionPen: Towards con- trolling the style of handwritten text generation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) Computer Vision – ECCV 2024. pp. 417–434. Springer Nature Switzerland, Cham (2025)

  25. [25]

    In: The Fourteenth International Conference on Learning Representations (2026),https: //openreview.net/forum?id=XKOEQFKFdL DiffMath 17

    Pan, W., He, H., Cheng, H., Shi, Y., Jin, L.: DiffInk: Glyph- and style-aware latent diffusion transformer for text to online handwriting generation. In: The Fourteenth International Conference on Learning Representations (2026),https: //openreview.net/forum?id=XKOEQFKFdL DiffMath 17

  26. [26]

    Peebles,W.,Xie,S.:Scalablediffusionmodelswithtransformers.In:Proceedingsof the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 4195– 4205 (October 2023)

  27. [27]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Pippi, V., Cascianelli, S., Cucchiara, R.: Handwritten text generation from visual archetypes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 22458–22467 (June 2023)

  28. [28]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Pippi, V., Quattrini, F., Cascianelli, S., Tonioni, A., Cucchiara, R.: Zero-shot styled text image generation, but make it autoregressive. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 7910–7919 (June 2025)

  29. [29]

    In: The Twelfth International Conference on Learning Representa- tions (2024),https://openreview.net/forum?id=di52zR8xgf

    Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: SDXL: Improving latent diffusion models for high-resolution im- age synthesis. In: The Twelfth International Conference on Learning Representa- tions (2024),https://openreview.net/forum?id=di52zR8xgf

  30. [30]

    In: The Thirteenth International Conference on Learning Representations (2025),https://openreview.net/forum?id=DhHIw9Nbl1

    Ren, M., Zhang, Y.M., yi chen: Decoupling layout from glyph in online chinese handwriting generation. In: The Thirteenth International Conference on Learning Representations (2025),https://openreview.net/forum?id=DhHIw9Nbl1

  31. [31]

    In: Encyclopedia of biometrics, pp

    Reynolds, D.: Gaussian mixture models. In: Encyclopedia of biometrics, pp. 827–

  32. [32]

    Song,J.,Meng,C.,Ermon,S.:Denoisingdiffusionimplicitmodels.In:International Conferenceon LearningRepresentations(2021),https://openreview.net/forum? id=St1giarCHLP

  33. [33]

    In: Proceedings of the 2021 Workshop on Multi-Modal Pre-Training for Multimedia Understanding

    Springstein, M., Müller-Budack, E., Ewerth, R.: Unsupervised training data gen- eration of handwritten formulas using generative adversarial networks with self- attention. In: Proceedings of the 2021 Workshop on Multi-Modal Pre-Training for Multimedia Understanding. p. 46–54. MMPT ’21, Association for Computing Machinery, New York, NY, USA (2021).https://...

  34. [34]

    IEEE Transactions on Image Processing 34, 5228–5240 (2025).https://doi.org/10.1109/TIP.2025.3593974

    Tang, L., Chai, T., Zhang, Z., Zhang, M., Wu, X.: PalmDiff: When palmprint gen- eration meets controllable diffusion model. IEEE Transactions on Image Processing 34, 5228–5240 (2025).https://doi.org/10.1109/TIP.2025.3593974

  35. [35]

    Team, Q.: Qwen3 technical report (2025),https://arxiv.org/abs/2505.09388

  36. [36]

    Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

    Team, Z.I.: Z-Image: An efficient image generation foundation model with single- stream diffusion transformer. arXiv preprint arXiv:2511.22699 (2025)

  37. [37]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Wang, B., Wu, F., Ouyang, L., Gu, Z., Zhang, R., Xia, R., Shi, B., Zhang, B., He, C.: Image over text: Transforming formula recognition evaluation with character detection matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 19681–19690 (June 2025)

  38. [38]

    In: Yin, X.C., Karatzas, D., Lopresti, D

    Wang, Y., Wei, H., Wang, H., Sun, B.: VMF-Net: Visual-aware multi- representation fusion network for artifact-free handwritten mathematical expres- sions generation. In: Yin, X.C., Karatzas, D., Lopresti, D. (eds.) Document Anal- ysis and Recognition – ICDAR 2025. pp. 257–269. Springer Nature Switzerland, Cham (2026)

  39. [39]

    In: Yin, X.C., Karatzas, D., Lopresti, D

    Wang, Y., Wei, H., Wang, H., Sun, S.: SFRD: Handwritten mathematical ex- pressions generation by spatial-aware feature refinement diffusion. In: Yin, X.C., Karatzas, D., Lopresti, D. (eds.) Document Analysis and Recognition – ICDAR

  40. [40]

    pp. 414–428. Springer Nature Switzerland, Cham (2026)

  41. [41]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Woo, S., Debnath, S., Hu, R., Chen, X., Liu, Z., Kweon, I.S., Xie, S.: Convnext v2: Co-designing and scaling convnets with masked autoencoders. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 16133–16142 (June 2023) 18 W. Pan et al

  42. [42]

    Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., ming Yin, S., Bai, S., Xu, X., Chen, Y., Chen, Y., Tang, Z., Zhang, Z., Wang, Z., Yang, A., Yu, B., Cheng, C., Liu, D., Li, D., Zhang, H., Meng, H., Wei, H., Ni, J., Chen, K., Cao, K., Peng, L., Qu, L., Wu, M., Wang, P., Yu, S., Wen, T., Feng, W., Xu, X., Wang, Y., Zhang, Y., Zhu, Y., Wu, Y., Cai, Y., L...

  43. [43]

    generation: Taming optimization dilemma in latent diffusion models

    Yao, J., Yang, B., Wang, X.: Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 15703–15712 (June 2025)

  44. [44]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Yuan, Y., Liu, X., Dikubab, W., Liu, H., Ji, Z., Wu, Z., Bai, X.: Syntax-aware network for handwritten mathematical expression recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4553–4562 (June 2022)

  45. [45]

    In: III, H.D., Singh, A

    Zhang, J., Du, J., Yang, Y., Song, Y.Z., Wei, S., Dai, L.: A tree-structured de- coder for image-to-markup generation. In: III, H.D., Singh, A. (eds.) Proceed- ings of the 37th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 119, pp. 11076–11085. PMLR (13–18 Jul 2020), https://proceedings.mlr.press/v119/zhang20g.html

  46. [46]

    In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T

    Zhao, W., Gao, L.: CoMER: Modeling coverage for transformer-based handwrit- ten mathematical expression recognition. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. pp. 392–408. Springer Nature Switzerland, Cham (2022)

  47. [47]

    In: Lladós, J., Lopresti, D., Uchida, S

    Zhao, W., Gao, L., Yan, Z., Peng, S., Du, L., Zhang, Z.: Handwritten mathemati- cal expression recognition with bidirectionally trained transformer. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) Document Analysis and Recognition – ICDAR 2021. pp. 570–584. Springer International Publishing, Cham (2021)

  48. [48]

    Zhu, J., Zhao, W., Li, Y., Hu, X., Gao, L.: TAMER: Tree-aware transformer for handwritten mathematical expression recognition. Proceedings of the AAAI Conference on Artificial Intelligence39(10), 10950–10958 (Apr 2025).https: //doi.org/10.1609/aaai.v39i10.33190 DiffMath 19 DiffMath: Symbol- and Graph-Aware Latent Diffusion Transformer for Handwritten Math...