DiffMath: Symbol- and Graph-Aware Latent Diffusion Transformer for Handwritten Mathematical Expression Generation

Dezhi Peng; Hiuyi Cheng; Huiguo He; Lianwen Jin; Minghui Liao; Wei Pan; Xuhan Zheng; Yilin Shi

arxiv: 2606.19939 · v1 · pith:5JYYZUB5new · submitted 2026-06-18 · 💻 cs.CV

DiffMath: Symbol- and Graph-Aware Latent Diffusion Transformer for Handwritten Mathematical Expression Generation

Wei Pan , Xuhan Zheng , Yilin Shi , Huiguo He , Hiuyi Cheng , Dezhi Peng , Minghui Liao , Lianwen Jin This is my paper

Pith reviewed 2026-06-26 17:52 UTC · model grok-4.3

classification 💻 cs.CV

keywords handwritten mathematical expression generationlatent diffusionstructural priors from LaTeXdata augmentation for OCRgraph-aware generationsymbol-aware regularization

0 comments

The pith

DiffMath generates handwritten math expressions from LaTeX hierarchies alone by encoding them as compact symbol-relation-depth triplets instead of using bounding-box labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops DiffMath to solve the problem of creating realistic handwritten mathematical expressions whose two-dimensional layouts are hard to capture without expensive position annotations. It extracts the existing tree structure from LaTeX or MathML into short sequences that record each symbol, its spatial relation to others, and its nesting depth. A variational autoencoder then learns a latent space that keeps both symbol identity and these spatial relations intact, after which a diffusion transformer denoises new samples in that space while an adaptive normalization layer injects a global count of symbols for extra coherence. The resulting images are structurally consistent and, when added to training sets, raise the accuracy of downstream math OCR systems.

Core claim

DiffMath is a symbol- and graph-aware latent diffusion framework that uses the hierarchical structure inherent in LaTeX as a structural prior. It first converts expressions via Relational Abstract Syntax Tree (RelAST) into triplet sequences [S, R, D], trains MathVAE with symbol-aware and relation-aware perceptual regularization to obtain structure-preserving latents, and runs MathDiT for conditional denoising guided by a symbol-count prior through Adaptive Layer Normalization (AdaLN).

What carries the argument

RelAST, a generation-oriented representation that distills MathML trees into compact triplet sequences [S, R, D] where each token encodes symbol identity, spatial relation, or nesting depth.

If this is right

Generated expressions maintain correct spatial topology without any bounding-box supervision during training or inference.
The method achieves higher scores than prior approaches on standard generation metrics for handwritten math.
Synthetic images produced by the model improve accuracy when used to augment training data for downstream OCR systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The triplet encoding could be applied to other hierarchically structured generation problems such as chemical diagrams or circuit schematics where explicit coordinates are costly.
Because the approach removes the need for position labels, it may allow creation of much larger synthetic datasets covering rare symbols or unusual layouts.
The learned latent space might support controlled editing, such as changing one sub-expression while keeping overall structure fixed.

Load-bearing premise

The hierarchical structure inherent in LaTeX can be distilled into compact triplet sequences [S, R, D] that preserve spatial topology sufficiently well to replace explicit positional supervision.

What would settle it

If expressions generated by the model show frequent spatial errors such as misplaced superscripts or unbalanced fractions when inspected by eye, or if adding the synthetic samples to an OCR training set produces no measurable accuracy gain over real data alone.

Figures

Figures reproduced from arXiv: 2606.19939 by Dezhi Peng, Hiuyi Cheng, Huiguo He, Lianwen Jin, Minghui Liao, Wei Pan, Xuhan Zheng, Yilin Shi.

**Figure 1.** Figure 1: Comparison of DiffMath (Ours) and two-stage generation paradigms. Unlike decoupled two-stage approaches that require explicit position-level supervision, DiffMath adopts a streamlined end-to-end framework to directly map LaTeX to formula pixels, reducing data dependency while improving global structural consistency. Recognition (HMER). However, HMEG poses unique challenges because it must model complex tw… view at source ↗

**Figure 2.** Figure 2: Overview of the DiffMath Framework. (a) LaTeX is parsed into a structured representation (symbols, relations, depths) to provide explicit structural guidance. (b) MathVAE compresses raw trajectories into a latent space, utilizing perceptual losses (Lsym, Lrel) to encode geometries and topologies. (c) MathDiT reconstructs the clean latent xˆ0 from noise, conditioned on structural tokens and global counts … view at source ↗

**Figure 3.** Figure 3: Overview of the RelAST construction process. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison with SOTA methods. [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: MathDiT generation with VAE variants. Red/blue boxes mark content/style errors. MathVAE produces more accurate and consistent results. Visual ablation results in [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Ablation study of MathDiT. Red/blue boxes mark content/structure errors. Full structural inputs reduce errors. Symbol counts further improve completeness. Visual ablation results in [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: presents representative failure cases. Most errors occur in LaTeX expressions with rare symbols, dense layouts, or deep nesting, where the model may omit small components, confuse similar symbols, or misplace superscripts, subscripts, and fraction elements. These cases indicate that compact and complex mathematical structures remain challenging for generation. (a) (b) (c) [PITH_FULL_IMAGE:figures/full_f… view at source ↗

**Figure 8.** Figure 8: More qualitative comparisons with SOTA methods. [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗

**Figure 9.** Figure 9: Additional samples generated by DiffMath (Ours). [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗

read the original abstract

Handwritten Mathematical Expression Generation (HMEG) is challenging due to the complex two-dimensional layouts and long-range structural dependencies of mathematical expressions. Existing methods typically rely on explicit spatial supervision, such as symbol-level bounding boxes, which incurs high annotation costs and limits scalability. In this work, we propose DiffMath, a symbol- and graph-aware latent diffusion framework that leverages the hierarchical structure inherent in LaTeX as a structural prior, eliminating the need for positional supervision. First, we design a Relational Abstract Syntax Tree (RelAST), a generation-oriented representation that distills MathML trees into compact triplet sequences [S, R, D], where each token directly encodes a symbol identity, spatial relation, or nesting depth. Second, we introduce MathVAE, which learns structure-preserving latent representations through symbol-aware and relation-aware perceptual regularization, ensuring that the latent space captures both character semantics and spatial topology. Third, MathDiT performs conditional denoising in this structured latent space, further guided by a global symbol-count prior via Adaptive Layer Normalization (AdaLN) to improve structural coherence. Experiments show that DiffMath produces structurally consistent handwritten expressions, achieves superior performance over existing methods, and improves the accuracy of downstream OCR models through synthetic data augmentation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DiffMath uses LaTeX-derived RelAST triplets to drive a latent diffusion model for handwritten math without bounding boxes, which is a clean way to cut annotation cost, but the spatial fidelity of the triplet flattening remains the main open question.

read the letter

The paper's main contribution is a generation-oriented representation called RelAST that turns MathML trees into compact [S, R, D] sequences and then trains a MathVAE with symbol-aware and relation-aware perceptual losses before running conditional denoising in a MathDiT with AdaLN symbol-count conditioning. This setup lets the model use the hierarchical structure already present in LaTeX as supervision instead of requiring symbol-level boxes.

The approach is new in its specific combination of triplet linearization, dual perceptual regularizers, and AdaLN guidance inside a latent diffusion transformer for this task. It directly targets the annotation bottleneck in handwritten mathematical expression generation and shows a practical route to synthetic data that can augment OCR training.

The soft spot is the assumption that the linear triplet stream preserves enough 2D layout information. Flattening a tree can lose adjacency, alignment, and long-range positioning details that matter for fractions, matrices, and superscripts. The perceptual losses and diffusion process are meant to recover the missing geometry, but the abstract supplies no quantitative metrics, baseline tables, or ablation results to show whether that recovery actually happens at usable quality. The downstream OCR improvement claim is also stated without numbers.

The work is aimed at groups building synthetic training data for document OCR and math recognition. Readers working on structured diffusion or annotation-efficient vision methods will find the conditioning and regularization choices worth examining.

It deserves peer review because the method is distinct from prior HMEG work and the practical motivation is clear, even though the experiments will need close checking on spatial accuracy and reported gains.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes DiffMath, a symbol- and graph-aware latent diffusion framework for Handwritten Mathematical Expression Generation (HMEG). It introduces a Relational Abstract Syntax Tree (RelAST) that distills MathML/LaTeX trees into compact [S, R, D] triplet sequences, a MathVAE that learns structure-preserving latent representations via symbol-aware and relation-aware perceptual regularization, and a MathDiT that performs conditional denoising in this latent space with global symbol-count guidance via Adaptive Layer Normalization (AdaLN). The central claim is that this approach eliminates the need for explicit positional supervision such as symbol-level bounding boxes while producing structurally consistent expressions, outperforming existing methods, and improving downstream OCR accuracy through synthetic data augmentation.

Significance. If the quantitative claims hold, the work could meaningfully lower annotation costs for spatial supervision in HMEG datasets and supply higher-quality synthetic data for training mathematical OCR models. The use of an external LaTeX structural prior to replace explicit geometry supervision is a potentially high-impact direction if the topology is shown to be preserved.

major comments (2)

[Abstract] Abstract: the abstract asserts superior performance and downstream OCR gains but supplies no quantitative metrics, baseline comparisons, ablation results, or dataset details; claims cannot be verified from the given text.
[Abstract] Abstract: the central claim requires that RelAST triplets distilled from MathML/LaTeX trees encode spatial relations and nesting sufficiently to replace explicit positional supervision (bounding boxes). The representation converts trees to compact sequences where each token is symbol, relation or depth; however, flattening a 2D layout graph into a linear triplet stream can lose alignment, adjacency and long-range spatial constraints (e.g., horizontal positioning in matrices or vertical centering in fractions). If this occurs, MathVAE perceptual regularization and MathDiT denoising must implicitly recover the missing geometry, which the abstract does not demonstrate is possible without additional supervision.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful comments. We address each major point below and indicate planned revisions where appropriate. The full manuscript contains the supporting experiments and ablations referenced in the abstract.

read point-by-point responses

Referee: [Abstract] Abstract: the abstract asserts superior performance and downstream OCR gains but supplies no quantitative metrics, baseline comparisons, ablation results, or dataset details; claims cannot be verified from the given text.

Authors: We agree the abstract is too terse to allow verification of the claims. In the revised manuscript we will expand the abstract to include the primary quantitative results (e.g., the main HMEG metric and the downstream OCR accuracy gain) together with the dataset names and a brief statement of the strongest baseline. revision: yes
Referee: [Abstract] Abstract: the central claim requires that RelAST triplets distilled from MathML/LaTeX trees encode spatial relations and nesting sufficiently to replace explicit positional supervision (bounding boxes). The representation converts trees to compact sequences where each token is symbol, relation or depth; however, flattening a 2D layout graph into a linear triplet stream can lose alignment, adjacency and long-range spatial constraints (e.g., horizontal positioning in matrices or vertical centering in fractions). If this occurs, MathVAE perceptual regularization and MathDiT denoising must implicitly recover the missing geometry, which the abstract does not demonstrate is possible without additional supervision.

Authors: RelAST explicitly encodes spatial relations via the R component of each triplet and nesting via D; the linear sequence therefore retains the topology that would otherwise be supplied by bounding boxes. The symbol-aware and relation-aware perceptual losses in MathVAE are designed to enforce preservation of this topology in the latent space, while MathDiT’s conditional denoising and AdaLN symbol-count guidance further promote global structural consistency. Section 4 and the associated ablations show that the resulting generations are structurally coherent and improve downstream OCR without any bounding-box supervision. We will add one sentence to the abstract clarifying that the perceptual regularizers recover the necessary geometry. revision: partial

Circularity Check

0 steps flagged

No circularity; derivation uses external LaTeX prior and independent training

full rationale

The paper introduces RelAST as a distillation of standard MathML/LaTeX trees into [S, R, D] triplets, an external structural prior rather than a self-defined quantity. MathVAE perceptual regularization and MathDiT denoising operate on this input representation with no equations shown that equate outputs to fitted parameters or prior self-citations by construction. Performance claims rest on downstream experiments and OCR augmentation, which are falsifiable outside the method definition. No load-bearing step reduces to tautology or self-referential fit.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 3 invented entities

The central claim rests on the assumption that LaTeX trees supply adequate spatial information and on several new architectural modules whose hyperparameters are not disclosed.

free parameters (1)

diffusion and VAE training hyperparameters
Standard but unspecified model knobs that control latent space quality and denoising behavior.

axioms (1)

domain assumption LaTeX/MathML trees encode sufficient spatial topology for generation without explicit bounding boxes
Invoked to justify removal of positional supervision.

invented entities (3)

RelAST no independent evidence
purpose: Compact triplet encoding of symbols, relations, and depth
New intermediate representation distilled from MathML.
MathVAE no independent evidence
purpose: Structure-preserving latent encoder with symbol and relation regularizers
New VAE variant tailored to the task.
MathDiT no independent evidence
purpose: Conditional latent diffusion transformer guided by symbol count via AdaLN
New diffusion backbone variant.

pith-pipeline@v0.9.1-grok · 5773 in / 1330 out tokens · 33417 ms · 2026-06-26T17:52:36.091653+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 14 canonical work pages · 5 internal anchors

[1]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Bhunia, A.K., Khan, S., Cholakkal, H., Anwer, R.M., Khan, F.S., Shah, M.: Hand- writing transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 1086–1094 (October 2021)

2021
[2]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Chen, Y., Gao, F., Zhang, Y., Qiao, M., Wang, N.: Generating handwritten mathe- matical expressions from symbol graphs: An end-to-end pipeline. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 15675–15685 (June 2024)

2024
[3]

In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G

Dai, G., Zhang, Y., Ke, Q., Guo, Q., Huang, S.: One-DM: One-shot diffusion mimicker for handwritten text generation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) Computer Vision – ECCV 2024. pp. 410–427. Springer Nature Switzerland, Cham (2025)

2024
[4]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Dai, G., Zhang, Y., Qin, Y., Guo, Q., Huang, S., Yan, S.: Beyond isolated words: Diffusion brush for handwritten text-line generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 19054– 19064 (October 2025)

2025
[5]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Dai, G., Zhang, Y., Wang, Q., Du, Q., Yu, Z., Liu, Z., Huang, S.: Disentangling writer and character styles for handwriting generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5977–5986 (June 2023)

2023
[6]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2020)

Fogel, S., Averbuch-Elor, H., Cohen, S., Mazor, S., Litman, R.: ScrabbleGAN: Semi-supervised varying length handwritten text generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2020)

2020
[7]

Gan, J., Li, B., Zhang, Y.M., Leng, J., Wang, W., Gao, X.: Stylized handwriting generation of arbitrary structures and OOV expressions: A decoupled approach via layout-offsets (2025),https://openreview.net/forum?id=SuLp0J2uan

2025
[8]

Proceedings of the AAAI Conference on Artificial In- telligence35(9), 7484–7492 (May 2021).https://doi.org/10.1609/aaai.v35i9

Gan,J.,Wang,W.:HiGAN:Handwritingimitationconditionedonarbitrary-length texts and disentangled styles. Proceedings of the AAAI Conference on Artificial In- telligence35(9), 7484–7492 (May 2021).https://doi.org/10.1609/aaai.v35i9. 16917

work page doi:10.1609/aaai.v35i9 2021
[9]

Gervais, A

Gervais, P., Fadeeva, A., Maksai, A.: MathWriting: A dataset for handwritten mathematical expression recognition. In: Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2. p. 5459–5469. KDD ’25, Association for Computing Machinery, New York, NY, USA (2025).https: //doi.org/10.1145/3711896.3737436

work page doi:10.1145/3711896.3737436 2025
[10]

Generating Sequences With Recurrent Neural Networks

Graves, A.: Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850 (2013)

work page internal anchor Pith review Pith/arXiv arXiv 2013
[11]

Proceedings of the 23rd International Conference on Machine Learning , series =

Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning. p. 369–376. ICML ’06, Association for Computing Machinery, New York, NY, USA (2006).https://doi.org/10.1145/1143844.1143891

work page doi:10.1145/1143844.1143891 2006
[12]

In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G

Guan, T., Lin, C., Shen, W., Yang, X.: PosFormer: Recognizing complex handwrit- ten mathematical expression with position forest transformer. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) Computer Vision – ECCV 2024. pp. 130–147. Springer Nature Switzerland, Cham (2025)

2024
[13]

In: 16 W

Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local nash equilibrium. In: 16 W. Pan et al. Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems. vol. 30. Curran Associates, In...

2017
[14]

In: NeurIPS 2021 Work- shop on Deep Generative Models and Downstream Applications (2021),https: //openreview.net/forum?id=qw8AKxfYbI

Ho, J., Salimans, T.: Classifier-free diffusion guidance. In: NeurIPS 2021 Work- shop on Deep Generative Models and Downstream Applications (2021),https: //openreview.net/forum?id=qw8AKxfYbI

2021
[15]

Labs, B.F.: FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2 (2025)

2025
[16]

Labs,B.F.,Batifol,S.,Blattmann,A.,Boesel,F.,Consul,S.,Diagne,C.,Dockhorn, T., English, J., English, Z., Esser, P., Kulal, S., Lacey, K., Levi, Y., Li, C., Lorenz, D., Müller, J., Podell, D., Rombach, R., Saini, H., Sauer, A., Smith, L.: FLUX.1 Kontext: Flow matching for in-context image generation and editing in latent space (2025),https://arxiv.org/abs/2...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T

Li, B., Yuan, Y., Liang, D., Liu, X., Ji, Z., Bai, J., Liu, W., Bai, X.: When counting meets HMER: Counting-aware network for handwritten mathematical expression recognition. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. pp. 197–214. Springer Nature Switzerland, Cham (2022)

2022
[18]

In: The Thirty-ninth Annual Conference on Neural Informa- tion Processing Systems (2025),https://openreview.net/forum?id=oHbVboLXz6

Li,Y.,Jiang,J.,Zhu,J.,Peng,S.,Wei,B.,Zhou,Y.,Gao,L.:Uni-MuMER:Unified multi-task fine-tuning of vision-language model for handwritten mathematical ex- pression recognition. In: The Thirty-ninth Annual Conference on Neural Informa- tion Processing Systems (2025),https://openreview.net/forum?id=oHbVboLXz6

2025
[19]

IEEE transactions on neural networks and learning systems34(11), 8503–8515 (2022)

Luo, C., Zhu, Y., Jin, L., Li, Z., Peng, D.: SLOGAN: handwriting style synthesis for arbitrary-length and out-of-vocabulary text. IEEE transactions on neural networks and learning systems34(11), 8503–8515 (2022)

2022
[20]

In: 2019 International Confer- ence on Document Analysis and Recognition (ICDAR)

Mahdavi, M., Zanibbi, R., Mouchere, H., Viard-Gaudin, C., Garain, U.: Ic- dar 2019 crohme + tfd: Competition on recognition of handwritten mathemat- ical expressions and typeset formula detection. In: 2019 International Confer- ence on Document Analysis and Recognition (ICDAR). pp. 1533–1538 (2019). https://doi.org/10.1109/ICDAR.2019.00247

work page doi:10.1109/icdar.2019.00247 2019
[21]

https://github.com/brucemiller/LaTeXML(2026), accessed: 2026-03-05

Miller, B.: LaTeXML: a tex and latex to xml/html/epub/mathml translator. https://github.com/brucemiller/LaTeXML(2026), accessed: 2026-03-05

2026
[22]

In: 2014 14th International Conference on Frontiers in Handwriting Recognition

Mouchère, H., Viard-Gaudin, C., Zanibbi, R., Garain, U.: Icfhr 2014 competition on recognition of on-line handwritten mathematical expressions (crohme 2014). In: 2014 14th International Conference on Frontiers in Handwriting Recognition. pp. 791–796 (2014).https://doi.org/10.1109/ICFHR.2014.138

work page doi:10.1109/icfhr.2014.138 2014
[23]

In: 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR)

Mouchère, H., Viard-Gaudin, C., Zanibbi, R., Garain, U.: Icfhr2016 crohme: Com- petition on recognition of online handwritten mathematical expressions. In: 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR). pp. 607–612 (2016).https://doi.org/10.1109/ICFHR.2016.0116

work page doi:10.1109/icfhr.2016.0116 2016
[24]

In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G

Nikolaidou, K., Retsinas, G., Sfikas, G., Liwicki, M.: DiffusionPen: Towards con- trolling the style of handwritten text generation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) Computer Vision – ECCV 2024. pp. 417–434. Springer Nature Switzerland, Cham (2025)

2024
[25]

In: The Fourteenth International Conference on Learning Representations (2026),https: //openreview.net/forum?id=XKOEQFKFdL DiffMath 17

Pan, W., He, H., Cheng, H., Shi, Y., Jin, L.: DiffInk: Glyph- and style-aware latent diffusion transformer for text to online handwriting generation. In: The Fourteenth International Conference on Learning Representations (2026),https: //openreview.net/forum?id=XKOEQFKFdL DiffMath 17

2026
[26]

Peebles,W.,Xie,S.:Scalablediffusionmodelswithtransformers.In:Proceedingsof the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 4195– 4205 (October 2023)

2023
[27]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pippi, V., Cascianelli, S., Cucchiara, R.: Handwritten text generation from visual archetypes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 22458–22467 (June 2023)

2023
[28]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pippi, V., Quattrini, F., Cascianelli, S., Tonioni, A., Cucchiara, R.: Zero-shot styled text image generation, but make it autoregressive. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 7910–7919 (June 2025)

2025
[29]

In: The Twelfth International Conference on Learning Representa- tions (2024),https://openreview.net/forum?id=di52zR8xgf

Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: SDXL: Improving latent diffusion models for high-resolution im- age synthesis. In: The Twelfth International Conference on Learning Representa- tions (2024),https://openreview.net/forum?id=di52zR8xgf

2024
[30]

In: The Thirteenth International Conference on Learning Representations (2025),https://openreview.net/forum?id=DhHIw9Nbl1

Ren, M., Zhang, Y.M., yi chen: Decoupling layout from glyph in online chinese handwriting generation. In: The Thirteenth International Conference on Learning Representations (2025),https://openreview.net/forum?id=DhHIw9Nbl1

2025
[31]

In: Encyclopedia of biometrics, pp

Reynolds, D.: Gaussian mixture models. In: Encyclopedia of biometrics, pp. 827–
[32]

Song,J.,Meng,C.,Ermon,S.:Denoisingdiffusionimplicitmodels.In:International Conferenceon LearningRepresentations(2021),https://openreview.net/forum? id=St1giarCHLP

2021
[33]

In: Proceedings of the 2021 Workshop on Multi-Modal Pre-Training for Multimedia Understanding

Springstein, M., Müller-Budack, E., Ewerth, R.: Unsupervised training data gen- eration of handwritten formulas using generative adversarial networks with self- attention. In: Proceedings of the 2021 Workshop on Multi-Modal Pre-Training for Multimedia Understanding. p. 46–54. MMPT ’21, Association for Computing Machinery, New York, NY, USA (2021).https://...

work page doi:10.1145/3463945 2021
[34]

IEEE Transactions on Image Processing 34, 5228–5240 (2025).https://doi.org/10.1109/TIP.2025.3593974

Tang, L., Chai, T., Zhang, Z., Zhang, M., Wu, X.: PalmDiff: When palmprint gen- eration meets controllable diffusion model. IEEE Transactions on Image Processing 34, 5228–5240 (2025).https://doi.org/10.1109/TIP.2025.3593974

work page doi:10.1109/tip.2025.3593974 2025
[35]

Team, Q.: Qwen3 technical report (2025),https://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

Team, Z.I.: Z-Image: An efficient image generation foundation model with single- stream diffusion transformer. arXiv preprint arXiv:2511.22699 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Wang, B., Wu, F., Ouyang, L., Gu, Z., Zhang, R., Xia, R., Shi, B., Zhang, B., He, C.: Image over text: Transforming formula recognition evaluation with character detection matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 19681–19690 (June 2025)

2025
[38]

In: Yin, X.C., Karatzas, D., Lopresti, D

Wang, Y., Wei, H., Wang, H., Sun, B.: VMF-Net: Visual-aware multi- representation fusion network for artifact-free handwritten mathematical expres- sions generation. In: Yin, X.C., Karatzas, D., Lopresti, D. (eds.) Document Anal- ysis and Recognition – ICDAR 2025. pp. 257–269. Springer Nature Switzerland, Cham (2026)

2025
[39]

In: Yin, X.C., Karatzas, D., Lopresti, D

Wang, Y., Wei, H., Wang, H., Sun, S.: SFRD: Handwritten mathematical ex- pressions generation by spatial-aware feature refinement diffusion. In: Yin, X.C., Karatzas, D., Lopresti, D. (eds.) Document Analysis and Recognition – ICDAR
[40]

pp. 414–428. Springer Nature Switzerland, Cham (2026)

2026
[41]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Woo, S., Debnath, S., Hu, R., Chen, X., Liu, Z., Kweon, I.S., Xie, S.: Convnext v2: Co-designing and scaling convnets with masked autoencoders. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 16133–16142 (June 2023) 18 W. Pan et al

2023
[42]

Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., ming Yin, S., Bai, S., Xu, X., Chen, Y., Chen, Y., Tang, Z., Zhang, Z., Wang, Z., Yang, A., Yu, B., Cheng, C., Liu, D., Li, D., Zhang, H., Meng, H., Wei, H., Ni, J., Chen, K., Cao, K., Peng, L., Qu, L., Wu, M., Wang, P., Yu, S., Wen, T., Feng, W., Xu, X., Wang, Y., Zhang, Y., Zhu, Y., Wu, Y., Cai, Y., L...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

generation: Taming optimization dilemma in latent diffusion models

Yao, J., Yang, B., Wang, X.: Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 15703–15712 (June 2025)

2025
[44]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Yuan, Y., Liu, X., Dikubab, W., Liu, H., Ji, Z., Wu, Z., Bai, X.: Syntax-aware network for handwritten mathematical expression recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4553–4562 (June 2022)

2022
[45]

In: III, H.D., Singh, A

Zhang, J., Du, J., Yang, Y., Song, Y.Z., Wei, S., Dai, L.: A tree-structured de- coder for image-to-markup generation. In: III, H.D., Singh, A. (eds.) Proceed- ings of the 37th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 119, pp. 11076–11085. PMLR (13–18 Jul 2020), https://proceedings.mlr.press/v119/zhang20g.html

2020
[46]

In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T

Zhao, W., Gao, L.: CoMER: Modeling coverage for transformer-based handwrit- ten mathematical expression recognition. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. pp. 392–408. Springer Nature Switzerland, Cham (2022)

2022
[47]

In: Lladós, J., Lopresti, D., Uchida, S

Zhao, W., Gao, L., Yan, Z., Peng, S., Du, L., Zhang, Z.: Handwritten mathemati- cal expression recognition with bidirectionally trained transformer. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) Document Analysis and Recognition – ICDAR 2021. pp. 570–584. Springer International Publishing, Cham (2021)

2021
[48]

Zhu, J., Zhao, W., Li, Y., Hu, X., Gao, L.: TAMER: Tree-aware transformer for handwritten mathematical expression recognition. Proceedings of the AAAI Conference on Artificial Intelligence39(10), 10950–10958 (Apr 2025).https: //doi.org/10.1609/aaai.v39i10.33190 DiffMath 19 DiffMath: Symbol- and Graph-Aware Latent Diffusion Transformer for Handwritten Math...

work page doi:10.1609/aaai.v39i10.33190 2025

[1] [1]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Bhunia, A.K., Khan, S., Cholakkal, H., Anwer, R.M., Khan, F.S., Shah, M.: Hand- writing transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 1086–1094 (October 2021)

2021

[2] [2]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Chen, Y., Gao, F., Zhang, Y., Qiao, M., Wang, N.: Generating handwritten mathe- matical expressions from symbol graphs: An end-to-end pipeline. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 15675–15685 (June 2024)

2024

[3] [3]

In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G

Dai, G., Zhang, Y., Ke, Q., Guo, Q., Huang, S.: One-DM: One-shot diffusion mimicker for handwritten text generation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) Computer Vision – ECCV 2024. pp. 410–427. Springer Nature Switzerland, Cham (2025)

2024

[4] [4]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Dai, G., Zhang, Y., Qin, Y., Guo, Q., Huang, S., Yan, S.: Beyond isolated words: Diffusion brush for handwritten text-line generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 19054– 19064 (October 2025)

2025

[5] [5]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Dai, G., Zhang, Y., Wang, Q., Du, Q., Yu, Z., Liu, Z., Huang, S.: Disentangling writer and character styles for handwriting generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5977–5986 (June 2023)

2023

[6] [6]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2020)

Fogel, S., Averbuch-Elor, H., Cohen, S., Mazor, S., Litman, R.: ScrabbleGAN: Semi-supervised varying length handwritten text generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2020)

2020

[7] [7]

Gan, J., Li, B., Zhang, Y.M., Leng, J., Wang, W., Gao, X.: Stylized handwriting generation of arbitrary structures and OOV expressions: A decoupled approach via layout-offsets (2025),https://openreview.net/forum?id=SuLp0J2uan

2025

[8] [8]

Proceedings of the AAAI Conference on Artificial In- telligence35(9), 7484–7492 (May 2021).https://doi.org/10.1609/aaai.v35i9

Gan,J.,Wang,W.:HiGAN:Handwritingimitationconditionedonarbitrary-length texts and disentangled styles. Proceedings of the AAAI Conference on Artificial In- telligence35(9), 7484–7492 (May 2021).https://doi.org/10.1609/aaai.v35i9. 16917

work page doi:10.1609/aaai.v35i9 2021

[9] [9]

Gervais, A

Gervais, P., Fadeeva, A., Maksai, A.: MathWriting: A dataset for handwritten mathematical expression recognition. In: Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2. p. 5459–5469. KDD ’25, Association for Computing Machinery, New York, NY, USA (2025).https: //doi.org/10.1145/3711896.3737436

work page doi:10.1145/3711896.3737436 2025

[10] [10]

Generating Sequences With Recurrent Neural Networks

Graves, A.: Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850 (2013)

work page internal anchor Pith review Pith/arXiv arXiv 2013

[11] [11]

Proceedings of the 23rd International Conference on Machine Learning , series =

Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning. p. 369–376. ICML ’06, Association for Computing Machinery, New York, NY, USA (2006).https://doi.org/10.1145/1143844.1143891

work page doi:10.1145/1143844.1143891 2006

[12] [12]

In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G

Guan, T., Lin, C., Shen, W., Yang, X.: PosFormer: Recognizing complex handwrit- ten mathematical expression with position forest transformer. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) Computer Vision – ECCV 2024. pp. 130–147. Springer Nature Switzerland, Cham (2025)

2024

[13] [13]

In: 16 W

Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local nash equilibrium. In: 16 W. Pan et al. Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems. vol. 30. Curran Associates, In...

2017

[14] [14]

In: NeurIPS 2021 Work- shop on Deep Generative Models and Downstream Applications (2021),https: //openreview.net/forum?id=qw8AKxfYbI

Ho, J., Salimans, T.: Classifier-free diffusion guidance. In: NeurIPS 2021 Work- shop on Deep Generative Models and Downstream Applications (2021),https: //openreview.net/forum?id=qw8AKxfYbI

2021

[15] [15]

Labs, B.F.: FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2 (2025)

2025

[16] [16]

Labs,B.F.,Batifol,S.,Blattmann,A.,Boesel,F.,Consul,S.,Diagne,C.,Dockhorn, T., English, J., English, Z., Esser, P., Kulal, S., Lacey, K., Levi, Y., Li, C., Lorenz, D., Müller, J., Podell, D., Rombach, R., Saini, H., Sauer, A., Smith, L.: FLUX.1 Kontext: Flow matching for in-context image generation and editing in latent space (2025),https://arxiv.org/abs/2...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T

Li, B., Yuan, Y., Liang, D., Liu, X., Ji, Z., Bai, J., Liu, W., Bai, X.: When counting meets HMER: Counting-aware network for handwritten mathematical expression recognition. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. pp. 197–214. Springer Nature Switzerland, Cham (2022)

2022

[18] [18]

In: The Thirty-ninth Annual Conference on Neural Informa- tion Processing Systems (2025),https://openreview.net/forum?id=oHbVboLXz6

Li,Y.,Jiang,J.,Zhu,J.,Peng,S.,Wei,B.,Zhou,Y.,Gao,L.:Uni-MuMER:Unified multi-task fine-tuning of vision-language model for handwritten mathematical ex- pression recognition. In: The Thirty-ninth Annual Conference on Neural Informa- tion Processing Systems (2025),https://openreview.net/forum?id=oHbVboLXz6

2025

[19] [19]

IEEE transactions on neural networks and learning systems34(11), 8503–8515 (2022)

Luo, C., Zhu, Y., Jin, L., Li, Z., Peng, D.: SLOGAN: handwriting style synthesis for arbitrary-length and out-of-vocabulary text. IEEE transactions on neural networks and learning systems34(11), 8503–8515 (2022)

2022

[20] [20]

In: 2019 International Confer- ence on Document Analysis and Recognition (ICDAR)

Mahdavi, M., Zanibbi, R., Mouchere, H., Viard-Gaudin, C., Garain, U.: Ic- dar 2019 crohme + tfd: Competition on recognition of handwritten mathemat- ical expressions and typeset formula detection. In: 2019 International Confer- ence on Document Analysis and Recognition (ICDAR). pp. 1533–1538 (2019). https://doi.org/10.1109/ICDAR.2019.00247

work page doi:10.1109/icdar.2019.00247 2019

[21] [21]

https://github.com/brucemiller/LaTeXML(2026), accessed: 2026-03-05

Miller, B.: LaTeXML: a tex and latex to xml/html/epub/mathml translator. https://github.com/brucemiller/LaTeXML(2026), accessed: 2026-03-05

2026

[22] [22]

In: 2014 14th International Conference on Frontiers in Handwriting Recognition

Mouchère, H., Viard-Gaudin, C., Zanibbi, R., Garain, U.: Icfhr 2014 competition on recognition of on-line handwritten mathematical expressions (crohme 2014). In: 2014 14th International Conference on Frontiers in Handwriting Recognition. pp. 791–796 (2014).https://doi.org/10.1109/ICFHR.2014.138

work page doi:10.1109/icfhr.2014.138 2014

[23] [23]

In: 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR)

Mouchère, H., Viard-Gaudin, C., Zanibbi, R., Garain, U.: Icfhr2016 crohme: Com- petition on recognition of online handwritten mathematical expressions. In: 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR). pp. 607–612 (2016).https://doi.org/10.1109/ICFHR.2016.0116

work page doi:10.1109/icfhr.2016.0116 2016

[24] [24]

In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G

Nikolaidou, K., Retsinas, G., Sfikas, G., Liwicki, M.: DiffusionPen: Towards con- trolling the style of handwritten text generation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) Computer Vision – ECCV 2024. pp. 417–434. Springer Nature Switzerland, Cham (2025)

2024

[25] [25]

In: The Fourteenth International Conference on Learning Representations (2026),https: //openreview.net/forum?id=XKOEQFKFdL DiffMath 17

Pan, W., He, H., Cheng, H., Shi, Y., Jin, L.: DiffInk: Glyph- and style-aware latent diffusion transformer for text to online handwriting generation. In: The Fourteenth International Conference on Learning Representations (2026),https: //openreview.net/forum?id=XKOEQFKFdL DiffMath 17

2026

[26] [26]

Peebles,W.,Xie,S.:Scalablediffusionmodelswithtransformers.In:Proceedingsof the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 4195– 4205 (October 2023)

2023

[27] [27]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pippi, V., Cascianelli, S., Cucchiara, R.: Handwritten text generation from visual archetypes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 22458–22467 (June 2023)

2023

[28] [28]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pippi, V., Quattrini, F., Cascianelli, S., Tonioni, A., Cucchiara, R.: Zero-shot styled text image generation, but make it autoregressive. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 7910–7919 (June 2025)

2025

[29] [29]

In: The Twelfth International Conference on Learning Representa- tions (2024),https://openreview.net/forum?id=di52zR8xgf

Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: SDXL: Improving latent diffusion models for high-resolution im- age synthesis. In: The Twelfth International Conference on Learning Representa- tions (2024),https://openreview.net/forum?id=di52zR8xgf

2024

[30] [30]

In: The Thirteenth International Conference on Learning Representations (2025),https://openreview.net/forum?id=DhHIw9Nbl1

Ren, M., Zhang, Y.M., yi chen: Decoupling layout from glyph in online chinese handwriting generation. In: The Thirteenth International Conference on Learning Representations (2025),https://openreview.net/forum?id=DhHIw9Nbl1

2025

[31] [31]

In: Encyclopedia of biometrics, pp

Reynolds, D.: Gaussian mixture models. In: Encyclopedia of biometrics, pp. 827–

[32] [32]

Song,J.,Meng,C.,Ermon,S.:Denoisingdiffusionimplicitmodels.In:International Conferenceon LearningRepresentations(2021),https://openreview.net/forum? id=St1giarCHLP

2021

[33] [33]

In: Proceedings of the 2021 Workshop on Multi-Modal Pre-Training for Multimedia Understanding

Springstein, M., Müller-Budack, E., Ewerth, R.: Unsupervised training data gen- eration of handwritten formulas using generative adversarial networks with self- attention. In: Proceedings of the 2021 Workshop on Multi-Modal Pre-Training for Multimedia Understanding. p. 46–54. MMPT ’21, Association for Computing Machinery, New York, NY, USA (2021).https://...

work page doi:10.1145/3463945 2021

[34] [34]

IEEE Transactions on Image Processing 34, 5228–5240 (2025).https://doi.org/10.1109/TIP.2025.3593974

Tang, L., Chai, T., Zhang, Z., Zhang, M., Wu, X.: PalmDiff: When palmprint gen- eration meets controllable diffusion model. IEEE Transactions on Image Processing 34, 5228–5240 (2025).https://doi.org/10.1109/TIP.2025.3593974

work page doi:10.1109/tip.2025.3593974 2025

[35] [35]

Team, Q.: Qwen3 technical report (2025),https://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [36]

Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

Team, Z.I.: Z-Image: An efficient image generation foundation model with single- stream diffusion transformer. arXiv preprint arXiv:2511.22699 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [37]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Wang, B., Wu, F., Ouyang, L., Gu, Z., Zhang, R., Xia, R., Shi, B., Zhang, B., He, C.: Image over text: Transforming formula recognition evaluation with character detection matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 19681–19690 (June 2025)

2025

[38] [38]

In: Yin, X.C., Karatzas, D., Lopresti, D

Wang, Y., Wei, H., Wang, H., Sun, B.: VMF-Net: Visual-aware multi- representation fusion network for artifact-free handwritten mathematical expres- sions generation. In: Yin, X.C., Karatzas, D., Lopresti, D. (eds.) Document Anal- ysis and Recognition – ICDAR 2025. pp. 257–269. Springer Nature Switzerland, Cham (2026)

2025

[39] [39]

In: Yin, X.C., Karatzas, D., Lopresti, D

Wang, Y., Wei, H., Wang, H., Sun, S.: SFRD: Handwritten mathematical ex- pressions generation by spatial-aware feature refinement diffusion. In: Yin, X.C., Karatzas, D., Lopresti, D. (eds.) Document Analysis and Recognition – ICDAR

[40] [40]

pp. 414–428. Springer Nature Switzerland, Cham (2026)

2026

[41] [41]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Woo, S., Debnath, S., Hu, R., Chen, X., Liu, Z., Kweon, I.S., Xie, S.: Convnext v2: Co-designing and scaling convnets with masked autoencoders. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 16133–16142 (June 2023) 18 W. Pan et al

2023

[42] [42]

Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., ming Yin, S., Bai, S., Xu, X., Chen, Y., Chen, Y., Tang, Z., Zhang, Z., Wang, Z., Yang, A., Yu, B., Cheng, C., Liu, D., Li, D., Zhang, H., Meng, H., Wei, H., Ni, J., Chen, K., Cao, K., Peng, L., Qu, L., Wu, M., Wang, P., Yu, S., Wen, T., Feng, W., Xu, X., Wang, Y., Zhang, Y., Zhu, Y., Wu, Y., Cai, Y., L...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[43] [43]

generation: Taming optimization dilemma in latent diffusion models

Yao, J., Yang, B., Wang, X.: Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 15703–15712 (June 2025)

2025

[44] [44]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Yuan, Y., Liu, X., Dikubab, W., Liu, H., Ji, Z., Wu, Z., Bai, X.: Syntax-aware network for handwritten mathematical expression recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4553–4562 (June 2022)

2022

[45] [45]

In: III, H.D., Singh, A

Zhang, J., Du, J., Yang, Y., Song, Y.Z., Wei, S., Dai, L.: A tree-structured de- coder for image-to-markup generation. In: III, H.D., Singh, A. (eds.) Proceed- ings of the 37th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 119, pp. 11076–11085. PMLR (13–18 Jul 2020), https://proceedings.mlr.press/v119/zhang20g.html

2020

[46] [46]

In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T

Zhao, W., Gao, L.: CoMER: Modeling coverage for transformer-based handwrit- ten mathematical expression recognition. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. pp. 392–408. Springer Nature Switzerland, Cham (2022)

2022

[47] [47]

In: Lladós, J., Lopresti, D., Uchida, S

Zhao, W., Gao, L., Yan, Z., Peng, S., Du, L., Zhang, Z.: Handwritten mathemati- cal expression recognition with bidirectionally trained transformer. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) Document Analysis and Recognition – ICDAR 2021. pp. 570–584. Springer International Publishing, Cham (2021)

2021

[48] [48]

Zhu, J., Zhao, W., Li, Y., Hu, X., Gao, L.: TAMER: Tree-aware transformer for handwritten mathematical expression recognition. Proceedings of the AAAI Conference on Artificial Intelligence39(10), 10950–10958 (Apr 2025).https: //doi.org/10.1609/aaai.v39i10.33190 DiffMath 19 DiffMath: Symbol- and Graph-Aware Latent Diffusion Transformer for Handwritten Math...

work page doi:10.1609/aaai.v39i10.33190 2025