pith. sign in

arxiv: 2606.04366 · v1 · pith:QPRHYW32new · submitted 2026-06-03 · 💻 cs.LG · cs.NA· math.NA

MeshTok: Efficient Multi-Scale Tokenization for Scalable PDE Transformers

Pith reviewed 2026-06-28 07:04 UTC · model grok-4.3

classification 💻 cs.LG cs.NAmath.NA
keywords adaptive tokenizationPDE transformersmultiscale modelingmesh refinementneural PDE solverssequence modelingefficiency-accuracy trade-off
0
0 comments X

The pith

MeshTok generates heterogeneous multiscale tokens by refining sharp regions on a fixed grid so a single Transformer sequence can handle both global context and local PDE details more efficiently than uniform patches.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MeshTok, an adaptive tokenization method inspired by adaptive mesh refinement. It selectively creates finer tokens only where gradients are sharp or features are transient, while keeping coarser tokens elsewhere on the same fixed grid. These tokens of varying sizes are then fed into one standard Transformer, allowing the model to focus computation on physically important areas without custom layers or separate branches. Experiments on multiple PDE families show this yields a better accuracy-to-compute ratio than uniform-grid tokenization. The approach treats the modest increase in token count as a useful bias rather than a guaranteed optimum.

Core claim

MeshTok produces a heterogeneous collection of multiscale tokens on a fixed simulation grid by refining regions with sharp gradients or multiscale structures, then processes the entire collection inside one unified Transformer sequence; this targeted allocation of tokens improves the efficiency-accuracy trade-off over uniform spatial partitions across several PDE benchmarks.

What carries the argument

Heterogeneous multiscale tokens generated by selective refinement on a fixed grid and fed into a single Transformer sequence.

If this is right

  • PDE solutions with localized sharp features can be modeled with fewer total tokens while preserving accuracy.
  • The same Transformer backbone works for both smooth and multiscale problems without redesign.
  • Token count grows only where needed, offering a practical way to scale to larger domains.
  • The method supplies an inductive bias that favors physically informative regions rather than uniform effort.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The fixed-grid constraint may limit applicability to problems where the underlying mesh itself must move or deform.
  • Extending the refinement criterion beyond gradient magnitude to other indicators such as residual error could further improve results.
  • Because tokens remain on a fixed underlying grid, post-processing to recover a continuous field may require additional interpolation steps not detailed in the work.

Load-bearing premise

A standard Transformer can process the mixed-size tokens from the adaptive scheme without extra architectural machinery and still extract both coarse global and fine local information.

What would settle it

Run the same PDE benchmarks with uniform-grid tokenization and with MeshTok; if the uniform version matches or exceeds MeshTok on accuracy per token or per FLOPs across the test set, the claimed improvement disappears.

Figures

Figures reproduced from arXiv: 2606.04366 by Congcong Zhu, Jiamin Jiang, Jingrun Chen, Xiaoyu Peng, Yanshun Zhao.

Figure 1
Figure 1. Figure 1: An illustration of our model architecture. Given input PDE states, an indicator predicts patch-level refinement scores on a coarse grid. Selected patches are recursively refined to generate a set of multi-scale tokens, which are encoded and merged into a unified sequence with geometry-aware positional encodings. A Transformer backbone processes the resulting tokens to model spatiotemporal dependencies, and… view at source ↗
Figure 2
Figure 2. Figure 2: Visualization of tokenizations for a 128 × 128 PDE field. Left: uniform 8 × 8 patch grid (patch size 16 × 16). Middle: MeshTok refinement that further splits a subset of coarse patches (25% in this example), concentrating refined tokens near high￾variation regions. Right: uniform 16 × 16 patch grid (patch size 8 × 8), shown with major (8 × 8) and minor (16 × 16) grid lines. (similar to BCAT (Liu et al., 20… view at source ↗
Figure 4
Figure 4. Figure 4: One-step prediction error versus model scale under different refinement settings. shown in [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 3
Figure 3. Figure 3: Benefit of pretraining across four PDE families. Each subplot compares training from scratch against fine-tuning from a pretrained model for BCAT and MeshTok. Lower relative ℓ2 error is better. These results suggest that pretraining can provide a useful initialization that transfers beyond the source equations and improves sample efficiency in several downstream settings. They also show that MeshTok retain… view at source ↗
Figure 5
Figure 5. Figure 5: summarizes the Pareto-style trade-off between per-step runtime and one-step prediction error under the OURS-BIG model. The plot includes a baseline line con￾necting No refinement and Full refinement, representing the reference trade-off achieved by uniformly changing the re￾finement level. The gray region highlights configurations that dominate this baseline (simultaneously lower error and lower runtime), … view at source ↗
Figure 6
Figure 6. Figure 6: Time–error trade-off across model scales. Each subplot reports one-step relative ℓ2 error (y-axis, lower is better) versus average forward runtime after warmup (x-axis, in milliseconds). Subplots correspond to SMALL (left), BIG (middle), and LARGE (right). Within each subplot, the three operating points are ordered as No refinement → Ours (AMR 25%) → Full refinement, where moving right increases compute an… view at source ↗
read the original abstract

Conventional patchified Transformers operate on uniform spatial partitions, distributing computational effort evenly across the domain irrespective of local features. This inflexible tokenization scheme is inherently limited in its ability to efficiently represent and process solutions to complex PDEs. To address this, we propose MeshTok, an adaptive mesh refinement (AMR)-inspired tokenization and sequence modeling framework. This method selectively refines spatial regions exhibiting sharp gradients, transient features, or multiscale structures, generating a heterogeneous set of multiscale tokens defined on a fixed simulation grid. These tokens are processed within a unified Transformer sequence, enabling the model to simultaneously capture coarse-grained global context and fine-grained local details without requiring specialized architectural components. Although adaptive refinement moderately increases token count, it promotes a more targeted allocation of computational resources to physically informative regions, which we view as a practical inductive bias rather than a formal optimality guarantee. Experimental evaluations across multiple PDE families and benchmark datasets demonstrate that MeshTok consistently improves the efficiency-accuracy trade-off compared to uniform-grid baselines. This suggests adaptive multiscale tokenization as a scalable and generalizable design principle for neural PDE modeling. Code is available at https://github.com/SCAILab-USTC/MeshTok.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes MeshTok, an AMR-inspired adaptive tokenization scheme for PDE-solving Transformers. It generates heterogeneous multiscale tokens on a fixed simulation grid, selectively refining regions with sharp gradients or transient features, and feeds these tokens into a single unified Transformer sequence to capture both coarse global context and fine local details. The central claim is that this yields a better efficiency-accuracy trade-off than uniform-grid baselines across multiple PDE families, without requiring specialized architectural components, and the authors release public code.

Significance. If the central claim holds, MeshTok supplies a practical inductive bias for allocating compute to physically relevant regions in neural PDE models, which could improve scalability for multiscale problems. The public code release is a clear strength that enables direct verification and extension.

major comments (2)
  1. [Abstract, Section 3] Abstract and Section 3 (Method): The claim that the heterogeneous multiscale tokens are processed "without requiring specialized architectural components" is load-bearing for attributing any observed gains to the tokenization scheme alone. The manuscript must explicitly state (with pseudocode or architecture diagram) whether standard Transformer components—token embeddings, positional encodings, and attention—are used unmodified or whether scale-specific projections, padding, or attention masks are introduced to accommodate variable token sizes.
  2. [Section 4] Section 4 (Experiments): The abstract asserts "consistent improvements" in the efficiency-accuracy trade-off, yet no quantitative metrics, error bars, dataset sizes, or ablation tables are referenced in the provided text. Without these, it is impossible to assess whether the reported gains survive controls for token count or whether they are driven by the adaptive refinement itself.
minor comments (1)
  1. [Abstract] The abstract would benefit from a single sentence summarizing the magnitude of the reported gains (e.g., relative error reduction or FLOPs savings) even if full tables appear later.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on clarifying our architectural claims and strengthening the experimental reporting. We respond to each major comment below.

read point-by-point responses
  1. Referee: [Abstract, Section 3] Abstract and Section 3 (Method): The claim that the heterogeneous multiscale tokens are processed "without requiring specialized architectural components" is load-bearing for attributing any observed gains to the tokenization scheme alone. The manuscript must explicitly state (with pseudocode or architecture diagram) whether standard Transformer components—token embeddings, positional encodings, and attention—are used unmodified or whether scale-specific projections, padding, or attention masks are introduced to accommodate variable token sizes.

    Authors: We agree that the claim requires explicit support. In the revision we will add both an architecture diagram and pseudocode to Section 3. These will show that (i) all tokens receive the same linear embedding projection, (ii) positional encodings are computed from token-center coordinates using the standard sinusoidal formulation, and (iii) a vanilla multi-head self-attention layer is applied to the concatenated sequence with no scale-specific projections, padding tokens, or attention masks. The only heterogeneity resides in the tokenization step itself; the Transformer treats every token identically. revision: yes

  2. Referee: [Section 4] Section 4 (Experiments): The abstract asserts "consistent improvements" in the efficiency-accuracy trade-off, yet no quantitative metrics, error bars, dataset sizes, or ablation tables are referenced in the provided text. Without these, it is impossible to assess whether the reported gains survive controls for token count or whether they are driven by the adaptive refinement itself.

    Authors: Section 4 of the full manuscript already contains the requested elements: relative L2 errors with standard deviations over five random seeds, explicit dataset cardinalities, and ablations that match total token count between MeshTok and uniform baselines. We will revise the abstract to cite these results directly (e.g., “12 % lower relative error at matched FLOPs, Table 3”) so that the quantitative support is visible without reading the full experimental section. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical design choice evaluated on benchmarks

full rationale

The paper proposes MeshTok as an AMR-inspired adaptive tokenization method for PDE Transformers. It frames the approach as a practical inductive bias for better efficiency-accuracy trade-off, demonstrates gains via experiments across PDE families and datasets, and releases public code. No derivation chain, first-principles predictions, fitted parameters renamed as outputs, or load-bearing self-citations exist. The central claim rests on empirical results rather than any reduction to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review yields minimal explicit parameters or entities; the primary domain assumption is the practical benefit of targeted refinement.

axioms (1)
  • domain assumption Adaptive refinement moderately increases token count but promotes targeted allocation as a practical inductive bias rather than a formal optimality guarantee.
    Stated directly in the abstract as the authors' framing of the method.

pith-pipeline@v0.9.1-grok · 5755 in / 1193 out tokens · 30457 ms · 2026-06-28T07:04:16.754253+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

18 extracted references · 10 canonical work pages

  1. [1]

    Journal of Computational Physics , keywords =

    URL https://api.semanticscholar. org/CorpusID:119491298. Bar-Sinai, Y ., Hoyer, S., Hickey, J., and Brenner, M. P. Learning data-driven discretizations for partial differen- tial equations.Proceedings of the National Academy of Sciences, 116(31):15344–15349, 2019. Bengio, Y ., Ducharme, R., and Vincent, P. A neural prob- abilistic language model. InProcee...

  2. [2]

    org/CorpusID:218971783

    URL https://api.semanticscholar. org/CorpusID:218971783. Cao, S. Choose a transformer: Fourier or galerkin. In Ranzato, M., Beygelzimer, A., Dauphin, Y ., Liang, P., and Vaughan, J. W. (eds.),Advances in Neural Information Processing Systems, vol- ume 34, pp. 24924–24940. Curran Associates, Inc.,

  3. [3]

    2021 IEEE/CVF International Conference on Computer Vision (ICCV) , author=

    URL https://proceedings.neurips. cc/paper_files/paper/2021/file/ d0921d442ee91b896ad95059d13df618-Paper. pdf. Chen, C.-F. R., Fan, Q., and Panda, R. Crossvit: Cross- attention multi-scale vision transformer for image clas- sification. In2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 347–356, 2021. doi: 10.1109/ICCV48922.2021.00041. ...

  4. [4]

    URL https://www.sciencedirect.com/ science/article/pii/S002199912300476X

    doi: https://doi.org/10.1016/j.jcp.2023.112381. URL https://www.sciencedirect.com/ science/article/pii/S002199912300476X. Freymuth, N., Dahlinger, P., W¨urth, T., Reisch, S., K¨arger, L., and Neumann, G. Swarm reinforcement learning for adaptive mesh refinement.Advances in neural informa- tion processing systems, 36:73312–73347, 2023. Gillette, A., Keith,...

  5. [5]

    org/CorpusID:13905106

    URL https://api.semanticscholar. org/CorpusID:13905106. Guibas, J., Mardani, M., Li, Z., Tao, A., Anandkumar, A., and Catanzaro, B. Adaptive fourier neural operators: Efficient token mixers for transformers.arXiv preprint arXiv:2111.13587, 2021. Guo, X., Li, W., and Iorio, F. Convolutional neural networks for steady flow approximation. InProceed- ings of ...

  6. [6]

    Kevrekidis, Lu Lu, Paris Perdikaris, Sifan Wang, and Liu Yang

    URL https://proceedings.neurips. cc/paper_files/paper/2021/file/ 854d9fca60b4bd07f9bb215d59ef5561-Paper. pdf. Hao, Z., Su, C., Liu, S., Berner, J., Ying, C., Su, H., Anand- kumar, A., Song, J., and Zhu, J. Dpot: auto-regressive denoising operator transformer for large-scale pde pre- training. InProceedings of the 41st International Confer- ence on Machine...

  7. [7]

    cc/paper_files/paper/2020/file/ 4b21cf96d4cf612f239a6c322b10c8fe-Paper

    URL https://proceedings.neurips. cc/paper_files/paper/2020/file/ 4b21cf96d4cf612f239a6c322b10c8fe-Paper. pdf. Li, Z., Kovachki, N. B., Azizzadenesheli, K., Liu, B., Bhattacharya, K., Stuart, A. M., and Anandkumar, A. Fourier neural operator for parametric partial differential equations. In9th International Conference on Learn- ing Representations, ICLR 20...

  8. [8]

    org/CorpusID:14337532

    URL https://api.semanticscholar. org/CorpusID:14337532. Loshchilov, I. and Hutter, F. Fixing weight decay regularization in Adam.ArXiv, abs/1711.05101,

  9. [9]

    org/CorpusID:3312944

    URL https://api.semanticscholar. org/CorpusID:3312944. Lu, L., Jin, P., and Karniadakis, G. E. Deeponet: Learning nonlinear operators for identifying differential equations based on the universal approximation theorem of opera- tors.arXiv preprint arXiv:1910.03193, 2019. Masliaev, M., Gusarov, D., Markov, I., and Hvatov, A. To- wards universal neural oper...

  10. [10]

    URL https://www.sciencedirect.com/ science/article/pii/S0045782524003657

    doi: https://doi.org/10.1016/j.cma.2024.117109. URL https://www.sciencedirect.com/ science/article/pii/S0045782524003657. 12 MeshTok: Efficient Multi-Scale Tokenization for Scalable PDE Transformers Perez, E., Strub, F., De Vries, H., Dumoulin, V ., and Courville, A. Film: Visual reasoning with a general con- ditioning layer. InProceedings of the AAAI con...

  11. [11]

    2008 , isbn =

    URL https://www.sciencedirect.com/ science/article/pii/S0021999118307125. Rao, Y ., Zhao, W., Liu, B., Lu, J., Zhou, J., and Hsieh, C.-J. Dynamicvit: Efficient vision transformers with dynamic token sparsification.Advances in neural information processing systems, 34:13937–13949, 2021. Roohi, E. and Mahdavi, A. Shock-aware physics-guided fusion-deeponet o...

  12. [12]

    doi: 10.24963/ijcai.2024/

    International Joint Conferences on Artificial Intel- ligence Organization, 8 2024. doi: 10.24963/ijcai.2024/

  13. [13]

    2021/205

    URL https://doi.org/10.24963/ijcai. 2024/573. Main Track. Wu, H., Luo, H., Wang, H., Wang, J., and Long, M. Tran- solver: A fast transformer solver for pdes on general geometries. InInternational Conference on Machine Learning, 2024. Xu, Z., Liu, J., Chen, K., Chen, Y ., Hu, Z., and Ni, B. Amr- transformer: Enabling efficient long-range interaction for co...

  14. [14]

    Zhu, Y ., Zabaras, N., Koutsourelakis, P.-S., and Perdikaris, P

    URL https://www.sciencedirect.com/ science/article/pii/S0021999118302341. Zhu, Y ., Zabaras, N., Koutsourelakis, P.-S., and Perdikaris, P. Physics-constrained deep learning for high-dimensional surrogate modeling and uncer- tainty quantification without labeled data.Journal of Computational Physics, 394:56–81, 2019. ISSN 0021-9991. doi: https://doi.org/10...

  15. [15]

    14 MeshTok: Efficient Multi-Scale Tokenization for Scalable PDE Transformers A

    URL https://www.sciencedirect.com/ science/article/pii/S0021999119303559. 14 MeshTok: Efficient Multi-Scale Tokenization for Scalable PDE Transformers A. Theoretical Analysis A.1. Theoretical Analysis of AMR A substantial body of prior work has demonstrated that modern neural architectures—including convolutional encoder– decoder surrogates and neural-ope...

  16. [16]

    IfC qo = 1, then E(P2;g) =E ⋆(N(P2);g)≤ E(P 1;g)

  17. [17]

    Proof.We decompose the argument into four conceptual steps

    More generally, for arbitraryC qo ≥1, E(P2;g)≤C qo E ⋆(N(P2);g)≤C qo E(P1;g). Proof.We decompose the argument into four conceptual steps. Since refinement never reduces the token count, we have N(P1)≤N(P 2). Therefore, the uniform partition P1 is a feasible candidate in the definition ofE ⋆(N(P2);g). By definition of the infimum, E ⋆(N(P2);g) = inf ˜P:N( ...

  18. [18]

    error-producing

    The model consists of 8 stacked spectral convolution layers and operates directly on the full spatial grid, without relying on patch-based tokenization. • ViTThe Vision Transformer baseline follows a standard encoder-only Transformer architecture. It uses an embedding dimension of 512 and a feed-forward network dimension of 2048. The input field is tokeni...