arxiv: 2604.07522 · v2 · submitted 2026-04-08 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Training-free Spatially Grounded Geometric Shape Encoding (Technical Report)

Yuhang He

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:12 UTC · model grok-4.3

classification 💻 cs.CV

keywords geometric shape encodingZernike basespositional encodingtraining-free method2D shapesinvertible representationspatial intelligencecomputer vision

0 comments

The pith

A training-free method encodes any 2D geometric shape into a compact invertible representation using Zernike bases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces XShapeEnc as a general-purpose encoding for arbitrary 2D spatially grounded shapes that requires no training. It first splits each shape into normalized geometry inside the unit disk and a pose vector turned into a matching harmonic pose field. Orthogonal Zernike bases then encode the two parts, after which frequency propagation adds high-frequency content to produce the final vector. A reader would care because standard positional encodings handle sequences well but leave full spatial shapes under-served, and this decomposition promises a ready-to-use alternative that preserves invertibility and adapts across tasks. The authors show the approach works on a range of shape-aware problems and a self-curated corpus.

Core claim

By decomposing a 2D spatially grounded geometric shape into normalized geometry within the unit disk and a pose vector that is converted into a harmonic pose field also inside the unit disk, then encoding both components with orthogonal Zernike bases either independently or jointly and applying frequency propagation, the method produces a compact representation that is invertible, adaptive, and frequency-rich without any training or task-specific adjustments.

What carries the argument

XShapeEnc, the decomposition of an input shape into unit-disk normalized geometry and harmonic pose field followed by orthogonal Zernike basis encoding and frequency propagation. The Zernike bases are orthogonal polynomials over the unit disk that permit separate or joint encoding of the geometry and pose components.

If this is right

The resulting encoding can be inverted to recover the original shape geometry and pose.
The representation adapts to new shapes without retraining or parameter changes.
Frequency propagation supplies high-frequency content that improves compatibility with neural network learning.
Different shapes produce distinguishable encodings, supporting discriminability across tasks.
The method runs efficiently and applies to a wide range of shape-aware vision problems as verified in experiments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same decomposition and Zernike encoding could be generalized to 3D shapes by lifting the bases to higher-dimensional orthogonal functions.
Its training-free property suggests direct use in low-data or on-device settings where collecting shape annotations is costly.
Hybrid pipelines that combine XShapeEnc with standard 1D positional encodings might handle mixed sequential and spatial inputs more cleanly.

Load-bearing premise

Decomposing any shape into normalized geometry and harmonic pose inside the unit disk, then applying Zernike bases and frequency propagation, will automatically deliver invertibility, discriminability, and applicability to arbitrary shapes without training or post-hoc fixes.

What would settle it

If inverting the encoding of a complex arbitrary shape yields a reconstruction whose geometry or pose deviates substantially from the input, or if networks using the encoding underperform trained baselines on a held-out shape-aware task, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2604.07522 by Yuhang He.

**Figure 1.** Figure 1: Zernike basis visualization, we visualize nine real-part basis governed by radial mode 𝑛 and angular frequency 𝑚. The Zernike basis 𝑉 𝑚 𝑛 (𝑟, 𝜃) is in the complex domain 𝑉 𝑚 𝑛 (𝑟, 𝜃) ∈ C, while 𝑅 |𝑚| 𝑛 (𝑟) is in the real domain 𝑅 |𝑚| 𝑛 (𝑟) ∈ R. By constraining 𝑛 − |𝑚| to be even and |𝑚| < 𝑛, the constructed Zernike bases are mutually orthogonal over the unit disk with respect to the area measure 𝑟𝑑𝑟𝑑𝜃, ∫ … view at source ↗

**Figure 2.** Figure 2: XShapeEnc pipeline visualization. The spatially grounded shape is decomposed into its shape geometry within the unit disk and its shape pose vector. XShapeEnc flexibly supports shape geometry encoding, shape pose encoding either independently or jointly, under the same Zernike basis umbrella. The shape pose vector constructs a harmonic pose field lying within the unit disk so that it can be processed by Ze… view at source ↗

**Figure 3.** Figure 3: Frequency impact decay illustration in FreqProp. We show the impact decay ratio w.r.t. radial/angular basis distance Δ𝑛/Δ𝑚, under the different propagation ratio 𝜆. where arg(·) indicates the phase information, 𝜆𝑟 is the radial propagation ratio deciding the frequency ratio propagating from lower-frequency coefficient (we set it as 0.6, see Sec. 6.5 in Appendix), 𝜆𝑎 is the corresponding angular propaga… view at source ↗

**Figure 4.** Figure 4: FreqProp visualization. rFreqProp and aFreqProp propagate along fixed-angular and and fixed-radial Zernike basis, respectively. FreqProp is invertible as we reverse the propagation process. We do not need to run FreqProp on negative angular Zernike basis (𝑚 < 0, overlaid by light blue) because their projection coefficients show conjugate symmetry (𝑧 −𝑚 𝑛 = 𝑧 𝑚 𝑛 ) with their positive angular Zernike bas… view at source ↗

**Figure 6.** Figure 6: Three orthonormal radial windows and harmonic pose field visualization. To ensure invertibility and robustness of p, C has to be full rank rank(C)= 𝐾 and well-conditioned so that p = AC−1 . To this end, we add two constraints to Eqn. (8): First, 𝐾 ≤ 𝐿, which is the prerequisite for C to be fullranked. In practice, we simply need to project the harmonic pose field to at least 𝐾 Zernike basis. Second, we … view at source ↗

**Figure 7.** Figure 7: Correlation between harmonic pose filed and the final shape pose encodings. Linearity: The proposed harmonic pose encoding is linear with respect to the pose field and the subsequent Zernike projection (see the Proof in Sec. 6.6). In particular, the resulting pose coefficients are linear functions of the pose vector, and therefore obey superposition. However, we do not enforce linearity with respect to p… view at source ↗

**Figure 8.** Figure 8: Relative geometry-pose emphasis joint encoding visualization. Beyond the default joint encoding, we introduce a tunable emphasis mechanism that allows controlled bias toward either geometry or pose. We define a relative emphasis parameter 𝛽 ∈ [0, 2], where 𝛽 = 1 indicates neutral emphasis, 𝛽 < 1 emphasizes pose, and 𝛽 > 1 emphasizes geometry. The more distant of 𝛽 to 1, the stronger emphasis it lays to… view at source ↗

**Figure 9.** Figure 9: XShapeCorpus curation visualization. More complex shapes can be created by consecutively running shape operations. Higher depth (operation number) indicates higher Shape complexity. Each shape is independently associated with a spatial pose. Currently there is no public dataset in which each 2D shape is an arbitrary geometric shape (aka shape geometry) and further paired with a spatial position (aka sha… view at source ↗

**Figure 10.** Figure 10: Typical baselines illustration: AngularSweep, ShapeEmbed [38], ShapeDist [32] are based on shape geometry boundary points, the other three baselines (PointSet, 2DPE and Space2Vec [28]) are based on regularly sampled points. other points in a log-polar histogram around each reference point. It captures local and global geometric structure, offering robustness to moderate deformation and making it a strong … view at source ↗

**Figure 11.** Figure 11: MSE variation for encoding length over depth shape complexity on XShapeCorpus dataset. Shape Geometry Encoding Invertibility. Given the constructed shape corpus XShapeCorpus, we exhaustively test shape geometry reconstruction error (mean square error MSE) under various encoding lengths and shape geometry complexity. To this end, first, we encode each shape geometry by setting the rasterization resolut… view at source ↗

**Figure 12.** Figure 12: XShapeEnc encoding invertibility illustration. We show two complex shape geometry (within unit disk) with depth = 5 and depth = 10 and their reconstructed shape geometry under various encoding length. Note that the original reconstructed shape is soft-masked, we binarize it with the threshold 0.2 for better visualization. parameter number. 4.4. XShapeEnc Encoding Efficiency Based on the discussion in Sec… view at source ↗

**Figure 13.** Figure 13: Shape geometry t-SNE [47] clustering visualization between XShapeEnc and other baselines. We choose four main complex shape geometries (subfig. A) with depth = 10 and augment each one to obtain 200 variations by operations including rotation, shearing and elastic deformation. We compare XShapeEnc with both boundary-based shape representation (AngularSweep w/o w/ positional encoding, elliptical Fourier tr… view at source ↗

**Figure 14.** Figure 14: Shape geometry encoding comparison w/ w/o FreqProp. To analyze the effect of FreqProp on shape geometry discriminability, we further compare t-SNE clustering results with and without frequency propagation. As shown in [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗

**Figure 16.** Figure 16: t-SNE [47] clustering result visualization on frequency propagation coefficient 𝜆. Zernike modes while retaining structured harmonic relationships. In other words, Zernike basis encoding provides a stable geometric backbone, and FreqProp acts as a controlled enhancement mechanism that strengthens downstream learnability while preserving shape-aware discriminability. To further test the impact of encoding… view at source ↗

**Figure 17.** Figure 17: Shape pose t-SNE [47] clustering visualization. We divide a region into four main sub-regions and sample 200 points at each sub-region (subfig. A). Both XShapeEnc and classic 2D positional encoding can successfully cluster shape poses based on their spatial position. To assess XShapeEnc shape pose encoding discriminability, we construct a 100 × 100 𝑚2 area and evenly divide it into 4 sub-areas: topleft… view at source ↗

**Figure 18.** Figure 18: Shape Geometry and Pose Joint Encoding tSNE [47] clustering result visualization the 4 sub-areas but with different shape pose (e.g., 𝑥- and 𝑦- coordinates), leading to a total of 16 spatially grounded shape geometries. We then encode each of the spatially grounded shape geometry by the shape geometry and pose jointly encoding strategy as presented in Sec. 3.6. By varying the modulation weight 𝛽 in Eqn. 1… view at source ↗

**Figure 19.** Figure 19: Polygon-polygon topological relation visualization: we visualize 5 inter-shape topological relation in OpenStreetMap Singapore dataset: Disjoint, Within, Overlap, Touch and Equal. Equal means two polygons are identical. From these visualizations, we can learn that these polygons vary drastically in terms of shape geometry complexity, size and scale. shape is a 2D polygon associated with a spatial position… view at source ↗

read the original abstract

Positional encoding has become the de facto standard for grounding deep neural networks on discrete point-wise positions, and it has achieved remarkable success in tasks where the input can be represented as a one-dimensional sequence. However, extending this concept to 2D spatial geometric shapes demands carefully designed encoding strategies that account not only for shape geometry and pose, but also for compatibility with neural network learning. In this work, we address these challenges by introducing a training-free, general-purpose encoding strategy, dubbed XShapeEnc, that encodes an arbitrary spatially grounded 2D geometric shape into a compact representation exhibiting five favorable properties, including invertibility, adaptivity, and frequency richness. Specifically, a 2D spatially grounded geometric shape is decomposed into its normalized geometry within the unit disk and its pose vector, where the pose is further transformed into a harmonic pose field that also lies within the unit disk. A set of orthogonal Zernike bases is constructed to encode shape geometry and pose either independently or jointly, followed by a frequency-propagation operation to introduce high-frequency content into the encoding. We demonstrate the theoretical validity, efficiency, discriminability, and applicability of XShapeEnc via extensive analysis and experiments across a wide range of shape-aware tasks and our self-curated XShapeCorpus. We envision XShapeEnc as a foundational tool for research that goes beyond one-dimensional sequential data toward frontier 2D spatial intelligence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

XShapeEnc gives a plausible training-free Zernike-based encoding for 2D shapes but the invertibility claim does not survive the finite discrete case.

read the letter

The main thing to know is that the paper introduces XShapeEnc as a training-free way to turn an arbitrary 2D geometric shape into a compact vector by normalizing its geometry inside the unit disk, turning its pose into a harmonic field, projecting both onto Zernike polynomials, and then applying frequency propagation. This combination is new and targets a clear gap between 1D positional encodings and the needs of shape-aware networks. The authors correctly note that existing methods do not jointly handle geometry, pose, and high-frequency content without training, and they try to fix that with orthogonal bases plus an extra propagation step. That part is worth attention if the details check out. The work does a reasonable job describing the pipeline and running it on a range of shape tasks plus their own XShapeCorpus dataset. Using Zernike moments for the core encoding is a natural and reproducible choice, and the training-free constraint is a practical advantage. The claimed properties of invertibility, adaptivity, and frequency richness are laid out clearly enough to evaluate. The soft spot is exactly where the stress test points: invertibility. Zernike polynomials are orthogonal in the continuous setting, but the paper uses a finite number of terms on discrete pixel or point inputs and inserts an unspecified propagation operator. No explicit inverse is derived that recovers the original indicator function after these steps, and there is no bound on discretization error for shapes that fall between the basis functions. The abstract asserts theoretical validity, yet the construction as described does not guarantee lossless reconstruction. That weakens the discriminability and broad-applicability claims until the math is tightened. The experiments are mentioned but the abstract gives no numbers or controls, so it is hard to judge how much the method actually improves on baselines. This paper is for computer-vision researchers who need compact, spatially grounded representations for shapes rather than points or sequences. A reader working on robotics or graphics pipelines could extract useful ideas even if they have to add their own error analysis. It deserves a serious referee because the gap is real and the proposed decomposition is concrete, though the authors should expect heavy questions on the reconstruction proof and more transparent results. I would send it for review with instructions to strengthen the invertibility section and add quantitative ablation on discretization.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces XShapeEnc, a training-free encoding for arbitrary 2D spatially grounded geometric shapes. It decomposes each shape into normalized geometry inside the unit disk plus a harmonic pose field, projects both onto a finite set of orthogonal Zernike polynomials, and applies an unspecified frequency-propagation step to enrich high-frequency content. The resulting compact representation is asserted to possess five properties (invertibility, adaptivity, frequency richness, discriminability, and broad applicability) and is evaluated on shape-aware tasks using the self-curated XShapeCorpus.

Significance. If the invertibility and reconstruction guarantees can be rigorously established for discrete inputs, XShapeEnc would offer a parameter-free, general-purpose alternative to learned positional encodings for 2D geometric data, potentially enabling more interpretable and training-efficient spatial reasoning in neural networks.

major comments (3)

[§4] §4 (Theoretical Analysis): No explicit inverse transform or reconstruction formula is derived for the composition of finite-order Zernike projection, harmonic pose encoding, and frequency propagation. Zernike orthogonality guarantees L2 invertibility only in the continuous, infinite-order limit; the manuscript provides neither a closed-form inverse nor discretization-error bounds for arbitrary discrete shapes.
[§3.2] §3.2 (Encoding Pipeline): The frequency-propagation operator is described only at a high level; its explicit functional form (additive, multiplicative, convolutional, etc.) is not given, preventing verification that the overall map remains bijective or that high-frequency injection does not destroy exact recoverability of the original indicator function or contour.
[§5] §5 (Experiments): The reported results on discriminability and applicability lack quantitative error analysis, ablation on the number of Zernike orders retained, and explicit statements of data-exclusion criteria or shape-sampling density. Without these, it is impossible to assess whether the claimed advantages hold beyond the specific XShapeCorpus instances.

minor comments (2)

[§3.1] The definition of the 'harmonic pose field' (Eq. (3) or surrounding text) would benefit from an explicit functional form or pseudocode to clarify how the pose vector is mapped into the unit disk.
[Figure 2] Figure 2 (pipeline diagram) and the accompanying text use inconsistent notation for the normalized geometry versus the full encoding vector; a single consistent symbol table would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We agree that several aspects of the theoretical analysis, pipeline description, and experimental reporting can be strengthened for greater rigor and reproducibility. We will revise the manuscript to incorporate explicit formulas, clarifications, and additional quantitative details as outlined below.

read point-by-point responses

Referee: [§4] §4 (Theoretical Analysis): No explicit inverse transform or reconstruction formula is derived for the composition of finite-order Zernike projection, harmonic pose encoding, and frequency propagation. Zernike orthogonality guarantees L2 invertibility only in the continuous, infinite-order limit; the manuscript provides neither a closed-form inverse nor discretization-error bounds for arbitrary discrete shapes.

Authors: We thank the referee for this observation. While the continuous infinite-order case follows directly from Zernike orthogonality, the finite-order discrete setting requires explicit treatment. In the revised §4 we will derive the reconstruction formula: the normalized geometry is recovered via the finite sum of Zernike coefficients multiplied by the corresponding basis functions evaluated on the discrete grid, and the harmonic pose field is recovered analogously. We will explicitly state that this yields the minimum-L2-error approximation for the retained orders and will add discretization-error bounds based on the sampling density within the unit disk together with empirical reconstruction errors measured on XShapeCorpus shapes. These additions will be included in the next version. revision: yes
Referee: [§3.2] §3.2 (Encoding Pipeline): The frequency-propagation operator is described only at a high level; its explicit functional form (additive, multiplicative, convolutional, etc.) is not given, preventing verification that the overall map remains bijective or that high-frequency injection does not destroy exact recoverability of the original indicator function or contour.

Authors: We apologize for the insufficient detail. The frequency-propagation step is a per-coefficient multiplicative scaling c'_k = c_k · (1 + β · r_k), where r_k is the radial frequency of the k-th Zernike term and β is a small positive constant. Because the scaling factor is strictly positive for all retained orders, the map is invertible by simple division. The harmonic pose encoding is stored separately and recovered directly from its own coefficients. In the revision we will state this functional form explicitly in §3.2, include the corresponding equation, and provide a short argument that the overall encoding remains bijective for any finite set of orders. This clarification will be added. revision: yes
Referee: [§5] §5 (Experiments): The reported results on discriminability and applicability lack quantitative error analysis, ablation on the number of Zernike orders retained, and explicit statements of data-exclusion criteria or shape-sampling density. Without these, it is impossible to assess whether the claimed advantages hold beyond the specific XShapeCorpus instances.

Authors: We agree that these details are necessary for proper evaluation. In the revised §5 we will add (i) quantitative reconstruction error (mean IoU and L2 norm) for the decoded shapes, (ii) an ablation table varying the maximum Zernike order from 4 to 24, and (iii) a precise description of XShapeCorpus construction, including uniform sampling of 1024 boundary points per shape, exclusion of degenerate (zero-area) contours, and the exact train/test split. These additions will allow readers to assess the method beyond the reported instances. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation relies on standard Zernike orthogonality and explicit decomposition without self-referential reduction

full rationale

The paper constructs XShapeEnc via an explicit pipeline: decompose the input shape into normalized geometry inside the unit disk plus a harmonic pose field, project both onto a finite set of orthogonal Zernike polynomials, and apply a frequency-propagation step. Invertibility is asserted to follow from the known L2 orthogonality of Zernike bases on the disk (a pre-existing mathematical fact, not derived inside the paper). No parameter is fitted to data and then renamed as a prediction, no self-citation is invoked to justify a uniqueness theorem or ansatz, and no known empirical pattern is merely relabeled. The central claims therefore remain independent of the target properties; they are built from external, verifiable components rather than reducing to the inputs by construction. This is the normal, non-circular case for a method paper that assembles standard tools.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Ledger constructed from abstract only. No explicit free parameters are described. Relies on standard mathematical properties of Zernike polynomials.

axioms (1)

standard math Zernike polynomials form an orthogonal basis over the unit disk
Invoked to encode shape geometry and pose independently or jointly.

invented entities (1)

harmonic pose field no independent evidence
purpose: Transform of the pose vector into a field inside the unit disk for joint encoding
Introduced as part of the pose handling step.

pith-pipeline@v0.9.0 · 5543 in / 1432 out tokens · 60537 ms · 2026-05-12T01:12:57.603735+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean and Cost/FunctionalEquation.lean alexander_duality_circle_linking; washburn_uniqueness_aczel; dAlembert_to_ODE_general echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Zernike basis ... mutually orthogonal over the unit disk ... ∫ V_m^n (V_m'^n')* r dr dθ = π/(n+1) δ δ (Eq 4); projection z_m^n = (n+1)/π ∬ f_G V* (Eq 5); FreqProp z_m^n ← z_m^n + λ |z_m^{n-2}| e^{i arg} (Eq 7) is linear and invertible
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_add; logicNat_initial; realization_initial echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Linearity: Zernike encoding on composite shape equals linear combination of individual encodings (Sec 6.2); harmonic pose field projection A = p · C with radially orthonormal windows guarantees invertibility (Eq 10, Sec 6.6)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 1 internal anchor

[1]

Belongie, J

S. Belongie, J. Malik, and J. Puzicha. Shape Matching and Object Recognition Using Shape Contexts. InIEEE T ransactions on Pattern Analysis and Machine Intelligence (T-P AMI), 2002

work page 2002
[2]

Boob and M

A. Boob and M. Radke. ElementaryCQT: A New Dataset and its Deep Learning Analysis for 2D Geometric Shape Recognition. InSN Computer Science, 2024

work page 2024
[3]

Burgess, J

J. Burgess, J. J. Nirschl, M.-C. Zanellati, A. Lozano, S. Cohen, and S. Yeung-Levy. Orientation- Invariant Autoencoders Learn Robust Representations For Shape Profiling of Cells and Organelles. InNature Communications, 2024

work page 2024
[4]

A. X. Chang, T. Funkhouser, L. Guibas, P . Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu. ShapeNet: An Information-Rich 3D Model Repository. Technical Report arXiv:1512.03012 [cs.GR], Stanford University — Princeton University — Toyota Technological Institute at Chicago, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[5]

Chang and J

S.-F. Chang and J. R. Smith. Extracting Multi-dimensional Signal Features for Content-based Visual Query. InSPIE Symposium on Visual Communications and Signal Processing, volume 2501, pages 995–1006, 1995. doi: 10.1117/12.206632

work page doi:10.1117/12.206632 1995
[6]

R. Q. Charles, H. Su, M. Kaichun, and L. J. Guibas. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017

work page 2017
[7]

Chen and H

Z. Chen and H. Zhang. Learning Implicit Fields for Generative Shape Modeling.IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019

work page 2019
[8]

Clementini, P

E. Clementini, P . Di Felice, and P . van Oosterom. A Small Set of Formal Topological Relationships Suitable for End-User Interaction. InAdvances in Spatial Databases, 1993

work page 1993
[9]

A. J. Davison. FutureMapping: The Computational Structure of Spatial AI Systems.CoRR, abs/1803.11288, 2018

work page arXiv 2018
[10]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2009

work page 2009
[11]

Dosovitskiy, L

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. De- hghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. InInternational Conference on Learning Representations (ICLR), 2021

work page 2021
[12]

Y. Fang, J. Xie, G. Dai, M. Wang, F. Zhu, T. Xu, and E. Wong. 3D Deep Shape Descriptor. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2015

work page 2015
[13]

Gafni, A

O. Gafni, A. Polyak, O. Ashual, S. Sheynin, D. Parikh, and Y. Taigman. Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors. In S. Avidan, G. Brostow, M. Cissé, G. M. Farinella, and T. Hassner, editors,European Conference on Computer Vi- sion (ECCV), 2022

work page 2022
[14]

Groueix, M

T. Groueix, M. Fisher, V . G. Kim, B. Russell, and M. Aubry. AtlasNet: A Papier-Mâché Approach to Learning 3D Surface Generation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018. 32

work page 2018
[15]

Gu and Y

R. Gu and Y. Luo. ReZero: Region-Customizable Sound Extraction.IEEE/ACM T ransactions on Audio, Speech, and Language Processing (TASLP), 32:2576–2589, 2024

work page 2024
[16]

K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016

work page 2016
[17]

Y. He, A. Cherian, G. Wichern, and A. Markham. Deep Neural Room Acoustics Primitive. InInternational Conference on Machine Learning (ICML), 2024

work page 2024
[18]

Y. He, A. Markham, and O. Köpüklü. SoundTRC: DNN-based Acoustic Target Region Control. InIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025

work page 2025
[19]

E. Hua, C. Jiang, X. Lv, K. Zhang, N. Ding, Y. Sun, B. Qi, Y. Fan, X. Zhu, and B. Zhou. Fourier Position Embedding: Enhancing Attention’s Periodic Extension for Length Generalization. InInternational Conference on Machine Learning (ICML), 2025

work page 2025
[20]

Huang, T

Z. Huang, T. Wu, W. Lin, S. Zhang, J. Chen, and F. Wu. AutoGeo: Automating Geometric Image Dataset Creation for Enhanced Geometry Understanding.IEEE T ransactions on Multimedia, 27:3105–3116, 2025. doi: 10.1109/TMM.2025.3557720

work page doi:10.1109/tmm.2025.3557720 2025
[21]

Itani, T

M. Itani, T. Chen, T. Yoshioka, and S. Gollakota. Creating Speech Zones with Self- Distributing Acoustic Swarms. InNature Communications, 2023

work page 2023
[22]

Khotanzad and Y

A. Khotanzad and Y. Hong. Invariant Image Recognition by Zernike Moments.IEEE T ransactions on Pattern Analysis and Machine Intelligence (TP AMI), 1990

work page 1990
[23]

D. P . Kingma and J. Ba. Adam: A Method for Stochastic Optimization. InInternational Conference on Learning Representations (ICLR), 2015

work page 2015
[24]

Konukoglu, B

E. Konukoglu, B. Glocker, A. Criminisi, and K. M. Pohl. WESD–Weighted Spectral Distance for measuring shape dissimilarity.IEEE T ransactions on Pattern Analysis and Machine Intelligence (TP AMI), 35(9):2284–2297, 2013

work page 2013
[25]

A. E. Korchi and Y. Ghanou. 2D Geometric Shapes Dataset–For Machine Learning and Pattern Recognition.Data in Brief, 32:106090, 2020. ISSN 2352-3409. doi: https://doi.org/ 10.1016/j.dib.2020.106090. URL https://www.sciencedirect.com/science/articl e/pii/S2352340920309847

work page doi:10.1016/j.dib.2020.106090 2020
[26]

L. J. Latecki. Shape data for the mpeg-7 core experiment ce-shape-1, 2006. Dataset

work page 2006
[27]

Lee and M

J.-G. Lee and M. Kang. Geospatial Big Data: Challenges and Opportunities. InBig Data Research, 2015

work page 2015
[28]

G. Mai, K. Janowicz, B. Yan, R. Zhu, L. Cai, and N. Lao. Multi-Scale Representation Learning for Spatial Feature Distributions using Grid Cells. InInternational Conference on Learning Representations (ICLR), 2020

work page 2020
[29]

G. Mai, C. Jiang, W. Sun, R. Zhu, Y. Xuan, L. Cai, K. Janowicz, S. Ermon, and N. Lao. To- wards General-Purpose Representation Learning of Polygonal Geometries.GeoInformatica, 27(2):289–340, 2023

work page 2023
[30]

Malik, S

J. Malik, S. Belongie, T. Leung, and J. Shi. Contour and Texture Analysis for Image Segmentation. InInternational Journal of Computer Vision (IJCV), 2001. 33

work page 2001
[31]

Mescheder, M

L. Mescheder, M. Oechsle, M. Niemeyer, S. Nowozin, and A. Geiger. Occupancy Networks: Learning 3D Reconstruction in Function Space. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019

work page 2019
[32]

Osada, T

R. Osada, T. Funkhouser, B. Chazelle, and D. Dobkin. Shape Distributions.ACM T ransactions on Graphics, 21(4):807–832, Oct. 2002

work page 2002
[33]

F. P . Kuhl and C. R. Giardina. Elliptic Fourier Features of a Closed Contour.Computer Graphics and Image Processing, 1982

work page 1982
[34]

J. J. Park, P . Florence, J. Straub, R. Newcombe, and S. Lovegrove. DeepSDF: Learning Continuous Signed Distance Functions for Shape Representation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019

work page 2019
[35]

Persoon and K.-S

E. Persoon and K.-S. Fu. Shape Discrimination Using Fourier Descriptors.IEEE T ransactions on Systems, Man, and Cybernetics, 1977

work page 1977
[36]

Press, N

O. Press, N. Smith, and M. Lewis. Train short, test long: Attention with linear biases enables input length extrapolation. InInternational Conference on Learning Representations (ICLR), 2022

work page 2022
[37]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P . Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning Transferable Visual Models From Natural Language Supervision. InInternational Conference on Machine Learning (ICML), 2021

work page 2021
[38]

A. F. Romero, C. Russell, A. Krull, and V . Uhlmann. ShapeEmbed: A Self-Supervised Learning Framework for 2D Contour Quantification. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

work page 2025
[39]

J. L. Roux, S. Wisdom, H. Erdogan, and J. R. Hershey. SDR - Half-baked Or Well Done? In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019

work page 2019
[40]

Scheibler, E

R. Scheibler, E. Bezzam, and D. Ivan. Pyroomacoustics: A python package for audio room simulations and array processing algorithms. InIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018

work page 2018
[41]

C. Shu, J. Deng, F. Yu, and Y. Liu. 3DPPE: 3D Point Positional Encoding for Multi-Camera 3D Object Detection Transformers. InInternational Conference on Computer Vision (ICCV), 2023

work page 2023
[42]

M. D. Siampou, J. Li, J. Krumm, C. Shahabi, and H. Lu. Poly2vec: Polymorphic Encoding of Geospatial Objects for Spatial Reasoning with Deep Neural Networks.International Conference on Machine Learning (ICML), 2025

work page 2025
[43]

J. Su, Y. Lu, S. Pan, B. Wen, and Y. Liu. RoFormer: Enhanced Transformer with Rotary Position Embedding. InNeurocomputing, 2024

work page 2024
[44]

C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen. A Short-Time Objective Intelligibility Measure for Time-Frequency Weighted Noisy Speech. InIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2010

work page 2010
[45]

Tancik, P

M. Tancik, P . P . Srinivasan, B. Mildenhall, S. Fridovich-Keil, N. Raghavan, U. Singhal, R. Ramamoorthi, J. T. Barron, and R. Ng. Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains. InAdvances in Neural Information Processing Systems (NeurIPS), 2020. 34

work page 2020
[46]

Tatarchenko, A

M. Tatarchenko, A. Dosovitskiy, and T. Brox. Octree Generating Networks: Efficient Con- volutional Architectures for High-resolution 3D Outputs. InIEEE International Conference on Computer Vision (ICCV), 2017

work page 2017
[47]

van der Maaten and G

L. van der Maaten and G. Hinton. Visualizing Data using t-SNE.Journal of Machine Learning Research (JMLR), 9(86):2579–2605, 2008

work page 2008
[48]

van’t Veer, P

R. van’t Veer, P . Bloem, and E. Folmer. Deep Learning for Classification Tasks on Geospatial Vector Polygons, 2019

work page 2019
[49]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. J. Jone, A. N. Gomez, and L. Kaiser. Attention is All You Need. InAdvances in Neural Information Processing Systems (NeurIPS), 2017

work page 2017
[50]

von F. Zernike. Beugungstheorie des Schneidenver-Fahrens Und Seiner Verbesserten form, der Phasenkontrastmethode.Physica, 1(7):689–704, 1934

work page 1934
[51]

R. G. von Gioi, J. Jakubowicz, J.-M. Morel, and G. Randall. LSD: a Line Segment Detector. InImage Processing On Line (IPOL), 2012

work page 2012
[52]

H. Wang, X. Wu, Z. Huang, and E. P . Xing. High-Frequency Component Helps Explain the Generalization of Convolutional Neural Networks. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020

work page 2020
[53]

M. Wang, C. Boeddeker, R. G. Dantas, and A. Seelan. PESQ (Perceptual Evaluation of Speech Quality) Wrapper for Python Users, 2022

work page 2022
[54]

X. Wang, B. Feng, X. Bai, W. Liu, and L. Jan Latecki. Bag of Contour Fragments for Robust Shape Classification.Pattern Recognition, 47(6):2116–2125, 2014. ISSN 0031-3203. doi: https://doi.org/10.1016/j.patcog.2013.12.008

work page doi:10.1016/j.patcog.2013.12.008 2014
[55]

Z. Yang, J. Wang, Z. Gan, L. Li, K. Lin, C. Wu, N. Duan, Z. Liu, C. Liu, M. Zeng, and L. Wang. ReCo: Region-Controlled Text-to-Image Generation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

work page 2023
[56]

D. Yu, Y. Hu, Y. Li, and L. Zhao. PolygonGNN: Representation Learning for Polygonal Geometries with Heterogeneous Visibility Graph. InACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2024

work page 2024
[57]

Zhang, P

H. Zhang, P . Wang, M. Li, Z. Li, and Y. Wu. Unit Region Encoding: A Unified and Compact Geometry-aware Representation for Floorplan Applications, 2025. 35

work page 2025