GlazyBench: A Benchmark for Ceramic Glaze Property Prediction and Image Generation
Pith reviewed 2026-05-08 09:32 UTC · model grok-4.3
The pith
A dataset of 23,148 real glaze recipes enables AI to predict fired surface properties and generate matching images from raw materials.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GlazyBench supplies 23,148 real glaze formulations that allow models to learn the mapping from ingredient combinations to post-firing color, transparency, and visual appearance, with experiments on property prediction and image generation yielding promising but imperfect results.
What carries the argument
The GlazyBench dataset of real-world glaze recipes paired with their fired properties and images, used as training and test data for property prediction and image generation models.
Load-bearing premise
The 23,148 collected formulations accurately represent the range of possible glazes and their outcomes without collection biases that would limit model reliability on new designs.
What would settle it
Collect new glaze recipes outside the dataset, fire them under controlled conditions, and check whether the AI predictions of color, transparency, and generated images match the actual fired results.
Figures
read the original abstract
Developing ceramic glazes is a costly, time-consuming process of trial and error due to complex chemistry, placing a significant burden on independent artists. While recent advances in multimodal AI offer a modern solution, the field lacks the large-scale datasets required to train these models. We propose GlazyBench, the first dataset for AI-assisted glaze design. Comprising 23,148 real glaze formulations, GlazyBench supports two primary tasks: predicting post-firing surface properties, such as color and transparency, from raw materials, and generating accurate visual representations of the glaze based on these properties. We establish comprehensive baselines for property prediction using traditional machine learning and large language models, alongside image generation benchmarks using deep generative and large multimodal models. Our experiments demonstrate promising yet challenging results. GlazyBench pioneers a new research direction in AI-assisted material design, providing a standardized benchmark for systematic evaluation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces GlazyBench, a dataset of 23,148 real glaze formulations sourced from user-submitted repositories, positioned as the first benchmark for AI-assisted ceramic glaze design. It defines two core tasks: (1) predicting post-firing properties such as color and transparency from raw material compositions, with baselines using traditional ML and LLMs, and (2) generating visual representations of fired glazes using deep generative and large multimodal models. The authors report promising yet challenging baseline results and claim the resource enables systematic evaluation in a new research direction.
Significance. If the dataset proves representative and labels reliable, this benchmark could meaningfully advance AI applications in material design by reducing costly trial-and-error for ceramic artists and providing a standardized testbed. The release of baselines for both prediction and generation tasks is a constructive starting point that lowers the barrier for follow-on work.
major comments (2)
- [Data Collection] Data Collection and Validation: The manuscript provides insufficient documentation on sourcing the 23,148 formulations (e.g., from Glazy.org), including any deduplication procedures, validation of user-reported post-firing properties against actual firing outcomes, inter-rater reliability for labels such as color and transparency, or quantitative coverage metrics (e.g., diversity in oxide compositions via PCA or firing schedule distributions). This directly undermines the central claim that models trained on GlazyBench will yield reliable predictions and generations for new designs.
- [Experiments] Baseline Experiments: No quantitative performance metrics, error breakdowns, train/validation/test splits, or statistical validation details are reported for the property prediction or image generation baselines. Without these, the statement of 'promising yet challenging results' cannot be evaluated and does not yet support the benchmark's claimed utility.
minor comments (2)
- [Abstract] The abstract would be strengthened by including at least one key quantitative result (e.g., best MAE or FID score) to ground the 'promising' claim.
- [Methods] Notation for input features (oxide compositions) and output properties should be defined consistently in a table or early section to aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive comments on our manuscript. We have carefully considered each point and provide detailed responses below, along with plans for revisions to improve the clarity and rigor of the paper.
read point-by-point responses
-
Referee: [Data Collection] Data Collection and Validation: The manuscript provides insufficient documentation on sourcing the 23,148 formulations (e.g., from Glazy.org), including any deduplication procedures, validation of user-reported post-firing properties against actual firing outcomes, inter-rater reliability for labels such as color and transparency, or quantitative coverage metrics (e.g., diversity in oxide compositions via PCA or firing schedule distributions). This directly undermines the central claim that models trained on GlazyBench will yield reliable predictions and generations for new designs.
Authors: We agree that additional documentation on data collection would strengthen the manuscript. In the revised version, we will expand the Data section to include: (1) details on sourcing from Glazy.org, including how formulations were collected via their public API or repository; (2) deduplication procedures, such as normalizing compositions to 100% and removing entries with identical oxide percentages; (3) quantitative coverage metrics, including PCA visualizations of the oxide composition space and distributions of firing schedules (temperature and hold times). For validation, since the properties are user-reported based on their firing experiences, we cannot provide independent lab validation for the entire dataset due to resource constraints. We will explicitly discuss this as a limitation of the benchmark, noting that Glazy.org entries often include photos and community feedback which provide some corroboration. Inter-rater reliability is not available as each formulation has a single reporter. These additions will better contextualize the dataset's strengths and limitations without overstating its reliability. revision: partial
-
Referee: [Experiments] Baseline Experiments: No quantitative performance metrics, error breakdowns, train/validation/test splits, or statistical validation details are reported for the property prediction or image generation baselines. Without these, the statement of 'promising yet challenging results' cannot be evaluated and does not yet support the benchmark's claimed utility.
Authors: We acknowledge that the experimental results section would benefit from more detailed quantitative reporting. We will revise the Experiments section to include: specific performance metrics such as mean absolute error (MAE) and root mean square error (RMSE) for property predictions (e.g., for color in CIELAB space and transparency), along with breakdowns by key factors like dominant oxides or firing temperature ranges. For image generation, we will report Fréchet Inception Distance (FID), Learned Perceptual Image Patch Similarity (LPIPS), and other relevant metrics, supported by statistical analysis including confidence intervals. We will clearly describe the train/validation/test splits used (e.g., random 70/15/15 split with stratification to ensure diversity), and any cross-validation procedures. These details will allow readers to fully evaluate the baselines and the benchmark's utility. The 'promising yet challenging' characterization will be supported by these numbers. revision: yes
- Complete independent validation of all user-submitted post-firing properties against controlled laboratory experiments, due to the scale (23k entries) and crowdsourced nature of the data.
Circularity Check
No significant circularity in derivation chain
full rationale
The paper is a dataset release and benchmark paper that introduces GlazyBench comprising 23,148 real glaze formulations and establishes baselines for property prediction and image generation tasks. There are no mathematical derivations, equations, fitted parameters, or predictions that reduce to their own inputs by construction. The central claims rest on data collection and experimental baselines rather than any self-definitional, self-citation load-bearing, or ansatz-smuggled steps. This is the most common honest finding for benchmark papers and receives the default low score.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Ceramics international47(6), 7946–7956 (2021)
Ahmmad, S.K., Jabeen, N., Ahmed, S.T.U., Ahmed, S.A., Rahman, S.: Artificial intelligence density model for oxide glasses. Ceramics international47(6), 7946–7956 (2021)
2021
-
[2]
Molecules30(8), 1745 (2025)
Belciu, M.I., Velea, A.: Ensemble machine learning for the prediction and understanding of the refractive index in chalcogenide glasses. Molecules30(8), 1745 (2025)
2025
-
[3]
Tunnelling and underground space technology124, 104448 (2022)
Bo, Y ., Liu, Q., Huang, X., Pan, Y .: Real-time hard-rock tunnel prediction model for rock mass classification using catboost integrated with sequential model-based optimization. Tunnelling and underground space technology124, 104448 (2022)
2022
-
[4]
Journal of the european ceramic society26(3), 311–316 (2006)
Bondioli, F., Manfredini, T., Romagnoli, M.: Color matching algorithms in ceramic tile production. Journal of the european ceramic society26(3), 311–316 (2006)
2006
-
[5]
Journal of the European Ceramic Society30(12), 2451–2455 (2010)
Castela, A., Fonseca, A., Mantas, P.: Development of coloured glazes for tile applications using taguchi’s method. Journal of the European Ceramic Society30(12), 2451–2455 (2010)
2010
-
[6]
Computer Science Review59, 100845 (2026)
Chakraborty, S., Björk, J., Dahlqvist, M., Rosen, J., Heintz, F.: A survey of ai-supported materials informatics. Computer Science Review59, 100845 (2026)
2026
-
[7]
IEEE Computer Graphics and Applications40(5), 100–107 (2020) 10 GlazyBench
Chen, S.S.C., Cui, H., Tan, P., Sun, X., Ji, Y ., Duh, H.: Cantonese porcelain image generation using user-guided generative adversarial networks. IEEE Computer Graphics and Applications40(5), 100–107 (2020) 10 GlazyBench
2020
-
[8]
In: International conference on learning representations (2021)
Ding, X., Wang, Y ., Xu, Z., Welch, W.J., Wang, Z.J.: Ccgan: Continuous conditional generative adversarial networks for image generation. In: International conference on learning representations (2021)
2021
-
[9]
Progress in Geophysics40(1), 230–242 (2025)
FENG, H., ZHANG, G., CAO, J., REN, H., WAN, W., LIU, D.: Application of woa optimized lightgbm in lithology identification of igneous logging. Progress in Geophysics40(1), 230–242 (2025)
2025
-
[10]
Journal of the European Ceramic Society43(14), 6581–6589 (2023)
Feng, L., Wang, F., Luo, H., Zhu, J., Wang, M., Yang, C., Sun, J., Wang, T.: Phase-separated tenmoku “blue” glaze: Microstructure and coloring mechanism. Journal of the European Ceramic Society43(14), 6581–6589 (2023)
2023
-
[11]
Journal of Computational Methods in Sciences and Engineering p
Fu, Z.: Digital color enhancement in ceramic imagery using graph-guided residual learning and adaptive scattering models. Journal of Computational Methods in Sciences and Engineering p. 14727978251391297 (2025)
2025
-
[12]
Communications Materials3(1), 59 (2022)
Fujinuma, N., DeCost, B., Hattrick-Simpers, J., Lofland, S.E.: Why big data and compute are not necessarily the path to big materials science. Communications Materials3(1), 59 (2022)
2022
-
[13]
In: 2010 3rd International conference on computer science and information technology
Gao, W., Zhang, X., Yang, L., Liu, H.: An improved sobel edge detection. In: 2010 3rd International conference on computer science and information technology. vol. 5, pp. 67–71. IEEE (2010)
2010
-
[14]
Glazy Contributors: Glazy.https://glazy.org/(2026),https://glazy.org/, accessed: 2026-02-01
2026
-
[15]
Ceramics International42(15), 17222–17228 (2016)
Imer, C., Günay, E., Öveço˘glu, M.: Effects of firing temperatures and compositions on the formation of nano particles in lustre layers on a lead-alkali glaze. Ceramics International42(15), 17222–17228 (2016)
2016
-
[16]
In: 2019 6th international conference on systems and informatics (ICSAI)
Jin, Q., Luo, X., Shi, Y ., Kita, K.: Image generation method based on improved condition gan. In: 2019 6th international conference on systems and informatics (ICSAI). pp. 1290–1294. IEEE (2019)
2019
-
[17]
Ieee Access8, 60338–60343 (2020)
Li, Y ., Fu, R., Meng, X., Jin, W., Shao, F.: A sar-to-optical image translation method based on conditional generation adversarial network (cgan). Ieee Access8, 60338–60343 (2020)
2020
-
[18]
Journal of Non-Crystalline Solids557, 119419 (2021)
Liu, H., Fu, Z., Yang, K., Xu, X., Bauchy, M.: Machine learning for glass science and engineering: A review. Journal of Non-Crystalline Solids557, 119419 (2021)
2021
-
[19]
Construction and Building Materials498, 143712 (2025)
Mao, L.x., He, F., Li, L., Xu, W., Wang, Y ., Liu, Q.f.: A quantitative study of phase assemblage in cement-fly ash-slag ternary systems using machine learning-assisted bse-eds image analysis. Construction and Building Materials498, 143712 (2025)
2025
-
[20]
Applied Computing and Geosciences p
Mues, M., Kraemer, D., Styn, D.M.E.: Using machine learning classifiers together with discrimination diagrams for validation of rock classification labels. Applied Computing and Geosciences p. 100288 (2025)
2025
-
[21]
IEEE Transactions on pattern analysis and machine intelligence24(7), 971–987 (2002)
Ojala, T., Pietikainen, M., Maenpaa, T.: Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Transactions on pattern analysis and machine intelligence24(7), 971–987 (2002)
2002
-
[22]
Automation in Construction135, 104144 (2022)
Riedel, H., Mokdad, S., Schulz, I., Kocer, C., Rosendahl, P.L., Schneider, J., Kraus, M.A., Drass, M.: Automated quality control of vacuum insulated glazing by convolutional neural network image classification. Automation in Construction135, 104144 (2022)
2022
-
[23]
In: International Congress of Ceramic Materiali
Romagnoli, M., Bondioli, F., Barattini, M., et al.: Neural network approach for color matching of ceramic glazes. In: International Congress of Ceramic Materiali. vol. 1, pp. xx–xx. ECERS (2008)
2008
-
[24]
grabcut
Rother, C., Kolmogorov, V ., Blake, A.: " grabcut" interactive foreground extraction using iterated graph cuts. ACM transactions on graphics (TOG)23(3), 309–314 (2004)
2004
-
[25]
Journal of Applied Geophysics195, 104480 (2021)
Ruiyi, H., Zhuwen, W., Wenhua, W., Fanghui, X., Xinghua, Q., Yitong, C.: Lithology identification of igneous rocks based on xgboost and conventional logging curves, a case study of the eastern depression of liaohe basin. Journal of Applied Geophysics195, 104480 (2021)
2021
-
[26]
Integrating Materials and Manufacturing Innovation6(2), 172–186 (2017)
Rumble Jr, J.R.: Accessing materials data: challenges and directions in the digital era. Integrating Materials and Manufacturing Innovation6(2), 172–186 (2017)
2017
-
[27]
Journal of Manufacturing and Materials Processing9(7), 213 (2025)
Santos, T., Hennetier, L., Costa, V .A., Costa, L.C.: Temperature assessment through decal color in microwave-fired porcelain. Journal of Manufacturing and Materials Processing9(7), 213 (2025)
2025
-
[28]
Journal of the European Ceramic Society31(5), 659–664 (2011)
Schabbach, L., Bondioli, F., Fredel, M.: Colouring of opaque ceramic glaze with zircon pigments: Formulation with simplified kubelka–munk model. Journal of the European Ceramic Society31(5), 659–664 (2011)
2011
-
[29]
Dyes and pigments99(3), 1029–1035 (2013)
Schabbach, L., Bondioli, F., Fredel, M.: Color prediction with simplified kubelka–munk model in glazes containing fe2o3–zrsio4 coral pink pigments. Dyes and pigments99(3), 1029–1035 (2013)
2013
-
[30]
Applied Computing and Geosciences15, 100090 (2022)
Trott, M., Leybourne, M., Hall, L., Layton-Matthews, D.: Random forest rock type classification with integration of geochemical and photographic data. Applied Computing and Geosciences15, 100090 (2022)
2022
-
[31]
Geochemistry, Geophysics, Geosystems19(4), 1327–1347 (2018) 11 GlazyBench
Ueki, K., Hino, H., Kuwatani, T.: Geochemical discrimination and characteristics of magmatic tectonic settings: A machine-learning-based approach. Geochemistry, Geophysics, Geosystems19(4), 1327–1347 (2018) 11 GlazyBench
2018
-
[32]
Scientific reports15(1), 31397 (2025)
Vasi´c, M.V ., Awoyera, P.O., Fadugba, O.G., Bariši´c, I., Grubeša, I.N.: Advanced machine learning models for the prediction of ceramic tiles’ properties during the firing stage. Scientific reports15(1), 31397 (2025)
2025
-
[33]
Electronics14(11), 2185 (2025)
Wang, Y ., Zhang, G.: Lightweight text-to-image generation model based on contrastive language-image pre- training embeddings and conditional variational autoencoders. Electronics14(11), 2185 (2025)
2025
-
[34]
Sensors20(7), 1834 (2020)
Wei, J., Hao, Y ., Fu, Y ., Yang, L., Gan, J., Li, H.: Experimental study on glaze icing detection of 110 kv composite insulators using fiber bragg gratings. Sensors20(7), 1834 (2020)
2020
-
[35]
Ceramics International47(23), 32817–32827 (2021)
Wu, B., Zhao, W., Ren, X., Liu, X., Li, B., Feng, S., Feng, X., Zhao, H.: Firing process and colouring mechanism of black glaze and brown glaze porcelains from the yuan and ming dynasties from the qingliang temple kiln in baofeng, henan, china. Ceramics International47(23), 32817–32827 (2021)
2021
-
[36]
Nanomaterials15(11), 860 (2025)
Xie, Y ., Wang, X.: Prediction of thermal and optical properties of oxyfluoride glasses based on interpretable machine learning. Nanomaterials15(11), 860 (2025)
2025
-
[37]
Industrial Engineering & Management Systems24(4), 650–662 (2025)
Yamagiwa, A., Goto, M., et al.: An analytical model using cvae-based image generation from product descriptions and image data. Industrial Engineering & Management Systems24(4), 650–662 (2025)
2025
-
[38]
In: European conference on computer vision
Yan, X., Yang, J., Sohn, K., Lee, H.: Attribute2image: Conditional image generation from visual attributes. In: European conference on computer vision. pp. 776–791. Springer (2016)
2016
-
[39]
Wear477, 203837 (2021)
Zhang, C., Neu, R.W.: Understanding the role of glaze layer with aligned images from multiple surface characteri- zation techniques. Wear477, 203837 (2021)
2021
-
[40]
Minerals15(9), 923 (2025)
Zhang, P., Xi, X., Wang, B.C.: Geochemical signatures and element interactions of volcanic-hosted agates: Insights from interpretable machine learning. Minerals15(9), 923 (2025)
2025
-
[41]
npj Materials Degradation4(1), 14 (2020)
Zhang, Y ., Li, A., Deng, B., Hughes, K.K.: Data-driven predictive models for chemical durability of oxide glass under different chemical conditions. npj Materials Degradation4(1), 14 (2020)
2020
-
[42]
Zhao, L., Zhang, Y .: Revealing the individual effects of firing temperature and chemical composition on raman parameters of celadon glaze. Ceramics6(2), 1263–1276 (2023) 12 GlazyBench A Appendix A: Data Preprocessing Details A.1 Color Annotation Methodology Transparency and surface-texture labels are obtained directly from structured dropdown menus on th...
2023
-
[43]
The two best-performing models—Random Forest and XGBoost—are retained and combined into an ensemble for downstream color selection
Reference model (ensemble construction).We train and compare four machine-learning models to learn the recipe-to-color mapping from the manually labeled data. The two best-performing models—Random Forest and XGBoost—are retained and combined into an ensemble for downstream color selection
-
[44]
Let the two predicted candidates be c1,c 2 ∈R 3, and let ¯cpred denote their centroid
RGB-based agreement and selection.The two models independently predict an RGB color. Let the two predicted candidates be c1,c 2 ∈R 3, and let ¯cpred denote their centroid. We compute Euclidean distances dk = ck − ¯cpred 2, k∈ {1,2},and selectarg min k dk. Intuitively, this step prefers the candidate closer to the consensus of the two predictors
-
[45]
After filtering, 12,175 training samples remain with validated color annotations
Ambiguity filtering.If |d1 −d 2|<10 , the two candidates are considered equally plausible and the sample is marked as ambiguous and discarded. After filtering, 12,175 training samples remain with validated color annotations. Sanity check.All 3,097 samples previously marked asuncertainduring manual curation are removed by the above filtering pipeline, supp...
-
[46]
Chemical composition (wt.% oxides).All oxide weight percentages larger than 0.01% are listed in the format Oxide: value%(comma-separated), e.g.,SiO2: 45.20%, Al2O3: 12.80%, CaO: 8.50%,
-
[47]
UMF formula.All UMF entries larger than 0.01 are listed as Oxide: value and prefixed by UMF Formula:
-
[48]
Otherwise, the field is set to No additional firing parameters available
Firing parameters.If available, we include cone information ( Cone: N orCone Range: N–M) and atmo- sphere (Oxidation or Reduction). Otherwise, the field is set to No additional firing parameters available. C.3 Prompt Design For each task, we use a unified prompt template that supports both zero-shot and few-shot evaluation. The template consists of:
-
[49]
a role declaration and task instruction
-
[50]
an explicit, enumerated label set with short descriptions
-
[51]
domain rules connecting oxides/firing conditions to visual properties
-
[52]
an optional few-shot block{few_shot_examples}
-
[53]
the query sample (three input blocks as above)
-
[54]
For zero-shot evaluation (K= 0 ), the few-shot block is omitted
a strict output constraint:output exactly one label from the allowed set. For zero-shot evaluation (K= 0 ), the few-shot block is omitted. For K-shot evaluation, the block is populated as described in Section C.4. Task-specific instantiations.The three tasks share the same structure but differ in label sets and domain rules: • Transparency (4 classes).Lab...
-
[55]
Group them by class
Collect training samples that (i) have valid labels for the target task and (ii) contain non-empty chemical composition data. Group them by class
-
[56]
Classes with no remaining samples are removed from the rotation
Iterate classes in insertion order and draw one example per class in sequence until K examples are obtained. Classes with no remaining samples are removed from the rotation
-
[57]
This procedure encourages class coverage in-context, ensuring up to min(K,|C|) distinct classes appear in the prompt
Serialize each selected example using the same three-block format as the query, followed by Answer: {label}. This procedure encourages class coverage in-context, ensuring up to min(K,|C|) distinct classes appear in the prompt. This is particularly relevant for imbalanced tasks (e.g., surface texture, whereGlossyaccounts for 49% of samples). Few-shot block...
-
[58]
Strip leading/trailing whitespace and quotation characters, then extract the first line
-
[59]
Iterate through the ordered list of valid labels and return the first label whose lowercase form appears as a substring of the lowercase response line
-
[60]
For multi-word labels (e.g.,Semi-opaque,Satin-matte,Smooth Matte), we accept both hyphenated and space-separated variants
-
[61]
Outputs that match none of the valid labels are recorded as parsing failures and excluded from metric computation. 18 GlazyBench D Appendix D: Specifications of Image-Generation Baselines This appendix reports the technical specifications of two baseline models for the conditional glaze image generation task (Task D), including the problem formulation, mo...
-
[62]
Resize to 128×128 using Lanczos resampling
-
[63]
Normalize pixel values to [−1,1] via (x/255−0.5)/0.5
-
[64]
Apply random horizontal flipping (probability 0.5) to training images only. 20
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.