pith. sign in

arxiv: 2605.23777 · v1 · pith:7EQWCUKQnew · submitted 2026-05-22 · 💻 cs.CV

Machine learning applied to emerald gemstone grading: framework proposal and creation of a public dataset

Pith reviewed 2026-05-25 04:25 UTC · model grok-4.3

classification 💻 cs.CV
keywords emerald gradinggemstone classificationmachine learningimage processingpublic datasetcomputer visionclassification accuracy
0
0 comments X

The pith

A machine learning framework automates emerald gemstone grading by matching stones to reference images and reaches 98 percent accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to replace subjective manual grading of emeralds by gemologists with an automated system that uses image processing and machine learning. It builds a complete pipeline from capturing images in a controlled chamber to categorizing stones based on extracted features. A sympathetic reader would care because this could make grading consistent and objective, reducing disagreements between specialists. The work also provides the first public dataset for this task, enabling further research.

Core claim

The framework uses image acquisition in a dedicated chamber followed by feature extraction and machine learning classification to categorize emeralds according to reference stones, achieving 98% accuracy on the dataset of 192 images, which outperforms a deep learning approach, and the dataset is made public.

What carries the argument

Image acquisition chamber combined with extracted and pre-processed features fed into a machine learning classifier for matching to reference stones.

Load-bearing premise

The image acquisition chamber and the extracted features encode the same grading criteria that human specialists use when comparing stones to references.

What would settle it

Testing the framework on a fresh set of emeralds graded by multiple independent specialists and finding frequent mismatches with the majority human consensus.

read the original abstract

The grading of gemstones is currently a manual procedure performed by gemologists. A popular approach uses reference stones, where those are visually inspected by specialists that decide which one of the available reference stone is the most similar to the inspected stone. This procedure is very subjective as different specialists may end up with different grading choices. This work proposes a complete framework that entails the image acquisition and goes up to the final stone categorization. The proposal is able to automate the entire process apart from including the stone in the created chamber for the image acquisition. It discards the subjective decisions made by specialists. This is the first work to propose a machine learning approach coupled with image processing techniques for emerald grading. The proposed framework achieves 98% of accuracy (correctly categorized stones), outperforming a deep learning approach. Furthermore, we also create and publish the used dataset that contains 192 images of emerald stones along with their extracted and pre-processed features.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes an end-to-end machine-learning framework for emerald gemstone grading that combines a custom image-acquisition chamber, hand-crafted feature extraction, and a classifier. It reports 98% accuracy on a newly created public dataset of 192 images and claims to outperform a deep-learning baseline while removing human subjectivity from the reference-stone matching process.

Significance. A reproducible, objective grading system for emeralds would address a long-standing practical problem in gemology. The release of the 192-image dataset with extracted features is a concrete contribution that could enable future benchmarking. However, the absence of any reported validation protocol, label provenance, or statistical controls means the 98% figure cannot yet be treated as evidence that the framework encodes the same criteria used by specialists.

major comments (3)
  1. [Abstract / Results] Abstract and Results section: the central claim of 98% accuracy (and superiority over deep learning) is presented without any description of the train-test split, cross-validation procedure, number of folds, or error bars. On a dataset of only 192 images this information is load-bearing for interpreting whether the result reflects generalization or overfitting.
  2. [Dataset / Methods] Dataset creation paragraph: although the text acknowledges that emerald grading is subjective and that specialists may disagree on reference-stone matches, no information is supplied on how the ground-truth labels for the 192 images were produced (single grader, consensus of several, or reference to an external standard) and no inter-rater agreement statistic is reported. High label noise would render both the 98% figure and the deep-learning comparison difficult to interpret.
  3. [Results] Comparison to deep learning: the manuscript states that the proposed framework outperforms a deep-learning approach, yet supplies no details on the architecture, training regime, data augmentation, or hyper-parameter search used for the baseline. Without these specifics the performance comparison cannot be evaluated.
minor comments (2)
  1. [Abstract] The abstract claims the framework 'discards the subjective decisions made by specialists,' but the system still relies on human-labeled training data; this tension should be clarified.
  2. [Figures / Tables] Figure captions and table headings should explicitly state the number of images per class and the exact feature dimensionality after preprocessing.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and will revise the manuscript to incorporate the requested information where it is currently absent.

read point-by-point responses
  1. Referee: [Abstract / Results] Abstract and Results section: the central claim of 98% accuracy (and superiority over deep learning) is presented without any description of the train-test split, cross-validation procedure, number of folds, or error bars. On a dataset of only 192 images this information is load-bearing for interpreting whether the result reflects generalization or overfitting.

    Authors: We agree that the validation protocol must be described explicitly. The revised manuscript will add a clear account of the train-test split used, whether cross-validation was performed and with how many folds, and any error bars or confidence intervals accompanying the 98% accuracy figure. This will allow readers to assess generalization versus potential overfitting on the small dataset. revision: yes

  2. Referee: [Dataset / Methods] Dataset creation paragraph: although the text acknowledges that emerald grading is subjective and that specialists may disagree on reference-stone matches, no information is supplied on how the ground-truth labels for the 192 images were produced (single grader, consensus of several, or reference to an external standard) and no inter-rater agreement statistic is reported. High label noise would render both the 98% figure and the deep-learning comparison difficult to interpret.

    Authors: We will revise the Dataset section to specify exactly how the ground-truth labels were assigned (including the number of graders and the procedure followed) and will report any available inter-rater agreement statistics or explicitly discuss the limitations of the labeling process used. This addresses the concern about potential label noise. revision: yes

  3. Referee: [Results] Comparison to deep learning: the manuscript states that the proposed framework outperforms a deep-learning approach, yet supplies no details on the architecture, training regime, data augmentation, or hyper-parameter search used for the baseline. Without these specifics the performance comparison cannot be evaluated.

    Authors: The revised Results section will include complete specifications of the deep-learning baseline: the network architecture, training regime, data augmentation strategies, and hyper-parameter search procedure. These details will make the performance comparison reproducible and evaluable. revision: yes

Circularity Check

0 steps flagged

No significant circularity; standard empirical ML evaluation on created dataset.

full rationale

The paper describes an image acquisition chamber, feature extraction, ML model training on a new 192-image emerald dataset, and reports classification accuracy against ground-truth labels. No derivation chain, equations, or self-citations are presented that reduce the accuracy claim to a fitted input by construction. The 98% figure is the direct output of supervised training and evaluation on the authors' own data, which is the expected reporting format for such work rather than a tautological redefinition. No load-bearing self-citation, ansatz smuggling, or uniqueness theorem is invoked. The central claim remains an empirical result on the provided dataset.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that image features can substitute for expert visual comparison and on standard supervised-learning assumptions that the training data distribution matches future stones.

axioms (1)
  • domain assumption Visual features extracted from controlled images are sufficient to represent the grading criteria used by gemologists
    The framework depends on this mapping between pixel data and subjective grade labels.

pith-pipeline@v0.9.0 · 5704 in / 1158 out tokens · 22623 ms · 2026-05-25T04:25:21.645014+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages

  1. [1]

    Electr Eng Comput Sci 43: 1997

    Alsabti K, Ranka S, Singh V (1997) An efficient k-means cluster- ing algorithm. Electr Eng Comput Sci 43: 1997. https://surface. syr. edu/eecs/43/

  2. [2]

    Mach Learn 45:5-32

    Breiman L (2001) Random forests. Mach Learn 45:5-32

  3. [3]

    Pattern Recognit 35:1355-1370

    Cha SH, Srihari SN (2002) On measuring the distance between histograms. Pattern Recognit 35:1355-1370

  4. [4]

    IEEE Trans Knowl Data Eng 8:866-883

    Chen M, Han J, Yu P (2005) Data mining: an overview from a database perspective. IEEE Trans Knowl Data Eng 8:866-883

  5. [5]

    htt gemsociety.org/article/a-consumers-guide-to-gem-grading/

    Clark D (2019) A consumer’s guide to gem grading. htt gemsociety.org/article/a-consumers-guide-to-gem-grading/

  6. [6]

    Crabi D et al. (2020). https://github.com/DaniloRicardoCrabi/ Emeralds-. git

  7. [7]

    Robot Auton Syst 48:93-110

    Dominguez-Lopez JA, Damper RI, Crowder RM, Harris CJ (2004) Adaptive neurofuzzy control of a robotic gripper with on- line machine learning. Robot Auton Syst 48:93-110

  8. [8]

    In: International conference on image processing theory, tools and applications

    Dubuisson S (2010) The computation of the Bhattacharyya dis- tance between histograms without histograms. In: International conference on image processing theory, tools and applications

  9. [9]

    https://www

    FMI (2018). https://www. futuremarketinsights.com/press-release/ 650

  10. [10]

    IEEE J Biomed Health

    Frank E, Hall MA, Witten IH (2016) Data mining: practical machine learning tools and techniques. IEEE J Biomed Health. Tnform 5(51):2006. https://biomedical-engineering-online.biome dcentral.com/articles/10.1186/1475-925X-5-51:

  11. [11]

    https:/geology.com/gemstones/emerald/

    Geology (2018). https:/geology.com/gemstones/emerald/

  12. [12]

    Mach Learn 63:3-42

    Geurts P, Ernst D, Wehenkel L (2006) Extremely randomized trees. Mach Learn 63:3-42

  13. [13]

    Minerals 9:105

    Giuliani G, Groat LA, Marshall D, Fallick AE, Branquet Y (2019) Emerald deposits: a review and enhanced classification. Minerals 9:105

  14. [14]

    SIGKDD Explor 11:10-18

    Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten 1H (2009) The weka data mining software: an update. SIGKDD Explor 11:10-18

  15. [15]

    Data Mining, Inference, and Prediction

    Hastie T, Tibshirani R, Friedman J (2009) The elements of statisti- cal learning. Data Mining, Inference, and Prediction. pp 485-585. https://doi.org/10.1007/978-0-387-84858-7

  16. [16]

    Emerald quality factors

    Instituto Gemológico da América (2019). Emerald quality factors. https:/fAwww. gia.edu/emerald-quality-factor 20. 21 22. 23. 24. 25. . McClure SF, Moses TM, Tannous M, Koivula JT (1999) Class Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classifi- cation with deep convolutional neural networks. Commun ACM 60:84-90 . Manson DV, Stockton CM (1982) Ge...