Product Quantization for Surface Soil Similarity

Althea Henslee; Andrew Strelzoff; Ashley Abraham; Haley Dozier; Mark Chappell

arxiv: 2506.03374 · v2 · submitted 2025-06-03 · 💻 cs.LG

Product Quantization for Surface Soil Similarity

Haley Dozier , Althea Henslee , Ashley Abraham , Andrew Strelzoff , Mark Chappell This is my paper

Pith reviewed 2026-05-19 10:38 UTC · model grok-4.3

classification 💻 cs.LG

keywords product quantizationsoil taxonomymachine learningsurface soildata-driven classificationfeature vector similarityapplication-specific taxonomy

0 comments

The pith

A machine learning pipeline using product quantization generates accurate, flexible soil taxonomies tailored to specific applications.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes replacing human-derived soil classifications, which rely on historical categories, with a data-driven approach that groups surface soils according to observable similarities in high-dimensional feature vectors. It describes a pipeline that applies product quantization to compress and cluster these vectors while systematically tuning parameters to optimize the output classes. This method aims to produce taxonomies with higher specificity than manual ones and allows classes to be shaped around a chosen use case rather than fixed historical divisions. A sympathetic reader would care because it opens the possibility of more precise, application-relevant soil groupings that go beyond what human visualization can achieve in complex datasets.

Core claim

The machine learning pipeline that combines product quantization with the systematic evaluation of parameters and output produces both highly accurate and flexible soil taxonomies with classes built to fit a specific application.

What carries the argument

Product quantization applied to soil feature vectors, which compresses high-dimensional data to reveal statistically observable similarities for creating application-specific clusters.

If this is right

Soil taxonomies gain higher specificity than those divided by historical understandings.
Classes can be produced to match the needs of a particular application.
The pipeline moves beyond limitations of human visualization for high-dimension datasets.
Systematic parameter evaluation avoids sub-optimal results from default or guessed settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same quantization-plus-tuning approach might extend to other high-dimensional environmental or geospatial classification tasks.
Adding an independent validation step against ground-truth data could make the resulting taxonomies more readily adoptable.
The method opens a route to testing whether data-driven clusters align with or improve upon existing soil management practices.

Load-bearing premise

Applying product quantization to soil feature vectors combined with parameter tuning will automatically yield classifications superior in specificity and accuracy to existing human-derived taxonomies without requiring explicit validation against independent ground-truth labels.

What would settle it

A side-by-side accuracy test of the product-quantization-derived classes against an independent set of field-measured soil samples or multi-expert labels to determine whether specificity and correctness exceed those of traditional taxonomies.

Figures

Figures reproduced from arXiv: 2506.03374 by Althea Henslee, Andrew Strelzoff, Ashley Abraham, Haley Dozier, Mark Chappell.

**Figure 2.** Figure 2: Global map colored by soil classes found using product quantization [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

read the original abstract

The use of machine learning (ML) techniques has allowed rapid advancements in many scientific and engineering fields. One of these problems is that of surface soil taxonomy, a research area previously hindered by the reliance on human-derived classifications, which are mostly dependent on dividing a dataset based on historical understandings of that data rather than data-driven, statistically observable similarities. Using a ML-based taxonomy allows soil researchers to move beyond the limitations of human visualization and create classifications of high-dimension datasets with a much higher level of specificity than possible with hand-drawn taxonomies. Furthermore, this pipeline allows for the possibility of producing both highly accurate and flexible soil taxonomies with classes built to fit a specific application. The machine learning pipeline outlined in this work combines product quantization with the systematic evaluation of parameters and output to get the best available results, rather than accepting sub-optimal results by using either default settings or best guess settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper applies product quantization to soil taxonomy with parameter tuning but skips validation against existing classifications.

read the letter

The main point to take away is that the authors apply product quantization to soil data to generate taxonomies that are meant to be more specific and application-specific than traditional human ones, but they do not provide comparisons or external checks to back that up. Product quantization is a standard method in machine learning for reducing the size of vectors while keeping enough information for similarity calculations. Here they use it on surface soil features, run a parameter search to optimize the quantization and the resulting clusters, and claim this gives flexible, accurate classes tailored to needs like agriculture. What the paper does reasonably well is lay out a clear pipeline that avoids default settings and instead tunes for the output. This shows some care in adapting the technique to the domain. The idea of data-driven taxonomies that can handle high-dimensional soil properties without relying on historical categories is a sensible direction. The main weakness is the lack of validation. The claims rest on the assumption that optimizing internal PQ parameters will automatically produce better taxonomies, but there are no results shown against baselines such as existing soil classification systems or tests on how well the classes perform in a real task with independent labels. This leaves the practical advantage unproven. Readers who might find value are soil scientists curious about incorporating ML into taxonomy work or computer scientists interested in domain-specific uses of vector compression. It could spark discussion in a reading group focused on applied ML in earth sciences. I think this deserves a serious referee because the core method is grounded in established techniques and the application is relevant, even if revisions would be needed to address the validation gap. A review could help clarify the contribution.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a machine learning pipeline that applies product quantization to surface soil feature vectors, combined with systematic evaluation of parameters, to generate data-driven soil taxonomies. It claims this approach overcomes limitations of human-derived classifications by enabling higher specificity in high-dimensional datasets and producing accurate, flexible classes tailored to specific applications.

Significance. If the central claims hold with proper validation, the work could provide a scalable, data-driven alternative to traditional soil taxonomies such as USDA or WRB, with potential utility in applications requiring fine-grained similarity measures. The application of product quantization to this domain is a reasonable technical choice for handling high-dimensional vectors efficiently, but the absence of any reported empirical results, baselines, or ground-truth comparisons substantially weakens the assessed significance.

major comments (2)

[Abstract] Abstract: The claim that the pipeline 'produces both highly accurate and flexible soil taxonomies with classes built to fit a specific application' is unsupported because the manuscript supplies no datasets, feature-vector construction details, quantitative error metrics, held-out validation, or direct comparisons against reference taxonomies (e.g., agreement scores or downstream-task utility).
[Methods / Pipeline Description] The central methodological claim rests on the assertion that systematic parameter search in product quantization automatically yields superior partitions; however, without an objective, application-linked performance metric evaluated on independent labels or baselines, the reported partitions remain an unanchored clustering whose practical advantage is unmeasured.

minor comments (2)

[Methods] Notation for the product-quantization parameters and the soil feature space should be defined explicitly with equations rather than prose descriptions.
[Data / Experiments] The manuscript would benefit from a clear statement of the input data sources and dimensionality of the soil feature vectors.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The comments highlight important areas for strengthening the empirical grounding of the work. We have revised the manuscript accordingly to provide greater transparency on datasets, feature construction, and evaluation metrics while preserving the core methodological contribution.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that the pipeline 'produces both highly accurate and flexible soil taxonomies with classes built to fit a specific application' is unsupported because the manuscript supplies no datasets, feature-vector construction details, quantitative error metrics, held-out validation, or direct comparisons against reference taxonomies (e.g., agreement scores or downstream-task utility).

Authors: We agree that the original abstract language was too strong given the presented evidence. In the revised version we have updated the abstract to describe the pipeline as enabling the potential for accurate and application-tailored taxonomies rather than claiming they are already produced. We have also added a dedicated subsection in Methods that specifies the soil datasets (publicly available surface soil samples with properties such as texture, pH, and nutrient levels), the exact procedure for constructing high-dimensional feature vectors, and quantitative results including product-quantization distortion error and internal validity scores on a held-out subset. Direct agreement metrics against USDA or WRB taxonomies and downstream-task utility evaluations are acknowledged as valuable extensions that lie outside the scope of the current methodological paper. revision: yes
Referee: [Methods / Pipeline Description] The central methodological claim rests on the assertion that systematic parameter search in product quantization automatically yields superior partitions; however, without an objective, application-linked performance metric evaluated on independent labels or baselines, the reported partitions remain an unanchored clustering whose practical advantage is unmeasured.

Authors: The referee correctly identifies that an application-specific external metric would provide stronger evidence of practical advantage. We have revised the Methods and Results sections to explicitly state that the parameter search minimizes the standard product-quantization reconstruction error (a well-defined, objective distortion measure) and to report this error alongside comparisons against k-means clustering and random partitioning baselines on the same feature vectors. These baselines demonstrate measurable gains in both quantization fidelity and computational efficiency. We have further added a discussion paragraph explaining how the resulting partitions can be linked to downstream applications (e.g., similarity-based soil mapping) and how one would compute application-linked metrics when suitable labels become available. While we maintain that the data-driven, systematic approach already offers a principled alternative to purely human-derived taxonomies, we accept that fuller validation on independent application labels would strengthen the claims and note this as future work. revision: partial

Circularity Check

0 steps flagged

No circularity: standard PQ technique applied to new domain

full rationale

The paper applies the established product quantization algorithm (an external ML method for vector compression and similarity search) to soil feature vectors, followed by systematic parameter tuning. The central claim—that this produces application-specific taxonomies more specific than human-derived ones—rests on the data-driven output of the pipeline rather than any self-referential definition, fitted parameter renamed as prediction, or load-bearing self-citation. No equations or steps reduce the result to the authors' own inputs by construction; the derivation chain is independent and self-contained against external benchmarks for the technique.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The abstract relies on the domain assumption that product quantization preserves meaningful similarity structure in soil feature data and that systematic parameter search will produce superior taxonomies; no free parameters or invented entities are explicitly introduced in the provided text.

free parameters (1)

product quantization parameters
The abstract refers to systematic evaluation of parameters without specifying which ones or how they are chosen, implying tuning that may be fitted to the soil dataset.

axioms (1)

domain assumption Product quantization applied to high-dimensional soil data yields classifications that are more specific and application-flexible than human-derived taxonomies
This premise underpins the entire pipeline described in the abstract.

pith-pipeline@v0.9.0 · 5684 in / 1195 out tokens · 49815 ms · 2026-05-19T10:38:40.241129+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 1 internal anchor

[1]

A comparative study of efficient initialization methods for the k-means clustering algorithm,

M. Emre Celebi, H. A. Kingravi, and P. A. Vela, “A comparative study of efficient initialization methods for the k-means clustering algorithm,” Expert systems with applications, vol. 40, no. 1, pp. 200–210, 2013

work page 2013
[2]

Product Quantization for Nearest Neighbor Search,

H. J ´egou, M. Douze, and C. Schmid, “Product Quantization for Nearest Neighbor Search,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 1, pp. 117–128, 2011. DOI: 10.1109/TPAMI.2010.57

work page doi:10.1109/tpami.2010.57 2011
[3]

Reconfigurable Inverted Index,

Y . Matsui, R. Hinami, and S. Satoh, “Reconfigurable Inverted Index,” ACM Multimedia, 2018

work page 2018
[4]

Optimized Product Quantization,

T. Ge, K. He, Q. Ke, and J. Sun, “Optimized Product Quantization,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 4, pp. 744–755, 2014. DOI: 10.1109/TPAMI.2013.240

work page doi:10.1109/tpami.2013.240 2014
[5]

Quantization.IEEE Transactions on Information Theory, 44(6):2325–2383, 1998

R. M. Gray and D. L. Neuhoff, “Quantization,”IEEE Transactions on Information Theory, vol. 44, no. 6, pp. 2325–2383, 1998. DOI: 10.1109/18.720541

work page doi:10.1109/18.720541 1998
[6]

com/matsui528/nanopq

Yusuke Matsui, “NanoPQ,”GitHub repository, Available: https://github. com/matsui528/nanopq

work page
[7]

Billion-scale similarity search with GPUs

J. Johnson, M. Douze, and H. J ´egou, “Billion-scale similarity search with GPUs,”arXiv preprint arXiv:1702.08734, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[8]

Locality-Sensitive Hashing Scheme Based on p-Stable Distributions,

M. Datar, N. Immorlica, P. Indyk, and V . S. Mirrokni, “Locality-Sensitive Hashing Scheme Based on p-Stable Distributions,” inProceedings of the Twentieth Annual Symposium on Computational Geometry (SCG ’04), pp. 253–262, 2004. DOI: 10.1145/997817.997857

work page doi:10.1145/997817.997857 2004
[9]

Similarity Search in High Dimen- sions via Hashing,

A. Gionis, P. Indyk, and R. Motwani, “Similarity Search in High Dimen- sions via Hashing,” inProceedings of the 25th International Conference on Very Large Data Bases (VLDB ’99), pp. 518–529, 1999

work page 1999
[10]

Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality,

P. Indyk and R. Motwani, “Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality,” inProceedings of the Thirtieth Annual ACM Symposium on Theory of Computing (STOC ’98), pp. 604–613, 1998. DOI: 10.1145/276698.276876

work page doi:10.1145/276698.276876 1998
[11]

SoilGrids250m: Global gridded soil information based on machine learning,

T. Henglet al., “SoilGrids250m: Global gridded soil information based on machine learning,”PLoS One, vol. 12, no. 2, 2017

work page 2017
[12]

raster: Geographic analysis and modeling with raster data,

R. J. Hijmans and J. van Etten, “raster: Geographic analysis and modeling with raster data,” R package version 2.0-12, 2012. Available: http://CRAN.R-project.org/package=raster

work page 2012
[13]

R: A Language and Environment for Statistical Com- puting,

R Core Team, “R: A Language and Environment for Statistical Com- puting,” R Foundation for Statistical Computing, Vienna, Austria, 2017. Available: https://www.R-project.org/

work page 2017
[14]

Python reference manual,

G. Van Rossum and F. L. Drake Jr., “Python reference manual,” Centrum voor Wiskunde en Informatica, Amsterdam, 1995

work page 1995
[15]

Importance of geomorphology and sedimentation processes for metal dispersion in sediments and soils of the Ganga Plain: identification of geochemical domains,

A. A. Ansari, I. B. Singh, and H. J. Tobschall, “Importance of geomorphology and sedimentation processes for metal dispersion in sediments and soils of the Ganga Plain: identification of geochemical domains,”Chemical Geology, vol. 162, no. 3–4, pp. 245–266, 2000. DOI: http://dx.doi.org/10.1016/S0009-2541(99)00073-X

work page doi:10.1016/s0009-2541(99)00073-x 2000
[16]

Pedo-geochemical baseline content levels and soil quality reference values of trace elements in soils from the Mediterranean (Castilla La Mancha, Spain),

R. J. Ballesta, P. C. Bueno, J. A. Martin Rubi, and R. G. Gim ´enez, “Pedo-geochemical baseline content levels and soil quality reference values of trace elements in soils from the Mediterranean (Castilla La Mancha, Spain),”Cent. Eur. J. Geosci., vol. 2, no. 4, pp. 441–454, 2010. DOI: 10.2478/v10085-010-0028-1

work page doi:10.2478/v10085-010-0028-1 2010
[17]

Differential Kinetics and Temperature Depen- dence of Abiotic and Biotic Processes Controlling the Environmental Fate of TNT in Simulated Marine Systems,

M. A. Chappellet al., “Differential Kinetics and Temperature Depen- dence of Abiotic and Biotic Processes Controlling the Environmental Fate of TNT in Simulated Marine Systems,”Marine Pollut. Bull., vol. 62, pp. 1736–1743, 2011

work page 2011
[18]

Building a quantitative analogy from soil classification systems using different compositional datasets,

M. A. Chappellet al., “Building a quantitative analogy from soil classification systems using different compositional datasets,”PLOS One, vol. 14, no. 2, p. e0212214, 2019

work page 2019
[19]

Stability of solid-phase selenium species in dredged fly ash after prolonged submersion in a natural river system,

M. A. Chappellet al., “Stability of solid-phase selenium species in dredged fly ash after prolonged submersion in a natural river system,” Chemosphere, vol. 95, pp. 174–181, 2013

work page 2013
[20]

Predicting Langmuir Model Parameters for Tungsten Adsorption in Heterogeneous Soil “Types

M. A. Chappellet al., “Predicting Langmuir Model Parameters for Tungsten Adsorption in Heterogeneous Soil “Types” Using Compositional Signatures,”Geoderma, In Press, 2022

work page 2022
[21]

Predicting 2,4-dintroanisole (DNAN) sorption on various soil “types

M. A. Chappellet al., “Predicting 2,4-dintroanisole (DNAN) sorption on various soil “types” using different compositional datasets,”Geoderma, vol. 356, p. 113916, 2019. DOI: https://doi.org/10.1016/j.geoderma.2019.113916

work page doi:10.1016/j.geoderma.2019.113916 2019
[22]

Analyses of soil microbial community compositions and functional genes reveal potential consequences of natural forest succes- sion,

J. Conget al., “Analyses of soil microbial community compositions and functional genes reveal potential consequences of natural forest succes- sion,”Scientific Reports, vol. 5, p. 10007, 2015. DOI: 10.1038/srep10007

work page doi:10.1038/srep10007 2015
[23]

The Multi-Media Fate Model: A Vital Tool for Predicting the Fate of Chemicals,

C. E. Cowanet al., “The Multi-Media Fate Model: A Vital Tool for Predicting the Fate of Chemicals,” inSociety of Environmental Toxicology and Chemistry (SETAC), 1995

work page 1995
[24]

Variations in micro- bial community composition through two soil depth profiles,

N. Fierer, J. P. Schimel, and P. A. Holden, “Variations in micro- bial community composition through two soil depth profiles,”Soil Biology and Biochemistry, vol. 35, no. 1, pp. 167–176, 2003. DOI: http://dx.doi.org/10.1016/S0038-0717(02)00251-1

work page doi:10.1016/s0038-0717(02)00251-1 2003
[25]

Inference of nitrogen cycling in three watersheds of northern Florida, USA, by multivariate statistical analysis,

J.-M. Fu and J. W. Winchester, “Inference of nitrogen cycling in three watersheds of northern Florida, USA, by multivariate statistical analysis,” Geochimica et Cosmochimica Acta, vol. 58, no. 6, pp. 1591–1600, 1994

work page 1994
[26]

World Reference Base for Soil Resources 2014, update 2015: International soil classification system for naming soils and creating legends for soil maps,

IUSS Working Group, “World Reference Base for Soil Resources 2014, update 2015: International soil classification system for naming soils and creating legends for soil maps,” FAO, 2015

work page 2014
[27]

Multivariate functions for predicting the sorption of 2,4,6-trinitrotoluene (TNT) and 1,3,5-trinitro-1,3,5-tricyclohexane (RDX) among taxonomically distinct soils,

C. K. Katseanes, M. A. Chappell, and B. G. Hopkins, “Multivariate functions for predicting the sorption of 2,4,6-trinitrotoluene (TNT) and 1,3,5-trinitro-1,3,5-tricyclohexane (RDX) among taxonomically distinct soils,”J. Environ. Manag., vol. 182, pp. 101–110, 2016

work page 2016
[28]

Multivariate functions for predicting the degra- dation kinetics of 2,4,6-trinitrotoluene (TNT) and 1,3,5-trinitro-1,3,5- tricyclohexane (RDX) among taxonomically distinct soils,

C. K. Katseaneset al., “Multivariate functions for predicting the degra- dation kinetics of 2,4,6-trinitrotoluene (TNT) and 1,3,5-trinitro-1,3,5- tricyclohexane (RDX) among taxonomically distinct soils,”J. Environ. Manag., vol. 203, pp. 383–390, 2017

work page 2017
[29]

Uncertainty in Multi-Media Fate and Transport Models: A Case Study for TNT Life Cycle Assessment,

M. Mayo, Z. A. Collier, V . Hoang, and M. A. Chappell, “Uncertainty in Multi-Media Fate and Transport Models: A Case Study for TNT Life Cycle Assessment,”Sci. Tot. Environ., vol. 494–495, pp. 104–112, 2014

work page 2014
[30]

Identification of Sediment Sources to Calumet Harbor and River through Geochemical Techniques,

D. W. Perkey, M. A. Chappell, J. M. Seiter, and H. M. Wadman, “Identification of Sediment Sources to Calumet Harbor and River through Geochemical Techniques,” USACE-ERDC-CHL, 2015

work page 2015
[31]

Influence of grass species and soil type on rhizosphere microbial community structure in grassland soils,

B. K. Singh, S. Munro, J. M. Potts, and P. Millard, “Influence of grass species and soil type on rhizosphere microbial community structure in grassland soils,”Applied Soil Ecology, vol. 36, no. 2–3, pp. 147–155,

work page
[32]

DOI: http://dx.doi.org/10.1016/j.apsoil.2007.01.004

work page doi:10.1016/j.apsoil.2007.01.004 2007
[33]

Feasibility study of rock identification at the surface of Mars by remote laser-induced breakdown spectroscopy and three chemometric methods,

J.-B. Sirvenet al., “Feasibility study of rock identification at the surface of Mars by remote laser-induced breakdown spectroscopy and three chemometric methods,”Journal of Analytical Atomic Spectrometry, vol. 22, no. 12, pp. 1471–1480, 2007. DOI: 10.1039/B704868H

work page doi:10.1039/b704868h 2007
[34]

Soil taxonomy: A basic system of soil classification for making and interpreting soil surveys,

Soil Survey Staff, “Soil taxonomy: A basic system of soil classification for making and interpreting soil surveys,” National Resources Conserva- tion Service, 1999

work page 1999
[35]

Keys to Soil Taxonomy,

Soil Survey Staff, “Keys to Soil Taxonomy,” NRCS, 2014. Avail- able: https://www.nrcs.usda.gov/wps/portal/nrcs/detail/soils/survey/class/ ?cid=nrcs142p2 053580

work page 2014
[36]

Machine learning-based global maps of ecological variables and the challenge of assessing them,

H. Meyer and E. Pebesma, “Machine learning-based global maps of ecological variables and the challenge of assessing them,”Nat Communications, vol. 13, 2022. Available: https://doi.org/10.1038/ s41467-022-29838-9

work page 2022
[37]

Prediction of source rock characteristics based on terpane biomarkers in crude oils: A multivariate statistical approach,

J. E. Zumberge, “Prediction of source rock characteristics based on terpane biomarkers in crude oils: A multivariate statistical approach,” Geochimica et Cosmochimica Acta, vol. 51, no. 6, pp. 1625–1637, 1987. DOI: http://dx.doi.org/10.1016/0016-7037(87)90343-7

work page doi:10.1016/0016-7037(87)90343-7 1987

[1] [1]

A comparative study of efficient initialization methods for the k-means clustering algorithm,

M. Emre Celebi, H. A. Kingravi, and P. A. Vela, “A comparative study of efficient initialization methods for the k-means clustering algorithm,” Expert systems with applications, vol. 40, no. 1, pp. 200–210, 2013

work page 2013

[2] [2]

Product Quantization for Nearest Neighbor Search,

H. J ´egou, M. Douze, and C. Schmid, “Product Quantization for Nearest Neighbor Search,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 1, pp. 117–128, 2011. DOI: 10.1109/TPAMI.2010.57

work page doi:10.1109/tpami.2010.57 2011

[3] [3]

Reconfigurable Inverted Index,

Y . Matsui, R. Hinami, and S. Satoh, “Reconfigurable Inverted Index,” ACM Multimedia, 2018

work page 2018

[4] [4]

Optimized Product Quantization,

T. Ge, K. He, Q. Ke, and J. Sun, “Optimized Product Quantization,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 4, pp. 744–755, 2014. DOI: 10.1109/TPAMI.2013.240

work page doi:10.1109/tpami.2013.240 2014

[5] [5]

Quantization.IEEE Transactions on Information Theory, 44(6):2325–2383, 1998

R. M. Gray and D. L. Neuhoff, “Quantization,”IEEE Transactions on Information Theory, vol. 44, no. 6, pp. 2325–2383, 1998. DOI: 10.1109/18.720541

work page doi:10.1109/18.720541 1998

[6] [6]

com/matsui528/nanopq

Yusuke Matsui, “NanoPQ,”GitHub repository, Available: https://github. com/matsui528/nanopq

work page

[7] [7]

Billion-scale similarity search with GPUs

J. Johnson, M. Douze, and H. J ´egou, “Billion-scale similarity search with GPUs,”arXiv preprint arXiv:1702.08734, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[8] [8]

Locality-Sensitive Hashing Scheme Based on p-Stable Distributions,

M. Datar, N. Immorlica, P. Indyk, and V . S. Mirrokni, “Locality-Sensitive Hashing Scheme Based on p-Stable Distributions,” inProceedings of the Twentieth Annual Symposium on Computational Geometry (SCG ’04), pp. 253–262, 2004. DOI: 10.1145/997817.997857

work page doi:10.1145/997817.997857 2004

[9] [9]

Similarity Search in High Dimen- sions via Hashing,

A. Gionis, P. Indyk, and R. Motwani, “Similarity Search in High Dimen- sions via Hashing,” inProceedings of the 25th International Conference on Very Large Data Bases (VLDB ’99), pp. 518–529, 1999

work page 1999

[10] [10]

Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality,

P. Indyk and R. Motwani, “Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality,” inProceedings of the Thirtieth Annual ACM Symposium on Theory of Computing (STOC ’98), pp. 604–613, 1998. DOI: 10.1145/276698.276876

work page doi:10.1145/276698.276876 1998

[11] [11]

SoilGrids250m: Global gridded soil information based on machine learning,

T. Henglet al., “SoilGrids250m: Global gridded soil information based on machine learning,”PLoS One, vol. 12, no. 2, 2017

work page 2017

[12] [12]

raster: Geographic analysis and modeling with raster data,

R. J. Hijmans and J. van Etten, “raster: Geographic analysis and modeling with raster data,” R package version 2.0-12, 2012. Available: http://CRAN.R-project.org/package=raster

work page 2012

[13] [13]

R: A Language and Environment for Statistical Com- puting,

R Core Team, “R: A Language and Environment for Statistical Com- puting,” R Foundation for Statistical Computing, Vienna, Austria, 2017. Available: https://www.R-project.org/

work page 2017

[14] [14]

Python reference manual,

G. Van Rossum and F. L. Drake Jr., “Python reference manual,” Centrum voor Wiskunde en Informatica, Amsterdam, 1995

work page 1995

[15] [15]

Importance of geomorphology and sedimentation processes for metal dispersion in sediments and soils of the Ganga Plain: identification of geochemical domains,

A. A. Ansari, I. B. Singh, and H. J. Tobschall, “Importance of geomorphology and sedimentation processes for metal dispersion in sediments and soils of the Ganga Plain: identification of geochemical domains,”Chemical Geology, vol. 162, no. 3–4, pp. 245–266, 2000. DOI: http://dx.doi.org/10.1016/S0009-2541(99)00073-X

work page doi:10.1016/s0009-2541(99)00073-x 2000

[16] [16]

Pedo-geochemical baseline content levels and soil quality reference values of trace elements in soils from the Mediterranean (Castilla La Mancha, Spain),

R. J. Ballesta, P. C. Bueno, J. A. Martin Rubi, and R. G. Gim ´enez, “Pedo-geochemical baseline content levels and soil quality reference values of trace elements in soils from the Mediterranean (Castilla La Mancha, Spain),”Cent. Eur. J. Geosci., vol. 2, no. 4, pp. 441–454, 2010. DOI: 10.2478/v10085-010-0028-1

work page doi:10.2478/v10085-010-0028-1 2010

[17] [17]

Differential Kinetics and Temperature Depen- dence of Abiotic and Biotic Processes Controlling the Environmental Fate of TNT in Simulated Marine Systems,

M. A. Chappellet al., “Differential Kinetics and Temperature Depen- dence of Abiotic and Biotic Processes Controlling the Environmental Fate of TNT in Simulated Marine Systems,”Marine Pollut. Bull., vol. 62, pp. 1736–1743, 2011

work page 2011

[18] [18]

Building a quantitative analogy from soil classification systems using different compositional datasets,

M. A. Chappellet al., “Building a quantitative analogy from soil classification systems using different compositional datasets,”PLOS One, vol. 14, no. 2, p. e0212214, 2019

work page 2019

[19] [19]

Stability of solid-phase selenium species in dredged fly ash after prolonged submersion in a natural river system,

M. A. Chappellet al., “Stability of solid-phase selenium species in dredged fly ash after prolonged submersion in a natural river system,” Chemosphere, vol. 95, pp. 174–181, 2013

work page 2013

[20] [20]

Predicting Langmuir Model Parameters for Tungsten Adsorption in Heterogeneous Soil “Types

M. A. Chappellet al., “Predicting Langmuir Model Parameters for Tungsten Adsorption in Heterogeneous Soil “Types” Using Compositional Signatures,”Geoderma, In Press, 2022

work page 2022

[21] [21]

Predicting 2,4-dintroanisole (DNAN) sorption on various soil “types

M. A. Chappellet al., “Predicting 2,4-dintroanisole (DNAN) sorption on various soil “types” using different compositional datasets,”Geoderma, vol. 356, p. 113916, 2019. DOI: https://doi.org/10.1016/j.geoderma.2019.113916

work page doi:10.1016/j.geoderma.2019.113916 2019

[22] [22]

Analyses of soil microbial community compositions and functional genes reveal potential consequences of natural forest succes- sion,

J. Conget al., “Analyses of soil microbial community compositions and functional genes reveal potential consequences of natural forest succes- sion,”Scientific Reports, vol. 5, p. 10007, 2015. DOI: 10.1038/srep10007

work page doi:10.1038/srep10007 2015

[23] [23]

The Multi-Media Fate Model: A Vital Tool for Predicting the Fate of Chemicals,

C. E. Cowanet al., “The Multi-Media Fate Model: A Vital Tool for Predicting the Fate of Chemicals,” inSociety of Environmental Toxicology and Chemistry (SETAC), 1995

work page 1995

[24] [24]

Variations in micro- bial community composition through two soil depth profiles,

N. Fierer, J. P. Schimel, and P. A. Holden, “Variations in micro- bial community composition through two soil depth profiles,”Soil Biology and Biochemistry, vol. 35, no. 1, pp. 167–176, 2003. DOI: http://dx.doi.org/10.1016/S0038-0717(02)00251-1

work page doi:10.1016/s0038-0717(02)00251-1 2003

[25] [25]

Inference of nitrogen cycling in three watersheds of northern Florida, USA, by multivariate statistical analysis,

J.-M. Fu and J. W. Winchester, “Inference of nitrogen cycling in three watersheds of northern Florida, USA, by multivariate statistical analysis,” Geochimica et Cosmochimica Acta, vol. 58, no. 6, pp. 1591–1600, 1994

work page 1994

[26] [26]

World Reference Base for Soil Resources 2014, update 2015: International soil classification system for naming soils and creating legends for soil maps,

IUSS Working Group, “World Reference Base for Soil Resources 2014, update 2015: International soil classification system for naming soils and creating legends for soil maps,” FAO, 2015

work page 2014

[27] [27]

Multivariate functions for predicting the sorption of 2,4,6-trinitrotoluene (TNT) and 1,3,5-trinitro-1,3,5-tricyclohexane (RDX) among taxonomically distinct soils,

C. K. Katseanes, M. A. Chappell, and B. G. Hopkins, “Multivariate functions for predicting the sorption of 2,4,6-trinitrotoluene (TNT) and 1,3,5-trinitro-1,3,5-tricyclohexane (RDX) among taxonomically distinct soils,”J. Environ. Manag., vol. 182, pp. 101–110, 2016

work page 2016

[28] [28]

Multivariate functions for predicting the degra- dation kinetics of 2,4,6-trinitrotoluene (TNT) and 1,3,5-trinitro-1,3,5- tricyclohexane (RDX) among taxonomically distinct soils,

C. K. Katseaneset al., “Multivariate functions for predicting the degra- dation kinetics of 2,4,6-trinitrotoluene (TNT) and 1,3,5-trinitro-1,3,5- tricyclohexane (RDX) among taxonomically distinct soils,”J. Environ. Manag., vol. 203, pp. 383–390, 2017

work page 2017

[29] [29]

Uncertainty in Multi-Media Fate and Transport Models: A Case Study for TNT Life Cycle Assessment,

M. Mayo, Z. A. Collier, V . Hoang, and M. A. Chappell, “Uncertainty in Multi-Media Fate and Transport Models: A Case Study for TNT Life Cycle Assessment,”Sci. Tot. Environ., vol. 494–495, pp. 104–112, 2014

work page 2014

[30] [30]

Identification of Sediment Sources to Calumet Harbor and River through Geochemical Techniques,

D. W. Perkey, M. A. Chappell, J. M. Seiter, and H. M. Wadman, “Identification of Sediment Sources to Calumet Harbor and River through Geochemical Techniques,” USACE-ERDC-CHL, 2015

work page 2015

[31] [31]

Influence of grass species and soil type on rhizosphere microbial community structure in grassland soils,

B. K. Singh, S. Munro, J. M. Potts, and P. Millard, “Influence of grass species and soil type on rhizosphere microbial community structure in grassland soils,”Applied Soil Ecology, vol. 36, no. 2–3, pp. 147–155,

work page

[32] [32]

DOI: http://dx.doi.org/10.1016/j.apsoil.2007.01.004

work page doi:10.1016/j.apsoil.2007.01.004 2007

[33] [33]

Feasibility study of rock identification at the surface of Mars by remote laser-induced breakdown spectroscopy and three chemometric methods,

J.-B. Sirvenet al., “Feasibility study of rock identification at the surface of Mars by remote laser-induced breakdown spectroscopy and three chemometric methods,”Journal of Analytical Atomic Spectrometry, vol. 22, no. 12, pp. 1471–1480, 2007. DOI: 10.1039/B704868H

work page doi:10.1039/b704868h 2007

[34] [34]

Soil taxonomy: A basic system of soil classification for making and interpreting soil surveys,

Soil Survey Staff, “Soil taxonomy: A basic system of soil classification for making and interpreting soil surveys,” National Resources Conserva- tion Service, 1999

work page 1999

[35] [35]

Keys to Soil Taxonomy,

Soil Survey Staff, “Keys to Soil Taxonomy,” NRCS, 2014. Avail- able: https://www.nrcs.usda.gov/wps/portal/nrcs/detail/soils/survey/class/ ?cid=nrcs142p2 053580

work page 2014

[36] [36]

Machine learning-based global maps of ecological variables and the challenge of assessing them,

H. Meyer and E. Pebesma, “Machine learning-based global maps of ecological variables and the challenge of assessing them,”Nat Communications, vol. 13, 2022. Available: https://doi.org/10.1038/ s41467-022-29838-9

work page 2022

[37] [37]

Prediction of source rock characteristics based on terpane biomarkers in crude oils: A multivariate statistical approach,

J. E. Zumberge, “Prediction of source rock characteristics based on terpane biomarkers in crude oils: A multivariate statistical approach,” Geochimica et Cosmochimica Acta, vol. 51, no. 6, pp. 1625–1637, 1987. DOI: http://dx.doi.org/10.1016/0016-7037(87)90343-7

work page doi:10.1016/0016-7037(87)90343-7 1987