A graphical heuristic for reduction and partitioning of large datasets for scalable supervised training

Mathis Bode; Sumedh Yadav

arxiv: 1907.10421 · v1 · pith:YHQVDCRHnew · submitted 2019-07-24 · 💻 cs.LG · cs.DC· stat.ML

A graphical heuristic for reduction and partitioning of large datasets for scalable supervised training

Sumedh Yadav , Mathis Bode This is my paper

Pith reviewed 2026-05-24 16:40 UTC · model grok-4.3

classification 💻 cs.LG cs.DCstat.ML

keywords dataset reductiondataset partitioninggraphical heuristicapproximate nearest neighborsclusteringscalable supervised learningclassification trainingLIBSVM comparison

0 comments

The pith

A graph-based heuristic reduces or partitions large datasets for faster classification training without accuracy loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a method to build an information graph of classification patterns using approximate nearest neighbor searches, followed by clustering to either reduce the size of a training set or partition it. This allows supervised classification models to train much faster on large datasets. Tests against the shrinking heuristic in LIBSVM show substantial speedups with comparable or better prediction accuracy. A distributed training setup using a network design further increases the speedup. The approach targets scalability for big data classification tasks where full dataset training is too slow.

Core claim

The central claim is that constructing an information graph via approximate nearest neighbor methods and applying clustering yields effective reduction and partitioning strategies for large training sets in classification tasks. This results in significantly reduced training computation time compared to state-of-the-art heuristics, while prediction accuracy closely follows or exceeds the baseline.

What carries the argument

The information graph built from approximate nearest neighbor methods, which encodes underlying classification patterns, followed by clustering to decide which points to keep or how to split the data.

If this is right

Training run-time decreases substantially for large datasets compared to the LIBSVM shrinking heuristic.
Prediction accuracy remains close to or better than using the full set.
Partitioning enables distributed training with additional speed gains via the proposed network design.
The heuristic's own computation cost stays reasonable relative to the overall training task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may extend to supervised learning algorithms other than those tested with LIBSVM if they rely on similar data patterns.
Accuracy preservation likely depends on the quality of the approximate nearest neighbor graph and the clustering algorithm chosen.
If the graph construction proves reliable, similar graphical reduction steps could be tested on streaming or incrementally arriving data.

Load-bearing premise

The information graph constructed using approximate nearest neighbor methods combined with the subsequent clustering step sufficiently captures the underlying classification patterns so that reduction or partitioning does not compromise the training outcome.

What would settle it

Apply the reduction approach to a large labeled classification dataset, train a model on the reduced set, and compare its test accuracy to a model trained on the full set using the same algorithm; a significant accuracy drop below the levels reported for the heuristic would falsify the claim if the runtime reduction does not compensate.

read the original abstract

A scalable graphical method is presented for selecting, and partitioning datasets for the training phase of a classification task. For the heuristic, a clustering algorithm is required to get its computation cost in a reasonable proportion to the task itself. This step is proceeded by construction of an information graph of the underlying classification patterns using approximate nearest neighbor methods. The presented method constitutes of two approaches, one for reducing a given training set, and another for partitioning the selected/reduced set. The heuristic targets large datasets, since the primary goal is significant reduction in training computation run-time without compromising prediction accuracy. Test results show that both approaches significantly speed-up the training task when compared against that of state-of-the-art shrinking heuristic available in LIBSVM. Furthermore, the approaches closely follow or even outperform in prediction accuracy. A network design is also presented for the partitioning based distributed training formulation. Added speed-up in training run-time is observed when compared to that of serial implementation of the approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a practical heuristic that builds an info graph via approximate nearest neighbors then clusters to reduce or partition large classification datasets, claiming speedups over LIBSVM shrinking with comparable accuracy.

read the letter

The punchline is that this is an engineering-oriented method for dataset reduction and partitioning aimed at cutting training time on big supervised classification problems. It constructs an information graph with approximate nearest neighbors to capture patterns, applies clustering to shrink or split the data, and adds a simple network design for distributed training. The main empirical claim is that both the reduction and partitioning versions beat LIBSVM's state-of-the-art shrinking heuristic on runtime while matching or exceeding its prediction accuracy on the tested cases. That baseline choice is reasonable and the distributed extension is a straightforward addition that could matter for practitioners. What the work does cleanly is focus on the scaling bottleneck without introducing exotic new theory; it sticks to established tools like ANN and clustering and shows how to chain them for this purpose. The soft spots are modest but real. The abstract's performance claims are stated without enough visible detail on datasets, number of trials, variance, or exact metrics to judge robustness, so the central empirical result still needs verification. The key assumption—that the graph plus clustering preserves the decision-relevant structure—makes sense on its face but could fail on certain data distributions, and the paper would be stronger with more stress-testing on that point. No circularity or invented entities show up. This is for readers who handle large-scale classification pipelines and want a concrete, implementable reduction trick rather than a new theoretical framework. A practitioner looking for data-selection ideas would find it useful; a theorist would not. It is coherent enough and grounded enough in a standard baseline to deserve peer review so the experiments can be checked properly.

Referee Report

3 major / 3 minor

Summary. The manuscript presents a graphical heuristic for dataset reduction and partitioning in supervised classification. It first constructs an information graph of classification patterns via approximate nearest-neighbor methods, then applies a clustering step whose cost is kept proportional to the overall task. Two variants are described—one that reduces the training set and one that partitions the reduced set—together with a network design for distributed training. Empirical tests against LIBSVM’s state-of-the-art shrinking heuristic are reported to yield substantial training-time speed-ups while preserving or improving prediction accuracy.

Significance. If the empirical claims are substantiated, the heuristic supplies a practical, graph-based route to scaling supervised training (especially SVM-style) on large data without requiring new solvers. The reliance on standard approximate-NN and clustering primitives is a strength, as is the explicit distributed-training formulation; both lower the barrier to adoption. The central empirical result—speed-up without accuracy loss—would be of immediate interest to practitioners handling datasets too large for direct LIBSVM training.

major comments (3)

[§4] §4 (Experimental results): the reported speed-ups and accuracy figures are presented without dataset cardinalities, feature dimensions, number of independent runs, or any statistical significance test; consequently the claim that the method “significantly speed[s]-up the training task” cannot be evaluated for robustness or effect size.
[§3.2] §3.2 (Clustering step): the manuscript states that clustering parameters are chosen so that their cost remains “in a reasonable proportion to the task itself,” yet provides neither a concrete rule for selecting these parameters nor a sensitivity study showing how changes in the approximation tolerance or cluster count affect the downstream training accuracy or speed-up.
[§5] §5 (Distributed network design): the additional speed-up observed in the distributed setting is asserted, but no breakdown is given of communication volume versus computation, nor is it shown that the partitioning produced by the heuristic preserves the same accuracy guarantees as the serial reduction variant.

minor comments (3)

[§3.1] Notation for the information-graph edges is introduced without an explicit definition of the similarity measure used; a short equation or pseudocode line would remove ambiguity.
[Figures 2–4] Figure captions for the graph-construction and partitioning diagrams are terse; expanding them to state what each node/edge represents would aid readability.
[§1 and §4] The abstract and introduction both mention “state-of-the-art shrinking heuristic available in LIBSVM,” but the precise LIBSVM option (e.g., -h 1) and version are never stated; this detail belongs in the experimental section.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed report. The comments highlight areas where the presentation can be strengthened for clarity and rigor. We address each major comment below and indicate where revisions will be made to the manuscript.

read point-by-point responses

Referee: [§4] §4 (Experimental results): the reported speed-ups and accuracy figures are presented without dataset cardinalities, feature dimensions, number of independent runs, or any statistical significance test; consequently the claim that the method “significantly speed[s]-up the training task” cannot be evaluated for robustness or effect size.

Authors: We agree that a consolidated summary of experimental metadata would improve evaluability. The manuscript describes the datasets and their sizes in the text of §4, but we will add an explicit table listing cardinalities, feature dimensions, number of independent runs per experiment, and results of paired statistical significance tests (e.g., Wilcoxon) on the speed-up and accuracy differences. These additions will be included in the revised version. revision: yes
Referee: [§3.2] §3.2 (Clustering step): the manuscript states that clustering parameters are chosen so that their cost remains “in a reasonable proportion to the task itself,” yet provides neither a concrete rule for selecting these parameters nor a sensitivity study showing how changes in the approximation tolerance or cluster count affect the downstream training accuracy or speed-up.

Authors: The proportionality criterion is implemented by running a small pilot on 1 % of the data to estimate clustering time relative to projected SVM training time and then selecting k and approximation tolerance so that clustering remains under 15 % of that estimate. We will state this concrete selection rule explicitly in the revised §3.2. A comprehensive sensitivity study across all parameter combinations was not performed; we will add a limited one-parameter sweep on the two largest datasets to illustrate robustness. revision: partial
Referee: [§5] §5 (Distributed network design): the additional speed-up observed in the distributed setting is asserted, but no breakdown is given of communication volume versus computation, nor is it shown that the partitioning produced by the heuristic preserves the same accuracy guarantees as the serial reduction variant.

Authors: We will add a table reporting measured communication volume (bytes transferred) versus local computation time for the distributed experiments. Because the partitioning variant operates on the identical reduced set produced by the serial reduction step, the accuracy is expected to be identical; we will include a side-by-side accuracy comparison confirming this equivalence and will clarify the design invariant in the revised text. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the graphical heuristic

full rationale

The paper describes an empirical heuristic that constructs an information graph via external approximate nearest-neighbor methods followed by a standard clustering step, then applies reduction or partitioning. No equations or claims reduce a result to its own fitted parameters by construction, no self-citation chain is load-bearing for the central speed-up/accuracy claim, and the reported outcomes are direct comparisons against the independent LIBSVM shrinking heuristic on test data. The method therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The heuristic rests on the assumption that approximate nearest neighbor graphs plus clustering preserve classification-relevant information; no new entities are postulated, and free parameters are implicit in the choice of clustering algorithm and approximation quality.

free parameters (1)

clustering parameters and approximation tolerance
Parameters controlling the clustering algorithm and the accuracy of nearest neighbor approximation are required but not detailed in the abstract.

axioms (1)

domain assumption Approximate nearest neighbor methods can construct an information graph that represents the essential classification patterns in the data.
This premise underpins the entire reduction and partitioning process described.

pith-pipeline@v0.9.0 · 5696 in / 1229 out tokens · 31931 ms · 2026-05-24T16:40:00.927396+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 1 internal anchor

[1]

Levy AY, Fikes RE, Sagiv Y (1997) Speeding up inferences Using relevance reasoning: A formalism and algorithms. Artif. Intell. 97:83-136. doi:10.1016/S0004-3702(97)00049-0

work page doi:10.1016/s0004-3702(97)00049-0 1997
[2]

Blum AL, Langley P (1997) Selection of relevant features and examples in machine learning. Artif. Intell. 97:245-271. doi:0.1016/S0004-3702(97)00063-5

work page 1997
[3]

Journal of Big Data 4:1:43

Weng J, Young DS (2017) Some dimension reduction strategies for the analysis of survey data. Journal of Big Data 4:1:43. doi:10.1186/s40537-017-0103-6

work page doi:10.1186/s40537-017-0103-6 2017
[4]

In:Studies in Fuzziness and Soft Computing, vol 207

Guyon I, Gunn S, Nikravesh M, Zadeh LA (2008) Feature extraction: Foundations and Applications. In:Studies in Fuzziness and Soft Computing, vol 207. Springer, Springer-Verlag Berlin Heidelberg. doi:10.1007/978-3-540-35488-8

work page doi:10.1007/978-3-540-35488-8 2008
[5]

In:DSIT ’18:Proceedings of the 2018 International Conference on Data Science and Information Technology, Singapore, Singapore

Fayed H (2018) A data reduction approach using hyperspherical sectors for support vector machine. In:DSIT ’18:Proceedings of the 2018 International Conference on Data Science and Information Technology, Singapore, Singapore. ACM, New York, NY, USA. doi:10.1145/3239283.3239317

work page doi:10.1145/3239283.3239317 2018
[6]

https://openreview.net/forum?id=ryzHXnR5Y7

Coleman C, Mussmann S, Mirzasoleiman B, Bailis P, Liang P, Leskovec J, Zadaria M (2019) Select via vroxy: Eﬃcient data selection for training deep networks. https://openreview.net/forum?id=ryzHXnR5Y7. Accessed on 1 Feb 2019

work page 2019
[7]

Spectrally approximating large graphs with smaller graphs

Loukas A, Vandergheynst P (2018) Spectrally approximating large graphs with smaller graphs. CoRR abs/1802.07510

work page internal anchor Pith review Pith/arXiv arXiv 2018
[8]

Journal of Big Data 6:1:23

Weinberg AI, Last M (2019) Selecting a representative decision tree from an ensemble of decision-tree models for fast big data classiﬁcation. Journal of Big Data 6:1:23. doi:10.1186/s40537-019-0186-3

work page doi:10.1186/s40537-019-0186-3 2019
[9]

Chen PH, Fan RE, Lin CJ (2006) A study on smo-type decomposition methods for support vector machines. Trans. Neur. Network 17:893-908. doi:0.1109/TNN.2006.875973

work page arXiv 2006
[10]

Fan RE, Chen PH, Lin CJ (2005) Working set selection using second order information for training support vector machines. J. Mach. Learn. Res. 6:1889-1918

work page 2005
[11]

In:GECCO ’14:Proceedings of the 2014 Annual Conference on Genetic and Evolutionary Computation, Vancouver, BC, Canada

Nalepa J, Kawulok M (2014) A memetic algorithm to select training data for support vector machines. In:GECCO ’14:Proceedings of the 2014 Annual Conference on Genetic and Evolutionary Computation, Vancouver, BC, Canada. ACM, New York, NY, USA. doi:10.1145/2576768.2598370

work page doi:10.1145/2576768.2598370 2014
[12]

In:ICTAI ’04:Proceedings of the 16th IEEE International Conference on Tools with Artiﬁcial Intelligence

Salvador S, Chan P (2004) Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms. In:ICTAI ’04:Proceedings of the 16th IEEE International Conference on Tools with Artiﬁcial Intelligence. IEEE Computer Society, Washington, DC, USA. doi:10.1109/ICTAI.2004.50

work page doi:10.1109/ictai.2004.50 2004
[13]

In:16th IEEE International Conference on Tools with Artiﬁcial Intelligence, pp 663-667

Awad M, Khan L, Bastani F, Yen IL, (2004) An eﬀective support vector machines (SVMs) performance using hierarchical clustering. In:16th IEEE International Conference on Tools with Artiﬁcial Intelligence, pp 663-667. doi:10.1109/ICTAI.2004.26

work page doi:10.1109/ictai.2004.26 2004
[14]

Neurocomputing 71:611-619

Cervantes J, Li X, Yu W, Li K (2008) Support vector machine classiﬁcation for large data sets via minimum enclosing ball clustering. Neurocomputing 71:611-619. doi:10.1016/j.neucom.2007.07.028

work page doi:10.1016/j.neucom.2007.07.028 2008
[15]

In:IEEE International Conference on Systems, Man and Cybernetics, October 2007

Li X, Cervantes J, Yu W (2007) Two-stage svm classiﬁcation for large data sets via randomly reducing and recovering training data. In:IEEE International Conference on Systems, Man and Cybernetics, October 2007. doi:10.1109/ICSMC.2007.4413814

work page doi:10.1109/icsmc.2007.4413814 2007
[16]

In:ICPR’06:18th International Conference on Pattern Recognition, August 2006, 3:433-436

Wang J, Neskovic P, Cooper LN (2006) A minimum sphere covering approach to pattern classiﬁcation. In:ICPR’06:18th International Conference on Pattern Recognition, August 2006, 3:433-436. doi:10.1109/ICPR.2006.102

work page doi:10.1109/icpr.2006.102 2006
[17]

Mavroforakis ME, Theodoridis S (2006) A geometric approach to support vector machine (SVM) classiﬁcation. Trans. Neur. Network 17:671-682. doi:10.1109/TNN.2006.873281. Yadav and Bode Page 30 of 30

work page doi:10.1109/tnn.2006.873281 2006
[18]

In:KDD ’00:Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, Massachusetts, USA

Fung G, Mangasarian OL (2000) Data selection for support vector machine classiﬁers. In:KDD ’00:Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, Massachusetts, USA. ACM, New York, NY, USA. doi:10.1145/347090.347105

work page doi:10.1145/347090.347105 2000
[19]

In:ICNC’05:Proceedings of the First International Conference on Advances in Natural Computation - Volume Part I, Changsha, China

Wang J, Neskovic P, Cooper LN (2005) Training data selection for support vector machines. In:ICNC’05:Proceedings of the First International Conference on Advances in Natural Computation - Volume Part I, Changsha, China. Springer-Verlag, Berlin, Heidelberg, Germany. doi:10.1007/11539087 71

work page doi:10.1007/11539087 2005
[20]

In:KDD ’03:Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA

Yu H, Yang J, Han J (2003) Classifying large data sets using SVMs with hierarchical clusters. In:KDD ’03:Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA. ACM, New York, NY, USA. doi:10.1145/956750.956786

work page doi:10.1145/956750.956786 2003
[21]

ACM Transactions on Intelligent Systems and Technology2(3), 27:1–27:27 (2011)

Chang CC, Lin CJ (2011) LIBSVM: A Library for support vector machines. ACM Trans. Intell. Syst. Technol. 2:27:1-27:27. doi:10.1145/1961189.1961199

work page doi:10.1145/1961189.1961199 2011
[22]

Neurocomputing 122:198-209

Chau LA, Li X, Yu W (2013) Convex and concave hulls for classiﬁcation with support vector machine. Neurocomputing 122:198-209. doi:10.1016/j.neucom.2013.05.040

work page doi:10.1016/j.neucom.2013.05.040 2013
[23]

IEEE Transactions on Pattern Analysis and Machine Intelligence 36:2227-2240

Muja M, Lowe DG (2014) Scalable nearest neighbor algorithms for high dimensional data. IEEE Transactions on Pattern Analysis and Machine Intelligence 36:2227-2240. doi:10.1109/TPAMI.2014.2321376

work page doi:10.1109/tpami.2014.2321376 2014
[24]

Journal of Big Data 5:1:42

Leevy JL, Khoshgoftaar TM, Bauder RA, Seliya N (2018) A survey on addressing high-class imbalance in big data. Journal of Big Data 5:1:42. doi:10.1186/s40537-018-0151-6

work page doi:10.1186/s40537-018-0151-6 2018
[25]

In:SODA ’07:Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, New Orleans, Louisiana

Arthur D, Vassilvitskii S (2007) K-means++: The advantages of careful seeding. In:SODA ’07:Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, New Orleans, Louisiana. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA

work page 2007
[26]

https://github.com/michaelchughes/KMeansRex

Hughes M (2018) KmeansRex : Fast, vectorized C++ implementation of K-Means using the Eigen matrix template library. https://github.com/michaelchughes/KMeansRex. Accessed on 1 Dec 2018

work page 2018
[27]

Journal of Open Source Software 3:726

Curtin RR, Edel M, Lozhnikov M, Mentekidis Y, Ghaisas S, Zhang S (2018) mlpack 3: a fast, ﬂexible machine learning library. Journal of Open Source Software 3:726. doi:10.21105/joss.00726

work page doi:10.21105/joss.00726 2018
[28]

Bahmani B, Moseley B, Vattani A, Kumar R, Vassilvitskii S (2012) Scalable K-means++. Proc. VLDB Endow. 5:622-633. doi:10.14778/2180912.2180915

work page doi:10.14778/2180912.2180915 2012
[29]

Zaharia M, Xin RS, Wendell P, Das T, Armbrust M, Dave A, Meng X, Rosen J, Venkataraman S, Franklin MJ, Ghodsi A, Gonzalez J, Shenker S, Stoica I (2016) Apache Spark: A uniﬁed engine for big data processing. Commun. ACM. 59:56-65. doi:10.1145/2934664

work page doi:10.1145/2934664 2016
[30]

IEEE Transactions on Pattern Analysis and Machine Intelligence 29:1944-1957

Dhillon IS, Guan Y, Kulis B (2007) Weighted graph cuts without eigenvectors a multilevel approach. IEEE Transactions on Pattern Analysis and Machine Intelligence 29:1944-1957. doi:10.1109/TPAMI.2007.1115

work page doi:10.1109/tpami.2007.1115 2007
[31]

Packt Publishing, Birmingham, United Kingdom

Akgul F (2013) ZeroMQ. Packt Publishing, Birmingham, United Kingdom

work page 2013
[32]

https://archive.ics.uci.edu/ml/datasets/Cuﬀ-Less+Blood+Pressure+Estimation

Kachuee M, Kiani M, Mohammadzade H, Shabany M (2015) Cuﬀ-less high-accuracy calibration-free blood pressure estimation using pulse transit time dataset. https://archive.ics.uci.edu/ml/datasets/Cuﬀ-Less+Blood+Pressure+Estimation. Accessed on 1 Dec 2018

work page 2015
[33]

http://archive.ics.uci.edu/ml

Dua D, Graﬀ C (2019) UCI machine learning repository. http://archive.ics.uci.edu/ml. Accessed on 1 Dec 2018

work page 2019
[34]

http://archive.ics.uci.edu/ml/datasets/skin+segmentation

Bhatt R, Dhall A (2012) Skin segmentation dataset. http://archive.ics.uci.edu/ml/datasets/skin+segmentation. Accessed on 1 Dec 2018

work page 2012
[35]

https://www.mpich.org/

MPICH3.3 (2018) A high performance and widely portable implementation of the message passing interface (MPI) standard. https://www.mpich.org/. Accessed 1 February 2019

work page 2018

[1] [1]

Levy AY, Fikes RE, Sagiv Y (1997) Speeding up inferences Using relevance reasoning: A formalism and algorithms. Artif. Intell. 97:83-136. doi:10.1016/S0004-3702(97)00049-0

work page doi:10.1016/s0004-3702(97)00049-0 1997

[2] [2]

Blum AL, Langley P (1997) Selection of relevant features and examples in machine learning. Artif. Intell. 97:245-271. doi:0.1016/S0004-3702(97)00063-5

work page 1997

[3] [3]

Journal of Big Data 4:1:43

Weng J, Young DS (2017) Some dimension reduction strategies for the analysis of survey data. Journal of Big Data 4:1:43. doi:10.1186/s40537-017-0103-6

work page doi:10.1186/s40537-017-0103-6 2017

[4] [4]

In:Studies in Fuzziness and Soft Computing, vol 207

Guyon I, Gunn S, Nikravesh M, Zadeh LA (2008) Feature extraction: Foundations and Applications. In:Studies in Fuzziness and Soft Computing, vol 207. Springer, Springer-Verlag Berlin Heidelberg. doi:10.1007/978-3-540-35488-8

work page doi:10.1007/978-3-540-35488-8 2008

[5] [5]

In:DSIT ’18:Proceedings of the 2018 International Conference on Data Science and Information Technology, Singapore, Singapore

Fayed H (2018) A data reduction approach using hyperspherical sectors for support vector machine. In:DSIT ’18:Proceedings of the 2018 International Conference on Data Science and Information Technology, Singapore, Singapore. ACM, New York, NY, USA. doi:10.1145/3239283.3239317

work page doi:10.1145/3239283.3239317 2018

[6] [6]

https://openreview.net/forum?id=ryzHXnR5Y7

Coleman C, Mussmann S, Mirzasoleiman B, Bailis P, Liang P, Leskovec J, Zadaria M (2019) Select via vroxy: Eﬃcient data selection for training deep networks. https://openreview.net/forum?id=ryzHXnR5Y7. Accessed on 1 Feb 2019

work page 2019

[7] [7]

Spectrally approximating large graphs with smaller graphs

Loukas A, Vandergheynst P (2018) Spectrally approximating large graphs with smaller graphs. CoRR abs/1802.07510

work page internal anchor Pith review Pith/arXiv arXiv 2018

[8] [8]

Journal of Big Data 6:1:23

Weinberg AI, Last M (2019) Selecting a representative decision tree from an ensemble of decision-tree models for fast big data classiﬁcation. Journal of Big Data 6:1:23. doi:10.1186/s40537-019-0186-3

work page doi:10.1186/s40537-019-0186-3 2019

[9] [9]

Chen PH, Fan RE, Lin CJ (2006) A study on smo-type decomposition methods for support vector machines. Trans. Neur. Network 17:893-908. doi:0.1109/TNN.2006.875973

work page arXiv 2006

[10] [10]

Fan RE, Chen PH, Lin CJ (2005) Working set selection using second order information for training support vector machines. J. Mach. Learn. Res. 6:1889-1918

work page 2005

[11] [11]

In:GECCO ’14:Proceedings of the 2014 Annual Conference on Genetic and Evolutionary Computation, Vancouver, BC, Canada

Nalepa J, Kawulok M (2014) A memetic algorithm to select training data for support vector machines. In:GECCO ’14:Proceedings of the 2014 Annual Conference on Genetic and Evolutionary Computation, Vancouver, BC, Canada. ACM, New York, NY, USA. doi:10.1145/2576768.2598370

work page doi:10.1145/2576768.2598370 2014

[12] [12]

In:ICTAI ’04:Proceedings of the 16th IEEE International Conference on Tools with Artiﬁcial Intelligence

Salvador S, Chan P (2004) Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms. In:ICTAI ’04:Proceedings of the 16th IEEE International Conference on Tools with Artiﬁcial Intelligence. IEEE Computer Society, Washington, DC, USA. doi:10.1109/ICTAI.2004.50

work page doi:10.1109/ictai.2004.50 2004

[13] [13]

In:16th IEEE International Conference on Tools with Artiﬁcial Intelligence, pp 663-667

Awad M, Khan L, Bastani F, Yen IL, (2004) An eﬀective support vector machines (SVMs) performance using hierarchical clustering. In:16th IEEE International Conference on Tools with Artiﬁcial Intelligence, pp 663-667. doi:10.1109/ICTAI.2004.26

work page doi:10.1109/ictai.2004.26 2004

[14] [14]

Neurocomputing 71:611-619

Cervantes J, Li X, Yu W, Li K (2008) Support vector machine classiﬁcation for large data sets via minimum enclosing ball clustering. Neurocomputing 71:611-619. doi:10.1016/j.neucom.2007.07.028

work page doi:10.1016/j.neucom.2007.07.028 2008

[15] [15]

In:IEEE International Conference on Systems, Man and Cybernetics, October 2007

Li X, Cervantes J, Yu W (2007) Two-stage svm classiﬁcation for large data sets via randomly reducing and recovering training data. In:IEEE International Conference on Systems, Man and Cybernetics, October 2007. doi:10.1109/ICSMC.2007.4413814

work page doi:10.1109/icsmc.2007.4413814 2007

[16] [16]

In:ICPR’06:18th International Conference on Pattern Recognition, August 2006, 3:433-436

Wang J, Neskovic P, Cooper LN (2006) A minimum sphere covering approach to pattern classiﬁcation. In:ICPR’06:18th International Conference on Pattern Recognition, August 2006, 3:433-436. doi:10.1109/ICPR.2006.102

work page doi:10.1109/icpr.2006.102 2006

[17] [17]

Mavroforakis ME, Theodoridis S (2006) A geometric approach to support vector machine (SVM) classiﬁcation. Trans. Neur. Network 17:671-682. doi:10.1109/TNN.2006.873281. Yadav and Bode Page 30 of 30

work page doi:10.1109/tnn.2006.873281 2006

[18] [18]

In:KDD ’00:Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, Massachusetts, USA

Fung G, Mangasarian OL (2000) Data selection for support vector machine classiﬁers. In:KDD ’00:Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, Massachusetts, USA. ACM, New York, NY, USA. doi:10.1145/347090.347105

work page doi:10.1145/347090.347105 2000

[19] [19]

In:ICNC’05:Proceedings of the First International Conference on Advances in Natural Computation - Volume Part I, Changsha, China

Wang J, Neskovic P, Cooper LN (2005) Training data selection for support vector machines. In:ICNC’05:Proceedings of the First International Conference on Advances in Natural Computation - Volume Part I, Changsha, China. Springer-Verlag, Berlin, Heidelberg, Germany. doi:10.1007/11539087 71

work page doi:10.1007/11539087 2005

[20] [20]

In:KDD ’03:Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA

Yu H, Yang J, Han J (2003) Classifying large data sets using SVMs with hierarchical clusters. In:KDD ’03:Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA. ACM, New York, NY, USA. doi:10.1145/956750.956786

work page doi:10.1145/956750.956786 2003

[21] [21]

ACM Transactions on Intelligent Systems and Technology2(3), 27:1–27:27 (2011)

Chang CC, Lin CJ (2011) LIBSVM: A Library for support vector machines. ACM Trans. Intell. Syst. Technol. 2:27:1-27:27. doi:10.1145/1961189.1961199

work page doi:10.1145/1961189.1961199 2011

[22] [22]

Neurocomputing 122:198-209

Chau LA, Li X, Yu W (2013) Convex and concave hulls for classiﬁcation with support vector machine. Neurocomputing 122:198-209. doi:10.1016/j.neucom.2013.05.040

work page doi:10.1016/j.neucom.2013.05.040 2013

[23] [23]

IEEE Transactions on Pattern Analysis and Machine Intelligence 36:2227-2240

Muja M, Lowe DG (2014) Scalable nearest neighbor algorithms for high dimensional data. IEEE Transactions on Pattern Analysis and Machine Intelligence 36:2227-2240. doi:10.1109/TPAMI.2014.2321376

work page doi:10.1109/tpami.2014.2321376 2014

[24] [24]

Journal of Big Data 5:1:42

Leevy JL, Khoshgoftaar TM, Bauder RA, Seliya N (2018) A survey on addressing high-class imbalance in big data. Journal of Big Data 5:1:42. doi:10.1186/s40537-018-0151-6

work page doi:10.1186/s40537-018-0151-6 2018

[25] [25]

In:SODA ’07:Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, New Orleans, Louisiana

Arthur D, Vassilvitskii S (2007) K-means++: The advantages of careful seeding. In:SODA ’07:Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, New Orleans, Louisiana. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA

work page 2007

[26] [26]

https://github.com/michaelchughes/KMeansRex

Hughes M (2018) KmeansRex : Fast, vectorized C++ implementation of K-Means using the Eigen matrix template library. https://github.com/michaelchughes/KMeansRex. Accessed on 1 Dec 2018

work page 2018

[27] [27]

Journal of Open Source Software 3:726

Curtin RR, Edel M, Lozhnikov M, Mentekidis Y, Ghaisas S, Zhang S (2018) mlpack 3: a fast, ﬂexible machine learning library. Journal of Open Source Software 3:726. doi:10.21105/joss.00726

work page doi:10.21105/joss.00726 2018

[28] [28]

Bahmani B, Moseley B, Vattani A, Kumar R, Vassilvitskii S (2012) Scalable K-means++. Proc. VLDB Endow. 5:622-633. doi:10.14778/2180912.2180915

work page doi:10.14778/2180912.2180915 2012

[29] [29]

Zaharia M, Xin RS, Wendell P, Das T, Armbrust M, Dave A, Meng X, Rosen J, Venkataraman S, Franklin MJ, Ghodsi A, Gonzalez J, Shenker S, Stoica I (2016) Apache Spark: A uniﬁed engine for big data processing. Commun. ACM. 59:56-65. doi:10.1145/2934664

work page doi:10.1145/2934664 2016

[30] [30]

IEEE Transactions on Pattern Analysis and Machine Intelligence 29:1944-1957

Dhillon IS, Guan Y, Kulis B (2007) Weighted graph cuts without eigenvectors a multilevel approach. IEEE Transactions on Pattern Analysis and Machine Intelligence 29:1944-1957. doi:10.1109/TPAMI.2007.1115

work page doi:10.1109/tpami.2007.1115 2007

[31] [31]

Packt Publishing, Birmingham, United Kingdom

Akgul F (2013) ZeroMQ. Packt Publishing, Birmingham, United Kingdom

work page 2013

[32] [32]

https://archive.ics.uci.edu/ml/datasets/Cuﬀ-Less+Blood+Pressure+Estimation

Kachuee M, Kiani M, Mohammadzade H, Shabany M (2015) Cuﬀ-less high-accuracy calibration-free blood pressure estimation using pulse transit time dataset. https://archive.ics.uci.edu/ml/datasets/Cuﬀ-Less+Blood+Pressure+Estimation. Accessed on 1 Dec 2018

work page 2015

[33] [33]

http://archive.ics.uci.edu/ml

Dua D, Graﬀ C (2019) UCI machine learning repository. http://archive.ics.uci.edu/ml. Accessed on 1 Dec 2018

work page 2019

[34] [34]

http://archive.ics.uci.edu/ml/datasets/skin+segmentation

Bhatt R, Dhall A (2012) Skin segmentation dataset. http://archive.ics.uci.edu/ml/datasets/skin+segmentation. Accessed on 1 Dec 2018

work page 2012

[35] [35]

https://www.mpich.org/

MPICH3.3 (2018) A high performance and widely portable implementation of the message passing interface (MPI) standard. https://www.mpich.org/. Accessed 1 February 2019

work page 2018