pith. machine review for the scientific record. sign in

arxiv: 1906.03543 · v1 · submitted 2019-06-08 · 💻 cs.LG · stat.ML

Recognition: unknown

apricot: Submodular selection for data summarization in Python

Authors on Pith no claims yet
classification 💻 cs.LG stat.ML
keywords apricotdataselectionsubmodularalgorithmapplicablebroadlycode
0
0 comments X
read the original abstract

We present apricot, an open source Python package for selecting representative subsets from large data sets using submodular optimization. The package implements an efficient greedy selection algorithm that offers strong theoretical guarantees on the quality of the selected set. Two submodular set functions are implemented in apricot: facility location, which is broadly applicable but requires memory quadratic in the number of examples in the data set, and a feature-based function that is less broadly applicable but can scale to millions of examples. Apricot is extremely efficient, using both algorithmic speedups such as the lazy greedy algorithm and code optimizers such as numba. We demonstrate the use of subset selection by training machine learning models to comparable accuracy using either the full data set or a representative subset thereof. This paper presents an explanation of submodular selection, an overview of the features in apricot, and an application to several data sets. The code and tutorial Jupyter notebooks are available at https://github.com/jmschrei/apricot

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ContextualJailbreak: Evolutionary Red-Teaming via Simulated Conversational Priming

    cs.CL 2026-05 unverdicted novelty 7.0

    ContextualJailbreak uses evolutionary search over simulated primed dialogues with novel mutations to reach 90-100% attack success on open LLMs and transfers to some closed frontier models at 15-90% rates.