PipeANN-Filter improves filtered vector search latency and throughput on SSD by exploring a superset of valid vectors identified via probabilistic filters and verifying attributes only after selecting top-k candidates.
Space/time trade-offs in hash coding with allowable errors,
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.
StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
Bloom filter encodings convert data samples to bit arrays that support comparable classifier performance to raw data across text, time-series, tabular, and image datasets while delivering consistent memory savings.
citing papers explorer
-
PipeANN-Filter: An Efficient Filtered Vector Search System on SSD
PipeANN-Filter improves filtered vector search latency and throughput on SSD by exploring a superset of valid vectors identified via probabilistic filters and verifying attributes only after selecting top-k candidates.
-
DataComp-LM: In search of the next generation of training sets for language models
DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.
-
StarCoder: may the source be with you!
StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
-
Bloom Filter Encoding for Machine Learning
Bloom filter encodings convert data samples to bit arrays that support comparable classifier performance to raw data across text, time-series, tabular, and image datasets while delivering consistent memory savings.