archive
Every paper Pith has read. Search by title, abstract, or pith.
446 papers in cs.DB · page 6
-
Native graph index cuts multi-vector search latency up to 14 times
Unified and Efficient Approach for Multi-Vector Similarity Search
-
DCO shortcuts unstable in vector search benchmarks
Distance Comparison Operations Are Not Silver Bullets in Vector Similarity Search: A Benchmark Study on Their Merits and Limits
-
Clustering gives LLMs full dataset context for semantic tasks
Semantic Data Processing with Holistic Data Understanding
-
ReCAP makes relational DBs up to 400000x faster on constrained path queries
Efficient Path Query Processing in Relational Database Systems
-
Hybrid query system cuts cost while raising accuracy on mixed structured-text tables
OmniTQA: A Cost-Aware System for Hybrid Query Processing over Semi-Structured Data
-
Bucket collector speeds large-k ANN search up to 3.8x
BBC: Improving Large-k Approximate Nearest Neighbor Search with a Bucket-based Result Collector
-
Hybrid memory beats state-of-the-art LLM agent methods
Memory in the LLM Era: Modular Architectures and Strategies in a Unified Framework
-
LLM tool adds database functions 34 percent more accurately
Automating Database-Native Function Code Synthesis with LLMs
-
Unifying 8,000 atomistic simulations into one queryable graph
Ontology-based knowledge graph infrastructure for interoperable atomistic simulation data
-
GPU bucketing delivers 240x faster hybrid searches
GRAB-ANNS: High-Throughput Indexing and Hybrid Search via GPU-Native Bucketing
-
Query focus cuts RAG response time by 40 percent
QCFuse: Query-Centric Cache Fusion for Efficient RAG Inference
-
Platform structures electrospinning data including failures for predictive use
Electrospinning-Data.org: A FAIR, Structured Knowledge Resource for Nanofiber Fabrication
-
Enzyme cuts daily pipeline compute by billions of CPU seconds
Enzyme: Incremental View Maintenance for Data Engineering
-
Streaming context cuts LLM first-token latency by up to 11x
Stream2LLM: Overlap Context Streaming and Prefill for Reduced Time-to-First-Token (TTFT)
-
Streaming overlaps cut LLM first response time by 11x
Stream2LLM: Overlap Context Streaming and Prefill for Reduced Time-to-First-Token (TTFT)
-
Ontology encodes EU Data Act rules for SPARQL compliance checks
DAOnt: A Formal Ontology for EU Data Act Compliance
-
Refutational normalization speeds up complete JSON schema checks
JSON Schema Inclusion through Refutational Normalization: Reconciling Efficiency and Completeness
-
Survey maps NLIDB methods for spatial-temporal databases
Natural Language Interfaces for Spatial and Temporal Databases: A Comprehensive Overview of Methods, Taxonomy, and Future Directions
-
Value-based quadtree cuts spatial query time by 90%
Spatial Analysis on Value-Based Quadtrees of Rasterized Vector Data
-
Value-based quadtree cuts point-in-polygon latency by 90%
Spatial Analysis on Value-Based Quadtrees of Rasterized Vector Data
-
Embedding random tests inside the DBMS finds 23 bugs with higher true positives
DIRT: Database-Integrated Random Testing
-
Hybrid decoding speeds robot VLA models up to 2.45x
HeiSD: Hybrid Speculative Decoding for Embodied Vision-Language-Action Models with Kinematic Awareness
-
Assumptions enable clear dynamic relationships in object event logs
Detecting Dynamic Relationships in Object-Centric Event Logs
-
Itemset mining groups cities by shared land use patterns
Exploring Urban Land Use Patterns by Pattern Mining and Unsupervised Learning
-
Proxy models cut AI query costs by over 100x
100x Cost & Latency Reduction: Performance Analysis of AI Query Approximation using Lightweight Proxy Models
-
Fixes for one agent model improve 13 others across seven families
Beyond Benchmark Islands: Toward Representative Trustworthiness Evaluation for Agentic AI
-
Catalog system converts natural language to PromQL queries in 1.1 seconds
From Natural Language to PromQL: A Catalog-Driven Framework with Dynamic Temporal Resolution for Cloud-Native Observability
-
Dynamic counters deliver sublinear error for growing stream sketches
Sublime: Sublinear Error & Space for Unbounded Skewed Streams
-
Jaguar evaluates queries in N to the submodular width plus epsilon
Jaguar: A Primal Algorithm for Conjunctive Query Evaluation in Submodular-Width Time
-
DSL lets LLMs produce consistent sensor triggers
A Domain-Specific Language for LLM-Driven Trigger Generation in Multimodal Data Collection
-
One graph index covers every range for filtered ANN search
RNSG: A Range-Aware Graph Index for Efficient Range-Filtered Approximate Nearest Neighbor Search
-
Optimal sampling for matrix, star, and chain join-project queries
Towards Output-Optimal Uniform Sampling and Approximate Counting for Join-Project Queries
-
Toolkit auto-generates standard APIs for materials datasets
optimade-maker: Automated generation of interoperable materials APIs from static datasets
-
Toolkit turns raw materials data into standard APIs
optimade-maker: Automated generation of interoperable materials APIs from static datasets
-
Self-evolved cycles reach 83.1% MongoDB query accuracy
Draft-Refine-Optimize: Self-Evolved Learning for Natural Language to MongoDB Query Generation
-
Real-time terminology queries improve LLM metadata accuracy
Automated Standardization of Legacy Biomedical Metadata Using an Ontology-Constrained LLM Agent
-
GeoBenchr benchmarks spatiotemporal DBs on real workloads
GeoBenchr: An Application-Centric Benchmarking Suite for Spatiotemporal Database Platforms
-
LLM index tuning outperforms DTA in some cases
Evaluating the Practical Effectiveness of LLM-Driven Index Tuning with Microsoft Database Tuning Advisor
-
OMA retains Kubernetes crash evidence past the evidence horizon
Operational Memory Architecture for Kubernetes:Preserving Causal Context Across the Evidence Horizon
-
DEBISS corpus supplies annotated spoken debates for NLP tasks
DEBISS: a Corpus of Individual, Semi-structured and Spoken Debates
-
LLM templates let GNNs run 28x faster on huge graphs with 98% less memory
An LLM-Guided Query-Aware Inference System for GNN Models on Large Knowledge Graphs
-
Mined constraints create realistic SQL query test cases
SpotIt+: Verification-based Text-to-SQL Evaluation with Database Constraints
-
Synthetic pre-training produces relational in-context learner
Relational In-Context Learning via Synthetic Pre-training with Structural Prior
-
Graphs gain normal forms that cover edge dependencies
Graph-Native Normalization
-
Taxonomy groups LLM database operators into five categories
Large Language Model-Enhanced Relational Operators: Taxonomy, Benchmark, and Analysis
-
Python functions generate ocean RDF without semantic web tools
A Pythonic Functional Approach for Semantic Data Harmonisation in the ILIAD Project
-
Item-level data required for rigorous AI evaluation
AI Evaluation Should Require Standardized Item-Level Data Releases
-
Item-level data releases required for valid AI benchmarks
AI Evaluation Should Require Standardized Item-Level Data Releases
-
LLM agents generate large table datasets for recognition
TableNet A Large-Scale Table Dataset with LLM-Powered Autonomous
-
On-disk vector search matches in-memory speed
AlayaLaser: Efficient Index Layout and Search Strategy for Large-scale High-dimensional Vector Similarity Search