archive
Every paper Pith has read. Search by title, abstract, or pith.
446 papers in cs.DB · page 5
-
OffloadFS moves database compaction to storage nodes for 3.36x speedup
OffloadFS: Leveraging Disaggregated Storage for Computation Offloading
-
Benchmark reveals 9% drop for Indic languages in Text-to-SQL
IndicDB -- Benchmarking Multilingual Text-to-SQL Capabilities in Indian Languages
-
Self-healing LLM loop boosts NL-to-SQL accuracy up to 9 points
SQL Query Engine: A Self-Healing LLM Pipeline for Natural Language to PostgreSQL Translation
-
DySkew cuts UDF skew delays with runtime data swaps
DySkew: Dynamic Data Redistribution for Skew-Resilient Snowpark UDF Execution
-
Log anomalies detected directly on compressed bytes
CLAD: Efficient Log Anomaly Detection Directly on Compressed Representations
-
ROSE judges NL2SQL by user intent not reference SQL match
ROSE: An Intent-Centered Evaluation Metric for NL2SQL
-
Panoramic 3D datasets reach 96 percent place categorization accuracy
Multi-modal panoramic 3D outdoor datasets for place categorization
-
Workflow builds reproducible 582k-paper chemistry corpus
Lit2Vec: A Reproducible Workflow for Building a Legally Screened Chemistry Corpus from S2ORC for Downstream Retrieval and Text Mining
-
Three verification layers catch outsourced anonymization errors
VeriX-Anon: A Multi-Layered Framework for Mathematically Verifiable Outsourced Target-Driven Data Anonymization
-
Benchmark reveals accuracy and efficiency gaps in NL2SQL methods
NL2SQLBench: A Modular Benchmarking Framework for LLM-Enabled NL2SQL Solutions
-
GraphAlg Core simulated in matrix language with induction
Foundations of the GraphAlg Language
-
Ozone unifies four traffic datasets to cut experiment setup time 85%
Ozone: A Unified Platform for Transportation Research
-
Ozone unifies traffic datasets to cut setup time 85%
Ozone: A Unified Platform for Transportation Research
-
NL queries must sometimes invent their own data target
Natural Language to What? A Vision for Intermediate Representations in NL-to-X Querying
-
Fine-grained GPU model scales subgraph matching to larger queries
gMatch: Fine-Grained and Hardware-Efficient Subgraph Matching on GPUs
-
Knowledge graph unifies AI artifact management across platforms
Gypscie: A Cross-Platform AI Artifact Management System
-
Benchmark lets dialogue clarify ambiguous table questions
ODUTQA-MDC: A Task for Open-Domain Underspecified Tabular QA with Multi-turn Dialogue-based Clarification
-
New benchmark and alignment model raise VCOD performance on motion-heavy video
YUV20K: A Complexity-Driven Benchmark and Trajectory-Aware Alignment Model for Video Camouflaged Object Detection
-
Dynamic programming chooses semantic filter positions to cut hybrid query costs
PLOP: Cost-Based Placement of Semantic Operators in Hybrid Query Plans
-
Catalog defines 35 data error types in three categories
A Catalog of Data Errors
-
Decoupling vectors from indexes cuts storage by up to 58%
Decoupling Vector Data and Index Storage for Space Efficiency
-
Decoupling vectors from indexes cuts storage by up to 59%
Decoupling Vector Data and Index Storage for Space Efficiency
-
Proprietary tools top data quality metrics and LLM features
Evaluating Data Quality Tools: Measurement Capabilities and LLM Integration
-
Constraint solver matches patients to 32-72% more trials
SatIR: Scalable High-Recall Constraint-Satisfaction-Based Information Retrieval for Clinical Trials Matching
-
Constraint-guided LLMs generate graph queries with 31.6% F1 gains
Graph Query Generation with Constraint-guided Large Language Agents
-
Fine-tuned LLMs translate QoS to QoE and back with strong accuracy
QoS-QoE Translation with Large Language Model
-
Color coding counts hypergraphlets faster on (α,β)-nice hypergraphs
Counting HyperGraphlets via Color Coding: a Quadratic Barrier and How to Break It
-
Dynamic graph method speeds up large language model training
GRACE: A Dynamic Coreset Selection Framework for Large Language Model Optimization
-
FK graph traversal yields diverse SQL workloads for optimizer training
SynQL: A Controllable and Scalable Rule-Based Framework for SQL Workload Synthesis for Performance Benchmarking
-
PostRI gives DP medians higher utility with post-release error intervals
Interpreting the Error of Differentially Private Median Queries through Randomization Intervals
-
Agent views help AI write complex SQL queries
AV-SQL: Decomposing Complex Text-to-SQL Queries with Agentic Views
-
VulGD builds dynamic graph of vulnerabilities with LLM embeddings
VulGD: A LLM-Powered Dynamic Open-Access Vulnerability Graph Database
-
Small models outperform rules and LLMs in SQL query rewriting
LASER: A Data-Centric Method for Low-Cost and Efficient SQL Rewriting based on SQL-GRPO
-
LLMs vary SQL structure even when results match
SQLStructEval: Structural Evaluation of LLM Text-to-SQL Generation
-
CubeGraph stitches per-cell vector graphs for fast hybrid spatial search
CubeGraph: Efficient Retrieval-Augmented Generation for Spatial and Temporal Data
-
Non-forking chains cut Verkle Trie storage by 97.8%
SonicDB S6: A Storage-Efficient Verkle Trie for High-Throughput Blockchains
-
Verkle Trie storage slashed 98% for 300ms blocks
SonicDB S6: A Storage-Efficient Verkle Trie for High-Throughput Blockchains
-
Co-evolved evaluators find 6.8x faster database algorithms
AI-Driven Research for Databases
-
Bayesian net models missingness to create probabilistic DB for queries
Database Querying under Missing Values Governed by Missingness Mechanisms
-
Toolkit pairs Python pipelines with AI chat for data harmonization
BDI-Kit Demo: A Toolkit for Programmable and Conversational Data Harmonization
-
ICT models become useful when treated as traceable open graphs
All LCA models are wrong. Are some of them useful? Towards open computational LCA in ICT
-
Equivalence proofs compose correct database stores
CobbleDB: Modelling Levelled Storage by Composition
-
Few central vectors poison nearly all top-k results
Can You Trust the Vectors in Your Vector Database? Black-Hole Attack from Embedding Space Defects
-
Spatiotemporal warehouse lifts entity extraction F1 by 4.37%
STIndex: A Context-Aware Multi-Dimensional Spatiotemporal Information Extraction System
-
PANDA derives optimal query plans from information bounds
Query Optimization and Evaluation via Information Theory: A Tutorial
-
Adaptive probing estimates high-dim similarity cardinalities
Cardinality Estimation for High Dimensional Similarity Queries with Adaptive Bucket Probing
-
LLM operators bring semantic processing to text streams
VectraFlow: Long-Horizon Semantic Processing over Data and Event Streams with LLMs
-
LLM turns changing web pages into verified JSON via embedding checks
Method for Aggregating Unstructured Data Using Large Language Models
-
Pessimistic sync cuts redundant I/Os in disaggregated KV stores
CIDER: Boosting Memory-Disaggregated Key-Value Stores with Pessimistic Synchronization
-
Workshop maps LLM-graph integration for data systems
LLM+Graph@VLDB'2025 Workshop Summary