archive
Every paper Pith has read. Search by title, abstract, or pith.
446 papers in cs.DB · page 4
-
Bounded self-joins make fact relevance as easy as query evaluation
How Hard is it to Decide if a Fact is Relevant to a Query?
-
Unified model pivots database migration across heterogeneous systems
A Model-Driven Approach to Database Migration with a Unified Data Model
-
Maximal-clique index speeds filtered nearest-neighbor search
MCI: A Maximal Clique Index for Efficient Arbitrary-Filtered Approximate Nearest Neighbor Search
-
ESPRESSO scales keyword search over Solid pods with privacy
Implementation and Privacy Guarantees for Scalable Keyword Search on SOLID-based Decentralized Data with Granular Visibility Constraints
-
Best tabular embedding model varies by task and level
Towards Universal Tabular Embeddings: A Benchmark Across Data Tasks
-
ASP(Q) first implements globally-optimal repairs for inconsistent data
Using ASP(Q) to Handle Inconsistent Prioritized Data
-
Delta Lake loads fastest, Iceberg saves most space
Research on the efficiency of data loading and storage in Data Lakehouse architectures for the formation of analytical data systems
-
Query algebra and wrappers replace LLM agents for enterprise data
An Alternate Agentic AI Architecture (It's About the Data)
-
SQLyzr adds diverse metrics and realism to text-to-SQL evaluation
A Demonstration of SQLyzr: A Platform for Fine-Grained Text-to-SQL Evaluation and Analysis
-
Only 150k scientific posters shared across 86 platforms
The State of Scientific Poster Sharing and Reuse
-
FPGA level-wise batch search speeds B+ tree lookups 4.9x
Efficient Batch Search Algorithm for B+ Tree Index Structures with Level-Wise Traversal on FPGAs
-
ShEx and SHACL match on large recursive fragments via duality
Common Foundations for Recursive Shape Languages
-
ShEx and SHACL fragments match via fixpoint duality
Common Foundations for Recursive Shape Languages
-
Self-aware embeddings double RAG accuracy on versioned queries
Self-Aware Vector Embeddings for Retrieval-Augmented Generation: A Neuroscience-Inspired Framework for Temporal, Confidence-Weighted, and Relational Knowledge
-
New framework checks isolation levels without database internals
Making TransactionIsolation Checking Practical
-
Vision-based tactile dataset scales bimanual robot data
VTouch++: A Multimodal Dataset with Vision-Based Tactile Enhancement for Bimanual Manipulation
-
Low-dim stats cut noise in private power-law exponent estimates
Estimating Power-Law Exponent with Edge Differential Privacy
-
ML model predicts query slot-time before execution
Pre-Execution Query Slot-Time Prediction in Cloud Data Warehouses: A Feature-Scoped Machine Learning Approach
-
LLM agent finds minimal data sets for analysis at 83% F1
An Agentic Approach to Metadata Reasoning
-
Garfield cuts RFANNS index size 4.4x and raises throughput 120x
A GPU-Accelerated Framework for Multi-Attribute Range Filtered Approximate Nearest Neighbor Search
-
First GPU Datalog engine uses WCOJ to avoid memory blowup
Scaling Worst-Case Optimal Datalog to GPUs
-
GPU pipeline speeds 3D polyhedral spatial joins by 9x
3DPipe: A Pipelined GPU Framework for Scalable Generalized Spatial Join over Polyhedral Objects
-
RaBitQ outperforms TurboQuant on most quantization tasks
Revisiting RaBitQ and TurboQuant: A Symmetric Comparison of Methods, Theory, and Experiments
-
Online schema alignment recovers full results in decentralized queries
Demonstrating Online Schema Alignment in Decentralized Knowledge Graphs Querying
-
Monotonic embeddings prune more vertices in subgraph matching
LIVE: Learnable Monotonic Vertex Embedding for Efficient Exact Subgraph Matching (Technical Report)
-
Heuristic partitioning cuts multi-tenant query P95 latency from 61s to 2s
Heuristic Search Space Partitioning for Low-Latency Multi-Tenant Cloud Queries
-
Tool-augmented LLMs beat static ones on warehouse graph reasoning
DW-Bench: Benchmarking LLMs on Data Warehouse Graph Topology Reasoning
-
Open data model v3 fixes wastewater surveillance data sharing
The Public Health and Environmental Surveillance Open Data Model (PHES-ODM) Version 3: An Open, Relational Data Model and Interoperability Framework for Wastewater Surveillance
-
Modular adapters beat fine-tuning on hard SQL queries
LeGo-Code: Can Modular Curriculum Learning Advance Complex Code Generation? Insights from Text-to-SQL
-
Topology grouping cuts token use 50-90% in LLM social simulations
Topology-Aware LLM-Driven Social Simulation: A Unified Framework for Efficient and Realistic Agent Dynamics
-
Syntactic tests flag contamination in old NL2SQL benchmarks
SPENCE: A Syntactic Probe for Detecting Contamination in NL2SQL Benchmarks
-
Database probing and rule checks raise text-to-SQL accuracy 5%
PV-SQL: Synergizing Database Probing and Rule-based Verification for Text-to-SQL Agents
-
Branchable databases slow reads up to 4000x as agent branches deepen
BranchBench: Aligning Database Branching with Agentic Demands
-
New benchmark finds AI agents falter on complex personalized home tasks
PersonalHomeBench: Evaluating Agents in Personalized Smart Homes
-
Agents falter in smart homes as tasks grow complex
PersonalHomeBench: Evaluating Agents in Personalized Smart Homes
-
Flipped indexing delivers 6.5x lower GPU query latency with dynamic updates
FliX: Flipped-Indexing for Scalable GPU Queries and Updates
-
QMutBench gives 700k quantum mutants to benchmark tests
QMutBench: A Dataset of Quantum Circuit Mutants
-
Policy structure dictates database optimizer plans
Compliance in Databases: A Study of Structural Policies and Query Optimization
-
Agent autonomy pushes humans to supervisor roles in visual analytics
Exploring Agentic Visual Analytics: A Co-Evolutionary Framework of Roles and Workflows
-
Event cameras enable lip-motion speaker ID across new views and lights
NeuroLip: An Event-driven Spatiotemporal Learning Framework for Cross-Scene Lip-Motion-based Visual Speaker Recognition
-
Response feedback backpropagates to refine KG-RAG by 7.34%
EvoRAG: Making Knowledge Graph-based RAG Automatically Evolve through Feedback-driven Backpropagation
-
Small model attention prunes long docs to 10% for big QA
SAGE: Selective Attention-Guided Extraction for Token-Efficient Document Indexing
-
Layer treats LLMs and web as databases for natural language data queries
Blue Data Intelligence Layer: Streaming Data and Agents for Multi-source Multi-modal Data-Centric Applications
-
SQL and Python agreement on tiny database picks correct queries
DPC: Training-Free Text-to-SQL Candidate Selection via Dual-Paradigm Consistency
-
Four-layer architecture unifies reconciliation and anomaly detection
Data Engineering Patterns for Cross-System Reconciliation in Regulated Enterprises: Architecture, Anomaly Detection, and Governance
-
PP-FP-tree finds top keyword k-core communities in public-private graphs
Efficient Community Search on Attributed Public-Private Graphs
-
RELOAD is a learned query optimizer that reduces individual query performance regressions…
RELOAD: A Robust and Efficient Learned Query Optimizer for Database Systems
-
PIM hardware speeds R-tree queries up to 3.66x with less energy
Parallel R-tree-based Spatial Query Processing on a Commercial Processing-in-Memory System
-
Beliefs and policies declaratively control LLM pipelines
Credo: Declarative Control of LLM Pipelines via Beliefs and Policies
-
GLOW hybrid boosts open-world QA on incomplete KGs by 38% on average
Leveraging LLM-GNN Integration for Open-World Question Answering over Knowledge Graphs