archive
Every paper Pith has read. Search by title, abstract, or pith.
446 papers in cs.DB · page 3
-
Event languages mapped to one Temporal Datalog engine for streams
Efficient Temporal Datalog Materialisation for Composite Event Recognition
-
eBPF scheduler doubles throughput for time-sensitive DB tasks
Unfair by design: eBPF-based scheduling of mixed database workloads
-
2-bit vectors build ANN graphs for 16x faster search
QuIVer: Rethinking ANN Graph Topology via Training-Free Binary Quantization
-
2-bit quantization builds ANN graphs without training
QuIVer: Rethinking ANN Graph Topology via Training-Free Binary Quantization
-
Binary quantization builds ANN graphs for 88% recall
QuIVer: Rethinking ANN Graph Topology via Training-Free Binary Quantization
-
Dual HNSW graphs enable fast search for any Lp metric
U-HNSW: An Efficient Graph-based Solution to ANNS Under Universal Lp Metrics
-
Predictions let private query streams reach near-offline utility
LAPRAS : Learning-Augmented PRivate Answering for linear query Streams
-
Decentralized geohash sampling cuts geospatial stream latency
Decentralized Stratified Sampling for Low-Latency Approximate Geospatial Data Stream Processing in Edge-Cloud Architectures
-
Prompt-conditioned masking traces RAG poison to exact characters
Needle-in-RAG: Prompt-Conditioned Character-Level Traceback of Poisoned Spans in Retrieved Evidence
-
This paper proposes Action Units as structured extensions to knowledge representations…
Actionable Understanding: Action Units for Bridging the Knowledge-Action Gap in Post-FAIR Knowledge Infrastructures
-
Lattice merges co-accessed vectors to cut authorized search cost
Don't Stir the Pot! Authorized Vector Data Retrieval via Access-Aware Indexing
-
Lattice method balances duplication and search for authorized vectors
Don't Stir the Pot! Authorized Vector Data Retrieval via Access-Aware Indexing
-
Five patterns decouple writes from reads in search engines
Write-Read Decoupling in Modern Large-Scale Search Engines: Architectures, Techniques, and Emerging Approaches
-
Team projects built into database courses raise grades and teamwork scores
Complete Integration of Team Project-based Learning into a Database Syllabus
-
One abstraction unifies database evolution
Living Databases: A Unified Model for Continuous Schema Evolution, Versioning, and Transformations
-
Execution-verified renamings recover Text-to-SQL accuracy on noisy schemas
EGREFINE: An Execution-Grounded Optimization Framework for Text-to-SQL Schema Refinement
-
Framework makes shuffle-DP protocols resist poisoning
Defense against Poisoning Attacks under Shuffle-DP
-
SPARQL multiset patterns match Datalog and relational algebra
Multiset semantics in SPARQL, Relational Algebra and Datalog
-
Two-phase sampling cuts online aggregation cost up to 3x
Index-Assisted Stratified Sampling for Online Aggregation
-
Tailwind speeds TPC-H queries 1.38x on average
Tailwind: A Practical Framework for Query Accelerators
-
Templates from past queries boost Text-to-SQL accuracy 36%
Reliable Answers for Recurring Questions: Boosting Text-to-SQL Accuracy with Template Constrained Decoding
-
GUI agents hit exact states only 23 percent of the time
FineState-Bench: Benchmarking State-Conditioned Grounding for Fine-grained GUI State Setting
-
ObjectGraph cuts agent document tokens by 95% without accuracy loss
ObjectGraph: From Document Injection to Knowledge Traversal -- A Native File Format for the Agentic Era
-
Synthetic databases reveal 3-14 percent drops in text-to-SQL accuracy
SynSQL: Synthesizing Relational Databases for Robust Evaluation of Text-to-SQL Systems
-
One model unifies table discovery from text and table queries
Unified Data Discovery across Query Modalities and User Intents
-
Graphify turns GraphQL into single optimized Gremlin queries in linear time
Graphify: Automated Synthesis of Type-Safe Graph Backends via $O(S)$ GraphQL-to-Gremlin Transpilation
-
Non-speech audio reveals spurious correlations in speech data
A Toolkit for Detecting Spurious Correlations in Speech Datasets
-
LLM search aligns pivot table schemas at 88% accuracy
PiLLar: Matching for Pivot Table Schema via LLM-guided Monte-Carlo Tree Search
-
LLM assistant cuts big data support tickets by 20.8%
SiriusHelper: An LLM Agent-Based Operations Assistant for Big Data Platforms
-
Evergreen converts verification of claims in LLM-generated semantic aggregates into…
Evergreen: Efficient Claim Verification for Semantic Aggregates
-
CacheRAG turns stateless KGQA planning into cached learning
CacheRAG: A Semantic Caching System for Retrieval-Augmented Generation in Knowledge Graph Question Answering
-
Semantic search runs on 166 million clinical notes at $4k per month
Health System Scale Semantic Search Across Unstructured Clinical Notes
-
Streaming sampler approximates graphlets in constant passes
An Efficient Streaming Algorithm for Approximating Graphlet Distributions
-
Negative patterns raise viral classification accuracy
Mining Negative Sequential Patterns to Improve Viral Genomic Feature Representation and Classification
-
VisualNeo connects visual queries to Neo4j for graph searches
VisualNeo: Bridging the Gap between Visual Query Interfaces and Graph Query Engines
-
Algorithms hide all sensitive cross-level utility patterns without fakes
Cross-level Privacy Preserving Utility Mining
-
RL learns to clean tabular data for foundation model priors
Prior-Aligned Data Cleaning for Tabular Foundation Models
-
Fixed-input lock keeps Spark policy outputs identical under repartitioning
Spark Policy Toolkit: Semantic Contracts and Scalable Execution for Policy Learning in Spark
-
Dynamic attacks slow ALEX lookups up to 2.8x
Poisoning Learned Index Structures: Static and Dynamic Adversarial Attacks on ALEX
-
Autoencoder rewrites speed hybrid vector queries 2x on average
BoomHQ: Learning to Boost Multiple Hybrid Queries on Vector DBMSs
-
Sliding window finds dense patterns exactly without gap parameters
Exact Mining of Dense Patterns via Direct Evaluation of Local Interval Frequency Using a Sliding Window
-
Late materialization slashes storage for long user sequences in DLRMs
Versioned Late Materialization for Ultra-Long Sequence Training in Recommendation Systems at Scale
-
IM chat turns natural language into complete data reports
DataClaw: An Autonomous Data Agent with Instant Messaging Integration
-
RL distills agentic reasoning into private product mapping models
EPM-RL: Reinforcement Learning for On-Premise Product Mapping in E-Commerce
-
SEMA-SQL mixes SQL with LLM semantics for natural language database questions
SEMA-SQL: Beyond Traditional Relational Querying with Large Language Models
-
SEMA-SQL mixes SQL with LLM reasoning for semantic database queries
SEMA-SQL: Beyond Traditional Relational Querying with Large Language Models
-
Branchwidth approximates submodular width to within 3/2
Cuts and Gauges for Submodular Width
-
Dataset released for 10,000 early AI agents on Ethereum
A dataset of early blockchain-registered AI agents on Ethereum
-
Atomic RDF Datasets can serve as standardized messages for streaming
It's Time to Standardize RDF Messages
-
Formal library verifies chase as universal model
The Chase in Lean -- Crafting a Formal Library for Existential Rule Research