archive
Every paper Pith has read. Search by title, abstract, or pith.
446 papers in cs.DB · page 7
-
Tree structure lifts document QA accuracy 6-61 percent
MoDora: Tree-Based Semi-Structured Document Analysis System
-
DPSQL+ adds minimum frequency rule to private SQL queries
DPSQL+: A Differentially Private SQL Library with a Minimum Frequency Rule
-
Thresholds on modal operators decompose fuzzy contexts into independent subcontexts
Decomposition of contexts into independent subcontexts based on thresholds
-
Fuzzy contexts split into independent subcontexts via lattice blocks
Independent subcontexts and blocks of concept lattices. Definitions and relationships to decompose fuzzy contexts
-
Text-to-SQL benchmarks miss big-data cost penalties
Both Ends Count! Just How Good are LLM Agents at "Text-to-Big SQL"?
-
SmartNIC offloads Parquet decoding to speed lake queries
Should I Hide My Duck in the Lake?
-
172 open datasets found in learning analytics papers
Open Datasets in Learning Analytics: Trends, Challenges, and Best PRACTICE
-
Sonar-TS searches time series with SQL then verifies with code
Sonar-TS: Search-Then-Verify Natural Language Querying for Time Series Databases
-
Algorithm maps workflow nets to POWL models preserving behavior
Hierarchical Decomposition of Separable Workflow-Nets
-
Algorithm decides SPARQL pattern satisfiability on Façade-X
Towards a theory of Fa\c{c}ade-X data access: satisfiability of SPARQL basic graph patterns
-
Variable local prompts cut conflicts in federated vision learning
SDFed: Bridging Local Global Discrepancy via Subspace Refinement and Divergence Control in Federated Prompt Learning
-
Original papers outperform tutorials for system design mastery
The Computer System Trail
-
KRONE turns flat logs into hierarchies for 10% better anomaly F1
KRONE: Scalable LLM-Augmented Log Anomaly Detection via Hierarchical Abstraction
-
LLMs need intrinsic strategies for data insight agency
Hunt Instead of Wait: Evaluating Deep Data Research on Large Language Models
-
Primary access hints speed Ethereum replay 25x
Ira: Efficient Transaction Replay for Distributed Systems
-
Context packs raise LLM schema matching accuracy
ConStruM: A Structure-Guided LLM Framework for Context-Aware Schema Matching
-
HyEm lets Euclidean indexes retrieve hyperbolic ontology embeddings
HyEm: Query-Adaptive Hyperbolic Retrieval for Biomedical Ontologies via Euclidean Vector Indexing
-
Self-refinement and voting reach 86 percent SQL accuracy
LLM-Based SQL Generation: Prompting, Self-Refinement, and Adaptive Weighted Majority Voting
-
New predict operator lets SQL run LLM calls inside the database
iPDB -- Optimizing Semantic SQL Queries
-
Algorithm turns math database schemes into relational apps
Translating database mathematical schemes into relational database software applications with MatBase
-
Natural language turns into SQL for cross-domain data exploration
TiInsight: A SQL-based Automated Exploratory Data Analysis System through Large Language Models
-
Honeynet dataset logs 132k attacks across four Azure regions
Descriptor: Multi-Regional Cloud Honeypot Dataset (MURHCAD)
-
Online tool mines MLCS from sequences up to length 5000
OVT-MLCS: An Online Visual Tool for MLCS Mining from Long or Big Sequences
-
Temporal attribution tracks dataflow dependencies lightly over time
Toward Temporal Attribution Analytics in Dataflows
-
PANDAExpress drops polylog factor from query runtime
PANDAExpress: a Simpler and Faster PANDA Algorithm
-
OSM+ delivers billion-vertex global road graph for city experiments
OSM+: Billion-Level OpenStreetMap Dataset for City-wide Experiments
-
LLMs auto-swapped for cheaper models on repeated tasks
Poodle: Seamlessly Scaling Down Large Language Models with Just-in-Time Model Replacement
-
PyTorch I/O tweaks yield 3x faster distributed GPU queries
PystachIO: Efficient Distributed GPU Query Processing with PyTorch over Fast Networks & Fast Storage
-
LLM agents repair 8771 invalid MOF database entries
LitMOF: An LLM Multi-Agent for Literature-Validated Metal-Organic Frameworks Database Correction and Expansion
-
Algorithm converts math data models to entity-relationship diagrams
MatBase algorithm for translating (E)MDM schemes into E-R data models
-
Tokenized context speeds edge LLM responses by up to 14%
DisCEdge: Distributed Context Management for Large Language Models at the Edge
-
Neural query models match path counting after relaxation
Counting Still Counts: Understanding Neural Complex Query Answering Through Query Relaxation
-
Hybrid index cuts tail latency 98% under mixed workloads
HIRE: A Hybrid Learned Index for Robust and Efficient Performance under Mixed Workloads
-
Answer-set programs compute sufficient explanations for database queries
Sufficient Explanations in Databases and their Connections to Database Repairs
-
Compiler derives pruning rules for tree queries
Bonsai: Compiling Queries to Pruned Tree Traversals
-
Gradient descent finds join orders with cost matching or beating discrete search
Gradient-Based Join Ordering
-
GNN-PE scales exact subgraph matching to distributed clusters
Efficient Distributed Exact Subgraph Matching via GNN-PE: Load Balancing, Cache Optimization, and Query Plan Ranking
-
AISQL speeds semantic queries 2-70x via cost-aware planning and cascades
Cortex AISQL: A Production SQL Engine for Unstructured Data
-
Learned models break entropy barrier in static functions
Learned Static Function Data Structures
-
DGAI separates vectors from graphs for 8x faster ANN updates
DGAI: Decoupled On-Disk Graph-Based ANN Index for Efficient Updates and Queries
-
Orchestration reaches 89.8% Text-to-SQL accuracy on Spider
DeepEye-SQL: A Software-Engineering-Inspired Text-to-SQL Framework
-
Database feedback and memory improve multi-turn SQL accuracy
MTSQL-R1: Towards Long-Horizon Multi-Turn Text-to-SQL via Agentic Training
-
Algorithms mine low-utility sequences faster with less memory
Efficient Mining of Low-Utility Sequential Patterns
-
Partial orders from event data yield sound process models
Revealing Inherent Concurrency in Event Data: A Partial Order Approach to Process Discovery
-
ScaleDoc filters 85% of LLM calls on large document sets
ScaleDoc: Scaling LLM-based Predicates over Large Document Collections
-
TurtleKV adapts key-value stores to shifting read and write demands
Dynamic read & write optimization with TurtleKV
-
Users cannot tell acted idle animations from genuine ones
Evaluating Idle Animation Believability: a User Perspective
-
Optimizer picks best LLM sort paths at runtime
Access Paths for Efficient Ordering with Large Language Models
-
Versioned views and when-then rules make prompts adaptive in LLM pipelines
Making Prompts First-Class Citizens for Adaptive LLM Pipelines
-
Merged T1D data yields 149 million glucose readings from 2510 subjects
Presenting DiaData for Research on Type 1 Diabetes