ReSequel uses LLMs guided by metadata-derived templates and sampling-based verification to rewrite SQL queries, delivering up to 16x workload speedups over native DBMSs and 22x over prior LLM baselines across eight benchmarks and three systems.
hub Canonical reference
SiloFuse: Cross-silo Synthetic Data Generation with Latent Tabular Diffusion Models
Canonical reference. 100% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
roles
background 6polarities
background 6representative citing papers
A new data structure samples any entry of the noise vector in constant time while exactly reproducing the binary tree Gaussian mechanism distribution, applied to DP CountSketches for improved range counting and join size estimation.
The paper shows that arbitrage-free information pricing is computationally hard in general, provides a branch-and-bound algorithm, and proves that for threshold utilities arbitrage-freeness reduces to Blackwell dominance, unifying prior query and model pricing results.
A single autoregressive model for conversational recommendation that uses semantic item IDs, predicts response intent and target first, then generates the response, reporting up to 29% Recall@1 gains.
DARE-EEG is a self-supervised EEG foundation model that enforces mask-invariance via contrastive mask alignment and momentum anchor alignment, plus conv-linear-probing for heterogeneous setups, achieving SOTA accuracy and cross-dataset portability.
A rule-based strikingness measure is added to TKGR metrics to weight rare events higher, revealing that models weaken on striking events and ensemble gains come mostly from trivial fits.
U-HNSW is the first graph-based index for approximate nearest neighbor search under all Lp metrics (0 < p <= 2) simultaneously, using L1/L2 HNSW graphs plus early-termination verification to beat MLSH query times.
Exact LTI Koopman models for nonlinear control systems require affine linear dynamics under controllability and coordinate inclusion assumptions.
SynHAT uses a novel two-stage spatio-temporal diffusion framework with Latent Spatio-Temporal U-Net to synthesize realistic human activity traces, outperforming baselines by 52% on spatial and 33% on temporal metrics across four cities.
NL2SQLBench is a new modular benchmarking framework that evaluates LLM NL2SQL methods across three core modules on existing datasets, exposing large accuracy gaps and computational inefficiency.
GRAB-ANNS is a new GPU graph index that achieves up to 240x higher hybrid search throughput via bucket layouts and hybrid intra/inter-bucket edges.
Sublime generalizes Count-Min and Count Sketch with dynamically elongating counters and expanding counter arrays to deliver sublinear error growth and lower memory use on skewed unbounded streams.
KG-WISE decomposes GNN models and uses LLM-generated query templates for partial loading of relevant components, achieving up to 28x faster inference and 98% lower memory on KGs with up to 42 million nodes while preserving accuracy.
Learned static functions combine per-key ML-predicted prefix codes with classic static function storage to compress static key-value mappings beyond zero-order entropy limits.
TurtleKV uses a balanced TurtleTree on-disk structure and flexible memory tuning knobs to deliver strong performance across inserts, mixed workloads, point queries, and scans in YCSB tests, matching or beating SplinterDB, RocksDB, and WiredTiger.
First dedicated survey organizing diffusion and flow matching models for tabular data synthesis, imputation, anomaly detection, and related tasks, covering literature from 2015 to 2026 and highlighting open problems.
EcoTable is the first NL-based data integration framework that builds a join-likelihood graph, uses two-stage schema linking and Steiner tree search to find paths, then generates transformations with LLMs, reporting >30% accuracy gain and 5x lower cost on four real-world datasets.
Spectral aggregate tests prune up to 51% of candidates in CSM but leave enumeration intermediates unchanged beyond initial bindings across tested workloads.
CEB and TIDE are two-layer append-only B+-tree indexes for intervals under the increasing ending time assumption that claim smaller size, faster insertions, and superior query performance over prior art.
Formulates pre-hoc fine-tuning prediction as stochastic estimation, proves lower bound on optimization variance decay rate, and introduces a three-regime predictability phase diagram.
SIDInspector provides a standardized adapter contract and mapping-level probes for Semantic-ID tokenizers, with empirical contrasts showing high aliasing in GRID-style exports and superior prefix alignment from deterministic controls on Musical items.
ANNS-AMP adapts distance-computation precision to vector-space regions via a lightweight cluster-level predictor and a bit-serial accelerator, delivering 163.76x/10.57x/2.06x average speedups and 1100x/39.41x/6.66x energy reductions versus CPU/GPU/custom baselines with <2.7% accuracy loss.
ANN search quality is better assessed by 1/Ratio@k than Recall@k because the former tracks downstream task utility more closely while allowing substantially lower computational cost.
LMs compare unit quantities via number-specific and unit-specific heuristics rather than unified scale conversion, evidenced by degraded accuracy near boundaries, linear surrogate predictions, and causal subspace interventions.
citing papers explorer
-
ReSequel: Robust LLM-assisted Query Rewriting and Optimization using Templatization and Sampling
ReSequel uses LLMs guided by metadata-derived templates and sampling-based verification to rewrite SQL queries, delivering up to 16x workload speedups over native DBMSs and 22x over prior LLM baselines across eight benchmarks and three systems.
-
A Fast Gaussian Mechanism under Continual Observation, with Applications
A new data structure samples any entry of the noise vector in constant time while exactly reproducing the binary tree Gaussian mechanism distribution, applied to DP CountSketches for improved range counting and join size estimation.
-
Arbitrage-free Data Pricing
The paper shows that arbitrage-free information pricing is computationally hard in general, provides a branch-and-bound algorithm, and proves that for threshold utilities arbitrage-freeness reduces to Blackwell dominance, unifying prior query and model pricing results.
-
Generative Conversational Recommender System
A single autoregressive model for conversational recommendation that uses semantic item IDs, predicts response intent and target first, then generates the response, reporting up to 29% Recall@1 gains.
-
DARE-EEG: A Foundation Model for Mining Dual-Aligned Representation of EEG
DARE-EEG is a self-supervised EEG foundation model that enforces mask-invariance via contrastive mask alignment and momentum anchor alignment, plus conv-linear-probing for heterogeneous setups, achieving SOTA accuracy and cross-dataset portability.
-
Strikingness-Aware Evaluation for Temporal Knowledge Graph Reasoning
A rule-based strikingness measure is added to TKGR metrics to weight rare events higher, revealing that models weaken on striking events and ensemble gains come mostly from trivial fits.
-
U-HNSW: An Efficient Graph-based Solution to ANNS Under Universal Lp Metrics
U-HNSW is the first graph-based index for approximate nearest neighbor search under all Lp metrics (0 < p <= 2) simultaneously, using L1/L2 HNSW graphs plus early-termination verification to beat MLSH query times.
-
Limitations of LTI Koopman Modeling for Nonlinear Control Systems
Exact LTI Koopman models for nonlinear control systems require affine linear dynamics under controllability and coordinate inclusion assumptions.
-
SynHAT: A Two-stage Coarse-to-Fine Diffusion Framework for Synthesizing Human Activity Traces
SynHAT uses a novel two-stage spatio-temporal diffusion framework with Latent Spatio-Temporal U-Net to synthesize realistic human activity traces, outperforming baselines by 52% on spatial and 33% on temporal metrics across four cities.
-
NL2SQLBench: A Modular Benchmarking Framework for LLM-Enabled NL2SQL Solutions
NL2SQLBench is a new modular benchmarking framework that evaluates LLM NL2SQL methods across three core modules on existing datasets, exposing large accuracy gaps and computational inefficiency.
-
GRAB-ANNS: High-Throughput Indexing and Hybrid Search via GPU-Native Bucketing
GRAB-ANNS is a new GPU graph index that achieves up to 240x higher hybrid search throughput via bucket layouts and hybrid intra/inter-bucket edges.
-
Sublime: Sublinear Error & Space for Unbounded Skewed Streams
Sublime generalizes Count-Min and Count Sketch with dynamically elongating counters and expanding counter arrays to deliver sublinear error growth and lower memory use on skewed unbounded streams.
-
An LLM-Guided Query-Aware Inference System for GNN Models on Large Knowledge Graphs
KG-WISE decomposes GNN models and uses LLM-generated query templates for partial loading of relevant components, achieving up to 28x faster inference and 98% lower memory on KGs with up to 42 million nodes while preserving accuracy.
-
EcoTable: Cost-effective Table Integration in Data Lakes for Natural Language Queries
EcoTable is the first NL-based data integration framework that builds a join-likelihood graph, uses two-stage schema linking and Steiner tree search to find paths, then generates transformations with LLMs, reporting >30% accuracy gain and 5x lower cost on four real-world datasets.
-
Can Aggregate Invariants Accelerate Continuous Subgraph Matching? Limits, Laws, and a Dynamic Spectral Index
Spectral aggregate tests prune up to 51% of candidates in CSM but leave enumeration intermediates unchanged beyond initial bindings across tested workloads.
-
Disk-Based Interval Indexes Under the Increasing Ending Time Assumption
CEB and TIDE are two-layer append-only B+-tree indexes for intervals under the increasing ending time assumption that claim smaller size, faster insertions, and superior query performance over prior art.
-
A Risk Decomposition Framework for Pre-Hoc Fine-Tuning Prediction
Formulates pre-hoc fine-tuning prediction as stochastic estimation, proves lower bound on optimization variance decay rate, and introduces a three-regime predictability phase diagram.
-
SIDInspector: A Mapping-First Diagnostic Resource for Semantic-ID Tokenizers
SIDInspector provides a standardized adapter contract and mapping-level probes for Semantic-ID tokenizers, with empirical contrasts showing high aliasing in GRID-style exports and superior prefix alignment from deterministic controls on Musical items.
-
ANNS-AMP: Accelerating Approximate Nearest Neighbor Search via Adaptive Mixed-Precision Computing
ANNS-AMP adapts distance-computation precision to vector-space regions via a lightweight cluster-level predictor and a bit-serial accelerator, delivering 163.76x/10.57x/2.06x average speedups and 1100x/39.41x/6.66x energy reductions versus CPU/GPU/custom baselines with <2.7% accuracy loss.
-
ANN Search: Recall What Matters
ANN search quality is better assessed by 1/Ratio@k than Recall@k because the former tracks downstream task utility more closely while allowing substantially lower computational cost.
-
Language Models Compare Quantities Using Number-specific and Unit-specific Heuristics
LMs compare unit quantities via number-specific and unit-specific heuristics rather than unified scale conversion, evidenced by degraded accuracy near boundaries, linear surrogate predictions, and causal subspace interventions.
-
SOLANET: Distributed Neighbor Graph Construction on GPU-Accelerated Systems
SOLANET is a distributed GPU toolkit for neighbor graph construction that reports 11X speedup on 512 APUs for 1B points and 6.9X for 2B points.
-
LASAR: Latent Adaptive Semantic Aligned Reasoning for Generative Recommendation
LASAR uses two-stage supervised training plus reinforcement learning to ground semantic IDs, align latent reasoning trajectories to CoT hidden states via KL divergence, and adaptively choose reasoning depth, halving average steps while improving quality on three datasets.
-
Generalized Category Discovery in Federated Graph Learning
GCD-FGL mitigates neighborhood absorption and global semantic inconsistency in federated generalized category discovery, delivering +4.86 average HRScore gain over baselines on five graph datasets.
-
GRACE: A Dynamic Coreset Selection Framework for Large Language Model Optimization
GRACE dynamically constructs and updates coresets for LLM training using representation diversity, gradient-based importance, and k-NN graph propagation to improve efficiency and performance.
-
Towards Efficient and Generalizable Retrieval: Adaptive Semantic Quantization and Residual Knowledge Transfer
SA²CRQ uses sequential adaptive residual quantization based on path entropy plus anchored curriculum regularization from head items to improve both efficiency and cold-start performance in generative retrieval.
-
Towards Federated Long-Tailed Graph Learning: An Energy-Guided Dual Decoupling Approach
FedEPD decouples topological purification from semantic recalibration using energy-guided pruning and prototype injection to improve minority performance in federated long-tailed graph learning.
-
TuneAhead: Predicting Fine-tuning Performance Before Full Training Begins
TUNEAHEAD predicts fine-tuning performance from meta-features and short probes, reporting RMSE 1.47 and 95.1% of predictions within 3 points on 370 held-out runs of Qwen2.5-7B.
-
HRNN: A Hybrid Graph Index for Approximate Reverse k-Nearest Neighbor Search on High-Dimensional Vectors
HRNN combines a navigation graph, ranked KNN graph, and reverse-neighbor lists with proxy-based candidate generation and materialized kNN-radii to achieve up to 10x higher throughput for approximate RkNN on datasets up to 10M vectors.
-
SemStruct: Contextualizing Semantic Embeddings with Structural Information for Schema Matching
SemStruct models tables as heterogeneous graphs with GNNs on frozen PLM embeddings to incorporate row co-occurrences for schema matching and reports SOTA results on Valentine and SOTAB-SM benchmarks.
-
Co-Designing Graph-based Approximate Nearest Neighbor Search at Billion Scale for Processing-in-Memory
Co-design of 14.5x compacted index, asynchronous scheduler, and multiplication-free kernel for PIM-based graph ANNS delivers up to 20x CPU and 17.1x GPU throughput on billion-scale benchmarks.
-
TwiSTAR:Think Fast, Think Slow, Then Act,Generative Recommendation with Adaptive Reasoning
TwiSTAR learns to switch between fast SID retrieval and slow rationale-generating reasoning in generative recommendation, yielding better accuracy-latency trade-offs on three datasets.
-
TabEmb: Joint Semantic-Structure Embedding for Table Annotation
TabEmb decouples LLM-based semantic column embeddings from graph-based structural modeling to produce joint representations that improve table annotation tasks.
-
LogCopilot: Automating Log Aggregation Analysis through Large Language Models
LogCopilot is an LLM framework that builds a hierarchical knowledge base from logs and generates/executes LogQL queries from natural language instructions, reporting 76.8% average accuracy across four datasets.
-
Bridging Short Videos and Live Streams: Reasoning-Guided Multimodal LLMs for Cross-Domain Representation Learning
RGCD-Rep distills cross-domain reasoning from a frozen MLLM teacher and learns decomposed transferable item representations via two-stage training, yielding gains in offline experiments and production A/B tests on a live streaming platform.
-
Bridging Classification and Reconstruction: Cooperative Time Series Anomaly Detection
CoAD unifies outlier exposure classification and masked autoencoder reconstruction in a cooperative loop to detect subtle and prolonged time series anomalies.
-
A Pragmatic Approach to Learned Indexing in RocksDB: Targeted Optimizations with Minimal System Modification
MountDB extends RocksDB with Memtable-level model reuse and a block-aware learned disk index, reporting up to 1.5X write and 2.1X read throughput over state-of-the-art on large-scale workloads.
-
To GPU or Not to GPU: Vector Search in Relational Engines
Relational engines achieve faster SQL+vector-search queries on GPU than CPU when using compact vector indexes and fast interconnects, reversing the CPU-only design in current systems.
-
PaAno+: Multiscale Encoding and Cross-Variable Attention for Time Series Anomaly Detection
PaAno+ extends the original PaAno with multiscale feature extraction, cross-variable fusion attention, and a temporal patch sorting pretext task to report state-of-the-art results on the TSB-AD benchmark for univariate and multivariate anomaly detection.