Releases a large real-world dataset of dirty postal addresses with ground truth for benchmarking data cleaning algorithms.
Title resolution pending
4 Pith papers cite this work. Polarity classification is still indexing.
years
2026 4verdicts
UNVERDICTED 4representative citing papers
EcoTable is the first NL-based data integration framework that builds a join-likelihood graph, uses two-stage schema linking and Steiner tree search to find paths, then generates transformations with LLMs, reporting >30% accuracy gain and 5x lower cost on four real-world datasets.
LasRepair++ pairs an LLM instructor with an SLM corrector, refines context via EM, and down-weights uncertain repairs using column-calibrated confidence, reporting 18.1% average F1 gain over baselines on data repair tasks.
Introduces a cyclic-dynamics dataset for industrial MTSAD and benchmarks federated anomaly detection methods on it and a public dataset.
citing papers explorer
-
Clean Me If You Can: A Large Collection of Real-World Addresses for Data Cleaning Benchmarking
Releases a large real-world dataset of dirty postal addresses with ground truth for benchmarking data cleaning algorithms.
-
EcoTable: Cost-effective Table Integration in Data Lakes for Natural Language Queries
EcoTable is the first NL-based data integration framework that builds a join-likelihood graph, uses two-stage schema linking and Steiner tree search to find paths, then generates transformations with LLMs, reporting >30% accuracy gain and 5x lower cost on four real-world datasets.
-
Collaborative Large and Small Language Models for Accurate and Scalable Data Repair
LasRepair++ pairs an LLM instructor with an SLM corrector, refines context via EM, and down-weights uncertain repairs using column-calibrated confidence, reporting 18.1% average F1 gain over baselines on data repair tasks.
-
Federated Learning for Multivariate Time Series Anomaly Detection in Industrial Automation
Introduces a cyclic-dynamics dataset for industrial MTSAD and benchmarks federated anomaly detection methods on it and a public dataset.