ICL in LLMs shows a sharp ceiling on categorical distributions for high-cardinality tabular data, failing to reproduce rare classes despite examples, while numerical fidelity improves.
Language models are realistic tabular data generators
11 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
Concordia aligns synthetic table generation with federated validation utility via client-side utility scorers and group-relative policy optimization to improve LLM adaptation on non-IID tabular tasks.
LLM-generated synthetic datasets steered uniformly across a 2D performance space defined by two landmark algorithms improve meta-learner performance on algorithm selection for regression tasks.
AnomalyVFM converts vision foundation models into zero-shot anomaly detectors via three-stage synthetic dataset generation plus low-rank adapters and weighted pixel loss, reaching 94.1% average image AUROC across nine datasets.
LLM tabular generators leak memorized numeric strings, allowing a no-box attack to achieve near-perfect membership inference on some state-of-the-art models.
DiffICL breaks the quality-privacy tradeoff in small-data tabular synthesis by using in-context learning on pretrained structural priors to generate data that is both higher quality and less memorizing of training samples.
Semantically invariant row and column permutations in tables can cause LLMs to output incorrect answers, and a gradient-based attack called ATP efficiently finds such permutations that degrade performance across many models.
Proposes three metrics for inter-column logical relationships in synthetic tabular data and reports that current generators often fail to preserve them on an industrial dataset.
PSyGenTAB is a constrained-optimization framework that generates privacy-preserving synthetic clinical tabular data while preserving clinical relationships and downstream model performance.
TabGRAA applies group-relative advantage alignment in an iterative reward-guided post-training loop to improve tabular language model generators on fidelity, utility, and privacy trade-offs across five benchmarks.
CTGAN and LLMs generate synthetic student data that passes statistical and predictive utility checks for learning analytics.
citing papers explorer
-
Categorical Prior Lock-in: Why In-Context Learning Fails for Structured Data
ICL in LLMs shows a sharp ceiling on categorical distributions for high-cardinality tabular data, failing to reproduce rare classes despite examples, while numerical fidelity improves.
-
Concordia: Self-Improving Synthetic Tables for Federated LLMs
Concordia aligns synthetic table generation with federated validation utility via client-side utility scorers and group-relative policy optimization to improve LLM adaptation on non-IID tabular tasks.
-
LLM-Driven Performance-Space Augmentation for Meta-Learning-Based Algorithm Selection
LLM-generated synthetic datasets steered uniformly across a 2D performance space defined by two landmark algorithms improve meta-learner performance on algorithm selection for regression tasks.
-
AnomalyVFM -- Transforming Vision Foundation Models into Zero-Shot Anomaly Detectors
AnomalyVFM converts vision foundation models into zero-shot anomaly detectors via three-stage synthetic dataset generation plus low-rank adapters and weighted pixel loss, reaching 94.1% average image AUROC across nine datasets.
-
When Tables Leak: Attacking String Memorization in LLM-Based Tabular Data Generation
LLM tabular generators leak memorized numeric strings, allowing a no-box attack to achieve near-perfect membership inference on some state-of-the-art models.
-
Breaking the Quality-Privacy Tradeoff in Tabular Data Generation via In-Context Learning
DiffICL breaks the quality-privacy tradeoff in small-data tabular synthesis by using in-context learning on pretrained structural priors to generate data that is both higher quality and less memorizing of training samples.
-
The Power of Order: Fooling LLMs with Adversarial Table Permutations
Semantically invariant row and column permutations in tables can cause LLMs to output incorrect answers, and a gradient-based attack called ATP efficiently finds such permutations that degrade performance across many models.
-
Evaluating Inter-Column Logical Relationships in Synthetic Tabular Data Generation
Proposes three metrics for inter-column logical relationships in synthetic tabular data and reports that current generators often fail to preserve them on an industrial dataset.
-
PSyGenTAB: A Privacy-Preserving Framework for Synthetic Clinical Tabular Data Generation via Constrained Optimization
PSyGenTAB is a constrained-optimization framework that generates privacy-preserving synthetic clinical tabular data while preserving clinical relationships and downstream model performance.
-
Self-Improving Tabular Language Models via Iterative Reward-Guided Post-Training
TabGRAA applies group-relative advantage alignment in an iterative reward-guided post-training loop to improve tabular language model generators on fidelity, utility, and privacy trade-offs across five benchmarks.
-
Creating Artificial Students that Never Existed: Leveraging Large Language Models and CTGANs for Synthetic Data Generation
CTGAN and LLMs generate synthetic student data that passes statistical and predictive utility checks for learning analytics.