pith. sign in

arxiv: 2605.17758 · v1 · pith:3NCPVYDQnew · submitted 2026-05-18 · 💻 cs.LG

Memisis: Orchestrating and Evaluating Synthetic Data for Tabular Health Datasets

Pith reviewed 2026-05-20 12:07 UTC · model grok-4.3

classification 💻 cs.LG
keywords synthetic datahealthcaretabular datasetslarge language modelsdata generationevaluation metricsprivacyfairness
0
0 comments X

The pith

Memisis uses a language model agent to orchestrate synthetic data generation and evaluation for tabular health datasets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Memisis as a tool that brings together existing synthetic data generators, large language models, and evaluation metrics into one workflow for health data. It aims to let users describe what kind of synthetic data they need rather than adjusting technical settings themselves. This matters because creating good synthetic data that protects privacy while keeping useful patterns is hard, especially for medical research and decisions. The demonstration uses a schizophrenia dataset and shows similar results from three different generators.

Core claim

Memisis orchestrates and evaluates synthetic data for tabular health datasets by using an interactive agent driven by a local large language model. Users express their goals for the synthetic data, and the agent handles selecting among synthesizers like CTGAN, TVAE, and GaussianCopula, setting parameters such as training size and epochs, generating the data, and running evaluations for privacy, utility, and fairness. This creates a unified process instead of separate steps for generation and checking.

What carries the argument

An interactive agent powered by a local language model that interprets user goals to select, configure, and evaluate synthetic data tools.

If this is right

  • Users specify goals for synthetic data instead of tuning individual parameters.
  • The system runs necessary evaluations for privacy, utility, and fairness automatically.
  • Control is retained over training size, epochs, and number of synthetic samples.
  • Comparable performance across synthesizers is observed in the schizophrenia dataset example.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Non-experts in data synthesis could more easily produce suitable synthetic health datasets for their needs.
  • Keeping the language model local helps avoid sending health-related instructions to external services.
  • Similar orchestration ideas might apply to synthetic data tasks in other fields like finance or social sciences.

Load-bearing premise

The local language model agent can accurately figure out the user's intentions and choose the correct synthesizers and settings without errors or added biases.

What would settle it

Running Memisis with a clear user goal on a test dataset and finding that the produced synthetic data scores much worse on utility or fairness than data made by directly using the synthesizers with expert settings.

Figures

Figures reproduced from arXiv: 2605.17758 by Aadi Sharma, Amir M. Rahmani, Arshia Harish Puthran, Ian Harris, Mahdi Bagheri, Muhjaazee Love, Nitish Nagesh, Pengbao Zhou.

Figure 1
Figure 1. Figure 1: Memisis separates synthesis from evaluation so scoring cannot influence generation. The supervisor routes the user’s goal to a generator subgraph (CTGAN, TVAE, or GaussianCopula via SDV) and a separate evaluator subgraph (SDMetrics quality + Fairlearn FPR by group). After evaluation the supervisor compares the composite synth_score against repository-derived thresholds and either reports results or issues … view at source ↗
Figure 2
Figure 2. Figure 2: Memisis deployment stack. Users (researchers, data owners) interact via Streamlit. All interfaces share one FastAPI service. LangGraph agents (ReACT and multi-agent supervisor) sit alongside the different metrics. Llama3.2 is the Large Language Model (LLM) under consideration. Model checkpoints are stored as appropriate. fidelity (0.91) does not translate to a superior composite score when the downstream c… view at source ↗
read the original abstract

Synthetic data is widely used in healthcare to create datasets that are similar to original data but without the privacy concerns. Generating and evaluating synthetic data across privacy, utility and fairness is crucial for facilitating high quality data availability for downstream prediction tasks and clinical decision making. We present Memisis, a tool that orchestrates and evaluates synthetic data by leveraging existing synthetic data tools, the power of large language models and state-of-the-art evaluation metrics. Our tool creates a unified workflow for data generation, validation and evaluation. Users have control over the training size, training epochs and the number of synthetic rows to sample. Instead of knobs to tune synthetic data, the interactive agent allows users to specify their synthetic data generation goals and the tool will orchestrate the workflow by leveraging existing tools while performing the requisite evaluation. For the demo, we use an open source schizophrenia dataset with protected attributes related to race and gender, three different synthesizers and a local language model to orchestrate the workflow. We observe that CTGAN, TVAE and GaussianCopula have comparable performance across fairness and utility metrics. The workflow allows users flexibility and control over the data generation and evaluation process.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents Memisis, a tool that orchestrates and evaluates synthetic data for tabular health datasets. It integrates existing synthesizers (CTGAN, TVAE, GaussianCopula), a local LLM-powered interactive agent to interpret user-specified goals, and state-of-the-art metrics for privacy, utility, and fairness. Users control training size, epochs, and synthetic row count. A single demo on an open schizophrenia dataset with race and gender attributes reports comparable performance across the three synthesizers on fairness and utility metrics, claiming a unified workflow without manual knob tuning.

Significance. If the LLM agent's goal interpretation and synthesizer selection prove reliable without introducing unmeasured biases, Memisis would offer a practical, accessible framework for synthetic health data generation that lowers barriers for downstream clinical tasks. The explicit leverage of pre-existing open tools and metrics is a strength that supports reproducibility and reduces reinvention.

major comments (2)
  1. [Abstract] Abstract: the central claim that the interactive agent 'accurately interpret[s] user goals and reliably select[s] and configure[s] existing synthesizers' without new biases rests on an untested assumption; no accuracy, consistency, or bias metrics for the agent itself, nor any comparison against expert or exhaustive baselines, are reported.
  2. [Demo] Demo description: the observation that CTGAN, TVAE, and GaussianCopula 'have comparable performance across fairness and utility metrics' is presented as a single high-level result without statistical tests, error analysis, multiple runs, or validation beyond the observation, limiting support for the evaluation component of the unified workflow.
minor comments (2)
  1. The manuscript would benefit from an explicit description of the prompt templates or decision logic used by the local LLM agent to map user goals to synthesizer configurations.
  2. Clarify the exact privacy metrics employed and how they are computed within the evaluation pipeline, as this is central to health-data applications.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive recognition of Memisis as a practical framework. We address each major comment below with honest revisions where the manuscript requires strengthening.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that the interactive agent 'accurately interpret[s] user goals and reliably select[s] and configure[s] existing synthesizers' without new biases rests on an untested assumption; no accuracy, consistency, or bias metrics for the agent itself, nor any comparison against expert or exhaustive baselines, are reported.

    Authors: We agree that the manuscript does not report quantitative metrics for the LLM agent's accuracy, consistency, or potential biases in goal interpretation and synthesizer selection. The presented work emphasizes the orchestration workflow and a single-dataset demonstration rather than a dedicated agent evaluation study. We will revise the abstract to qualify or remove the phrasing implying reliable selection without new biases. We will also add explicit discussion of the agent as an interface layer whose performance is not yet benchmarked, along with a clear statement of this limitation and planned future comparisons to expert baselines. These changes will appear in the revised manuscript. revision: yes

  2. Referee: [Demo] Demo description: the observation that CTGAN, TVAE, and GaussianCopula 'have comparable performance across fairness and utility metrics' is presented as a single high-level result without statistical tests, error analysis, multiple runs, or validation beyond the observation, limiting support for the evaluation component of the unified workflow.

    Authors: We acknowledge that the demo presents a single high-level observation without statistical tests, error analysis, or multiple runs. The section was intended to illustrate the end-to-end workflow on an open schizophrenia dataset rather than to serve as a comprehensive benchmark. We will revise the demo section to include results aggregated over multiple independent runs, report means and standard deviations for the fairness and utility metrics, and add basic statistical comparisons (e.g., paired tests where applicable) to provide stronger support for the evaluation claims. These enhancements will be incorporated in the next version. revision: yes

Circularity Check

0 steps flagged

No circularity; tool composes external components without self-referential reductions.

full rationale

The manuscript presents Memisis as an orchestration layer over pre-existing synthesizers (CTGAN, TVAE, GaussianCopula), a local LLM agent, and standard privacy/utility/fairness metrics. No equations, fitted parameters renamed as predictions, or derivation chains appear. The single demo observation of comparable metrics on one dataset is an empirical report, not a quantity forced by construction or by self-citation. The central workflow claim is a composition of independent open-source tools and does not reduce to any input defined inside the paper itself.

Axiom & Free-Parameter Ledger

3 free parameters · 1 axioms · 1 invented entities

The contribution centers on a new orchestration layer rather than new mathematics or data; it depends on user-specified controls and the effectiveness of prior synthesizers without introducing fitted constants or new entities with independent evidence.

free parameters (3)
  • training size
    User-controlled amount of real data used to train the synthesizers.
  • training epochs
    User-controlled number of training iterations for the generators.
  • number of synthetic rows
    User-controlled quantity of output synthetic samples.
axioms (1)
  • domain assumption Existing synthesizers (CTGAN, TVAE, GaussianCopula) and standard fairness/utility metrics are appropriate for protected health attributes such as race and gender.
    Invoked when claiming comparable performance in the demo without new validation of these components.
invented entities (1)
  • Memisis interactive agent no independent evidence
    purpose: To translate natural-language user goals into synthesizer selection and workflow execution via LLM.
    New component presented as the core of the orchestration system.

pith-pipeline@v0.9.0 · 5764 in / 1327 out tokens · 69250 ms · 2026-05-20T12:07:48.274392+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages

  1. [1]

    synthetic-data-generator, 2024

    Argilla. synthetic-data-generator, 2024. URL https://github.com/argilla-io/syntheti c-data-generator. Apache-2.0; Gradio app and distilabel-based pipelines

  2. [2]

    Fairlearn: A toolkit for assessing and improving fairness in AI

    Sarah Bird, Miro Dudík, Richard Edgar, Brandon Horn, Roman Lutz, Vanessa Milan, Mehrnoosh Sameki, Hanna Wallach, and Kathleen Walker. Fairlearn: A toolkit for assessing and improving fairness in AI. Technical Report MSR-TR-2020-32, Microsoft, May 2020. URL https://www.microsoft.com/en-us/research/publication/fairlearn-a-toolkit-for -assessing-and-improvin...

  3. [3]

    Openml: Insights from 10 years and more than a thousand papers.Patterns, 6(7), 2025

    Bernd Bischl, Giuseppe Casalicchio, Taniya Das, Matthias Feurer, Sebastian Fischer, Pieter Gijsbers, Subhaditya Mukherjee, Andreas C Müller, László Németh, Luis Oala, et al. Openml: Insights from 10 years and more than a thousand papers.Patterns, 6(7), 2025

  4. [4]

    Lavista Ferres, and Mihaela van der Schaar

    Thomas Callender, Anders Boyd, Robert Davis, Silas Ruhrberg Estevez, Juan M. Lavista Ferres, and Mihaela van der Schaar. Synthcraft: An AI partner for synthetic data generation to support data access and augmentation in healthcare. Technical report, Microsoft Research,

  5. [5]

    URL https://www.microsoft.com/en-us/research/publication/synthcraft-an-a i-partner-for-synthetic-data-generation-to-support-data-access-and-augmentat ion-in-healthcare/

  6. [6]

    DataCebo, Inc., 03 2026

    Synthetic Data Metrics. DataCebo, Inc., 03 2026. URLhttps://docs.sdv.dev/sdmetrics/. Version 0.28.0

  7. [7]

    Why bias in AI is a problem and why business leaders should care (fairness series part 1), May 2020

    Alexandra Ebert. Why bias in AI is a problem and why business leaders should care (fairness series part 1), May 2020. URLhttps://mostly.ai/blog/why-bias-in-ai-is-a-problem . MOSTLY AI Blog. Accessed 2026-03-27

  8. [8]

    Influence of patient race and ethnicity on clinical assessment in patients with affective disorders.Archives of general psychiatry, 69(6):593–600, 2012

    Michael A Gara, William A Vega, Stephan Arndt, Michael Escamilla, David E Fleck, William B Lawson, Ira Lesser, Harold W Neighbors, Daniel R Wilson, Lesley M Arnold, et al. Influence of patient race and ethnicity on clinical assessment in patients with affective disorders.Archives of general psychiatry, 69(6):593–600, 2012

  9. [9]

    A naturalistic study of racial disparities in diagnoses at an outpatient behavioral health clinic.Psychiatric Services, 70(2):130–134, 2019

    Michael A Gara, Shula Minsky, Steven M Silverstein, Theresa Miskimen, and Stephen M Strakowski. A naturalistic study of racial disparities in diagnoses at an outpatient behavioral health clinic.Psychiatric Services, 70(2):130–134, 2019. 5

  10. [10]

    Michael Giuffrè and David L. Shung. Harnessing the power of synthetic data in healthcare: innovation, application, and privacy.npj Digital Medicine, 6(1):186, 2023. doi: 10.1038/s41746 -023-00927-3

  11. [11]

    June 17, 2025.doi:10.1101/2025

    Alon Gorenshtein, Mahmud Omar, Benjamin S. Glicksberg, Girish N. Nadkarni, and Eyal Klang. AI agents in clinical medicine: A systematic review.medRxiv, 2025. doi: 10.1101/2025 .08.22.25334232. Preprint; also available via PMCID PMC12407621

  12. [12]

    Generate synthetic data (IBM watsonx data platform), 2024

    IBM. Generate synthetic data (IBM watsonx data platform), 2024. URLhttps://dataplat form.cloud.ibm.com/docs/content/wsj/getting-started/get-started-generate-data. html?context=wx. Accessed 2026-03-27

  13. [13]

    Synthetic data for AI/ML development, 2021

    MOSTLY AI. Synthetic data for AI/ML development, 2021. URLhttps://mostly.ai/ use-case/synthetic-data-for-analytics-ai-training . Use case overview. Accessed 2026-03-27

  14. [14]

    Nitish Nagesh, Salar Shakibhamedan, Mahdi Bagheri, Ziyu Wang, Nima TaheriNejad, Axel Jantsch, and Amir M. Rahmani. FairTabGen: High-fidelity and fair synthetic health data generation from limited samples, 2025. URLhttps://arxiv.org/abs/2508.11810

  15. [15]

    Nitish Nagesh, Ziyu Wang, and Amir M. Rahmani. FairCauseSyn: Towards causally fair LLM-augmented synthetic data generation, 2025. URLhttps://arxiv.org/abs/2506.19082. Accepted to IEEE EMBC 2025

  16. [16]

    Meta-analysis of black vs

    Charles M Olbert, Arundati Nagendra, and Benjamin Buck. Meta-analysis of black vs. white racial disparity in schizophrenia diagnosis in the united states: Do structured assessments attenuate racial disparities?Journal of abnormal psychology, 127(1):104, 2018

  17. [17]

    Patki, R

    Neha Patki, Roy Wedge, and Kalyan Veeramachaneni. The synthetic data vault. InIEEE International Conference on Data Science and Advanced Analytics (DSAA), pages 399–410, Oct 2016. doi: 10.1109/DSAA.2016.49

  18. [18]

    Tonic documentation: synthetic data platform, 2024

    Tonic AI. Tonic documentation: synthetic data platform, 2024. URLhttps://docs.tonic.ai/. Accessed 2026-03-27

  19. [19]

    Creating synthetic data using Llama 3.1 405B, July 2024

    Tanay Varshney and Chintan Patel. Creating synthetic data using Llama 3.1 405B, July 2024. URL https://developer.nvidia.com/blog/creating-synthetic-data-using-llama-3 -1-405b/. NVIDIA Technical Blog. Accessed 2026-03-27

  20. [20]

    Jason Walonoski, Mark Kramer, Joseph Nichols, Andre Quina, Chris Moesel, Dylan Hall, Carlton Duffett, Kudakwashe Dube, Thomas Gallagher, and Scott McLachlan. Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record.Journal of the American Medical Informatics Association, 25(3): ...