pith. machine review for the scientific record. sign in

arxiv: 2512.05024 · v3 · submitted 2025-12-04 · 📊 stat.ME · cs.AI· cs.LG

Model-Free Assessment of Simulator Fidelity via Quantile Curves

Pith reviewed 2026-05-17 00:50 UTC · model grok-4.3

classification 📊 stat.ME cs.AIcs.LG
keywords simulator fidelitysim-to-real gapquantile curvesconfidence setslatent parametersmodel-free assessmentrisk profilegenerative AI evaluation
0
0 comments X

The pith

A model-free method builds confidence sets for latent parameters to proxy sim-to-real discrepancy and estimates its quantile function for a full risk profile.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a way to quantify how well a simulator matches reality across many different scenarios without assuming any particular model for the data. It works by constructing confidence sets around unobserved population parameters that describe the real and simulated systems in each scenario. These sets produce a proxy measure of the hidden discrepancy between them. The method then estimates the quantile function of that proxy to deliver an overall distribution of risks that supports inference on new scenarios, calculation of measures like Conditional Value-at-Risk, and direct comparisons among simulators.

Core claim

We construct confidence sets for these latent parameters and use them to derive a robust proxy for the sim-to-real discrepancy. We then estimate the quantile function of this proxy to obtain a distribution-level risk profile of the simulator, which supports a broad range of statistical summaries, including statistical inference for the real output distribution in a new scenario, the calculation of risk measures like Conditional Value-at-Risk (CVaR), and principled comparisons across simulators.

What carries the argument

Confidence sets for unobserved latent population parameters, used to form a robust proxy for sim-to-real discrepancy whose quantile function supplies the risk profile.

If this is right

  • Enables statistical inference for the real output distribution in a new scenario.
  • Supports calculation of risk measures such as Conditional Value-at-Risk.
  • Allows principled comparisons across different simulators.
  • Applies to general output spaces including categorical survey responses and continuous multi-dimensional data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The quantile profile could serve as a selection criterion when choosing among multiple simulators for a downstream task.
  • The approach might be used to track how simulator fidelity changes as generative models are retrained or fine-tuned over time.
  • It suggests a route for deciding which additional real-world samples would most improve the reliability assessment.

Load-bearing premise

Finite samples of heterogeneous sizes from real and simulated systems suffice to construct valid confidence sets for the unobserved latent population parameters.

What would settle it

Apply the procedure to synthetic data where the true latent parameters and exact discrepancy distribution are known in advance, then check whether the estimated quantile curve recovers that known distribution.

Figures

Figures reproduced from arXiv: 2512.05024 by Garud Iyengar, Kaizheng Wang, Yu-Shiou Willy Lin.

Figure 1
Figure 1. Figure 1: Simulation Uncertainty Quantification. Related Literature The sim-to-real discrepancy has been the focus of the uncertainty quantifica￾tion (UQ) literature [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Example of World Value Question. Source: [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Calibrated Vˆ (τ ) across LLMs. The calibrated quantile functions for the four different LLMs are plotted in [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Robustness check of simulator performances. [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Tightness analysis for varying nj under GPT-4o. larger nj yields more concentrated estimates pˆj and smaller confidence sets Cj (pˆj , γj ), which reduces the inflation induced by the supremum defining ∆ˆ j . The results in [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Tightness analysis for varying nj under GPT-4o with fixed γj = 1 2 . The formal version of this result is presented in Theorem B.1 and its proof is deferred to Appendix B.4. The takeaway is that Theorems 3.1 and B.1 provide complementary tools for assessing simulator fidelity at the distributional level. Theorem 3.1 provides a one-sided, calibrated upper envelope for the population quantile curve, while Th… view at source ↗
Figure 7
Figure 7. Figure 7: Confidence Bands of GPT-4o and Llama. many applications involve temporally dependent, dynamic simulation processes; extending our static framework to dynamic settings would broaden applicability. Third, our analysis assumes i.i.d. scenarios, whereas covariate shift or endogenous sampling may invalidate marginal guarantees; addressing such distribution shifts is an important avenue for future work. Acknowle… view at source ↗
Figure 8
Figure 8. Figure 8: Text of Question 223. C.2 Example Question and Preprocess Below we list three example questions from the dataset [PITH_FULL_IMAGE:figures/full_fig_p030_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Example questions from the Political Interest category. [PITH_FULL_IMAGE:figures/full_fig_p030_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Example question from the Science and Technology category. [PITH_FULL_IMAGE:figures/full_fig_p030_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Example question from the Migration category. [PITH_FULL_IMAGE:figures/full_fig_p030_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: β sensitivtiy analysis aunder n = {50, 200}. D Simulation System Examples Manufacturing: Factory Production (discrete-event simulation; cycle time). • Outcome space: X = R+ (cycle time or throughput). • Scenarios: ψ = product mix + scheduling policy. • Profiles: Z = machine/operator states, shift team, lot sizes; P = plant variability. • Laws: Qgt(· | z, ψ) = empirical cycle-time distribution on the floor… view at source ↗
Figure 13
Figure 13. Figure 13: Quantile fidelity profiles Vˆ (α) across LLMs (Discrepancy: Absolute loss, k = 50. multinomial setting. We can construct confidence sets Cj for multinomial vectors by adopting Example 3.1 with d = 5. OpinionQA also provides individual-level covariates such as gender, age, socioeconomic status, religious affiliation, and marital status, and more, which are used to construct synthetic profiles. Under the sa… view at source ↗
Figure 14
Figure 14. Figure 14: Quantile fidelity profiles Vˆ (α) across LLMs. 35 [PITH_FULL_IMAGE:figures/full_fig_p035_14.png] view at source ↗
read the original abstract

As generative AI models are increasingly used to simulate real-world systems, quantifying the ``sim-to-real'' gap is critical. For each input setting of interest -- which we call a \emph{scenario}, such as a survey question or operating condition -- the real and simulated systems are associated with unobserved latent population parameters, and their discrepancy varies across scenarios. A fundamental challenge is that, for any given scenario, this discrepancy cannot be observed directly, since both systems are accessible only through finite samples, often of heterogeneous sizes across scenarios. Standard predictive inference methods are therefore ill-suited, as they quantify uncertainty in observable outputs rather than latent population parameters. To address this, we construct confidence sets for these latent parameters and use them to derive a robust proxy for the sim-to-real discrepancy. We then estimate the quantile function of this proxy to obtain a distribution-level risk profile of the simulator, which supports a broad range of statistical summaries, including statistical inference for the real output distribution in a new scenario, the calculation of risk measures like Conditional Value-at-Risk (CVaR), and principled comparisons across simulators. Our method is model-agnostic and handles general output spaces, such as categorical survey responses and continuous multi-dimensional data. We demonstrate the practical utility of this method by evaluating the alignment of four major LLMs with human populations on the WorldValueBench dataset.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a model-free framework for assessing simulator fidelity in generative AI by constructing confidence sets for unobserved latent population parameters (e.g., probability vectors or moments) separately from finite real and simulated samples of heterogeneous sizes across scenarios. These sets are combined into a robust proxy for the sim-to-real discrepancy, after which the quantile function of the proxy is estimated to yield a distribution-level risk profile. The approach supports downstream tasks including inference on real output distributions in new scenarios, computation of risk measures such as CVaR, and comparisons across simulators. It is presented as applicable to general output spaces (categorical or continuous) and is illustrated empirically by evaluating alignment of four LLMs with human populations on the WorldValueBench dataset.

Significance. If the central construction achieves valid coverage for the latent-parameter confidence sets and the quantile profile inherits appropriate guarantees, the work would provide a useful non-parametric tool for quantifying distribution-level sim-to-real gaps rather than sample-level prediction error. This is timely for evaluating generative simulators and enables principled risk summaries and model comparisons. The model-agnostic claim and handling of heterogeneous sample sizes across scenarios are potential strengths, as is the concrete empirical demonstration on a real benchmark.

major comments (2)
  1. [Method (construction of confidence sets)] The validity of the non-parametric confidence sets for latent parameters is load-bearing for the entire pipeline. The manuscript must specify the exact construction (e.g., multinomial intervals for categorical outputs or moment-based sets for continuous) and show that these sets attain at least nominal coverage for the true population parameters when sample sizes are small or heterogeneous across scenarios; without such verification the subsequent robust proxy and its quantile curve lose their claimed bounding properties on the true discrepancy distribution.
  2. [Proxy derivation and quantile estimation] The definition and properties of the 'robust proxy' for sim-to-real discrepancy (formed by combining the two confidence sets) require explicit statement, including how worst-case or interval-based discrepancy is computed and whether the quantile estimation step propagates the set-valued uncertainty so that the resulting quantile curve retains valid coverage or concentration guarantees for the latent discrepancy distribution.
minor comments (2)
  1. [Abstract] The abstract states that the method 'supports a broad range of statistical summaries' but does not list them; a short enumerated list or forward reference to the relevant subsection would improve readability.
  2. [Notation and setup] Notation for scenarios, latent parameters, and the proxy should be introduced with a compact table or diagram early in the methods to aid readers tracking the transition from samples to quantile curves.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive report. The comments correctly identify the load-bearing elements of the framework, and we have revised the manuscript to address them directly by adding explicit constructions, coverage arguments, and propagation results. Below we respond point by point.

read point-by-point responses
  1. Referee: The validity of the non-parametric confidence sets for latent parameters is load-bearing for the entire pipeline. The manuscript must specify the exact construction (e.g., multinomial intervals for categorical outputs or moment-based sets for continuous) and show that these sets attain at least nominal coverage for the true population parameters when sample sizes are small or heterogeneous across scenarios; without such verification the subsequent robust proxy and its quantile curve lose their claimed bounding properties on the true discrepancy distribution.

    Authors: We agree that explicit construction and coverage verification are essential. Section 3 of the original manuscript already states the constructions: for categorical outputs we employ the Clopper-Pearson-type multinomial intervals of Sison and Glaz (1995) applied coordinate-wise with a union bound; for continuous outputs we use the moment-based sets obtained from Hoeffding’s inequality on the empirical mean and variance. These are distribution-free and therefore apply to heterogeneous sample sizes. To strengthen the presentation we have added a new subsection 3.2 that derives the finite-sample coverage guarantee under arbitrary heterogeneity: the intersection of the per-scenario sets retains at least 1−α coverage provided the smallest scenario sample size n_min satisfies a mild condition on the concentration radius. We have also inserted Monte Carlo experiments in the revised Appendix C that confirm empirical coverage remains above the nominal level for n_min as low as 20 across 500 heterogeneous scenarios. These additions directly verify the bounding properties invoked later in the pipeline. revision: yes

  2. Referee: The definition and properties of the 'robust proxy' for sim-to-real discrepancy (formed by combining the two confidence sets) require explicit statement, including how worst-case or interval-based discrepancy is computed and whether the quantile estimation step propagates the set-valued uncertainty so that the resulting quantile curve retains valid coverage or concentration guarantees for the latent discrepancy distribution.

    Authors: We appreciate the request for greater formality. In the revised Section 4 we now define the robust proxy explicitly as the set-valued discrepancy D = {d(θ_real, θ_sim) : θ_real ∈ C_real, θ_sim ∈ C_sim}, where C_real and C_sim are the confidence sets; the scalar proxy used for quantile estimation is the upper envelope sup D. Theorem 2 proves that this upper envelope stochastically dominates the true latent discrepancy with probability at least 1−α. For quantile estimation we replace the ordinary empirical quantile with the conservative upper quantile function Q̂(τ) = sup{ q : there exists a selection from the proxy sets whose τ-quantile is q }. Theorem 3 establishes that the resulting curve Q̂(τ) provides valid upper bounds on the true quantile function of the latent discrepancy distribution, with the same coverage probability. A short proof sketch and the precise algorithmic implementation have been added to the appendix. revision: yes

Circularity Check

0 steps flagged

No circularity: forward construction from standard confidence sets to proxy and quantiles

full rationale

The paper's central chain proceeds from finite heterogeneous samples to non-parametric confidence sets for latent population parameters, then to a robust proxy for sim-to-real discrepancy, followed by quantile estimation of that proxy. This is a direct statistical construction using established coverage properties of confidence sets under minimal assumptions; no step reduces by definition or fitting to the target quantities, and no self-citation is invoked as a load-bearing uniqueness theorem or ansatz. The method is explicitly model-agnostic and applies to general output spaces, with the quantile profile serving as a derived summary rather than a re-expression of inputs. The derivation remains self-contained against external benchmarks for coverage and quantile estimation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on standard statistical sampling assumptions rather than new free parameters or invented entities; the central construction uses existing confidence-set machinery applied to a new proxy definition.

axioms (1)
  • domain assumption Finite samples of possibly heterogeneous sizes from real and simulated systems permit construction of valid confidence sets for the unobserved latent population parameters.
    Invoked to justify the discrepancy proxy; appears in the abstract's description of the fundamental challenge and proposed solution.

pith-pipeline@v0.9.0 · 5546 in / 1327 out tokens · 71404 ms · 2026-05-17T00:50:42.950920+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 2 internal anchors

  1. [1]

    , " * write output.state after.block = add.period write newline

    ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type url volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.all := #1 'mid.sentence := ...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

  3. [3]

    , Arriaga, R

    Aher, G. , Arriaga, R. I. and Kalai, A. T. (2023). Using large language models to simulate multiple humans and replicate human subject studies. In Proceedings of the 40th International Conference on Machine Learning. ICML'23, JMLR.org

  4. [4]

    Argyle, L. P. , Busby, E. C. , Fulda, N. , Gubler, J. R. , Rytting, C. and Wingate, D. (2023). Out of one, many: Using language models to simulate human samples. Political Analysis 31 337–351

  5. [5]

    Barton, R. R. , Nelson, B. L. and Xie, W. (2014). Quantifying input uncertainty via simulation confidence intervals. INFORMS Journal on Computing 26 74--87

  6. [6]

    , Angelopoulos, A

    Bates, S. , Angelopoulos, A. , Lei, L. , Malik, J. and Jordan, M. (2021). Distribution-free, risk-controlling prediction sets. Journal of the ACM (JACM) 68 1--34

  7. [7]

    Budde, C. E. , Hartmanns, A. , Meggendorfer, T. , Weininger, M. and Wienh \"o ft, P. (2025). Statistical model checking beyond means: Quantiles, cvar, and the dkw inequality (extended version). arXiv preprint arXiv:2509.11859

  8. [8]

    , Lam, H

    Chen, M. , Lam, H. and Liu, Z. (2024). Quantifying distributional input uncertainty via inflated kolmogorov-smirnov confidence band. arXiv preprint arXiv:2403.09877

  9. [9]

    Towards Measuring the Representation of Subjective Global Opinions in Language Models

    Durmus, E. , Nguyen, K. , Liao, T. I. , Schiefer, N. , Askell, A. , Bakhtin, A. , Chen, C. , Hatfield-Dodds, Z. , Hernandez, D. , Joseph, N. et al. (2023). Towards measuring the representation of subjective global opinions in language models. arXiv preprint arXiv:2306.16388

  10. [10]

    , Lee, D

    Gao, Y. , Lee, D. , Burtch, G. and Fazelpour, S. (2025). Take caution in using llms as human surrogates. Proceedings of the National Academy of Sciences 122 e2501660122. ://www.pnas.org/doi/abs/10.1073/pnas.2501660122

  11. [11]

    , Inglehart, R

    Haerpfer, C. , Inglehart, R. , Moreno, A. , Welzel, C. , Kizilova, K. , Diez-Medrano, J. , Lagos, M. , Norris, P. , Ponarin, E. , Puranen, B. et al. (2020). World values survey: Round seven – country-pooled datafile (2017–2020). ://doi.org/10.14281/18241.1

  12. [12]

    He-Yueya, J. , Ma, W. A. , Gandhi, K. , Domingue, B. W. , Brunskill, E. and Goodman, N. D. (2024). Psychometric alignment: Capturing human knowledge distributions via language models. arXiv preprint arXiv:2407.15645

  13. [13]

    Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association 58 13--30. ://www.tandfonline.com/doi/abs/10.1080/01621459.1963.10500830

  14. [14]

    Huang, C. , Wu, Y. and Wang, K. (2025). Uncertainty quantification for LLM -based survey simulations. In Forty-second International Conference on Machine Learning. ://openreview.net/forum?id=nY1Ge2wxtP

  15. [15]

    , Chu, Y

    Jeon, Y. , Chu, Y. , Pasupathy, R. and Shashaani, S. (2024). Uncertainty quantification using simulation output: Batching as an inferential device. ://arxiv.org/abs/2311.04159

  16. [16]

    and Bhat, S

    L.A., P. and Bhat, S. P. (2022). A wasserstein distance approach for concentration of empirical risk estimates. Journal of Machine Learning Research 23 1--61. ://jmlr.org/papers/v23/20-965.html

  17. [17]

    Lam, H. (2022). Cheap bootstrap for input uncertainty quantification. In Proceedings of the 2022 Winter Simulation Conference. IEEE

  18. [18]

    Can LLM Agents Simulate Multi-Turn Human Behavior? Evidence from Real Online Customer Behavior Data

    Lu, Y. , Huang, J. , Han, Y. , Yao, B. , Bei, S. , Gesi, J. , Xie, Y. , He, Q. , Wang, D. et al. (2025). Prompting is not all you need! evaluating llm agent simulation methodologies with real-world online customer behavior data. arXiv preprint arXiv:2503.20749

  19. [19]

    Macal, C. (2016). Everything you need to know about agent-based modelling and simulation. Journal of Simulation 10 144--156

  20. [20]

    , Jiao, J

    Mardia, J. , Jiao, J. , Tánczos, E. , Nowak, R. D. and Weissman, T. (2019). Concentration inequalities for the empirical distribution of discrete distributions: beyond the method of types. Information and Inference: A Journal of the IMA 9 813--850. ://doi.org/10.1093/imaiai/iaz025

  21. [21]

    Massart, P. (1990). The tight constant in the dvoretzky-kiefer-wolfowitz inequality. The annals of Probability 1269--1283

  22. [22]

    Park, J. S. , O'Brien, J. , Cai, C. J. , Morris, M. R. , Liang, P. and Bernstein, M. S. (2023). Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th annual acm symposium on user interface software and technology

  23. [23]

    Peng, X. B. , Andrychowicz, M. , Zaremba, W. and Abbeel, P. (2018). Sim-to-real transfer of robotic control with dynamics randomization. In 2018 IEEE international conference on robotics and automation (ICRA). IEEE

  24. [24]

    Roy, C. J. and Oberkampf, W. L. (2011). A comprehensive framework for verification, validation, and uncertainty quantification in scientific computing. Computer Methods in Applied Mechanics and Engineering 200 2131--2144. ://www.sciencedirect.com/science/article/pii/S0045782511001290

  25. [25]

    , Durmus, E

    Santurkar, S. , Durmus, E. , Ladhak, F. , Lee, C. , Liang, P. and Hashimoto, T. (2023 a ). Whose opinions do language models reflect? In Proceedings of the 40th International Conference on Machine Learning. ICML'23, JMLR.org

  26. [26]

    , Durmus, E

    Santurkar, S. , Durmus, E. , Ladhak, F. , Lee, C. , Liang, P. and Hashimoto, T. (2023 b ). Whose opinions do language models reflect? In Proceedings of the 40th International Conference on Machine Learning. ICML'23, JMLR.org

  27. [27]

    Snell, J. C. , Zollo, T. P. , Deng, Z. , Pitassi, T. and Zemel, R. (2022). Quantile risk control: A flexible framework for bounding the probability of high-loss predictions. arXiv preprint arXiv:2212.13629

  28. [28]

    , Fong, R

    Tobin, J. , Fong, R. , Ray, A. , Schneider, J. , Zaremba, W. and Abbeel, P. (2017). Domain randomization for transferring deep neural networks from simulation to the real world. In 2017 IEEE/RSJ international conference on intelligent robots and systems (IROS). IEEE

  29. [29]

    , Gammerman, A

    Vovk, V. , Gammerman, A. and Shafer, G. (2005). Algorithmic Learning in a Random World. Springer-Verlag, Berlin, Heidelberg

  30. [30]

    , Lamb, A

    Wang, Z. , Lamb, A. , Saveliev, E. , Cameron, P. , Zaykov, J. , Hernandez-Lobato, J. M. , Turner, R. E. , Baraniuk, R. G. , Craig Barton, E. , Peyton Jones, S. , Woodhead, S. and Zhang, C. (2021). Results and insights from diagnostic questions: The neurips 2020 education challenge. In Proceedings of the NeurIPS 2020 Competition and Demonstration Track (H....

  31. [31]

    , Zhou, L

    Zhang, L. , Zhou, L. , Ren, L. and Laili, Y. (2019). Modeling and simulation in intelligent manufacturing. Computers in Industry 112 103123. ://www.sciencedirect.com/science/article/pii/S0166361519303239

  32. [32]

    , Mondal, D

    Zhao, W. , Mondal, D. , Tandon, N. , Dillion, D. , Gray, K. and Gu, Y. (2024). Worldvaluesbench: A large-scale benchmark dataset for multi-cultural value awareness of language models. ://arxiv.org/abs/2404.16308