pith. sign in

arxiv: 2606.20536 · v1 · pith:LA2DOKCHnew · submitted 2026-06-18 · 💻 cs.CV

The FID Lottery: Quantifying Hidden Randomness in Generative-Model Evaluation

Pith reviewed 2026-06-26 17:40 UTC · model grok-4.3

classification 💻 cs.CV
keywords FIDgenerative modelsevaluationvariancereproducibilityImageNettraining seedsclassifier-free guidance
0
0 comments X

The pith

Retraining generative models with different seeds shifts FID 3.2 times more than changing the sampling seed from a fixed model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper treats the FID score as a random variable across training seeds and generation seeds, then measures its variance by training several hundred SiT networks on class-conditional ImageNet 256x256. It finds that the training-seed component dominates, driven by random initialization, data ordering, and the Gaussian noise in the flow-matching loss. Increasing model size or compute does not shrink the relative spread, which stays inside a 1-2 percent coefficient of variation. Per-cell tuning of classifier-free guidance halves the spread but reorders which seeds perform best. The authors therefore recommend evaluating under per-cell optimal guidance, treating gaps below the observed 1.3 percent CoV as inconclusive, and reporting error bars over multiple training seeds.

Core claim

On a two-axis panel of training and generation seeds, retraining the model moves FID 3.2 times farther in Inception feature space than redrawing samples from a fixed network; the gap is produced by random initialization, data ordering, and per-step Gaussian noise; the coefficient of variation remains inside 1-2 percent even when compute or model size grows; and per-cell guidance tuning halves the spread while reshuffling rankings so that a lucky seed reaches target FID with up to twice less compute.

What carries the argument

Two-axis panel of training seeds and generation seeds on which FID is treated as a random variable and measured directly across hundreds of trained networks.

If this is right

  • Per-cell classifier-free-guidance tuning halves the observed FID spread.
  • Any reported FID gap smaller than the measured 1.3 percent coefficient of variation should be treated as inconclusive.
  • A lucky training seed can reach the same FID value with up to 2 times less compute than an unlucky seed.
  • Evaluations should report an error bar over several training seeds rather than a single FID number.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Current generative-model leaderboards that publish single FID numbers are likely to contain many inconclusive comparisons.
  • The same hidden training-seed variance may affect other popular metrics such as CLIP score or precision-recall.
  • Averaging FID over a small number of independent training runs could produce more stable model rankings at modest extra cost.
  • The finding raises the question whether similar seed-driven variance appears in non-diffusion generative models.

Load-bearing premise

The variance patterns and driving factors observed on SiT networks trained on class-conditional ImageNet 256x256 generalize to other generative architectures, datasets, and training recipes.

What would settle it

Repeating the full two-axis experiment on a different architecture or dataset and obtaining a training-seed to sampling-seed variance ratio far from 3.2 would falsify the central quantitative claim.

Figures

Figures reproduced from arXiv: 2606.20536 by Alexei A. Efros, Nicolas Dufour, Patrick P\'erez.

Figure 1
Figure 1. Figure 1: All sources of randomness behind a generative model. Training and then sampling a generative model is a chain of pseudo-random draws. They fall into two lotteries. The training lottery (left) is drawn from four sources: the random weight initialisation 1 , the order in which examples are visited 2 , the fresh Gaussian noise the flow-matching loss injects at every gradient step 3 , and the bitwise non-deter… view at source ↗
Figure 2
Figure 2. Figure 2: The FID lottery in SiT-B/2 at 400k steps. Each violin is one of 25 independently trained SiT-B/2 models, sorted by per-seed mean Inception FID. Small dots are the 250 individual sampling￾seed evaluations and short black ticks are per-seed means. The two highlighted markers pick out the single best (33.59) and worst (35.69) FID across the panel, a 2.10-point gap produced purely by changing seeds. Each violi… view at source ↗
Figure 3
Figure 3. Figure 3: Variance decomposition of the training-seed lottery (SiT-B/2, 400k, no CFG). (a) Per￾seed mean Inception FID under the three single-source conditions plus the fully-stochastic baseline (vary all) and the same seed control. Each dot is one training seed, boxes show the 25/50/75 percentiles. (b) Between-seed σ (coral) versus within-seed sampling σ (sage) per condition. The four random-source conditions are o… view at source ↗
Figure 4
Figure 4. Figure 4: What does FID spectrum looks like? Each row is one scene rendered by SiT-XL model whose Inception FID falls log-uniformly from 43 (left) to 3.6 (right), a 12× range: quality improves toward the right, as FID goes down. FID is defined at the distribution level. It’s a Fréchet distance between Gaussians fit to the Inception features of the reference distribution (the ImageNet dataset) and 50,000 generated im… view at source ↗
Figure 5
Figure 5. Figure 5: Per-cell guidance tuning halves the seed-induced FID spread, but reshuffles which seeds rank best. Per-(training, sampling)-seed golden-section CFG search (GS-FID) across 25 SiT-B/2 training seeds (400k steps, 10 sampling seeds per cell). (a) Per-seed violins of guided Inception FID, sorted by per-seed mean. The relative spread tightens to CoV = 0.67%, about half the 1.26% measured unguided on the same pan… view at source ↗
Figure 6
Figure 6. Figure 6: The seed lottery across compute and model size. (a) Inception FID over training: thin pastel lines are individual training-seed trajectories, bold lines are per-step means. The spread between seeds stays wide at every checkpoint and does not shrink as training converges. (b) Coefficient of variation σ/µ over training: all four models stay near a 1–2% band. Bigger models do not yield proportionally tighter … view at source ↗
Figure 7
Figure 7. Figure 7: The luck of the draw: a 1.2–2.0× convergence gap. For each model the dashed horizontal line marks the target T, the FID reached by the unluckiest of ∼20 seeds at 2M. The green dot is the step at which the luckiest seed first crosses T. The coral dot sits at 2M where the unlucky seed finally reaches it. The amber band between them is the training compute the unlucky seed wastes catching up. The per-step sha… view at source ↗
Figure 8
Figure 8. Figure 8: µP-coordinated LR sweep at 100k for SiT-S/B/L/XL. Solid lines are the per-LR mean Inception FID across 10 training seeds, the shaded envelope is the per-LR seed min–max, and the open ring on each curve circles each size’s best-mean-FID dot. (a) Unguided FID is monotone in LR for every size, so the highlighted dot sits at 5×10−4 (the edge of training stability) for all four. (b) GS-FID has flat-bottomed val… view at source ↗
Figure 9
Figure 9. Figure 9: The FID lottery, drawn as two slot machines. A casino-themed rendering of the same two lotteries diagrammed from first principles in [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: (a) Vary all (σbetween = 0.438). All three randomness sources are free. The panel is the same data as [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: (b) Vary noise (σbetween = 0.336). Init and data order are fixed. Only the per-step Gaussian noise of the flow-matching loss varies between training runs. Noise alone reproduces ≈77% of the baseline between-seed σ of (a). 33.5 34.5 35.5 Inception FID (c) vary init Training seed (sorted by mean Inception FID) [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: (c) Vary init (σbetween = 0.294). Data order and noise are fixed. Only the parameter initialisation varies. Init alone reproduces ≈67% of the baseline between-seed σ of (a). scale), and it tells us the generation lottery is set by the converged SiT-B/2 weights, not by which init or which data pairing produced them. In aggregate, an init looks like a constant offset. The aggregate picture is misleading. Co… view at source ↗
Figure 13
Figure 13. Figure 13: (d) Vary data (σbetween = 0.221). Init and noise are fixed. Only the data-loader order varies. Data order alone reproduces ≈51% of the baseline between-seed σ of (a). All four panels share the same y-range and a consistent layout: every violin is one training seed showing the Gaussian KDE of its 10 sampling-seed evaluations, the small dots are individual evaluations, and the black tick is the per-seed mea… view at source ↗
Figure 14
Figure 14. Figure 14: Rank stability of the 25 training seeds (SiT-B/2, 400k). (a) Bump chart: each line is one training seed traced through three ranking criteria — mean, min, and max FID across its 10 sampling seeds. Crossings dominate the picture: the best seed by mean is rarely the best by min or max. Coral and teal highlight the seeds that are best- and worst-by-mean to make their rank trajectories visible. (b) Spearman ρ… view at source ↗
Figure 15
Figure 15. Figure 15: Seed optimality: are “good” init seeds universal? (SiT-B/2, 400k, no CFG.) (a) Heatmap of mean Inception FID over a 10 ×15 grid of init seeds (rows) and (data, noise) pairings (columns). Cells span ∼33.9–35.9. A single init does not consistently shade greenest across rows. (b) Bump chart of the same data: each line is one init seed traced through its rank within each (data, noise) pair. Heavy crossings ma… view at source ↗
Figure 16
Figure 16. Figure 16: Seed-induced 95% confidence interval as a function of the reported Inception FID. For a mean FID computed from N independently trained models with K = 10 sampling seeds each, the normal-approximation half-width is CI95 = 1.96 CoV F/√ N, where CoV = σbetween/µ is the scale-invariant noise floor reported for Inception FID across the 76 (model,step) cells of the scaling sweep ( [PITH_FULL_IMAGE:figures/full… view at source ↗
Figure 17
Figure 17. Figure 17: Golden-section search on FID(ω) (referenced briefly from Sec. 4.3). (a) Algorithm 1: pseudocode for the bracket-contraction loop (the bracket-contraction sequence itself is illustrated in [PITH_FULL_IMAGE:figures/full_fig_p027_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: ). (b) Two interior probes x1 = b − ρ(b − a) and x2 = a + ρ(b − a) with ρ = 1/φ≈0.618 split the bracket [a, b]. The side with the larger f-value is discarded. ω ⋆ reused Ln+1 = ρ Ln 0.5 1 1.5 0 1 2 3 4 5 6 Guidance scale ω iteration n Bracket contraction across iterations [PITH_FULL_IMAGE:figures/full_fig_p027_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: The four PRDC metrics on a 2D toy. Marker convention: circles (•) are real points and triangles (▲) are generated points. The larger marker in each panel is the entity the metric tests. In panels (a, c, d) the metric is a per-point indicator, drawn green (• / ▲ = counted / covered) or red and hollow (◦ / △ = not counted / uncovered). Lavender (▲) marks generated points whose coverage status is not relevan… view at source ↗
Figure 20
Figure 20. Figure 20: Seed lottery on the same SiT-B/2 panel under DINOv2 FID. Each violin is one of 25 trained models, sorted left-to-right by per-seed mean. Violin shape traces the within-seed sampling distribution. The black tick is the per-seed mean. The training-seed mean ticks span 694.8 to 718.8 around grand mean 708.6. The between-to-within ratio is σbetween/σwithin = 4.82×, against 3.19× for Inception FID, so the stai… view at source ↗
Figure 21
Figure 21. Figure 21: Seed lottery under Inception Precision. Y-axis values are multiplied by 100 for readability. Per-seed means span 0.480 to 0.491 around grand mean 0.485. Violin height (within￾seed sampling spread) and the staircase of mean ticks (between-seed training spread) are comparable, since σbetween/σwithin = 1.14×. A multi-sampling-seed CI therefore covers roughly half of the seed￾induced envelope on precision, in… view at source ↗
Figure 22
Figure 22. Figure 22: Seed lottery under Inception Recall (the inversion case). Y-axis values are multiplied by 100 for readability. Per-seed means span 0.312 to 0.317 around grand mean 0.314. Each violin is taller than the staircase of mean ticks: σbetween/σwithin = 0.28×, an inversion of the FID asymmetry. On Inception recall the right CI to report on a fixed trained model is the sampling-only one. This is the opposite recom… view at source ↗
Figure 23
Figure 23. Figure 23: Coefficient of variation across compute and scale, for six metrics. Each panel plots between-seed CoV=σ/µ versus training step on the variance-over-training panel for SiT-S/B/L/XL with the clean seed sets of [PITH_FULL_IMAGE:figures/full_fig_p036_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Rank stability across compute, for six metrics. Each panel plots Spearman ρ between the seed ranking at training step t and at 2M, one curve per model size, mirroring the layout of Figure 6c. Dashed horizontal lines mark ρ = 0.8 (the high-stability target) and ρ = 0 (random). Inception FID, DINOv2 FID and Inception density approach ρ ≈ 0.8 by ∼ 1.4M on most models. Inception precision and coverage stabili… view at source ↗
Figure 25
Figure 25. Figure 25: Guided FID gallery – golden retriever. Rows are 6 fixed initial-noise seeds, and columns step through 8 FID-ordered checkpoints. The two halves continue along the same FID axis. Pastel colorbars report per-column FID. See Appendix G for the full setup. 38 [PITH_FULL_IMAGE:figures/full_fig_p038_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Guided FID gallery – tabby cat. Rows are 6 fixed initial-noise seeds, and columns step through 8 FID-ordered checkpoints. The two halves continue along the same FID axis. Pastel colorbars report per-column FID. See Appendix G for the full setup. 40 [PITH_FULL_IMAGE:figures/full_fig_p040_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Guided FID gallery – macaw. Rows are 6 fixed initial-noise seeds, and columns step through 8 FID-ordered checkpoints. The two halves continue along the same FID axis. Pastel colorbars report per-column FID. See Appendix G for the full setup. 41 [PITH_FULL_IMAGE:figures/full_fig_p041_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Guided FID gallery – flamingo. Rows are 6 fixed initial-noise seeds, and columns step through 8 FID-ordered checkpoints. The two halves continue along the same FID axis. Pastel colorbars report per-column FID. See Appendix G for the full setup. 42 [PITH_FULL_IMAGE:figures/full_fig_p042_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: Guided FID gallery – cheeseburger. Rows are 6 fixed initial-noise seeds, and columns step through 8 FID-ordered checkpoints. The two halves continue along the same FID axis. Pastel colorbars report per-column FID. See Appendix G for the full setup. 43 [PITH_FULL_IMAGE:figures/full_fig_p043_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: Guided FID gallery – ice cream. Rows are 6 fixed initial-noise seeds, and columns step through 8 FID-ordered checkpoints. The two halves continue along the same FID axis. Pastel colorbars report per-column FID. See Appendix G for the full setup. 44 [PITH_FULL_IMAGE:figures/full_fig_p044_30.png] view at source ↗
Figure 31
Figure 31. Figure 31: Guided FID gallery – volcano. Rows are 6 fixed initial-noise seeds, and columns step through 8 FID-ordered checkpoints. The two halves continue along the same FID axis. Pastel colorbars report per-column FID. See Appendix G for the full setup. 45 [PITH_FULL_IMAGE:figures/full_fig_p045_31.png] view at source ↗
Figure 32
Figure 32. Figure 32: Guided FID gallery – alp. Rows are 6 fixed initial-noise seeds, and columns step through 8 FID-ordered checkpoints. The two halves continue along the same FID axis. Pastel colorbars report per-column FID. See Appendix G for the full setup. 46 [PITH_FULL_IMAGE:figures/full_fig_p046_32.png] view at source ↗
Figure 33
Figure 33. Figure 33: Guided FID gallery – geyser. Rows are 6 fixed initial-noise seeds, and columns step through 8 FID-ordered checkpoints. The two halves continue along the same FID axis. Pastel colorbars report per-column FID. See Appendix G for the full setup. 47 [PITH_FULL_IMAGE:figures/full_fig_p047_33.png] view at source ↗
Figure 34
Figure 34. Figure 34: Guided FID gallery – daisy. Rows are 6 fixed initial-noise seeds, and columns step through 8 FID-ordered checkpoints. The two halves continue along the same FID axis. Pastel colorbars report per-column FID. See Appendix G for the full setup. 48 [PITH_FULL_IMAGE:figures/full_fig_p048_34.png] view at source ↗
Figure 35
Figure 35. Figure 35: Unguided FID gallery – golden retriever. Rows are 6 fixed initial-noise seeds, and columns step through 8 FID-ordered checkpoints. The two halves continue along the same FID axis. Pastel colorbars report per-column FID. See Appendix G for the full setup. 49 [PITH_FULL_IMAGE:figures/full_fig_p049_35.png] view at source ↗
Figure 36
Figure 36. Figure 36: Unguided FID gallery – tabby cat. Rows are 6 fixed initial-noise seeds, and columns step through 8 FID-ordered checkpoints. The two halves continue along the same FID axis. Pastel colorbars report per-column FID. See Appendix G for the full setup. 50 [PITH_FULL_IMAGE:figures/full_fig_p050_36.png] view at source ↗
Figure 37
Figure 37. Figure 37: Unguided FID gallery – macaw. Rows are 6 fixed initial-noise seeds, and columns step through 8 FID-ordered checkpoints. The two halves continue along the same FID axis. Pastel colorbars report per-column FID. See Appendix G for the full setup. 51 [PITH_FULL_IMAGE:figures/full_fig_p051_37.png] view at source ↗
Figure 38
Figure 38. Figure 38: Unguided FID gallery – flamingo. Rows are 6 fixed initial-noise seeds, and columns step through 8 FID-ordered checkpoints. The two halves continue along the same FID axis. Pastel colorbars report per-column FID. See Appendix G for the full setup. 52 [PITH_FULL_IMAGE:figures/full_fig_p052_38.png] view at source ↗
Figure 39
Figure 39. Figure 39: Unguided FID gallery – cheeseburger. Rows are 6 fixed initial-noise seeds, and columns step through 8 FID-ordered checkpoints. The two halves continue along the same FID axis. Pastel colorbars report per-column FID. See Appendix G for the full setup. 53 [PITH_FULL_IMAGE:figures/full_fig_p053_39.png] view at source ↗
Figure 40
Figure 40. Figure 40: Unguided FID gallery – ice cream. Rows are 6 fixed initial-noise seeds, and columns step through 8 FID-ordered checkpoints. The two halves continue along the same FID axis. Pastel colorbars report per-column FID. See Appendix G for the full setup. 54 [PITH_FULL_IMAGE:figures/full_fig_p054_40.png] view at source ↗
Figure 41
Figure 41. Figure 41: Unguided FID gallery – volcano. Rows are 6 fixed initial-noise seeds, and columns step through 8 FID-ordered checkpoints. The two halves continue along the same FID axis. Pastel colorbars report per-column FID. See Appendix G for the full setup. 55 [PITH_FULL_IMAGE:figures/full_fig_p055_41.png] view at source ↗
Figure 42
Figure 42. Figure 42: Unguided FID gallery – alp. Rows are 6 fixed initial-noise seeds, and columns step through 8 FID-ordered checkpoints. The two halves continue along the same FID axis. Pastel colorbars report per-column FID. See Appendix G for the full setup. 56 [PITH_FULL_IMAGE:figures/full_fig_p056_42.png] view at source ↗
Figure 43
Figure 43. Figure 43: Unguided FID gallery – geyser. Rows are 6 fixed initial-noise seeds, and columns step through 8 FID-ordered checkpoints. The two halves continue along the same FID axis. Pastel colorbars report per-column FID. See Appendix G for the full setup. 57 [PITH_FULL_IMAGE:figures/full_fig_p057_43.png] view at source ↗
Figure 44
Figure 44. Figure 44: Unguided FID gallery – daisy. Rows are 6 fixed initial-noise seeds, and columns step through 8 FID-ordered checkpoints. The two halves continue along the same FID axis. Pastel colorbars report per-column FID. See Appendix G for the full setup. 58 [PITH_FULL_IMAGE:figures/full_fig_p058_44.png] view at source ↗
read the original abstract

The Frechet Inception Distance (FID) is the de facto arbiter of image generation, yet most papers report just a single number from a single trained model using a single sampling seed. How reproducible is that number if we retrain the model, or merely resample from it? In this paper, we treat FID as a random variable on a two-axis panel of training and generation seeds, and measure its variance directly on several hundred SiT networks trained on class-conditional ImageNet 256x256. We report surprising findings: (a) Retraining the model using the same recipe with a different seed moves FID 3.2x more (in Inception feature space) than redrawing samples from a fixed network. (b) That gap is driven by three factors: random initialisation, data ordering, and the per-step Gaussian noise of the flow-matching loss. (c) Increasing compute or model size barely tightens the spread, holding the FID coefficient of variation (CoV) inside a 1-2% band. (d) Per-cell classifier-free-guidance tuning halves the spread but reshuffles which seeds work best, and a lucky training seed reaches the same FID with up to 2x less compute than an unlucky one. Based on these findings, we recommend a new FID evaluation protocol: evaluate under per-cell optimal guidance, treat any FID gap below the empirically measured ~1.3% CoV as inconclusive, and report an error bar over several training seeds rather than a single FID number.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper treats FID as a random variable over training and sampling seeds and measures its variance directly on several hundred SiT networks trained on class-conditional ImageNet 256x256. It reports that training-seed variance is 3.2x larger than sampling-seed variance in Inception space, identifies three driving factors (random initialization, data ordering, per-step Gaussian noise), finds that the coefficient of variation remains inside a 1-2% band even with increased compute or model size, shows that per-cell classifier-free-guidance tuning halves the spread, and recommends a new evaluation protocol: per-cell optimal guidance, treat gaps below ~1.3% CoV as inconclusive, and report error bars over training seeds rather than a single FID.

Significance. If the empirical variance measurements hold, the work provides direct evidence that single FID numbers are unreliable and that training randomness dominates sampling randomness. The scale of the experiment (hundreds of models) and the use of direct empirical computations without fitted parameters or self-referential definitions are strengths. The protocol recommendations would meaningfully change evaluation standards in generative modeling if the observed variance structure generalizes.

major comments (2)
  1. [Abstract and experimental results] The central quantitative claims (3.2x ratio, 1-2% CoV) and the three protocol recommendations are derived exclusively from SiT networks on class-conditional ImageNet 256x256; no results are shown for other architectures (GANs, diffusion, autoregressive), losses, or datasets. This makes the generalization premise for the recommendations unverified and load-bearing for the paper's broader impact.
  2. [Methods and appendix] Limited detail is provided on the statistical tests supporting the variance claims, the exact rules for model exclusion, and controls for confounding factors (e.g., training duration, hyperparameter stability). This leaves room for unstated post-hoc choices that could affect the reported ratios and CoV bounds.
minor comments (1)
  1. [Experimental setup] The manuscript would benefit from an explicit statement of the precise number of models, seeds, and samples per cell in the main text rather than only in supplementary material.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive review and for acknowledging the scale of the experiments and the direct empirical approach. We respond to each major comment below.

read point-by-point responses
  1. Referee: [Abstract and experimental results] The central quantitative claims (3.2x ratio, 1-2% CoV) and the three protocol recommendations are derived exclusively from SiT networks on class-conditional ImageNet 256x256; no results are shown for other architectures (GANs, diffusion, autoregressive), losses, or datasets. This makes the generalization premise for the recommendations unverified and load-bearing for the paper's broader impact.

    Authors: We agree that all quantitative results and protocol recommendations are derived solely from SiT models on class-conditional ImageNet 256x256. The manuscript frames the work as a detailed empirical study on this benchmark rather than a universal claim. We will revise the abstract, introduction, and conclusion to explicitly limit the scope of the claims and recommendations to the studied setting, removing any implication of broader generalization. We cannot add results for other architectures or datasets without new experiments. revision: yes

  2. Referee: [Methods and appendix] Limited detail is provided on the statistical tests supporting the variance claims, the exact rules for model exclusion, and controls for confounding factors (e.g., training duration, hyperparameter stability). This leaves room for unstated post-hoc choices that could affect the reported ratios and CoV bounds.

    Authors: We will expand the methods section and appendix with additional detail on the statistical procedures used to compute variances and ratios, the exact exclusion criteria applied to trained models, and controls for training duration and hyperparameter stability. These additions will include explicit statements confirming that no post-hoc selections were made that could bias the reported 3.2x ratio or 1-2% CoV bounds. revision: yes

standing simulated objections not resolved
  • Empirical verification of the variance structure and protocol recommendations on architectures other than SiT, different losses, or datasets other than class-conditional ImageNet 256x256.

Circularity Check

0 steps flagged

No circularity; all claims are direct empirical measurements on trained models

full rationale

The paper reports variance ratios, CoV bounds, and driving factors obtained by explicitly training several hundred SiT networks on ImageNet 256x256 under varied seeds and computing FID directly in Inception space. No derivation, equation, or prediction reduces to a fitted parameter, self-definition, or self-citation chain; the 3.2x ratio and 1-2% CoV are computed quantities, not outputs forced by the paper's own formalism. The protocol recommendations follow from these measurements without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work is purely empirical and relies on standard statistical assumptions for variance and coefficient of variation; no free parameters are fitted to produce the headline ratios, and no new entities are postulated.

axioms (2)
  • domain assumption FID computed in Inception-v3 feature space is a stable and meaningful measure of distribution distance between real and generated images
    Invoked as the basis for all variance measurements throughout the abstract.
  • domain assumption The SiT training recipe and ImageNet 256x256 setup are representative enough for the observed variance patterns to be informative
    Underlies the generalization of the 3.2x factor and CoV band to broader practice.

pith-pipeline@v0.9.1-grok · 5814 in / 1465 out tokens · 42084 ms · 2026-06-26T17:40:02.653738+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

101 extracted references · 14 canonical work pages · 5 internal anchors

  1. [1]

    Sinjini Banerjee, Tim Marrinan, Reilly Cannon, Tony Chiang, and Anand D. Sarwate. Measuring training variability from stochastic optimization using robust nonparametric testing.arXiv preprint arXiv:2406.08307, 2024

  2. [2]

    A Note on the Inception Score

    Shane Barratt and Rishi Sharma. A note on the inception score.arXiv preprint arXiv:1801.01973, 2018

  3. [3]

    Time for a change: a tutorial for comparing multiple classifiers through Bayesian analysis.Journal of Machine Learning Research, 2017

    Alessio Benavoli, Giorgio Corani, Janez Demšar, and Marco Zaffalon. Time for a change: a tutorial for comparing multiple classifiers through Bayesian analysis.Journal of Machine Learning Research, 2017

  4. [4]

    Quantifying the uncertainty of model-based synthetic image quality metrics.arXiv preprint arXiv:2504.03623, 2025

    Ciaran Bench and Spencer Angus Thomas. Quantifying the uncertainty of model-based synthetic image quality metrics.arXiv preprint arXiv:2504.03623, 2025

  5. [5]

    Princeton University Press, 2007

    Rajendra Bhatia.Positive Definite Matrices. Princeton University Press, 2007

  6. [6]

    Sutherland, Michael Arbel, and Arthur Gretton

    Mikołaj Bi´nkowski, Danica J. Sutherland, Michael Arbel, and Arthur Gretton. Demystifying MMD GANs. In6th International Conference on Learning Representations (ICLR), 2018

  7. [7]

    Pros and cons of GAN evaluation measures.Computer Vision and Image Under- standing, 2019

    Ali Borji. Pros and cons of GAN evaluation measures.Computer Vision and Image Under- standing, 2019

  8. [8]

    Pros and cons of GAN evaluation measures: New developments.Computer Vision and Image Understanding, 2022

    Ali Borji. Pros and cons of GAN evaluation measures: New developments.Computer Vision and Image Understanding, 2022

  9. [9]

    Unreproducible research is reproducible

    Xavier Bouthillier, César Laurent, and Pascal Vincent. Unreproducible research is reproducible. InProceedings of the 36th International Conference on Machine Learning (ICML), 2019

  10. [10]

    Accounting for variance in machine learning benchmarks

    Xavier Bouthillier, Pierre Delaunay, Mirko Bronzi, Assya Trofimov, Brennan Nichyporuk, Justin Szeto, Naz Sepah, Edward Raff, Kanika Madan, Vikram V oleti, Samira Ebrahimi Kahou, Vincent Michalski, Dmitriy Serdyuk, Tal Arbel, Chris Pal, Gaël Varoquaux, and Pascal Vincent. Accounting for variance in machine learning benchmarks. InProceedings of Machine Lear...

  11. [11]

    Brent.Algorithms for Minimization without Derivatives

    Richard P. Brent.Algorithms for Minimization without Derivatives. Prentice-Hall, 1973

  12. [12]

    Effectively unbiased FID and inception score and where to find them

    Min Jin Chong and David Forsyth. Effectively unbiased FID and inception score and where to find them. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020

  13. [13]

    Questionable answers in question answering research: Reproducibility and vari- ability of published results.Transactions of the Association for Computational Linguistics, 2018

    Matt Crane. Questionable answers in question answering research: Reproducibility and vari- ability of published results.Transactions of the Association for Computational Linguistics, 2018

  14. [14]

    Alexander D’Amour, Katherine Heller, Dan Moldovan, Ben Adlam, Babak Alipanahi, Alex Beutel, Christina Chen, Jonathan Deaton, Jacob Eisenstein, Matthew D. Hoffman, Farhad Hormozdiari, Neil Houlsby, Shaobo Hou, Ghassen Jerfel, Alan Karthikesalingam, Mario Lucic, Yian Ma, Cory McLean, Diana Mincu, Akinori Mitani, Andrea Montanari, Zachary Nado, Vivek Nataraj...

  15. [15]

    How far can we go with ImageNet for text-to-image generation? InAdvances in Neural Information Processing Systems 38 (NeurIPS), 2025

    Lucas Degeorge, Arijit Ghosh, Nicolas Dufour, David Picard, and Vicky Kalogeiton. How far can we go with ImageNet for text-to-image generation? InAdvances in Neural Information Processing Systems 38 (NeurIPS), 2025

  16. [16]

    Statistical comparisons of classifiers over multiple data sets.Journal of Machine Learning Research, 2006

    Janez Demšar. Statistical comparisons of classifiers over multiple data sets.Journal of Machine Learning Research, 2006. 12

  17. [17]

    ImageNet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009

  18. [18]

    Diffusion models beat GANs on image synthesis

    Prafulla Dhariwal and Alexander Quinn Nichol. Diffusion models beat GANs on image synthesis. InAdvances in Neural Information Processing Systems 34 (NeurIPS), 2021

  19. [19]

    DiCiccio and Bradley Efron

    Thomas J. DiCiccio and Bradley Efron. Bootstrap confidence intervals.Statistical Science, 1996

  20. [20]

    Dietterich

    Thomas G. Dietterich. Approximate statistical tests for comparing supervised classification learning algorithms.Neural Computation, 1998

  21. [21]

    Jesse Dodge, Suchin Gururangan, Dallas Card, Roy Schwartz, and Noah A. Smith. Show your work: Improved reporting of experimental results. InProceedings of EMNLP-IJCNLP, 2019

  22. [22]

    Jesse Dodge, Gabriel Ilharco, Roy Schwartz, Ali Farhadi, Hannaneh Hajishirzi, and Noah A. Smith. Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping.arXiv preprint arXiv:2002.06305, 2020

  23. [23]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations (ICLR), 2021

  24. [24]

    The hitchhiker’s guide to testing statistical significance in natural language processing

    Rotem Dror, Gili Baumer, Segev Shlomov, and Roi Reichart. The hitchhiker’s guide to testing statistical significance in natural language processing. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL), Volume 1: Long Papers, 2018

  25. [25]

    Multiple comparisons among means.Journal of the American Statistical Association, 1961

    Olive Jean Dunn. Multiple comparisons among means.Journal of the American Statistical Association, 1961

  26. [26]

    B. Efron. Bootstrap methods: Another look at the jackknife.The Annals of Statistics, 1979

  27. [27]

    Scaling rectified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis. InProceedings of the 41st International Conference on Machine Learning (ICML), 2024

  28. [28]

    Mark Everingham, S. M. Ali Eslami, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. The Pascal visual object classes challenge: A retrospective.International Journal of Computer Vision, 2014

  29. [29]

    Deep ensembles: A loss landscape perspective.arXiv preprint arXiv:1912.02757, 2019

    Stanislav Fort, Huiyi Hu, and Balaji Lakshminarayanan. Deep ensembles: A loss landscape perspective.arXiv preprint arXiv:1912.02757, 2019

  30. [30]

    The lottery ticket hypothesis: Finding sparse, trainable neural networks

    Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. InInternational Conference on Learning Representations (ICLR), 2019

  31. [31]

    Roy, and Michael Carbin

    Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M. Roy, and Michael Carbin. Linear mode connectivity and the lottery ticket hypothesis. InProceedings of the 37th International Conference on Machine Learning (ICML), 2020

  32. [32]

    Understanding the difficulty of training deep feedforward neural networks

    Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. InProceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS), 2010

  33. [33]

    State of the art: Reproducibility in artificial intelligence

    Odd Erik Gundersen and Sigbjørn Kjensmo. State of the art: Reproducibility in artificial intelligence. InProceedings of the AAAI Conference on Artificial Intelligence, 2018

  34. [34]

    Defeating nondeterminism in LLM inference

    Horace He and Thinking Machines Lab. Defeating nondeterminism in LLM inference. https: //thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/ ,

  35. [35]

    Thinking Machines Lab blog; accessed 2026. 13

  36. [36]

    Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. InProceedings of the IEEE International Conference on Computer Vision (ICCV), 2015

  37. [37]

    Deep residual learning for im- age recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for im- age recognition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016

  38. [38]

    Deep reinforcement learning that matters

    Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep reinforcement learning that matters. InProceedings of the AAAI Conference on Artificial Intelligence, 2018

  39. [39]

    Scaling Laws for Autoregressive Generative Modeling

    Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B. Brown, Prafulla Dhariwal, Scott Gray, Chris Hallacy, Benjamin Mann, Alec Radford, Aditya Ramesh, Nick Ryder, Daniel M. Ziegler, John Schulman, Dario Amodei, and Sam McCandlish. Scaling laws for autoregressive generative modeling.arXiv preprint arXiv:2010....

  40. [40]

    Deep Learning Scaling is Predictable, Empirically

    Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md. Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically.arXiv preprint arXiv:1712.00409, 2017

  41. [41]

    GANs trained by a two time-scale update rule converge to a local Nash equilibrium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. InAdvances in Neural Information Processing Systems 30 (NIPS 2017), 2017

  42. [42]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

  43. [43]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems 33 (NeurIPS), 2020

  44. [44]

    Rae, Oriol Vinyals, and Laurent Sifre

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre...

  45. [45]

    Rethinking FID: Towards a better evaluation metric for image generation

    Sadeep Jayasumana, Srikumar Ramalingam, Andreas Veit, Daniel Glasner, Ayan Chakrabarti, and Sanjiv Kumar. Rethinking FID: Towards a better evaluation metric for image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  46. [46]

    Calibrated chaos: Variance between runs of neural network training is harmless and inevitable.arXiv preprint arXiv:2304.01910, 2023

    Keller Jordan. Calibrated chaos: Variance between runs of neural network training is harmless and inevitable.arXiv preprint arXiv:2304.01910, 2023

  47. [47]

    Simoncelli, and Stéphane Mallat

    Zahra Kadkhodaie, Florentin Guth, Eero P. Simoncelli, and Stéphane Mallat. Generalization in diffusion models arises from geometry-adaptive harmonic representations. InThe Twelfth International Conference on Learning Representations (ICLR), 2024

  48. [48]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

  49. [49]

    A style-based generator architecture for generative adversarial networks

    Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019

  50. [50]

    Elucidating the design space of diffusion-based generative models

    Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. InAdvances in Neural Information Processing Systems 35 (NeurIPS), 2022. 14

  51. [51]

    Analyzing and improving the training dynamics of diffusion models

    Tero Karras, Miika Aittala, Jaakko Lehtinen, Janne Hellsten, Timo Aila, and Samuli Laine. Analyzing and improving the training dynamics of diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  52. [52]

    M. G. Kendall and B. Babington Smith. The problem ofm rankings.The Annals of Mathematical Statistics, 1939

  53. [53]

    J. Kiefer. Sequential minimax search for a maximum.Proceedings of the American Mathemati- cal Society, 1953

  54. [54]

    Kingma and Jimmy Ba

    Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In3rd International Conference on Learning Representations (ICLR), 2015

  55. [55]

    Kingma, Tim Salimans, Ben Poole, and Jonathan Ho

    Diederik P. Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. InAdvances in Neural Information Processing Systems 34 (NeurIPS), 2021

  56. [56]

    Improved precision and recall metric for assessing generative models

    Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall metric for assessing generative models. InAdvances in Neural Information Processing Systems 32 (NeurIPS), 2019

  57. [57]

    The role of ImageNet classes in Fréchet inception distance

    Tuomas Kynkäänniemi, Tero Karras, Miika Aittala, Timo Aila, and Jaakko Lehtinen. The role of ImageNet classes in Fréchet inception distance. InThe Eleventh International Conference on Learning Representations (ICLR), 2023

  58. [58]

    Data set selection.Journal of Machine Learning Gossip, 2003

    Doudou LaLoudouana, Mambobo Bonouliqui Tarare, Lupano Tecallonou Center, and GUANA Selacie. Data set selection.Journal of Machine Learning Gossip, 2003

  59. [59]

    Autoregressive image generation without vector quantization

    Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vector quantization. InAdvances in Neural Information Processing Systems 37 (NeurIPS), 2024

  60. [60]

    Scaling laws for diffusion transformers

    Zhengyang Liang, Hao He, Ceyuan Yang, and Bo Dai. Scaling laws for diffusion transformers. arXiv preprint arXiv:2410.08184, 2024

  61. [61]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. InThe Eleventh International Conference on Learning Representations (ICLR), 2023

  62. [62]

    Flow straight and fast: Learning to generate and transfer data with rectified flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InThe Eleventh International Conference on Learning Representations (ICLR), 2023

  63. [63]

    Are GANs created equal? A large-scale study

    Mario Lucic, Karol Kurach, Marcin Michalski, Sylvain Gelly, and Olivier Bousquet. Are GANs created equal? A large-scale study. InAdvances in Neural Information Processing Systems 31 (NeurIPS), 2018

  64. [64]

    Albergo, Nicholas M

    Nanye Ma, Mark Goldstein, Michael S. Albergo, Nicholas M. Boffi, Eric Vanden-Eijnden, and Saining Xie. SiT: Exploring flow and diffusion-based generative models with scalable interpolant transformers. InComputer Vision – ECCV 2024, 2024

  65. [65]

    On model stability as a function of random seed

    Pranava Madhyastha and Rishabh Jain. On model stability as a function of random seed. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), 2019

  66. [66]

    On the state of the art of evaluation in neural language models

    Gábor Melis, Chris Dyer, and Phil Blunsom. On the state of the art of evaluation in neural language models. InInternational Conference on Learning Representations (ICLR), 2018

  67. [67]

    On self-supervised image representa- tions for GAN evaluation

    Stanislav Morozov, Andrey V oynov, and Artem Babenko. On self-supervised image representa- tions for GAN evaluation. In9th International Conference on Learning Representations (ICLR), 2021

  68. [68]

    On the stability of fine- tuning BERT: Misconceptions, explanations, and strong baselines

    Marius Mosbach, Maksym Andriushchenko, and Dietrich Klakow. On the stability of fine- tuning BERT: Misconceptions, explanations, and strong baselines. InInternational Conference on Learning Representations (ICLR), 2021. 15

  69. [69]

    Reliable fidelity and diversity metrics for generative models

    Muhammad Ferjad Naeem, Seong Joon Oh, Youngjung Uh, Yunjey Choi, and Jaejun Yoo. Reliable fidelity and diversity metrics for generative models. InProceedings of the 37th International Conference on Machine Learning, 2020

  70. [70]

    Zico Kolter

    Vaishnavh Nagarajan and J. Zico Kolter. Uniform convergence may be unable to explain generalization in deep learning. InAdvances in Neural Information Processing Systems 32 (NeurIPS), 2019

  71. [71]

    Improved denoising diffusion probabilistic models

    Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. InProceedings of the 38th International Conference on Machine Learning (ICML), 2021

  72. [72]

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jégou, Julien Mairal, Patric...

  73. [73]

    On aliased resizing and surprising subtleties in GAN evaluation

    Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On aliased resizing and surprising subtleties in GAN evaluation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

  74. [74]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

  75. [75]

    Problems and opportunities in training deep learning software systems: An analysis of variance

    Hung Viet Pham, Shangshu Qian, Jiannan Wang, Thibaud Lutellier, Jonathan Rosenthal, Lin Tan, Yaoliang Yu, and Nachiappan Nagappan. Problems and opportunities in training deep learning software systems: An analysis of variance. InProceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering (ASE), 2020

  76. [76]

    Torch.manual_seed(3407) is all you need: On the influence of random seeds in deep learning architectures for computer vision.arXiv preprint arXiv:2109.08203, 2021

    David Picard. Torch.manual_seed(3407) is all you need: On the influence of random seeds in deep learning architectures for computer vision.arXiv preprint arXiv:2109.08203, 2021

  77. [77]

    Improving reproducibility in machine learning research (a report from the NeurIPS 2019 reproducibility program).Journal of Machine Learning Research, 2021

    Joelle Pineau, Philippe Vincent-Lamarre, Koustuv Sinha, Vincent Larivière, Alina Beygelzimer, Florence d’Alché Buc, Emily Fox, and Hugo Larochelle. Improving reproducibility in machine learning research (a report from the NeurIPS 2019 reproducibility program).Journal of Machine Learning Research, 2021

  78. [78]

    A step toward quantifying independently reproducible machine learning research

    Edward Raff. A step toward quantifying independently reproducible machine learning research. InAdvances in Neural Information Processing Systems 32 (NeurIPS), 2019

  79. [79]

    Do ImageNet classifiers generalize to ImageNet? InProceedings of the 36th International Conference on Machine Learning (ICML), 2019

    Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do ImageNet classifiers generalize to ImageNet? InProceedings of the 36th International Conference on Machine Learning (ICML), 2019

  80. [80]

    Reporting score distributions makes a difference: Perfor- mance study of LSTM-networks for sequence tagging

    Nils Reimers and Iryna Gurevych. Reporting score distributions makes a difference: Perfor- mance study of LSTM-networks for sequence tagging. InProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2017

Showing first 80 references.