The FID Lottery: Quantifying Hidden Randomness in Generative-Model Evaluation

Alexei A. Efros; Nicolas Dufour; Patrick P\'erez

arxiv: 2606.20536 · v1 · pith:LA2DOKCHnew · submitted 2026-06-18 · 💻 cs.CV

The FID Lottery: Quantifying Hidden Randomness in Generative-Model Evaluation

Nicolas Dufour , Alexei A. Efros , Patrick P\'erez This is my paper

Pith reviewed 2026-06-26 17:40 UTC · model grok-4.3

classification 💻 cs.CV

keywords FIDgenerative modelsevaluationvariancereproducibilityImageNettraining seedsclassifier-free guidance

0 comments

The pith

Retraining generative models with different seeds shifts FID 3.2 times more than changing the sampling seed from a fixed model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper treats the FID score as a random variable across training seeds and generation seeds, then measures its variance by training several hundred SiT networks on class-conditional ImageNet 256x256. It finds that the training-seed component dominates, driven by random initialization, data ordering, and the Gaussian noise in the flow-matching loss. Increasing model size or compute does not shrink the relative spread, which stays inside a 1-2 percent coefficient of variation. Per-cell tuning of classifier-free guidance halves the spread but reorders which seeds perform best. The authors therefore recommend evaluating under per-cell optimal guidance, treating gaps below the observed 1.3 percent CoV as inconclusive, and reporting error bars over multiple training seeds.

Core claim

On a two-axis panel of training and generation seeds, retraining the model moves FID 3.2 times farther in Inception feature space than redrawing samples from a fixed network; the gap is produced by random initialization, data ordering, and per-step Gaussian noise; the coefficient of variation remains inside 1-2 percent even when compute or model size grows; and per-cell guidance tuning halves the spread while reshuffling rankings so that a lucky seed reaches target FID with up to twice less compute.

What carries the argument

Two-axis panel of training seeds and generation seeds on which FID is treated as a random variable and measured directly across hundreds of trained networks.

If this is right

Per-cell classifier-free-guidance tuning halves the observed FID spread.
Any reported FID gap smaller than the measured 1.3 percent coefficient of variation should be treated as inconclusive.
A lucky training seed can reach the same FID value with up to 2 times less compute than an unlucky seed.
Evaluations should report an error bar over several training seeds rather than a single FID number.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Current generative-model leaderboards that publish single FID numbers are likely to contain many inconclusive comparisons.
The same hidden training-seed variance may affect other popular metrics such as CLIP score or precision-recall.
Averaging FID over a small number of independent training runs could produce more stable model rankings at modest extra cost.
The finding raises the question whether similar seed-driven variance appears in non-diffusion generative models.

Load-bearing premise

The variance patterns and driving factors observed on SiT networks trained on class-conditional ImageNet 256x256 generalize to other generative architectures, datasets, and training recipes.

What would settle it

Repeating the full two-axis experiment on a different architecture or dataset and obtaining a training-seed to sampling-seed variance ratio far from 3.2 would falsify the central quantitative claim.

Figures

Figures reproduced from arXiv: 2606.20536 by Alexei A. Efros, Nicolas Dufour, Patrick P\'erez.

**Figure 1.** Figure 1: All sources of randomness behind a generative model. Training and then sampling a generative model is a chain of pseudo-random draws. They fall into two lotteries. The training lottery (left) is drawn from four sources: the random weight initialisation 1 , the order in which examples are visited 2 , the fresh Gaussian noise the flow-matching loss injects at every gradient step 3 , and the bitwise non-deter… view at source ↗

**Figure 2.** Figure 2: The FID lottery in SiT-B/2 at 400k steps. Each violin is one of 25 independently trained SiT-B/2 models, sorted by per-seed mean Inception FID. Small dots are the 250 individual samplingseed evaluations and short black ticks are per-seed means. The two highlighted markers pick out the single best (33.59) and worst (35.69) FID across the panel, a 2.10-point gap produced purely by changing seeds. Each violi… view at source ↗

**Figure 3.** Figure 3: Variance decomposition of the training-seed lottery (SiT-B/2, 400k, no CFG). (a) Perseed mean Inception FID under the three single-source conditions plus the fully-stochastic baseline (vary all) and the same seed control. Each dot is one training seed, boxes show the 25/50/75 percentiles. (b) Between-seed σ (coral) versus within-seed sampling σ (sage) per condition. The four random-source conditions are o… view at source ↗

**Figure 4.** Figure 4: What does FID spectrum looks like? Each row is one scene rendered by SiT-XL model whose Inception FID falls log-uniformly from 43 (left) to 3.6 (right), a 12× range: quality improves toward the right, as FID goes down. FID is defined at the distribution level. It’s a Fréchet distance between Gaussians fit to the Inception features of the reference distribution (the ImageNet dataset) and 50,000 generated im… view at source ↗

**Figure 5.** Figure 5: Per-cell guidance tuning halves the seed-induced FID spread, but reshuffles which seeds rank best. Per-(training, sampling)-seed golden-section CFG search (GS-FID) across 25 SiT-B/2 training seeds (400k steps, 10 sampling seeds per cell). (a) Per-seed violins of guided Inception FID, sorted by per-seed mean. The relative spread tightens to CoV = 0.67%, about half the 1.26% measured unguided on the same pan… view at source ↗

**Figure 6.** Figure 6: The seed lottery across compute and model size. (a) Inception FID over training: thin pastel lines are individual training-seed trajectories, bold lines are per-step means. The spread between seeds stays wide at every checkpoint and does not shrink as training converges. (b) Coefficient of variation σ/µ over training: all four models stay near a 1–2% band. Bigger models do not yield proportionally tighter … view at source ↗

**Figure 7.** Figure 7: The luck of the draw: a 1.2–2.0× convergence gap. For each model the dashed horizontal line marks the target T, the FID reached by the unluckiest of ∼20 seeds at 2M. The green dot is the step at which the luckiest seed first crosses T. The coral dot sits at 2M where the unlucky seed finally reaches it. The amber band between them is the training compute the unlucky seed wastes catching up. The per-step sha… view at source ↗

**Figure 8.** Figure 8: µP-coordinated LR sweep at 100k for SiT-S/B/L/XL. Solid lines are the per-LR mean Inception FID across 10 training seeds, the shaded envelope is the per-LR seed min–max, and the open ring on each curve circles each size’s best-mean-FID dot. (a) Unguided FID is monotone in LR for every size, so the highlighted dot sits at 5×10−4 (the edge of training stability) for all four. (b) GS-FID has flat-bottomed val… view at source ↗

**Figure 9.** Figure 9: The FID lottery, drawn as two slot machines. A casino-themed rendering of the same two lotteries diagrammed from first principles in [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 10.** Figure 10: (a) Vary all (σbetween = 0.438). All three randomness sources are free. The panel is the same data as [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗

**Figure 11.** Figure 11: (b) Vary noise (σbetween = 0.336). Init and data order are fixed. Only the per-step Gaussian noise of the flow-matching loss varies between training runs. Noise alone reproduces ≈77% of the baseline between-seed σ of (a). 33.5 34.5 35.5 Inception FID (c) vary init Training seed (sorted by mean Inception FID) [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗

**Figure 12.** Figure 12: (c) Vary init (σbetween = 0.294). Data order and noise are fixed. Only the parameter initialisation varies. Init alone reproduces ≈67% of the baseline between-seed σ of (a). scale), and it tells us the generation lottery is set by the converged SiT-B/2 weights, not by which init or which data pairing produced them. In aggregate, an init looks like a constant offset. The aggregate picture is misleading. Co… view at source ↗

**Figure 13.** Figure 13: (d) Vary data (σbetween = 0.221). Init and noise are fixed. Only the data-loader order varies. Data order alone reproduces ≈51% of the baseline between-seed σ of (a). All four panels share the same y-range and a consistent layout: every violin is one training seed showing the Gaussian KDE of its 10 sampling-seed evaluations, the small dots are individual evaluations, and the black tick is the per-seed mea… view at source ↗

**Figure 14.** Figure 14: Rank stability of the 25 training seeds (SiT-B/2, 400k). (a) Bump chart: each line is one training seed traced through three ranking criteria — mean, min, and max FID across its 10 sampling seeds. Crossings dominate the picture: the best seed by mean is rarely the best by min or max. Coral and teal highlight the seeds that are best- and worst-by-mean to make their rank trajectories visible. (b) Spearman ρ… view at source ↗

**Figure 15.** Figure 15: Seed optimality: are “good” init seeds universal? (SiT-B/2, 400k, no CFG.) (a) Heatmap of mean Inception FID over a 10 ×15 grid of init seeds (rows) and (data, noise) pairings (columns). Cells span ∼33.9–35.9. A single init does not consistently shade greenest across rows. (b) Bump chart of the same data: each line is one init seed traced through its rank within each (data, noise) pair. Heavy crossings ma… view at source ↗

**Figure 16.** Figure 16: Seed-induced 95% confidence interval as a function of the reported Inception FID. For a mean FID computed from N independently trained models with K = 10 sampling seeds each, the normal-approximation half-width is CI95 = 1.96 CoV F/√ N, where CoV = σbetween/µ is the scale-invariant noise floor reported for Inception FID across the 76 (model,step) cells of the scaling sweep ( [PITH_FULL_IMAGE:figures/full… view at source ↗

**Figure 17.** Figure 17: Golden-section search on FID(ω) (referenced briefly from Sec. 4.3). (a) Algorithm 1: pseudocode for the bracket-contraction loop (the bracket-contraction sequence itself is illustrated in [PITH_FULL_IMAGE:figures/full_fig_p027_17.png] view at source ↗

**Figure 18.** Figure 18: ). (b) Two interior probes x1 = b − ρ(b − a) and x2 = a + ρ(b − a) with ρ = 1/φ≈0.618 split the bracket [a, b]. The side with the larger f-value is discarded. ω ⋆ reused Ln+1 = ρ Ln 0.5 1 1.5 0 1 2 3 4 5 6 Guidance scale ω iteration n Bracket contraction across iterations [PITH_FULL_IMAGE:figures/full_fig_p027_18.png] view at source ↗

**Figure 19.** Figure 19: The four PRDC metrics on a 2D toy. Marker convention: circles (•) are real points and triangles (▲) are generated points. The larger marker in each panel is the entity the metric tests. In panels (a, c, d) the metric is a per-point indicator, drawn green (• / ▲ = counted / covered) or red and hollow (◦ / △ = not counted / uncovered). Lavender (▲) marks generated points whose coverage status is not relevan… view at source ↗

**Figure 20.** Figure 20: Seed lottery on the same SiT-B/2 panel under DINOv2 FID. Each violin is one of 25 trained models, sorted left-to-right by per-seed mean. Violin shape traces the within-seed sampling distribution. The black tick is the per-seed mean. The training-seed mean ticks span 694.8 to 718.8 around grand mean 708.6. The between-to-within ratio is σbetween/σwithin = 4.82×, against 3.19× for Inception FID, so the stai… view at source ↗

**Figure 21.** Figure 21: Seed lottery under Inception Precision. Y-axis values are multiplied by 100 for readability. Per-seed means span 0.480 to 0.491 around grand mean 0.485. Violin height (withinseed sampling spread) and the staircase of mean ticks (between-seed training spread) are comparable, since σbetween/σwithin = 1.14×. A multi-sampling-seed CI therefore covers roughly half of the seedinduced envelope on precision, in… view at source ↗

**Figure 22.** Figure 22: Seed lottery under Inception Recall (the inversion case). Y-axis values are multiplied by 100 for readability. Per-seed means span 0.312 to 0.317 around grand mean 0.314. Each violin is taller than the staircase of mean ticks: σbetween/σwithin = 0.28×, an inversion of the FID asymmetry. On Inception recall the right CI to report on a fixed trained model is the sampling-only one. This is the opposite recom… view at source ↗

**Figure 23.** Figure 23: Coefficient of variation across compute and scale, for six metrics. Each panel plots between-seed CoV=σ/µ versus training step on the variance-over-training panel for SiT-S/B/L/XL with the clean seed sets of [PITH_FULL_IMAGE:figures/full_fig_p036_23.png] view at source ↗

**Figure 24.** Figure 24: Rank stability across compute, for six metrics. Each panel plots Spearman ρ between the seed ranking at training step t and at 2M, one curve per model size, mirroring the layout of Figure 6c. Dashed horizontal lines mark ρ = 0.8 (the high-stability target) and ρ = 0 (random). Inception FID, DINOv2 FID and Inception density approach ρ ≈ 0.8 by ∼ 1.4M on most models. Inception precision and coverage stabili… view at source ↗

**Figure 25.** Figure 25: Guided FID gallery – golden retriever. Rows are 6 fixed initial-noise seeds, and columns step through 8 FID-ordered checkpoints. The two halves continue along the same FID axis. Pastel colorbars report per-column FID. See Appendix G for the full setup. 38 [PITH_FULL_IMAGE:figures/full_fig_p038_25.png] view at source ↗

**Figure 26.** Figure 26: Guided FID gallery – tabby cat. Rows are 6 fixed initial-noise seeds, and columns step through 8 FID-ordered checkpoints. The two halves continue along the same FID axis. Pastel colorbars report per-column FID. See Appendix G for the full setup. 40 [PITH_FULL_IMAGE:figures/full_fig_p040_26.png] view at source ↗

**Figure 27.** Figure 27: Guided FID gallery – macaw. Rows are 6 fixed initial-noise seeds, and columns step through 8 FID-ordered checkpoints. The two halves continue along the same FID axis. Pastel colorbars report per-column FID. See Appendix G for the full setup. 41 [PITH_FULL_IMAGE:figures/full_fig_p041_27.png] view at source ↗

**Figure 28.** Figure 28: Guided FID gallery – flamingo. Rows are 6 fixed initial-noise seeds, and columns step through 8 FID-ordered checkpoints. The two halves continue along the same FID axis. Pastel colorbars report per-column FID. See Appendix G for the full setup. 42 [PITH_FULL_IMAGE:figures/full_fig_p042_28.png] view at source ↗

**Figure 29.** Figure 29: Guided FID gallery – cheeseburger. Rows are 6 fixed initial-noise seeds, and columns step through 8 FID-ordered checkpoints. The two halves continue along the same FID axis. Pastel colorbars report per-column FID. See Appendix G for the full setup. 43 [PITH_FULL_IMAGE:figures/full_fig_p043_29.png] view at source ↗

**Figure 30.** Figure 30: Guided FID gallery – ice cream. Rows are 6 fixed initial-noise seeds, and columns step through 8 FID-ordered checkpoints. The two halves continue along the same FID axis. Pastel colorbars report per-column FID. See Appendix G for the full setup. 44 [PITH_FULL_IMAGE:figures/full_fig_p044_30.png] view at source ↗

**Figure 31.** Figure 31: Guided FID gallery – volcano. Rows are 6 fixed initial-noise seeds, and columns step through 8 FID-ordered checkpoints. The two halves continue along the same FID axis. Pastel colorbars report per-column FID. See Appendix G for the full setup. 45 [PITH_FULL_IMAGE:figures/full_fig_p045_31.png] view at source ↗

**Figure 32.** Figure 32: Guided FID gallery – alp. Rows are 6 fixed initial-noise seeds, and columns step through 8 FID-ordered checkpoints. The two halves continue along the same FID axis. Pastel colorbars report per-column FID. See Appendix G for the full setup. 46 [PITH_FULL_IMAGE:figures/full_fig_p046_32.png] view at source ↗

**Figure 33.** Figure 33: Guided FID gallery – geyser. Rows are 6 fixed initial-noise seeds, and columns step through 8 FID-ordered checkpoints. The two halves continue along the same FID axis. Pastel colorbars report per-column FID. See Appendix G for the full setup. 47 [PITH_FULL_IMAGE:figures/full_fig_p047_33.png] view at source ↗

**Figure 34.** Figure 34: Guided FID gallery – daisy. Rows are 6 fixed initial-noise seeds, and columns step through 8 FID-ordered checkpoints. The two halves continue along the same FID axis. Pastel colorbars report per-column FID. See Appendix G for the full setup. 48 [PITH_FULL_IMAGE:figures/full_fig_p048_34.png] view at source ↗

**Figure 35.** Figure 35: Unguided FID gallery – golden retriever. Rows are 6 fixed initial-noise seeds, and columns step through 8 FID-ordered checkpoints. The two halves continue along the same FID axis. Pastel colorbars report per-column FID. See Appendix G for the full setup. 49 [PITH_FULL_IMAGE:figures/full_fig_p049_35.png] view at source ↗

**Figure 36.** Figure 36: Unguided FID gallery – tabby cat. Rows are 6 fixed initial-noise seeds, and columns step through 8 FID-ordered checkpoints. The two halves continue along the same FID axis. Pastel colorbars report per-column FID. See Appendix G for the full setup. 50 [PITH_FULL_IMAGE:figures/full_fig_p050_36.png] view at source ↗

**Figure 37.** Figure 37: Unguided FID gallery – macaw. Rows are 6 fixed initial-noise seeds, and columns step through 8 FID-ordered checkpoints. The two halves continue along the same FID axis. Pastel colorbars report per-column FID. See Appendix G for the full setup. 51 [PITH_FULL_IMAGE:figures/full_fig_p051_37.png] view at source ↗

**Figure 38.** Figure 38: Unguided FID gallery – flamingo. Rows are 6 fixed initial-noise seeds, and columns step through 8 FID-ordered checkpoints. The two halves continue along the same FID axis. Pastel colorbars report per-column FID. See Appendix G for the full setup. 52 [PITH_FULL_IMAGE:figures/full_fig_p052_38.png] view at source ↗

**Figure 39.** Figure 39: Unguided FID gallery – cheeseburger. Rows are 6 fixed initial-noise seeds, and columns step through 8 FID-ordered checkpoints. The two halves continue along the same FID axis. Pastel colorbars report per-column FID. See Appendix G for the full setup. 53 [PITH_FULL_IMAGE:figures/full_fig_p053_39.png] view at source ↗

**Figure 40.** Figure 40: Unguided FID gallery – ice cream. Rows are 6 fixed initial-noise seeds, and columns step through 8 FID-ordered checkpoints. The two halves continue along the same FID axis. Pastel colorbars report per-column FID. See Appendix G for the full setup. 54 [PITH_FULL_IMAGE:figures/full_fig_p054_40.png] view at source ↗

**Figure 41.** Figure 41: Unguided FID gallery – volcano. Rows are 6 fixed initial-noise seeds, and columns step through 8 FID-ordered checkpoints. The two halves continue along the same FID axis. Pastel colorbars report per-column FID. See Appendix G for the full setup. 55 [PITH_FULL_IMAGE:figures/full_fig_p055_41.png] view at source ↗

**Figure 42.** Figure 42: Unguided FID gallery – alp. Rows are 6 fixed initial-noise seeds, and columns step through 8 FID-ordered checkpoints. The two halves continue along the same FID axis. Pastel colorbars report per-column FID. See Appendix G for the full setup. 56 [PITH_FULL_IMAGE:figures/full_fig_p056_42.png] view at source ↗

**Figure 43.** Figure 43: Unguided FID gallery – geyser. Rows are 6 fixed initial-noise seeds, and columns step through 8 FID-ordered checkpoints. The two halves continue along the same FID axis. Pastel colorbars report per-column FID. See Appendix G for the full setup. 57 [PITH_FULL_IMAGE:figures/full_fig_p057_43.png] view at source ↗

**Figure 44.** Figure 44: Unguided FID gallery – daisy. Rows are 6 fixed initial-noise seeds, and columns step through 8 FID-ordered checkpoints. The two halves continue along the same FID axis. Pastel colorbars report per-column FID. See Appendix G for the full setup. 58 [PITH_FULL_IMAGE:figures/full_fig_p058_44.png] view at source ↗

read the original abstract

The Frechet Inception Distance (FID) is the de facto arbiter of image generation, yet most papers report just a single number from a single trained model using a single sampling seed. How reproducible is that number if we retrain the model, or merely resample from it? In this paper, we treat FID as a random variable on a two-axis panel of training and generation seeds, and measure its variance directly on several hundred SiT networks trained on class-conditional ImageNet 256x256. We report surprising findings: (a) Retraining the model using the same recipe with a different seed moves FID 3.2x more (in Inception feature space) than redrawing samples from a fixed network. (b) That gap is driven by three factors: random initialisation, data ordering, and the per-step Gaussian noise of the flow-matching loss. (c) Increasing compute or model size barely tightens the spread, holding the FID coefficient of variation (CoV) inside a 1-2% band. (d) Per-cell classifier-free-guidance tuning halves the spread but reshuffles which seeds work best, and a lucky training seed reaches the same FID with up to 2x less compute than an unlucky one. Based on these findings, we recommend a new FID evaluation protocol: evaluate under per-cell optimal guidance, treat any FID gap below the empirically measured ~1.3% CoV as inconclusive, and report an error bar over several training seeds rather than a single FID number.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Training seed changes shift FID 3.2x more than sampling changes on these SiT runs, with CoV locked at 1-2%, but the numbers come only from that one setup.

read the letter

The key point is that retraining with a new seed moves FID 3.2 times farther in feature space than just resampling from a fixed model, and the coefficient of variation stays inside a 1-2% band even when you scale compute or model size. They also show that tuning guidance per seed cuts the spread in half but changes which seeds win, and that a good seed can match a bad seed's FID with half the compute.

The paper does solid empirical work by training several hundred SiT models on class-conditional ImageNet 256x256 and measuring the variances directly across seeds. It pins the extra variance on three concrete sources: random initialization, data ordering, and the per-step Gaussian noise in the flow-matching loss. Those measurements are new at this scale, and the 3.2x ratio plus the stable CoV band are specific results not in the earlier literature they cite. The protocol they suggest (per-cell guidance, treat gaps under 1.3% as inconclusive, report training-seed error bars) follows from the data they collected.

The soft spot is the narrow scope. Everything is measured on SiT networks for one dataset and resolution. No results appear for GANs, standard diffusion models, autoregressive generators, or other datasets, so the 3.2x factor and the 1-2% CoV band could be specific to this recipe rather than general. The stress-test note is right on that point. The abstract also gives little detail on exact exclusion rules or statistical tests, which would need checking in the full text.

This paper is for researchers who train and compare generative models and currently report single FID numbers. Anyone in that group would get practical value from the variance numbers and the suggested reporting changes. It deserves a serious referee because the measurements are direct and the evaluation question is real for the field, even if later work has to test whether the ratios hold elsewhere.

Referee Report

2 major / 1 minor

Summary. The paper treats FID as a random variable over training and sampling seeds and measures its variance directly on several hundred SiT networks trained on class-conditional ImageNet 256x256. It reports that training-seed variance is 3.2x larger than sampling-seed variance in Inception space, identifies three driving factors (random initialization, data ordering, per-step Gaussian noise), finds that the coefficient of variation remains inside a 1-2% band even with increased compute or model size, shows that per-cell classifier-free-guidance tuning halves the spread, and recommends a new evaluation protocol: per-cell optimal guidance, treat gaps below ~1.3% CoV as inconclusive, and report error bars over training seeds rather than a single FID.

Significance. If the empirical variance measurements hold, the work provides direct evidence that single FID numbers are unreliable and that training randomness dominates sampling randomness. The scale of the experiment (hundreds of models) and the use of direct empirical computations without fitted parameters or self-referential definitions are strengths. The protocol recommendations would meaningfully change evaluation standards in generative modeling if the observed variance structure generalizes.

major comments (2)

[Abstract and experimental results] The central quantitative claims (3.2x ratio, 1-2% CoV) and the three protocol recommendations are derived exclusively from SiT networks on class-conditional ImageNet 256x256; no results are shown for other architectures (GANs, diffusion, autoregressive), losses, or datasets. This makes the generalization premise for the recommendations unverified and load-bearing for the paper's broader impact.
[Methods and appendix] Limited detail is provided on the statistical tests supporting the variance claims, the exact rules for model exclusion, and controls for confounding factors (e.g., training duration, hyperparameter stability). This leaves room for unstated post-hoc choices that could affect the reported ratios and CoV bounds.

minor comments (1)

[Experimental setup] The manuscript would benefit from an explicit statement of the precise number of models, seeds, and samples per cell in the main text rather than only in supplementary material.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive review and for acknowledging the scale of the experiments and the direct empirical approach. We respond to each major comment below.

read point-by-point responses

Referee: [Abstract and experimental results] The central quantitative claims (3.2x ratio, 1-2% CoV) and the three protocol recommendations are derived exclusively from SiT networks on class-conditional ImageNet 256x256; no results are shown for other architectures (GANs, diffusion, autoregressive), losses, or datasets. This makes the generalization premise for the recommendations unverified and load-bearing for the paper's broader impact.

Authors: We agree that all quantitative results and protocol recommendations are derived solely from SiT models on class-conditional ImageNet 256x256. The manuscript frames the work as a detailed empirical study on this benchmark rather than a universal claim. We will revise the abstract, introduction, and conclusion to explicitly limit the scope of the claims and recommendations to the studied setting, removing any implication of broader generalization. We cannot add results for other architectures or datasets without new experiments. revision: yes
Referee: [Methods and appendix] Limited detail is provided on the statistical tests supporting the variance claims, the exact rules for model exclusion, and controls for confounding factors (e.g., training duration, hyperparameter stability). This leaves room for unstated post-hoc choices that could affect the reported ratios and CoV bounds.

Authors: We will expand the methods section and appendix with additional detail on the statistical procedures used to compute variances and ratios, the exact exclusion criteria applied to trained models, and controls for training duration and hyperparameter stability. These additions will include explicit statements confirming that no post-hoc selections were made that could bias the reported 3.2x ratio or 1-2% CoV bounds. revision: yes

standing simulated objections not resolved

Empirical verification of the variance structure and protocol recommendations on architectures other than SiT, different losses, or datasets other than class-conditional ImageNet 256x256.

Circularity Check

0 steps flagged

No circularity; all claims are direct empirical measurements on trained models

full rationale

The paper reports variance ratios, CoV bounds, and driving factors obtained by explicitly training several hundred SiT networks on ImageNet 256x256 under varied seeds and computing FID directly in Inception space. No derivation, equation, or prediction reduces to a fitted parameter, self-definition, or self-citation chain; the 3.2x ratio and 1-2% CoV are computed quantities, not outputs forced by the paper's own formalism. The protocol recommendations follow from these measurements without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work is purely empirical and relies on standard statistical assumptions for variance and coefficient of variation; no free parameters are fitted to produce the headline ratios, and no new entities are postulated.

axioms (2)

domain assumption FID computed in Inception-v3 feature space is a stable and meaningful measure of distribution distance between real and generated images
Invoked as the basis for all variance measurements throughout the abstract.
domain assumption The SiT training recipe and ImageNet 256x256 setup are representative enough for the observed variance patterns to be informative
Underlies the generalization of the 3.2x factor and CoV band to broader practice.

pith-pipeline@v0.9.1-grok · 5814 in / 1465 out tokens · 42084 ms · 2026-06-26T17:40:02.653738+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

101 extracted references · 14 canonical work pages · 5 internal anchors

[1]

Sinjini Banerjee, Tim Marrinan, Reilly Cannon, Tony Chiang, and Anand D. Sarwate. Measuring training variability from stochastic optimization using robust nonparametric testing.arXiv preprint arXiv:2406.08307, 2024

work page arXiv 2024
[2]

A Note on the Inception Score

Shane Barratt and Rishi Sharma. A note on the inception score.arXiv preprint arXiv:1801.01973, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[3]

Time for a change: a tutorial for comparing multiple classifiers through Bayesian analysis.Journal of Machine Learning Research, 2017

Alessio Benavoli, Giorgio Corani, Janez Demšar, and Marco Zaffalon. Time for a change: a tutorial for comparing multiple classifiers through Bayesian analysis.Journal of Machine Learning Research, 2017

2017
[4]

Quantifying the uncertainty of model-based synthetic image quality metrics.arXiv preprint arXiv:2504.03623, 2025

Ciaran Bench and Spencer Angus Thomas. Quantifying the uncertainty of model-based synthetic image quality metrics.arXiv preprint arXiv:2504.03623, 2025

work page arXiv 2025
[5]

Princeton University Press, 2007

Rajendra Bhatia.Positive Definite Matrices. Princeton University Press, 2007

2007
[6]

Sutherland, Michael Arbel, and Arthur Gretton

Mikołaj Bi´nkowski, Danica J. Sutherland, Michael Arbel, and Arthur Gretton. Demystifying MMD GANs. In6th International Conference on Learning Representations (ICLR), 2018

2018
[7]

Pros and cons of GAN evaluation measures.Computer Vision and Image Under- standing, 2019

Ali Borji. Pros and cons of GAN evaluation measures.Computer Vision and Image Under- standing, 2019

2019
[8]

Pros and cons of GAN evaluation measures: New developments.Computer Vision and Image Understanding, 2022

Ali Borji. Pros and cons of GAN evaluation measures: New developments.Computer Vision and Image Understanding, 2022

2022
[9]

Unreproducible research is reproducible

Xavier Bouthillier, César Laurent, and Pascal Vincent. Unreproducible research is reproducible. InProceedings of the 36th International Conference on Machine Learning (ICML), 2019

2019
[10]

Accounting for variance in machine learning benchmarks

Xavier Bouthillier, Pierre Delaunay, Mirko Bronzi, Assya Trofimov, Brennan Nichyporuk, Justin Szeto, Naz Sepah, Edward Raff, Kanika Madan, Vikram V oleti, Samira Ebrahimi Kahou, Vincent Michalski, Dmitriy Serdyuk, Tal Arbel, Chris Pal, Gaël Varoquaux, and Pascal Vincent. Accounting for variance in machine learning benchmarks. InProceedings of Machine Lear...

2021
[11]

Brent.Algorithms for Minimization without Derivatives

Richard P. Brent.Algorithms for Minimization without Derivatives. Prentice-Hall, 1973

1973
[12]

Effectively unbiased FID and inception score and where to find them

Min Jin Chong and David Forsyth. Effectively unbiased FID and inception score and where to find them. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020

2020
[13]

Questionable answers in question answering research: Reproducibility and vari- ability of published results.Transactions of the Association for Computational Linguistics, 2018

Matt Crane. Questionable answers in question answering research: Reproducibility and vari- ability of published results.Transactions of the Association for Computational Linguistics, 2018

2018
[14]

Alexander D’Amour, Katherine Heller, Dan Moldovan, Ben Adlam, Babak Alipanahi, Alex Beutel, Christina Chen, Jonathan Deaton, Jacob Eisenstein, Matthew D. Hoffman, Farhad Hormozdiari, Neil Houlsby, Shaobo Hou, Ghassen Jerfel, Alan Karthikesalingam, Mario Lucic, Yian Ma, Cory McLean, Diana Mincu, Akinori Mitani, Andrea Montanari, Zachary Nado, Vivek Nataraj...

2022
[15]

How far can we go with ImageNet for text-to-image generation? InAdvances in Neural Information Processing Systems 38 (NeurIPS), 2025

Lucas Degeorge, Arijit Ghosh, Nicolas Dufour, David Picard, and Vicky Kalogeiton. How far can we go with ImageNet for text-to-image generation? InAdvances in Neural Information Processing Systems 38 (NeurIPS), 2025

2025
[16]

Statistical comparisons of classifiers over multiple data sets.Journal of Machine Learning Research, 2006

Janez Demšar. Statistical comparisons of classifiers over multiple data sets.Journal of Machine Learning Research, 2006. 12

2006
[17]

ImageNet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009

2009
[18]

Diffusion models beat GANs on image synthesis

Prafulla Dhariwal and Alexander Quinn Nichol. Diffusion models beat GANs on image synthesis. InAdvances in Neural Information Processing Systems 34 (NeurIPS), 2021

2021
[19]

DiCiccio and Bradley Efron

Thomas J. DiCiccio and Bradley Efron. Bootstrap confidence intervals.Statistical Science, 1996

1996
[20]

Dietterich

Thomas G. Dietterich. Approximate statistical tests for comparing supervised classification learning algorithms.Neural Computation, 1998

1998
[21]

Jesse Dodge, Suchin Gururangan, Dallas Card, Roy Schwartz, and Noah A. Smith. Show your work: Improved reporting of experimental results. InProceedings of EMNLP-IJCNLP, 2019

2019
[22]

Jesse Dodge, Gabriel Ilharco, Roy Schwartz, Ali Farhadi, Hannaneh Hajishirzi, and Noah A. Smith. Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping.arXiv preprint arXiv:2002.06305, 2020

work page arXiv 2002
[23]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations (ICLR), 2021

2021
[24]

The hitchhiker’s guide to testing statistical significance in natural language processing

Rotem Dror, Gili Baumer, Segev Shlomov, and Roi Reichart. The hitchhiker’s guide to testing statistical significance in natural language processing. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL), Volume 1: Long Papers, 2018

2018
[25]

Multiple comparisons among means.Journal of the American Statistical Association, 1961

Olive Jean Dunn. Multiple comparisons among means.Journal of the American Statistical Association, 1961

1961
[26]

B. Efron. Bootstrap methods: Another look at the jackknife.The Annals of Statistics, 1979

1979
[27]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis. InProceedings of the 41st International Conference on Machine Learning (ICML), 2024

2024
[28]

Mark Everingham, S. M. Ali Eslami, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. The Pascal visual object classes challenge: A retrospective.International Journal of Computer Vision, 2014

2014
[29]

Deep ensembles: A loss landscape perspective.arXiv preprint arXiv:1912.02757, 2019

Stanislav Fort, Huiyi Hu, and Balaji Lakshminarayanan. Deep ensembles: A loss landscape perspective.arXiv preprint arXiv:1912.02757, 2019

work page arXiv 1912
[30]

The lottery ticket hypothesis: Finding sparse, trainable neural networks

Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. InInternational Conference on Learning Representations (ICLR), 2019

2019
[31]

Roy, and Michael Carbin

Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M. Roy, and Michael Carbin. Linear mode connectivity and the lottery ticket hypothesis. InProceedings of the 37th International Conference on Machine Learning (ICML), 2020

2020
[32]

Understanding the difficulty of training deep feedforward neural networks

Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. InProceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS), 2010

2010
[33]

State of the art: Reproducibility in artificial intelligence

Odd Erik Gundersen and Sigbjørn Kjensmo. State of the art: Reproducibility in artificial intelligence. InProceedings of the AAAI Conference on Artificial Intelligence, 2018

2018
[34]

Defeating nondeterminism in LLM inference

Horace He and Thinking Machines Lab. Defeating nondeterminism in LLM inference. https: //thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/ ,
[35]

Thinking Machines Lab blog; accessed 2026. 13

2026
[36]

Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. InProceedings of the IEEE International Conference on Computer Vision (ICCV), 2015

2015
[37]

Deep residual learning for im- age recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for im- age recognition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016

2016
[38]

Deep reinforcement learning that matters

Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep reinforcement learning that matters. InProceedings of the AAAI Conference on Artificial Intelligence, 2018

2018
[39]

Scaling Laws for Autoregressive Generative Modeling

Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B. Brown, Prafulla Dhariwal, Scott Gray, Chris Hallacy, Benjamin Mann, Alec Radford, Aditya Ramesh, Nick Ryder, Daniel M. Ziegler, John Schulman, Dario Amodei, and Sam McCandlish. Scaling laws for autoregressive generative modeling.arXiv preprint arXiv:2010....

work page internal anchor Pith review Pith/arXiv arXiv 2010
[40]

Deep Learning Scaling is Predictable, Empirically

Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md. Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically.arXiv preprint arXiv:1712.00409, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[41]

GANs trained by a two time-scale update rule converge to a local Nash equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. InAdvances in Neural Information Processing Systems 30 (NIPS 2017), 2017

2017
[42]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[43]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems 33 (NeurIPS), 2020

2020
[44]

Rae, Oriol Vinyals, and Laurent Sifre

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre...

2022
[45]

Rethinking FID: Towards a better evaluation metric for image generation

Sadeep Jayasumana, Srikumar Ramalingam, Andreas Veit, Daniel Glasner, Ayan Chakrabarti, and Sanjiv Kumar. Rethinking FID: Towards a better evaluation metric for image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024
[46]

Calibrated chaos: Variance between runs of neural network training is harmless and inevitable.arXiv preprint arXiv:2304.01910, 2023

Keller Jordan. Calibrated chaos: Variance between runs of neural network training is harmless and inevitable.arXiv preprint arXiv:2304.01910, 2023

work page arXiv 2023
[47]

Simoncelli, and Stéphane Mallat

Zahra Kadkhodaie, Florentin Guth, Eero P. Simoncelli, and Stéphane Mallat. Generalization in diffusion models arises from geometry-adaptive harmonic representations. InThe Twelfth International Conference on Learning Representations (ICLR), 2024

2024
[48]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[49]

A style-based generator architecture for generative adversarial networks

Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019

2019
[50]

Elucidating the design space of diffusion-based generative models

Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. InAdvances in Neural Information Processing Systems 35 (NeurIPS), 2022. 14

2022
[51]

Analyzing and improving the training dynamics of diffusion models

Tero Karras, Miika Aittala, Jaakko Lehtinen, Janne Hellsten, Timo Aila, and Samuli Laine. Analyzing and improving the training dynamics of diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024
[52]

M. G. Kendall and B. Babington Smith. The problem ofm rankings.The Annals of Mathematical Statistics, 1939

1939
[53]

J. Kiefer. Sequential minimax search for a maximum.Proceedings of the American Mathemati- cal Society, 1953

1953
[54]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In3rd International Conference on Learning Representations (ICLR), 2015

2015
[55]

Kingma, Tim Salimans, Ben Poole, and Jonathan Ho

Diederik P. Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. InAdvances in Neural Information Processing Systems 34 (NeurIPS), 2021

2021
[56]

Improved precision and recall metric for assessing generative models

Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall metric for assessing generative models. InAdvances in Neural Information Processing Systems 32 (NeurIPS), 2019

2019
[57]

The role of ImageNet classes in Fréchet inception distance

Tuomas Kynkäänniemi, Tero Karras, Miika Aittala, Timo Aila, and Jaakko Lehtinen. The role of ImageNet classes in Fréchet inception distance. InThe Eleventh International Conference on Learning Representations (ICLR), 2023

2023
[58]

Data set selection.Journal of Machine Learning Gossip, 2003

Doudou LaLoudouana, Mambobo Bonouliqui Tarare, Lupano Tecallonou Center, and GUANA Selacie. Data set selection.Journal of Machine Learning Gossip, 2003

2003
[59]

Autoregressive image generation without vector quantization

Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vector quantization. InAdvances in Neural Information Processing Systems 37 (NeurIPS), 2024

2024
[60]

Scaling laws for diffusion transformers

Zhengyang Liang, Hao He, Ceyuan Yang, and Bo Dai. Scaling laws for diffusion transformers. arXiv preprint arXiv:2410.08184, 2024

work page arXiv 2024
[61]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. InThe Eleventh International Conference on Learning Representations (ICLR), 2023

2023
[62]

Flow straight and fast: Learning to generate and transfer data with rectified flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InThe Eleventh International Conference on Learning Representations (ICLR), 2023

2023
[63]

Are GANs created equal? A large-scale study

Mario Lucic, Karol Kurach, Marcin Michalski, Sylvain Gelly, and Olivier Bousquet. Are GANs created equal? A large-scale study. InAdvances in Neural Information Processing Systems 31 (NeurIPS), 2018

2018
[64]

Albergo, Nicholas M

Nanye Ma, Mark Goldstein, Michael S. Albergo, Nicholas M. Boffi, Eric Vanden-Eijnden, and Saining Xie. SiT: Exploring flow and diffusion-based generative models with scalable interpolant transformers. InComputer Vision – ECCV 2024, 2024

2024
[65]

On model stability as a function of random seed

Pranava Madhyastha and Rishabh Jain. On model stability as a function of random seed. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), 2019

2019
[66]

On the state of the art of evaluation in neural language models

Gábor Melis, Chris Dyer, and Phil Blunsom. On the state of the art of evaluation in neural language models. InInternational Conference on Learning Representations (ICLR), 2018

2018
[67]

On self-supervised image representa- tions for GAN evaluation

Stanislav Morozov, Andrey V oynov, and Artem Babenko. On self-supervised image representa- tions for GAN evaluation. In9th International Conference on Learning Representations (ICLR), 2021

2021
[68]

On the stability of fine- tuning BERT: Misconceptions, explanations, and strong baselines

Marius Mosbach, Maksym Andriushchenko, and Dietrich Klakow. On the stability of fine- tuning BERT: Misconceptions, explanations, and strong baselines. InInternational Conference on Learning Representations (ICLR), 2021. 15

2021
[69]

Reliable fidelity and diversity metrics for generative models

Muhammad Ferjad Naeem, Seong Joon Oh, Youngjung Uh, Yunjey Choi, and Jaejun Yoo. Reliable fidelity and diversity metrics for generative models. InProceedings of the 37th International Conference on Machine Learning, 2020

2020
[70]

Zico Kolter

Vaishnavh Nagarajan and J. Zico Kolter. Uniform convergence may be unable to explain generalization in deep learning. InAdvances in Neural Information Processing Systems 32 (NeurIPS), 2019

2019
[71]

Improved denoising diffusion probabilistic models

Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. InProceedings of the 38th International Conference on Machine Learning (ICML), 2021

2021
[72]

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jégou, Julien Mairal, Patric...

2024
[73]

On aliased resizing and surprising subtleties in GAN evaluation

Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On aliased resizing and surprising subtleties in GAN evaluation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

2022
[74]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

2023
[75]

Problems and opportunities in training deep learning software systems: An analysis of variance

Hung Viet Pham, Shangshu Qian, Jiannan Wang, Thibaud Lutellier, Jonathan Rosenthal, Lin Tan, Yaoliang Yu, and Nachiappan Nagappan. Problems and opportunities in training deep learning software systems: An analysis of variance. InProceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering (ASE), 2020

2020
[76]

Torch.manual_seed(3407) is all you need: On the influence of random seeds in deep learning architectures for computer vision.arXiv preprint arXiv:2109.08203, 2021

David Picard. Torch.manual_seed(3407) is all you need: On the influence of random seeds in deep learning architectures for computer vision.arXiv preprint arXiv:2109.08203, 2021

work page arXiv 2021
[77]

Improving reproducibility in machine learning research (a report from the NeurIPS 2019 reproducibility program).Journal of Machine Learning Research, 2021

Joelle Pineau, Philippe Vincent-Lamarre, Koustuv Sinha, Vincent Larivière, Alina Beygelzimer, Florence d’Alché Buc, Emily Fox, and Hugo Larochelle. Improving reproducibility in machine learning research (a report from the NeurIPS 2019 reproducibility program).Journal of Machine Learning Research, 2021

2019
[78]

A step toward quantifying independently reproducible machine learning research

Edward Raff. A step toward quantifying independently reproducible machine learning research. InAdvances in Neural Information Processing Systems 32 (NeurIPS), 2019

2019
[79]

Do ImageNet classifiers generalize to ImageNet? InProceedings of the 36th International Conference on Machine Learning (ICML), 2019

Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do ImageNet classifiers generalize to ImageNet? InProceedings of the 36th International Conference on Machine Learning (ICML), 2019

2019
[80]

Reporting score distributions makes a difference: Perfor- mance study of LSTM-networks for sequence tagging

Nils Reimers and Iryna Gurevych. Reporting score distributions makes a difference: Perfor- mance study of LSTM-networks for sequence tagging. InProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2017

2017

Showing first 80 references.

[1] [1]

Sinjini Banerjee, Tim Marrinan, Reilly Cannon, Tony Chiang, and Anand D. Sarwate. Measuring training variability from stochastic optimization using robust nonparametric testing.arXiv preprint arXiv:2406.08307, 2024

work page arXiv 2024

[2] [2]

A Note on the Inception Score

Shane Barratt and Rishi Sharma. A note on the inception score.arXiv preprint arXiv:1801.01973, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[3] [3]

Time for a change: a tutorial for comparing multiple classifiers through Bayesian analysis.Journal of Machine Learning Research, 2017

Alessio Benavoli, Giorgio Corani, Janez Demšar, and Marco Zaffalon. Time for a change: a tutorial for comparing multiple classifiers through Bayesian analysis.Journal of Machine Learning Research, 2017

2017

[4] [4]

Quantifying the uncertainty of model-based synthetic image quality metrics.arXiv preprint arXiv:2504.03623, 2025

Ciaran Bench and Spencer Angus Thomas. Quantifying the uncertainty of model-based synthetic image quality metrics.arXiv preprint arXiv:2504.03623, 2025

work page arXiv 2025

[5] [5]

Princeton University Press, 2007

Rajendra Bhatia.Positive Definite Matrices. Princeton University Press, 2007

2007

[6] [6]

Sutherland, Michael Arbel, and Arthur Gretton

Mikołaj Bi´nkowski, Danica J. Sutherland, Michael Arbel, and Arthur Gretton. Demystifying MMD GANs. In6th International Conference on Learning Representations (ICLR), 2018

2018

[7] [7]

Pros and cons of GAN evaluation measures.Computer Vision and Image Under- standing, 2019

Ali Borji. Pros and cons of GAN evaluation measures.Computer Vision and Image Under- standing, 2019

2019

[8] [8]

Pros and cons of GAN evaluation measures: New developments.Computer Vision and Image Understanding, 2022

Ali Borji. Pros and cons of GAN evaluation measures: New developments.Computer Vision and Image Understanding, 2022

2022

[9] [9]

Unreproducible research is reproducible

Xavier Bouthillier, César Laurent, and Pascal Vincent. Unreproducible research is reproducible. InProceedings of the 36th International Conference on Machine Learning (ICML), 2019

2019

[10] [10]

Accounting for variance in machine learning benchmarks

Xavier Bouthillier, Pierre Delaunay, Mirko Bronzi, Assya Trofimov, Brennan Nichyporuk, Justin Szeto, Naz Sepah, Edward Raff, Kanika Madan, Vikram V oleti, Samira Ebrahimi Kahou, Vincent Michalski, Dmitriy Serdyuk, Tal Arbel, Chris Pal, Gaël Varoquaux, and Pascal Vincent. Accounting for variance in machine learning benchmarks. InProceedings of Machine Lear...

2021

[11] [11]

Brent.Algorithms for Minimization without Derivatives

Richard P. Brent.Algorithms for Minimization without Derivatives. Prentice-Hall, 1973

1973

[12] [12]

Effectively unbiased FID and inception score and where to find them

Min Jin Chong and David Forsyth. Effectively unbiased FID and inception score and where to find them. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020

2020

[13] [13]

Questionable answers in question answering research: Reproducibility and vari- ability of published results.Transactions of the Association for Computational Linguistics, 2018

Matt Crane. Questionable answers in question answering research: Reproducibility and vari- ability of published results.Transactions of the Association for Computational Linguistics, 2018

2018

[14] [14]

Alexander D’Amour, Katherine Heller, Dan Moldovan, Ben Adlam, Babak Alipanahi, Alex Beutel, Christina Chen, Jonathan Deaton, Jacob Eisenstein, Matthew D. Hoffman, Farhad Hormozdiari, Neil Houlsby, Shaobo Hou, Ghassen Jerfel, Alan Karthikesalingam, Mario Lucic, Yian Ma, Cory McLean, Diana Mincu, Akinori Mitani, Andrea Montanari, Zachary Nado, Vivek Nataraj...

2022

[15] [15]

How far can we go with ImageNet for text-to-image generation? InAdvances in Neural Information Processing Systems 38 (NeurIPS), 2025

Lucas Degeorge, Arijit Ghosh, Nicolas Dufour, David Picard, and Vicky Kalogeiton. How far can we go with ImageNet for text-to-image generation? InAdvances in Neural Information Processing Systems 38 (NeurIPS), 2025

2025

[16] [16]

Statistical comparisons of classifiers over multiple data sets.Journal of Machine Learning Research, 2006

Janez Demšar. Statistical comparisons of classifiers over multiple data sets.Journal of Machine Learning Research, 2006. 12

2006

[17] [17]

ImageNet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009

2009

[18] [18]

Diffusion models beat GANs on image synthesis

Prafulla Dhariwal and Alexander Quinn Nichol. Diffusion models beat GANs on image synthesis. InAdvances in Neural Information Processing Systems 34 (NeurIPS), 2021

2021

[19] [19]

DiCiccio and Bradley Efron

Thomas J. DiCiccio and Bradley Efron. Bootstrap confidence intervals.Statistical Science, 1996

1996

[20] [20]

Dietterich

Thomas G. Dietterich. Approximate statistical tests for comparing supervised classification learning algorithms.Neural Computation, 1998

1998

[21] [21]

Jesse Dodge, Suchin Gururangan, Dallas Card, Roy Schwartz, and Noah A. Smith. Show your work: Improved reporting of experimental results. InProceedings of EMNLP-IJCNLP, 2019

2019

[22] [22]

Jesse Dodge, Gabriel Ilharco, Roy Schwartz, Ali Farhadi, Hannaneh Hajishirzi, and Noah A. Smith. Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping.arXiv preprint arXiv:2002.06305, 2020

work page arXiv 2002

[23] [23]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations (ICLR), 2021

2021

[24] [24]

The hitchhiker’s guide to testing statistical significance in natural language processing

Rotem Dror, Gili Baumer, Segev Shlomov, and Roi Reichart. The hitchhiker’s guide to testing statistical significance in natural language processing. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL), Volume 1: Long Papers, 2018

2018

[25] [25]

Multiple comparisons among means.Journal of the American Statistical Association, 1961

Olive Jean Dunn. Multiple comparisons among means.Journal of the American Statistical Association, 1961

1961

[26] [26]

B. Efron. Bootstrap methods: Another look at the jackknife.The Annals of Statistics, 1979

1979

[27] [27]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis. InProceedings of the 41st International Conference on Machine Learning (ICML), 2024

2024

[28] [28]

Mark Everingham, S. M. Ali Eslami, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. The Pascal visual object classes challenge: A retrospective.International Journal of Computer Vision, 2014

2014

[29] [29]

Deep ensembles: A loss landscape perspective.arXiv preprint arXiv:1912.02757, 2019

Stanislav Fort, Huiyi Hu, and Balaji Lakshminarayanan. Deep ensembles: A loss landscape perspective.arXiv preprint arXiv:1912.02757, 2019

work page arXiv 1912

[30] [30]

The lottery ticket hypothesis: Finding sparse, trainable neural networks

Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. InInternational Conference on Learning Representations (ICLR), 2019

2019

[31] [31]

Roy, and Michael Carbin

Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M. Roy, and Michael Carbin. Linear mode connectivity and the lottery ticket hypothesis. InProceedings of the 37th International Conference on Machine Learning (ICML), 2020

2020

[32] [32]

Understanding the difficulty of training deep feedforward neural networks

Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. InProceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS), 2010

2010

[33] [33]

State of the art: Reproducibility in artificial intelligence

Odd Erik Gundersen and Sigbjørn Kjensmo. State of the art: Reproducibility in artificial intelligence. InProceedings of the AAAI Conference on Artificial Intelligence, 2018

2018

[34] [34]

Defeating nondeterminism in LLM inference

Horace He and Thinking Machines Lab. Defeating nondeterminism in LLM inference. https: //thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/ ,

[35] [35]

Thinking Machines Lab blog; accessed 2026. 13

2026

[36] [36]

Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. InProceedings of the IEEE International Conference on Computer Vision (ICCV), 2015

2015

[37] [37]

Deep residual learning for im- age recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for im- age recognition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016

2016

[38] [38]

Deep reinforcement learning that matters

Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep reinforcement learning that matters. InProceedings of the AAAI Conference on Artificial Intelligence, 2018

2018

[39] [39]

Scaling Laws for Autoregressive Generative Modeling

Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B. Brown, Prafulla Dhariwal, Scott Gray, Chris Hallacy, Benjamin Mann, Alec Radford, Aditya Ramesh, Nick Ryder, Daniel M. Ziegler, John Schulman, Dario Amodei, and Sam McCandlish. Scaling laws for autoregressive generative modeling.arXiv preprint arXiv:2010....

work page internal anchor Pith review Pith/arXiv arXiv 2010

[40] [40]

Deep Learning Scaling is Predictable, Empirically

Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md. Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically.arXiv preprint arXiv:1712.00409, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[41] [41]

GANs trained by a two time-scale update rule converge to a local Nash equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. InAdvances in Neural Information Processing Systems 30 (NIPS 2017), 2017

2017

[42] [42]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[43] [43]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems 33 (NeurIPS), 2020

2020

[44] [44]

Rae, Oriol Vinyals, and Laurent Sifre

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre...

2022

[45] [45]

Rethinking FID: Towards a better evaluation metric for image generation

Sadeep Jayasumana, Srikumar Ramalingam, Andreas Veit, Daniel Glasner, Ayan Chakrabarti, and Sanjiv Kumar. Rethinking FID: Towards a better evaluation metric for image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024

[46] [46]

Calibrated chaos: Variance between runs of neural network training is harmless and inevitable.arXiv preprint arXiv:2304.01910, 2023

Keller Jordan. Calibrated chaos: Variance between runs of neural network training is harmless and inevitable.arXiv preprint arXiv:2304.01910, 2023

work page arXiv 2023

[47] [47]

Simoncelli, and Stéphane Mallat

Zahra Kadkhodaie, Florentin Guth, Eero P. Simoncelli, and Stéphane Mallat. Generalization in diffusion models arises from geometry-adaptive harmonic representations. InThe Twelfth International Conference on Learning Representations (ICLR), 2024

2024

[48] [48]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001

[49] [49]

A style-based generator architecture for generative adversarial networks

Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019

2019

[50] [50]

Elucidating the design space of diffusion-based generative models

Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. InAdvances in Neural Information Processing Systems 35 (NeurIPS), 2022. 14

2022

[51] [51]

Analyzing and improving the training dynamics of diffusion models

Tero Karras, Miika Aittala, Jaakko Lehtinen, Janne Hellsten, Timo Aila, and Samuli Laine. Analyzing and improving the training dynamics of diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024

[52] [52]

M. G. Kendall and B. Babington Smith. The problem ofm rankings.The Annals of Mathematical Statistics, 1939

1939

[53] [53]

J. Kiefer. Sequential minimax search for a maximum.Proceedings of the American Mathemati- cal Society, 1953

1953

[54] [54]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In3rd International Conference on Learning Representations (ICLR), 2015

2015

[55] [55]

Kingma, Tim Salimans, Ben Poole, and Jonathan Ho

Diederik P. Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. InAdvances in Neural Information Processing Systems 34 (NeurIPS), 2021

2021

[56] [56]

Improved precision and recall metric for assessing generative models

Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall metric for assessing generative models. InAdvances in Neural Information Processing Systems 32 (NeurIPS), 2019

2019

[57] [57]

The role of ImageNet classes in Fréchet inception distance

Tuomas Kynkäänniemi, Tero Karras, Miika Aittala, Timo Aila, and Jaakko Lehtinen. The role of ImageNet classes in Fréchet inception distance. InThe Eleventh International Conference on Learning Representations (ICLR), 2023

2023

[58] [58]

Data set selection.Journal of Machine Learning Gossip, 2003

Doudou LaLoudouana, Mambobo Bonouliqui Tarare, Lupano Tecallonou Center, and GUANA Selacie. Data set selection.Journal of Machine Learning Gossip, 2003

2003

[59] [59]

Autoregressive image generation without vector quantization

Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vector quantization. InAdvances in Neural Information Processing Systems 37 (NeurIPS), 2024

2024

[60] [60]

Scaling laws for diffusion transformers

Zhengyang Liang, Hao He, Ceyuan Yang, and Bo Dai. Scaling laws for diffusion transformers. arXiv preprint arXiv:2410.08184, 2024

work page arXiv 2024

[61] [61]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. InThe Eleventh International Conference on Learning Representations (ICLR), 2023

2023

[62] [62]

Flow straight and fast: Learning to generate and transfer data with rectified flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InThe Eleventh International Conference on Learning Representations (ICLR), 2023

2023

[63] [63]

Are GANs created equal? A large-scale study

Mario Lucic, Karol Kurach, Marcin Michalski, Sylvain Gelly, and Olivier Bousquet. Are GANs created equal? A large-scale study. InAdvances in Neural Information Processing Systems 31 (NeurIPS), 2018

2018

[64] [64]

Albergo, Nicholas M

Nanye Ma, Mark Goldstein, Michael S. Albergo, Nicholas M. Boffi, Eric Vanden-Eijnden, and Saining Xie. SiT: Exploring flow and diffusion-based generative models with scalable interpolant transformers. InComputer Vision – ECCV 2024, 2024

2024

[65] [65]

On model stability as a function of random seed

Pranava Madhyastha and Rishabh Jain. On model stability as a function of random seed. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), 2019

2019

[66] [66]

On the state of the art of evaluation in neural language models

Gábor Melis, Chris Dyer, and Phil Blunsom. On the state of the art of evaluation in neural language models. InInternational Conference on Learning Representations (ICLR), 2018

2018

[67] [67]

On self-supervised image representa- tions for GAN evaluation

Stanislav Morozov, Andrey V oynov, and Artem Babenko. On self-supervised image representa- tions for GAN evaluation. In9th International Conference on Learning Representations (ICLR), 2021

2021

[68] [68]

On the stability of fine- tuning BERT: Misconceptions, explanations, and strong baselines

Marius Mosbach, Maksym Andriushchenko, and Dietrich Klakow. On the stability of fine- tuning BERT: Misconceptions, explanations, and strong baselines. InInternational Conference on Learning Representations (ICLR), 2021. 15

2021

[69] [69]

Reliable fidelity and diversity metrics for generative models

Muhammad Ferjad Naeem, Seong Joon Oh, Youngjung Uh, Yunjey Choi, and Jaejun Yoo. Reliable fidelity and diversity metrics for generative models. InProceedings of the 37th International Conference on Machine Learning, 2020

2020

[70] [70]

Zico Kolter

Vaishnavh Nagarajan and J. Zico Kolter. Uniform convergence may be unable to explain generalization in deep learning. InAdvances in Neural Information Processing Systems 32 (NeurIPS), 2019

2019

[71] [71]

Improved denoising diffusion probabilistic models

Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. InProceedings of the 38th International Conference on Machine Learning (ICML), 2021

2021

[72] [72]

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jégou, Julien Mairal, Patric...

2024

[73] [73]

On aliased resizing and surprising subtleties in GAN evaluation

Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On aliased resizing and surprising subtleties in GAN evaluation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

2022

[74] [74]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

2023

[75] [75]

Problems and opportunities in training deep learning software systems: An analysis of variance

Hung Viet Pham, Shangshu Qian, Jiannan Wang, Thibaud Lutellier, Jonathan Rosenthal, Lin Tan, Yaoliang Yu, and Nachiappan Nagappan. Problems and opportunities in training deep learning software systems: An analysis of variance. InProceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering (ASE), 2020

2020

[76] [76]

Torch.manual_seed(3407) is all you need: On the influence of random seeds in deep learning architectures for computer vision.arXiv preprint arXiv:2109.08203, 2021

David Picard. Torch.manual_seed(3407) is all you need: On the influence of random seeds in deep learning architectures for computer vision.arXiv preprint arXiv:2109.08203, 2021

work page arXiv 2021

[77] [77]

Improving reproducibility in machine learning research (a report from the NeurIPS 2019 reproducibility program).Journal of Machine Learning Research, 2021

Joelle Pineau, Philippe Vincent-Lamarre, Koustuv Sinha, Vincent Larivière, Alina Beygelzimer, Florence d’Alché Buc, Emily Fox, and Hugo Larochelle. Improving reproducibility in machine learning research (a report from the NeurIPS 2019 reproducibility program).Journal of Machine Learning Research, 2021

2019

[78] [78]

A step toward quantifying independently reproducible machine learning research

Edward Raff. A step toward quantifying independently reproducible machine learning research. InAdvances in Neural Information Processing Systems 32 (NeurIPS), 2019

2019

[79] [79]

Do ImageNet classifiers generalize to ImageNet? InProceedings of the 36th International Conference on Machine Learning (ICML), 2019

Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do ImageNet classifiers generalize to ImageNet? InProceedings of the 36th International Conference on Machine Learning (ICML), 2019

2019

[80] [80]

Reporting score distributions makes a difference: Perfor- mance study of LSTM-networks for sequence tagging

Nils Reimers and Iryna Gurevych. Reporting score distributions makes a difference: Perfor- mance study of LSTM-networks for sequence tagging. InProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2017

2017