Audits as Evidence: Experiments, Ensembles, and Enforcement

Christopher Walters; Patrick Kline

arxiv: 1907.06622 · v2 · pith:NZ2ON32Pnew · submitted 2019-07-15 · 💰 econ.EM · stat.AP· stat.ME· stat.ML

Audits as Evidence: Experiments, Ensembles, and Enforcement

Patrick Kline , Christopher Walters This is my paper

Pith reviewed 2026-05-24 21:01 UTC · model grok-4.3

classification 💰 econ.EM stat.APstat.MEstat.ML

keywords discriminationcorrespondence experimentspartial identificationemployer heterogeneityaudit studiesracial discriminationdecision rulescallback rates

0 comments

The pith

Correspondence experiments can bound the share of employers illegally discriminating and evaluate rules for investigating them.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops statistical tools that turn data from sending multiple fictitious job applications with varying protected traits into bounds on the fraction of individual employers who discriminate illegally. These tools recover higher moments of job-specific callback effects conditional on how many applications each employer receives, then use the moments to partially identify the share of discriminating jobs without strong parametric assumptions. Applied to three existing audit datasets, the approach reveals substantial heterogeneity across employers, with the standard deviation of group-specific callback gaps roughly twice the average gap. In one racial name experiment the method concludes that at least 85 percent of employers who call both white-named applicants and neither black-named applicant are discriminating. The same framework is used to compare the error rates of simple decision rules that flag suspicious callback patterns for further investigation.

Core claim

Higher moments of the causal effect of protected characteristics on callback rates are identified as functions of the number of applications sent; these moments bound the fraction of jobs engaged in illegal discrimination. In data from a recent experiment, at least 85 percent of jobs that contact both white applications and neither black application discriminate. Under a two-type model consistent with the data, an experiment sending ten applications per job can detect 7-10 percent of discriminators while falsely accusing fewer than 0.2 percent of non-discriminators.

What carries the argument

Identification of higher moments of job-specific callback effects conditional on the number of applications sent, used to partially identify the joint distribution of callback rates across protected groups and thereby bound illegal discrimination.

If this is right

Employer heterogeneity in discrimination is large: the standard deviation of group-specific callback gaps is about twice the mean gap.
At least 85 percent of jobs showing the pattern of contacting both white applications and neither black application are discriminating.
Under the two-type model, sending ten applications per job detects 7-10 percent of discriminators while keeping false accusations below 0.2 percent.
A minimax rule that acknowledges partial identification produces more investigations but higher error rates than the baseline two-type rule.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Audit designs could be adjusted to send more applications per job to improve detection accuracy without raising false-positive rates much.
The same bounding approach might be applied to other protected characteristics such as sex or age if similar multi-application experiments are run.
Enforcement agencies could use the derived decision rules to prioritize which employers to investigate based on observed callback sequences.

Load-bearing premise

The joint distribution of callback rates across protected groups can be bounded from the observed moments conditional on the number of applications sent, without further parametric restrictions beyond the two-type model.

What would settle it

An experiment that sends a varying number of applications and produces callback patterns whose implied bounds on the discriminating share are violated by the observed frequency of all-white or all-black callbacks.

read the original abstract

We develop tools for utilizing correspondence experiments to detect illegal discrimination by individual employers. Employers violate US employment law if their propensity to contact applicants depends on protected characteristics such as race or sex. We establish identification of higher moments of the causal effects of protected characteristics on callback rates as a function of the number of fictitious applications sent to each job ad. These moments are used to bound the fraction of jobs that illegally discriminate. Applying our results to three experimental datasets, we find evidence of significant employer heterogeneity in discriminatory behavior, with the standard deviation of gaps in job-specific callback probabilities across protected groups averaging roughly twice the mean gap. In a recent experiment manipulating racially distinctive names, we estimate that at least 85% of jobs that contact both of two white applications and neither of two black applications are engaged in illegal discrimination. To assess the tradeoff between type I and II errors presented by these patterns, we consider the performance of a series of decision rules for investigating suspicious callback behavior under a simple two-type model that rationalizes the experimental data. Though, in our preferred specification, only 17% of employers are estimated to discriminate on the basis of race, we find that an experiment sending 10 applications to each job would enable accurate detection of 7-10% of discriminators while falsely accusing fewer than 0.2% of non-discriminators. A minimax decision rule acknowledging partial identification of the joint distribution of callback rates yields higher error rates but more investigations than our baseline two-type model. Our results suggest illegal labor market discrimination can be reliably monitored with relatively small modifications to existing audit designs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a way to bound the share of discriminating employers from audit data by identifying higher moments off varying application counts per job, then uses that for detection rules; the 85% figure is the headline empirical claim.

read the letter

The new piece is identifying higher moments of job-specific callback gaps as a function of how many applications are sent, then using those to partially identify the share of employers with different callback rates by race. They apply it to three datasets and report that at least 85% of the jobs showing the 2-white/0-black pattern are discriminating. They also run a two-type model to check error rates for simple investigation rules and compare it to a minimax rule that respects the partial identification. That last step is useful because it shows how the bounds affect enforcement tradeoffs. The heterogeneity result (standard deviation of gaps roughly twice the mean) is a clean descriptive finding from the same moments. The math on the moments looks like it could be checked directly from the identification argument, and the empirical application is straightforward once the moments are in hand. The soft spot is whether the 85% lower bound on the 2/0 jobs actually follows from the observed moments without extra restrictions on the joint distribution of callback rates; if mass near the diagonal can still match the moments, the bound loosens. The two-type model is only for the decision-rule part, so it does not prop up the bound itself. The paper is aimed at people working on audit methods and labor-market enforcement. It is worth sending to referees because the identification step is new relative to the audit literature and the policy angle is direct, even though the bound needs a close look at the full derivation.

Referee Report

2 major / 2 minor

Summary. The paper develops identification results for higher moments of the job-specific causal effects of protected characteristics on callback rates in correspondence experiments, as a function of the number of applications sent. These moments are used to partially identify and bound the fraction of employers engaged in illegal discrimination. In an application to three experimental datasets, including one with racially distinctive names, the authors report substantial heterogeneity in discriminatory behavior and estimate that at least 85% of jobs exhibiting callbacks to both of two white applications and none to two black applications are discriminating. They then evaluate the type I/II error tradeoffs of various decision rules for targeting investigations under a two-type model that rationalizes the data, concluding that modest increases in applications per job (to 10) can detect 7-10% of discriminators while keeping false accusations below 0.2%.

Significance. If the partial identification arguments hold without hidden parametric restrictions, the paper offers a rigorous framework for turning audit-study data into evidence of individual-employer discrimination with quantifiable bounds and error rates. This would be a meaningful contribution to the econometrics of discrimination and to enforcement design, as it moves beyond average effects to heterogeneity and actionable decision rules. The empirical heterogeneity finding (standard deviation of gaps roughly twice the mean) and the low false-positive rates under the preferred specification are potentially policy-relevant if robust.

major comments (2)

[Abstract and identification results (likely §3)] Abstract and the identification results (likely §3): the 85% lower bound on the share of discriminating jobs (p_w ≠ p_b) among those showing the 2/0 callback pattern is presented as following from higher moments of the job-specific effects identified conditional on the number of applications sent. However, it is not shown that the identified set of joint distributions (p_w, p_b) excludes distributions with substantial mass on the diagonal (p_w = p_b) that still reproduce the observed moments; if such distributions remain in the identified set, the 85% figure does not follow from the moments alone.
[Decision-rule section (likely §5 or §6)] Decision-rule section (likely §5 or §6): the two-type model is introduced to rationalize the experimental data and to compute error rates for the 10-application rule, but the text does not explicitly confirm that the preceding 85% bound is obtained without reference to this model or any other parametric restriction on the joint distribution beyond the moment conditions.

minor comments (2)

[Abstract] Clarify in the abstract and introduction whether the 85% bound is obtained solely from the nonparametric moment identification or whether it incorporates any features of the two-type model used later.
[Identification section (likely §3)] Provide a brief statement of the exact moment conditions used for the partial identification argument and the resulting identified set for the 2/0 subpopulation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. We address each major comment below. The 85% bound is obtained from the nonparametric moment conditions alone; we will revise to make this separation and the exclusion of diagonal mass fully explicit.

read point-by-point responses

Referee: Abstract and the identification results (likely §3): the 85% lower bound on the share of discriminating jobs (p_w ≠ p_b) among those showing the 2/0 callback pattern is presented as following from higher moments of the job-specific effects identified conditional on the number of applications sent. However, it is not shown that the identified set of joint distributions (p_w, p_b) excludes distributions with substantial mass on the diagonal (p_w = p_b) that still reproduce the observed moments; if such distributions remain in the identified set, the 85% figure does not follow from the moments alone.

Authors: Section 3 establishes sharp nonparametric bounds on the share of jobs with p_w ≠ p_b using only the identified higher moments of job-specific effects. These moments are inconsistent with substantial mass on the diagonal while matching the observed callback frequencies for the 2/0 pattern; the 85% figure is the lower bound on off-diagonal mass implied by the moment conditions. We will revise the text to include an explicit statement and brief argument confirming that the identified set rules out large diagonal mass without additional parametric restrictions. revision: yes
Referee: Decision-rule section (likely §5 or §6): the two-type model is introduced to rationalize the experimental data and to compute error rates for the 10-application rule, but the text does not explicitly confirm that the preceding 85% bound is obtained without reference to this model or any other parametric restriction on the joint distribution beyond the moment conditions.

Authors: The 85% bound is presented in the application of the identification results (prior to the two-type model) and relies exclusively on the moment conditions from §3. The two-type model is used only for the subsequent error-tradeoff calculations. We will add an explicit clarifying sentence in the decision-rule section stating that the bound does not invoke the two-type model or other parametric restrictions. revision: yes

Circularity Check

0 steps flagged

No circularity: bounds derived from identified higher moments independent of two-type model

full rationale

The paper first identifies higher moments of job-specific callback gaps as a function of the number of applications sent, then uses those moments to partially identify bounds on the share of jobs with p_w ≠ p_b among those exhibiting the 2/0 callback pattern. This step relies on the observed conditional moments alone and does not invoke the two-type model. The two-type model appears only later, solely to rationalize data for evaluating decision-rule error rates. No equation reduces the 85% bound to a fitted parameter by construction, no self-citation supplies a load-bearing uniqueness result, and the central identification argument is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The abstract invokes standard econometric identification from experimental variation in application counts and a two-type behavioral model; no explicit free parameters or invented entities are named.

axioms (2)

domain assumption Higher moments of employer-specific callback probabilities are identified from the number of fictitious applications sent per job ad.
Stated as the basis for bounding the share of discriminating jobs.
domain assumption A two-type model rationalizes the experimental callback patterns for evaluating investigation rules.
Used to assess type I/II error tradeoffs.

pith-pipeline@v0.9.0 · 5820 in / 1311 out tokens · 18446 ms · 2026-05-24T21:01:32.290694+00:00 · methodology

Audits as Evidence: Experiments, Ensembles, and Enforcement

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)