Challenging Data Aggregation Practices: A MAIHDA Study of Asian Student Outcomes in Introductory Physics
Pith reviewed 2026-05-21 22:21 UTC · model grok-4.3
The pith
Aggregating all Asian students into one group in physics courses hides 15-percentage-point performance gaps among 19 subgroups.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that the aggregated Asian stratum conceals performance differences among 19 Asian subgroups on the Force Concept Inventory and Force and Motion Conceptual Evaluation. Subgroup predicted means spanned 15.8 percentage points on the pretest and 15.4 percentage points on the posttest. The lowest-performing subgroup's posttest mean was roughly equal to the highest-performing subgroup's pretest mean. Mean absolute error between the Asian Stratum and the 19-subgroup estimates was 3.3 percentage points at pretest and 3.6 percentage points at posttest.
What carries the argument
Multilevel Analysis of Individual Heterogeneity and Discriminatory Accuracy (MAIHDA), which quantifies both average performance for the aggregated group and the spread of outcomes across the 19 finer subgroups while accounting for course-level clustering.
If this is right
- Fine-grained identity data collection can reveal learning gaps that broad categories average away.
- Aggregation errors of 3 to 4 percentage points correspond to several weeks of instruction in a typical course.
- The single Asian category can produce misleading estimates of both overall performance and equity needs.
- Subgroup-level analysis supports more targeted identification of students who may need additional support.
Where Pith is reading between the lines
- The same aggregation problem may hide variation within other broad racial or ethnic categories used in education research.
- Departments could test whether adopting detailed demographic questions changes how they allocate resources or design interventions.
- Future analyses might check whether the observed subgroup gaps remain after accounting for differences in prior schooling or socioeconomic background.
Load-bearing premise
The assumption that self-reported identities define 19 distinct Asian subgroups with large enough samples per group for the multilevel model to detect real differences without major bias from reporting or clustering effects.
What would settle it
A new dataset from similar introductory physics courses that collects the same 19-subgroup identities, has adequate sample sizes in each, and shows no meaningful performance variation across those subgroups after applying the same MAIHDA method.
read the original abstract
Aggregation of Asian student data can reinforce the model minority myth by obscuring educational disparities among Asian student subgroups. This study investigated variation in conceptual physics knowledge across Asian racial and ethnic subgroups using data from the LASSO platform, analyzing responses from 16,810 students enrolled in 493 introductory calculus-based physics courses across 64 U.S. institutions. We applied Multilevel Analysis of Individual Heterogeneity and Discriminatory Accuracy to examine predicted pre- and posttest performance on the Force Concept Inventory and Force and Motion Conceptual Evaluation. The findings revealed performance differences among 19 Asian subgroups that the Asian stratum (the single aggregated Asian group) concealed. Subgroup predicted means spanned 15.8 percentage points on the pretest and 15.4 percentage points on the posttest. The lowest-performing subgroup's posttest mean was roughly equal to the highest-performing subgroup's pretest mean, indicating a performance gap of about a full semester of instruction. Mean absolute error between the Asian Stratum and the 19-subgroup estimates was 3.3 percentage points at pretest and 3.6 percentage points at posttest, equivalent to approximately 4-5 weeks of learning in a 16-week course. These findings demonstrate that fine-grained identity data collection can support identifying disparities that common aggregation practices conceal.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper applies Multilevel Analysis of Individual Heterogeneity and Discriminatory Accuracy (MAIHDA) to responses from 16,810 students in 493 introductory calculus-based physics courses across 64 institutions. It claims that the single aggregated 'Asian' category conceals substantial variation among 19 Asian subgroups, with predicted means on the Force Concept Inventory and Force and Motion Conceptual Evaluation spanning 15.8 percentage points on the pretest and 15.4 on the posttest; the lowest subgroup posttest mean equals the highest subgroup pretest mean, and mean absolute errors versus the aggregate are 3.3–3.6 pp.
Significance. If the subgroup-specific predicted means are shown to be robust, the work provides concrete evidence that standard racial aggregation practices in physics education research can mask disparities equivalent to several weeks of instruction, supporting calls for finer-grained identity data collection to improve equity analyses.
major comments (2)
- [Methods] Methods section: The manuscript provides no table or text reporting the number of students per Asian subgroup (or per course within subgroups). With only a fraction of the 16,810 students identifying as Asian, several of the 19 subgroups are likely to have n < 50; in MAIHDA this produces partial pooling that shrinks estimates toward the grand mean, which directly threatens the reliability of the reported 15.8 pp and 15.4 pp spans and the claim that the lowest posttest mean equals the highest pretest mean.
- [Results] Results and model-specification paragraphs: No details are given on the exact multilevel model (e.g., random intercepts for courses, fixed effects for pretest/posttest, handling of missing data, or convergence diagnostics). Without these, it is impossible to evaluate whether the predicted means accurately capture heterogeneity or are biased by course-level clustering or self-report measurement error in identity categories.
minor comments (1)
- [Abstract] Abstract: Adding one sentence on the range of subgroup sample sizes and the basic MAIHDA random-effects structure would allow readers to assess the strength of the central claim without consulting the full methods.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which have helped us improve the clarity and transparency of our manuscript. We address each major comment in turn below.
read point-by-point responses
-
Referee: Methods section: The manuscript provides no table or text reporting the number of students per Asian subgroup (or per course within subgroups). With only a fraction of the 16,810 students identifying as Asian, several of the 19 subgroups are likely to have n < 50; in MAIHDA this produces partial pooling that shrinks estimates toward the grand mean, which directly threatens the reliability of the reported 15.8 pp and 15.4 pp spans and the claim that the lowest posttest mean equals the highest pretest mean.
Authors: We agree with the referee that providing the sample sizes per subgroup is crucial for readers to assess the robustness of our findings. In the revised manuscript, we have included a new table (Table 1) detailing the number of students in each Asian subgroup for the pretest and posttest analyses. We also note that while MAIHDA does involve partial pooling for smaller groups, this is by design to improve estimate stability, and the substantial variation we report (15.8 pp span) persists even after accounting for this. We have added a sentence in the discussion acknowledging that subgroups with smaller n have wider credible intervals, which we now report in the supplementary materials. Regarding per-course breakdowns within subgroups, we believe this would be overly granular and not add substantial value given the focus on subgroup heterogeneity, but we can provide aggregate course-level statistics if requested. revision: yes
-
Referee: Results and model-specification paragraphs: No details are given on the exact multilevel model (e.g., random intercepts for courses, fixed effects for pretest/posttest, handling of missing data, or convergence diagnostics). Without these, it is impossible to evaluate whether the predicted means accurately capture heterogeneity or are biased by course-level clustering or self-report measurement error in identity categories.
Authors: We appreciate this feedback and have substantially expanded the Methods section in the revision to provide the full model specification. The model includes random intercepts for courses to account for clustering at the course level, fixed effects for the 19 Asian subgroups, and separate models for pretest and posttest. Missing data were handled through listwise deletion consistent with the LASSO database protocols. We have also added convergence diagnostics, including Gelman-Rubin R-hat statistics below 1.01 for all parameters, to the supplementary information. These additions should allow readers to better evaluate the model's handling of heterogeneity and potential biases. We do not believe self-report measurement error in identity categories introduces systematic bias in this context, as the subgroups are based on self-identification, but we have noted this as a limitation. revision: yes
Circularity Check
No significant circularity in empirical MAIHDA data analysis
full rationale
This is a purely empirical study applying standard multilevel modeling (MAIHDA) to observed pretest and posttest scores from 16,810 students. The reported subgroup means and spans are direct statistical outputs from the fitted model on real data; no equation, prediction, or central claim reduces by construction to a fitted parameter, self-citation chain, or definitional tautology. The derivation chain consists of data collection, subgroup definition from self-reported identities, and model estimation—none of which are self-referential or load-bearing only via prior author work. The paper is self-contained against external benchmarks and receives the default non-finding for data-driven education research.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Multilevel Analysis of Individual Heterogeneity and Discriminatory Accuracy (MAIHDA) is an appropriate method for detecting subgroup heterogeneity in educational outcomes.
Reference graph
Works this paper leans on
-
[1]
Asianization recognizes how dominant U.S. society imposes homogenous and stereotypical identities onto Asian communities, flattening their cultural and ethnic differences. For instance, Asian students are often labeled as the “model minority,” a stereotype that can obscure academic challenges and discourage them from seeking support
-
[2]
Strategic (anti)essentialism emphasizes recognizing both shared and divergent experiences among Asian subgroups, resisting the notion of a monolithic Asian identity. Disaggregated educational data, for example, can highlight disparities between Southeast Asian and East Asian students, informing more targeted and equitable interventions
-
[3]
Intersectionality identifies how intersections between social identities, such as race, gender, class, and language, shape complex educational experiences. In this paper, we focused on the intersection of different racialized identities within the Asian community (e.g., White-Asian, etc.). For example, Asian Indian and Chinese American groups, who report ...
work page 1997
-
[4]
Categories are Neither Natural Nor Inherent: QuantCrit recognizes racial categories as socially constructed and context-dependent. In this study, we examine disaggregated data on Asian identities (e.g., Chinese, Korean, and Filipino) to reveal differences that aggregated categories (e.g., URM and non-URM) often obscure. Methods Data collection and cleanin...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.