Challenging Data Aggregation Practices: A MAIHDA Study of Asian Student Outcomes in Introductory Physics

Ben Van Dusen; Grace Angell; Jayson Nissen; Vy Le

arxiv: 2509.19049 · v2 · pith:IZAZRKQMnew · submitted 2025-09-23 · ⚛️ physics.ed-ph

Challenging Data Aggregation Practices: A MAIHDA Study of Asian Student Outcomes in Introductory Physics

Vy Le , Grace Angell , Jayson Nissen , Ben Van Dusen This is my paper

Pith reviewed 2026-05-21 22:21 UTC · model grok-4.3

classification ⚛️ physics.ed-ph

keywords Asian subgroupsdata aggregationintroductory physicsMAIHDAconceptual understandingeducational disparitiesmodel minority mythmultilevel modeling

0 comments

The pith

Aggregating all Asian students into one group in physics courses hides 15-percentage-point performance gaps among 19 subgroups.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that treating Asian students as a single category in introductory physics data collection conceals wide variation in conceptual understanding across different Asian subgroups. Analysis of over 16,000 students from hundreds of courses found that subgroup performance on standard conceptual tests spanned more than 15 points both before and after instruction. The lowest post-instruction subgroup score matched the highest pre-instruction subgroup score, a difference comparable to a full semester of learning. Using the broad Asian label instead of the 19 subgroups produced average errors of 3.3 to 3.6 points, or roughly four to five weeks of typical course progress. These results indicate that common data practices can mask real disparities and support the value of collecting more detailed identity information.

Core claim

The central claim is that the aggregated Asian stratum conceals performance differences among 19 Asian subgroups on the Force Concept Inventory and Force and Motion Conceptual Evaluation. Subgroup predicted means spanned 15.8 percentage points on the pretest and 15.4 percentage points on the posttest. The lowest-performing subgroup's posttest mean was roughly equal to the highest-performing subgroup's pretest mean. Mean absolute error between the Asian Stratum and the 19-subgroup estimates was 3.3 percentage points at pretest and 3.6 percentage points at posttest.

What carries the argument

Multilevel Analysis of Individual Heterogeneity and Discriminatory Accuracy (MAIHDA), which quantifies both average performance for the aggregated group and the spread of outcomes across the 19 finer subgroups while accounting for course-level clustering.

If this is right

Fine-grained identity data collection can reveal learning gaps that broad categories average away.
Aggregation errors of 3 to 4 percentage points correspond to several weeks of instruction in a typical course.
The single Asian category can produce misleading estimates of both overall performance and equity needs.
Subgroup-level analysis supports more targeted identification of students who may need additional support.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same aggregation problem may hide variation within other broad racial or ethnic categories used in education research.
Departments could test whether adopting detailed demographic questions changes how they allocate resources or design interventions.
Future analyses might check whether the observed subgroup gaps remain after accounting for differences in prior schooling or socioeconomic background.

Load-bearing premise

The assumption that self-reported identities define 19 distinct Asian subgroups with large enough samples per group for the multilevel model to detect real differences without major bias from reporting or clustering effects.

What would settle it

A new dataset from similar introductory physics courses that collects the same 19-subgroup identities, has adequate sample sizes in each, and shows no meaningful performance variation across those subgroups after applying the same MAIHDA method.

read the original abstract

Aggregation of Asian student data can reinforce the model minority myth by obscuring educational disparities among Asian student subgroups. This study investigated variation in conceptual physics knowledge across Asian racial and ethnic subgroups using data from the LASSO platform, analyzing responses from 16,810 students enrolled in 493 introductory calculus-based physics courses across 64 U.S. institutions. We applied Multilevel Analysis of Individual Heterogeneity and Discriminatory Accuracy to examine predicted pre- and posttest performance on the Force Concept Inventory and Force and Motion Conceptual Evaluation. The findings revealed performance differences among 19 Asian subgroups that the Asian stratum (the single aggregated Asian group) concealed. Subgroup predicted means spanned 15.8 percentage points on the pretest and 15.4 percentage points on the posttest. The lowest-performing subgroup's posttest mean was roughly equal to the highest-performing subgroup's pretest mean, indicating a performance gap of about a full semester of instruction. Mean absolute error between the Asian Stratum and the 19-subgroup estimates was 3.3 percentage points at pretest and 3.6 percentage points at posttest, equivalent to approximately 4-5 weeks of learning in a 16-week course. These findings demonstrate that fine-grained identity data collection can support identifying disparities that common aggregation practices conceal.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper shows aggregation of Asian students hides 15-point subgroup spans in physics conceptual scores, but small subgroup sizes risk shrinkage artifacts in the MAIHDA estimates.

read the letter

The one thing to know is that this paper uses MAIHDA on a large physics dataset to show that the usual Asian category hides performance variation among 19 subgroups, with means differing by 15 points and aggregation errors of 3-4 points that matter for equity claims. They do a good job with the scale of the data, drawing from 16,810 students in many courses and institutions to estimate pre and post means on standard conceptual tests. This gives a clear, quantitative illustration of how aggregation can conceal disparities, and they connect it usefully to the model minority issue without overclaiming. The soft spots center on the subgroup level. Some of the 19 groups probably have small numbers of students, which in a multilevel model like MAIHDA leads to shrinkage that could exaggerate or stabilize the apparent differences between groups. Self-reported race and ethnicity categories carry their own errors, and without seeing the per-group sample sizes or model diagnostics in detail, it's tough to know how much to trust the exact span or the semester-gap claim. The abstract is light on these specs, so the full paper needs to address them directly. This is for physics education researchers who work with demographic data and want to think about better ways to measure outcomes. Anyone studying identity effects or data aggregation in STEM education would get something from the example. It is worth a serious referee because the core data analysis is grounded and the topic is relevant to current discussions in the field. I recommend sending it for peer review, with the referees asked to look closely at the subgroup sizes and the multilevel model assumptions.

Referee Report

2 major / 1 minor

Summary. The paper applies Multilevel Analysis of Individual Heterogeneity and Discriminatory Accuracy (MAIHDA) to responses from 16,810 students in 493 introductory calculus-based physics courses across 64 institutions. It claims that the single aggregated 'Asian' category conceals substantial variation among 19 Asian subgroups, with predicted means on the Force Concept Inventory and Force and Motion Conceptual Evaluation spanning 15.8 percentage points on the pretest and 15.4 on the posttest; the lowest subgroup posttest mean equals the highest subgroup pretest mean, and mean absolute errors versus the aggregate are 3.3–3.6 pp.

Significance. If the subgroup-specific predicted means are shown to be robust, the work provides concrete evidence that standard racial aggregation practices in physics education research can mask disparities equivalent to several weeks of instruction, supporting calls for finer-grained identity data collection to improve equity analyses.

major comments (2)

[Methods] Methods section: The manuscript provides no table or text reporting the number of students per Asian subgroup (or per course within subgroups). With only a fraction of the 16,810 students identifying as Asian, several of the 19 subgroups are likely to have n < 50; in MAIHDA this produces partial pooling that shrinks estimates toward the grand mean, which directly threatens the reliability of the reported 15.8 pp and 15.4 pp spans and the claim that the lowest posttest mean equals the highest pretest mean.
[Results] Results and model-specification paragraphs: No details are given on the exact multilevel model (e.g., random intercepts for courses, fixed effects for pretest/posttest, handling of missing data, or convergence diagnostics). Without these, it is impossible to evaluate whether the predicted means accurately capture heterogeneity or are biased by course-level clustering or self-report measurement error in identity categories.

minor comments (1)

[Abstract] Abstract: Adding one sentence on the range of subgroup sample sizes and the basic MAIHDA random-effects structure would allow readers to assess the strength of the central claim without consulting the full methods.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which have helped us improve the clarity and transparency of our manuscript. We address each major comment in turn below.

read point-by-point responses

Referee: Methods section: The manuscript provides no table or text reporting the number of students per Asian subgroup (or per course within subgroups). With only a fraction of the 16,810 students identifying as Asian, several of the 19 subgroups are likely to have n < 50; in MAIHDA this produces partial pooling that shrinks estimates toward the grand mean, which directly threatens the reliability of the reported 15.8 pp and 15.4 pp spans and the claim that the lowest posttest mean equals the highest pretest mean.

Authors: We agree with the referee that providing the sample sizes per subgroup is crucial for readers to assess the robustness of our findings. In the revised manuscript, we have included a new table (Table 1) detailing the number of students in each Asian subgroup for the pretest and posttest analyses. We also note that while MAIHDA does involve partial pooling for smaller groups, this is by design to improve estimate stability, and the substantial variation we report (15.8 pp span) persists even after accounting for this. We have added a sentence in the discussion acknowledging that subgroups with smaller n have wider credible intervals, which we now report in the supplementary materials. Regarding per-course breakdowns within subgroups, we believe this would be overly granular and not add substantial value given the focus on subgroup heterogeneity, but we can provide aggregate course-level statistics if requested. revision: yes
Referee: Results and model-specification paragraphs: No details are given on the exact multilevel model (e.g., random intercepts for courses, fixed effects for pretest/posttest, handling of missing data, or convergence diagnostics). Without these, it is impossible to evaluate whether the predicted means accurately capture heterogeneity or are biased by course-level clustering or self-report measurement error in identity categories.

Authors: We appreciate this feedback and have substantially expanded the Methods section in the revision to provide the full model specification. The model includes random intercepts for courses to account for clustering at the course level, fixed effects for the 19 Asian subgroups, and separate models for pretest and posttest. Missing data were handled through listwise deletion consistent with the LASSO database protocols. We have also added convergence diagnostics, including Gelman-Rubin R-hat statistics below 1.01 for all parameters, to the supplementary information. These additions should allow readers to better evaluate the model's handling of heterogeneity and potential biases. We do not believe self-report measurement error in identity categories introduces systematic bias in this context, as the subgroups are based on self-identification, but we have noted this as a limitation. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical MAIHDA data analysis

full rationale

This is a purely empirical study applying standard multilevel modeling (MAIHDA) to observed pretest and posttest scores from 16,810 students. The reported subgroup means and spans are direct statistical outputs from the fitted model on real data; no equation, prediction, or central claim reduces by construction to a fitted parameter, self-citation chain, or definitional tautology. The derivation chain consists of data collection, subgroup definition from self-reported identities, and model estimation—none of which are self-referential or load-bearing only via prior author work. The paper is self-contained against external benchmarks and receives the default non-finding for data-driven education research.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The analysis rests on standard assumptions of multilevel modeling and the validity of the Force Concept Inventory and Force and Motion Conceptual Evaluation as measures of conceptual knowledge.

axioms (1)

domain assumption Multilevel Analysis of Individual Heterogeneity and Discriminatory Accuracy (MAIHDA) is an appropriate method for detecting subgroup heterogeneity in educational outcomes.
Invoked in the methods description to justify the choice of analysis technique.

pith-pipeline@v0.9.0 · 5769 in / 1237 out tokens · 40501 ms · 2026-05-21T22:21:24.347369+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages

[1]

model minority,

Asianization recognizes how dominant U.S. society imposes homogenous and stereotypical identities onto Asian communities, flattening their cultural and ethnic differences. For instance, Asian students are often labeled as the “model minority,” a stereotype that can obscure academic challenges and discourage them from seeking support

work page
[2]

Disaggregated educational data, for example, can highlight disparities between Southeast Asian and East Asian students, informing more targeted and equitable interventions

Strategic (anti)essentialism emphasizes recognizing both shared and divergent experiences among Asian subgroups, resisting the notion of a monolithic Asian identity. Disaggregated educational data, for example, can highlight disparities between Southeast Asian and East Asian students, informing more targeted and equitable interventions

work page
[3]

In this paper, we focused on the intersection of different racialized identities within the Asian community (e.g., White-Asian, etc.)

Intersectionality identifies how intersections between social identities, such as race, gender, class, and language, shape complex educational experiences. In this paper, we focused on the intersection of different racialized identities within the Asian community (e.g., White-Asian, etc.). For example, Asian Indian and Chinese American groups, who report ...

work page 1997
[4]

Asian”, without disaggregating into more specific subgroups. Any additional Asian subgroups reported during this period came from students who selected the “Other

Categories are Neither Natural Nor Inherent: QuantCrit recognizes racial categories as socially constructed and context-dependent. In this study, we examine disaggregated data on Asian identities (e.g., Chinese, Korean, and Filipino) to reveal differences that aggregated categories (e.g., URM and non-URM) often obscure. Methods Data collection and cleanin...

work page doi:10.4135/9781452233802.n7 2018

[1] [1]

model minority,

Asianization recognizes how dominant U.S. society imposes homogenous and stereotypical identities onto Asian communities, flattening their cultural and ethnic differences. For instance, Asian students are often labeled as the “model minority,” a stereotype that can obscure academic challenges and discourage them from seeking support

work page

[2] [2]

Disaggregated educational data, for example, can highlight disparities between Southeast Asian and East Asian students, informing more targeted and equitable interventions

Strategic (anti)essentialism emphasizes recognizing both shared and divergent experiences among Asian subgroups, resisting the notion of a monolithic Asian identity. Disaggregated educational data, for example, can highlight disparities between Southeast Asian and East Asian students, informing more targeted and equitable interventions

work page

[3] [3]

In this paper, we focused on the intersection of different racialized identities within the Asian community (e.g., White-Asian, etc.)

Intersectionality identifies how intersections between social identities, such as race, gender, class, and language, shape complex educational experiences. In this paper, we focused on the intersection of different racialized identities within the Asian community (e.g., White-Asian, etc.). For example, Asian Indian and Chinese American groups, who report ...

work page 1997

[4] [4]

Asian”, without disaggregating into more specific subgroups. Any additional Asian subgroups reported during this period came from students who selected the “Other

Categories are Neither Natural Nor Inherent: QuantCrit recognizes racial categories as socially constructed and context-dependent. In this study, we examine disaggregated data on Asian identities (e.g., Chinese, Korean, and Filipino) to reveal differences that aggregated categories (e.g., URM and non-URM) often obscure. Methods Data collection and cleanin...

work page doi:10.4135/9781452233802.n7 2018