You See It, They Don't: An Exploratory Study of User-to-User Variation in Instagram Comments
Pith reviewed 2026-05-15 00:40 UTC · model grok-4.3
The pith
Instagram's AI comment ranking shows less variation on news posts than on non-news posts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Using four sock-puppet accounts that varied by gender and political leaning and collecting visible comments twice from separate VPN locations, the study finds that the visible comment sets on news posts vary less across users than the sets on non-news posts. Variation correlates more strongly with account-level metrics such as comment volume and follower count than with the simulated user attributes of gender, political leaning, or location.
What carries the argument
Sock-puppet accounts that simulate different user profiles to compare the ranked lists of visible comments returned by the platform on identical posts.
If this is right
- News posts may present a more uniform set of comments to all users than other content does.
- Post-level metrics such as follower count and comment volume predict visibility better than user demographics.
- Personalization effects in comment ranking appear weaker for news than expected.
- Larger-scale audits are required to confirm whether these patterns hold across more accounts and topics.
Where Pith is reading between the lines
- More uniform comment sections on news could reduce the range of opinions users encounter even without strong personalization.
- If account metrics dominate ranking, then high-engagement accounts gain greater control over visible discourse.
- The same ranking logic may operate on other platforms and could be tested by repeating the sock-puppet method elsewhere.
- Direct comparison with real-user data would help check whether the four-account design captures the full range of platform behavior.
Load-bearing premise
Observed differences between accounts are produced by the AI ranking algorithm rather than by timing, random platform behavior, or other unmeasured factors.
What would settle it
Collecting comments from all four accounts at the exact same moment and finding identical top-comment lists on every news post would show that the ranking system does not create user-specific views.
read the original abstract
In March 2025, Meta announced a new AI system to rank the order of the comments shown to Instagram users. With existing research showing how feed personalization systems can lead to increased polarization, the introduction of this new system raises similar questions. This paper presents a small-scale exploratory study examining whether the ranking system produces systematic differences in visible comments shown to different users, particularly for news-related content. Using four sock-puppet accounts varying in gender and political leaning, we collect visible comments on posts from ten news and ten non-news accounts. This collection is repeated twice from two VPN locations to assess location effects. We ask 1) how many visible comments vary across different users, 2) is this variation higher for news accounts than non-news accounts, and 3) can user-attributes like gender, political leaning, and location systematically explain the observed variation. Contrary to our expectations, we find that visible comments on news posts are less likely to vary across users than those on non-news posts. Variation is better explained by account metrics like comment and follower counts than by user attributes. These findings provide an initial glimpse into personalized comment ranking on Instagram and motivate larger, more systematic audits of how comment personalization may shape online discourse. To support further research, we provide the code to collect comments and the data upon request.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This paper reports a small-scale exploratory study of Instagram's March 2025 AI comment-ranking system. Using four sock-puppet accounts that vary by gender and political leaning, the authors collect visible comments on posts from ten news and ten non-news source accounts, repeating the collection twice from two VPN locations. They measure user-to-user variation in the visible comment sets and test whether variation differs by post type and whether it is better predicted by account metrics (comment and follower counts) than by user attributes. The central empirical claim is that visible comments vary less on news posts than on non-news posts and that account-level metrics explain more of the observed differences than user-level attributes.
Significance. If the patterns survive larger-scale replication, the work would supply early empirical evidence that Instagram's comment personalization may be content-dependent, with potential consequences for the diversity of discourse around news. The release of collection code and data is a concrete contribution that lowers the barrier for follow-up audits in this under-studied domain.
major comments (3)
- [Methods] Methods: The design uses only four accounts and twenty posts total, collected on two occasions. With no non-personalized baseline view, no logging of the full comment pool, and no controls for collection timing or platform A/B tests, observed differences cannot be confidently attributed to the AI ranking system rather than comment churn, account-age effects, or random ordering.
- [Results] Results: The claim that variation is lower for news posts and better explained by account metrics rests on a very small number of observations. The manuscript does not report the precise overlap metric used, the regression specification that compares user attributes versus account metrics, or any statistical significance tests that account for the paired collections.
- [Discussion] Discussion: The conclusion that user attributes (gender, leaning, location) do not systematically explain variation is under-powered given only four accounts; any regression or correlation analysis has too few degrees of freedom to separate these factors from account-level confounds.
minor comments (1)
- [Abstract] Abstract and Methods: Clarify the exact overlap metric (e.g., Jaccard, set difference) and the precise definition of 'account metrics' used in the predictive analysis.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our exploratory study. We agree that the small sample size constrains causal inference and statistical power, and we will revise the manuscript to clarify methods, report precise metrics and specifications, and moderate conclusions. We address each major comment below.
read point-by-point responses
-
Referee: [Methods] Methods: The design uses only four accounts and twenty posts total, collected on two occasions. With no non-personalized baseline view, no logging of the full comment pool, and no controls for collection timing or platform A/B tests, observed differences cannot be confidently attributed to the AI ranking system rather than comment churn, account-age effects, or random ordering.
Authors: We agree that the small scale and absence of a baseline limit attribution to the AI system specifically. This study is framed as exploratory to identify patterns warranting larger follow-up. In revision we will expand the Methods to detail collection timing, VPN controls, and repetition protocol, and add a Limitations subsection explicitly discussing comment churn, account-age effects, and random ordering as alternative explanations. We cannot supply a non-personalized baseline or full comment pool because Instagram's interface does not expose them; we will state this platform constraint directly. revision: partial
-
Referee: [Results] Results: The claim that variation is lower for news posts and better explained by account metrics rests on a very small number of observations. The manuscript does not report the precise overlap metric used, the regression specification that compares user attributes versus account metrics, or any statistical significance tests that account for the paired collections.
Authors: We will revise the Results section to define the overlap metric explicitly as the Jaccard index between visible comment sets for each pair of accounts. We will specify the regression as a linear model with variation as the outcome and predictors including account metrics (comment count, follower count) and user attributes (gender, leaning, location), and we will add paired tests (Wilcoxon signed-rank for news vs. non-news variation) along with effect sizes and confidence intervals, while noting the small n. revision: yes
-
Referee: [Discussion] Discussion: The conclusion that user attributes (gender, leaning, location) do not systematically explain variation is under-powered given only four accounts; any regression or correlation analysis has too few degrees of freedom to separate these factors from account-level confounds.
Authors: We concur that four accounts yield insufficient power to isolate user attributes from account-level confounds. In the revised Discussion we will emphasize the exploratory character of the work, avoid strong claims that user attributes do not explain variation, and explicitly recommend larger-scale replication to disentangle these factors. We will also note the low degrees of freedom as a core limitation. revision: yes
- The inability to obtain a non-personalized baseline view or log the complete comment pool, which are inherent platform constraints not addressable within the current study design.
Circularity Check
No circularity: purely empirical data collection and observation
full rationale
The paper performs an exploratory study by creating four sock-puppet accounts, collecting visible comments on 20 posts (10 news, 10 non-news) via two VPNs, and comparing variation across accounts. No equations, derivations, model fits, or predictions are present. Claims rest on direct counts of differing comments and simple statistical associations with account metrics; no self-citation chains, ansatzes, or uniqueness theorems are invoked to support core results. The analysis is self-contained against external benchmarks and does not reduce any finding to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Sock-puppet accounts differing only in gender and political leaning can isolate the effects of user-attribute-based personalization.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.