You See It, They Don't: An Exploratory Study of User-to-User Variation in Instagram Comments

Brahmani Nutakki; Ingmar Weber; Manon Lilott Kempermann

arxiv: 2603.21953 · v2 · submitted 2026-03-23 · 💻 cs.CY

You See It, They Don't: An Exploratory Study of User-to-User Variation in Instagram Comments

Brahmani Nutakki , Manon Lilott Kempermann , Ingmar Weber This is my paper

Pith reviewed 2026-05-15 00:40 UTC · model grok-4.3

classification 💻 cs.CY

keywords instagramcomment rankingpersonalizationnews contentsock-puppet accountsuser variationai systemcontent visibility

0 comments

The pith

Instagram's AI comment ranking shows less variation on news posts than on non-news posts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether the new AI system that ranks Instagram comments creates different visible sets for different users, with special attention to news content where polarization risks are higher. Researchers created four sock-puppet accounts that differed in gender and political leaning, then viewed the same posts from two locations and recorded which comments appeared first. They expected more user-to-user differences on news posts, yet found the opposite: news posts produced more consistent comment lists across accounts. Differences tracked better with the post account's own metrics, such as follower count and total comments received, than with any user traits. The work therefore calls for larger audits to understand how comment ordering shapes public discourse.

Core claim

Using four sock-puppet accounts that varied by gender and political leaning and collecting visible comments twice from separate VPN locations, the study finds that the visible comment sets on news posts vary less across users than the sets on non-news posts. Variation correlates more strongly with account-level metrics such as comment volume and follower count than with the simulated user attributes of gender, political leaning, or location.

What carries the argument

Sock-puppet accounts that simulate different user profiles to compare the ranked lists of visible comments returned by the platform on identical posts.

If this is right

News posts may present a more uniform set of comments to all users than other content does.
Post-level metrics such as follower count and comment volume predict visibility better than user demographics.
Personalization effects in comment ranking appear weaker for news than expected.
Larger-scale audits are required to confirm whether these patterns hold across more accounts and topics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

More uniform comment sections on news could reduce the range of opinions users encounter even without strong personalization.
If account metrics dominate ranking, then high-engagement accounts gain greater control over visible discourse.
The same ranking logic may operate on other platforms and could be tested by repeating the sock-puppet method elsewhere.
Direct comparison with real-user data would help check whether the four-account design captures the full range of platform behavior.

Load-bearing premise

Observed differences between accounts are produced by the AI ranking algorithm rather than by timing, random platform behavior, or other unmeasured factors.

What would settle it

Collecting comments from all four accounts at the exact same moment and finding identical top-comment lists on every news post would show that the ranking system does not create user-specific views.

read the original abstract

In March 2025, Meta announced a new AI system to rank the order of the comments shown to Instagram users. With existing research showing how feed personalization systems can lead to increased polarization, the introduction of this new system raises similar questions. This paper presents a small-scale exploratory study examining whether the ranking system produces systematic differences in visible comments shown to different users, particularly for news-related content. Using four sock-puppet accounts varying in gender and political leaning, we collect visible comments on posts from ten news and ten non-news accounts. This collection is repeated twice from two VPN locations to assess location effects. We ask 1) how many visible comments vary across different users, 2) is this variation higher for news accounts than non-news accounts, and 3) can user-attributes like gender, political leaning, and location systematically explain the observed variation. Contrary to our expectations, we find that visible comments on news posts are less likely to vary across users than those on non-news posts. Variation is better explained by account metrics like comment and follower counts than by user attributes. These findings provide an initial glimpse into personalized comment ranking on Instagram and motivate larger, more systematic audits of how comment personalization may shape online discourse. To support further research, we provide the code to collect comments and the data upon request.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Small sock-puppet study finds lower comment variation on news posts but cannot separate ranking effects from platform noise.

read the letter

The paper's central observation is that visible comments on news posts vary less across different users than those on non-news posts, and that factors like the number of comments and followers on the source account explain more of the variation than the user's gender, political leaning, or location. With only four accounts involved, though, it's hard to say this difference comes from the AI ranking rather than other things going on in the platform. They set up sock-puppet accounts that differ in gender and leaning, collected visible comments on ten news and ten non-news posts, and repeated the process from two different VPN locations. This gives a direct look at what different users see, which is new for this specific ranking change. The finding runs counter to expectations about personalization increasing divides on news content, and they make the collection code available along with the data on request. That practical step helps anyone who wants to check or build on the results. The limitations stand out clearly. The sample is small enough that differences in visible comments could easily come from when the data was gathered, changes in comment order over time, or other unmeasured platform behaviors instead of systematic personalization. They lack a non-personalized reference point and do not record the entire set of comments available, so the ranking mechanism itself stays somewhat opaque. The comparisons rest on these limited observations without statistical controls for potential confounders. Readers working on social media algorithms or content moderation would find this useful as an initial probe. It gives a concrete method and some numbers to think about, even if the results are suggestive. The work shows honest engagement with the question and the literature on personalization effects. I would accept it for peer review. The idea is timely and the execution is straightforward, so referees can point out ways to make the evidence stronger.

Referee Report

3 major / 1 minor

Summary. This paper reports a small-scale exploratory study of Instagram's March 2025 AI comment-ranking system. Using four sock-puppet accounts that vary by gender and political leaning, the authors collect visible comments on posts from ten news and ten non-news source accounts, repeating the collection twice from two VPN locations. They measure user-to-user variation in the visible comment sets and test whether variation differs by post type and whether it is better predicted by account metrics (comment and follower counts) than by user attributes. The central empirical claim is that visible comments vary less on news posts than on non-news posts and that account-level metrics explain more of the observed differences than user-level attributes.

Significance. If the patterns survive larger-scale replication, the work would supply early empirical evidence that Instagram's comment personalization may be content-dependent, with potential consequences for the diversity of discourse around news. The release of collection code and data is a concrete contribution that lowers the barrier for follow-up audits in this under-studied domain.

major comments (3)

[Methods] Methods: The design uses only four accounts and twenty posts total, collected on two occasions. With no non-personalized baseline view, no logging of the full comment pool, and no controls for collection timing or platform A/B tests, observed differences cannot be confidently attributed to the AI ranking system rather than comment churn, account-age effects, or random ordering.
[Results] Results: The claim that variation is lower for news posts and better explained by account metrics rests on a very small number of observations. The manuscript does not report the precise overlap metric used, the regression specification that compares user attributes versus account metrics, or any statistical significance tests that account for the paired collections.
[Discussion] Discussion: The conclusion that user attributes (gender, leaning, location) do not systematically explain variation is under-powered given only four accounts; any regression or correlation analysis has too few degrees of freedom to separate these factors from account-level confounds.

minor comments (1)

[Abstract] Abstract and Methods: Clarify the exact overlap metric (e.g., Jaccard, set difference) and the precise definition of 'account metrics' used in the predictive analysis.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive feedback on our exploratory study. We agree that the small sample size constrains causal inference and statistical power, and we will revise the manuscript to clarify methods, report precise metrics and specifications, and moderate conclusions. We address each major comment below.

read point-by-point responses

Referee: [Methods] Methods: The design uses only four accounts and twenty posts total, collected on two occasions. With no non-personalized baseline view, no logging of the full comment pool, and no controls for collection timing or platform A/B tests, observed differences cannot be confidently attributed to the AI ranking system rather than comment churn, account-age effects, or random ordering.

Authors: We agree that the small scale and absence of a baseline limit attribution to the AI system specifically. This study is framed as exploratory to identify patterns warranting larger follow-up. In revision we will expand the Methods to detail collection timing, VPN controls, and repetition protocol, and add a Limitations subsection explicitly discussing comment churn, account-age effects, and random ordering as alternative explanations. We cannot supply a non-personalized baseline or full comment pool because Instagram's interface does not expose them; we will state this platform constraint directly. revision: partial
Referee: [Results] Results: The claim that variation is lower for news posts and better explained by account metrics rests on a very small number of observations. The manuscript does not report the precise overlap metric used, the regression specification that compares user attributes versus account metrics, or any statistical significance tests that account for the paired collections.

Authors: We will revise the Results section to define the overlap metric explicitly as the Jaccard index between visible comment sets for each pair of accounts. We will specify the regression as a linear model with variation as the outcome and predictors including account metrics (comment count, follower count) and user attributes (gender, leaning, location), and we will add paired tests (Wilcoxon signed-rank for news vs. non-news variation) along with effect sizes and confidence intervals, while noting the small n. revision: yes
Referee: [Discussion] Discussion: The conclusion that user attributes (gender, leaning, location) do not systematically explain variation is under-powered given only four accounts; any regression or correlation analysis has too few degrees of freedom to separate these factors from account-level confounds.

Authors: We concur that four accounts yield insufficient power to isolate user attributes from account-level confounds. In the revised Discussion we will emphasize the exploratory character of the work, avoid strong claims that user attributes do not explain variation, and explicitly recommend larger-scale replication to disentangle these factors. We will also note the low degrees of freedom as a core limitation. revision: yes

standing simulated objections not resolved

The inability to obtain a non-personalized baseline view or log the complete comment pool, which are inherent platform constraints not addressable within the current study design.

Circularity Check

0 steps flagged

No circularity: purely empirical data collection and observation

full rationale

The paper performs an exploratory study by creating four sock-puppet accounts, collecting visible comments on 20 posts (10 news, 10 non-news) via two VPNs, and comparing variation across accounts. No equations, derivations, model fits, or predictions are present. Claims rest on direct counts of differing comments and simple statistical associations with account metrics; no self-citation chains, ansatzes, or uniqueness theorems are invoked to support core results. The analysis is self-contained against external benchmarks and does not reduce any finding to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The findings rest on the domain assumption that sock-puppet accounts with assigned attributes accurately simulate real-user personalization exposure and that observed comment differences are attributable to the ranking algorithm rather than other unmeasured factors.

axioms (1)

domain assumption Sock-puppet accounts differing only in gender and political leaning can isolate the effects of user-attribute-based personalization.
The study design assigns these traits to test whether they drive comment visibility differences.

pith-pipeline@v0.9.0 · 5545 in / 1128 out tokens · 27343 ms · 2026-05-15T00:40:42.605475+00:00 · methodology

You See It, They Don't: An Exploratory Study of User-to-User Variation in Instagram Comments

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)