Just Another Hour on TikTok: ID sampling to obtain a complete slice of TikTok

Benjamin Steel; Derek Ruths; Juergen Pfeffer; Miriam Schirmer

arxiv: 2504.13279 · v5 · pith:BLP37TYBnew · submitted 2025-04-17 · 💻 cs.SI

Just Another Hour on TikTok: ID sampling to obtain a complete slice of TikTok

Benjamin Steel , Miriam Schirmer , Derek Ruths , Juergen Pfeffer This is my paper

Pith reviewed 2026-05-22 19:53 UTC · model grok-4.3

classification 💻 cs.SI

keywords TikToksocial media samplingdata collection methodplatform statisticscontent analysispost identifiersAI-generated content

0 comments

The pith

TikTok post IDs enable sampling that captures more than 99 percent of content from any chosen time range.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a method that uses the structure of TikTok post identifiers to gather a representative sample covering over 99 percent of posts within a specified time window. The authors apply the technique to collect every post from one full hour and every post from one minute in each hour across an entire day. This produces metadata, videos, and comments that let them calculate platform-wide figures such as 269 million posts on the sampled day, 18 percent of videos featuring children, and at least 0.5 percent containing AI-generated material. A sympathetic reader would care because earlier work lacked any reliable way to see the full distribution of content on a platform that shapes global events.

Core claim

The authors develop a method to extract a representative sample of more than 99 percent of posts from a given time range on TikTok by targeting identifiers, then use it to collect every post from a full hour on the platform along with every post from a single minute in each hour of a day. This yields post metadata, video media, and comments from a near-complete slice, from which they derive the critical statistics of the platform including an estimate of 269 million posts produced on the day examined, 18 percent of videos featuring children, and at least 0.5 percent of posts containing artificial intelligence-generated content.

What carries the argument

ID sampling that exploits the predictable generation of TikTok post identifiers to reach targeted time ranges with high coverage.

If this is right

Researchers can now obtain near-complete sets of posts, metadata, videos, and comments for any chosen hour or minute interval.
Platform-wide daily output is estimated at 269 million posts on the sampled day.
18 percent of videos on the platform feature children.
At least 0.5 percent of posts contain artificial intelligence-generated content.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same identifier-based approach could be tested on other platforms whose post IDs follow similar sequential patterns.
Repeated daily samples would allow tracking of changes in the share of child-featured or AI-generated content over time.
The collected media could support independent audits of moderation effectiveness on the reported content categories.

Load-bearing premise

Post identifiers on TikTok are generated in a sufficiently predictable or sequential manner that targeted sampling can achieve over 99 percent coverage without systematic bias or missing large clusters of content.

What would settle it

Collect a separate random sample of posts known to exist in the same time range and measure what fraction are absent from the ID-sampled collection; coverage below 99 percent would falsify the central claim.

read the original abstract

TikTok is now a massive platform, and has a deep impact on global events. Despite preliminary studies, issues remain in determining fundamental characteristics of the platform. We develop a method to extract a representative sample of >99% of posts from a given time range on TikTok, and use it to collect all posts from a full hour on the platform, alongside all posts from a single minute from each hour of a day. Through this, we obtain post metadata, video media, and comments from a close-to-complete slice of TikTok, and report the critical statistics of the platform. Notably, we estimate a total of 269 million posts produced on the day we looked at, that 18% of videos on the platform feature children, and that at least 0.5% of posts contain artificial intelligence-generated content.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TikTok ID sampling claims near-complete hourly slices but the coverage rests on untested assumptions about post ID structure.

read the letter

The main thing to know is that this paper describes an ID sampling method meant to pull more than 99% of TikTok posts from a fixed time window, then uses it to collect a full hour plus one minute per hour across a day and reports platform-wide numbers such as 269 million daily posts, 18% child videos, and 0.5% AI content. The specific combination of dense hourly sampling with distributed minute probes looks new compared with earlier TikTok studies that relied on smaller or less systematic crawls. The work does a reasonable job of converting the collection into concrete estimates that could serve as reference points for researchers tracking overall activity or content categories. The soft spot is the coverage claim itself. It depends on TikTok post IDs being generated densely and monotonically enough within short windows that targeted sampling misses almost nothing. If IDs contain large gaps, batch assignments, or time-varying patterns, entire clusters could be omitted without the method detecting it, which would directly affect the representativeness of the 269 million figure and the content percentages. The abstract states the coverage rate and the resulting stats but shows no ground-truth comparison, error bounds, or robustness checks, so the central technical result is hard to assess from what is presented. This paper is for social-media researchers who need better tools for obtaining representative platform slices when API access is restricted. A reader working on measurement methods or large-scale content analysis would get some practical value from the sampling idea and the scale estimates, even while treating the exact coverage as provisional. It deserves a serious referee because the collection effort is substantial and the problem it targets is real; referees could check the ID mechanics and any validation that appears in the full methods section.

Referee Report

2 major / 2 minor

Summary. The manuscript presents an ID sampling method claimed to extract a representative sample of >99% of TikTok posts from a given time range. The authors apply this to collect a complete hour of posts plus one minute per hour across a day, yielding metadata, video media, and comments, and report platform-wide statistics including an estimated 269 million posts per day, 18% of videos featuring children, and at least 0.5% AI-generated content.

Significance. If the sampling achieves unbiased near-complete coverage, the work would enable large-scale, representative analyses of TikTok content at a scale (hundreds of millions of posts) that is rare in the field and could inform studies of daily volume, child-related content, and AI generation. The approach is presented as a technical contribution independent of fitted parameters.

major comments (2)

[Abstract] Abstract: the claim of >99% coverage and the specific downstream statistics (269 M posts/day, 18% children, 0.5% AI-generated) are stated without validation data, error bars, ground-truth comparison, or any quantitative assessment of missed posts, making it impossible to evaluate whether the central coverage claim holds.
[Methods] Methods (ID sampling description): the completeness of the targeted sampling rests on the unverified assumption that post IDs are generated in a sufficiently monotonic, sequential, and gap-free manner within time windows; if batch assignments, large non-sequential gaps, or time-varying randomization exist, entire clusters of content could be systematically omitted, directly undermining the representativeness of all reported statistics.

minor comments (2)

[Abstract] Abstract: specify the exact calendar date and time zone of the sampled day to allow reproducibility and context for the 269 M posts/day estimate.
[Results] Results: clarify the annotation or detection method used to arrive at the 18% children and 0.5% AI-generated figures (e.g., sample size, inter-annotator agreement, or automated classifier details).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. We address each major point below and indicate where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of >99% coverage and the specific downstream statistics (269 M posts/day, 18% children, 0.5% AI-generated) are stated without validation data, error bars, ground-truth comparison, or any quantitative assessment of missed posts, making it impossible to evaluate whether the central coverage claim holds.

Authors: We acknowledge that the abstract states the coverage claim and derived statistics without accompanying quantitative validation metrics. The full manuscript describes the ID sampling procedure and reports an estimated coverage derived from observed ID continuity across multiple collection runs. We agree that explicit error bars, missed-post counts, and ground-truth comparisons were not foregrounded in the abstract. In revision we will add a concise statement in the abstract directing readers to the validation analysis in Section 3 and will include a short quantitative summary of observed ID gaps and coverage estimates. Direct ground-truth comparison with TikTok’s internal logs remains impossible because such data are not released; we will therefore frame the coverage figure as an empirical lower bound rather than an absolute guarantee. revision: partial
Referee: [Methods] Methods (ID sampling description): the completeness of the targeted sampling rests on the unverified assumption that post IDs are generated in a sufficiently monotonic, sequential, and gap-free manner within time windows; if batch assignments, large non-sequential gaps, or time-varying randomization exist, entire clusters of content could be systematically omitted, directly undermining the representativeness of all reported statistics.

Authors: The referee correctly notes that the method depends on post IDs behaving sufficiently monotonically within short time windows. Our data collection shows that, within each targeted hour or minute, the large majority of IDs are strictly increasing with only small, infrequent gaps; we have used these empirical gap statistics to compute the reported coverage figure. We will expand the Methods section with additional figures illustrating the distribution of ID increments and the size of any detected gaps across our samples. We will also add a dedicated limitations paragraph discussing the possibility of batch-assigned or randomized IDs and the steps taken (multiple overlapping passes) to reduce the chance of systematic omission. While we maintain that the observed near-complete capture of posts in the sampled slices supports the reported platform statistics, we accept that the assumption cannot be proven without platform internals and will therefore present the coverage claim with appropriate caveats. revision: yes

Circularity Check

0 steps flagged

No circularity: sampling method is an independent technical contribution

full rationale

The paper presents a method for ID-based sampling to achieve >99% coverage of TikTok posts within a time window, followed by collection of a full hour and minute-per-hour slices. This claim rests on platform-specific assumptions about post ID generation (monotonicity and density) rather than any self-referential definition, fitted parameter renamed as prediction, or load-bearing self-citation. No equations, ansatzes, or uniqueness theorems are described that reduce the coverage result to the inputs by construction. The reported statistics (269M posts/day, 18% children, 0.5% AI content) are downstream outputs of the collected data, not circularly defined. The derivation chain is therefore self-contained against external benchmarks of TikTok's ID behavior.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on an unstated assumption that TikTok post IDs are distributed in a manner that permits high-coverage sampling; no free parameters, additional axioms, or invented entities are visible from the abstract alone.

axioms (1)

domain assumption TikTok post identifiers are generated in a sufficiently ordered or predictable sequence that targeted sampling can reach >99% coverage.
This premise is required for the sampling method to achieve the stated completeness; it is invoked implicitly when the authors claim representative samples from a time range.

pith-pipeline@v0.9.0 · 5676 in / 1315 out tokens · 40822 ms · 2026-05-22T19:53:14.560488+00:00 · methodology

Just Another Hour on TikTok: ID sampling to obtain a complete slice of TikTok

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)