Just Another Hour on TikTok: ID sampling to obtain a complete slice of TikTok
Pith reviewed 2026-05-22 19:53 UTC · model grok-4.3
The pith
TikTok post IDs enable sampling that captures more than 99 percent of content from any chosen time range.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors develop a method to extract a representative sample of more than 99 percent of posts from a given time range on TikTok by targeting identifiers, then use it to collect every post from a full hour on the platform along with every post from a single minute in each hour of a day. This yields post metadata, video media, and comments from a near-complete slice, from which they derive the critical statistics of the platform including an estimate of 269 million posts produced on the day examined, 18 percent of videos featuring children, and at least 0.5 percent of posts containing artificial intelligence-generated content.
What carries the argument
ID sampling that exploits the predictable generation of TikTok post identifiers to reach targeted time ranges with high coverage.
If this is right
- Researchers can now obtain near-complete sets of posts, metadata, videos, and comments for any chosen hour or minute interval.
- Platform-wide daily output is estimated at 269 million posts on the sampled day.
- 18 percent of videos on the platform feature children.
- At least 0.5 percent of posts contain artificial intelligence-generated content.
Where Pith is reading between the lines
- The same identifier-based approach could be tested on other platforms whose post IDs follow similar sequential patterns.
- Repeated daily samples would allow tracking of changes in the share of child-featured or AI-generated content over time.
- The collected media could support independent audits of moderation effectiveness on the reported content categories.
Load-bearing premise
Post identifiers on TikTok are generated in a sufficiently predictable or sequential manner that targeted sampling can achieve over 99 percent coverage without systematic bias or missing large clusters of content.
What would settle it
Collect a separate random sample of posts known to exist in the same time range and measure what fraction are absent from the ID-sampled collection; coverage below 99 percent would falsify the central claim.
read the original abstract
TikTok is now a massive platform, and has a deep impact on global events. Despite preliminary studies, issues remain in determining fundamental characteristics of the platform. We develop a method to extract a representative sample of >99% of posts from a given time range on TikTok, and use it to collect all posts from a full hour on the platform, alongside all posts from a single minute from each hour of a day. Through this, we obtain post metadata, video media, and comments from a close-to-complete slice of TikTok, and report the critical statistics of the platform. Notably, we estimate a total of 269 million posts produced on the day we looked at, that 18% of videos on the platform feature children, and that at least 0.5% of posts contain artificial intelligence-generated content.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents an ID sampling method claimed to extract a representative sample of >99% of TikTok posts from a given time range. The authors apply this to collect a complete hour of posts plus one minute per hour across a day, yielding metadata, video media, and comments, and report platform-wide statistics including an estimated 269 million posts per day, 18% of videos featuring children, and at least 0.5% AI-generated content.
Significance. If the sampling achieves unbiased near-complete coverage, the work would enable large-scale, representative analyses of TikTok content at a scale (hundreds of millions of posts) that is rare in the field and could inform studies of daily volume, child-related content, and AI generation. The approach is presented as a technical contribution independent of fitted parameters.
major comments (2)
- [Abstract] Abstract: the claim of >99% coverage and the specific downstream statistics (269 M posts/day, 18% children, 0.5% AI-generated) are stated without validation data, error bars, ground-truth comparison, or any quantitative assessment of missed posts, making it impossible to evaluate whether the central coverage claim holds.
- [Methods] Methods (ID sampling description): the completeness of the targeted sampling rests on the unverified assumption that post IDs are generated in a sufficiently monotonic, sequential, and gap-free manner within time windows; if batch assignments, large non-sequential gaps, or time-varying randomization exist, entire clusters of content could be systematically omitted, directly undermining the representativeness of all reported statistics.
minor comments (2)
- [Abstract] Abstract: specify the exact calendar date and time zone of the sampled day to allow reproducibility and context for the 269 M posts/day estimate.
- [Results] Results: clarify the annotation or detection method used to arrive at the 18% children and 0.5% AI-generated figures (e.g., sample size, inter-annotator agreement, or automated classifier details).
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments. We address each major point below and indicate where revisions will be made to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim of >99% coverage and the specific downstream statistics (269 M posts/day, 18% children, 0.5% AI-generated) are stated without validation data, error bars, ground-truth comparison, or any quantitative assessment of missed posts, making it impossible to evaluate whether the central coverage claim holds.
Authors: We acknowledge that the abstract states the coverage claim and derived statistics without accompanying quantitative validation metrics. The full manuscript describes the ID sampling procedure and reports an estimated coverage derived from observed ID continuity across multiple collection runs. We agree that explicit error bars, missed-post counts, and ground-truth comparisons were not foregrounded in the abstract. In revision we will add a concise statement in the abstract directing readers to the validation analysis in Section 3 and will include a short quantitative summary of observed ID gaps and coverage estimates. Direct ground-truth comparison with TikTok’s internal logs remains impossible because such data are not released; we will therefore frame the coverage figure as an empirical lower bound rather than an absolute guarantee. revision: partial
-
Referee: [Methods] Methods (ID sampling description): the completeness of the targeted sampling rests on the unverified assumption that post IDs are generated in a sufficiently monotonic, sequential, and gap-free manner within time windows; if batch assignments, large non-sequential gaps, or time-varying randomization exist, entire clusters of content could be systematically omitted, directly undermining the representativeness of all reported statistics.
Authors: The referee correctly notes that the method depends on post IDs behaving sufficiently monotonically within short time windows. Our data collection shows that, within each targeted hour or minute, the large majority of IDs are strictly increasing with only small, infrequent gaps; we have used these empirical gap statistics to compute the reported coverage figure. We will expand the Methods section with additional figures illustrating the distribution of ID increments and the size of any detected gaps across our samples. We will also add a dedicated limitations paragraph discussing the possibility of batch-assigned or randomized IDs and the steps taken (multiple overlapping passes) to reduce the chance of systematic omission. While we maintain that the observed near-complete capture of posts in the sampled slices supports the reported platform statistics, we accept that the assumption cannot be proven without platform internals and will therefore present the coverage claim with appropriate caveats. revision: yes
Circularity Check
No circularity: sampling method is an independent technical contribution
full rationale
The paper presents a method for ID-based sampling to achieve >99% coverage of TikTok posts within a time window, followed by collection of a full hour and minute-per-hour slices. This claim rests on platform-specific assumptions about post ID generation (monotonicity and density) rather than any self-referential definition, fitted parameter renamed as prediction, or load-bearing self-citation. No equations, ansatzes, or uniqueness theorems are described that reduce the coverage result to the inputs by construction. The reported statistics (269M posts/day, 18% children, 0.5% AI content) are downstream outputs of the collected data, not circularly defined. The derivation chain is therefore self-contained against external benchmarks of TikTok's ID behavior.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption TikTok post identifiers are generated in a sufficiently ordered or predictable sequence that targeted sampling can reach >99% coverage.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.