pith. sign in

arxiv: 2606.27274 · v2 · pith:ZLOJKVBUnew · submitted 2026-06-25 · 💻 cs.LG

BetXplain: An Explanation-Annotated Dataset for Detecting Manipulative Betting Advertisements on Social Media

Pith reviewed 2026-06-30 09:34 UTC · model grok-4.3

classification 💻 cs.LG
keywords betting advertisementsmanipulative advertisingsocial mediaexplainable AIannotated datasetInstagramRedditdeceptive practices
0
0 comments X

The pith

A new dataset of social media betting ads provides both manipulative labels and human explanations for each annotation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper collects betting advertisements from Instagram and Reddit and manually labels them for manipulative or deceptive practices. Each label comes with a written human explanation of the reasoning behind it. This structure is meant to support both standard classification and explainable detection methods. The authors further review common persuasive strategies in the ads and consider possible links to users' mental health. The dataset is positioned to enable tools such as browser warnings for users and automated monitoring for regulators.

Core claim

We introduce a dataset of betting-related advertisements from Instagram and Reddit that have been manually annotated for manipulative and deceptive advertising practices, together with human-provided explanations describing the reasoning for each annotation, to enable research into explainable detection approaches while also analyzing persuasive strategies and their potential mental health effects.

What carries the argument

The explanation-annotated dataset, which supplies both classification labels and the reasoning for those labels so that models can be trained to detect and justify findings about manipulative betting ads.

If this is right

  • Automated classifiers can be trained directly on the labeled advertisements to identify manipulative content.
  • Explainable AI methods can use the provided human explanations to generate justifications for detections.
  • Analysis of the ads reveals recurring persuasive tactics used in betting promotions.
  • Browser plugins can be built to warn users when they encounter flagged advertisements.
  • Web crawlers can be developed to help regulatory authorities monitor and detect such promotions at scale.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same annotation approach with explanations could be applied to manipulative advertising in other product categories beyond betting.
  • Patterns in the explanations might be studied to identify which tactics most affect specific user groups.
  • The dataset could serve as a starting point for longitudinal tracking of how betting ad strategies evolve over time on the same platforms.

Load-bearing premise

The manual annotations and the explanations given for them are reliable and consistent enough to train and evaluate automated detection systems.

What would settle it

A classifier trained on the dataset that achieves accuracy no higher than random guessing when tested on an independently collected set of new betting advertisements would show the annotations do not support reliable detection.

Figures

Figures reproduced from arXiv: 2606.27274 by Akrati Saxena, Mark Lee, MSVPJ Sathvik, Nishit Rane, Parmitha Vangapandu, Sathwik Narkedimilli.

Figure 1
Figure 1. Figure 1: Overview of the BetXplain framework, illustrating the pipeline from betting advertisement data collection [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Psychological manipulation, architecture, and mental health impact(a) Keyword frequency heatmap across [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Influence cues and deceptive claims across categories: (a) Influence cues intensity by category. (b) [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Clinical risk profile: (a) Risk severity distributions by disorder. (b) Psychological impact heatmap across [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Sentiment, emotional tone, and the deceptive positivity effect: (a) Sentiment polarity by segment. (b) [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
read the original abstract

The promotion of betting applications on social media platforms has increased significantly in recent years. Many of these advertisements use persuasive techniques that may mislead users, encourage risky behavior, and potentially influence users' mental well-being. However, research on the automated detection of manipulative and deceptive betting advertisements remains limited due to the lack of publicly available annotated datasets. In this work, we introduce a new dataset of betting-related advertisements collected from two widely used social media platforms, Instagram and Reddit. The advertisements were manually annotated for manipulative and deceptive advertising practices. In addition to classification labels, the dataset includes human-provided explanations that describe the reasoning behind each annotation, enabling research into explainable approaches to detecting manipulative advertising. Furthermore, we analyze the strategies commonly used in betting advertisements and examine how these persuasive tactics may impact users' mental health. The proposed framework can also enable practical applications such as browser plugins that warn users about manipulative betting advertisements and automated web crawlers that help regulatory authorities monitor and detect such promotions online.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces BetXplain, a new dataset of betting-related advertisements collected from Instagram and Reddit. Advertisements are manually annotated for manipulative and deceptive practices, with human-provided explanations for each label. The work also analyzes common persuasive strategies in such ads and their potential effects on users' mental health, with the goal of supporting explainable AI research and applications such as warning plugins or regulatory crawlers.

Significance. If the annotations are shown to be reliable, the dataset would address a documented gap in public resources for automated detection of manipulative betting content on social media. The addition of human explanations is a positive feature that directly supports research on explainable detection methods, and the mental-health analysis angle broadens potential impact beyond pure classification.

major comments (2)
  1. [Data collection and annotation description] The manuscript provides no information on the annotation protocol, number of annotators per item, inter-annotator agreement scores, adjudication procedure, or annotation guidelines. These details are required to evaluate whether the labels constitute stable ground truth suitable for training detection models or for benchmarking explanation methods.
  2. [Dataset presentation] No dataset statistics (total size, label distribution, platform breakdown, or example annotations with explanations) are reported. Without these, it is impossible to assess scale, balance, or representativeness, which are load-bearing for the claim that the resource enables reproducible research.
minor comments (1)
  1. [Abstract] The abstract would be strengthened by a single sentence summarizing dataset scale and any reliability metrics.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive review. We agree that the manuscript requires additional information on the annotation methodology and dataset characteristics to support claims about reliability and reproducibility. We will revise the paper to address both major comments.

read point-by-point responses
  1. Referee: [Data collection and annotation description] The manuscript provides no information on the annotation protocol, number of annotators per item, inter-annotator agreement scores, adjudication procedure, or annotation guidelines. These details are required to evaluate whether the labels constitute stable ground truth suitable for training detection models or for benchmarking explanation methods.

    Authors: We acknowledge that these methodological details were omitted from the submitted manuscript. In the revised version, we will add a new section detailing the full annotation protocol. This will include the number of annotators assigned per item, inter-annotator agreement scores (e.g., Cohen's kappa or Fleiss' kappa), the adjudication procedure for resolving disagreements, and the complete annotation guidelines provided to annotators. These additions will allow evaluation of label stability. revision: yes

  2. Referee: [Dataset presentation] No dataset statistics (total size, label distribution, platform breakdown, or example annotations with explanations) are reported. Without these, it is impossible to assess scale, balance, or representativeness, which are load-bearing for the claim that the resource enables reproducible research.

    Authors: We agree that quantitative dataset characteristics are essential. The revised manuscript will include a dedicated 'Dataset Statistics' section reporting total size, label distribution across manipulative/deceptive categories, platform breakdown (Instagram vs. Reddit), and several example annotations that include both the classification labels and the accompanying human explanations. These will be presented in a table or figure to demonstrate scale and representativeness. revision: yes

Circularity Check

0 steps flagged

Dataset release paper contains no derivation chain or fitted predictions

full rationale

This is a dataset paper whose central contribution is the release of manually annotated betting ads from Instagram and Reddit, plus human explanations. No equations, models, or quantitative predictions are claimed. The abstract and described content contain no self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations. Annotation reliability is not quantified in the provided text, but absence of reliability metrics is a reproducibility concern, not a circularity reduction. The paper is self-contained as a resource contribution with no internal derivation that reduces to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a data-collection and annotation paper with no mathematical model or derivations.

pith-pipeline@v0.9.1-grok · 5728 in / 1146 out tokens · 46054 ms · 2026-06-30T09:34:08.599514+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

12 extracted references · 1 canonical work pages · 1 internal anchor

  1. [1]

    InCompanion Proceedings of the ACM on Web Conference 2025, pages 1885–1889

    Mental health and relations: Detection of mental health disorders related to relationship issues through reddit posts. InCompanion Proceedings of the ACM on Web Conference 2025, pages 1885–1889. Government of India. 1867. The public gambling act,

  2. [2]

    https://www.indiacode.nic.in/handle/ 123456789/2269. Act No. 3 of 1867 regulating gam- bling activities in India. Government of India. 2000. Information technology act, 2000. https://www.meity.gov.in/content/ information-technology-act-2000 . Cyber law addressing digital crimes including online fraud and illegal betting platforms. Government of Kuwait. 19...

  3. [3]

    ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

    Albert: A lite bert for self-supervised learn- ing of language representations.arXiv preprint arXiv:1909.11942. Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. Bart: Denoising sequence-to-sequence pre-training for nat- ural language generation, translation, and compre...

  4. [4]

    Relevant advertisements asso- ciated with betting applications and services were retrieved using betting-related keywords

    Data Collection:Betting-related advertise- ments were collected from public social media platforms, including Instagram and Reddit, us- ing platform search mechanisms and the Meta Ads Library. Relevant advertisements asso- ciated with betting applications and services were retrieved using betting-related keywords

  5. [5]

    Advertisements con- taining minimal or non-informative text were excluded from the dataset

    Data Cleaning:Duplicate advertisements, ir- relevant promotional content, spam entries, and non-betting advertisements were removed during preprocessing. Advertisements con- taining minimal or non-informative text were excluded from the dataset

  6. [6]

    Text preprocessing was performed while preserv- ing the persuasive and emotional language patterns important for annotation

    Data Preprocessing:URLs, usernames, hash- tags, excessive emojis, and formatting arti- facts were normalized where necessary. Text preprocessing was performed while preserv- ing the persuasive and emotional language patterns important for annotation

  7. [7]

    An- notators also provided concise, human-written explanations of the reasoning behind each as- signed label

    Manual Annotation:Each advertisement was independently reviewed and labeled in ac- cordance with the annotation guidelines. An- notators also provided concise, human-written explanations of the reasoning behind each as- signed label

  8. [8]

    Quality Verification:Annotations were re- viewed for consistency, and disagreements were resolved through collaborative discus- sion before constructing the final dataset

  9. [9]

    Try your luck,

    Dataset Finalization:After quality verifica- tion, duplicate advertisements were removed and the final dataset was partitioned into train, validation, and test splits using stratified sam- pling. F Longformer Ablation Study To justify Longformer’s inclusion despite short me- dian text length, we conducted an ablation study varying the maximum sequence len...

  10. [10]

    The Chase Phase.Content encourages users to “recover” losses by making further deposits, sustaining engagement through loss-aversion psychology

  11. [11]

    The Crash.When mathematically inevitable losses occur, the user experiences a rapid de- pletion of both dopamine and serotonin

  12. [12]

    Jackpot,

    The Danger Zone.This neurochemical crash, combined with the tangible reality of debt or bankruptcy, creates a window of acute sui- cidality. Research indicates that individuals with Gambling Disorder have the highest sui- cide attempt rates among all addiction de- mographics, with estimates reaching up to 20% (Blaszczynski and Nower, 2002). L Visual Evide...