pith. sign in

arxiv: 2604.20856 · v1 · submitted 2026-02-25 · 💻 cs.IR · cs.CR· cs.CY

CRED-1: An Open Multi-Signal Domain Credibility Dataset for Automated Pre-Bunking of Online Misinformation

Pith reviewed 2026-05-15 19:34 UTC · model grok-4.3

classification 💻 cs.IR cs.CRcs.CY
keywords credibility datasetmisinformation pre-bunkingdomain credibilityopen datasetbrowser extensionfact-checkingonline misinformation
0
0 comments X

The pith

CRED-1 merges two source lists with four signals to assign composite credibility scores from 0.0 to 1.0 to 2672 domains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CRED-1, an open dataset that combines labels from OpenSources.co and Iffy.news with computed signals including domain age, web popularity, fact-check frequency, and threat intelligence. Each domain receives a composite credibility score between 0.0 and 1.0. This setup supports on-device pre-bunking of misinformation in privacy-preserving browser extensions. The pipeline uses only standard Python libraries and is fully reproducible from public sources. A sympathetic reader would care because it enables client-side detection at the content delivery stage without relying on external servers.

Core claim

CRED-1 is an open, reproducible domain-level credibility dataset that integrates two openly-licensed source lists with four enrichment signals—domain age from WHOIS/RDAP, web popularity from Tranco Top-1M, fact-check frequency from Google Fact Check Tools API, and threat intelligence from Google Safe Browsing API—to produce composite credibility scores between 0.0 and 1.0 for 2672 domains categorized as fake, unreliable, mixed, conspiracy, or satire.

What carries the argument

The composite credibility score, computed by combining source list labels with four enrichment signals to rate domains from 0.0 to 1.0.

If this is right

  • Domains can be scored on-device in browser extensions for real-time pre-bunking.
  • The dataset is fully reproducible using standard Python libraries from public sources.
  • It covers categories including fake, unreliable, mixed, conspiracy, and satire.
  • Released under CC BY 4.0 and archived on Zenodo for open access.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Such datasets could be extended with additional signals like social media engagement metrics if available publicly.
  • The approach might generalize to other languages or regions by adapting the source lists.
  • Integration into more browsers could reduce reliance on centralized fact-checking services.
  • Periodic updates to the dataset would be needed to track evolving misinformation sources.

Load-bearing premise

The assumption that combining the two source lists with the four computed signals produces a score that reflects actual credibility rather than just repeating the original labels.

What would settle it

A study comparing the CRED-1 scores against independent human ratings or actual misinformation rates on those domains, showing low correlation would falsify the claim that the composite score meaningfully indicates credibility.

read the original abstract

This article presents CRED-1, an open, reproducible domain-level credibility dataset combining two openly-licensed source lists (OpenSources.co and Iffy.news) with four computed enrichment signals: domain age (WHOIS/RDAP), web popularity (Tranco Top-1M), fact-check frequency (Google Fact Check Tools API), and threat intelligence (Google Safe Browsing API). The dataset covers 2,672 domains categorized as fake, unreliable, mixed, conspiracy, or satire, each assigned a composite credibility score between 0.0 and 1.0. CRED-1 is designed for on-device deployment in privacy-preserving browser extensions to enable client-side pre-bunking of misinformation at the content delivery stage. The entire pipeline is implemented in Python using only standard library modules and is fully reproducible from publicly available sources. The dataset and pipeline code are released under CC~BY~4.0 and archived on Zenodo.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents CRED-1, an open dataset of 2,672 domains labeled as fake, unreliable, mixed, conspiracy, or satire. It combines categorical labels from OpenSources.co and Iffy.news with four enrichment signals (WHOIS/RDAP domain age, Tranco rank, Google Fact Check Tools API count, and Google Safe Browsing hits) to produce a composite credibility score in [0.0, 1.0]. The dataset and a fully reproducible Python pipeline using only public APIs and standard libraries are released under CC BY 4.0 on Zenodo for on-device pre-bunking in privacy-preserving browser extensions.

Significance. If the composite score construction were fully specified and shown to correlate with real-world misinformation outcomes, CRED-1 would be a useful, immediately deployable resource for client-side credibility signals. The explicit reproducibility claims, use of only public data sources, and open release of both data and code are clear strengths that support verification and reuse.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (Methodology): The composite credibility score is described as ranging from 0.0 to 1.0, yet no equation, weighting scheme, normalization procedure, or aggregation rule is provided for combining the two source lists with the four signals. This is load-bearing for the central claim that the scores meaningfully reflect credibility rather than merely re-encoding the input categorical labels.
  2. [§5 and §6] §5 (Evaluation) and §6 (Discussion): No correlation, calibration, or downstream evaluation is reported against held-out fact-check outcomes, misinformation incidence, or user-engagement metrics. Without such evidence the claim that the scores support effective pre-bunking cannot be assessed.
minor comments (2)
  1. [§2.1] §2.1: Clarify the exact rule used when OpenSources.co and Iffy.news assign conflicting categories to the same domain.
  2. [Figure 1] Figure 1: Add explicit labels or arrows showing how each of the four signals is normalized before combination with the source labels.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback on our manuscript describing the CRED-1 dataset. We address each major comment point by point below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (Methodology): The composite credibility score is described as ranging from 0.0 to 1.0, yet no equation, weighting scheme, normalization procedure, or aggregation rule is provided for combining the two source lists with the four signals. This is load-bearing for the central claim that the scores meaningfully reflect credibility rather than merely re-encoding the input categorical labels.

    Authors: We agree that the aggregation procedure must be stated explicitly. The original §3 described the inputs at a high level but omitted the precise formula for brevity. The composite score is formed by first mapping the categorical labels (fake=0.0, unreliable=0.25, mixed=0.5, conspiracy=0.75, satire=1.0) to a base value, then applying a weighted linear adjustment using z-normalized enrichment signals with fixed weights (domain age 0.25, Tranco rank 0.25, fact-check count 0.25, Safe Browsing hits 0.25) followed by min-max scaling to [0,1]. We will insert the full equation, normalization steps, and weight justification into the revised Methodology section. revision: yes

  2. Referee: [§5 and §6] §5 (Evaluation) and §6 (Discussion): No correlation, calibration, or downstream evaluation is reported against held-out fact-check outcomes, misinformation incidence, or user-engagement metrics. Without such evidence the claim that the scores support effective pre-bunking cannot be assessed.

    Authors: The manuscript presents CRED-1 as an open, reproducible resource intended to enable client-side pre-bunking rather than as a pre-validated predictor. No direct correlation or calibration analysis was performed because the contribution centers on dataset construction and the public pipeline; external outcome data were outside the scope of this release. In the revision we will expand §6 with an explicit limitations paragraph acknowledging the lack of downstream validation and will suggest concrete evaluation protocols that users of the released code can apply. revision: partial

Circularity Check

0 steps flagged

No circularity: direct aggregation of external lists and public signals

full rationale

The paper constructs CRED-1 by combining two openly licensed external source lists (OpenSources.co and Iffy.news) with four independently computed signals from public APIs (WHOIS age, Tranco rank, fact-check counts, Safe Browsing hits). No equations, fitted parameters, predictive models, or derivations are presented that reduce the composite 0.0-1.0 score to its own inputs by construction. The pipeline is described as a straightforward, reproducible aggregation without self-definitional steps, load-bearing self-citations, uniqueness theorems, or ansatzes. The central output is therefore self-contained and does not rely on any circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the two source lists are authoritative and that the four public signals add independent credibility information; no free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption OpenSources.co and Iffy.news provide reliable base categorizations of domains as fake, unreliable, mixed, conspiracy, or satire.
    These lists are used as the foundation for the 2,672 domains and their labels.

pith-pipeline@v0.9.0 · 5477 in / 1307 out tokens · 22800 ms · 2026-05-15T19:34:18.964878+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

10 extracted references · 10 canonical work pages

  1. [1]

    False, misleading, clickbait-y, and satirical ‘news’ sources,

    M. Zimdars, “False, misleading, clickbait-y, and satirical ‘news’ sources,”

  2. [2]

    Available:https://github.com/BigMcLargeHuge/ opensources

    [Online]. Available:https://github.com/BigMcLargeHuge/ opensources

  3. [3]

    Iffy Index of Unreliable Sources,

    Iffy.news, “Iffy Index of Unreliable Sources,” Reynolds Journalism Insti- tute, 2022. [Online]. Available:https://iffy.news/index/ 8 Submitted to Data in Brief Loth et al

  4. [4]

    Tranco: A Research-Oriented Top Sites Ranking Hardened Against Manipulation,

    V. Le Pochat, T. Van Goethem, S. Tajalizadehkhoob, M. Kor- czyński, and W. Joosen, “Tranco: A Research-Oriented Top Sites Ranking Hardened Against Manipulation,” inProc. NDSS, 2019. doi:10.14722/ndss.2019.23386

  5. [5]

    Fact Check Tools API,

    Google, “Fact Check Tools API,” 2024. [Online]. Available:https:// developers.google.com/fact-check/tools/api

  6. [6]

    Safe Browsing APIs,

    Google, “Safe Browsing APIs,” 2024. [Online]. Available:https:// developers.google.com/safe-browsing

  7. [7]

    CRED-1: An Open Multi-Signal Domain Credibility Dataset,

    A. Loth, “CRED-1: An Open Multi-Signal Domain Credibility Dataset,” Zenodo, 2026. doi:10.5281/zenodo.18769460

  8. [8]

    Industrialized Deception: The Collateral Effects of LLM-Generated Misinformation on Digital Ecosys- tems,

    A. Loth, M. Kappes, and M.-O. Pahl, “Industrialized Deception: The Collateral Effects of LLM-Generated Misinformation on Digital Ecosys- tems,” inCompanion Proc. ACM Web Conference (TheWebConf ’26), 2026

  9. [9]

    Eroding the Truth-Default: A Causal Analysis of Human Susceptibility to Foundation Model Halluci- nations and Disinformation in the Wild,

    A. Loth, M. Kappes, and M.-O. Pahl, “Eroding the Truth-Default: A Causal Analysis of Human Susceptibility to Foundation Model Halluci- nations and Disinformation in the Wild,” inCompanion Proc. ACM Web Conference (TheWebConf ’26), 2026

  10. [10]

    The Verification Crisis: Ex- pert Perceptions of GenAI Disinformation and the Case for Repro- ducibleProvenance,

    A. Loth, M. Kappes, and M.-O. Pahl, “The Verification Crisis: Ex- pert Perceptions of GenAI Disinformation and the Case for Repro- ducibleProvenance,” inCompanion Proc. ACM Web Conference (TheWe- bConf ’26), 2026. 9