Quantifying Data Similarity Using Cross Learning

Hao Helen Zhang; Joseph C Watkins; Shudong Sun

arxiv: 2510.10866 · v3 · submitted 2025-10-13 · 📊 stat.ML · cs.LG

Quantifying Data Similarity Using Cross Learning

Shudong Sun , Hao Helen Zhang , Joseph C Watkins This is my paper

Pith reviewed 2026-05-18 08:20 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords dataset similaritycross learningtransfer learningdecision boundariescosine similaritygeneralizationdomain adaptation

0 comments

The pith

Dataset similarity is captured by how well a decision rule from one dataset performs on the other.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper proposes the Cross-Learning Score to quantify similarity between two supervised datasets using the generalization performance when models are crossed between them. It differs from prior methods by incorporating label information through this performance measure rather than just comparing feature distributions. The authors link the score to the cosine similarity of decision boundaries in linear models for a geometric view. This matters for transfer learning because it helps determine which source datasets will lead to positive, negative, or neutral effects when adapting models to a target dataset.

Core claim

The Cross-Learning Score measures dataset similarity through bidirectional generalization performance of decision rules. Under canonical linear models, this score is equivalent to the cosine similarity between the decision boundaries of the two datasets.

What carries the argument

The Cross-Learning Score (CLS), which averages the out-of-sample performance of a decision rule trained on one dataset when evaluated on the other.

If this is right

Introduces a transferable zones framework that categorizes source datasets based on their CLS values for use in transfer learning.
Develops an ensemble-based estimator that is easy to implement without needing high-dimensional density estimation.
Extends the approach to encoder-head architectures to work with deep learning models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Practitioners might use this score to screen large collections of available datasets for those most compatible with a given target task.
The geometric interpretation opens the door to developing analogous similarity measures for nonlinear classifiers.
CLS could inform data curation strategies by quantifying how adding or removing certain data affects overall transfer potential.

Load-bearing premise

Bidirectional generalization performance of decision rules reflects intrinsic dataset similarity without depending on the specific model class or training procedure.

What would settle it

Compare the CLS value to the cosine similarity of fitted linear decision boundaries on multiple dataset pairs; systematic mismatch would disprove the theoretical equivalence.

read the original abstract

Measuring dataset similarity is fundamental in machine learning, particularly for transfer learning and domain adaptation. In the context of supervised learning, most existing approaches quantify similarity of two data sets based on their input feature distributions, neglecting label information and feature-response alignment. To address this, we propose the Cross-Learning Score (CLS), which measures dataset similarity through bidirectional generalization performance of decision rules. We establish its theoretical foundation by linking CLS to cosine similarity between decision boundaries under canonical linear models, providing a geometric interpretation. A robust ensemble-based estimator is developed that is easy to implement and bypasses high-dimensional density estimation entirely. For transfer learning applications, we introduce a "transferable zones" framework that categorizes source datasets into positive, ambiguous, and negative transfer regions. To accommodate deep learning, we extend CLS to encoder-head architectures, aligning with modern representation-based pipelines. Extensive experiments on synthetic and real-world datasets validate the effectiveness of CLS for similarity measurement and transfer assessment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CLS defines a label-aware similarity via cross-generalization performance with a linear-model geometric link, but the deep-learning extension rests on unproven transfer of that link.

read the letter

The main thing here is a Cross-Learning Score that measures dataset similarity by training models on one set and testing generalization on the other in both directions. They link this score to cosine similarity of decision boundaries for linear models and add a transferable-zones split into positive, ambiguous, and negative sources. An ensemble estimator makes it practical without density estimation, and they sketch an extension to encoder-head deep models plus some synthetic and real-data checks.

Referee Report

2 major / 1 minor

Summary. The paper proposes the Cross-Learning Score (CLS) to quantify similarity between supervised datasets via bidirectional generalization performance of decision rules learned on each. It claims a theoretical foundation linking CLS to cosine similarity of decision boundaries under canonical linear models, introduces an ensemble estimator that avoids density estimation, defines a 'transferable zones' framework (positive, ambiguous, negative) for transfer learning, extends CLS to encoder-head deep architectures, and validates the approach on synthetic and real-world data.

Significance. If the geometric equivalence and its extension hold, CLS would provide a label-aware, model-based similarity measure that complements feature-distribution approaches and could improve transferability assessment without requiring explicit density estimation. The ensemble estimator's ease of implementation is a practical strength. However, the significance is tempered by the limited scope of the theoretical derivation.

major comments (2)

[Theoretical foundation / Abstract] Abstract and theoretical foundation section: The link between CLS and cosine similarity of decision boundaries is derived only for canonical linear models. No derivation, counter-example analysis, or transfer argument is supplied for non-linear models or the encoder-head extension, so the geometric interpretation does not support the broader claims about representation-based pipelines or the transferable-zones framework.
[Transferable zones framework] Transferable zones framework: The categorization of source datasets into positive, ambiguous, and negative transfer regions rests on the assumption that bidirectional generalization performance captures intrinsic similarity independent of model class and optimization procedure. The manuscript supplies no robustness checks or ablation across base learners to substantiate this load-bearing assumption.

minor comments (1)

[Abstract] The abstract refers to 'canonical linear models' without defining the term or the precise model class used in the derivation; a short clarifying sentence would improve accessibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment below, acknowledging the precise scope of our theoretical derivation while committing to clarifications and additional experiments in the revision.

read point-by-point responses

Referee: [Theoretical foundation / Abstract] Abstract and theoretical foundation section: The link between CLS and cosine similarity of decision boundaries is derived only for canonical linear models. No derivation, counter-example analysis, or transfer argument is supplied for non-linear models or the encoder-head extension, so the geometric interpretation does not support the broader claims about representation-based pipelines or the transferable-zones framework.

Authors: We agree that the geometric equivalence to cosine similarity of decision boundaries is derived rigorously only for canonical linear models, as presented in the theoretical foundation. The extension of CLS to encoder-head architectures is introduced as a practical adaptation for modern representation-based pipelines, motivated by the linear case and validated through experiments, rather than supported by an identical theoretical derivation. We will revise the abstract to state the scope of the theoretical link more precisely and add a clarifying discussion in the theoretical section noting the empirical nature of the deep-learning extension. This will better align the claims with the provided derivations. revision: yes
Referee: [Transferable zones framework] Transferable zones framework: The categorization of source datasets into positive, ambiguous, and negative transfer regions rests on the assumption that bidirectional generalization performance captures intrinsic similarity independent of model class and optimization procedure. The manuscript supplies no robustness checks or ablation across base learners to substantiate this load-bearing assumption.

Authors: The transferable zones framework is constructed directly from the bidirectional generalization performance measured by the CLS estimator for a given model class. While this makes the zones dependent on the learner by design, which is appropriate for practical transfer scenarios, we acknowledge the value of demonstrating robustness. We will add an ablation study in the revised experiments section using multiple base learners (linear models, decision trees, and small neural networks) on the synthetic data to assess consistency of the zone categorizations. revision: yes

Circularity Check

0 steps flagged

No significant circularity; CLS definition and linear-model geometric link are independent.

full rationale

The paper defines CLS directly via bidirectional generalization performance of decision rules obtained by fitting models on each dataset. It then separately derives a geometric interpretation linking this quantity to cosine similarity of decision boundaries, but only under canonical linear models. This link is presented as a mathematical result for a restricted case rather than a redefinition or tautology forced by the fitting procedure itself. No self-citation chain, ansatz smuggling, or uniqueness theorem imported from prior author work is invoked to carry the central claim. The extension to encoder-head architectures is described as an empirical adaptation, not a model-independent equivalence. The overall derivation therefore retains independent content beyond its inputs and is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the linear-model assumption for the geometric interpretation and on the premise that ensemble generalization error faithfully reflects dataset similarity without additional regularization effects.

free parameters (1)

ensemble size or base learner hyperparameters
Used in the robust estimator; value not specified in abstract but required for practical computation.

axioms (1)

domain assumption Canonical linear models suffice to establish the cosine-similarity link
Invoked to provide geometric interpretation of CLS.

invented entities (1)

transferable zones (positive, ambiguous, negative) no independent evidence
purpose: Categorize source datasets for transfer decisions
New framework introduced to operationalize the score for transfer learning.

pith-pipeline@v0.9.0 · 5687 in / 1240 out tokens · 32286 ms · 2026-05-18T08:20:49.675617+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

CLS = arccos(ρ1) + arccos(ρ2) / 2π where ρ1,ρ2 are normalized inner products of β(t),β(s) (Theorem 1, probit model)
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean alpha_pin_under_high_calibration unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

transferable zones defined by CLS thresholds relative to baseline error e0 + γ·SE(e0)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.