Quantifying Data Similarity Using Cross Learning
Pith reviewed 2026-05-18 08:20 UTC · model grok-4.3
The pith
Dataset similarity is captured by how well a decision rule from one dataset performs on the other.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Cross-Learning Score measures dataset similarity through bidirectional generalization performance of decision rules. Under canonical linear models, this score is equivalent to the cosine similarity between the decision boundaries of the two datasets.
What carries the argument
The Cross-Learning Score (CLS), which averages the out-of-sample performance of a decision rule trained on one dataset when evaluated on the other.
If this is right
- Introduces a transferable zones framework that categorizes source datasets based on their CLS values for use in transfer learning.
- Develops an ensemble-based estimator that is easy to implement without needing high-dimensional density estimation.
- Extends the approach to encoder-head architectures to work with deep learning models.
Where Pith is reading between the lines
- Practitioners might use this score to screen large collections of available datasets for those most compatible with a given target task.
- The geometric interpretation opens the door to developing analogous similarity measures for nonlinear classifiers.
- CLS could inform data curation strategies by quantifying how adding or removing certain data affects overall transfer potential.
Load-bearing premise
Bidirectional generalization performance of decision rules reflects intrinsic dataset similarity without depending on the specific model class or training procedure.
What would settle it
Compare the CLS value to the cosine similarity of fitted linear decision boundaries on multiple dataset pairs; systematic mismatch would disprove the theoretical equivalence.
read the original abstract
Measuring dataset similarity is fundamental in machine learning, particularly for transfer learning and domain adaptation. In the context of supervised learning, most existing approaches quantify similarity of two data sets based on their input feature distributions, neglecting label information and feature-response alignment. To address this, we propose the Cross-Learning Score (CLS), which measures dataset similarity through bidirectional generalization performance of decision rules. We establish its theoretical foundation by linking CLS to cosine similarity between decision boundaries under canonical linear models, providing a geometric interpretation. A robust ensemble-based estimator is developed that is easy to implement and bypasses high-dimensional density estimation entirely. For transfer learning applications, we introduce a "transferable zones" framework that categorizes source datasets into positive, ambiguous, and negative transfer regions. To accommodate deep learning, we extend CLS to encoder-head architectures, aligning with modern representation-based pipelines. Extensive experiments on synthetic and real-world datasets validate the effectiveness of CLS for similarity measurement and transfer assessment.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes the Cross-Learning Score (CLS) to quantify similarity between supervised datasets via bidirectional generalization performance of decision rules learned on each. It claims a theoretical foundation linking CLS to cosine similarity of decision boundaries under canonical linear models, introduces an ensemble estimator that avoids density estimation, defines a 'transferable zones' framework (positive, ambiguous, negative) for transfer learning, extends CLS to encoder-head deep architectures, and validates the approach on synthetic and real-world data.
Significance. If the geometric equivalence and its extension hold, CLS would provide a label-aware, model-based similarity measure that complements feature-distribution approaches and could improve transferability assessment without requiring explicit density estimation. The ensemble estimator's ease of implementation is a practical strength. However, the significance is tempered by the limited scope of the theoretical derivation.
major comments (2)
- [Theoretical foundation / Abstract] Abstract and theoretical foundation section: The link between CLS and cosine similarity of decision boundaries is derived only for canonical linear models. No derivation, counter-example analysis, or transfer argument is supplied for non-linear models or the encoder-head extension, so the geometric interpretation does not support the broader claims about representation-based pipelines or the transferable-zones framework.
- [Transferable zones framework] Transferable zones framework: The categorization of source datasets into positive, ambiguous, and negative transfer regions rests on the assumption that bidirectional generalization performance captures intrinsic similarity independent of model class and optimization procedure. The manuscript supplies no robustness checks or ablation across base learners to substantiate this load-bearing assumption.
minor comments (1)
- [Abstract] The abstract refers to 'canonical linear models' without defining the term or the precise model class used in the derivation; a short clarifying sentence would improve accessibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major comment below, acknowledging the precise scope of our theoretical derivation while committing to clarifications and additional experiments in the revision.
read point-by-point responses
-
Referee: [Theoretical foundation / Abstract] Abstract and theoretical foundation section: The link between CLS and cosine similarity of decision boundaries is derived only for canonical linear models. No derivation, counter-example analysis, or transfer argument is supplied for non-linear models or the encoder-head extension, so the geometric interpretation does not support the broader claims about representation-based pipelines or the transferable-zones framework.
Authors: We agree that the geometric equivalence to cosine similarity of decision boundaries is derived rigorously only for canonical linear models, as presented in the theoretical foundation. The extension of CLS to encoder-head architectures is introduced as a practical adaptation for modern representation-based pipelines, motivated by the linear case and validated through experiments, rather than supported by an identical theoretical derivation. We will revise the abstract to state the scope of the theoretical link more precisely and add a clarifying discussion in the theoretical section noting the empirical nature of the deep-learning extension. This will better align the claims with the provided derivations. revision: yes
-
Referee: [Transferable zones framework] Transferable zones framework: The categorization of source datasets into positive, ambiguous, and negative transfer regions rests on the assumption that bidirectional generalization performance captures intrinsic similarity independent of model class and optimization procedure. The manuscript supplies no robustness checks or ablation across base learners to substantiate this load-bearing assumption.
Authors: The transferable zones framework is constructed directly from the bidirectional generalization performance measured by the CLS estimator for a given model class. While this makes the zones dependent on the learner by design, which is appropriate for practical transfer scenarios, we acknowledge the value of demonstrating robustness. We will add an ablation study in the revised experiments section using multiple base learners (linear models, decision trees, and small neural networks) on the synthetic data to assess consistency of the zone categorizations. revision: yes
Circularity Check
No significant circularity; CLS definition and linear-model geometric link are independent.
full rationale
The paper defines CLS directly via bidirectional generalization performance of decision rules obtained by fitting models on each dataset. It then separately derives a geometric interpretation linking this quantity to cosine similarity of decision boundaries, but only under canonical linear models. This link is presented as a mathematical result for a restricted case rather than a redefinition or tautology forced by the fitting procedure itself. No self-citation chain, ansatz smuggling, or uniqueness theorem imported from prior author work is invoked to carry the central claim. The extension to encoder-head architectures is described as an empirical adaptation, not a model-independent equivalence. The overall derivation therefore retains independent content beyond its inputs and is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- ensemble size or base learner hyperparameters
axioms (1)
- domain assumption Canonical linear models suffice to establish the cosine-similarity link
invented entities (1)
-
transferable zones (positive, ambiguous, negative)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
CLS = arccos(ρ1) + arccos(ρ2) / 2π where ρ1,ρ2 are normalized inner products of β(t),β(s) (Theorem 1, probit model)
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanalpha_pin_under_high_calibration unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
transferable zones defined by CLS thresholds relative to baseline error e0 + γ·SE(e0)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.