The Futility of Bias-Free Learning and Search

Akshay Trikha; Dominique Macias; George D. Montanez; Jonathan Hayase; Julia Vendemiatti; Julius Lauw

arxiv: 1907.06010 · v1 · pith:GNDH4S3Bnew · submitted 2019-07-13 · 💻 cs.LG · stat.ML

The Futility of Bias-Free Learning and Search

George D. Montanez , Jonathan Hayase , Julius Lauw , Dominique Macias , Akshay Trikha , Julia Vendemiatti This is my paper

Pith reviewed 2026-05-24 22:10 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords bias in machine learninginformation resourcesconservation of biassearch view of learningno-free-lunch implicationsgeometric success measure

0 comments

The pith

Machine learning requires bias toward a target, and that bias cannot favor many targets at once without trade-offs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper treats learning as a search process over possible datasets or information resources. It proves that for any fixed amount of bias favoring one particular target, only a limited share of those resources can actually help the search succeed. Bias behaves as a conserved quantity, so an algorithm cannot spread favorable bias across many different targets simultaneously. Success probability equals the geometric angle between the task's true structure and the assumptions encoded in the algorithm's bias. Finding a distribution over resources that supplies useful bias is itself provably hard unless the resources already align with the task.

Core claim

For a given degree of bias toward a fixed target, the proportion of favorable information resources is strictly bounded from above. Bias is a conserved quantity such that no algorithm can be favorably biased toward many distinct targets simultaneously. The probability of success for a task equals the angle of agreement between what holds for the actual task and what the algorithm assumes through its bias. Finding a favorably biasing distribution over a fixed set of information resources is provably difficult unless the set of resources itself is already favorable with respect to the given task and algorithm.

What carries the argument

Bias defined as a measure relative to a collection of possible information resources, which bounds the fraction of helpful resources and enforces conservation across targets.

If this is right

Success on one task necessarily reduces the resources available for success on other tasks.
The geometric angle between task and bias directly limits achievable success probability.
Any attempt to select a biasing distribution over resources faces a hardness barrier unless the resources already match the task.
Bias encodes unavoidable trade-offs between different learning targets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The conservation result suggests that multi-task learning must involve explicit sharing mechanisms rather than hoping one bias serves all tasks.
The bounded proportion of favorable resources offers a quantitative way to compare how much prior structure different algorithms inject.
If the view of learning as search over information resources holds, then claims of completely general learners without task-specific assumptions become impossible to sustain.

Load-bearing premise

Machine learning can be viewed productively as search over a collection of possible datasets or information resources, with bias defined relative to that collection.

What would settle it

An explicit algorithm that achieves higher-than-random success on multiple unrelated targets while maintaining the same bias level toward each, or a bias-free learner whose success rate exceeds the uniform random baseline on a non-trivial task.

Figures

Figures reproduced from arXiv: 1907.06010 by Akshay Trikha, Dominique Macias, George D. Montanez, Jonathan Hayase, Julia Vendemiatti, Julius Lauw.

**Figure 2.** Figure 2: These examples visualize the target divergence for various p [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

read the original abstract

Building on the view of machine learning as search, we demonstrate the necessity of bias in learning, quantifying the role of bias (measured relative to a collection of possible datasets, or more generally, information resources) in increasing the probability of success. For a given degree of bias towards a fixed target, we show that the proportion of favorable information resources is strictly bounded from above. Furthermore, we demonstrate that bias is a conserved quantity, such that no algorithm can be favorably biased towards many distinct targets simultaneously. Thus bias encodes trade-offs. The probability of success for a task can also be measured geometrically, as the angle of agreement between what holds for the actual task and what is assumed by the algorithm, represented in its bias. Lastly, finding a favorably biasing distribution over a fixed set of information resources is provably difficult, unless the set of resources itself is already favorable with respect to the given task and algorithm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The results on bias bounds and conservation follow directly from the search model definitions, with the geometric angle view as the main rephrasing.

read the letter

The colleague should know upfront that this paper quantifies how bias is required to beat uniform chance in a search-over-resources model of learning, and that bias cannot favor many targets at once because total excess probability is conserved. The bound on the share of favorable resources for a fixed bias level is presented as strict, and success probability is tied to an angle of agreement between assumed and actual task properties. The difficulty of locating a good biasing distribution over resources is also stated. These points are framed as trade-offs that any learner must face. The work does a clean job of spelling out the conservation statement and the geometric angle explicitly, which organizes the no-free-lunch style argument in one place. It builds on the search view the authors cite rather than starting from scratch. The central claims hold up inside the chosen model because the upper bound follows from averaging excess probabilities and conservation follows from linearity when summing across targets; there is no internal contradiction or hidden fitting step. The soft spot is that once the model is fixed—learning as search over a static collection of information resources, bias defined as excess success probability—the quantitative statements are largely immediate consequences rather than independent discoveries. The abstract supplies no derivations or checks, so the full paper needs to show whether the angle interpretation adds anything beyond an inner-product restatement. The modeling choice itself is the load-bearing assumption; results outside that framing are not addressed. This is for readers already working on formal limits to generalization or bias in learning theory. A serious referee should see it because the argument is coherent on its own terms and the formalization is precise enough to discuss, even if the novelty is modest. I would send it for review with a request to clarify how the search model maps to standard supervised settings.

Referee Report

2 major / 3 minor

Summary. The paper models machine learning as search over a fixed collection of information resources (e.g., datasets) and argues that bias—defined as excess success probability relative to uniform search—is necessary. It claims that for any fixed degree of bias toward one target, the fraction of favorable resources is strictly upper-bounded; that bias is conserved across targets (no algorithm can be favorably biased toward many distinct targets simultaneously); that success probability equals the angle of agreement between the algorithm's bias and the actual task; and that finding a favorably biasing distribution over resources is hard unless the resources are already favorable for the task.

Significance. If the modeling assumptions hold, the work supplies a parameter-free derivation of bias necessity, an explicit conservation law, and a geometric view of agreement, all obtained directly from averaging and linearity over the resource collection. These results formalize well-known intuitions about inductive bias and multi-task trade-offs without fitted parameters or invented entities, providing a clean theoretical lens that could inform algorithm analysis and the design of bias-aware search procedures.

major comments (2)

[Abstract and main text (no numbered equations or definitions supplied)] The central claims rest on the definition of bias as excess success probability relative to the collection; the upper bound follows by averaging and conservation by linearity. However, the manuscript states these results in the abstract and text without exhibiting the explicit definitions, the averaging step, or the linearity argument, so the derivations cannot be inspected for edge cases (e.g., when success probabilities are zero or when the collection is finite but small).
[Modeling premise (weakest assumption noted in reader's report)] The modeling choice that the collection of information resources is fixed and exhaustive is load-bearing for both the bound and the conservation claim. The paper does not examine what happens if the collection is misspecified, incomplete, or task-dependent, which would alter the averaging argument.

minor comments (3)

Add a short formal section or appendix with the bias definition, the averaging argument for the proportion bound, and the linearity argument for conservation; this would raise soundness from its current low level without changing the central claim.
The geometric angle interpretation is described only verbally; a one-paragraph vector-space example or a simple diagram would clarify the inner-product view of agreement between assumed and actual task properties.
The final claim that finding a favorably biasing distribution is 'provably difficult' is stated without reference to a complexity result or reduction; either supply the argument or qualify the statement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below and will incorporate revisions to improve the clarity and scope of the manuscript.

read point-by-point responses

Referee: [Abstract and main text (no numbered equations or definitions supplied)] The central claims rest on the definition of bias as excess success probability relative to the collection; the upper bound follows by averaging and conservation by linearity. However, the manuscript states these results in the abstract and text without exhibiting the explicit definitions, the averaging step, or the linearity argument, so the derivations cannot be inspected for edge cases (e.g., when success probabilities are zero or when the collection is finite but small).

Authors: We agree that the derivations should be presented explicitly. In the revised version we will add formal definitions of bias (as excess success probability relative to the resource collection), the averaging argument establishing the upper bound on the fraction of favorable resources, and the linearity argument for bias conservation. We will also include a dedicated subsection examining edge cases, including zero success probabilities and small finite collections, to confirm the results hold under these conditions. revision: yes
Referee: [Modeling premise (weakest assumption noted in reader's report)] The modeling choice that the collection of information resources is fixed and exhaustive is load-bearing for both the bound and the conservation claim. The paper does not examine what happens if the collection is misspecified, incomplete, or task-dependent, which would alter the averaging argument.

Authors: The fixed exhaustive collection is indeed the modeling premise that permits the direct averaging and linearity arguments. We will add a new discussion section in the revision that explicitly addresses the consequences of relaxing this assumption, including how misspecification, incompleteness, or task-dependence would modify the bounds and conservation law. This addition will delineate the scope of the current results without altering the core model. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper models learning as search over a fixed collection of information resources and defines bias as excess success probability relative to that collection. The upper bound on favorable resources follows directly from averaging over the collection (a mathematical consequence of the definition), and bias conservation follows from linearity of the excess-probability sum across targets. These are standard implications of the chosen formalization rather than reductions of outputs to inputs by construction. No equations equate a derived quantity to a fitted parameter, no self-citation chain supplies a uniqueness theorem, and no ansatz is smuggled via prior work. The argument is self-contained against external benchmarks and contains no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the modeling choice that learning equals search over information resources and on the relative definition of bias; no free parameters or new entities are introduced in the abstract.

axioms (1)

domain assumption Machine learning can be viewed as search over a collection of possible datasets or information resources
Explicitly invoked in the first sentence of the abstract as the foundation for all subsequent claims.

pith-pipeline@v0.9.0 · 5699 in / 1119 out tokens · 21702 ms · 2026-05-24T22:10:34.326561+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Theorem 2 (Conservation of Bias). ... ∑_{t∈τ_k} Bias(D,t)=0
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Corollary 2 (Geometric Divergence). Pr(...) ≤ ∥t∥ cos(θ)/q_min
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Theorem 4 (Futility of Bias-Free Search). If Bias(D,t)=0 then Pr= p

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

[1]

Addison-Wesley Longman Publishing Company (1999)

Goldberg, D.: Genetic algorithms in search optimization and machine learning. Addison-Wesley Longman Publishing Company (1999)

work page 1999
[2]

Journal of Machine Learning Research 17(8), 1–32 (2016)

G¨ ul¸ cehre, C ¸ ., Bengio, Y.: Knowledge matters: Importance of prior information for optimization. Journal of Machine Learning Research 17(8), 1–32 (2016)

work page 2016
[3]

no free lunch

McDermott, J.: When and why metaheuristics re- searchers can ignore “no free lunch” theorems. Metaheuris- tics (Mar 2019). https://doi.org/10.1007/s42257-019-00 002-6, https://doi.org/10.1007/s42257-019-00002-6

work page doi:10.1007/s42257-019-00 2019
[4]

In: Rutgers Univer- sity: CBM-TR-117 (1980)

Mitchell, T.D.: The need for biases in learning generaliz ations. In: Rutgers Univer- sity: CBM-TR-117 (1980)

work page 1980
[5]

In: 2017 IEEE International Conference on Systems, M an, and Cybernetics (SMC)

Monta ˜ nez, G.D.: The famine of forte: Few search problems greatly favor your algo- rithm. In: 2017 IEEE International Conference on Systems, M an, and Cybernetics (SMC). pp. 477–482. IEEE (2017)

work page 2017
[6]

In: Dissert ation

Monta ˜ nez, G.D.: Why machine learning works. In: Dissert ation. pp. 52–59. Carnegie Mellon University (2017)

work page 2017
[7]

In: Proc eedings of the 13th International Conference on Neural Information Processin g Systems

Rasmussen, C.E., Ghahramani, Z.: Occam’s razor. In: Proc eedings of the 13th International Conference on Neural Information Processin g Systems. pp. 276–282. NIPS’00, MIT Press, Cambridge, MA, USA (2000)

work page 2000
[8]

Reeves, C., Rowe, J.E.: Genetic algorithms: principles a nd perspectives: a guide to GA theory, vol. 20. Springer Science & Business Media (2002)

work page 2002
[9]

Systems, Man, and Cybernetics, Part C: Applications and Rev iews, IEEE Trans- actions on 35, 233 – 243 (06 2005)

Runarsson, T., Yao, X.: Search biases in constrained evol utionary optimization. Systems, Man, and Cybernetics, Part C: Applications and Rev iews, IEEE Trans- actions on 35, 233 – 243 (06 2005). https://doi.org/10.1109/TSMCC.2004 .841906

work page doi:10.1109/tsmcc.2004 2005
[10]

In: Machine Learn- ing Proceedings 1994, pp

Schaﬀer, C.: A conservation law for generalization perf ormance. In: Machine Learn- ing Proceedings 1994, pp. 259–265. Elsevier (1994)

work page 1994
[11]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition pp

Ulyanov, D., Vedaldi, A., Lempitsky, V.: Deep image prio r. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition pp. 9446–9454 (2018)

work page 2018
[12]

Wolpert, D.H., Macready, W.G.: No free lunch theorems fo r optimization. Trans. Evol. Comp 1(1), 67–82 (Apr 1997). https://doi.org/10.1109/4235.585 893 The Futility of Bias-Free Learning and Search 13 7 Appendix: Proofs Lemma 1 (Expected Per Query Performance F rom Expected Distr i- bution). This lemma has been proven by Monta˜ nez [5] and is directly dr ...

work page doi:10.1109/4235.585 1997
[13]

Theorem 3 (F amine of F avorable Information Resources)

So, Pr(q(t, F ) ≥ qmin) ≤ ∥t∥ qmin ( t⊤ ED[P F ] ∥t∥∥ED[P F ]∥ ) = ∥t∥ qmin cos ( arccos ( t⊤ ED[P F ] ∥t∥∥ED[P F ]∥ )) By the deﬁnition of target divergence, we have Pr(q(t, F ) ≥ qmin) ≤ ∥t∥ cos(θ) qmin . Theorem 3 (F amine of F avorable Information Resources). Let B be a ﬁnite set of information resources and let t ⊆ Ω be an arbitrary ﬁxed k-size targe...

work page

[1] [1]

Addison-Wesley Longman Publishing Company (1999)

Goldberg, D.: Genetic algorithms in search optimization and machine learning. Addison-Wesley Longman Publishing Company (1999)

work page 1999

[2] [2]

Journal of Machine Learning Research 17(8), 1–32 (2016)

G¨ ul¸ cehre, C ¸ ., Bengio, Y.: Knowledge matters: Importance of prior information for optimization. Journal of Machine Learning Research 17(8), 1–32 (2016)

work page 2016

[3] [3]

no free lunch

McDermott, J.: When and why metaheuristics re- searchers can ignore “no free lunch” theorems. Metaheuris- tics (Mar 2019). https://doi.org/10.1007/s42257-019-00 002-6, https://doi.org/10.1007/s42257-019-00002-6

work page doi:10.1007/s42257-019-00 2019

[4] [4]

In: Rutgers Univer- sity: CBM-TR-117 (1980)

Mitchell, T.D.: The need for biases in learning generaliz ations. In: Rutgers Univer- sity: CBM-TR-117 (1980)

work page 1980

[5] [5]

In: 2017 IEEE International Conference on Systems, M an, and Cybernetics (SMC)

Monta ˜ nez, G.D.: The famine of forte: Few search problems greatly favor your algo- rithm. In: 2017 IEEE International Conference on Systems, M an, and Cybernetics (SMC). pp. 477–482. IEEE (2017)

work page 2017

[6] [6]

In: Dissert ation

Monta ˜ nez, G.D.: Why machine learning works. In: Dissert ation. pp. 52–59. Carnegie Mellon University (2017)

work page 2017

[7] [7]

In: Proc eedings of the 13th International Conference on Neural Information Processin g Systems

Rasmussen, C.E., Ghahramani, Z.: Occam’s razor. In: Proc eedings of the 13th International Conference on Neural Information Processin g Systems. pp. 276–282. NIPS’00, MIT Press, Cambridge, MA, USA (2000)

work page 2000

[8] [8]

Reeves, C., Rowe, J.E.: Genetic algorithms: principles a nd perspectives: a guide to GA theory, vol. 20. Springer Science & Business Media (2002)

work page 2002

[9] [9]

Systems, Man, and Cybernetics, Part C: Applications and Rev iews, IEEE Trans- actions on 35, 233 – 243 (06 2005)

Runarsson, T., Yao, X.: Search biases in constrained evol utionary optimization. Systems, Man, and Cybernetics, Part C: Applications and Rev iews, IEEE Trans- actions on 35, 233 – 243 (06 2005). https://doi.org/10.1109/TSMCC.2004 .841906

work page doi:10.1109/tsmcc.2004 2005

[10] [10]

In: Machine Learn- ing Proceedings 1994, pp

Schaﬀer, C.: A conservation law for generalization perf ormance. In: Machine Learn- ing Proceedings 1994, pp. 259–265. Elsevier (1994)

work page 1994

[11] [11]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition pp

Ulyanov, D., Vedaldi, A., Lempitsky, V.: Deep image prio r. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition pp. 9446–9454 (2018)

work page 2018

[12] [12]

Wolpert, D.H., Macready, W.G.: No free lunch theorems fo r optimization. Trans. Evol. Comp 1(1), 67–82 (Apr 1997). https://doi.org/10.1109/4235.585 893 The Futility of Bias-Free Learning and Search 13 7 Appendix: Proofs Lemma 1 (Expected Per Query Performance F rom Expected Distr i- bution). This lemma has been proven by Monta˜ nez [5] and is directly dr ...

work page doi:10.1109/4235.585 1997

[13] [13]

Theorem 3 (F amine of F avorable Information Resources)

So, Pr(q(t, F ) ≥ qmin) ≤ ∥t∥ qmin ( t⊤ ED[P F ] ∥t∥∥ED[P F ]∥ ) = ∥t∥ qmin cos ( arccos ( t⊤ ED[P F ] ∥t∥∥ED[P F ]∥ )) By the deﬁnition of target divergence, we have Pr(q(t, F ) ≥ qmin) ≤ ∥t∥ cos(θ) qmin . Theorem 3 (F amine of F avorable Information Resources). Let B be a ﬁnite set of information resources and let t ⊆ Ω be an arbitrary ﬁxed k-size targe...

work page