pith. machine review for the scientific record. sign in

arxiv: 2605.11272 · v1 · submitted 2026-05-11 · 💻 cs.LG · cs.AI· cs.IR

Recognition: 2 theorem links

· Lean Theorem

Localization Boosting for Growth Markets: Mitigating Cross-Locale Behavioral Bias in Learning-to-Rank

Authors on Pith no claims yet

Pith reviewed 2026-05-13 02:47 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.IR
keywords learning-to-rankexposure biaslocalizationvision-language modelsmulti-objective learningcross-locale biasrelevance labels
0
0 comments X

The pith

Multi-objective learning-to-rank with vision-language labels and locale boosting reduces US-centric exposure bias in global templates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Learning-to-rank models trained on behavioral clicks inherit heavy US data dominance, over-serving American templates and hiding locally relevant content in growth markets. Click-only training also mutes semantically useful localization signals. Adding graded relevance labels from a vision-language model improves semantic fit but still fails to restore local visibility. The authors combine clicks, VLM signals, and an explicit locale-aware boosting term in one multi-objective objective; the resulting model lifts relevance while stabilizing local content exposure across five tested locales.

Core claim

A multi-objective framework that jointly optimizes behavioral supervision, VLM-derived relevance grades, and locale-aware boosting improves semantic alignment and restores stable local content visibility in non-US locales, whereas either clicks alone or clicks plus VLM labels leave the exposure imbalance intact.

What carries the argument

Locale-aware boosting term that counteracts cross-locale exposure imbalance inside the ranking loss while auxiliary VLM relevance labels supply semantic supervision.

If this is right

  • Relevance metrics rise in the five evaluated growth locales without sacrificing US performance.
  • Local templates receive measurably higher and more stable exposure once exposure is disentangled from semantic signals.
  • The same separation of exposure bias from semantic supervision applies to any ranking system whose training data is geographically skewed.
  • Pure auxiliary supervision (VLM labels) is insufficient by itself to correct visibility suppression.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar disentangling layers may be needed in other recommendation domains where one region dominates interaction data.
  • Dynamic versions of the boosting term could be driven by ongoing per-locale performance monitoring rather than fixed weights.
  • The approach implies that future LTR pipelines should treat exposure correction as a first-class modeling objective rather than an afterthought.

Load-bearing premise

Vision-language model relevance labels are accurate and unbiased across locales, and the added boosting term will not degrade overall ranking quality or create new biases.

What would settle it

A controlled ablation that removes only the locale-aware boosting component and measures whether local content visibility falls back to the click-only baseline despite the presence of VLM labels.

read the original abstract

Adobe Express is expanding internationally, but the US has a disproportionately large content supply and interaction volume. Learning-to-rank (LTR) models trained primarily on behavioral feedback inherit this imbalance: templates popular in US are over-served in non-US locales. This cross-locale exposure bias suppresses local content discoverability and degrades ranking quality in growth locales. We show that click-only training suppresses semantically informative localization features. Adding vision-language model (VLM) graded relevance labels as auxiliary supervision alongside clicks improves semantic alignment but does not preserve local content visibility. We propose a multi-objective framework combining behavioral supervision, VLM-derived relevance signals, and locale-aware boosting. Across five locales, the resulting model improves relevance while restoring stable localization, demonstrating the importance of disentangling exposure from semantic supervision.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper addresses cross-locale exposure bias in learning-to-rank models for Adobe Express, where US-dominant behavioral data leads to over-serving US templates in growth locales. It proposes combining click supervision with VLM-derived graded relevance labels as auxiliary signals and a locale-aware boosting component in a multi-objective framework. The central claim is that this disentangles exposure bias from semantic supervision, yielding improved relevance and restored stable localization across five locales.

Significance. If the results hold, the work offers a concrete approach to mitigating locale imbalance in production LTR systems without sacrificing semantic quality. The explicit separation of behavioral, semantic (VLM), and locale-boosting objectives is a useful framing for growth-market ranking problems. No machine-checked proofs or parameter-free derivations are present, but the multi-objective formulation itself is a clear methodological contribution if supported by rigorous experiments.

major comments (2)
  1. [Abstract] Abstract: The manuscript asserts that the multi-objective model 'improves relevance while restoring stable localization' across five locales, yet supplies no quantitative metrics, baselines, offline/online evaluation protocols, statistical significance tests, or ablation results. Without these, the central claim that the framework successfully disentangles exposure from semantic supervision cannot be assessed.
  2. [Abstract] Proposed multi-objective framework (as described in the abstract): The approach treats VLM graded relevance labels as clean auxiliary supervision that can be safely combined with clicks and locale boosting. No inter-locale human correlation, calibration curves, or error analysis for the VLM outputs is provided. If the VLM exhibits systematic locale-specific biases (cultural, linguistic, or training-data skew), the reported restoration of localization cannot be attributed to the proposed disentangling mechanism.
minor comments (1)
  1. [Abstract] The abstract would be clearer if it named the specific VLM, the five locales, and the precise form of the locale-aware boosting term.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for reviewing our manuscript. We value your comments on strengthening the abstract and validating the VLM supervision. We respond to each major comment below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: The manuscript asserts that the multi-objective model 'improves relevance while restoring stable localization' across five locales, yet supplies no quantitative metrics, baselines, offline/online evaluation protocols, statistical significance tests, or ablation results. Without these, the central claim that the framework successfully disentangles exposure from semantic supervision cannot be assessed.

    Authors: We agree that the abstract, as currently written, is high-level and does not include specific quantitative evidence. The body of the manuscript details the experimental setup with offline and online evaluations, baselines, ablations, and statistical tests across the five locales. To address this, we will revise the abstract to concisely report key quantitative outcomes, such as relative improvements in relevance metrics and localization stability, while directing readers to the full evaluation protocols in the paper. revision: yes

  2. Referee: The approach treats VLM graded relevance labels as clean auxiliary supervision that can be safely combined with clicks and locale boosting. No inter-locale human correlation, calibration curves, or error analysis for the VLM outputs is provided. If the VLM exhibits systematic locale-specific biases (cultural, linguistic, or training-data skew), the reported restoration of localization cannot be attributed to the proposed disentangling mechanism.

    Authors: This is a fair concern. The current version relies on the VLM as a general-purpose semantic signal without dedicated validation for locale biases. Our experiments show that adding the VLM signal improves semantic alignment but requires the locale boosting to restore visibility, supporting the disentangling claim. However, to rigorously rule out VLM biases as a confounding factor, we will include in the revision an analysis of VLM label agreement with human judgments across locales, along with calibration and error analysis. revision: yes

Circularity Check

0 steps flagged

No circularity detected; empirical multi-objective framework is self-contained

full rationale

The paper proposes combining click-based behavioral supervision with VLM-graded relevance labels and locale-aware boosting in a multi-objective LTR setup. No equations, derivations, or self-citations are presented that reduce any claimed prediction or result to the inputs by construction. The reported gains across five locales are framed as experimental outcomes from disentangling exposure bias, not tautological redefinitions or fitted parameters renamed as predictions. The central claim rests on external VLM signals and boosting rather than internal self-reference, making the derivation chain independent.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.0 · 5447 in / 990 out tokens · 63791 ms · 2026-05-13T02:47:21.177745+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 3 internal anchors

  1. [1]

    Qingyao Ai, Tao Yang, Huazheng Wang, and Jiaxin Mao. 2021. Unbiased Learning to Rank: Online or Offline?ACM Transactions on Information Systems39, 2, Article 21 (2021). doi:10.1145/3439861

  2. [2]

    Krisztian Balog, Donald Metzler, and Zhen Qin. 2025. Rankers, Judges, and Assis- tants: Towards Understanding the Interplay of LLMs in Information Retrieval Evaluation. InProceedings of the 48th International ACM SIGIR Conference on Re- search and Development in Information Retrieval. 3865–3875. doi:10.1145/3726302. 3730348

  3. [3]

    Biega, Krishna P

    Asia J. Biega, Krishna P. Gummadi, and Gerhard Weikum. 2018. Equity of Attention: Amortizing Individual Fairness in Rankings. InProceedings of the 41st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). ACM, New York, NY, USA

  4. [4]

    Luiz Henrique Bonifacio, Andres Abeliuk, Pablo Castellanos, and Arman Cohan

  5. [5]

    arXiv:2212.05144 [cs.IR]

    InPars: Data Augmentation for Information Retrieval using Large Language Models. arXiv:2212.05144 [cs.IR]

  6. [6]

    Christopher Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. 2005. Learning to Rank using Gradient Descent. InProceedings of the 22nd International Conference on Machine Learning (ICML). ACM, New York, NY, USA

  7. [7]

    Olivier Chapelle and Ya Zhang. 2009. A Dynamic Bayesian Network Click Model for Web Search Ranking. InProceedings of the 18th International Conference on World Wide Web (WWW). ACM, New York, NY, USA

  8. [8]

    Mouxiang Chen, Chenghao Liu, Jianling Sun, and Steven C. H. Hoi. 2021. Adapting Interactional Observation Embedding for Counterfactual Learning to Rank. InProceedings of the 44th International ACM SIGIR Conference on Re- search and Development in Information Retrieval (SIGIR ’21). ACM, 285–294. doi:10.1145/3404835.3462901

  9. [9]

    Nick Craswell, Onno Zoeter, Michael Taylor, and Bill Ramsey. 2008. An Experi- mental Comparison of Click Position-Bias Models. InProceedings of the 1st ACM International Conference on Web Search and Data Mining (WSDM). ACM, New York, NY, USA

  10. [10]

    Zhao, Ji Ma, Yi Luan, Jianmo Ni, Jing Lu, Anton Bakalov, Kelvin Guu, Keith B

    Zhuyun Dai, Vincent Y. Zhao, Ji Ma, Yi Luan, Jianmo Ni, Jing Lu, Anton Bakalov, Kelvin Guu, Keith B. Hall, and Ming-Wei Chang. 2022. Promptagator: Few-shot Dense Retrieval From 8 Examples. arXiv:2209.11755 [cs.CL] https://arxiv.org/ abs/2209.11755

  11. [11]

    Rotem Dror, Gili Baumer, Segev Shlomov, and Roi Reichart. 2018. The Hitchhiker’s Guide to Testing Statistical Significance in Natural Language Processing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Iryna Gurevych and Yusuke Miyao (Eds.). Association for Computational Linguistics, Mel...

  12. [12]

    Gizem Gezici. 2022. Case Study: The Impact of Location on Bias in Search Results. arXiv:2206.11869 [cs.IR] https://arxiv.org/abs/2206.11869

  13. [13]

    Shashank Gupta, Yiming Liao, and Maarten de Rijke. 2026. Towards Two-Stage Counterfactual Learning to Rank. InProceedings of the 49th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’26). ACM. doi:10.1145/3731120.3744583

  14. [14]

    Maria Heuss, Fatemeh Sarvi, and Maarten de Rijke. 2022. Fairness of Exposure in Light of Incomplete Exposure Estimation. InProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’22). ACM. doi:10.1145/3477495.3531977

  15. [15]

    Thorsten Joachims. 2005. Accurately Interpreting Clickthrough Data as Implicit Feedback. InProceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). ACM, New York, NY, USA

  16. [16]

    Thorsten Joachims, Adith Swaminathan, and Tobias Schnabel. 2017. Unbiased Learning-to-Rank with Biased Feedback. InProceedings of the 10th ACM Interna- tional Conference on Web Search and Data Mining (WSDM). ACM, New York, NY, USA

  17. [17]

    Robert Litschko, Ivan Vulić, Simone Paolo Ponzetto, and Goran Glavaš. 2022. On cross-lingual retrieval with multilingual text encoders.Information Retrieval Journal25 (2022), 149–183. doi:10.1007/s10791-022-09406-x

  18. [18]

    Xiao Liu, Juan Hu, Qi Shen, and Huan Chen. 2021. Geo-BERT Pre-training Model for Query Rewriting in POI Search. InFindings of the Association for Com- putational Linguistics: EMNLP 2021. Association for Computational Linguistics, 2209–2214. doi:10.18653/v1/2021.findings-emnlp.190

  19. [19]

    Yang Liu, Dan Iter, Bryan Lee, Jialu Xu, Hanyuan Zhao, Douwe Kiela, et al

  20. [20]

    G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

    G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. arXiv:2303.16634 [cs.CL]

  21. [21]

    Zechun Niu, Zhilin Zhang, Jiaxin Mao, Qingyao Ai, and Ji-Rong Wen. 2025. Investigating the Robustness of Counterfactual Learning to Rank Models: A Reproducibility Study. InProceedings of the 48th International ACM SIGIR Con- ference on Research and Development in Information Retrieval (SIGIR ’25). ACM. doi:10.1145/3726302.3730310

  22. [22]

    Harrie Oosterhuis and Maarten de Rijke. 2018. Differentiable Unbiased Online Learning to Rank. InProceedings of the 27th ACM International Conference on Information and Knowledge Management (CIKM). ACM, New York, NY, USA

  23. [23]

    OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]

  24. [24]

    OpenAI. 2024. GPT-4o System Card. arXiv:2410.21276 [cs.CL] https://arxiv.org/ abs/2410.21276

  25. [25]

    Ashudeep Singh and Thorsten Joachims. 2018. Fairness of Exposure in Rankings. InProceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD). ACM, New York, NY, USA

  26. [26]

    Weiwei Sun, Lingyong Yan, Xinyu Ma, Shuaiqiang Wang, Pengjie Ren, Zhumin Chen, Dawei Yin, and Zhaochun Ren. 2024. Is ChatGPT Good at Search? Investi- gating Large Language Models as Re-Ranking Agents. arXiv:2304.09542 [cs.CL] https://arxiv.org/abs/2304.09542

  27. [27]

    Adith Swaminathan and Thorsten Joachims. 2015. Counterfactual Risk Min- imization: Learning from Logged Bandit Feedback. InProceedings of the 32nd International Conference on Machine Learning (ICML). PMLR, Lille, France

  28. [28]

    Frank Wilcoxon. 1945. Individual Comparisons by Ranking Methods.Biometrics Bulletin1, 6 (1945), 80–83. http://www.jstor.org/stable/3001968

  29. [29]

    Le Yan, Zhen Qin, Honglei Zhuang, Xuanhui Wang, Michael Bendersky, and Marc Najork. 2022. Revisiting Two-tower Models for Unbiased Learning to Rank. InProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’22). ACM, 2410–2414. doi:10.1145/ 3477495.3531837

  30. [30]

    Tao Yang, Zhichao Xu, Zhenduo Wang, and Qingyao Ai. 2023. FARA: Future- aware Ranking Algorithm for Fairness Optimization. InProceedings of the 32nd ACM International Conference on Information and Knowledge Management (CIKM ’23). ACM. doi:10.1145/3583780.3614877

  31. [31]

    Meike Zehlike, Francesco Bonchi, Sara Hajian, Mohamed Megahed, and Ricardo Baeza-Yates. 2017. FA*IR: A Fair Top-𝑘 Ranking Algorithm. InProceedings of the 2017 ACM Conference on Information and Knowledge Management (CIKM). ACM, New York, NY, USA

  32. [32]

    Xinyu Zhang, Xueguang Ma, Peng Shi, and Jimmy Lin. 2021. Mr. TyDi: A Multi- lingual Benchmark for Dense Retrieval. arXiv:2108.08787 [cs.CL] https://arxiv. org/abs/2108.08787

  33. [33]

    Xinyu Zhang, Nandan Thakur, Odunayo Ogundepo, Ehsan Kamalloo, David Alfonso-Hermelo, Xiaoguang Li, Qun Liu, Mehdi Rezagholizadeh, and Jimmy Lin

  34. [34]

    arXiv:2210.09984 [cs.IR] https://arxiv.org/abs/2210.09984

    Making a MIRACL: Multilingual Information Retrieval Across a Continuum of Languages. arXiv:2210.09984 [cs.IR] https://arxiv.org/abs/2210.09984

  35. [35]

    Yiqian Zhang, Yinfu Feng, Wen-Ji Zhou, Yunan Ye, Min Tan, Rong Xiao, Haihong Tang, Jiajun Ding, and Jun Yu. 2024. Multi-Domain Deep Learning from a Multi- View Perspective for Cross-Border E-commerce Search. InProceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24). 9387–9395