pith. sign in

arxiv: 2605.21812 · v1 · pith:OLMII6QKnew · submitted 2026-05-20 · 💻 cs.IR

Bridging the Cold-Start Gap: LLM-Powered Synthetic Data Generation for Natural Language Search at Airbnb

Pith reviewed 2026-05-22 07:59 UTC · model grok-4.3

classification 💻 cs.IR
keywords synthetic data generationLLM query generationcold-start problemnatural language searchrelevance labelingcontrastive generationAirbnb search
0
0 comments X

The pith

Seed-guided LLM synthesis matches real user query lengths and attributes closely enough to train and evaluate natural language search models from the first day.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a framework that uses large language models to create synthetic queries and relevance labels when no real user data exists for a new natural language search system. It guides query generation by mixing contrastive pairs of listings drawn from actual booking sessions with seed queries collected from user research. This produces queries whose length distribution has a KL divergence of 0.66 from real queries and whose attribute-type distribution has a KL divergence of 0.04. Label generation relies on contrastive construction that defines topicality by design together with virtual-judge labeling for broader coverage. The resulting evaluation sets are harder than those from a no-seed baseline, showing 79 percent pairwise accuracy instead of 97 percent, and the method is already running in daily production pipelines.

Core claim

A seed-guided contrastive approach that feeds booking-session pairs and user-research queries into LLMs generates synthetic queries and labels whose length and attribute distributions are far closer to real users than either a no-seed contrastive baseline or an InPars-style baseline, while also creating more discriminative evaluation examples for ranking models.

What carries the argument

Seed-guided contrastive query generation that blends booking-session pairs with user-research seeds, paired with contrastive label generation and virtual-judge labeling.

If this is right

  • Production pipelines can generate fresh synthetic examples every day to support embedding retrieval and ranking evaluation.
  • The same seeded data enables a gradual shift from cold-start training to warm-start training as real queries arrive.
  • Harder synthetic evaluation sets give clearer signals for iterative model improvement than easier no-seed sets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same seeding strategy could be tested on other search or recommendation platforms that face similar initial-data shortages.
  • Reducing the influence of the seeds over time as real data accumulates might further tighten the match to live user behavior.
  • The method could shorten the time between feature launch and usable model performance in additional domains that rely on natural language input.

Load-bearing premise

That queries and labels produced by the LLM, even when guided by booking sessions and research seeds, will match the behavior of real users once the live system receives actual queries.

What would settle it

A side-by-side comparison of length and attribute distributions between the synthetic queries and the first large batch of real user queries collected after deployment, showing substantially higher KL divergence than the reported 0.66 and 0.04 values.

Figures

Figures reproduced from arXiv: 2605.21812 by Dillon Davis, Hao Li, Huiji Gao, Kedar Bellare, Malay Haldar, Sanjeev Katariya, Soumyadip Banerjee, Stephanie Moyerman, Weiwei Guo, Wendy Ran Wei, Xiaowei Liu, Xueyin Chen.

Figure 1
Figure 1. Figure 1: Contrastive query generation pipeline. Left: input data sources include search sessions (with booking logs) and listing [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Baseline vs. our approach: without seed guidance, [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Query length distribution comparison across datasets. Real users show right-skewed distribution with peak at 2 words; [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
read the original abstract

Deploying natural language search systems presents a critical cold-start challenge: no real user queries to learn linguistic patterns, and no relevance labels to train ranking models. We present a framework for generating synthetic queries and labels using large language models (LLMs), powering model training and evaluation for Airbnb's natural language search. For query generation, we combine contrastive listing pairs from booking sessions with seed queries from user research to balance realism and diversity, enabling a cold-to-warm start transition as real user data becomes available. For label generation, we introduce contrastive generation that produces topicality labels by construction, and Virtual Judge (VJ) labeling for broader coverage. We compare our approach against a no-seed contrastive baseline and an InPars-style baseline. For query length, the InPars baseline produces verbose queries with KL divergence of 12.03 vs. real users; our seed-guided approach achieves 0.66, a 7.5x improvement. For attribute type distributions, our approach achieves the lowest KL divergence (0.04), outperforming even seed queries (0.09). Experiments show our approach produces harder evaluation examples than the no-seed baseline (79% vs. 97% pairwise accuracy), providing discriminative signal for model improvement. We deploy production pipelines generating synthetic examples daily for embedding-based retrieval and ranking evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes an LLM-powered framework to generate synthetic queries and relevance labels for natural language search systems facing cold-start conditions, as at Airbnb. Query generation combines contrastive listing pairs from booking sessions with user-research seed queries to improve realism and diversity. Label generation uses contrastive methods to produce topicality labels by construction and Virtual Judge (VJ) labeling for broader coverage. The authors compare their seed-guided approach against a no-seed contrastive baseline and an InPars-style baseline, reporting a KL divergence of 0.66 on query length (7.5x better than InPars' 12.03), 0.04 on attribute distributions, and harder evaluation examples (79% pairwise accuracy vs. 97% for the no-seed baseline). They also describe deploying production pipelines that generate synthetic examples daily for embedding-based retrieval and ranking evaluation.

Significance. If the synthetic data generalizes to live user queries, the work offers a practical, deployable solution for cold-start challenges in industry IR systems, with clear quantitative gains in distribution matching over strong baselines. The production deployment and focus on both query generation and labeling are practical strengths that could influence synthetic-data practices in search.

major comments (1)
  1. [Experiments / Evaluation] The central claim that the seed-guided synthetic data bridges the cold-start gap for retrieval and ranking rests on the assumption that distribution-matched synthetic queries and labels will improve models on real user queries. However, the manuscript reports no downstream experiment that trains an embedding retriever or ranker on the generated data and measures recall, NDCG, or click-through metrics on a held-out set of real post-deployment queries (or any A/B test in production). The reported KL divergences and pairwise accuracies are computed only against real-user reference distributions on the synthetic set itself.
minor comments (2)
  1. [Abstract] The abstract introduces 'Virtual Judge (VJ) labeling' without a one-sentence definition; a brief parenthetical explanation on first use would aid readers.
  2. [Methods] Clarify the exact difference in prompt construction or sampling between the 'seed-guided' method and the 'no-seed contrastive baseline' so that the 79% vs. 97% pairwise accuracy gap can be reproduced from the description alone.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of the practical strengths of our work and for the constructive feedback on evaluation. We address the major comment below, clarifying the scope of our contributions while committing to revisions that strengthen the manuscript.

read point-by-point responses
  1. Referee: [Experiments / Evaluation] The central claim that the seed-guided synthetic data bridges the cold-start gap for retrieval and ranking rests on the assumption that distribution-matched synthetic queries and labels will improve models on real user queries. However, the manuscript reports no downstream experiment that trains an embedding retriever or ranker on the generated data and measures recall, NDCG, or click-through metrics on a held-out set of real post-deployment queries (or any A/B test in production). The reported KL divergences and pairwise accuracies are computed only against real-user reference distributions on the synthetic set itself.

    Authors: We appreciate this observation and agree that direct downstream experiments—training an embedding retriever or ranker on the synthetic data and measuring recall, NDCG, or click-through rate on held-out real post-deployment queries—would provide stronger evidence for the framework's impact. Our current evaluation deliberately focuses on the fidelity of the generated data through distribution-matching metrics (KL divergence on query length and attribute distributions) and the discriminative power of the resulting evaluation sets (pairwise accuracy of 79% vs. 97% for the no-seed baseline). These proxies are standard in synthetic-data-for-IR literature and demonstrate that the seed-guided approach produces more realistic and harder examples than the InPars-style and no-seed baselines. The manuscript also reports the production deployment of daily synthetic-data pipelines for embedding-based retrieval and ranking evaluation, which supports iterative model improvement as real queries accumulate during the cold-to-warm transition. We will revise the manuscript to (1) explicitly state the scope of our claims, (2) add a dedicated limitations paragraph acknowledging the absence of end-to-end retrieval/ranking results on real queries, and (3) outline concrete plans for future A/B testing and offline evaluation once sufficient post-deployment real query data is available. This partial revision will address the referee's concern without changing the core technical contributions. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical distribution matching against external references

full rationale

The manuscript reports direct empirical comparisons of generated query statistics (KL divergence on length and attribute distributions) to real-user reference distributions and to external baselines (InPars, no-seed contrastive). These measurements are computed on the outputs of the LLM generation process versus independent real data; they do not reduce to any parameter fitted inside the paper, nor to a self-defined quantity. No equations, uniqueness theorems, self-citations, or ansatzes are invoked to support the core claims. The 'by construction' phrasing for topicality labels is a definitional statement about the generation method itself and is not presented as a derived prediction. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on the assumption that LLMs can faithfully simulate user query distributions when given booking-session pairs and short seeds; no explicit free parameters are fitted in the abstract, but the choice of which seed queries to include functions as an implicit modeling decision.

axioms (1)
  • domain assumption LLMs prompted with booking-session contrastive pairs and user-research seeds will produce query distributions close to real users.
    Invoked in the query-generation section to justify the cold-to-warm transition claim.
invented entities (1)
  • Virtual Judge (VJ) labeling no independent evidence
    purpose: Generate broader topicality labels beyond strict contrastive pairs.
    New labeling procedure introduced to increase coverage; no independent falsifiable handle provided in the abstract.

pith-pipeline@v0.9.0 · 5819 in / 1395 out tokens · 48764 ms · 2026-05-22T07:59:23.375589+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 1 internal anchor

  1. [1]

    Rose, and Benjamin Stewart

    Omar Alonso, Daniel E. Rose, and Benjamin Stewart. 2008. Crowdsourcing for Relevance Evaluation. InACM SIGIR Forum, Vol. 42. 9–15

  2. [2]

    Luiz Bonifacio, Hugo Abonizio, Marzieh Fadaee, and Rodrigo Nogueira. 2022. InPars: Data Augmentation for Information Retrieval using Large Language Models.arXiv preprint arXiv:2202.05144(2022)

  3. [3]

    Zhuyun Dai, Vincent Y Zhao, Ji Ma, Yi Luan, Jianmo Ni, Jing Lu, Anton Baber, Kelvin Lee, and Yi Lu. 2023. Promptagator: Few-shot Dense Retrieval From 8 Examples. InInternational Conference on Learning Representations

  4. [4]

    Feng, Varun Gangal, Jason Wei, Sarath Chandar, Soroush Vosoughi, Teruko Mitamura, and Eduard Hovy

    Steven Y. Feng, Varun Gangal, Jason Wei, Sarath Chandar, Soroush Vosoughi, Teruko Mitamura, and Eduard Hovy. 2021. A Survey of Data Augmentation Approaches for NLP. InFindings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 968–988

  5. [5]

    Albert Gatt and Emiel Krahmer. 2018. Survey of the State of the Art in Natural Language Generation: Core Tasks, Applications and Evaluation.Journal of Artificial Intelligence Research61 (2018), 65–170

  6. [6]

    Mihajlo Grbovic and Haibin Cheng. 2018. Real-time Personalization using Em- beddings for Search Ranking at Airbnb. InProceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 311–320

  7. [7]

    Turnbull, Bren- dan M

    Malay Haldar, Mustafa Abdool, Prashant Ramanathan, Tao Xu, Shulin Yang, Huizhong Duan, Qing Zhang, Nick Barrow-Williams, Bradley C. Turnbull, Bren- dan M. Collins, and Thomas Legrand. 2019. Applying Deep Learning to Airbnb Search. InProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1927–1935

  8. [8]

    Malay Haldar, Daochen Zha, Huiji Gao, Liwei He, and Sanjeev Katariya. 2025. Beyond Pairwise Learning-To-Rank At Airbnb. InProceedings of the 34th ACM International Conference on Information and Knowledge Management. 5707–5714

  9. [9]

    Malay Haldar, Hongwei Zhang, Kedar Bellare, Sherry Chen, Soumyadip Banerjee, Xiaotang Wang, Mustafa Abdool, Huiji Gao, Pavan Tapadia, Liwei He, and Sanjeev Katariya. 2024. Learning to Rank for Maps at Airbnb. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 5061–5069

  10. [10]

    Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Rber, Edouard Grave, Armand Joulin, and Gabriel Synnaeve. 2022. Unsupervised Dense Information Retrieval with Contrastive Learning. InTransactions on Machine Learning Re- search

  11. [11]

    Vitor Jeronymo, Luiz Bonifacio, Hugo Abonizio, Marzieh Fadaee, Roberto Lotufo, Jakub Zavrel, and Rodrigo Nogueira. 2023. InPars-v2: Large Language Models as Efficient Dataset Generators for Information Retrieval. InProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2299–2303

  12. [12]

    Yannis Kalantidis, Mert Bulent Sariyildiz, Noe Pion, Philippe Weinzaepfel, and Diane Larlus. 2020. Hard Negative Mixing for Contrastive Learning. InAdvances in Neural Information Processing Systems, Vol. 33. 21798–21809

  13. [13]

    Mihir Kale and Abhinav Rastogi. 2020. Text-to-Text Pre-Training for Data-to-Text Tasks. InProceedings of the 13th International Conference on Natural Language Generation. 97–102

  14. [14]

    Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open- Domain Question Answering. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. 6769–6781

  15. [15]

    Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2511–2522

  16. [16]

    Ji Ma, Ivan Korotkov, Yinfei Yang, Keith Hall, and Ryan McDonald. 2021. Zero-shot Neural Passage Retrieval via Domain-targeted Synthetic Question Generation. 8 Bridging the Cold-Start Gap: LLM-Powered Synthetic Data Generation for Natural Language Search at Airbnb Preprint, 2026, InProceedings of the 16th Conference of the European Chapter of the Associat...

  17. [17]

    Rodrigo Nogueira, Wei Yang, Jimmy Lin, and Kyunghyun Cho. 2019. Document Expansion by Query Prediction.arXiv preprint arXiv:1904.08375(2019)

  18. [18]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.Journal of Machine Learning Research21, 140 (2020), 1–67

  19. [19]

    Joshua Robinson, Ching-Yao Chuang, Suvrit Sra, and Stefanie Jegelka. 2021. Contrastive Learning with Hard Negative Samples. InInternational Conference on Learning Representations

  20. [20]

    Timo Schick and Hinrich Schütze. 2021. Generating Datasets with Pretrained Language Models. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 6943–6951

  21. [21]

    Hashimoto

    Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Stanford Alpaca: An Instruction-following LLaMA Model.GitHub repository(2023)

  22. [22]

    Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. 2021. BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track

  23. [23]

    Shuohang Wang, Yang Liu, Yichong Xu, Chenguang Zhu, and Michael Zeng. 2021. Want To Reduce Labeling Cost? GPT-3 Can Help.Findings of the Association for Computational Linguistics: EMNLP 2021(2021), 4195–4205

  24. [24]

    Self-Instruct: Aligning Language Models with Self-Generated Instructions

    Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. 2023. Self-Instruct: Aligning Language Models with Self-Generated Instructions.arXiv preprint arXiv:2212.10560(2023)

  25. [25]

    Jason Wei and Kai Zou. 2019. EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks.Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing(2019), 6382– 6388

  26. [26]

    Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul Bennett, Junaid Ahmed, and Arnold Overwijk. 2021. Approximate Nearest Neighbor Neg- ative Contrastive Learning for Dense Text Retrieval. InInternational Conference on Learning Representations

  27. [27]

    entire home

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, et al . 2023. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. InAdvances in Neural Information Processing Systems, Vol. 36. A Query Length Distribution Figure 3 shows word count distributions across all four datasets...