pith. machine review for the scientific record. sign in

arxiv: 2605.07677 · v1 · submitted 2026-05-08 · 💻 cs.IR · cs.AI· cs.CL

Recognition: no theorem link

TRACE: Tourism Recommendation with Accountable Citation Evidence

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:23 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.CL
keywords tourism recommendationconversational recommender systemscitation evidenceLLM evaluationbenchmark datasetgroundingrejection recoveryverifiable evidence
0
0 comments X

The pith

Tourism recommenders split into three competency gaps: accurate but sparsely cited LLMs, grounded but inaccurate retrievers, and synthesis that fails recovery.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents TRACE as a benchmark dataset of 10,000 multi-turn dialogues over 2,400 Yelp points of interest, each including review-span citations and explicit rejection turns to test recommendation systems on accuracy, verifiable evidence, and adaptation. Current methods divide along a Three-Competency Gap where LLM zero-shot approaches achieve strong closed-set recall and rejection recovery yet cite less densely than retrievers, non-LLM retrievers deliver surface-verbatim grounding but low accuracy, and multi-review synthesis fails to recover from rejections. This setup shows that single-metric evaluations like Recall@k miss the joint requirements for trustworthy tourism advice that travelers can act on without trial and error.

Core claim

TRACE consists of 10,000 multi-turn tourism recommendation dialogues across eight U.S. cities, each tied to review-span citations from 34,208 Yelp reviews and explicit rejection turns. When 14 retrieval, planning, and LLM baselines are scored on 25 metrics grouped under Accuracy, Grounding, and Recovery, LLM Zero-Shot leads in closed-set Recall@1 and rejection recovery but cites less densely than retrievers; non-LLM retrievers achieve surface-verbatim grounding but with low accuracy; and Multi-Review Synthesis fails at recovery. The Grounding Score matches human citation precision with Spearman rho of +0.80, and paired t-tests reproduce the per-baseline ranking.

What carries the argument

The TRACE dataset of multi-turn dialogues with review-span citations and rejection turns, scored by 25 metrics organized under Accuracy, Grounding, and Recovery to expose the Three-Competency Gap.

If this is right

  • Accountable tourism recommendation requires systems that jointly optimize the right POI, verifiable review evidence, and adaptive repair rather than single-axis metrics.
  • LLM approaches need denser citation mechanisms to match retriever grounding without losing recall advantages.
  • Retrievable evidence must be paired with higher-accuracy planning to close the accuracy gap.
  • Recovery mechanisms in synthesis models require redesign to handle mid-dialogue rejections effectively.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Hybrid LLM-retriever architectures could be directly evaluated on TRACE to test whether they reduce the observed competency split.
  • The same three-axis structure of accuracy, grounding, and recovery may apply to other evidence-dependent recommendation domains such as product advice or medical options.
  • TRACE-style rejection turns could be extended to test long-horizon adaptation beyond single rejections.

Load-bearing premise

The 10,000 constructed dialogues, review-span citations, and 25 metrics accurately capture what trustworthy, verifiable, and adaptive tourism recommendation requires in real high-stakes settings.

What would settle it

A live study in which real travelers complete multi-turn dialogues with the baseline systems and rate recommendation usefulness and trustworthiness, showing no correlation with the TRACE metric rankings.

Figures

Figures reproduced from arXiv: 2605.07677 by Sijin Wang, Wenjie Zhang, Won-Yong Shin, Xike Xie, Xin Cao, Yuanyuan Xu, Yufan Sheng, Yu Hou, Zixu Zhao.

Figure 1
Figure 1. Figure 1: Left: four gaps in current tourism benchmarks (Single-Metric Ranking, No Evidence Audit, Tourism-Blind, No Conversational Repair). Right: TRACE merges the three missing audits into a multi-turn dialogue with review-span citations, scoring Accuracy, Grounding, Recovery. ReDial [Li et al., 2018], TG-ReDial [Zhou et al., 2020], INSPIRED [Hayati et al., 2020], and DuRecDial 2.0 [Liu et al., 2021], largely rank… view at source ↗
Figure 2
Figure 2. Figure 2: Three-Competency Gap (open-set). Non-LLM retrievers (grounding-led, CGS>0.65) and LLM-based systems (accuracy-led) occupy disjoint zones; no baseline reaches both. Finding 2a: Non-LLM retrievers own surface￾verbatim grounding. All 9 non-LLM baselines exceed CGS=0.68, led by Dense 0.864, Popular￾ity 0.802, and TF-IDF 0.776. RAG-Citation is the strongest LLM at 0.658, still below Persona￾Grounded at 0.706; L… view at source ↗
Figure 3
Figure 3. Figure 3: Five-stage dialogue generation pipeline. Stages feed sequentially, with a quality-validation [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Task 1: baseline response rating page. The annotator sees the full dialogue context (top), the response to evaluate (orange box), the cited reviews (collapsible), the per-quote Citation Check (Supported / Partial / Not supported), the 8-way error checklist, and two 5-point Likert scales (Informativeness, Naturalness). The system label (System A/B/C) is randomized per item; annotators do not know which base… view at source ↗
Figure 5
Figure 5. Figure 5: Task 2: expert recommendation page with all candidate POIs expanded to show their reviews. The annotator sees the conversation so far (with the user’s current request highlighted in blue), 8 candidate POIs (3 shown here for the illustrative example) each with reviews, and a free-text response box. Annotators select review spans and click the orange ‘+ Add quote’ button, which inserts a verbatim citation in… view at source ↗
Figure 6
Figure 6. Figure 6: Error classification across baselines. LLM baselines predominantly produce grounding [PITH_FULL_IMAGE:figures/full_fig_p031_6.png] view at source ↗
read the original abstract

Tourism is a high-stakes setting for conversational recommender systems (CRS): a plausible-sounding suggestion can waste real money and trip time once a traveler acts on it. Existing CRS benchmarks primarily evaluate systems with a single Recall@k score over entity mentions, and tourism-specific resources add spatial or knowledge-graph context, yet none of them couple multi-turn recommendation with verbatim review-span evidence and rejection recovery. This leaves an evaluation gap for tourism recommendation that is simultaneously trustworthy, verifiable, and adaptive: recommend the right point of interest (POI) for multi-aspect preferences (such as cuisine, price, atmosphere, walking distance), justify each suggestion with verifiable evidence from prior visitors so the traveler can act without trial and error, and recover when the first recommendation is rejected mid-dialogue. We introduce TRACE, where each item is a multi-turn tourism recommendation dialogue with review-span citations and explicit rejection turns: 10,000 dialogues over 2,400 Yelp POIs and 34,208 reviews across eight U.S. cities, paired with 14 retrieval, planning, and LLM baselines, along with 25 metrics organized under Accuracy, Grounding, and Recovery. Across these baselines, TRACE reveals the Three-Competency Gap: LLM Zero-Shot leads in closed-set Recall@1 and rejection recovery but cites less densely than retrievers; non-LLM retrievers achieve surface-verbatim grounding but with low accuracy; Multi-Review Synthesis fails at recovery. The Grounding Score agrees with human citation precision (Spearman rho=+0.80, p<10^-20), and paired t-tests reproduce the per-baseline ranking (p<0.01 on the dominant contrasts). TRACE reframes accountable tourism recommendation as a joint target (right POI, verifiable evidence, adaptive repair) rather than a single-axis leaderboard.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces TRACE, a benchmark of 10,000 multi-turn tourism recommendation dialogues over 2,400 Yelp POIs and 34,208 reviews, each with explicit rejection turns and review-span citations. It evaluates 14 retrieval, planning, and LLM baselines using 25 metrics under Accuracy, Grounding, and Recovery, and reports a Three-Competency Gap: LLM Zero-Shot leads in closed-set Recall@1 and rejection recovery but cites less densely; non-LLM retrievers achieve surface-verbatim grounding but low accuracy; Multi-Review Synthesis fails at recovery. Human correlation on Grounding Score is Spearman rho=+0.80 and paired t-tests confirm per-baseline rankings (p<0.01 on dominant contrasts).

Significance. If the benchmark faithfully represents high-stakes tourism requirements, the Three-Competency Gap finding usefully reframes evaluation away from single-axis Recall@k toward joint targets of correct POI, verifiable evidence, and adaptive repair. The human-validated Grounding metric and statistical tests are strengths that support the empirical claims.

major comments (3)
  1. [Methods/Evaluation] Dataset construction (Methods/Evaluation sections): The 10,000 dialogues, review-span citation rules, preference aspect coverage, rejection triggers, and data exclusion criteria are described at high level in the abstract but lack sufficient detail on generation process, filtering heuristics, or prompt templates. This is load-bearing for the central claim, as the observed gap could be an artifact of synthetic construction rather than a stable property of the competencies.
  2. [Introduction/Evaluation] External validity (Introduction/Evaluation): No grounding of the dialogue distribution, aspect coverage, or rejection frequency against real tourist logs, surveys, or deployment A/B tests is provided. Without this, the claim that TRACE captures requirements for trustworthy, verifiable, and adaptive recommendation rests on an unverified assumption.
  3. [Experiments] Baseline reproducibility (Experiments): Exact implementations of the 14 baselines (e.g., prompting for LLM Zero-Shot and Multi-Review Synthesis, citation density heuristics for retrievers) and the full 25-metric definitions are needed to verify the reported rankings and t-test results.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'paired t-tests reproduce the per-baseline ranking (p<0.01 on the dominant contrasts)' would benefit from specifying the exact contrasts and providing a supplementary table of p-values.
  2. [Evaluation] Notation: Ensure consistent use of 'closed-set Recall@1' vs. open-set variants across text and tables.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We have addressed each major comment below with point-by-point responses. Where the manuscript required expansion or clarification, we have revised it accordingly.

read point-by-point responses
  1. Referee: [Methods/Evaluation] Dataset construction (Methods/Evaluation sections): The 10,000 dialogues, review-span citation rules, preference aspect coverage, rejection triggers, and data exclusion criteria are described at high level in the abstract but lack sufficient detail on generation process, filtering heuristics, or prompt templates. This is load-bearing for the central claim, as the observed gap could be an artifact of synthetic construction rather than a stable property of the competencies.

    Authors: We agree that the original Methods section provided insufficient detail on the synthetic construction pipeline, which is critical for interpreting the Three-Competency Gap. In the revised manuscript, we have substantially expanded this section to describe the full end-to-end process: Yelp data filtering (minimum reviews per POI, city selection, POI deduplication), aspect taxonomy and sampling for multi-aspect preferences, the hybrid dialogue generation procedure (rule-based turn templates combined with LLM prompting), exact rejection insertion rules (aspect mismatch, simulated user feedback), citation span selection criteria, and exclusion heuristics (e.g., dialogues shorter than three turns or with citation validation failures). The revised version also includes the complete prompt templates in a new appendix and reports additional statistics on aspect coverage and rejection frequency. These additions allow readers to evaluate whether the observed gaps are robust properties of the competencies. revision: yes

  2. Referee: [Introduction/Evaluation] External validity (Introduction/Evaluation): No grounding of the dialogue distribution, aspect coverage, or rejection frequency against real tourist logs, surveys, or deployment A/B tests is provided. Without this, the claim that TRACE captures requirements for trustworthy, verifiable, and adaptive recommendation rests on an unverified assumption.

    Authors: We acknowledge that direct external validation against real tourist interaction logs would further strengthen the claims. No public datasets currently exist that combine multi-turn dialogues, explicit rejections, and review-span citations at this scale, so such grounding would require a separate data collection effort outside the scope of this benchmark paper. TRACE is deliberately constructed as a controlled synthetic benchmark to isolate the three competencies, consistent with standard practice in IR and NLP evaluation. In the revised manuscript, we have added a dedicated Limitations and Future Work section that explicitly states this assumption, provides indirect support by aligning TRACE's aspect distributions and rejection rates with statistics from the underlying Yelp corpus, and outlines plans for future real-world A/B testing. This clarifies the scope without overstating generalizability. revision: partial

  3. Referee: [Experiments] Baseline reproducibility (Experiments): Exact implementations of the 14 baselines (e.g., prompting for LLM Zero-Shot and Multi-Review Synthesis, citation density heuristics for retrievers) and the full 25-metric definitions are needed to verify the reported rankings and t-test results.

    Authors: We agree that complete reproducibility details are necessary to verify the rankings and statistical results. The revised manuscript now contains an expanded Experiments section with precise specifications for all 14 baselines, including the full prompt templates and decoding parameters for LLM Zero-Shot, the synthesis procedure and citation aggregation rules for Multi-Review Synthesis, and the exact span selection and density heuristics for the retriever baselines. We have also added a comprehensive appendix with formal definitions, computation procedures, and pseudocode for all 25 metrics, plus the exact setup for the Spearman correlation and paired t-tests. We will release the full codebase, evaluation scripts, and dataset upon acceptance to enable independent reproduction of every reported number. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with observed comparisons only

full rationale

The paper introduces TRACE as a new dataset of 10,000 multi-turn dialogues with review-span citations over Yelp POIs and evaluates 14 baselines (LLM zero-shot, retrievers, multi-review synthesis) using 25 metrics grouped under Accuracy, Grounding, and Recovery. The central claim of a 'Three-Competency Gap' is presented as an observed pattern from these direct comparisons and human correlation checks (Spearman rho +0.80), with no equations, parameter fitting, derivations, or self-citations that reduce any result to its inputs by construction. Dataset construction and metric definitions are explicit inputs, not outputs renamed as predictions. The work is self-contained empirical evaluation against external baselines and human judgments.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmark and dataset paper with no mathematical derivations, free parameters, or postulated entities; the central contribution is the new resource and observed performance patterns.

pith-pipeline@v0.9.0 · 5658 in / 1307 out tokens · 47010 ms · 2026-05-11T02:23:53.168815+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 1 internal anchor

  1. [1]

    Spatially- enhanced retrieval-augmented generation for walkability and urban discovery.arXiv preprint arXiv:2512.04790,

    Maddalena Amendola, Chiara Pugliese, Raffaele Perego, and Chiara Renso. Spatially- enhanced retrieval-augmented generation for walkability and urban discovery.arXiv preprint arXiv:2512.04790,

  2. [2]

    Reasoning- guided collaborative filtering with language models for explainable recommendation.arXiv preprint arXiv:2602.05544,

    Fahad Anwaar, Adil Mehmood Khan, Muhammad Khalid, Usman Zia, and Kezhi Wang. Reasoning- guided collaborative filtering with language models for explainable recommendation.arXiv preprint arXiv:2602.05544,

  3. [3]

    Collab- rec: An llm-based agentic framework for balancing recommendations in tourism.arXiv preprint arXiv:2508.15030, 2025a

    Ashmi Banerjee, Adithi Satish, Fitri Nur Aisyah, Wolfgang Wörndl, and Yashar Deldjoo. Collab- rec: An llm-based agentic framework for balancing recommendations in tourism.arXiv preprint arXiv:2508.15030, 2025a. Ashmi Banerjee, Adithi Satish, Fitri Nur Aisyah, Wolfgang Wörndl, and Yashar Deldjoo. Synthtrips: A knowledge-grounded framework for benchmark que...

  4. [4]

    Pers: A personalized and explainable poi recommender system.arXiv preprint arXiv:1712.07727,

    Ramesh Baral and Tao Li. Pers: A personalized and explainable poi recommender system.arXiv preprint arXiv:1712.07727,

  5. [5]

    Bridging conversational and collaborative signals for conversational recommendation

    Ahmad Bin Rabiah, Nafis Sadeq, and Julian McAuley. Bridging conversational and collaborative signals for conversational recommendation. InCompanion Proceedings of the ACM on Web Conference 2025, pages 878–882,

  6. [6]

    Reasoningrec: Bridging personalized recommendations and human-interpretable explanations through llm reasoning

    Millennium Bismay, Xiangjue Dong, and James Caverlee. Reasoningrec: Bridging personalized recommendations and human-interpretable explanations through llm reasoning. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 8132–8148,

  7. [7]

    Retail: Towards real-world travel planning for large language models

    Bin Deng, Yizhe Feng, Zeming Liu, Qing Wei, Xiangrong Zhu, Shuai Chen, Yuanfang Guo, and Yunhong Wang. Retail: Towards real-world travel planning for large language models. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 14881–14913,

  8. [8]

    Trace the evidence: Constructing knowledge- grounded reasoning chains for retrieval-augmented generation

    Jinyuan Fang, Zaiqiao Meng, and Craig Macdonald. Trace the evidence: Constructing knowledge- grounded reasoning chains for retrieval-augmented generation. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 8472–8494,

  9. [9]

    Enabling large language models to generate text with citations

    Tianyu Gao, Howard Yen, Jiatong Yu, and Danqi Chen. Enabling large language models to generate text with citations. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6465–6488,

  10. [10]

    VOGUE: A multimodal dataset for conversational recommendation in fashion, 2025

    David Guo, Minqi Sun, Yilun Jiang, Jiazhou Liang, and Scott Sanner. V ogue: A multimodal dataset for conversational recommendation in fashion.arXiv preprint arXiv:2510.21151,

  11. [11]

    Inspired: Toward sociable recommendation dialog systems

    Shirley Anugrah Hayati, Dongyeop Kang, Qingxiaoyang Zhu, Weiyan Shi, and Zhou Yu. Inspired: Toward sociable recommendation dialog systems. InProceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 8142–8152,

  12. [12]

    Citation: A key to building responsible and accountable large language models

    Jie Huang and Kevin Chang. Citation: A key to building responsible and accountable large language models. InFindings of the Association for Computational Linguistics: NAACL 2024, pages 464–473,

  13. [13]

    Agentic AI for scientific discovery: A survey of autonomous research systems.arXiv preprint arXiv:2501.03200, 2025

    Alon Jacovi, Andrew Wang, Chris Alberti, Connie Tao, Jon Lipovetz, Kate Olszewska, Lukas Haas, Michelle Liu, Nate Keating, Adam Bloniarz, et al. The facts grounding leaderboard: Benchmarking llms’ ability to ground responses to long-form input.arXiv preprint arXiv:2501.03200,

  14. [14]

    Fin-rate: A real-world financial analytics and tracking evaluation benchmark for llms on sec filings.arXiv preprint arXiv:2602.07294,

    Yidong Jiang, Junrong Chen, Eftychia Makri, Jialin Chen, Peiwen Li, Ali Maatouk, Leandros Tassiulas, Eliot Brenner, Bing Xiang, and Rex Ying. Fin-rate: A real-world financial analytics and tracking evaluation benchmark for llms on sec filings.arXiv preprint arXiv:2602.07294,

  15. [15]

    DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

    Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vard- hamanan, Saiful Haq, Ashutosh Sharma, Thomas T Joshi, Hanna Moazam, et al. Dspy: Compiling declarative language model calls into self-improving pipelines.arXiv preprint arXiv:2310.03714,

  16. [16]

    Pearl: A review-driven persona-knowledge grounded conversational recommendation dataset.Findings of the Association for Computational Linguistics: ACL 2024, pages 1105–1120,

    Minjin Kim, Minju Kim, Hana Kim, Beong-woo Kwak, SeongKu Kang, Youngjae Yu, Jinyoung Yeo, and Dongha Lee. Pearl: A review-driven persona-knowledge grounded conversational recommendation dataset.Findings of the Association for Computational Linguistics: ACL 2024, pages 1105–1120,

  17. [17]

    Durecdial 2.0: A bilingual parallel corpus for conversational recommendation

    Zeming Liu, Haifeng Wang, Zheng-Yu Niu, Hua Wu, and Wanxiang Che. Durecdial 2.0: A bilingual parallel corpus for conversational recommendation. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 4335–4347,

  18. [18]

    Recommendation-as- experience: A framework for context-sensitive adaptation in conversational recommender systems

    Raj Mahmud, Shlomo Berkovsky, Mukesh Prasad, and A Baki Kocaballi. Recommendation-as- experience: A framework for context-sensitive adaptation in conversational recommender systems. arXiv preprint arXiv:2601.07401,

  19. [19]

    Recommendation systems for tourism based on social networks: A survey.arXiv preprint arXiv:1903.12099,

    Alan Menk, Laura Sebastia, and Rebeca Ferreira. Recommendation systems for tourism based on social networks: A survey.arXiv preprint arXiv:1903.12099,

  20. [20]

    Tp-rag: benchmarking retrieval-augmented large language model agents for spatiotemporal-aware travel planning

    Hang Ni, Fan Liu, Xinyu Ma, Lixin Su, Shuaiqiang Wang, Dawei Yin, Hui Xiong, and Hao Liu. Tp-rag: benchmarking retrieval-augmented large language model agents for spatiotemporal-aware travel planning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 12403–12429,

  21. [21]

    Asqa: Factoid questions meet long-form answers

    Ivan Stelmakh, Yi Luan, Bhuwan Dhingra, and Ming-Wei Chang. Asqa: Factoid questions meet long-form answers. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 8273–8288,

  22. [22]

    Retrieval-augmented recommen- dation explanation generation with hierarchical aggregation.arXiv preprint arXiv:2507.09188,

    Bangcheng Sun, Yazhe Chen, Jilin Yang, Xiaodong Li, and Hui Li. Retrieval-augmented recommen- dation explanation generation with hierarchical aggregation.arXiv preprint arXiv:2507.09188,

  23. [23]

    Itinera: Integrating spatial optimization with large language models for open-domain urban itinerary planning

    12 Yihong Tang, Zhaokai Wang, Ao Qu, Yihao Yan, Zhaofeng Wu, Dingyi Zhuang, Jushi Kai, Kebing Hou, Xiaotong Guo, Jinhua Zhao, et al. Itinera: Integrating spatial optimization with large language models for open-domain urban itinerary planning. InProceedings of the 2024 conference on empirical methods in natural language processing: Industry track, pages 1...

  24. [24]

    Towards unified conversational recommender systems via knowledge-enhanced prompt learning

    Xiaolei Wang, Kun Zhou, Ji-Rong Wen, and Wayne Xin Zhao. Towards unified conversational recommender systems via knowledge-enhanced prompt learning. InProceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining, pages 1929–1937,

  25. [25]

    2025 , eprint=

    Guangzhi Xiong, Zhenghao He, Bohan Liu, Sanchit Sinha, and Aidong Zhang. Toward faithful retrieval-augmented generation with sparse autoencoders.arXiv preprint arXiv:2512.08892,

  26. [26]

    Improving conversational recommendation systems’ quality with context-aware item meta-information

    Bowen Yang, Cong Han, Yu Li, Lei Zuo, and Zhou Yu. Improving conversational recommendation systems’ quality with context-aware item meta-information. InFindings of the Association for Computational Linguistics: NAACL 2022, pages 38–48,

  27. [27]

    Item-language model for conversational recommendation.arXiv preprint arXiv:2406.02844,

    Li Yang, Anushya Subbiah, Hardik Patel, Judith Yue Li, Yanwei Song, Reza Mirghaderi, Vikram Aggarwal, and Qifan Wang. Item-language model for conversational recommendation.arXiv preprint arXiv:2406.02844,

  28. [28]

    Spatial-rag: Spatial retrieval augmented generation for real-world spatial reasoning questions,

    Dazhou Yu, Riyang Bao, Ruiyu Ning, Jinghong Peng, Gengchen Mai, and Liang Zhao. Spatial- rag: Spatial retrieval augmented generation for real-world geospatial reasoning questions.arXiv preprint arXiv:2502.18470,

  29. [29]

    Cite before you speak: Enhancing context-response grounding in e-commerce conversational llm-agents,

    Jingying Zeng, Hui Liu, Zhenwei Dai, Xianfeng Tang, Chen Luo, Samarth Varshney, Zhen Li, and Qi He. Cite before you speak: Enhancing context-response grounding in e-commerce conversa- tional llm-agents.arXiv preprint arXiv:2503.04830,

  30. [30]

    Hal- luguard: Demystifying data-driven and reasoning-driven hallucinations in llms,

    Xinyue Zeng, Junhong Lin, Yujun Yan, Feng Guo, Liang Shi, Jun Wu, and Dawei Zhou. Hallu- guard: Demystifying data-driven and reasoning-driven hallucinations in llms.arXiv preprint arXiv:2601.18753,

  31. [31]

    confirm the LLM-family pattern is not a single-model artifact, and a fully controlled multi-LLM leaderboard is left for a camera-ready release. The 90-item ALCE-style human evaluation calibrates the automated grounding metrics rather than ranking systems; agreement statistics and an annotator-heterogeneity diagnostic are in Appendix K. Memory-Aug. uses cr...

  32. [32]

    Best per column inbold, second best underlined

    GS = raw Grounding Score (see Table 5 for CGS). Best per column inbold, second best underlined . Baseline Spatial Coh.↑Price Align.↑Itin. Div.↑GS Popularity 0.174 0.974 0.333 0.998 TF-IDF 0.177 0.957 0.333 0.998 Aspect 0.175 0.953 0.333 0.998 Dense 0.186 0.966 0.333 0.997 Spatial 0.195 0.958 0.333 0.997 Hybrid-RRF 0.192 0.956 0.333 0.997 Itinerary0.6400.9...

  33. [33]

    I’d suggestX

    21 Figure 5:Task 2: expert recommendation pagewith all candidate POIs expanded to show their reviews. The annotator sees the conversation so far (with the user’s current request highlighted in blue), 8 candidate POIs (3 shown here for the illustrative example) each with reviews, and a free-text response box. Annotators select review spans and click the or...

  34. [34]

    Baseline GS Entail CD PC CGS CGS ent Dense 0.997 0.168 0.393 0.7330.8640.146 Popularity 0.998 0.153 0.411 0.608 0.802 0.123 TF-IDF 0.9980.1930.377 0.555 0.7760.149 Hybrid-RRF 0.997 0.171 0.412 0.555 0.775 0.133 Aspect 0.998 0.182 0.391 0.503 0.750 0.137 Spatial 0.997 0.143 0.437 0.458 0.727 0.104 Itinerary 0.997 0.110 0.310 0.373 0.684 0.076 RAG-Citation ...

  35. [35]

    is robust to difficulty stratification. N Rejection Recovery Analysis Rejection recovery measures a system’s ability to provide a correct alternative recommendation after the user rejects a previous suggestion (Table 6, Rej. Rec. column). This capability requires understandingwhythe user rejected (explicit constraint refinement) and adjusting the ranking ...

  36. [36]

    Per-turnσvalues are descriptive only; turns within a dialogue are correlated, so dialogue-clustered bootstrap CIs are the appropriate inferential unit

    and the inferential paired cluster- bootstrap 95% CIs (resampled by dialogue, nboot = 10,000, seed=42) for headline pairwise comparisons (Table 21). Per-turnσvalues are descriptive only; turns within a dialogue are correlated, so dialogue-clustered bootstrap CIs are the appropriate inferential unit. Sub- 0.01 gaps without matching cluster-bootstrap CIs sh...