The Personalization Paradox: Semantic Loss vs. Reasoning Gains in Agentic AI Q&A

Satyajit Movidi; Stephen Russell

arxiv: 2512.04343 · v1 · pith:TAOSUPOTnew · submitted 2025-12-04 · 💻 cs.IR

The Personalization Paradox: Semantic Loss vs. Reasoning Gains in Agentic AI Q&A

Satyajit Movidi , Stephen Russell This is my paper

Pith reviewed 2026-05-21 17:33 UTC · model grok-4.3

classification 💻 cs.IR

keywords personalizationagentic AILLM evaluationsemantic similarityreasoning qualityRAGAS groundingmetric limitationsstudent advising

0 comments

The pith

Personalization in agentic AI Q&A improves reasoning quality and grounding but lowers semantic similarity scores because current metrics penalize useful deviations from generic reference texts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests an agentic retrieval-augmented LLM called AIVisor on twelve real student advising questions that demand precise answers. It runs ten configurations, some personalized and some not, then measures results with lexical, semantic, and grounding metrics using a Linear Mixed-Effects Model. Personalization consistently raises reasoning quality and grounding scores, yet it produces a clear drop in semantic similarity metrics. The authors argue this drop occurs because the metrics are built around generic reference answers and therefore punish meaningful, user-specific content rather than indicating worse answers. This finding matters because it shows that standard evaluation practices can hide real gains from personalization and calls for multidimensional assessment instead of relying on single-metric benchmarks.

Core claim

Using AIVisor, an agentic retrieval-augmented LLM for student advising, the study compared ten personalized and non-personalized system configurations on twelve authentic advising questions. Results from a Linear Mixed-Effects Model showed that personalization reliably improved reasoning quality and grounding while creating a significant negative interaction on semantic similarity metrics. The authors conclude that this interaction stems not from poorer answers but from the limits of current metrics, which penalize meaningful personalized deviations from generic reference texts, exposing a structural flaw in prevailing LLM evaluation methods that are ill-suited for user-specific responses.

What carries the argument

The comparison of ten personalized versus non-personalized configurations of the AIVisor agentic RAG system, analyzed via Linear Mixed-Effects Model across lexical, semantic, and RAGAS grounding metrics on twelve stress-test advising questions.

If this is right

Personalization produces metric-dependent shifts rather than uniform improvements across evaluation dimensions.
The fully integrated personalized configuration yields the highest overall gains when measured with appropriate multidimensional metrics.
Prevailing LLM evaluation methods contain a structural flaw for assessing user-specific responses.
Multidimensional evaluation is required to capture the benefits of personalization in agentic AI systems.
The study supplies a methodological foundation for more transparent and robust personalization experiments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

New semantic metrics could be designed that reward rather than penalize contextually appropriate personalization while still measuring faithfulness.
The same trade-off pattern may appear in other high-stakes domains such as medical or legal advising where user context changes what counts as a good answer.
Human preference studies could test whether users actually prefer the personalized outputs even when automatic semantic scores are lower.
Replicating the experiment with tighter controls on prompt length, retrieval depth, and model temperature would isolate whether the metric effect persists.

Load-bearing premise

The observed drop in semantic similarity scores stems from limitations in the metrics themselves rather than any actual reduction in answer quality or unmeasured differences in how the personalized and non-personalized systems were built and run.

What would settle it

A direct comparison in which human raters judge the factual accuracy, helpfulness, and reasoning depth of the personalized answers as equal to or lower than the non-personalized answers on the same questions, or a re-run in which every implementation detail is strictly matched and the semantic-score gap disappears.

Figures

Figures reproduced from arXiv: 2512.04343 by Satyajit Movidi, Stephen Russell.

**Figure 2.** Figure 2: System-level means illustrating correlation strength across all eight metrics. [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗

**Figure 3.** Figure 3: Standardized (z-score) system performance by metric. Positive values indicate above-average performance for that [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗

**Figure 4.** Figure 4: Distribution of BLEU, ROUGE-L, METEOR, and BERTScore across systems. Box plots illustrate within-metric vari [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: Overall system (composite) performance. showing a statistically significant negative shift for most personalized configurations (such as D-G and E-I). This plot confirms the semantic penalty that the LMM analysis more precisely isolates and explains. Furthermore, Figure 9 supports Hypothesis H1 by illustrating a clear personalization-by-question effect, where the performance of different system configurat… view at source ↗

**Figure 6.** Figure 6: Pareto-front trade-off between lexical quality (METEOR) and grounding fidelity (Faithfulness) at the system level. [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Multi-metric radar profile for the top-5 systems by composite normalized score. [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: Forest plot of personalization effect sizes ( [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 9.** Figure 9: Interaction plots by metric. Each panel shows system-level traces (A-K) across non-personalized and personal [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

read the original abstract

AIVisor, an agentic retrieval-augmented LLM for student advising, was used to examine how personalization affects system performance across multiple evaluation dimensions. Using twelve authentic advising questions intentionally designed to stress lexical precision, we compared ten personalized and non-personalized system configurations and analyzed outcomes with a Linear Mixed-Effects Model across lexical (BLEU, ROUGE-L), semantic (METEOR, BERTScore), and grounding (RAGAS) metrics. Results showed a consistent trade-off: personalization reliably improved reasoning quality and grounding, yet introduced a significant negative interaction on semantic similarity, driven not by poorer answers but by the limits of current metrics, which penalize meaningful personalized deviations from generic reference texts. This reveals a structural flaw in prevailing LLM evaluation methods, which are ill-suited for assessing user-specific responses. The fully integrated personalized configuration produced the highest overall gains, suggesting that personalization can enhance system effectiveness when evaluated with appropriate multidimensional metrics. Overall, the study demonstrates that personalization produces metric-dependent shifts rather than uniform improvements and provides a methodological foundation for more transparent and robust personalization in agentic AI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Personalization boosts reasoning and grounding in this agentic advising system but lowers semantic similarity scores, which the authors attribute to metric limits rather than weaker answers.

read the letter

The main observation is that personalization in the AIVisor agentic RAG setup for student advising raised reasoning quality and RAGAS grounding scores while cutting METEOR and BERTScore. The authors link the semantic drop to the metrics penalizing valid deviations from generic reference texts instead of any real loss in answer quality. The fully personalized configuration came out ahead when all dimensions were considered together.

Referee Report

2 major / 2 minor

Summary. The paper presents an empirical study of AIVisor, an agentic retrieval-augmented LLM for student advising. It compares ten personalized and non-personalized system configurations on twelve authentic advising questions using a Linear Mixed-Effects Model to evaluate outcomes across lexical (BLEU, ROUGE-L), semantic (METEOR, BERTScore), and grounding (RAGAS) metrics. The central claim is that personalization consistently improves reasoning quality and grounding scores but produces a significant negative interaction on semantic similarity, which the authors attribute to current metrics penalizing valid personalized deviations from generic reference texts rather than any reduction in answer quality; this is interpreted as revealing a structural flaw in prevailing LLM evaluation methods for user-specific responses. The fully integrated personalized configuration is reported to yield the highest overall gains when multidimensional metrics are considered.

Significance. If the interpretation of the semantic similarity drop holds after ruling out confounds, the work would usefully highlight limitations of standard semantic metrics when applied to personalized agentic systems and demonstrate the value of mixed-effects modeling for detecting metric-dependent trade-offs. The controlled comparison across multiple configurations and use of authentic questions provide a practical foundation for rethinking evaluation in IR and agentic AI. The paper credits the multidimensional approach and LMM analysis as strengths for transparency.

major comments (2)

[Abstract] Abstract: The claim that the negative interaction on METEOR/BERTScore is 'driven not by poorer answers but by the limits of current metrics' is load-bearing for the personalization paradox conclusion. This interpretation assumes that RAGAS grounding fully isolates reasoning quality and that no unmeasured differences exist between personalized and non-personalized configurations (e.g., in retrieval, prompting, or user-data integration). The manuscript should provide explicit controls or additional metrics to exclude actual quality reductions or implementation confounds before attributing the drop to metric flaws.
[Results] Results section (LMM analysis): The Linear Mixed-Effects Model supports detection of interactions, but without reported details on how reasoning quality was isolated from semantic scores or on the precise differences among the ten configurations, it remains possible that the observed trade-off reflects systematic implementation variations rather than personalization itself. Adding a table or subsection comparing factual precision or human-rated quality across conditions would strengthen the causal attribution.

minor comments (2)

[Abstract] Abstract: The sentence describing the trade-off is long and combines multiple results; splitting it would improve readability while preserving the multidimensional framing.
[Discussion] The manuscript would benefit from an explicit limitations subsection addressing potential metric dependencies beyond the reported ones.

Simulated Author's Rebuttal

2 responses · 1 unresolved

Thank you for the constructive feedback on our manuscript. We have carefully considered the major comments and provide point-by-point responses below. Where appropriate, we indicate revisions to be made in the next version of the manuscript to address concerns about potential confounds and implementation details.

read point-by-point responses

Referee: [Abstract] The claim that the negative interaction on METEOR/BERTScore is 'driven not by poorer answers but by the limits of current metrics' is load-bearing for the personalization paradox conclusion. This interpretation assumes that RAGAS grounding fully isolates reasoning quality and that no unmeasured differences exist between personalized and non-personalized configurations (e.g., in retrieval, prompting, or user-data integration). The manuscript should provide explicit controls or additional metrics to exclude actual quality reductions or implementation confounds before attributing the drop to metric flaws.

Authors: We thank the referee for highlighting this important point. Our study systematically varies personalization across ten configurations while holding the core retrieval-augmented generation pipeline constant to the extent possible. The observed improvements in RAGAS grounding and reasoning-related aspects suggest that the semantic metric declines are not indicative of reduced answer quality. To further address potential confounds, we will include in the revised manuscript a new table that details the specific differences in user data integration, prompting strategies, and retrieval parameters for each configuration. We will also expand the discussion to explicitly address the assumptions underlying our interpretation of the metric trade-off. revision: partial
Referee: [Results] The Linear Mixed-Effects Model supports detection of interactions, but without reported details on how reasoning quality was isolated from semantic scores or on the precise differences among the ten configurations, it remains possible that the observed trade-off reflects systematic implementation variations rather than personalization itself. Adding a table or subsection comparing factual precision or human-rated quality across conditions would strengthen the causal attribution.

Authors: We agree that additional details on the configurations and isolation of reasoning quality would enhance clarity. Reasoning quality is primarily captured through the RAGAS metrics (faithfulness, answer relevance, and context relevance), which are modeled separately from the semantic similarity metrics in our Linear Mixed-Effects Model. In the revised version, we will add a subsection in the Results section describing the precise differences among the ten configurations and include a table summarizing these variations. While we did not include human-rated quality assessments in this work, as our focus was on automated multidimensional metrics, we will add a limitations paragraph noting this and the potential value of such evaluations in future studies. revision: yes

standing simulated objections not resolved

Conducting human evaluations or additional factual precision assessments would require a new experimental setup and participant recruitment, which is beyond the scope of the current study focused on automated metrics and LMM analysis.

Circularity Check

0 steps flagged

No circularity: empirical metric comparison with independent statistical modeling

full rationale

The paper reports results from a Linear Mixed-Effects Model applied to outcomes from ten system configurations evaluated on external metrics (BLEU, ROUGE-L, METEOR, BERTScore, RAGAS). No equations, predictions, or first-principles derivations are presented that reduce to fitted inputs or self-referential definitions. The interpretation that semantic-score drops reflect metric limitations rather than quality loss is an inference drawn from the observed pattern of improved grounding/reasoning scores; it does not constitute a definitional equivalence or a load-bearing self-citation chain. The study is self-contained against the reported experimental data and does not invoke uniqueness theorems or ansatzes from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the appropriateness of the Linear Mixed-Effects Model for the metric data and on the validity of the chosen metrics as proxies for reasoning quality and grounding. No free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption The Linear Mixed-Effects Model is an appropriate statistical tool for analyzing metric scores across the twelve questions and ten system configurations.
Invoked when the abstract states that outcomes were analyzed with a Linear Mixed-Effects Model.

pith-pipeline@v0.9.0 · 5721 in / 1506 out tokens · 40643 ms · 2026-05-21T17:33:56.927166+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 1 internal anchor

[1]

Advisely: AI-Powered Academic Advising Using Large Language Models (LLMs),

“Advisely: AI-Powered Academic Advising Using Large Language Models (LLMs), ” 1–12. doi:10.15379/ijmst.v10i1.2829. D. Akiba and M. C. Fraboni

work page doi:10.15379/ijmst.v10i1.2829
[2]

doi: 10.3390/educsci13090885. C. Antico, S. Giordano, C. Koyuturk, and D. Ognibene

work page doi:10.3390/educsci13090885
[3]

Unimib Assistant: Designing a student-friendly RAG-based chatbot for all their needs

“Unimib Assistant: Designing a student-friendly RAG-based chatbot for all their needs. ”https://arxiv.org/abs/2411.19554. R. Bach, R. Dobbe, and D. K. Mulligan

work page arXiv
[4]

A systematic review of user trust in AI from an HCI perspective

“A systematic review of user trust in AI from an HCI perspective. ”arXiv preprint arXiv:2304.08795. https://arxiv.org/abs/2304.08795. S. Banerjee and A. Lavie

work page arXiv
[5]

doi: 10.14569/IJACSA.2022.0130808. J. Blömker and C. M. Albrecht

work page doi:10.14569/ijacsa.2022.0130808 2022
[6]

doi: 10.1016/j.chbah.2025.100126. J. A. Casaca and L. P. Miguel

work page doi:10.1016/j.chbah.2025.100126 2025
[7]

The Influence of Personalization on Consumer Satisfaction: Trends and Challenges

“The Influence of Personalization on Consumer Satisfaction: Trends and Challenges. ” doi: 10.4018/979-8- 3693-3455-3.ch010 . S. Chandra, S. Verma, W. M. Lim, S. Kumar, and N. Donthu

work page doi:10.4018/979-8-
[8]

Personalization in personalized marketing: Trends and ways forward

“Personalization in personalized marketing: Trends and ways forward. ” Psychology & Marketing, 39, 8, 1529–1562. http://dx.doi.org/10.1002/mar.21670. M. Dawood

work page doi:10.1002/mar.21670
[9]

Assessing the effectiveness of Chatbots in providing personalized academic advising and support to higher education students: A narrative literature review

“Assessing the effectiveness of Chatbots in providing personalized academic advising and support to higher education students: A narrative literature review. ”Studies in Technology Enhanced Learning , 4, 1, 1–12. doi: 10.21428/8c225f6e.7140f8f4. Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, J. Sun, and M. Wang

work page doi:10.21428/8c225f6e.7140f8f4
[10]

Retrieval-Augmented Generation for Large Language Models: A Survey

“Retrieval-augmented generation for large language models: A survey. ”arXiv preprint arXiv:2312.10997. https://arxiv.org/abs/2312.10997. Google DeepMind

work page internal anchor Pith review Pith/arXiv arXiv
[11]

https://ai.google.dev/gemini- api/docs/long- context

Long Context in Gemini Models . https://ai.google.dev/gemini- api/docs/long- context . [Accessed: 16 Oct 2025]. (2024). R. Hasan and R. Bunescu

work page 2025
[12]

Affective recommender systems: A systematic review

“Affective recommender systems: A systematic review. ” arXiv preprint arXiv:2508.20289 . https://arxiv.org/a bs/2508.20289. Z. Ji, N. Lee, R. Ficler, D. M. Levy, D. Gurari, D. Chen, L. Huang, D. Khashabi, and D. Weld

work page arXiv
[13]

Survey of Hallucination in Natural Language Generation

“Survey of Hallucination in Natural Language Generation. ”ACM Computing Surveys, 55, 12, 1–38. doi: 10.1145/3571730. A. Kapoor, M. Alizadeh, and B. Mutlu

work page doi:10.1145/3571730
[14]

Impact of customizations on trust perceptions in human-robot collaboration

“Impact of customizations on trust perceptions in human-robot collaboration. ” arXiv preprint arXiv:2310.18791. https://arxiv.org/abs/2310.18791. J. Kim and Y. Yang

work page arXiv
[15]

Few-shot personalization of LLMs with mis-aligned responses

“Few-shot personalization of LLMs with mis-aligned responses. ” arXiv preprint arXiv:2406.18678 . https://arxiv.org /abs/2406.18678. G. Lang, J. Kim, and R. Carter

work page arXiv
[16]

Affective computing for personalized human-computer interaction: A survey

“Affective computing for personalized human-computer interaction: A survey. ” IEEE Transactions on Affective Computing, 14, 2, 123–139. doi: 10.1109/TAFFC.2023.3245678. 22 C.-Y. Lin

work page doi:10.1109/taffc.2023.3245678 2023
[17]

A bibliometric analysis of artificial intelligence chatbots in educational contexts

“A bibliometric analysis of artificial intelligence chatbots in educational contexts. ” Interactive Technology and Smart Education, 21, 2, 189–213. doi: 10.1108/ITSE-12-2022-0165 . J. Liu et al

work page doi:10.1108/itse-12-2022-0165 2022
[18]

A Survey of Personalized Large Language Models: Progress and Future Directions

“A Survey of Personalized Large Language Models: Progress and Future Directions. ” arXiv preprint arXiv:2502.11528. https ://arxiv.org/abs/2502.11528. H. Luong and K. Luong

work page arXiv
[19]

A Chatbot-Based Academic Advising Model for Student in Information Technology: A Case Study

“A Chatbot-Based Academic Advising Model for Student in Information Technology: A Case Study. ” Saudi Journal of Engineering and Technology , 10, 3, 93–100. doi: 10.36348/sjet.2025.v10i03.007. K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu

work page doi:10.36348/sjet.2025.v10i03.007 2025
[20]

doi:10.3115/1073083.1073135 , editor =

“BLEU: a Method for Automatic Evaluation of Machine Translation. ” In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL) . Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, 311–318. doi: 10.3115/1073083.1073135. F. B. Siddique, Y. Cao, X. Liu, Y. Fang, and O. Zaane

work page doi:10.3115/1073083.1073135
[21]

Personalizing task-oriented dialog systems via zero-shot generalizable reward function

“Personalizing task-oriented dialog systems via zero-shot generalizable reward function. ”arXiv preprint arXiv:2303.13797. https://arxiv.org/abs/2303.13797. A. A. Soomro, M. H. Khan, M. Umar, S. Khan, and O. Ali

work page arXiv
[22]

Academic Advising Chatbot Powered with AI Agent,

“Academic Advising Chatbot Powered with AI Agent, ” 195–202. doi: 10.11 45/3696673.3723065. N. I. D. Tarifi and et al

work page arXiv
[23]

doi: 10.30935/cedtech/13733. D. Thüs, J. Koerber, H. Goertzen, and T. Hauck

work page doi:10.30935/cedtech/13733
[24]

doi: 10.3389/fpsyg.2024.1474892. R. Zhang and Y. Zhang

work page doi:10.3389/fpsyg.2024.1474892 2024
[25]

A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions,

“A Survey on Hallucination in Large Language Models. ” ACM Computing Surveys. doi:10.1145/3703155. T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi

work page doi:10.1145/3703155

[1] [1]

Advisely: AI-Powered Academic Advising Using Large Language Models (LLMs),

“Advisely: AI-Powered Academic Advising Using Large Language Models (LLMs), ” 1–12. doi:10.15379/ijmst.v10i1.2829. D. Akiba and M. C. Fraboni

work page doi:10.15379/ijmst.v10i1.2829

[2] [2]

doi: 10.3390/educsci13090885. C. Antico, S. Giordano, C. Koyuturk, and D. Ognibene

work page doi:10.3390/educsci13090885

[3] [3]

Unimib Assistant: Designing a student-friendly RAG-based chatbot for all their needs

“Unimib Assistant: Designing a student-friendly RAG-based chatbot for all their needs. ”https://arxiv.org/abs/2411.19554. R. Bach, R. Dobbe, and D. K. Mulligan

work page arXiv

[4] [4]

A systematic review of user trust in AI from an HCI perspective

“A systematic review of user trust in AI from an HCI perspective. ”arXiv preprint arXiv:2304.08795. https://arxiv.org/abs/2304.08795. S. Banerjee and A. Lavie

work page arXiv

[5] [5]

doi: 10.14569/IJACSA.2022.0130808. J. Blömker and C. M. Albrecht

work page doi:10.14569/ijacsa.2022.0130808 2022

[6] [6]

doi: 10.1016/j.chbah.2025.100126. J. A. Casaca and L. P. Miguel

work page doi:10.1016/j.chbah.2025.100126 2025

[7] [7]

The Influence of Personalization on Consumer Satisfaction: Trends and Challenges

“The Influence of Personalization on Consumer Satisfaction: Trends and Challenges. ” doi: 10.4018/979-8- 3693-3455-3.ch010 . S. Chandra, S. Verma, W. M. Lim, S. Kumar, and N. Donthu

work page doi:10.4018/979-8-

[8] [8]

Personalization in personalized marketing: Trends and ways forward

“Personalization in personalized marketing: Trends and ways forward. ” Psychology & Marketing, 39, 8, 1529–1562. http://dx.doi.org/10.1002/mar.21670. M. Dawood

work page doi:10.1002/mar.21670

[9] [9]

Assessing the effectiveness of Chatbots in providing personalized academic advising and support to higher education students: A narrative literature review

“Assessing the effectiveness of Chatbots in providing personalized academic advising and support to higher education students: A narrative literature review. ”Studies in Technology Enhanced Learning , 4, 1, 1–12. doi: 10.21428/8c225f6e.7140f8f4. Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, J. Sun, and M. Wang

work page doi:10.21428/8c225f6e.7140f8f4

[10] [10]

Retrieval-Augmented Generation for Large Language Models: A Survey

“Retrieval-augmented generation for large language models: A survey. ”arXiv preprint arXiv:2312.10997. https://arxiv.org/abs/2312.10997. Google DeepMind

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

https://ai.google.dev/gemini- api/docs/long- context

Long Context in Gemini Models . https://ai.google.dev/gemini- api/docs/long- context . [Accessed: 16 Oct 2025]. (2024). R. Hasan and R. Bunescu

work page 2025

[12] [12]

Affective recommender systems: A systematic review

“Affective recommender systems: A systematic review. ” arXiv preprint arXiv:2508.20289 . https://arxiv.org/a bs/2508.20289. Z. Ji, N. Lee, R. Ficler, D. M. Levy, D. Gurari, D. Chen, L. Huang, D. Khashabi, and D. Weld

work page arXiv

[13] [13]

Survey of Hallucination in Natural Language Generation

“Survey of Hallucination in Natural Language Generation. ”ACM Computing Surveys, 55, 12, 1–38. doi: 10.1145/3571730. A. Kapoor, M. Alizadeh, and B. Mutlu

work page doi:10.1145/3571730

[14] [14]

Impact of customizations on trust perceptions in human-robot collaboration

“Impact of customizations on trust perceptions in human-robot collaboration. ” arXiv preprint arXiv:2310.18791. https://arxiv.org/abs/2310.18791. J. Kim and Y. Yang

work page arXiv

[15] [15]

Few-shot personalization of LLMs with mis-aligned responses

“Few-shot personalization of LLMs with mis-aligned responses. ” arXiv preprint arXiv:2406.18678 . https://arxiv.org /abs/2406.18678. G. Lang, J. Kim, and R. Carter

work page arXiv

[16] [16]

Affective computing for personalized human-computer interaction: A survey

“Affective computing for personalized human-computer interaction: A survey. ” IEEE Transactions on Affective Computing, 14, 2, 123–139. doi: 10.1109/TAFFC.2023.3245678. 22 C.-Y. Lin

work page doi:10.1109/taffc.2023.3245678 2023

[17] [17]

A bibliometric analysis of artificial intelligence chatbots in educational contexts

“A bibliometric analysis of artificial intelligence chatbots in educational contexts. ” Interactive Technology and Smart Education, 21, 2, 189–213. doi: 10.1108/ITSE-12-2022-0165 . J. Liu et al

work page doi:10.1108/itse-12-2022-0165 2022

[18] [18]

A Survey of Personalized Large Language Models: Progress and Future Directions

“A Survey of Personalized Large Language Models: Progress and Future Directions. ” arXiv preprint arXiv:2502.11528. https ://arxiv.org/abs/2502.11528. H. Luong and K. Luong

work page arXiv

[19] [19]

A Chatbot-Based Academic Advising Model for Student in Information Technology: A Case Study

“A Chatbot-Based Academic Advising Model for Student in Information Technology: A Case Study. ” Saudi Journal of Engineering and Technology , 10, 3, 93–100. doi: 10.36348/sjet.2025.v10i03.007. K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu

work page doi:10.36348/sjet.2025.v10i03.007 2025

[20] [20]

doi:10.3115/1073083.1073135 , editor =

“BLEU: a Method for Automatic Evaluation of Machine Translation. ” In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL) . Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, 311–318. doi: 10.3115/1073083.1073135. F. B. Siddique, Y. Cao, X. Liu, Y. Fang, and O. Zaane

work page doi:10.3115/1073083.1073135

[21] [21]

Personalizing task-oriented dialog systems via zero-shot generalizable reward function

“Personalizing task-oriented dialog systems via zero-shot generalizable reward function. ”arXiv preprint arXiv:2303.13797. https://arxiv.org/abs/2303.13797. A. A. Soomro, M. H. Khan, M. Umar, S. Khan, and O. Ali

work page arXiv

[22] [22]

Academic Advising Chatbot Powered with AI Agent,

“Academic Advising Chatbot Powered with AI Agent, ” 195–202. doi: 10.11 45/3696673.3723065. N. I. D. Tarifi and et al

work page arXiv

[23] [23]

doi: 10.30935/cedtech/13733. D. Thüs, J. Koerber, H. Goertzen, and T. Hauck

work page doi:10.30935/cedtech/13733

[24] [24]

doi: 10.3389/fpsyg.2024.1474892. R. Zhang and Y. Zhang

work page doi:10.3389/fpsyg.2024.1474892 2024

[25] [25]

A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions,

“A Survey on Hallucination in Large Language Models. ” ACM Computing Surveys. doi:10.1145/3703155. T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi

work page doi:10.1145/3703155