Recognition: unknown
CoPA: Benchmarking Personalized Question Answering with Data-Informed Cognitive Factors
Pith reviewed 2026-05-10 11:42 UTC · model grok-4.3
The pith
CoPA benchmark evaluates personalized question answering by scoring model alignment with six cognitive factors derived from user preference divergences.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By quantifying the alignment between model outputs and user-specific cognitive preferences inferred from interaction patterns, CoPA provides a more comprehensive and discriminative standard for evaluating personalized QA than generic metrics. The benchmark is built by distilling six key personalization factors from Community-Individual Preference Divergence.
What carries the argument
Community-Individual Preference Divergence (CIPD), the process of locating individual choices that override community consensus to distill six evaluative dimensions for the CoPA benchmark.
Load-bearing premise
The six factors distilled from Community-Individual Preference Divergence accurately and comprehensively capture user-specific cognitive preferences that can be reliably inferred from interaction patterns.
What would settle it
A controlled test in which models with high CoPA scores fail to produce answers that real users rate as more personally fitting than lower-scoring models would show the benchmark does not measure useful personalization.
Figures
read the original abstract
While LLMs have demonstrated remarkable potential in Question Answering (QA), evaluating personalization remains a critical bottleneck. Existing paradigms predominantly rely on lexical-level similarity or manual heuristics, often lacking sufficient data-driven validation. We address this by mining Community-Individual Preference Divergence (CIPD), where individual choices override consensus, to distill six key personalization factors as evaluative dimensions. Accordingly, we introduce CoPA, a benchmark with 1,985 user profiles for fine-grained, factor-level assessment. By quantifying the alignment between model outputs and user-specific cognitive preferences inferred from interaction patterns, CoPA provides a more comprehensive and discriminative standard for evaluating personalized QA than generic metrics. The code is available at https://github.com/bjzgcai/CoPA.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CoPA, a benchmark for evaluating personalized question answering in LLMs. It mines Community-Individual Preference Divergence (CIPD) from interaction data to distill six personalization factors, constructs 1,985 user profiles, and quantifies factor-level alignment between model outputs and inferred user-specific cognitive preferences. The central claim is that this yields a more comprehensive and discriminative evaluation standard than generic lexical or heuristic metrics, with accompanying public code.
Significance. If the factor distillation and alignment quantification hold under scrutiny, CoPA offers a data-driven advance over existing personalization metrics by grounding evaluation in observable preference divergences. The public code release at the cited GitHub repository is a clear strength that supports reproducibility and community extension of the benchmark.
minor comments (3)
- [§3.2] §3.2: Provide a concrete example of how one of the six factors is operationalized from raw interaction logs to a profile attribute, including the exact inference rule or threshold used.
- [Table 2] Table 2: Report the inter-annotator agreement or validation metric (e.g., Cohen's kappa) for the factor labels assigned to the 1,985 profiles.
- [§5.1] §5.1: Clarify whether the reported alignment scores are computed per-factor or aggregated; if aggregated, state the weighting scheme explicitly.
Simulated Author's Rebuttal
We thank the referee for the positive evaluation of CoPA and the recommendation for minor revision. The report correctly identifies the core contribution as a data-driven benchmark grounded in Community-Individual Preference Divergence (CIPD) and the release of public code. We will incorporate all minor suggestions in the revised version.
Circularity Check
No significant circularity detected
full rationale
The paper's derivation constructs CoPA by empirically mining CIPD from interaction data to distill six factors, then applies them to measure alignment in a benchmark with 1,985 profiles. This process is data-driven and externally supported by shared code rather than reducing any central claim to a self-definition, fitted parameter renamed as prediction, or self-citation chain. No equations or steps in the provided abstract and context exhibit the enumerated circular patterns; the evaluation standard remains independent of its own outputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Individual choices that diverge from community consensus reflect stable, measurable cognitive preferences that can be distilled into six general factors.
invented entities (1)
-
Community-Individual Preference Divergence (CIPD)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
InPro- ceedings of the ACM Web Conference 2024, pages 3355–3366
Knowledge-augmented large language models for personalized contextual query suggestion. InPro- ceedings of the ACM Web Conference 2024, pages 3355–3366. Raymond B Cattell. 1963. Theory of fluid and crystal- lized intelligence: A critical experiment.Journal of educational psychology, 54(1):1. Aili Chen, Chengyu Du, Jiangjie Chen, Jinghan Xu, Yikai Zhang, S...
2024
-
[2]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Can LLM be a personalized judge? InFind- ings of the Association for Computational Linguis- tics: EMNLP 2024, pages 10126–10141, Miami, Florida, USA. Association for Computational Lin- guistics. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. 2025. Deepseek-r1: In- centivizing r...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
Coral: collaborative retrieval-augmented large language models improve long-tail recommendation. InProceedings of the 30th ACM SIGKDD Confer- ence on Knowledge Discovery and Data Mining, pages 3391–3401. Derong Xu, Xinhang Li, Ziheng Zhang, Zhenxi Lin, Zhihong Zhu, Zhi Zheng, Xian Wu, Xiangyu Zhao, Tong Xu, and Enhong Chen. 2025. Harnessing large language...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Hallucination or Contradiction:The inferred motive is irrelevant to, or contra- dicts, the user’s provided conversation history
-
[5]
Theoretical Error:The model fabri- cates a non-existent theory or concept
-
[6]
Economic Utility Theory
Inappropriate Application of Theory: A valid theory is cited, but it is com- pletely mismatched with the inferred mo- tive (e.g., applying “Economic Utility Theory” to explain a rationale for “emo- tional counseling”). B.2 Factors and Their Associated Causal Cases TableA1, A2, and A3 show three examples of fac- tors associated with the reasons. We evaluat...
-
[7]
reasons": [ {
Whether the factors are subject to cognitive bi- ases. System: You are an experienced educationalist who has re- searched in the field of education for many years. Your specialty is to analyze a user’s psychology and choices from a pedagogical perspective to understand their learning behavior and needs. User: User’s historical question: {Historical_questi...
-
[8]
Each reason should be concise and not exceed five words
-
[9]
Each reason should be grounded in an educational or psychological framework, incorporating appropri- ate academic terminology
-
[10]
There is no limit to the number of reasons
-
[11]
The theory embodies the educational principles reflected in this reason
-
[12]
factors": [ {
The ’global_explanation’ should provide an over- all summary of the reasons listed above. Figure A1: Extract rationale Prompt. System: You are a researcher in the field of education with expertise in pedagogy, psychology, and linguistics. User: The pool of reasons for users’ choices. {Reasons} The pool of reasons contains the rationales for why different ...
-
[13]
You can use academic terminology to describe these factors
-
[14]
It must contain at least five pairs of contrasting examples, with one supporting a choice (a positive example) and one opposing it (a negative example)
The example section illustrates the conflicting rea- sons you found. It must contain at least five pairs of contrasting examples, with one supporting a choice (a positive example) and one opposing it (a negative example)
-
[15]
There is no limit to the number of factors
-
[16]
Figure A2: Factor Mine Prompt
The ’global_explanation’ should provide an over- all summary of the factor listed above. Figure A2: Factor Mine Prompt
-
[17]
Whether these factors constitute dimensions that influence an individual’s decision-making pro- cess
-
[18]
C Factor Label Figure A3 shows the prompt used for factor label
Whether the factors are generalizable across different populations, cultures, and contexts. C Factor Label Figure A3 shows the prompt used for factor label. Figure A4 shows the prompt used for evaluation . D Effectiveness of Factors Figures A5, A6, and A7 show the prompts used for the Direct, CoT, and Random baselines, respec- tively. Table A4 reports the...
2025
-
[21]
Schema Consistency: What is the nature of the user’s existing prior knowledge and mental mod- els (and how should the new information align with them)?
-
[22]
Cognitive Load Management: What are the user’s constraints regarding information processing ca- pacity and their tolerance for complexity?
-
[23]
Metacognitive Scaffolding: What are the user’s requirements for structural guidance to facili- tate higher-order understanding and self-regulated learning?
-
[24]
Cognitive Trust
Affective and Motivational Resonance: What are the user’s expectations regarding emotional en- gagement and motivational alignment within the response? User: User’s Historical Questions: {Historical_questions} User’s Current Question: {Current_question} The answer to the current question: {Answer} The response should be in JSON format: { "Cognitive Trust"...
-
[25]
Each description must be tailored to the user’s specific circumstances
-
[26]
Each description should provide a specific and accurate summary of the user’s personal profile, with no word limit
-
[27]
The explanation serves as the rationale for the description provided above
-
[28]
w/ Fac- tors
Ensure the output adheres to the specified format. Figure A3: Factor Label Prompt. Method A&E L&P S&C Avg. Qwen3 8B No-Pers. 0.3422 0.4824 0.4824 0.4357 RAG-Pers. 0.3634 0.4700 0.4983 0.4439 w/ Factors0.3743 0.4892 0.5131 0.4589 GPT-4o-mini No-Pers. 0.3941 0.5162 0.5282 0.4795 RAG-Pers. 0.4182 0.4861 0.5343 0.4795 w/ Factors0.4378 0.5234 0.5720 0.5111 Tab...
-
[29]
Cognitive Trust: What are the user’s epistemic requirements regarding the credibility, reliability, and verifiability of the information?
-
[30]
Situational Anchoring: To what extent does the user require the response to be contextually aligned, practically applicable, or specific to a given scenario?
-
[31]
Schema Consistency: What is the nature of the user’s existing prior knowledge and mental models (and how should the new information align with them)?
-
[32]
Cognitive Load Management: What are the user’s constraints regarding information processing capacity and their tolerance for complexity?
-
[33]
Metacognitive Scaffolding: What are the user’s requirements for structural guidance to facilitate higher-order understanding and self-regulated learning?
-
[34]
Cognitive Trust
Affective and Motivational Resonance: What are the user’s expectations regarding emotional engagement and motivational alignment within the response? Scoring Rubric (3-point Likert scale): • 0 (Mismatch): The response actively violates the user’s preference or completely ignores a high-priority requirement defined in the profile. • 1 (Partial Match): The ...
-
[36]
(e.g., output 1, not "1")
Assign a score of 0, 1, or 2.Important: The ’score’ field must be a raw INTEGER type (int), do not use strings. (e.g., output 1, not "1")
-
[38]
score": 0,
Output the result in strict JSON format. Figure A4: Evaluation Prompt. System: Role: You are a fair and insightful judge with exceptional reasoning and analytical abilities. Your task is to evaluate whether the response to the user’s question aligns with the user’s profile. Scoring Rubric (3-point Likert scale): • 0 (Mismatch): The response actively viola...
-
[39]
Analyze the match between the <user_profile> and <response_to_evaluate>
-
[42]
Figure A5: Direct Evaluation Prompt
Output the result in strict JSON format. Figure A5: Direct Evaluation Prompt. System: Role: You are a fair and insightful judge with exceptional reasoning and analytical abilities. Your task is to evaluate whether the response to the user’s question aligns with the user’s profile. Scoring Rubric (3-point Likert scale): • 0 (Mismatch): The response activel...
-
[43]
Step 1 – Profile Analysis: Identify the key preferences and requirements revealed by the user’s historical questions
-
[44]
Step 2 – Response Analysis: Examine what the response actually provides (e.g., depth, style, format, domain focus)
-
[45]
Note specific matches or mismatches
Step 3 – Alignment Assessment: Compare the user’s profile against the response. Note specific matches or mismatches
-
[46]
chain_of_thought
Step 4 – Score Decision: Based on the above analysis, assign a score of 0, 1, or 2. Output Format: Return a single JSON object with the following fields: { "chain_of_thought": "Your step-by-step reasoning following Steps 1–4", "score": 0, "reasoning": "One-sentence summary of the final judgement" } IMPORTANT: The ‘score‘ field must be a raw INTEGER type (...
-
[47]
Please ensure that every factor has a result
Each factor in the response corresponds to a factor in the user_factor_profile. Please ensure that every factor has a result
-
[48]
Analyze the match between the <user_factor_profile> and <response_to_evaluate> for EACH factor
-
[49]
Assign a score of 0, 1, or 2.Important: The ‘score‘ field must be a raw INTEGER type (int), do not use strings
-
[50]
Provide a brief reasoning for your score
-
[51]
Figure A7: Random-Factor Evaluation Prompt
Output the result in strict JSON format. Figure A7: Random-Factor Evaluation Prompt. Your response must incorporate personalization by addressing the following six dimensions:
-
[52]
Cognitive Trust:Does the response align with the user’s threshold for trust and credibility within this specific domain?
-
[53]
Situational Anchoring:Is the response pre- cisely calibrated to the user’s immediate con- text and specific problem?
-
[54]
Schema Consistency:Does the response in- tegrate coherently with the user’s prior knowl- edge and existing mental models?
-
[55]
Cognitive Load Management:Is the com- plexity of the response tailored to match the user’s cognitive capacity?
-
[56]
Metacognitive Scaffolding:Does the re- sponse provide structural support that fosters the user’s critical thinking skills?
-
[57]
answer":
Affective and Motivational Resonance:Does the response resonate with the user’s current emotional state and motivational orientation? Figure A8: Factors Prompt. System: Role: You are an intelligent assistant skilled in teach- ing. Your task is to generate personalized answers to user questions. User: User’s Question: {question} The response should be in J...
-
[60]
answer":
Ensure the output adheres to the specified format. Figure A9: No-Personalization Prompt. System: Role: You are an intelligent assistant skilled in teach- ing. Your task is to generate responses tailored to the user’s individual understanding, based on their question history. User: User’s Historical Questions: {Historical_questions} User’s Current Question...
-
[63]
answer":
Ensure the output adheres to the specified format. Figure A10: Time-Personalization Prompt. System: Role: You are an intelligent assistant skilled in teach- ing. Your task is to generate responses tailored to the user’s individual understanding, based on their question history. User: User’s Historical Questions: {Historical_questions} User’s Current Quest...
-
[66]
Figure A11: RAG-Personalization Prompt
Ensure the output adheres to the specified format. Figure A11: RAG-Personalization Prompt. 0 10 20 30 40 50 all 0.6 0.7 0.8 0.9Score Engineering&Tools Science&Theory Lifestyle&Society Leisure&Fandom Figure A12: Impact of User History Context Length (K) on Model Performance System: Role: You are an intelligent assistant. User: question: {question} Please t...
-
[67]
domain":
Please provide your decision in JSON format, fol- lowing this structure: { "domain": "A summarization of which domain this question is related to" (if you are unable to sum- marize it, please set this value to "None"), "reasoning": "briefly explain your reasoning for the summarization" }
-
[68]
Please ensure ’domain’ is a single word
-
[69]
reasoning
The "reasoning" has no word limits
-
[70]
Figure A13: Domain Extract Prompt
Do not provide any other text outside the JSON string. Figure A13: Domain Extract Prompt. System: You are an expert in {domain}. Please tell me the profile the user who asked the question in {domain}. User: question:{question} Requirements:
-
[71]
profile":
Please provide your summary in JSON format, following this structure: { "profile": "A Summary of the user’s profile in this domain"(starting with "This user"), "reasoning": "briefly explain your reasoning for the summarization" }
-
[73]
Figure A14: Domain Profile Extract Prompt
Do not provide any other text outside the JSON string. Figure A14: Domain Profile Extract Prompt. System: You are an expert in {domain}. Please generate a new user profile based on the user’s historical profile and the current profile. User: history user profile:{history} current user profile:{current} Requirements:
-
[74]
profile":
Please provide your output in JSON format, fol- lowing this structure: { "profile": "A Summary of the user’s profile in this domain"(starting with "This user"), "reasoning": "briefly explain your reasoning for the summarization" }
-
[75]
Please ensure that the "profile" is accurate
-
[76]
Figure A15: Domain Profile Synthesize Prompt
Do not provide any other text outside the JSON string. Figure A15: Domain Profile Synthesize Prompt. System: You are an intelligent assistant. Please summarize the user profile based on the information about the user in various domains. User: user global profile: {global_profile} user domain profile: {domain_profile} Requirements:
-
[77]
profile":
Please provide your output in JSON format, fol- lowing this structure: { "profile": "A Summary of user profile" (starting with "This user"), "reasoning": "briefly explain your reasoning for the summarization" }
-
[78]
Please ensure that the "profile" is comprehensive and accurate
-
[79]
answer":
Do not provide any other text outside the JSON string. Figure A16: Global Profile Generate Prompt. System: Role: You are an intelligent assistant skilled in teach- ing. Your task is to generate personalized answers to user questions. User: User’s Profile: {user_profile} User’s Current Question: {Current_question} The response should be in JSON format: { "...
-
[80]
The answer needs to be accurate
-
[81]
Reasoning is a brief explanation of how you ar- rived at the answer above
-
[82]
Figure A17: Answer Generation Prompt
Ensure the output adheres to the specified format. Figure A17: Answer Generation Prompt. 103 104 105 Number of Questions (Log Scale) unix tex salesforce electronics apple gis wordpress dba sharepoint magento drupal codereview softwareengineering security gamedev android ethereum blender meta webmasters ux webapps bitcoin raspberrypi emacs dsp arduino netw...
2067
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.