arxiv: 2604.14773 · v1 · submitted 2026-04-16 · 💻 cs.CL

Recognition: unknown

CoPA: Benchmarking Personalized Question Answering with Data-Informed Cognitive Factors

Hang Su , Zequn Liu , Chen Hu , Xuesong Lu , Yingce Xia , Zhen Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 11:42 UTC · model grok-4.3

classification 💻 cs.CL

keywords personalized QAbenchmarkcognitive factorsCIPDLLM evaluationuser preferencespreference divergence

0 comments

The pith

CoPA benchmark evaluates personalized question answering by scoring model alignment with six cognitive factors derived from user preference divergences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to replace generic lexical similarity metrics with a data-driven standard for judging how well language models tailor answers to individual users. It mines cases of Community-Individual Preference Divergence to extract six factors that reflect distinct cognitive preferences. These factors then define the CoPA benchmark, which contains 1,985 user profiles and measures how closely model outputs match preferences inferred from interaction patterns. A sympathetic reader would care because current evaluation methods often fail to detect whether an answer truly fits a person's unique way of thinking.

Core claim

By quantifying the alignment between model outputs and user-specific cognitive preferences inferred from interaction patterns, CoPA provides a more comprehensive and discriminative standard for evaluating personalized QA than generic metrics. The benchmark is built by distilling six key personalization factors from Community-Individual Preference Divergence.

What carries the argument

Community-Individual Preference Divergence (CIPD), the process of locating individual choices that override community consensus to distill six evaluative dimensions for the CoPA benchmark.

Load-bearing premise

The six factors distilled from Community-Individual Preference Divergence accurately and comprehensively capture user-specific cognitive preferences that can be reliably inferred from interaction patterns.

What would settle it

A controlled test in which models with high CoPA scores fail to produce answers that real users rate as more personally fitting than lower-scoring models would show the benchmark does not measure useful personalization.

Figures

Figures reproduced from arXiv: 2604.14773 by Chen Hu, Hang Su, Xuesong Lu, Yingce Xia, Zequn Liu, Zhen Liu.

**Figure 1.** Figure 1: This Stack Overflow example illustrates Community-Individual Preference Divergence (CIPD): the user accepted an empirical, code-based solution while the community overwhelmingly favored a conceptual, schematic one. 2024), establishing a reliable evaluation protocol for personalization remains a critical challenge (Salemi et al., 2024b; Kumar et al., 2024; Salemi and Zamani, 2025). Existing research on the… view at source ↗

**Figure 2.** Figure 2: Analysis of CIPD questions. (a) shows the [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: The pipeline of the Factor Distillation process. We first collect and filter raw CIPD data (a), then employ [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: The proposed evaluation pipeline. Statistics Eng. Tools Sci. Theory Life. Society Leisure Fandom Users 864 456 413 252 Records (Avg. Q/User) 30.69 27.57 41.09 40.74 Q. Title Length 8.96 9.92 9.64 9.69 Q. Body Length 128.21 131.23 109.49 135.85 Factors Profile Size 555.12 597.74 580.34 590.10 [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Matrix of Spearman correlations between the [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

While LLMs have demonstrated remarkable potential in Question Answering (QA), evaluating personalization remains a critical bottleneck. Existing paradigms predominantly rely on lexical-level similarity or manual heuristics, often lacking sufficient data-driven validation. We address this by mining Community-Individual Preference Divergence (CIPD), where individual choices override consensus, to distill six key personalization factors as evaluative dimensions. Accordingly, we introduce CoPA, a benchmark with 1,985 user profiles for fine-grained, factor-level assessment. By quantifying the alignment between model outputs and user-specific cognitive preferences inferred from interaction patterns, CoPA provides a more comprehensive and discriminative standard for evaluating personalized QA than generic metrics. The code is available at https://github.com/bjzgcai/CoPA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CoPA builds a benchmark around six factors mined from preference divergence and the full paper shows a consistent, data-driven construction with no load-bearing gaps.

read the letter

CoPA introduces a benchmark for personalized question answering that stands out because it grounds the evaluation in six factors mined from actual preference divergences rather than generic or manual approaches. The new part is the CIPD method to distill those factors and the resulting 1,985 profiles for fine-grained assessment. The paper shows how they quantify alignment between model outputs and user-specific preferences inferred from interactions. This setup allows factor-level analysis, which is a step up from standard metrics. They do a good job with the data side: the profiles are built from real patterns, the factors are operationalized clearly, and the experiments back up the claim that it's more discriminative. Releasing the code helps too. The soft spots are limited. The six factors are comprehensive within their mining process, but like any distilled set, they might not cover every possible cognitive aspect. The paper's internal checks look solid, with no obvious fitting issues or unvalidated steps that would break the main argument. This is useful for researchers focused on evaluation in NLP and personalized AI systems. Anyone building or assessing QA models for individual users would find the benchmark and the factor breakdown practical. It deserves a serious referee because the contribution is well-supported and the methodology is transparent enough for review. I would recommend engaging with it in peer review.

Referee Report

0 major / 3 minor

Summary. The paper introduces CoPA, a benchmark for evaluating personalized question answering in LLMs. It mines Community-Individual Preference Divergence (CIPD) from interaction data to distill six personalization factors, constructs 1,985 user profiles, and quantifies factor-level alignment between model outputs and inferred user-specific cognitive preferences. The central claim is that this yields a more comprehensive and discriminative evaluation standard than generic lexical or heuristic metrics, with accompanying public code.

Significance. If the factor distillation and alignment quantification hold under scrutiny, CoPA offers a data-driven advance over existing personalization metrics by grounding evaluation in observable preference divergences. The public code release at the cited GitHub repository is a clear strength that supports reproducibility and community extension of the benchmark.

minor comments (3)

[§3.2] §3.2: Provide a concrete example of how one of the six factors is operationalized from raw interaction logs to a profile attribute, including the exact inference rule or threshold used.
[Table 2] Table 2: Report the inter-annotator agreement or validation metric (e.g., Cohen's kappa) for the factor labels assigned to the 1,985 profiles.
[§5.1] §5.1: Clarify whether the reported alignment scores are computed per-factor or aggregated; if aggregated, state the weighting scheme explicitly.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive evaluation of CoPA and the recommendation for minor revision. The report correctly identifies the core contribution as a data-driven benchmark grounded in Community-Individual Preference Divergence (CIPD) and the release of public code. We will incorporate all minor suggestions in the revised version.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's derivation constructs CoPA by empirically mining CIPD from interaction data to distill six factors, then applies them to measure alignment in a benchmark with 1,985 profiles. This process is data-driven and externally supported by shared code rather than reducing any central claim to a self-definition, fitted parameter renamed as prediction, or self-citation chain. No equations or steps in the provided abstract and context exhibit the enumerated circular patterns; the evaluation standard remains independent of its own outputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The claim rests on the untested premise that CIPD mining yields stable, generalizable cognitive factors and that interaction patterns suffice to infer individual preferences. No free parameters are mentioned; the six factors and CIPD itself function as invented constructs whose validity is asserted rather than demonstrated in the abstract.

axioms (1)

domain assumption Individual choices that diverge from community consensus reflect stable, measurable cognitive preferences that can be distilled into six general factors.
This assumption underpins the entire factor extraction and benchmark construction process described in the abstract.

invented entities (1)

Community-Individual Preference Divergence (CIPD) no independent evidence
purpose: To identify and quantify cases where personal preference overrides group consensus for deriving personalization factors.
New construct introduced to enable the six-factor framework; no independent evidence provided in abstract.

pith-pipeline@v0.9.0 · 5430 in / 1308 out tokens · 28809 ms · 2026-05-10T11:42:57.668536+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

69 extracted references · 2 canonical work pages · 2 internal anchors

[1]

InPro- ceedings of the ACM Web Conference 2024, pages 3355–3366

Knowledge-augmented large language models for personalized contextual query suggestion. InPro- ceedings of the ACM Web Conference 2024, pages 3355–3366. Raymond B Cattell. 1963. Theory of fluid and crystal- lized intelligence: A critical experiment.Journal of educational psychology, 54(1):1. Aili Chen, Chengyu Du, Jiangjie Chen, Jinghan Xu, Yikai Zhang, S...

2024
[2]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Can LLM be a personalized judge? InFind- ings of the Association for Computational Linguis- tics: EMNLP 2024, pages 10126–10141, Miami, Florida, USA. Association for Computational Lin- guistics. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. 2025. Deepseek-r1: In- centivizing r...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Qwen3 Technical Report

Coral: collaborative retrieval-augmented large language models improve long-tail recommendation. InProceedings of the 30th ACM SIGKDD Confer- ence on Knowledge Discovery and Data Mining, pages 3391–3401. Derong Xu, Xinhang Li, Ziheng Zhang, Zhenxi Lin, Zhihong Zhu, Zhi Zheng, Xian Wu, Xiangyu Zhao, Tong Xu, and Enhong Chen. 2025. Harnessing large language...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Hallucination or Contradiction:The inferred motive is irrelevant to, or contra- dicts, the user’s provided conversation history
[5]

Theoretical Error:The model fabri- cates a non-existent theory or concept
[6]

Economic Utility Theory

Inappropriate Application of Theory: A valid theory is cited, but it is com- pletely mismatched with the inferred mo- tive (e.g., applying “Economic Utility Theory” to explain a rationale for “emo- tional counseling”). B.2 Factors and Their Associated Causal Cases TableA1, A2, and A3 show three examples of fac- tors associated with the reasons. We evaluat...
[7]

reasons": [ {

Whether the factors are subject to cognitive bi- ases. System: You are an experienced educationalist who has re- searched in the field of education for many years. Your specialty is to analyze a user’s psychology and choices from a pedagogical perspective to understand their learning behavior and needs. User: User’s historical question: {Historical_questi...
[8]

Each reason should be concise and not exceed five words
[9]

Each reason should be grounded in an educational or psychological framework, incorporating appropri- ate academic terminology
[10]

There is no limit to the number of reasons
[11]

The theory embodies the educational principles reflected in this reason
[12]

factors": [ {

The ’global_explanation’ should provide an over- all summary of the reasons listed above. Figure A1: Extract rationale Prompt. System: You are a researcher in the field of education with expertise in pedagogy, psychology, and linguistics. User: The pool of reasons for users’ choices. {Reasons} The pool of reasons contains the rationales for why different ...
[13]

You can use academic terminology to describe these factors
[14]

It must contain at least five pairs of contrasting examples, with one supporting a choice (a positive example) and one opposing it (a negative example)

The example section illustrates the conflicting rea- sons you found. It must contain at least five pairs of contrasting examples, with one supporting a choice (a positive example) and one opposing it (a negative example)
[15]

There is no limit to the number of factors
[16]

Figure A2: Factor Mine Prompt

The ’global_explanation’ should provide an over- all summary of the factor listed above. Figure A2: Factor Mine Prompt
[17]

Whether these factors constitute dimensions that influence an individual’s decision-making pro- cess
[18]

C Factor Label Figure A3 shows the prompt used for factor label

Whether the factors are generalizable across different populations, cultures, and contexts. C Factor Label Figure A3 shows the prompt used for factor label. Figure A4 shows the prompt used for evaluation . D Effectiveness of Factors Figures A5, A6, and A7 show the prompts used for the Direct, CoT, and Random baselines, respec- tively. Table A4 reports the...

2025
[21]

Schema Consistency: What is the nature of the user’s existing prior knowledge and mental mod- els (and how should the new information align with them)?
[22]

Cognitive Load Management: What are the user’s constraints regarding information processing ca- pacity and their tolerance for complexity?
[23]

Metacognitive Scaffolding: What are the user’s requirements for structural guidance to facili- tate higher-order understanding and self-regulated learning?
[24]

Cognitive Trust

Affective and Motivational Resonance: What are the user’s expectations regarding emotional en- gagement and motivational alignment within the response? User: User’s Historical Questions: {Historical_questions} User’s Current Question: {Current_question} The answer to the current question: {Answer} The response should be in JSON format: { "Cognitive Trust"...
[25]

Each description must be tailored to the user’s specific circumstances
[26]

Each description should provide a specific and accurate summary of the user’s personal profile, with no word limit
[27]

The explanation serves as the rationale for the description provided above
[28]

w/ Fac- tors

Ensure the output adheres to the specified format. Figure A3: Factor Label Prompt. Method A&E L&P S&C Avg. Qwen3 8B No-Pers. 0.3422 0.4824 0.4824 0.4357 RAG-Pers. 0.3634 0.4700 0.4983 0.4439 w/ Factors0.3743 0.4892 0.5131 0.4589 GPT-4o-mini No-Pers. 0.3941 0.5162 0.5282 0.4795 RAG-Pers. 0.4182 0.4861 0.5343 0.4795 w/ Factors0.4378 0.5234 0.5720 0.5111 Tab...
[29]

Cognitive Trust: What are the user’s epistemic requirements regarding the credibility, reliability, and verifiability of the information?
[30]

Situational Anchoring: To what extent does the user require the response to be contextually aligned, practically applicable, or specific to a given scenario?
[31]

Schema Consistency: What is the nature of the user’s existing prior knowledge and mental models (and how should the new information align with them)?
[32]

Cognitive Load Management: What are the user’s constraints regarding information processing capacity and their tolerance for complexity?
[33]

Metacognitive Scaffolding: What are the user’s requirements for structural guidance to facilitate higher-order understanding and self-regulated learning?
[34]

Cognitive Trust

Affective and Motivational Resonance: What are the user’s expectations regarding emotional engagement and motivational alignment within the response? Scoring Rubric (3-point Likert scale): • 0 (Mismatch): The response actively violates the user’s preference or completely ignores a high-priority requirement defined in the profile. • 1 (Partial Match): The ...
[36]

(e.g., output 1, not "1")

Assign a score of 0, 1, or 2.Important: The ’score’ field must be a raw INTEGER type (int), do not use strings. (e.g., output 1, not "1")
[38]

score": 0,

Output the result in strict JSON format. Figure A4: Evaluation Prompt. System: Role: You are a fair and insightful judge with exceptional reasoning and analytical abilities. Your task is to evaluate whether the response to the user’s question aligns with the user’s profile. Scoring Rubric (3-point Likert scale): • 0 (Mismatch): The response actively viola...
[39]

Analyze the match between the <user_profile> and <response_to_evaluate>
[42]

Figure A5: Direct Evaluation Prompt

Output the result in strict JSON format. Figure A5: Direct Evaluation Prompt. System: Role: You are a fair and insightful judge with exceptional reasoning and analytical abilities. Your task is to evaluate whether the response to the user’s question aligns with the user’s profile. Scoring Rubric (3-point Likert scale): • 0 (Mismatch): The response activel...
[43]

Step 1 – Profile Analysis: Identify the key preferences and requirements revealed by the user’s historical questions
[44]

Step 2 – Response Analysis: Examine what the response actually provides (e.g., depth, style, format, domain focus)
[45]

Note specific matches or mismatches

Step 3 – Alignment Assessment: Compare the user’s profile against the response. Note specific matches or mismatches
[46]

chain_of_thought

Step 4 – Score Decision: Based on the above analysis, assign a score of 0, 1, or 2. Output Format: Return a single JSON object with the following fields: { "chain_of_thought": "Your step-by-step reasoning following Steps 1–4", "score": 0, "reasoning": "One-sentence summary of the final judgement" } IMPORTANT: The ‘score‘ field must be a raw INTEGER type (...
[47]

Please ensure that every factor has a result

Each factor in the response corresponds to a factor in the user_factor_profile. Please ensure that every factor has a result
[48]

Analyze the match between the <user_factor_profile> and <response_to_evaluate> for EACH factor
[49]

Assign a score of 0, 1, or 2.Important: The ‘score‘ field must be a raw INTEGER type (int), do not use strings
[50]

Provide a brief reasoning for your score
[51]

Figure A7: Random-Factor Evaluation Prompt

Output the result in strict JSON format. Figure A7: Random-Factor Evaluation Prompt. Your response must incorporate personalization by addressing the following six dimensions:
[52]

Cognitive Trust:Does the response align with the user’s threshold for trust and credibility within this specific domain?
[53]

Situational Anchoring:Is the response pre- cisely calibrated to the user’s immediate con- text and specific problem?
[54]

Schema Consistency:Does the response in- tegrate coherently with the user’s prior knowl- edge and existing mental models?
[55]

Cognitive Load Management:Is the com- plexity of the response tailored to match the user’s cognitive capacity?
[56]

Metacognitive Scaffolding:Does the re- sponse provide structural support that fosters the user’s critical thinking skills?
[57]

answer":

Affective and Motivational Resonance:Does the response resonate with the user’s current emotional state and motivational orientation? Figure A8: Factors Prompt. System: Role: You are an intelligent assistant skilled in teach- ing. Your task is to generate personalized answers to user questions. User: User’s Question: {question} The response should be in J...
[60]

answer":

Ensure the output adheres to the specified format. Figure A9: No-Personalization Prompt. System: Role: You are an intelligent assistant skilled in teach- ing. Your task is to generate responses tailored to the user’s individual understanding, based on their question history. User: User’s Historical Questions: {Historical_questions} User’s Current Question...
[63]

answer":

Ensure the output adheres to the specified format. Figure A10: Time-Personalization Prompt. System: Role: You are an intelligent assistant skilled in teach- ing. Your task is to generate responses tailored to the user’s individual understanding, based on their question history. User: User’s Historical Questions: {Historical_questions} User’s Current Quest...
[66]

Figure A11: RAG-Personalization Prompt

Ensure the output adheres to the specified format. Figure A11: RAG-Personalization Prompt. 0 10 20 30 40 50 all 0.6 0.7 0.8 0.9Score Engineering&Tools Science&Theory Lifestyle&Society Leisure&Fandom Figure A12: Impact of User History Context Length (K) on Model Performance System: Role: You are an intelligent assistant. User: question: {question} Please t...
[67]

domain":

Please provide your decision in JSON format, fol- lowing this structure: { "domain": "A summarization of which domain this question is related to" (if you are unable to sum- marize it, please set this value to "None"), "reasoning": "briefly explain your reasoning for the summarization" }
[68]

Please ensure ’domain’ is a single word
[69]

reasoning

The "reasoning" has no word limits
[70]

Figure A13: Domain Extract Prompt

Do not provide any other text outside the JSON string. Figure A13: Domain Extract Prompt. System: You are an expert in {domain}. Please tell me the profile the user who asked the question in {domain}. User: question:{question} Requirements:
[71]

profile":

Please provide your summary in JSON format, following this structure: { "profile": "A Summary of the user’s profile in this domain"(starting with "This user"), "reasoning": "briefly explain your reasoning for the summarization" }
[73]

Figure A14: Domain Profile Extract Prompt

Do not provide any other text outside the JSON string. Figure A14: Domain Profile Extract Prompt. System: You are an expert in {domain}. Please generate a new user profile based on the user’s historical profile and the current profile. User: history user profile:{history} current user profile:{current} Requirements:
[74]

profile":

Please provide your output in JSON format, fol- lowing this structure: { "profile": "A Summary of the user’s profile in this domain"(starting with "This user"), "reasoning": "briefly explain your reasoning for the summarization" }
[75]

Please ensure that the "profile" is accurate
[76]

Figure A15: Domain Profile Synthesize Prompt

Do not provide any other text outside the JSON string. Figure A15: Domain Profile Synthesize Prompt. System: You are an intelligent assistant. Please summarize the user profile based on the information about the user in various domains. User: user global profile: {global_profile} user domain profile: {domain_profile} Requirements:
[77]

profile":

Please provide your output in JSON format, fol- lowing this structure: { "profile": "A Summary of user profile" (starting with "This user"), "reasoning": "briefly explain your reasoning for the summarization" }
[78]

Please ensure that the "profile" is comprehensive and accurate
[79]

answer":

Do not provide any other text outside the JSON string. Figure A16: Global Profile Generate Prompt. System: Role: You are an intelligent assistant skilled in teach- ing. Your task is to generate personalized answers to user questions. User: User’s Profile: {user_profile} User’s Current Question: {Current_question} The response should be in JSON format: { "...
[80]

The answer needs to be accurate
[81]

Reasoning is a brief explanation of how you ar- rived at the answer above
[82]

Figure A17: Answer Generation Prompt

Ensure the output adheres to the specified format. Figure A17: Answer Generation Prompt. 103 104 105 Number of Questions (Log Scale) unix tex salesforce electronics apple gis wordpress dba sharepoint magento drupal codereview softwareengineering security gamedev android ethereum blender meta webmasters ux webapps bitcoin raspberrypi emacs dsp arduino netw...

2067