SenseJudge: Human-Centric Preference-Driven Judgment Framework

Junfeng Liu; Linhai Xu; Rui Li; Xiangwen Kong; Zhifang Sui

arxiv: 2606.03189 · v2 · pith:J6B7B6T5new · submitted 2026-06-02 · 💻 cs.CL

SenseJudge: Human-Centric Preference-Driven Judgment Framework

Rui Li , Junfeng Liu , Xiangwen Kong , Linhai Xu , Zhifang Sui This is my paper

Pith reviewed 2026-06-28 10:51 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM as judgepersonalized judgmenthuman preferencesSenseBenchmodel rankingmulti-turn interactionsinstruction following

0 comments

The pith

SenseJudge is a customizable framework that drives LLM judgments from diverse human preferences instead of fixed training data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SenseJudge, a judgment framework that extracts and applies human preferences to let LLMs evaluate responses in a personalized way. It pairs this with SenseBench, a benchmark built from real multi-turn human-AI interactions. Experiments on two tasks show SenseJudge beats prior judgment methods for personalized evaluation and produces model rankings that match human judgments. The work targets the gap where fixed-preference judgers ignore individual tastes and fail to handle dynamic dialogues. If correct, this would let evaluation adapt to personal sense rather than one-size-fits-all preferences.

Core claim

SenseJudge is a human-preference-driven customizable judgment framework, together with SenseBench derived from real-world multi-turn interactions; when applied to LLMs-as-personalized-judges and model-ranking tasks it surpasses other methods and aligns rankings with real human sense.

What carries the argument

SenseJudge framework that customizes judgments by incorporating user-specific preferences extracted from interactions.

Load-bearing premise

Existing judgment approaches often rely on trained judgers using fixed preference data, which tend to overlook diverse user preferences and struggle to adapt to real-world human-AI dialogue scenarios.

What would settle it

Run a fresh set of multi-turn dialogues with new human raters providing explicit preferences; if SenseJudge rankings and judgments no longer match the humans better than baselines or fixed-preference models, the central claim fails.

Figures

Figures reproduced from arXiv: 2606.03189 by Junfeng Liu, Linhai Xu, Rui Li, Xiangwen Kong, Zhifang Sui.

**Figure 2.** Figure 2: An overview of the data construction pipeline of SenseBench. The pipeline involves 1) Quality-Based [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: The bar represents the absolute values of selecting response A (in the first place) and response B (in the second place) using the original model, while the bar represents the absolute values after applying SenseJudge. user preferences. Positional Bias Previous studies (Wang et al., 2023a) have demonstrated that the relative position of two responses is an element that theoretically should be irrelevant … view at source ↗

**Figure 4.** Figure 4: Pairwise judgments by Qwen3-14B-Instruct and Llama3.1-8B-Instruct with SenseJudge across advanced [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Large Language Models (LLMs) as judges across various scenarios such as assessing model responses is becoming an increasingly accepted paradigm. However, existing judgment approaches often rely on trained judgers using fixed preference data, which tend to overlook diverse user preferences and struggle to adapt to real-world human-AI dialogue scenarios. To address these limitations, we propose SenseJudge, a customizable judgment framework driven by human preferences and SenseBench, a diverse and challenging instruction-following benchmark derived from real-world multi-turn interactions. We applied the automatic judgment framework and benchmark to two tasks: (1) LLMs as personalized judges, and (2) model ranking. We conducted extensive experiments, and the results demonstrate that the SenseJudge framework surpasses other judgment methods and models in the LLMs-as-personalized-judges task and achieves model ranking that aligns with real human sense. Additionally, we conducted analyses on position bias and consistency, alongside ablation studies, which affirmed the robustness of SenseJudge.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SenseJudge adds a human-preference layer to LLM judging and a new multi-turn benchmark, but the superiority claims rest on experiments whose details and effect sizes are not visible in the summary.

read the letter

The main thing to know is that this paper introduces SenseJudge as a customizable framework for LLM-as-judge work that incorporates human preferences on the fly, plus SenseBench built from real multi-turn interactions. They apply both to personalized judgment and model ranking tasks.

The work does a reasonable job identifying that fixed-preference trained judges miss user diversity and real dialogue variation. Running position-bias checks, consistency tests, and ablations is a positive step that shows some attention to reliability. If the experiments are solid, the benchmark could be a practical addition for people who need evaluation that reflects varied human senses rather than averaged preferences.

The soft spots are more noticeable. The abstract states that SenseJudge surpasses other methods and aligns with human rankings, yet supplies no concrete numbers, baselines, variance, or description of how preferences are actually injected into the judgment process. Without those, it is hard to tell whether the gains are meaningful or whether the new benchmark is harder in ways that matter. The full paper would need to make the experimental design and results transparent enough to judge.

This is aimed at researchers working on LLM evaluation, alignment, and dialogue systems who already use LLM judges and want something more adaptable. A reader focused on practical improvements in personalized settings could extract value from the benchmark and the framing, even if the performance edge needs verification.

I would send it to peer review. The core limitation it targets is real, and the proposal is concrete enough that referees can assess the execution and evidence directly.

Referee Report

1 major / 0 minor

Summary. The paper proposes SenseJudge, a customizable judgment framework driven by human preferences, along with SenseBench, a benchmark derived from real-world multi-turn interactions. It evaluates the framework on two tasks—LLMs as personalized judges and model ranking—claiming via extensive experiments that SenseJudge outperforms other judgment methods and produces model rankings aligned with human sense, with additional analyses on position bias, consistency, and ablations supporting robustness.

Significance. If the empirical claims hold, the work would address a genuine limitation in current LLM-as-judge paradigms by moving beyond fixed preference data toward preference-driven customization, potentially improving adaptability in human-AI dialogue settings. The reported alignment between automated rankings and human sense would constitute a concrete, falsifiable contribution to evaluation methodology.

major comments (1)

[Abstract] Abstract: the central performance claims (surpassing other methods in the personalized-judges task and achieving human-aligned model ranking) are stated without any accompanying methods details, dataset statistics, error bars, experimental design, or baseline descriptions, rendering the claims unverifiable from the provided manuscript.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their feedback. We address the single major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the central performance claims (surpassing other methods in the personalized-judges task and achieving human-aligned model ranking) are stated without any accompanying methods details, dataset statistics, error bars, experimental design, or baseline descriptions, rendering the claims unverifiable from the provided manuscript.

Authors: Abstracts are intentionally concise summaries and do not contain the full experimental details; this is standard practice. The complete manuscript supplies the requested information: the SenseJudge framework and human-preference mechanism are described in Section 3, SenseBench construction and dataset statistics appear in Section 4, the two evaluation tasks, baselines, experimental design, and results (with error bars where computed) are reported in Section 5 together with position-bias, consistency, and ablation analyses. The performance claims in the abstract are therefore directly supported by the body of the paper. If only the abstract was supplied to the referee, we are happy to provide the full manuscript. revision: no

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper proposes an empirical framework (SenseJudge) and benchmark (SenseBench) for LLM judgment tasks, then reports experimental results on personalization and model ranking. No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described structure. Central claims rest on direct experimental comparisons rather than any chain that reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5698 in / 1000 out tokens · 34149 ms · 2026-06-28T10:51:33.698670+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 6 canonical work pages · 3 internal anchors

[1]

Xianzhe Fan, Qing Xiao, Xuhui Zhou, Jiaxin Pei, Maarten Sap, Zhicong Lu, and Hong Shen

Can llm be a personalized judge? Preprint, arXiv:2406.11657. Xianzhe Fan, Qing Xiao, Xuhui Zhou, Jiaxin Pei, Maarten Sap, Zhicong Lu, and Hong Shen. 2025. User-driven value alignment: Understanding users’ perceptions and strategies for addressing biased and discriminatory statements in ai companions . Preprint, arXiv:2409.00862. Aaron Grattaﬁori, Abhimany...

work page arXiv 2025
[2]

DeepSeek-V3 Technical Report

Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437. Humza Naveed, Asad Ullah Khan, Shi Qiu, Muham- mad Saqib, Saeed Anwar, Muhammad Usman, Naveed Akhtar, Nick Barnes, and Ajmal Mian

work page internal anchor Pith review Pith/arXiv arXiv
[3]

A Comprehensive Overview of Large Language Models

A comprehensive overview of large language models. Preprint, arXiv:2307.06435. Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car- roll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, an...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[4]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Direct preference optimization: Y our lan- guage model is secretly a reward model . Preprint, arXiv:2305.18290. Zhaoxuan Tan, Qingkai Zeng, Yijun Tian, Zheyuan Liu, Bing Yin, and Meng Jiang. 2025. De- mocratizing large language models via person- alized parameter-efﬁcient ﬁne-tuning . Preprint, arXiv:2402.04401. Kimi Team, Angang Du, Bofei Gao, Bowei Xing...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

arXiv preprint arXiv:2411.00027 , year=

Personalization of large language models: A survey. Preprint, arXiv:2411.00027. Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Y ejin Choi, and Y untian Deng. 2024. Wildchat: 1m chatgpt interaction logs in the wild . Preprint, arXiv:2405.01470. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Tianle Li, Siyuan Zhuang, Zhanghao Wu, Y onghao Zhuang, Zhuohan...

work page arXiv 2024
[6]

Judgelm: Fine-tuned large language models are scalable judges.arXiv preprint arXiv:2310.17631, 2023

Judgelm: Fine-tuned large language models are scalable judges. arXiv preprint arXiv:2310.17631. Minjun Zhu, Yixuan Weng, Linyi Y ang, and Y ue Zhang. 2025. Personality alignment of large lan- guage models. Preprint, arXiv:2408.11779. A Appendix A.1 Discussion A.1.1 Details of SenseBench We provide the statistics of SenseBench in Ta- ble 5. We provide the ...

work page arXiv 2025
[7]

Score each of the two responses based on the user preferences
[8]

The ﬁnal decision is Response A

Based on the scores obtained in the ﬁrst step, determine which response is better. If Response A is better, output “The ﬁnal decision is Response A.” If Response B is better, output “The ﬁnal decision is Response B.” A.3 Preference Case Preference for math tasks extracted from the development set "Based on the comparison, the user’s persona demonstrates t...
[9]

They strongly value methodical reasoning that transparently explores multiple approaches and validates failures, prioritizing thorough cognitive processes over conventional solutions
[10]

They prefer responses that explicitly build veriﬁcation frameworks and test edge cases, reject- ing shortcuts that lack demonstrated iterative reﬁnement
[11]

They seek pedagogical clarity through structured decomposition of assumptions, showing aversion to answers that prioritize memorized conclusions over original analytical scaffolding. This persona prefers the Chosen Response for its stepwise validation of failed strategies and truth- table proofs, while rejecting the alternative for its faster-to-conclusio...
[12]

**Comprehensive Logical Reasoning:** They prefer answers that break down the scenario step-by-step, exploring potential starting points and logical implications, rather than stating a direct conclusion without thorough justiﬁcation
[13]

**Acknowledgement of Edge Cases & Nuance:** They appreciate responses that explicitly consider edge cases (like being the ﬁrst place initially) and contextual factors, showing awareness that real-world questions often have layers beyond the surface
[14]

**Structured and Explicit Answer Presentation:** They favor responses that clearly summa- rize the primary conclusion after presenting the reasoning, making the ﬁnal answer distinct and easy to identify, rather than leaving it embedded within the explanation. They reject responses perceived as overly simplistic or lacking in explanatory depth.", "Based on...
[15]

They value analytical rigor and systematic problem-solving approaches, seeking responses that methodically break down constraints and explore multiple strategies rather than presenting isolated solutions without justiﬁcation
[16]

They prefer responses that optimize for efﬁciency by testing different scenarios and validating the optimal solution, rejecting approaches that overlook practical time-saving tactics or introduce unnecessary steps
[17]

Filtered preference set obtained after selection

Their learning style prioritizes conceptual clarity over fragmented execution, favoring explana- tions that emphasize logical reasoning patterns applicable to similar challenges rather than ad-hoc step sequences.", "Based on the preferred response, the user values thorough, methodical explanations that explic- itly outline academic reasoning processes, in...

[1] [1]

Xianzhe Fan, Qing Xiao, Xuhui Zhou, Jiaxin Pei, Maarten Sap, Zhicong Lu, and Hong Shen

Can llm be a personalized judge? Preprint, arXiv:2406.11657. Xianzhe Fan, Qing Xiao, Xuhui Zhou, Jiaxin Pei, Maarten Sap, Zhicong Lu, and Hong Shen. 2025. User-driven value alignment: Understanding users’ perceptions and strategies for addressing biased and discriminatory statements in ai companions . Preprint, arXiv:2409.00862. Aaron Grattaﬁori, Abhimany...

work page arXiv 2025

[2] [2]

DeepSeek-V3 Technical Report

Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437. Humza Naveed, Asad Ullah Khan, Shi Qiu, Muham- mad Saqib, Saeed Anwar, Muhammad Usman, Naveed Akhtar, Nick Barnes, and Ajmal Mian

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

A Comprehensive Overview of Large Language Models

A comprehensive overview of large language models. Preprint, arXiv:2307.06435. Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car- roll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, an...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[4] [4]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Direct preference optimization: Y our lan- guage model is secretly a reward model . Preprint, arXiv:2305.18290. Zhaoxuan Tan, Qingkai Zeng, Yijun Tian, Zheyuan Liu, Bing Yin, and Meng Jiang. 2025. De- mocratizing large language models via person- alized parameter-efﬁcient ﬁne-tuning . Preprint, arXiv:2402.04401. Kimi Team, Angang Du, Bofei Gao, Bowei Xing...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

arXiv preprint arXiv:2411.00027 , year=

Personalization of large language models: A survey. Preprint, arXiv:2411.00027. Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Y ejin Choi, and Y untian Deng. 2024. Wildchat: 1m chatgpt interaction logs in the wild . Preprint, arXiv:2405.01470. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Tianle Li, Siyuan Zhuang, Zhanghao Wu, Y onghao Zhuang, Zhuohan...

work page arXiv 2024

[6] [6]

Judgelm: Fine-tuned large language models are scalable judges.arXiv preprint arXiv:2310.17631, 2023

Judgelm: Fine-tuned large language models are scalable judges. arXiv preprint arXiv:2310.17631. Minjun Zhu, Yixuan Weng, Linyi Y ang, and Y ue Zhang. 2025. Personality alignment of large lan- guage models. Preprint, arXiv:2408.11779. A Appendix A.1 Discussion A.1.1 Details of SenseBench We provide the statistics of SenseBench in Ta- ble 5. We provide the ...

work page arXiv 2025

[7] [7]

Score each of the two responses based on the user preferences

[8] [8]

The ﬁnal decision is Response A

Based on the scores obtained in the ﬁrst step, determine which response is better. If Response A is better, output “The ﬁnal decision is Response A.” If Response B is better, output “The ﬁnal decision is Response B.” A.3 Preference Case Preference for math tasks extracted from the development set "Based on the comparison, the user’s persona demonstrates t...

[9] [9]

They strongly value methodical reasoning that transparently explores multiple approaches and validates failures, prioritizing thorough cognitive processes over conventional solutions

[10] [10]

They prefer responses that explicitly build veriﬁcation frameworks and test edge cases, reject- ing shortcuts that lack demonstrated iterative reﬁnement

[11] [11]

They seek pedagogical clarity through structured decomposition of assumptions, showing aversion to answers that prioritize memorized conclusions over original analytical scaffolding. This persona prefers the Chosen Response for its stepwise validation of failed strategies and truth- table proofs, while rejecting the alternative for its faster-to-conclusio...

[12] [12]

**Comprehensive Logical Reasoning:** They prefer answers that break down the scenario step-by-step, exploring potential starting points and logical implications, rather than stating a direct conclusion without thorough justiﬁcation

[13] [13]

**Acknowledgement of Edge Cases & Nuance:** They appreciate responses that explicitly consider edge cases (like being the ﬁrst place initially) and contextual factors, showing awareness that real-world questions often have layers beyond the surface

[14] [14]

**Structured and Explicit Answer Presentation:** They favor responses that clearly summa- rize the primary conclusion after presenting the reasoning, making the ﬁnal answer distinct and easy to identify, rather than leaving it embedded within the explanation. They reject responses perceived as overly simplistic or lacking in explanatory depth.", "Based on...

[15] [15]

They value analytical rigor and systematic problem-solving approaches, seeking responses that methodically break down constraints and explore multiple strategies rather than presenting isolated solutions without justiﬁcation

[16] [16]

They prefer responses that optimize for efﬁciency by testing different scenarios and validating the optimal solution, rejecting approaches that overlook practical time-saving tactics or introduce unnecessary steps

[17] [17]

Filtered preference set obtained after selection

Their learning style prioritizes conceptual clarity over fragmented execution, favoring explana- tions that emphasize logical reasoning patterns applicable to similar challenges rather than ad-hoc step sequences.", "Based on the preferred response, the user values thorough, methodical explanations that explic- itly outline academic reasoning processes, in...