pith. sign in

arxiv: 2606.03189 · v2 · pith:J6B7B6T5new · submitted 2026-06-02 · 💻 cs.CL

SenseJudge: Human-Centric Preference-Driven Judgment Framework

Pith reviewed 2026-06-28 10:51 UTC · model grok-4.3

classification 💻 cs.CL
keywords LLM as judgepersonalized judgmenthuman preferencesSenseBenchmodel rankingmulti-turn interactionsinstruction following
0
0 comments X

The pith

SenseJudge is a customizable framework that drives LLM judgments from diverse human preferences instead of fixed training data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SenseJudge, a judgment framework that extracts and applies human preferences to let LLMs evaluate responses in a personalized way. It pairs this with SenseBench, a benchmark built from real multi-turn human-AI interactions. Experiments on two tasks show SenseJudge beats prior judgment methods for personalized evaluation and produces model rankings that match human judgments. The work targets the gap where fixed-preference judgers ignore individual tastes and fail to handle dynamic dialogues. If correct, this would let evaluation adapt to personal sense rather than one-size-fits-all preferences.

Core claim

SenseJudge is a human-preference-driven customizable judgment framework, together with SenseBench derived from real-world multi-turn interactions; when applied to LLMs-as-personalized-judges and model-ranking tasks it surpasses other methods and aligns rankings with real human sense.

What carries the argument

SenseJudge framework that customizes judgments by incorporating user-specific preferences extracted from interactions.

Load-bearing premise

Existing judgment approaches often rely on trained judgers using fixed preference data, which tend to overlook diverse user preferences and struggle to adapt to real-world human-AI dialogue scenarios.

What would settle it

Run a fresh set of multi-turn dialogues with new human raters providing explicit preferences; if SenseJudge rankings and judgments no longer match the humans better than baselines or fixed-preference models, the central claim fails.

Figures

Figures reproduced from arXiv: 2606.03189 by Junfeng Liu, Linhai Xu, Rui Li, Xiangwen Kong, Zhifang Sui.

Figure 1
Figure 1. Figure 1: The model judges responses on behalf of hu [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: An overview of the data construction pipeline of SenseBench. The pipeline involves 1) Quality-Based [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The bar represents the absolute values of selecting response A (in the first place) and response B (in the second place) using the original model, while the bar represents the absolute values after applying SenseJudge. user preferences. Positional Bias Previous studies (Wang et al., 2023a) have demonstrated that the relative posi￾tion of two responses is an element that theoret￾ically should be irrelevant … view at source ↗
Figure 4
Figure 4. Figure 4: Pairwise judgments by Qwen3-14B-Instruct and Llama3.1-8B-Instruct with SenseJudge across advanced [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Large Language Models (LLMs) as judges across various scenarios such as assessing model responses is becoming an increasingly accepted paradigm. However, existing judgment approaches often rely on trained judgers using fixed preference data, which tend to overlook diverse user preferences and struggle to adapt to real-world human-AI dialogue scenarios. To address these limitations, we propose SenseJudge, a customizable judgment framework driven by human preferences and SenseBench, a diverse and challenging instruction-following benchmark derived from real-world multi-turn interactions. We applied the automatic judgment framework and benchmark to two tasks: (1) LLMs as personalized judges, and (2) model ranking. We conducted extensive experiments, and the results demonstrate that the SenseJudge framework surpasses other judgment methods and models in the LLMs-as-personalized-judges task and achieves model ranking that aligns with real human sense. Additionally, we conducted analyses on position bias and consistency, alongside ablation studies, which affirmed the robustness of SenseJudge.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper proposes SenseJudge, a customizable judgment framework driven by human preferences, along with SenseBench, a benchmark derived from real-world multi-turn interactions. It evaluates the framework on two tasks—LLMs as personalized judges and model ranking—claiming via extensive experiments that SenseJudge outperforms other judgment methods and produces model rankings aligned with human sense, with additional analyses on position bias, consistency, and ablations supporting robustness.

Significance. If the empirical claims hold, the work would address a genuine limitation in current LLM-as-judge paradigms by moving beyond fixed preference data toward preference-driven customization, potentially improving adaptability in human-AI dialogue settings. The reported alignment between automated rankings and human sense would constitute a concrete, falsifiable contribution to evaluation methodology.

major comments (1)
  1. [Abstract] Abstract: the central performance claims (surpassing other methods in the personalized-judges task and achieving human-aligned model ranking) are stated without any accompanying methods details, dataset statistics, error bars, experimental design, or baseline descriptions, rendering the claims unverifiable from the provided manuscript.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their feedback. We address the single major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central performance claims (surpassing other methods in the personalized-judges task and achieving human-aligned model ranking) are stated without any accompanying methods details, dataset statistics, error bars, experimental design, or baseline descriptions, rendering the claims unverifiable from the provided manuscript.

    Authors: Abstracts are intentionally concise summaries and do not contain the full experimental details; this is standard practice. The complete manuscript supplies the requested information: the SenseJudge framework and human-preference mechanism are described in Section 3, SenseBench construction and dataset statistics appear in Section 4, the two evaluation tasks, baselines, experimental design, and results (with error bars where computed) are reported in Section 5 together with position-bias, consistency, and ablation analyses. The performance claims in the abstract are therefore directly supported by the body of the paper. If only the abstract was supplied to the referee, we are happy to provide the full manuscript. revision: no

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper proposes an empirical framework (SenseJudge) and benchmark (SenseBench) for LLM judgment tasks, then reports experimental results on personalization and model ranking. No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described structure. Central claims rest on direct experimental comparisons rather than any chain that reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5698 in / 1000 out tokens · 34149 ms · 2026-06-28T10:51:33.698670+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

17 extracted references · 6 canonical work pages · 3 internal anchors

  1. [1]

    Xianzhe Fan, Qing Xiao, Xuhui Zhou, Jiaxin Pei, Maarten Sap, Zhicong Lu, and Hong Shen

    Can llm be a personalized judge? Preprint, arXiv:2406.11657. Xianzhe Fan, Qing Xiao, Xuhui Zhou, Jiaxin Pei, Maarten Sap, Zhicong Lu, and Hong Shen. 2025. User-driven value alignment: Understanding users’ perceptions and strategies for addressing biased and discriminatory statements in ai companions . Preprint, arXiv:2409.00862. Aaron Grattafiori, Abhimany...

  2. [2]

    DeepSeek-V3 Technical Report

    Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437. Humza Naveed, Asad Ullah Khan, Shi Qiu, Muham- mad Saqib, Saeed Anwar, Muhammad Usman, Naveed Akhtar, Nick Barnes, and Ajmal Mian

  3. [3]

    A Comprehensive Overview of Large Language Models

    A comprehensive overview of large language models. Preprint, arXiv:2307.06435. Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car- roll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, an...

  4. [4]

    Direct Preference Optimization: Your Language Model is Secretly a Reward Model

    Direct preference optimization: Y our lan- guage model is secretly a reward model . Preprint, arXiv:2305.18290. Zhaoxuan Tan, Qingkai Zeng, Yijun Tian, Zheyuan Liu, Bing Yin, and Meng Jiang. 2025. De- mocratizing large language models via person- alized parameter-efficient fine-tuning . Preprint, arXiv:2402.04401. Kimi Team, Angang Du, Bofei Gao, Bowei Xing...

  5. [5]

    arXiv preprint arXiv:2411.00027 , year=

    Personalization of large language models: A survey. Preprint, arXiv:2411.00027. Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Y ejin Choi, and Y untian Deng. 2024. Wildchat: 1m chatgpt interaction logs in the wild . Preprint, arXiv:2405.01470. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Tianle Li, Siyuan Zhuang, Zhanghao Wu, Y onghao Zhuang, Zhuohan...

  6. [6]

    Judgelm: Fine-tuned large language models are scalable judges.arXiv preprint arXiv:2310.17631, 2023

    Judgelm: Fine-tuned large language models are scalable judges. arXiv preprint arXiv:2310.17631. Minjun Zhu, Yixuan Weng, Linyi Y ang, and Y ue Zhang. 2025. Personality alignment of large lan- guage models. Preprint, arXiv:2408.11779. A Appendix A.1 Discussion A.1.1 Details of SenseBench We provide the statistics of SenseBench in Ta- ble 5. We provide the ...

  7. [7]

    Score each of the two responses based on the user preferences

  8. [8]

    The final decision is Response A

    Based on the scores obtained in the first step, determine which response is better. If Response A is better, output “The final decision is Response A.” If Response B is better, output “The final decision is Response B.” A.3 Preference Case Preference for math tasks extracted from the development set "Based on the comparison, the user’s persona demonstrates t...

  9. [9]

    They strongly value methodical reasoning that transparently explores multiple approaches and validates failures, prioritizing thorough cognitive processes over conventional solutions

  10. [10]

    They prefer responses that explicitly build verification frameworks and test edge cases, reject- ing shortcuts that lack demonstrated iterative refinement

  11. [11]

    They seek pedagogical clarity through structured decomposition of assumptions, showing aversion to answers that prioritize memorized conclusions over original analytical scaffolding. This persona prefers the Chosen Response for its stepwise validation of failed strategies and truth- table proofs, while rejecting the alternative for its faster-to-conclusio...

  12. [12]

    **Comprehensive Logical Reasoning:** They prefer answers that break down the scenario step-by-step, exploring potential starting points and logical implications, rather than stating a direct conclusion without thorough justification

  13. [13]

    **Acknowledgement of Edge Cases & Nuance:** They appreciate responses that explicitly consider edge cases (like being the first place initially) and contextual factors, showing awareness that real-world questions often have layers beyond the surface

  14. [14]

    **Structured and Explicit Answer Presentation:** They favor responses that clearly summa- rize the primary conclusion after presenting the reasoning, making the final answer distinct and easy to identify, rather than leaving it embedded within the explanation. They reject responses perceived as overly simplistic or lacking in explanatory depth.", "Based on...

  15. [15]

    They value analytical rigor and systematic problem-solving approaches, seeking responses that methodically break down constraints and explore multiple strategies rather than presenting isolated solutions without justification

  16. [16]

    They prefer responses that optimize for efficiency by testing different scenarios and validating the optimal solution, rejecting approaches that overlook practical time-saving tactics or introduce unnecessary steps

  17. [17]

    Filtered preference set obtained after selection

    Their learning style prioritizes conceptual clarity over fragmented execution, favoring explana- tions that emphasize logical reasoning patterns applicable to similar challenges rather than ad-hoc step sequences.", "Based on the preferred response, the user values thorough, methodical explanations that explic- itly outline academic reasoning processes, in...