SCAN: Structured Capability Assessment and Navigation for LLMs

Chen Gong; Siqi Bao; Tianle Gu; Xin Tian; Yujiu Yang; Zongqi Wang

arxiv: 2505.06698 · v4 · submitted 2025-05-10 · 💻 cs.CL

SCAN: Structured Capability Assessment and Navigation for LLMs

Zongqi Wang , Tianle Gu , Chen Gong , Xin Tian , Siqi Bao , Yujiu Yang This is my paper

Pith reviewed 2026-05-22 16:07 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM evaluationfine-grained assessmenthierarchical taxonomycapability analysisLLM-as-a-Judgequery synthesismodel comparison

0 comments

The pith

SCAN builds an automatic hierarchical taxonomy to support fine-grained evaluation of LLM capabilities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes SCAN as a framework that moves past overall model rankings to deliver detailed maps of specific LLM strengths and weaknesses. It does this by automatically extracting capability tags into a hierarchy, synthesizing enough test queries for each tag, supplying visualization tools, and using a pre-comparison-derived criteria method that improves LLM-as-a-Judge accuracy. When the framework is run on 21 mainstream models, analysis of the GPT-OSS family shows clear performance differences even among sub-capabilities that fall inside the same broad category. This result indicates that coarse evaluations can hide important distinctions in model behavior.

Core claim

SCAN incorporates TaxBuilder to extract capability-indicating tags from large query sets and build a hierarchical taxonomy, RealMix to synthesize and filter queries ensuring adequate coverage per tag, navigation and visualization tools, and a PC²-based LLM-as-a-Judge that reaches higher accuracy than standard LLM judges; evaluation of 21 LLMs with this system reveals substantial performance variations within sub-capabilities of the same category in families such as GPT-OSS.

What carries the argument

The hierarchical taxonomy of capability tags automatically constructed by TaxBuilder from extensive queries, which organizes abilities into categories and sub-capabilities to enable targeted assessment and navigation.

If this is right

Developers can locate precise strengths and weaknesses rather than relying on aggregate scores.
Model comparisons gain precision when conducted at the sub-capability level instead of broad categories.
Evaluation sets maintain sufficient data coverage for every identified tag through automated query synthesis and filtering.
LLM-based judging improves in accuracy by deriving comparison criteria in advance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Running SCAN on additional model families or new domains could expose capability patterns invisible to existing benchmarks.
The taxonomy could be iteratively improved by feeding human corrections back into TaxBuilder to reduce extraction biases.
Visualization outputs might be used to guide targeted data collection or fine-tuning for specific weak sub-capabilities.

Load-bearing premise

The tags automatically extracted by TaxBuilder form a valid, non-redundant hierarchical taxonomy that captures distinct LLM capabilities without significant bias or omission.

What would settle it

Re-running TaxBuilder on the same query collection yields a substantially different hierarchy, or manual review finds many overlapping or missing capabilities in the resulting taxonomy.

Figures

Figures reproduced from arXiv: 2505.06698 by Chen Gong, Siqi Bao, Tianle Gu, Xin Tian, Yujiu Yang, Zongqi Wang.

**Figure 1.** Figure 1: An overview of SCAN framework. T Ginit. Further details regarding this step are elaborated in § C.1.2. Node Insertion. The most core design of TaxBuilder is its node insertion mechanism. A naive approach would be to input the entire taxonomy as context to a powerful LLM and ask it to identify the position for the new node. However, this task involves both long-context and complex reasoning, which is ch… view at source ↗

**Figure 2.** Figure 2: A subset of SCAN-T-V0. Please refer to project page for complete taxonomy. T c cur is updated to include tg, i.e., T c cur = {T c1 cur, Tc2 cur, . . . , Tcm cur} ∪ {tg}. • T ci cur (Child Node): This implies that tg should be a child of the specific node T ci cur. In this scenario, a recursive insertion is performed. After recursively traversing the entire tree, we successfully insert the node tg and obta… view at source ↗

**Figure 3.** Figure 3: An overview of TaxBuilder, a tree-based automatic taxonomy generation. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of RealMix and real user queries, judged by five human evaluators. and its corresponding real user query (i.e., reference query), annotators evaluate (1) which query is higher in quality and (2) which is more likely to occur in real-world scenarios. Results in [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Fine-grained performance comparison on coding.Programming Languages.General-purpose Languages. tiveness of our approach in uncovering fine-grained capability profiles and identifying performance that holistic scores alone cannot capture. Fine-grained Coding Performance Analysis. Our SCAN framework enables decomposition of the aggregate coding score by programming language, allowing evaluation of model per… view at source ↗

**Figure 6.** Figure 6: The top 4 parent nodes exhibiting the largest performance variance across child nodes, automatically [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Prompt for pre-comparison-derived criteria extraction. [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Prompt for evaluation execution. Model Combination ACC single model (gpt-4o) 0.5974 gpt-4o + doubao-pro-1.5-32k + Deepseek-V3 0.6962 gpt-4o + doubao-pro-1.5-32k + Qwen2.5-7B-Instruct 0.6904 gpt-4o + doubao-pro-1.5-32k + Meta-Llama-3.1-8B-Instruct 0.6747 gpt-4o + doubao-pro-1.5-32k + Phi-4-mini-instruct 0.7077 gpt-4o + Qwen2.5-7B-Instruct + Meta-Llama-3.1-8B-Instruct 0.6825 gpt-4o + Qwen2.5-7B-Instruct + Ph… view at source ↗

**Figure 9.** Figure 9: Prompt for evaluation execution with baseline answer. [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Prompt for naive criteria decomposition. [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: Initial taxonomy for coding domain. ing at a specific evaluation node with its overall ranking. If the local ranking differs from the global ranking by more than a predefined threshold (set to 5 in our experiments), the node is flagged as a weakness (negative deviation) or as a strength (positive deviation). This enables rapid localization of domains where the model exhibits pronounced strengths or weak… view at source ↗

**Figure 12.** Figure 12: Prompt for initial annotation of tags. underperformance in a hierarchical category is comprehensive or localized. For any underperforming parent node, it calculates the standard deviation (STD) of rankings across its child nodes. If the STD is within the top 20% of all nodes, the performance is considered unstable, meaning that low performance may be caused by a few subcapabilities. If the STD is in t… view at source ↗

**Figure 13.** Figure 13: Prompt of LLM-as-Decision-Maker (node insertion). [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗

**Figure 14.** Figure 14: Prompt of LLM-as-Decision-Maker (node refinement and pruning). [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗

**Figure 15.** Figure 15: Prompt of domain annotation. Prompt of Query Tags Annotation (writing) You are tasked with categorizing a given query into multiple tags based on the following hierarchical classification system. ### Classification System: {taxonomy} ### Rules for Tagging: 1. **Hierarchy Rule**: If a query matches both a parent node and its child nodes, include both in the tags. 2. **Multiple Matches**: If a query matches… view at source ↗

**Figure 16.** Figure 16: Prompt of query tags annotation (writing). [PITH_FULL_IMAGE:figures/full_fig_p025_16.png] view at source ↗

**Figure 17.** Figure 17: Prompt of query tags annotation (roleplay). [PITH_FULL_IMAGE:figures/full_fig_p026_17.png] view at source ↗

**Figure 18.** Figure 18: Prompt of query tags annotation (knowledge). [PITH_FULL_IMAGE:figures/full_fig_p026_18.png] view at source ↗

**Figure 19.** Figure 19: Prompt of query tags annotation (coding). [PITH_FULL_IMAGE:figures/full_fig_p027_19.png] view at source ↗

**Figure 20.** Figure 20: Prompt of query tags annotation (mathematics). [PITH_FULL_IMAGE:figures/full_fig_p027_20.png] view at source ↗

**Figure 21.** Figure 21: Prompt of query tags annotation (reasoning). [PITH_FULL_IMAGE:figures/full_fig_p028_21.png] view at source ↗

**Figure 22.** Figure 22: Prompt of quality annotation (writing). 28 [PITH_FULL_IMAGE:figures/full_fig_p028_22.png] view at source ↗

**Figure 23.** Figure 23: Prompt of quality annotation (roleplay). [PITH_FULL_IMAGE:figures/full_fig_p029_23.png] view at source ↗

**Figure 24.** Figure 24: Prompt of quality annotation (knowledge). [PITH_FULL_IMAGE:figures/full_fig_p030_24.png] view at source ↗

**Figure 25.** Figure 25: Prompt of quality annotation (coding). 31 [PITH_FULL_IMAGE:figures/full_fig_p031_25.png] view at source ↗

**Figure 26.** Figure 26: Prompt of quality annotation (mathematics). [PITH_FULL_IMAGE:figures/full_fig_p032_26.png] view at source ↗

**Figure 27.** Figure 27: Prompt of quality annotation (reasoning). [PITH_FULL_IMAGE:figures/full_fig_p033_27.png] view at source ↗

**Figure 28.** Figure 28: Prompt of RealMix. 34 [PITH_FULL_IMAGE:figures/full_fig_p034_28.png] view at source ↗

**Figure 29.** Figure 29: Questionnaire protocol for comparative evaluation. [PITH_FULL_IMAGE:figures/full_fig_p035_29.png] view at source ↗

**Figure 30.** Figure 30: The annotation interface used for labeling the human preference dataset. [PITH_FULL_IMAGE:figures/full_fig_p037_30.png] view at source ↗

**Figure 31.** Figure 31: Domain distribution of SCAN-HPD. Example (1) of Pairwise Evaluation (prompt and thinking process) ... [[A]] **Explanation:** - **Correctness:** Assistant A’s answer directly addresses the user’s requirement to track the *number of open tabs* on the same domain using ‘localStorage‘. It provides a straightforward ‘TabTracker‘ class with methods to increment/decrement the count and store it in ‘localStorage‘… view at source ↗

**Figure 32.** Figure 32: Example (1) of Pairwise Evaluation. 38 [PITH_FULL_IMAGE:figures/full_fig_p038_32.png] view at source ↗

**Figure 33.** Figure 33: Example (2) of Pairwise Evaluation. 39 [PITH_FULL_IMAGE:figures/full_fig_p039_33.png] view at source ↗

read the original abstract

Evaluating Large Language Models (LLMs) has become increasingly important, with automatic evaluation benchmarks gaining prominence as alternatives to human evaluation. While existing research has focused on approximating model rankings, such benchmarks fail to provide users and developers with a comprehensive and fine-grained understanding of a specific model's capabilities. To fill this gap, we propose \textbf{SCAN} (Structured Capability Assessment and Navigation), a practical framework that enables detailed characterization of LLM capabilities through comprehensive and fine-grained evaluation. SCAN incorporates four key components: (1) TaxBuilder, which extracts capability-indicating tags from extensive queries to construct a hierarchical taxonomy automatically; (2) RealMix, a query synthesis and filtering mechanism that ensures sufficient evaluation data for each capability tag; (3) a suite of visualization and analysis tools that facilitate efficient navigation and analysis of model capabilities; and (4) a PC$^2$-based (Pre-Comparison-derived Criteria) LLM-as-a-Judge approach that achieves significantly higher accuracy compared to classic LLM-as-a-Judge method. Using SCAN, we conduct a comprehensive evaluation of 21 mainstream LLMs. Our detailed analysis of the GPT-OSS family reveals substantial performance variations, even within sub-capabilities belonging to the same category of capability. This finding highlights the importance of fine-grained evaluation in accurately understanding LLM behavior. Project homepage and resources are available at \href{https://github.com/liudan193/SCAN}{https://github.com/liudan193/SCAN}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes SCAN, a framework for fine-grained LLM capability evaluation consisting of TaxBuilder (automatic hierarchical taxonomy extraction from queries), RealMix (query synthesis/filtering), visualization tools, and a PC²-based LLM-as-a-Judge claimed to outperform classic methods. Using SCAN, the authors evaluate 21 mainstream LLMs and report substantial performance variations within sub-capabilities of the same category in the GPT-OSS family, arguing this demonstrates the value of detailed rather than coarse-grained assessment.

Significance. If the taxonomy is shown to be valid and non-redundant and the PC² judge accuracy gains are quantified and reproducible, SCAN could move the field beyond aggregate rankings toward actionable, navigable capability profiles that help developers and users identify specific strengths and gaps. The intra-family variation finding, if robust, would reinforce the need for fine-grained benchmarks.

major comments (3)

[TaxBuilder] TaxBuilder section: the automatic extraction of capability-indicating tags and construction of the hierarchical taxonomy is presented without any validation against human experts, inter-annotator agreement metrics, or overlap with established frameworks (e.g., BIG-bench or HELM categories). Because the central claim of substantial intra-category variations in GPT-OSS rests on this taxonomy forming a valid, non-redundant partition of distinct capabilities, the absence of such checks is load-bearing.
[PC² Judge / Results] PC² Judge and results sections: the abstract asserts significantly higher accuracy for the PC² judge, yet no quantitative metrics, validation procedure, error analysis, or comparison tables are referenced in the provided text. Without these, it is impossible to determine whether the claimed improvement supports the framework's practical utility or is an artifact of the evaluation setup.
[GPT-OSS family analysis] GPT-OSS analysis: the reported substantial performance variations across sub-capabilities lack accompanying details on measurement protocol, statistical significance testing, or controls for query-distribution effects. If these differences are driven by taxonomy artifacts rather than genuine capability distinctions, the highlighted importance of fine-grained evaluation would not follow.

minor comments (2)

[Abstract] Abstract: the phrase 'significantly higher accuracy' would benefit from a brief parenthetical summary of the actual accuracy delta or F1 improvement to orient readers immediately.
[Methods] Notation: the PC² acronym is introduced as 'Pre-Comparison-derived Criteria' but subsequent usage would be clearer if the expansion were repeated on first use in the methods section.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our work. We respond to each major comment point by point below, indicating the changes we will make to the manuscript.

read point-by-point responses

Referee: [TaxBuilder] TaxBuilder section: the automatic extraction of capability-indicating tags and construction of the hierarchical taxonomy is presented without any validation against human experts, inter-annotator agreement metrics, or overlap with established frameworks (e.g., BIG-bench or HELM categories). Because the central claim of substantial intra-category variations in GPT-OSS rests on this taxonomy forming a valid, non-redundant partition of distinct capabilities, the absence of such checks is load-bearing.

Authors: We acknowledge the importance of validating the taxonomy to support our central claims. Although the current version focuses on the automatic construction process, we agree that additional checks are needed. In the revised manuscript, we will incorporate a human validation study, including inter-annotator agreement metrics, and compare our taxonomy with established frameworks such as those in BIG-bench and HELM. This will demonstrate the validity and non-redundancy of the capability partition. revision: yes
Referee: [PC² Judge / Results] PC² Judge and results sections: the abstract asserts significantly higher accuracy for the PC² judge, yet no quantitative metrics, validation procedure, error analysis, or comparison tables are referenced in the provided text. Without these, it is impossible to determine whether the claimed improvement supports the framework's practical utility or is an artifact of the evaluation setup.

Authors: We apologize for any lack of clarity in referencing the quantitative results. The manuscript does present quantitative metrics for the PC² judge's accuracy, including comparisons to classic methods, along with the validation procedure and error analysis in the relevant results section. To address this, we will add explicit references, a summary table, and expanded discussion in the revised version to make these elements more prominent. revision: partial
Referee: [GPT-OSS family analysis] GPT-OSS analysis: the reported substantial performance variations across sub-capabilities lack accompanying details on measurement protocol, statistical significance testing, or controls for query-distribution effects. If these differences are driven by taxonomy artifacts rather than genuine capability distinctions, the highlighted importance of fine-grained evaluation would not follow.

Authors: We note that the measurement protocol for the GPT-OSS analysis is outlined in the evaluation section. However, we agree that additional statistical details would strengthen the findings. In the revision, we will include statistical significance testing for the performance variations and discuss controls for query-distribution effects to rule out artifacts and confirm the value of fine-grained assessment. revision: yes

Circularity Check

0 steps flagged

No significant circularity in SCAN framework components or claims

full rationale

The paper introduces SCAN as an applied evaluation framework rather than a mathematical derivation. TaxBuilder automatically extracts tags to form a hierarchy, RealMix synthesizes queries, and PC2 is presented as an LLM judge with claimed accuracy gains over baselines. The central empirical finding (intra-category variations in GPT-OSS models) is an observation obtained by applying the framework to 21 LLMs; it does not reduce to a fitted parameter or self-referential definition by any equation in the paper. No load-bearing self-citation, uniqueness theorem, or ansatz smuggling is evident in the provided text. The taxonomy construction is described as an input tool whose validity is assumed rather than derived from the evaluation results themselves. This is a self-contained systems paper whose claims rest on external data collection and comparison, not on internal circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Framework rests on domain assumptions about query-derived tags forming meaningful hierarchies and synthesized data being unbiased; no explicit free parameters or invented entities are named in the abstract.

axioms (2)

domain assumption Queries contain extractable capability-indicating tags that can be organized into a non-arbitrary hierarchical taxonomy
Invoked by TaxBuilder component as the basis for automatic taxonomy construction.
domain assumption Synthesized and filtered queries via RealMix provide sufficient and representative evaluation data for each tag
Central to ensuring coverage in the evaluation suite.

pith-pipeline@v0.9.0 · 5797 in / 1280 out tokens · 63094 ms · 2026-05-22T16:07:15.928552+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

TaxBuilder... extracts capability-indicating tags... recursive node insertion... Node Refinement and Pruning... Layer Pruning... SCAN-T-V0
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

PC²-based (Pre-Comparison-derived Criteria) LLM-as-a-Judge... pre-comparison phase... extract relevant evaluation criteria... assign weights
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_strictMono_of_one_lt unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

fine-grained analysis... substantial performance variations, even within sub-capabilities belonging to the same category

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 2 internal anchors

[1]

Youquan Li, Miao Zheng, Fan Yang, Guosheng Dong, Bin Cui, Weipeng Chen, Zenan Zhou, and Wentao Zhang

Cleva: Chinese language models evaluation platform.arXiv preprint arXiv:2308.04813. Youquan Li, Miao Zheng, Fan Yang, Guosheng Dong, Bin Cui, Weipeng Chen, Zenan Zhou, and Wentao Zhang. 2024g. Fb-bench: A fine-grained multi-task benchmark for evaluating llms’ responsiveness to human feedback.arXiv preprint arXiv:2410.09412. Paul Pu Liang, Yiwei Lyu, Xiang...

work page arXiv 2021
[2]

Saumya Malik, Valentina Pyatkin, Sander Land, Ja- cob Morrison, Noah A Smith, Hannaneh Hajishirzi, and Nathan Lambert

Halludial: A large-scale benchmark for auto- matic dialogue-level hallucination evaluation.arXiv preprint arXiv:2406.07070. Saumya Malik, Valentina Pyatkin, Sander Land, Ja- cob Morrison, Noah A Smith, Hannaneh Hajishirzi, and Nathan Lambert. 2025. Rewardbench 2: Ad- vancing reward model evaluation.arXiv preprint arXiv:2506.01937. Jinjie Ni, Fuzhao Xue, X...

work page arXiv 2025
[3]

Mixeval: Deriving wisdom of the crowd from llm benchmark mixtures.arXiv e-prints, pages arXiv– 2406. OpenAI. 2024a. Hello gpt-4o. https://openai.com/ index/hello-gpt-4o/. Accessed: 2025-04-30. OpenAI. 2024b. Openai o1 mini: Advancing cost- efficient reasoning. https://platform.openai. com/docs/models/o1-mini. Accessed: 2025-04- 30. OpenAI. 2025. gpt-oss-1...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Jonathan Roberts, Mohammad Reza Taesiri, Ansh Sharma, Akash Gupta, Samuel Roberts, Ioana Croitoru, Simion-Vlad Bogolin, Jialu Tang, Flo- rian Langer, Vyas Raina, and 1 others

Safetywashing: Do ai safety benchmarks ac- tually measure safety progress?Advances in Neural Information Processing Systems, 37:68559–68594. Jonathan Roberts, Mohammad Reza Taesiri, Ansh Sharma, Akash Gupta, Samuel Roberts, Ioana Croitoru, Simion-Vlad Bogolin, Jialu Tang, Flo- rian Langer, Vyas Raina, and 1 others. 2025. Ze- robench: An impossible visual ...

work page arXiv 2025
[5]

Livebench: A challenging, contamination-free llm benchmark.arXiv preprint arXiv:2406.19314. An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jian- hong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, and 23 others. 2024a. Qwen2.5...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Enyu Zhou, Guodong Zheng, Binghai Wang, Zhiheng Xi, Shihan Dou, Rong Bao, Wei Shen, Limao Xiong, Jessica Fan, Yurong Mou, Rui Zheng, Tao Gui, Qi Zhang, and Xuanjing Huang

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Pro- cessing Systems, 36:46595–46623. Enyu Zhou, Guodong Zheng, Binghai Wang, Zhiheng Xi, Shihan Dou, Rong Bao, Wei Shen, Limao Xiong, Jessica Fan, Yurong Mou, Rui Zheng, Tao Gui, Qi Zhang, and Xuanjing Huang. 2025. Rmb: Com- prehensively benchmarking reward models in ll...

work page 2025
[7]

Real.” denotes real-world queries, “Cont.-free

to ensure the diversity of the responses. For judge model, we adopt Deepseek-R1 (DeepSeek- AI, 2025) due to its superior reasoning performance. Unless otherwise specified, we use the officially recommended decoding parameters. Dataset.All our experiments are based on our pairwise human preference datasets SCAN- HPD, RewardBench-v2 (Malik et al., 2025) and...

work page 2025
[8]

The objective of this comparison is to pinpoint distinguishing factors that significantly influence the quality of the responses

**Analyze Responses**: You must first compare several provided [answers] and identify their differences. The objective of this comparison is to pinpoint distinguishing factors that significantly influence the quality of the responses

work page
[14]

Description of Secondary Metric 3 | Weight 3 ... <Evaluation_Framework> [User Question] {question} [The Start of Assistant 1’s Answer] {answer_1} [The End of Assistant 1’s Answer] [The Start of Assistant 2’s Answer] {answer_2} [The End of Assistant 2’s Answer] [The Start of Assistant 3’s Answer] {answer_3} [The End of Assistant 3’s Answer] Figure 7: Promp...

work page 2025
[15]

The objective of this comparison is to find im- portant factors that significantly influence the quality of the responses

**Analyze Question**: You must first analyze the question. The objective of this comparison is to find im- portant factors that significantly influence the quality of the responses

work page
[16]

There should be 3 to 9 primary metrics

**Develop Metrics**: Establish a hierarchical set of evaluation metrics. There should be 3 to 9 primary metrics. Each primary metric should have several detailed sub-metrics to provide specific, measurable criteria for evaluating the responses

work page
[17]

The weights should be integers, and the sum of all weights should equal 100

**Assign Weights**: Allocate appropriate weights to each metric based on its relative importance in distinguishing the quality of the responses. The weights should be integers, and the sum of all weights should equal 100

work page
[18]

You do not need to include the primary metrics; only the secondary metrics are required, in the following format: <Evaluation_Framework>

**Output Format**: Present the final evaluation framework in a structured list format. You do not need to include the primary metrics; only the secondary metrics are required, in the following format: <Evaluation_Framework>

work page
[19]

Description of Secondary Metric 1 | Weight 1

work page
[20]

Description of Secondary Metric 2 | Weight 2

work page
[21]

<E>"then 7:returnT 8:else ifd=

Description of Secondary Metric 3 | Weight 3 ... <Evaluation_Framework> [User Question] {question} Figure 10: Prompt for naive criteria decomposition. 18 Order of Models ACC gpt-4o→doubao-pro-1.5-32k→Deepseek-V3 0.6668 gpt-4o→Deepseek-V3→doubao-pro-1.5-32k0.6962 Deepseek-V3→gpt-4o→doubao-pro-1.5-32k 0.6699 Deepseek-V3→doubao-pro-1.5-32k→gpt-4o 0.6920 doub...

work page 2024
[22]

- Label:"writing"- If not matched, proceed to the next domain

Writing - Condition: If the query relates to writing, purposeful professional writing, literature, storytelling, grammar, or language-related topics. - Label:"writing"- If not matched, proceed to the next domain

work page
[23]

roleplay

Roleplay - Condition: If the query involves roleplay, character interactions, storytelling, or immersive scenario-based dialogue. - Label: "roleplay" - If not matched, proceed to the next domain

work page
[24]

- Label:"coding"- If not matched, proceed to the next domain

Coding - Condition: If the query relates to programming, algorithms, or code-related topics. - Label:"coding"- If not matched, proceed to the next domain

work page
[25]

mathematics

Mathematics - Condition: If the query pertains to mathematical concepts, computations, proofs, or problem-solving. - Label: "mathematics" - If not matched, proceed to the next domain

work page
[26]

reasoning

Reasoning - Condition: If the query pertains to reasoning, logic, critical thinking, or problem-solving without direct reference to specific knowledge or programming. - Label: "reasoning"- If not matched, proceed to the next domain

work page
[27]

knowledge

Knowledge - Condition: If the query pertains to factual knowledge or general subject areas such as science, history, literature, philosophy, or current affairs. - Label: "knowledge" - If not matched, label as"other"with a custom"domain_name"and stop. #### Step 2: Domain-Specific Categorization Once the domain is determined, apply the corresponding sub-rul...

work page 2024
[28]

How do I write a compelling essay introduction?

**other** For each query provided, determine the most appropriate category and output the result in lowcase enclosed within <domain> and </domain> tags. **Example**: Query: "How do I write a compelling essay introduction?" Output: <domain>roleplay</domain> **Now, analyze the following query**: <|begin_of_query|> {query} <|end_of_query|> Figure 15: Prompt ...

work page
[32]

I need help brainstorming ideas for a fantasy story involving a hidden kingdom and magical creatures

**Output Format**: The final output must be enclosed within ‘<tags>‘ and ‘</tags>‘ tags, and the tags should be provided as a JSON object where the keys are the basis for classification and the values are lists of matching tags. ### Example: Query: "I need help brainstorming ideas for a fantasy story involving a hidden kingdom and magical creatures." Outp...

work page 2023
[36]

Write a script for a medieval fantasy story involving a knight’s adventure and a magical encounter, with a humorous twist

**Output Format**: The final output must be enclosed within ‘<tags>‘ and ‘</tags>‘ tags, and the tags should be provided as a JSON object where the keys are the basis for classification and the values are lists of matching tags. ### Example: Query: "Write a script for a medieval fantasy story involving a knight’s adventure and a magical encounter, with a ...

work page
[40]

How can I analyze data from a physics experiment on thermodynamics and present it effectively?

**Output Format**: The final output must be enclosed within ‘<tags>‘ and ‘</tags>‘ tags, and the tags should be provided as a JSON object where the keys are the basis for classification and the values are lists of matching tags. ### Example: Query: "How can I analyze data from a physics experiment on thermodynamics and present it effectively?" Output: <ta...

work page
[44]

How can I optimize a Python script that processes large datasets and visualizes the results?

**Output Format**: The final output must be enclosed within ‘<tags>‘ and ‘</tags>‘ tags, and the tags should be provided as a JSON object where the keys are the basis for classification and the values are lists of matching tags. ### Example: Query: "How can I optimize a Python script that processes large datasets and visualizes the results?" Output: <tags...

work page
[48]

How do I use calculus to model the growth of a population over time and graph the results?

**Output Format**: The final output must be enclosed within ‘<tags>‘ and ‘</tags>‘ tags, and the tags should be provided as a JSON object where the keys are the basis for classification and the values are lists of matching tags. ### Example: Query: "How do I use calculus to model the growth of a population over time and graph the results?" Output: <tags>{...

work page
[49]

**Hierarchy Rule**: If a query matches both a parent node and its child nodes, include both in the tags

work page
[50]

**Multiple Matches**: If a query matches multiple nodes at the same level, include all matching nodes

work page
[51]

**No Match**: If a query does not match any nodes under the second-level main categories, assign the tag "Other" to that category

work page
[52]

How can I determine the cause of an event based on multiple contributing factors?

**Output Format**: The final output must be enclosed within ‘<tags>‘ and ‘</tags>‘ tags, and the tags should be provided as a JSON object where the keys are the basis for classification and the values are lists of matching tags. ### Example: Query: "How can I determine the cause of an event based on multiple contributing factors?" Output: <tags>{{"Reasoni...

work page
[56]

**Real-World Application**: Is this writing task something that would be proposed in the real world?

work page
[57]

**Professionalism**: Does it require professional capabilities or professional knowledge?

work page
[58]

**Originality:** Does the question encourage or require originality?

work page
[59]

‘python Final Labels: {{

**User’s Requirements**: Does the user have clear, detailed, or unique requests that need to be considered in the response? For example, if the question meets Clarity, Completeness, and Real-World Application, return ‘[1, 2, 4]‘. ## Final Output For each question provided, return a Python dictionary in the following format: “‘python Final Labels: {{"quest...

work page
[62]

**Complexity**: Does it involve in-depth understanding of any role-playing content, such as the psychology, characterization, and world-building of characters?

work page
[63]

**Real-World Application**: Is this role-playing task something that would be proposed in the real world?

work page
[64]

**Interactivity**: Does the question encourage meaningful interactions between characters, rather than single character?

work page
[65]

**Engagement**: Does the task motivate active participation and emotional involvement from the audience or participants?

work page
[66]

‘python Final Labels: {{

**Creativity:** Does it have creativity and novelty, or does solving it require creativity? For example, if the question meets Clarity, Completeness, and Real-World Application, return ‘[1, 2, 4]‘. ## Final Output For each question provided, return a Python dictionary in the following format: “‘python Final Labels: {{"question_quality": [1, 3, 4]}} “‘ ## ...

work page
[69]

**Complexity**: Does the question have enough depth and challenge beyond simple fact recall?

work page
[71]

**Depth of Knowledge**: Does the question require deep expertise in the subject instead of just memory?

work page
[72]

**Cross-Disciplinary**: Does the question involve cross-disciplinary aspects?

work page
[73]

yes” or “no

**Open-Endedness.**: Does the question encourage open-ended responses rather than simple “yes” or “no” answers, promoting deeper thinking? For example, if the question meets Clarity, Completeness, and Real-World Application, return ‘[1, 2, 4]‘. ## Final Output For each question provided, return a Python dictionary in the following format: “‘python Final L...

work page
[76]

**Complexity**: Does it involve multiple components, layers, or nuance?

work page
[77]

**Real-World Application**: Is the question something that would be encountered in real-world development?

work page
[78]

**Problem-Solving**: Does it require active problem-solving beyond simple and superficial script or fact recall?

work page
[79]

**Domain-Specific Expertise**: Does the question require in-depth knowledge of at least one specific area of programming?

work page
[80]

‘python Final Labels: {{

**Specified Requirements**: Does it specify particular requirements, such as execution time, space constraints, specific programming language, tools, packages, etc.? For example, if the question meets Clarity, Completeness, and Real-World Application, return ‘[1, 2, 4]‘. ## Final Output For each question provided, return a Python dictionary in the followi...

work page
[83]

**Complexity**: Does it involve multiple steps, analysis, or reasoning instead of simple concept memorization and numerical calculation?

work page
[85]

**Problem-Solving**: Does it test the ability to apply math in some scenarios?

work page
[86]

**Rigorous Logic**: Does it involve content such as theorem derivation and formula understanding, which require rigorous logical abilities?

work page
[87]

‘python Final Labels: {{

**Creativity:** Does it have creativity and novelty, or does solving it require creativity? For example, if the question meets Clarity, Completeness, and Real-World Application, return ‘[1, 2, 4]‘. ## Final Output For each question provided, return a Python dictionary in the following format: “‘python Final Labels: {{"question_quality": [1, 3, 4]}} “‘ ## ...

work page
[88]

**Clarity**: Is the question clear and well-defined?

work page
[89]

**Completeness**: Does the question provide enough information for the LLM to answer the question?

work page
[90]

**Complexity**: Does it involve multiple steps, analysis, or reasoning instead of simple concept memorization?

work page
[91]

**Real-World Application**: Is the question something that would be encountered in real-world?

work page
[92]

**Problem-Solving**: Does it require devising a solution or strategy?

work page
[93]

**Deep Thinking**: Does it require deep reasoning and consideration of multiple factors?

work page
[94]

‘python Final Labels: {{

**Novelty:** Does the question present a unique or unusual scenario that the LLM is unlikely to have encountered before? For example, if the question meets Clarity, Completeness, and Real-World Application, return ‘[1, 2, 4]‘. ## Final Output For each question provided, return a Python dictionary in the following format: “‘python Final Labels: {{"question...

work page
[95]

Although you should use details from real user questions, you must not mention the real user question in the new question

work page
[96]

The length of the question should be at least as long as the reference question but should not be overly simplistic or repetitive

The new question should be complex and challenging, requiring deep understanding and analysis of the subject. The length of the question should be at least as long as the reference question but should not be overly simplistic or repetitive. The question should be singular, not a multi-task question

work page
[97]

The new question must be **completely self-contained**, so that others can answer it without any additional information

work page
[98]

While multiple tags are available, the newly generated question only needs to align with some of them, not all

Analyze how to create the new question with chosen real-world details and provided tags. While multiple tags are available, the newly generated question only needs to align with some of them, not all. Even if the original question already fits, generate a different version. ### Output Format: [Anylysis]: You should first complete the anylysis of task1 and...

work page
[99]

Assess thequalityof both questions (poor quality indicators include: oversimplification, incompleteness, or unclear phrasing)

work page
[100]

A is better

Evaluate whether each question appears to behuman-authored. Rating Scheme: • For quality/authenticity: –1 = Question 1 superior –2 = Question 2 superior –3 = Both inadequate –4 = Both excellent Figure 29: Questionnaire protocol for comparative evaluation. 35 Domain Sample Writing Translate and adapt the following mission dialogue into French, ensuring it ...

work page
[101]

all in one document

**Self-Contained Diagram:** - **Assistant B** uses an embedded SVG diagram directly in the HTML, fulfilling the user’s requirement for an "all in one document." The SVG illustrates the connection process without relying on an external image file. - **Assistant A** references an external image (‘car_battery_jumpstart.png‘), which requires the user to provi...

work page
[102]

This enhances clarity for users unfamiliar with the process

**Step-by-Step Clarity:** - **Assistant B** breaks down the steps into a more granular 8-step process, explicitly separating each cable connection into distinct steps (e.g., connecting each end of the red and black cables separately). This enhances clarity for users unfamiliar with the process. - **Assistant A** combines some steps (e.g., all cable connec...

work page
[103]

- **Assistant B** adds a final step (step 8) explicitly detailing the order to remove cables, which is important for safety

**Safety and Correctness:** - Both answers include correct procedural steps (e.g., grounding the negative cable on the dead car’s metal surface instead of the terminal). - **Assistant B** adds a final step (step 8) explicitly detailing the order to remove cables, which is important for safety

work page
[104]

all in one

**HTML Structure:** - **Assistant B’s** use of SVG ensures the diagram is part of the HTML document, whereas **Assistant A’s** reliance on an external image breaks the "all in one" requirement. **Verdict:** While both answers are correct, **Assistant B’s** self-contained implementation of the diagram and clearer step breakdown make it the superior choice....

work page

[1] [1]

Youquan Li, Miao Zheng, Fan Yang, Guosheng Dong, Bin Cui, Weipeng Chen, Zenan Zhou, and Wentao Zhang

Cleva: Chinese language models evaluation platform.arXiv preprint arXiv:2308.04813. Youquan Li, Miao Zheng, Fan Yang, Guosheng Dong, Bin Cui, Weipeng Chen, Zenan Zhou, and Wentao Zhang. 2024g. Fb-bench: A fine-grained multi-task benchmark for evaluating llms’ responsiveness to human feedback.arXiv preprint arXiv:2410.09412. Paul Pu Liang, Yiwei Lyu, Xiang...

work page arXiv 2021

[2] [2]

Saumya Malik, Valentina Pyatkin, Sander Land, Ja- cob Morrison, Noah A Smith, Hannaneh Hajishirzi, and Nathan Lambert

Halludial: A large-scale benchmark for auto- matic dialogue-level hallucination evaluation.arXiv preprint arXiv:2406.07070. Saumya Malik, Valentina Pyatkin, Sander Land, Ja- cob Morrison, Noah A Smith, Hannaneh Hajishirzi, and Nathan Lambert. 2025. Rewardbench 2: Ad- vancing reward model evaluation.arXiv preprint arXiv:2506.01937. Jinjie Ni, Fuzhao Xue, X...

work page arXiv 2025

[3] [3]

Mixeval: Deriving wisdom of the crowd from llm benchmark mixtures.arXiv e-prints, pages arXiv– 2406. OpenAI. 2024a. Hello gpt-4o. https://openai.com/ index/hello-gpt-4o/. Accessed: 2025-04-30. OpenAI. 2024b. Openai o1 mini: Advancing cost- efficient reasoning. https://platform.openai. com/docs/models/o1-mini. Accessed: 2025-04- 30. OpenAI. 2025. gpt-oss-1...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Jonathan Roberts, Mohammad Reza Taesiri, Ansh Sharma, Akash Gupta, Samuel Roberts, Ioana Croitoru, Simion-Vlad Bogolin, Jialu Tang, Flo- rian Langer, Vyas Raina, and 1 others

Safetywashing: Do ai safety benchmarks ac- tually measure safety progress?Advances in Neural Information Processing Systems, 37:68559–68594. Jonathan Roberts, Mohammad Reza Taesiri, Ansh Sharma, Akash Gupta, Samuel Roberts, Ioana Croitoru, Simion-Vlad Bogolin, Jialu Tang, Flo- rian Langer, Vyas Raina, and 1 others. 2025. Ze- robench: An impossible visual ...

work page arXiv 2025

[5] [5]

Livebench: A challenging, contamination-free llm benchmark.arXiv preprint arXiv:2406.19314. An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jian- hong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, and 23 others. 2024a. Qwen2.5...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

Enyu Zhou, Guodong Zheng, Binghai Wang, Zhiheng Xi, Shihan Dou, Rong Bao, Wei Shen, Limao Xiong, Jessica Fan, Yurong Mou, Rui Zheng, Tao Gui, Qi Zhang, and Xuanjing Huang

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Pro- cessing Systems, 36:46595–46623. Enyu Zhou, Guodong Zheng, Binghai Wang, Zhiheng Xi, Shihan Dou, Rong Bao, Wei Shen, Limao Xiong, Jessica Fan, Yurong Mou, Rui Zheng, Tao Gui, Qi Zhang, and Xuanjing Huang. 2025. Rmb: Com- prehensively benchmarking reward models in ll...

work page 2025

[7] [7]

Real.” denotes real-world queries, “Cont.-free

to ensure the diversity of the responses. For judge model, we adopt Deepseek-R1 (DeepSeek- AI, 2025) due to its superior reasoning performance. Unless otherwise specified, we use the officially recommended decoding parameters. Dataset.All our experiments are based on our pairwise human preference datasets SCAN- HPD, RewardBench-v2 (Malik et al., 2025) and...

work page 2025

[8] [8]

The objective of this comparison is to pinpoint distinguishing factors that significantly influence the quality of the responses

**Analyze Responses**: You must first compare several provided [answers] and identify their differences. The objective of this comparison is to pinpoint distinguishing factors that significantly influence the quality of the responses

work page

[9] [14]

Description of Secondary Metric 3 | Weight 3 ... <Evaluation_Framework> [User Question] {question} [The Start of Assistant 1’s Answer] {answer_1} [The End of Assistant 1’s Answer] [The Start of Assistant 2’s Answer] {answer_2} [The End of Assistant 2’s Answer] [The Start of Assistant 3’s Answer] {answer_3} [The End of Assistant 3’s Answer] Figure 7: Promp...

work page 2025

[10] [15]

The objective of this comparison is to find im- portant factors that significantly influence the quality of the responses

**Analyze Question**: You must first analyze the question. The objective of this comparison is to find im- portant factors that significantly influence the quality of the responses

work page

[11] [16]

There should be 3 to 9 primary metrics

**Develop Metrics**: Establish a hierarchical set of evaluation metrics. There should be 3 to 9 primary metrics. Each primary metric should have several detailed sub-metrics to provide specific, measurable criteria for evaluating the responses

work page

[12] [17]

The weights should be integers, and the sum of all weights should equal 100

**Assign Weights**: Allocate appropriate weights to each metric based on its relative importance in distinguishing the quality of the responses. The weights should be integers, and the sum of all weights should equal 100

work page

[13] [18]

You do not need to include the primary metrics; only the secondary metrics are required, in the following format: <Evaluation_Framework>

**Output Format**: Present the final evaluation framework in a structured list format. You do not need to include the primary metrics; only the secondary metrics are required, in the following format: <Evaluation_Framework>

work page

[14] [19]

Description of Secondary Metric 1 | Weight 1

work page

[15] [20]

Description of Secondary Metric 2 | Weight 2

work page

[16] [21]

<E>"then 7:returnT 8:else ifd=

Description of Secondary Metric 3 | Weight 3 ... <Evaluation_Framework> [User Question] {question} Figure 10: Prompt for naive criteria decomposition. 18 Order of Models ACC gpt-4o→doubao-pro-1.5-32k→Deepseek-V3 0.6668 gpt-4o→Deepseek-V3→doubao-pro-1.5-32k0.6962 Deepseek-V3→gpt-4o→doubao-pro-1.5-32k 0.6699 Deepseek-V3→doubao-pro-1.5-32k→gpt-4o 0.6920 doub...

work page 2024

[17] [22]

- Label:"writing"- If not matched, proceed to the next domain

Writing - Condition: If the query relates to writing, purposeful professional writing, literature, storytelling, grammar, or language-related topics. - Label:"writing"- If not matched, proceed to the next domain

work page

[18] [23]

roleplay

Roleplay - Condition: If the query involves roleplay, character interactions, storytelling, or immersive scenario-based dialogue. - Label: "roleplay" - If not matched, proceed to the next domain

work page

[19] [24]

- Label:"coding"- If not matched, proceed to the next domain

Coding - Condition: If the query relates to programming, algorithms, or code-related topics. - Label:"coding"- If not matched, proceed to the next domain

work page

[20] [25]

mathematics

Mathematics - Condition: If the query pertains to mathematical concepts, computations, proofs, or problem-solving. - Label: "mathematics" - If not matched, proceed to the next domain

work page

[21] [26]

reasoning

Reasoning - Condition: If the query pertains to reasoning, logic, critical thinking, or problem-solving without direct reference to specific knowledge or programming. - Label: "reasoning"- If not matched, proceed to the next domain

work page

[22] [27]

knowledge

Knowledge - Condition: If the query pertains to factual knowledge or general subject areas such as science, history, literature, philosophy, or current affairs. - Label: "knowledge" - If not matched, label as"other"with a custom"domain_name"and stop. #### Step 2: Domain-Specific Categorization Once the domain is determined, apply the corresponding sub-rul...

work page 2024

[23] [28]

How do I write a compelling essay introduction?

**other** For each query provided, determine the most appropriate category and output the result in lowcase enclosed within <domain> and </domain> tags. **Example**: Query: "How do I write a compelling essay introduction?" Output: <domain>roleplay</domain> **Now, analyze the following query**: <|begin_of_query|> {query} <|end_of_query|> Figure 15: Prompt ...

work page

[24] [32]

I need help brainstorming ideas for a fantasy story involving a hidden kingdom and magical creatures

**Output Format**: The final output must be enclosed within ‘<tags>‘ and ‘</tags>‘ tags, and the tags should be provided as a JSON object where the keys are the basis for classification and the values are lists of matching tags. ### Example: Query: "I need help brainstorming ideas for a fantasy story involving a hidden kingdom and magical creatures." Outp...

work page 2023

[25] [36]

Write a script for a medieval fantasy story involving a knight’s adventure and a magical encounter, with a humorous twist

**Output Format**: The final output must be enclosed within ‘<tags>‘ and ‘</tags>‘ tags, and the tags should be provided as a JSON object where the keys are the basis for classification and the values are lists of matching tags. ### Example: Query: "Write a script for a medieval fantasy story involving a knight’s adventure and a magical encounter, with a ...

work page

[26] [40]

How can I analyze data from a physics experiment on thermodynamics and present it effectively?

**Output Format**: The final output must be enclosed within ‘<tags>‘ and ‘</tags>‘ tags, and the tags should be provided as a JSON object where the keys are the basis for classification and the values are lists of matching tags. ### Example: Query: "How can I analyze data from a physics experiment on thermodynamics and present it effectively?" Output: <ta...

work page

[27] [44]

How can I optimize a Python script that processes large datasets and visualizes the results?

**Output Format**: The final output must be enclosed within ‘<tags>‘ and ‘</tags>‘ tags, and the tags should be provided as a JSON object where the keys are the basis for classification and the values are lists of matching tags. ### Example: Query: "How can I optimize a Python script that processes large datasets and visualizes the results?" Output: <tags...

work page

[28] [48]

How do I use calculus to model the growth of a population over time and graph the results?

**Output Format**: The final output must be enclosed within ‘<tags>‘ and ‘</tags>‘ tags, and the tags should be provided as a JSON object where the keys are the basis for classification and the values are lists of matching tags. ### Example: Query: "How do I use calculus to model the growth of a population over time and graph the results?" Output: <tags>{...

work page

[29] [49]

**Hierarchy Rule**: If a query matches both a parent node and its child nodes, include both in the tags

work page

[30] [50]

**Multiple Matches**: If a query matches multiple nodes at the same level, include all matching nodes

work page

[31] [51]

**No Match**: If a query does not match any nodes under the second-level main categories, assign the tag "Other" to that category

work page

[32] [52]

How can I determine the cause of an event based on multiple contributing factors?

**Output Format**: The final output must be enclosed within ‘<tags>‘ and ‘</tags>‘ tags, and the tags should be provided as a JSON object where the keys are the basis for classification and the values are lists of matching tags. ### Example: Query: "How can I determine the cause of an event based on multiple contributing factors?" Output: <tags>{{"Reasoni...

work page

[33] [56]

**Real-World Application**: Is this writing task something that would be proposed in the real world?

work page

[34] [57]

**Professionalism**: Does it require professional capabilities or professional knowledge?

work page

[35] [58]

**Originality:** Does the question encourage or require originality?

work page

[36] [59]

‘python Final Labels: {{

**User’s Requirements**: Does the user have clear, detailed, or unique requests that need to be considered in the response? For example, if the question meets Clarity, Completeness, and Real-World Application, return ‘[1, 2, 4]‘. ## Final Output For each question provided, return a Python dictionary in the following format: “‘python Final Labels: {{"quest...

work page

[37] [62]

**Complexity**: Does it involve in-depth understanding of any role-playing content, such as the psychology, characterization, and world-building of characters?

work page

[38] [63]

**Real-World Application**: Is this role-playing task something that would be proposed in the real world?

work page

[39] [64]

**Interactivity**: Does the question encourage meaningful interactions between characters, rather than single character?

work page

[40] [65]

**Engagement**: Does the task motivate active participation and emotional involvement from the audience or participants?

work page

[41] [66]

‘python Final Labels: {{

**Creativity:** Does it have creativity and novelty, or does solving it require creativity? For example, if the question meets Clarity, Completeness, and Real-World Application, return ‘[1, 2, 4]‘. ## Final Output For each question provided, return a Python dictionary in the following format: “‘python Final Labels: {{"question_quality": [1, 3, 4]}} “‘ ## ...

work page

[42] [69]

**Complexity**: Does the question have enough depth and challenge beyond simple fact recall?

work page

[43] [71]

**Depth of Knowledge**: Does the question require deep expertise in the subject instead of just memory?

work page

[44] [72]

**Cross-Disciplinary**: Does the question involve cross-disciplinary aspects?

work page

[45] [73]

yes” or “no

**Open-Endedness.**: Does the question encourage open-ended responses rather than simple “yes” or “no” answers, promoting deeper thinking? For example, if the question meets Clarity, Completeness, and Real-World Application, return ‘[1, 2, 4]‘. ## Final Output For each question provided, return a Python dictionary in the following format: “‘python Final L...

work page

[46] [76]

**Complexity**: Does it involve multiple components, layers, or nuance?

work page

[47] [77]

**Real-World Application**: Is the question something that would be encountered in real-world development?

work page

[48] [78]

**Problem-Solving**: Does it require active problem-solving beyond simple and superficial script or fact recall?

work page

[49] [79]

**Domain-Specific Expertise**: Does the question require in-depth knowledge of at least one specific area of programming?

work page

[50] [80]

‘python Final Labels: {{

**Specified Requirements**: Does it specify particular requirements, such as execution time, space constraints, specific programming language, tools, packages, etc.? For example, if the question meets Clarity, Completeness, and Real-World Application, return ‘[1, 2, 4]‘. ## Final Output For each question provided, return a Python dictionary in the followi...

work page

[51] [83]

**Complexity**: Does it involve multiple steps, analysis, or reasoning instead of simple concept memorization and numerical calculation?

work page

[52] [85]

**Problem-Solving**: Does it test the ability to apply math in some scenarios?

work page

[53] [86]

**Rigorous Logic**: Does it involve content such as theorem derivation and formula understanding, which require rigorous logical abilities?

work page

[54] [87]

‘python Final Labels: {{

**Creativity:** Does it have creativity and novelty, or does solving it require creativity? For example, if the question meets Clarity, Completeness, and Real-World Application, return ‘[1, 2, 4]‘. ## Final Output For each question provided, return a Python dictionary in the following format: “‘python Final Labels: {{"question_quality": [1, 3, 4]}} “‘ ## ...

work page

[55] [88]

**Clarity**: Is the question clear and well-defined?

work page

[56] [89]

**Completeness**: Does the question provide enough information for the LLM to answer the question?

work page

[57] [90]

**Complexity**: Does it involve multiple steps, analysis, or reasoning instead of simple concept memorization?

work page

[58] [91]

**Real-World Application**: Is the question something that would be encountered in real-world?

work page

[59] [92]

**Problem-Solving**: Does it require devising a solution or strategy?

work page

[60] [93]

**Deep Thinking**: Does it require deep reasoning and consideration of multiple factors?

work page

[61] [94]

‘python Final Labels: {{

**Novelty:** Does the question present a unique or unusual scenario that the LLM is unlikely to have encountered before? For example, if the question meets Clarity, Completeness, and Real-World Application, return ‘[1, 2, 4]‘. ## Final Output For each question provided, return a Python dictionary in the following format: “‘python Final Labels: {{"question...

work page

[62] [95]

Although you should use details from real user questions, you must not mention the real user question in the new question

work page

[63] [96]

The length of the question should be at least as long as the reference question but should not be overly simplistic or repetitive

The new question should be complex and challenging, requiring deep understanding and analysis of the subject. The length of the question should be at least as long as the reference question but should not be overly simplistic or repetitive. The question should be singular, not a multi-task question

work page

[64] [97]

The new question must be **completely self-contained**, so that others can answer it without any additional information

work page

[65] [98]

While multiple tags are available, the newly generated question only needs to align with some of them, not all

Analyze how to create the new question with chosen real-world details and provided tags. While multiple tags are available, the newly generated question only needs to align with some of them, not all. Even if the original question already fits, generate a different version. ### Output Format: [Anylysis]: You should first complete the anylysis of task1 and...

work page

[66] [99]

Assess thequalityof both questions (poor quality indicators include: oversimplification, incompleteness, or unclear phrasing)

work page

[67] [100]

A is better

Evaluate whether each question appears to behuman-authored. Rating Scheme: • For quality/authenticity: –1 = Question 1 superior –2 = Question 2 superior –3 = Both inadequate –4 = Both excellent Figure 29: Questionnaire protocol for comparative evaluation. 35 Domain Sample Writing Translate and adapt the following mission dialogue into French, ensuring it ...

work page

[68] [101]

all in one document

**Self-Contained Diagram:** - **Assistant B** uses an embedded SVG diagram directly in the HTML, fulfilling the user’s requirement for an "all in one document." The SVG illustrates the connection process without relying on an external image file. - **Assistant A** references an external image (‘car_battery_jumpstart.png‘), which requires the user to provi...

work page

[69] [102]

This enhances clarity for users unfamiliar with the process

**Step-by-Step Clarity:** - **Assistant B** breaks down the steps into a more granular 8-step process, explicitly separating each cable connection into distinct steps (e.g., connecting each end of the red and black cables separately). This enhances clarity for users unfamiliar with the process. - **Assistant A** combines some steps (e.g., all cable connec...

work page

[70] [103]

- **Assistant B** adds a final step (step 8) explicitly detailing the order to remove cables, which is important for safety

**Safety and Correctness:** - Both answers include correct procedural steps (e.g., grounding the negative cable on the dead car’s metal surface instead of the terminal). - **Assistant B** adds a final step (step 8) explicitly detailing the order to remove cables, which is important for safety

work page

[71] [104]

all in one

**HTML Structure:** - **Assistant B’s** use of SVG ensures the diagram is part of the HTML document, whereas **Assistant A’s** reliance on an external image breaks the "all in one" requirement. **Verdict:** While both answers are correct, **Assistant B’s** self-contained implementation of the diagram and clearer step breakdown make it the superior choice....

work page