pith. the verified trust layer for science. sign in

arxiv: 2604.12995 · v1 · submitted 2026-04-14 · 💻 cs.CL · cs.CY

PolicyLLM: Towards Excellent Comprehension of Public Policy for Large Language Models

Pith reviewed 2026-05-10 15:55 UTC · model grok-4.3

classification 💻 cs.CL cs.CY
keywords policy comprehensionlarge language modelsbenchmarkmixture of expertspublic policyBloom's taxonomyUS-China comparison
0
0 comments X p. Extension

The pith

A specialized mixture-of-experts model with experts tied to different cognitive levels outperforms general LLMs on policy application tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to measure and improve how well large language models grasp public policy content across real governance scenarios. It builds PolicyBench, a 21,000-case dataset drawn from US and Chinese policy documents, then divides the cases into memorization of facts, conceptual understanding, and practical application following Bloom's levels. On this benchmark the authors train PolicyMoE, a mixture-of-experts architecture that assigns separate expert modules to each cognitive level. The resulting models show their strongest results on structured application problems rather than on pure recall or abstract reasoning. If correct, this indicates that current LLMs can be made more useful for policy work by routing different kinds of thinking to dedicated sub-models instead of relying on uniform training.

Core claim

PolicyBench supplies the first large-scale, cross-system evaluation of LLM policy comprehension with 21K cases spanning US and Chinese governance. PolicyMoE, built by aligning separate expert modules to the three Bloom levels of memorization, understanding, and application, produces higher accuracy on structured reasoning tasks than on memorization or conceptual understanding and outperforms general models on application-oriented policy scenarios.

What carries the argument

PolicyMoE, a domain-specialized Mixture-of-Experts model whose expert modules are each aligned to one level of Bloom's cognitive taxonomy (memorization, understanding, application).

If this is right

  • LLMs exhibit uneven policy capabilities: they handle real-world application scenarios more reliably than they handle fact recall or abstract concept use.
  • Routing different cognitive demands to separate experts raises accuracy on structured policy reasoning tasks.
  • Current general LLMs leave measurable gaps in policy understanding that can be narrowed by cognitive-level specialization.
  • Further development of policy-focused models should prioritize application-oriented evaluation over memorization tests.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same expert-routing idea could be tested on other high-stakes domains such as legal reasoning or medical guidelines to see whether cognitive-level specialization transfers.
  • PolicyBench could serve as a continuing yardstick for tracking whether later models close the gap between application strength and conceptual weakness.
  • If the pattern holds, policy-advisory systems might reduce errors by first routing fact questions to one expert and scenario questions to another instead of using a single model for both.

Load-bearing premise

The 21K benchmark cases and the way they were chosen truly capture the range and difficulty of actual public-policy work without systematic gaps or selection effects.

What would settle it

A general-purpose LLM that receives no level-specific expert routing yet scores as high as or higher than PolicyMoE on the full PolicyBench set would show the claimed gains come from something other than the cognitive alignment.

Figures

Figures reproduced from arXiv: 2604.12995 by Han Bao, Kehan Guo, Nitesh V Chawla, Penghao Zhang, Rui Su, Xiangliang Zhang, Xiangqi Wang, Yanchi Ru, Yanfang Ye, Yue Huang, Yujun Zhou, Zhengqing Yuan.

Figure 1
Figure 1. Figure 1: Three levels of evaluating LLM in PolicyBench. • For US policies: As there is no centralized repository for federal policies in the US, we collected policy documents from the official websites of 12 US federal departments (De￾tails in [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Selected examples from PolicyBench spanning three levels and two languages. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Model performance in 10 subtasks (ID and the specific task are shown in [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Routing distributions over three experts for [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Human accuracy on three levels (%). in US Level 1 tasks, where accuracy rises from 23.35% to 35.43%—a relative improvement of over 50%. China also records a 13.51% gain at the same level, highlighting the benefit of injecting struc￾tured domain knowledge. Improvements on Level 2, which emphasizes policy comprehension, are more modest (2.93% for China and 1.11% for the US), suggesting that higher-level reas… view at source ↗
Figure 7
Figure 7. Figure 7: Screenshot of human evaluation interface. [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Representative error cases across three cognitive levels in [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Collected Policies (Part) [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
read the original abstract

Large Language Models (LLMs) are increasingly integrated into real-world decision-making, including in the domain of public policy. Yet, their ability to comprehend and reason about policy-related content remains underexplored. To fill this gap, we present \textbf{\textit{PolicyBench}}, the first large-scale cross-system benchmark (US-China) evaluating policy comprehension, comprising 21K cases across a broad spectrum of policy areas, capturing the diversity and complexity of real-world governance. Following Bloom's taxonomy, the benchmark assesses three core capabilities: (1) \textbf{Memorization}: factual recall of policy knowledge, (2) \textbf{Understanding}: conceptual and contextual reasoning, and (3) \textbf{Application}: problem-solving in real-life policy scenarios. Building on this benchmark, we further propose \textbf{\textit{PolicyMoE}}, a domain-specialized Mixture-of-Experts (MoE) model with expert modules aligned to each cognitive level. The proposed models demonstrate stronger performance on application-oriented policy tasks than on memorization or conceptual understanding, and yields the highest accuracy on structured reasoning tasks. Our results reveal key limitations of current LLMs in policy understanding and suggest paths toward more reliable, policy-focused models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces PolicyBench, the first large-scale cross-system (US-China) benchmark for LLM policy comprehension comprising 21K cases spanning multiple policy areas and structured by Bloom's taxonomy into memorization, understanding, and application capabilities. It proposes PolicyMoE, a Mixture-of-Experts architecture with expert modules aligned to these cognitive levels, and claims that the resulting models exhibit stronger performance on application-oriented and structured reasoning tasks than on memorization or conceptual understanding, while revealing limitations in current LLMs.

Significance. If the benchmark construction proves unbiased and the reported performance advantages hold under rigorous validation, the work would be significant for the field by providing a much-needed evaluation framework for policy-aware LLMs and demonstrating a cognitively motivated specialization approach. The scale of 21K cases and cross-system focus represent a clear strength in addressing an underexplored real-world application domain.

major comments (3)
  1. [§2] §2 (PolicyBench construction): The central claim that the 21K cases 'capture the diversity and complexity of real-world governance' lacks any description of the sampling frame, policy-area coverage statistics, inter-annotator agreement for cognitive-level labeling, or anti-bias controls (e.g., against LLM-augmented curation favoring certain domains or Bloom levels). This is load-bearing because all downstream performance claims rest on the benchmark's representativeness.
  2. [§4] §4 (Experiments and results): No quantitative metrics, baseline comparisons (e.g., against standard LLMs or non-MoE variants), error bars, statistical tests, or ablation studies on expert routing are referenced in support of the claims that models are 'stronger on application-oriented policy tasks' and 'yield the highest accuracy on structured reasoning tasks.' Without these, the superiority assertions cannot be evaluated.
  3. [§3] §3 (PolicyMoE architecture): Aligning expert modules directly to Bloom's taxonomy risks circularity or overfitting to label-specific patterns rather than transferable policy reasoning; no independent validation of the taxonomy application or comparison to alternative routing mechanisms is provided, undermining the claim that this design produces genuine capability gains.
minor comments (2)
  1. [Abstract] The abstract would be strengthened by including at least one key quantitative result (e.g., accuracy figures or relative improvements) to substantiate the performance claims.
  2. [§2] Notation for the three capabilities (Memorization, Understanding, Application) should be consistently defined with explicit examples from the benchmark in the main text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback, which highlights important areas for strengthening the manuscript. We address each major comment below and commit to revisions that enhance transparency, rigor, and justification without altering the core contributions.

read point-by-point responses
  1. Referee: [§2] §2 (PolicyBench construction): The central claim that the 21K cases 'capture the diversity and complexity of real-world governance' lacks any description of the sampling frame, policy-area coverage statistics, inter-annotator agreement for cognitive-level labeling, or anti-bias controls (e.g., against LLM-augmented curation favoring certain domains or Bloom levels). This is load-bearing because all downstream performance claims rest on the benchmark's representativeness.

    Authors: We agree that explicit details on benchmark construction are required to substantiate representativeness. In the revised §2, we will add a full description of the sampling frame, quantitative policy-area coverage statistics, inter-annotator agreement scores for Bloom's taxonomy labeling, and anti-bias protocols (including safeguards against LLM-assisted curation biases). These additions will directly support the claim of capturing real-world governance diversity. revision: yes

  2. Referee: [§4] §4 (Experiments and results): No quantitative metrics, baseline comparisons (e.g., against standard LLMs or non-MoE variants), error bars, statistical tests, or ablation studies on expert routing are referenced in support of the claims that models are 'stronger on application-oriented policy tasks' and 'yield the highest accuracy on structured reasoning tasks.' Without these, the superiority assertions cannot be evaluated.

    Authors: We acknowledge the absence of these quantitative elements in the current experimental reporting. The revised §4 will incorporate comprehensive metrics, baseline comparisons to standard LLMs and non-MoE variants, error bars, statistical significance tests, and ablation studies on expert routing. These will provide rigorous evidence for the performance advantages on application-oriented and structured reasoning tasks. revision: yes

  3. Referee: [§3] §3 (PolicyMoE architecture): Aligning expert modules directly to Bloom's taxonomy risks circularity or overfitting to label-specific patterns rather than transferable policy reasoning; no independent validation of the taxonomy application or comparison to alternative routing mechanisms is provided, undermining the claim that this design produces genuine capability gains.

    Authors: We recognize the potential for circularity in the taxonomy-aligned routing. The revised §3 will include independent validation of the Bloom's taxonomy application (e.g., via external expert review or correlation with other reasoning measures) and direct comparisons to alternative routing mechanisms. These will demonstrate that the design yields genuine, transferable gains rather than label-specific overfitting. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark and model evaluation with no derivations or self-referential predictions

full rationale

The paper presents PolicyBench (21K cases) and PolicyMoE (Bloom-aligned MoE experts) as new artifacts, with performance claims resting on direct empirical measurements of accuracy across memorization, understanding, and application tasks. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the derivation chain. The reported ordering (application > memorization) is a measured outcome on the constructed benchmark rather than a reduction to inputs by construction. The central claims are therefore self-contained as an independent evaluation exercise.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract, the work relies on one domain assumption and introduces no free parameters or invented physical entities.

axioms (1)
  • domain assumption Bloom's taxonomy provides a valid and useful framework for categorizing cognitive capabilities relevant to policy comprehension
    The benchmark explicitly structures its 21K cases around memorization, understanding, and application following Bloom's taxonomy.

pith-pipeline@v0.9.0 · 5556 in / 1322 out tokens · 59724 ms · 2026-05-10T15:55:34.388611+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 1 internal anchor

  1. [1]

    Autobench-v: Can large vision-language models benchmark themselves?arXiv preprint arXiv:2410.21259,

    Autobench-v: Can large vision-language models benchmark themselves?arXiv preprint arXiv:2410.21259. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, and 1 others. 2021. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374...

  2. [2]

    arXiv preprint arXiv:2502.07853

    Policysimeval: A benchmark for evaluat- ing policy outcomes through agent-based simulation. arXiv preprint arXiv:2502.07853. Junmo Kang, Leonid Karlinsky, Hongyin Luo, Zhen Wang, Jacob Hansen, James Glass, David Cox, Rameswar Panda, Rogerio Feris, and Alan Ritter

  3. [3]

    Nikos Karacapilidis, Evangelos Kalampokis, Nikolaos Giarelis, and Charalampos Mastrokostas

    Self-moe: Towards compositional large lan- guage models with self-specialized experts.arXiv preprint arXiv:2406.12034. Nikos Karacapilidis, Evangelos Kalampokis, Nikolaos Giarelis, and Charalampos Mastrokostas. 2024. Gen- erative ai and public deliberation: A framework for llm-augmented digital democracy.Proceedings http://ceur-ws. org ISSN, 1613:0073. Da...

  4. [4]

    Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437. Meta. 2025. Meta llama-4. Https://www.llama.com/models/llama-4/. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, and 1 others. 2022. Training language models to follow in- structions with human feedba...

  5. [5]

    In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track

    Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, and 1 others. 2022. Chain-of-thought prompting elic- its reasoning in lar...

  6. [6]

    Zhi Zhou, Jiang-Xin Shi, Peng-Xiao Song, Xiaowen Yang, Yi-Xuan Jin, Lan-Zhe Guo, and Yu-Feng Li

    Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Pro- cessing Systems, 36:46595–46623. Zhi Zhou, Jiang-Xin Shi, Peng-Xiao Song, Xiaowen Yang, Yi-Xuan Jin, Lan-Zhe Guo, and Yu-Feng Li

  7. [7]

    Wenhao Zhu, Hongyi Liu, Qingxiu Dong, Jingjing Xu, Shujian Huang, Lingpeng Kong, Jiajun Chen, and Lei Li

    Lawgpt: A chinese legal knowledge-enhanced large language model.CoRR. Wenhao Zhu, Hongyi Liu, Qingxiu Dong, Jingjing Xu, Shujian Huang, Lingpeng Kong, Jiajun Chen, and Lei Li. 2024. Multilingual machine translation with large language models: Empirical results and analy- sis. InFindings of the Association for Computational Linguistics: NAACL 2024, pages 2...

  8. [8]

    We identified and removed duplicates by checking for high sim- ilarity in document titles and textual content

    Duplicate Removal:The first step was to eliminate redundant files. We identified and removed duplicates by checking for high sim- ilarity in document titles and textual content. Initially, documents with identical or near- identical titles were flagged, after which their content overlap was assessed. Documents with a high degree of textual similarity were...

  9. [9]

    non-substantive

    Substantive Content Filtering:Next, we fil- tered out documents that were not substantive policy texts. A document was classified as "non-substantive" and excluded if it met any of the following criteria: • It was purely administrative or procedu- ral (e.g., public meeting announcements, personnel appointment notices, holiday schedules). • It was a table ...

  10. [10]

    outdated

    Temporal and Relevancy Filtering:Finally, we applied a filter to remove documents that were considered "outdated" or irrelevant to the contemporary policy landscape. A policy was flagged and removed if: • It was explicitly superseded by a more recent version or subsequent legislation from the same issuing authority. • It was promulgated before the year 20...

  11. [11]

    Option: A:

    and TriviaQA (Joshi et al., 2017), test models on a wide range of subjects, from STEM fields to humanities, shedding light on the recalling and rea- soning ability of LLMs to handle complex queries in real-world scenarios. Domain-specific benchmarks now probe special- ized knowledge: BioASQ (Tsatsaronis et al., 2015) tests biomedical QA, MedQA (Jin et al....

  12. [12]

    Data Validity:Experts assessed the correct- ness of the Gold Answers and the appropriateness of the content. Agreement for Bloom-level labeling was calculated usingKrippendorff’s α, which is robust for multiple raters: α= 1− Do De (2) where Do is the observed disagreement and De is the expected disagreement by chance. The results in Table 10 confirm the d...

  13. [13]

    Appropriate

    Expert Performance Baseline:To establish a human ceiling, the experts answered the questions under anopen-book setting, simulating realistic policy analysis workflows. As shown in Table 11, experts significantly outperform models in deeper understanding tasks (L2/L3), validating the bench- mark’s difficulty gradient. F Formal Definition of the Policy Task...

  14. [14]

    The Examiner Pool:We utilized three distinct top-tier models to generate questions and distrac- tors, ensuring coverage across different model fam- ilies: •GPT Family:GPT-4o(OpenAI) • Claude Family: Claude-4-Sonnet (Anthropic) •Qwen Family:Qwen-3(Alibaba Cloud)

  15. [15]

    • Open-Weights: Qwen-3, Qwen-2.5, and Llama-4

    The Examinee Pool:We evaluated 7 models, including the three generator families and exter- nal open-weights models, to observe cross-family behaviors: • Closed-Source: GPT-4o, GPT-4o-mini, Claude-4-Sonnet, Claude-4-Haiku. • Open-Weights: Qwen-3, Qwen-2.5, and Llama-4

  16. [16]

    This is the standardPolicyBench setting

    Evaluation Conditions:We tested these models across three distinct settings: • Baseline (Consensus):Questions generated by the full pool. This is the standardPolicyBench setting. • Single-Examiner:Questions generated exclu- sively by one examiner (e.g.,GPT-Only). • LOEO (Leave-One-Examiner-Out):Questions generated by the remaining two examiners (e.g., Wo-...

  17. [17]

    Familiarity Bonus

    Self-Scoring Bias (Familiarity).Models con- sistently perform differently on questions they gen- erated themselves, creating a "Familiarity Bonus" or "Penalty". As shown in Table 13, relying on a single generator creates severe distortions: Insight:The data reveals divergent biases. GPT- 4o benefits from "Familiarity Bias" (+7%), likely exploiting its own...

  18. [18]

    Model-Speak

    Mitigating Model-Speak: Leaderboard Sta- bility.To determine if "Model-Speak" (stylistic tells) compromises the validity of the rankings, we analyzed theSpearman Rank Correlation ( ρ) (Zar, 2005) between the Baseline leaderboard and other conditions

  19. [19]

    Baseline

    External Model Robustness.For models out- side the generator pool, likeLlama-4, the Base- line provides the most stable evaluation. Llama- 4’s score varies from 82.0% to 89.0% across sin- gle examiners. The Baseline (82.0%) successfully anchors it to a consensus difficulty, filtering out examiner-specific noise. Conclusion:The low correlation of the GPT- ...

  20. [20]

    整顿市场秩序、建设法规体系、促进产业发展

    政府采购领域“整顿市场秩序、建设法规体系、促进产业发展”三年行动方案(2024—2026年)

  21. [21]

    中国气象局关于发布《空气污染扩散气象条件等级》等12项气象行业标准的通告

  22. [22]

    弘扬和平共处五项原则 携手构建人类命运共同体——在和平共处五项原则发表70周年纪念大会上的讲话

  23. [23]

    国务院关于印发上海系统推进全面创新改革试验加快建设具有全球影响力科技创新中心方案的通知

  24. [24]

    美丽中国·美好生活

    文化和旅游部办公厅关于开展“美丽中国·美好生活”2022年国内旅游推广活动的通知

  25. [25]

    国务院关于发布政府核准的投资项目目录(2014年本)的通知

  26. [26]

    住房城乡建设部 农业农村部 发展改革委 生态环境部 乡村振兴局 供销合作总社关于进一步加强农村 生活垃圾收运处置体系建设管理的通知

  27. [27]

    全国投资项目在线审批监管平台运行管理暂行办法

  28. [28]

    Department of the Treasury Highlights the Benefits of Public-Private Partnerships for Main Street and Underserved Rural and Urban Communities

    U.S. Department of the Treasury Highlights the Benefits of Public-Private Partnerships for Main Street and Underserved Rural and Urban Communities

  29. [29]

    National Arctic Policy

    U.S. National Arctic Policy

  30. [30]

    VA Directive 5979 - Harassment Prevention Policy

  31. [31]

    U.S. Department of the Treasury, IRS, and Department of Energy Announce Next Steps for 2024 Program Year of Inflation Reduction Act Program for Solar and Wind Energy in Low-Income Communities

  32. [32]

    HHS Finalizes Rule to Strengthen Medicare, Improve Access to Affordable Prescription Drug Coverage, and Hold Private Insurance Companies Accountable to Delivering Quality Health Care for America’s Seniors and People with Disab

  33. [33]

    Interior Department Announces Landsat 2030 International Partnership Initiative

  34. [34]

    Strategy of the Month Location Efficiency and Housing Type

  35. [35]

    Prohibiting Imports of Uranium Products from the Russian Federation

  36. [36]

    Department of the Treasury, IRS Release Proposed Guidance to Continue Investment Boom in Clean Energy Production

    U.S. Department of the Treasury, IRS Release Proposed Guidance to Continue Investment Boom in Clean Energy Production

  37. [37]

    February 26, 2024- Letter from Secretary Cardona regarding support for prioritizing early school success

  38. [38]

    April 9, 2024- Letter from Secretary Cardona to schools regarding the Better FAFSA Rollout

  39. [39]

    VA Directive 6401 - VA Standard Desktop Configurations

  40. [40]

    Department of the Treasury Issues Proposed Rules Supporting Expanded Tribal General Welfare for Tribal Communities

    U.S. Department of the Treasury Issues Proposed Rules Supporting Expanded Tribal General Welfare for Tribal Communities

  41. [41]

    Department of the Treasury, Consumer Financial Protection Bureau, and Federal Trade Commission Announce Steps to Protect Residential Solar Consumers, Ensure Access to Credits

    U.S. Department of the Treasury, Consumer Financial Protection Bureau, and Federal Trade Commission Announce Steps to Protect Residential Solar Consumers, Ensure Access to Credits

  42. [42]

    Strategy Linking Gender Equality & Climate Action

    Breaking the Silos A New U.S. Strategy Linking Gender Equality & Climate Action

  43. [43]

    Biden-Harris Administration Announces CHIPS Incentives Awards with GlobalWafers to Support Domestic Production of Silicon Wafers

  44. [44]

    U.S. Department of the Treasury Issues Proposed Guidance to Clarify Wholly-Owned Tribally Chartered Entities Are Not Subject to Income Tax and Expand Tribal Access to Clean Energy Tax Credits

  45. [45]

    Biden-Harris Administration Invests in Rural Communities to Lower Energy Costs and Create Jobs in 39 States and Guam as part of Investing in America Agenda

  46. [46]

    June 18, 2024- Joint Letter from Secretary Cardona and Secretary Blinken regarding World Refugee Day

  47. [47]

    According toXXX

    The United States’ International Cyberspace and Digital Policy Strategy ChinaU.S. Figure 9: Collected Policies (Part). Table 17: Performance comparison and correlation analysis across benchmarks. Model MMLU-Pro LegalBenchPolicyBench(Avg)PolicyBench(L2) DeepSeek-V381.9%80.1% 59.10% 57.68% GPT-4o 80.3% 79.8% 59.47% 56.08% Claude-3.7-Sonnet 80.3% 78.1%64.13%...

  48. [48]

    For calculation or factual questions where the result must be precise (e.g., math, unit conversion, logical problems), if the final answer is incorrect, the score should be 0, regardless of the explanation

  49. [49]

    - Compare the given answer with the reference key points

    For general questions (e.g., reasoning, explanation, analysis), the reference answer includes multiple key points. - Compare the given answer with the reference key points. - For each matched key point, assign partial credit proportionally. - If the answer includes correct but unlisted points (beyond the reference answer), you may award partial credit wit...

  50. [50]

    Provide a score from 0 to 5. Generally: - 5 = Completely correct and well explained - 4 = Mostly correct, with minor issues - 3 = Partially correct, some key points missing or wrong - 2 = Mostly incorrect but with small redeeming aspects - 1 = Barely relevant or correct - 0 = Completely wrong or irrelevant

  51. [51]

    In your reasoning, clearly list: - Which points in the reference answer are matched - Any extra correct points beyond the reference - Justify any deductions

  52. [52]

    Do not be lenient

    Be strict but fair. Do not be lenient. — Question: {question} Reference Answer: {reference_answer_with_point_marks} User Answer: {user_answer} — Now output: Score: X Reasoning: