PolicyLLM: Towards Excellent Comprehension of Public Policy for Large Language Models
Pith reviewed 2026-05-10 15:55 UTC · model grok-4.3
The pith
A specialized mixture-of-experts model with experts tied to different cognitive levels outperforms general LLMs on policy application tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PolicyBench supplies the first large-scale, cross-system evaluation of LLM policy comprehension with 21K cases spanning US and Chinese governance. PolicyMoE, built by aligning separate expert modules to the three Bloom levels of memorization, understanding, and application, produces higher accuracy on structured reasoning tasks than on memorization or conceptual understanding and outperforms general models on application-oriented policy scenarios.
What carries the argument
PolicyMoE, a domain-specialized Mixture-of-Experts model whose expert modules are each aligned to one level of Bloom's cognitive taxonomy (memorization, understanding, application).
If this is right
- LLMs exhibit uneven policy capabilities: they handle real-world application scenarios more reliably than they handle fact recall or abstract concept use.
- Routing different cognitive demands to separate experts raises accuracy on structured policy reasoning tasks.
- Current general LLMs leave measurable gaps in policy understanding that can be narrowed by cognitive-level specialization.
- Further development of policy-focused models should prioritize application-oriented evaluation over memorization tests.
Where Pith is reading between the lines
- The same expert-routing idea could be tested on other high-stakes domains such as legal reasoning or medical guidelines to see whether cognitive-level specialization transfers.
- PolicyBench could serve as a continuing yardstick for tracking whether later models close the gap between application strength and conceptual weakness.
- If the pattern holds, policy-advisory systems might reduce errors by first routing fact questions to one expert and scenario questions to another instead of using a single model for both.
Load-bearing premise
The 21K benchmark cases and the way they were chosen truly capture the range and difficulty of actual public-policy work without systematic gaps or selection effects.
What would settle it
A general-purpose LLM that receives no level-specific expert routing yet scores as high as or higher than PolicyMoE on the full PolicyBench set would show the claimed gains come from something other than the cognitive alignment.
Figures
read the original abstract
Large Language Models (LLMs) are increasingly integrated into real-world decision-making, including in the domain of public policy. Yet, their ability to comprehend and reason about policy-related content remains underexplored. To fill this gap, we present \textbf{\textit{PolicyBench}}, the first large-scale cross-system benchmark (US-China) evaluating policy comprehension, comprising 21K cases across a broad spectrum of policy areas, capturing the diversity and complexity of real-world governance. Following Bloom's taxonomy, the benchmark assesses three core capabilities: (1) \textbf{Memorization}: factual recall of policy knowledge, (2) \textbf{Understanding}: conceptual and contextual reasoning, and (3) \textbf{Application}: problem-solving in real-life policy scenarios. Building on this benchmark, we further propose \textbf{\textit{PolicyMoE}}, a domain-specialized Mixture-of-Experts (MoE) model with expert modules aligned to each cognitive level. The proposed models demonstrate stronger performance on application-oriented policy tasks than on memorization or conceptual understanding, and yields the highest accuracy on structured reasoning tasks. Our results reveal key limitations of current LLMs in policy understanding and suggest paths toward more reliable, policy-focused models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces PolicyBench, the first large-scale cross-system (US-China) benchmark for LLM policy comprehension comprising 21K cases spanning multiple policy areas and structured by Bloom's taxonomy into memorization, understanding, and application capabilities. It proposes PolicyMoE, a Mixture-of-Experts architecture with expert modules aligned to these cognitive levels, and claims that the resulting models exhibit stronger performance on application-oriented and structured reasoning tasks than on memorization or conceptual understanding, while revealing limitations in current LLMs.
Significance. If the benchmark construction proves unbiased and the reported performance advantages hold under rigorous validation, the work would be significant for the field by providing a much-needed evaluation framework for policy-aware LLMs and demonstrating a cognitively motivated specialization approach. The scale of 21K cases and cross-system focus represent a clear strength in addressing an underexplored real-world application domain.
major comments (3)
- [§2] §2 (PolicyBench construction): The central claim that the 21K cases 'capture the diversity and complexity of real-world governance' lacks any description of the sampling frame, policy-area coverage statistics, inter-annotator agreement for cognitive-level labeling, or anti-bias controls (e.g., against LLM-augmented curation favoring certain domains or Bloom levels). This is load-bearing because all downstream performance claims rest on the benchmark's representativeness.
- [§4] §4 (Experiments and results): No quantitative metrics, baseline comparisons (e.g., against standard LLMs or non-MoE variants), error bars, statistical tests, or ablation studies on expert routing are referenced in support of the claims that models are 'stronger on application-oriented policy tasks' and 'yield the highest accuracy on structured reasoning tasks.' Without these, the superiority assertions cannot be evaluated.
- [§3] §3 (PolicyMoE architecture): Aligning expert modules directly to Bloom's taxonomy risks circularity or overfitting to label-specific patterns rather than transferable policy reasoning; no independent validation of the taxonomy application or comparison to alternative routing mechanisms is provided, undermining the claim that this design produces genuine capability gains.
minor comments (2)
- [Abstract] The abstract would be strengthened by including at least one key quantitative result (e.g., accuracy figures or relative improvements) to substantiate the performance claims.
- [§2] Notation for the three capabilities (Memorization, Understanding, Application) should be consistently defined with explicit examples from the benchmark in the main text.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback, which highlights important areas for strengthening the manuscript. We address each major comment below and commit to revisions that enhance transparency, rigor, and justification without altering the core contributions.
read point-by-point responses
-
Referee: [§2] §2 (PolicyBench construction): The central claim that the 21K cases 'capture the diversity and complexity of real-world governance' lacks any description of the sampling frame, policy-area coverage statistics, inter-annotator agreement for cognitive-level labeling, or anti-bias controls (e.g., against LLM-augmented curation favoring certain domains or Bloom levels). This is load-bearing because all downstream performance claims rest on the benchmark's representativeness.
Authors: We agree that explicit details on benchmark construction are required to substantiate representativeness. In the revised §2, we will add a full description of the sampling frame, quantitative policy-area coverage statistics, inter-annotator agreement scores for Bloom's taxonomy labeling, and anti-bias protocols (including safeguards against LLM-assisted curation biases). These additions will directly support the claim of capturing real-world governance diversity. revision: yes
-
Referee: [§4] §4 (Experiments and results): No quantitative metrics, baseline comparisons (e.g., against standard LLMs or non-MoE variants), error bars, statistical tests, or ablation studies on expert routing are referenced in support of the claims that models are 'stronger on application-oriented policy tasks' and 'yield the highest accuracy on structured reasoning tasks.' Without these, the superiority assertions cannot be evaluated.
Authors: We acknowledge the absence of these quantitative elements in the current experimental reporting. The revised §4 will incorporate comprehensive metrics, baseline comparisons to standard LLMs and non-MoE variants, error bars, statistical significance tests, and ablation studies on expert routing. These will provide rigorous evidence for the performance advantages on application-oriented and structured reasoning tasks. revision: yes
-
Referee: [§3] §3 (PolicyMoE architecture): Aligning expert modules directly to Bloom's taxonomy risks circularity or overfitting to label-specific patterns rather than transferable policy reasoning; no independent validation of the taxonomy application or comparison to alternative routing mechanisms is provided, undermining the claim that this design produces genuine capability gains.
Authors: We recognize the potential for circularity in the taxonomy-aligned routing. The revised §3 will include independent validation of the Bloom's taxonomy application (e.g., via external expert review or correlation with other reasoning measures) and direct comparisons to alternative routing mechanisms. These will demonstrate that the design yields genuine, transferable gains rather than label-specific overfitting. revision: yes
Circularity Check
No circularity: empirical benchmark and model evaluation with no derivations or self-referential predictions
full rationale
The paper presents PolicyBench (21K cases) and PolicyMoE (Bloom-aligned MoE experts) as new artifacts, with performance claims resting on direct empirical measurements of accuracy across memorization, understanding, and application tasks. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the derivation chain. The reported ordering (application > memorization) is a measured outcome on the constructed benchmark rather than a reduction to inputs by construction. The central claims are therefore self-contained as an independent evaluation exercise.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Bloom's taxonomy provides a valid and useful framework for categorizing cognitive capabilities relevant to policy comprehension
Reference graph
Works this paper leans on
-
[1]
Autobench-v: Can large vision-language models benchmark themselves?arXiv preprint arXiv:2410.21259,
Autobench-v: Can large vision-language models benchmark themselves?arXiv preprint arXiv:2410.21259. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, and 1 others. 2021. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374...
-
[2]
arXiv preprint arXiv:2502.07853
Policysimeval: A benchmark for evaluat- ing policy outcomes through agent-based simulation. arXiv preprint arXiv:2502.07853. Junmo Kang, Leonid Karlinsky, Hongyin Luo, Zhen Wang, Jacob Hansen, James Glass, David Cox, Rameswar Panda, Rogerio Feris, and Alan Ritter
-
[3]
Nikos Karacapilidis, Evangelos Kalampokis, Nikolaos Giarelis, and Charalampos Mastrokostas
Self-moe: Towards compositional large lan- guage models with self-specialized experts.arXiv preprint arXiv:2406.12034. Nikos Karacapilidis, Evangelos Kalampokis, Nikolaos Giarelis, and Charalampos Mastrokostas. 2024. Gen- erative ai and public deliberation: A framework for llm-augmented digital democracy.Proceedings http://ceur-ws. org ISSN, 1613:0073. Da...
-
[4]
Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437. Meta. 2025. Meta llama-4. Https://www.llama.com/models/llama-4/. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, and 1 others. 2022. Training language models to follow in- structions with human feedba...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, and 1 others. 2022. Chain-of-thought prompting elic- its reasoning in lar...
work page 2022
-
[6]
Zhi Zhou, Jiang-Xin Shi, Peng-Xiao Song, Xiaowen Yang, Yi-Xuan Jin, Lan-Zhe Guo, and Yu-Feng Li
Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Pro- cessing Systems, 36:46595–46623. Zhi Zhou, Jiang-Xin Shi, Peng-Xiao Song, Xiaowen Yang, Yi-Xuan Jin, Lan-Zhe Guo, and Yu-Feng Li
-
[7]
Lawgpt: A chinese legal knowledge-enhanced large language model.CoRR. Wenhao Zhu, Hongyi Liu, Qingxiu Dong, Jingjing Xu, Shujian Huang, Lingpeng Kong, Jiajun Chen, and Lei Li. 2024. Multilingual machine translation with large language models: Empirical results and analy- sis. InFindings of the Association for Computational Linguistics: NAACL 2024, pages 2...
work page 2024
-
[8]
Duplicate Removal:The first step was to eliminate redundant files. We identified and removed duplicates by checking for high sim- ilarity in document titles and textual content. Initially, documents with identical or near- identical titles were flagged, after which their content overlap was assessed. Documents with a high degree of textual similarity were...
-
[9]
Substantive Content Filtering:Next, we fil- tered out documents that were not substantive policy texts. A document was classified as "non-substantive" and excluded if it met any of the following criteria: • It was purely administrative or procedu- ral (e.g., public meeting announcements, personnel appointment notices, holiday schedules). • It was a table ...
-
[10]
Temporal and Relevancy Filtering:Finally, we applied a filter to remove documents that were considered "outdated" or irrelevant to the contemporary policy landscape. A policy was flagged and removed if: • It was explicitly superseded by a more recent version or subsequent legislation from the same issuing authority. • It was promulgated before the year 20...
work page 2000
-
[11]
and TriviaQA (Joshi et al., 2017), test models on a wide range of subjects, from STEM fields to humanities, shedding light on the recalling and rea- soning ability of LLMs to handle complex queries in real-world scenarios. Domain-specific benchmarks now probe special- ized knowledge: BioASQ (Tsatsaronis et al., 2015) tests biomedical QA, MedQA (Jin et al....
-
[12]
Data Validity:Experts assessed the correct- ness of the Gold Answers and the appropriateness of the content. Agreement for Bloom-level labeling was calculated usingKrippendorff’s α, which is robust for multiple raters: α= 1− Do De (2) where Do is the observed disagreement and De is the expected disagreement by chance. The results in Table 10 confirm the d...
-
[13]
Expert Performance Baseline:To establish a human ceiling, the experts answered the questions under anopen-book setting, simulating realistic policy analysis workflows. As shown in Table 11, experts significantly outperform models in deeper understanding tasks (L2/L3), validating the bench- mark’s difficulty gradient. F Formal Definition of the Policy Task...
-
[14]
The Examiner Pool:We utilized three distinct top-tier models to generate questions and distrac- tors, ensuring coverage across different model fam- ilies: •GPT Family:GPT-4o(OpenAI) • Claude Family: Claude-4-Sonnet (Anthropic) •Qwen Family:Qwen-3(Alibaba Cloud)
-
[15]
• Open-Weights: Qwen-3, Qwen-2.5, and Llama-4
The Examinee Pool:We evaluated 7 models, including the three generator families and exter- nal open-weights models, to observe cross-family behaviors: • Closed-Source: GPT-4o, GPT-4o-mini, Claude-4-Sonnet, Claude-4-Haiku. • Open-Weights: Qwen-3, Qwen-2.5, and Llama-4
-
[16]
This is the standardPolicyBench setting
Evaluation Conditions:We tested these models across three distinct settings: • Baseline (Consensus):Questions generated by the full pool. This is the standardPolicyBench setting. • Single-Examiner:Questions generated exclu- sively by one examiner (e.g.,GPT-Only). • LOEO (Leave-One-Examiner-Out):Questions generated by the remaining two examiners (e.g., Wo-...
-
[17]
Self-Scoring Bias (Familiarity).Models con- sistently perform differently on questions they gen- erated themselves, creating a "Familiarity Bonus" or "Penalty". As shown in Table 13, relying on a single generator creates severe distortions: Insight:The data reveals divergent biases. GPT- 4o benefits from "Familiarity Bias" (+7%), likely exploiting its own...
-
[18]
Mitigating Model-Speak: Leaderboard Sta- bility.To determine if "Model-Speak" (stylistic tells) compromises the validity of the rankings, we analyzed theSpearman Rank Correlation ( ρ) (Zar, 2005) between the Baseline leaderboard and other conditions
work page 2005
-
[19]
External Model Robustness.For models out- side the generator pool, likeLlama-4, the Base- line provides the most stable evaluation. Llama- 4’s score varies from 82.0% to 89.0% across sin- gle examiners. The Baseline (82.0%) successfully anchors it to a consensus difficulty, filtering out examiner-specific noise. Conclusion:The low correlation of the GPT- ...
work page 2024
- [20]
-
[21]
中国气象局关于发布《空气污染扩散气象条件等级》等12项气象行业标准的通告
-
[22]
弘扬和平共处五项原则 携手构建人类命运共同体——在和平共处五项原则发表70周年纪念大会上的讲话
-
[23]
国务院关于印发上海系统推进全面创新改革试验加快建设具有全球影响力科技创新中心方案的通知
- [24]
-
[25]
国务院关于发布政府核准的投资项目目录(2014年本)的通知
-
[26]
住房城乡建设部 农业农村部 发展改革委 生态环境部 乡村振兴局 供销合作总社关于进一步加强农村 生活垃圾收运处置体系建设管理的通知
-
[27]
全国投资项目在线审批监管平台运行管理暂行办法
-
[28]
U.S. Department of the Treasury Highlights the Benefits of Public-Private Partnerships for Main Street and Underserved Rural and Urban Communities
- [29]
-
[30]
VA Directive 5979 - Harassment Prevention Policy
-
[31]
U.S. Department of the Treasury, IRS, and Department of Energy Announce Next Steps for 2024 Program Year of Inflation Reduction Act Program for Solar and Wind Energy in Low-Income Communities
work page 2024
-
[32]
HHS Finalizes Rule to Strengthen Medicare, Improve Access to Affordable Prescription Drug Coverage, and Hold Private Insurance Companies Accountable to Delivering Quality Health Care for America’s Seniors and People with Disab
-
[33]
Interior Department Announces Landsat 2030 International Partnership Initiative
work page 2030
-
[34]
Strategy of the Month Location Efficiency and Housing Type
-
[35]
Prohibiting Imports of Uranium Products from the Russian Federation
-
[36]
U.S. Department of the Treasury, IRS Release Proposed Guidance to Continue Investment Boom in Clean Energy Production
-
[37]
February 26, 2024- Letter from Secretary Cardona regarding support for prioritizing early school success
work page 2024
-
[38]
April 9, 2024- Letter from Secretary Cardona to schools regarding the Better FAFSA Rollout
work page 2024
-
[39]
VA Directive 6401 - VA Standard Desktop Configurations
-
[40]
U.S. Department of the Treasury Issues Proposed Rules Supporting Expanded Tribal General Welfare for Tribal Communities
-
[41]
U.S. Department of the Treasury, Consumer Financial Protection Bureau, and Federal Trade Commission Announce Steps to Protect Residential Solar Consumers, Ensure Access to Credits
-
[42]
Strategy Linking Gender Equality & Climate Action
Breaking the Silos A New U.S. Strategy Linking Gender Equality & Climate Action
-
[43]
Biden-Harris Administration Announces CHIPS Incentives Awards with GlobalWafers to Support Domestic Production of Silicon Wafers
-
[44]
U.S. Department of the Treasury Issues Proposed Guidance to Clarify Wholly-Owned Tribally Chartered Entities Are Not Subject to Income Tax and Expand Tribal Access to Clean Energy Tax Credits
-
[45]
Biden-Harris Administration Invests in Rural Communities to Lower Energy Costs and Create Jobs in 39 States and Guam as part of Investing in America Agenda
-
[46]
June 18, 2024- Joint Letter from Secretary Cardona and Secretary Blinken regarding World Refugee Day
work page 2024
-
[47]
The United States’ International Cyberspace and Digital Policy Strategy ChinaU.S. Figure 9: Collected Policies (Part). Table 17: Performance comparison and correlation analysis across benchmarks. Model MMLU-Pro LegalBenchPolicyBench(Avg)PolicyBench(L2) DeepSeek-V381.9%80.1% 59.10% 57.68% GPT-4o 80.3% 79.8% 59.47% 56.08% Claude-3.7-Sonnet 80.3% 78.1%64.13%...
-
[48]
For calculation or factual questions where the result must be precise (e.g., math, unit conversion, logical problems), if the final answer is incorrect, the score should be 0, regardless of the explanation
-
[49]
- Compare the given answer with the reference key points
For general questions (e.g., reasoning, explanation, analysis), the reference answer includes multiple key points. - Compare the given answer with the reference key points. - For each matched key point, assign partial credit proportionally. - If the answer includes correct but unlisted points (beyond the reference answer), you may award partial credit wit...
-
[50]
Provide a score from 0 to 5. Generally: - 5 = Completely correct and well explained - 4 = Mostly correct, with minor issues - 3 = Partially correct, some key points missing or wrong - 2 = Mostly incorrect but with small redeeming aspects - 1 = Barely relevant or correct - 0 = Completely wrong or irrelevant
-
[51]
In your reasoning, clearly list: - Which points in the reference answer are matched - Any extra correct points beyond the reference - Justify any deductions
-
[52]
Be strict but fair. Do not be lenient. — Question: {question} Reference Answer: {reference_answer_with_point_marks} User Answer: {user_answer} — Now output: Score: X Reasoning:
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.