arxiv: 2605.08800 · v1 · submitted 2026-05-09 · 💻 cs.CV · cs.AI

Recognition: no theorem link

PPU-Bench:Real World Benchmark for Personalized Partial Unlearning in Vision Language Models

Jiahui Guang , Zexun Zhan , Zhenlin Xu , Cuiyun Gao , Haiyan Wang , Jing Li , Zhaoquan Gu , Yanchun Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:20 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords personalized unlearningpartial unlearningvision-language modelsmultimodal benchmarksboundary-aware optimizationforget-retain trade-offsintra-subject boundaries

0 comments

The pith

A benchmark of 24,000 samples from public figures shows that complete unlearning in vision-language models erases visual identities rather than targeted facts, while boundary-aware optimization can enforce precise intra-subject distinctions

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PPU-Bench to evaluate unlearning methods on realistic personalized deletion requests in multimodal large language models. The benchmark draws 24K multimodal and unimodal samples from 500 public figures across three settings of increasing difficulty: complete subject deletion, selective fact deletion, and personalized deletion that keeps some facts while removing others. Experiments demonstrate that complete unlearning tends to suppress visual identity instead of factual knowledge, while selective and personalized settings expose clear trade-offs between forgetting and retaining information along with difficulties in maintaining boundaries within the same subject. The authors propose Boundary-Aware Optimization to explicitly model and enforce those intra-subject boundaries without requiring additional fine-tuning.

Core claim

We present PPU-Bench, a real-world fine-tuning-free benchmark with 24K samples under complete, selective, and personalized unlearning settings. Extensive experiments reveal that complete unlearning suppresses visual identity rather than factual knowledge, selective and personalized unlearning expose significant forget-retain trade-offs and intra-subject boundary challenges, and Boundary-Aware Optimization effectively enforces those boundaries on two representative methods.

What carries the argument

Boundary-Aware Optimization (BAO), which explicitly models intra-subject forget-retain boundaries to guide the unlearning process.

If this is right

Complete unlearning often removes visual identity information rather than specific factual knowledge.
Selective and personalized unlearning settings reveal significant trade-offs between forgetting target information and retaining non-target facts.
Intra-subject factual boundaries create persistent challenges that standard unlearning approaches do not address.
Boundary-Aware Optimization improves enforcement of forget-retain distinctions when applied to existing unlearning methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Unlearning techniques for large models may benefit from treating boundary detection as a core design requirement rather than an afterthought.
The same boundary-modeling approach could be tested on deletion requests involving private individuals or non-visual modalities to check generality.
If the benchmark patterns hold, model providers might need to expose internal representations of subject-specific facts to support fine-grained user requests.
Extending the evaluation to measure long-term stability of the enforced boundaries after continued model use would provide a practical next test.

Load-bearing premise

The 24K samples constructed from public figures accurately represent realistic personalized deletion requests and that forget-retain boundaries can be reliably identified and enforced without fine-tuning.

What would settle it

A new test set of deletion requests drawn from non-public figures or actual user data where Boundary-Aware Optimization fails to preserve non-target facts while removing targets would falsify the method's effectiveness.

Figures

Figures reproduced from arXiv: 2605.08800 by Cuiyun Gao, Haiyan Wang, Jiahui Guang, Jing Li, Yanchun Zhang, Zexun Zhan, Zhaoquan Gu, Zhenlin Xu.

**Figure 2.** Figure 2: Trade-offs between forget efficiency and utility preservation under complete, selective, and [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Personalized unlearning performance of different methods across public figure categories. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Attack robustness analysis across three unlearning settings of Qwen3-VL-8B. A larger [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: The results of memorization quantification. [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Attack robustness analysis across three unlearning settings of Gemma-3-12b. A larger ASR [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗

**Figure 7.** Figure 7: Case Study of Unlearning Methods under Complete, Selective, and Personalized Unlearning [PITH_FULL_IMAGE:figures/full_fig_p026_7.png] view at source ↗

read the original abstract

Multimodal Large Language Models (MLLMs) may memorize sensitive cross-modal information during pretraining. However, existing MLLM unlearning benchmarks rely on synthetic knowledge injection or complete subject-level deletion, which fail to capture realistic, personalized deletion requests that require fine-grained factual control. In this paper, we introduce PPU-Bench, a real-world and fine-tuning-free benchmark for personalized partial unlearning in MLLMs. PPU-Bench contains 24K multimodal and unimodal samples derived from pre-existing knowledge of 500 public figures under three progressively challenging settings: Complete, Selective, and Personalized unlearning. The benchmark evaluates whether methods can remove target knowledge while preserving non-target facts, model utility, and cross-modal consistency. Extensive experiments show that Complete Unlearning often suppresses visual identity rather than factual knowledge, while Selective and Personalized Unlearning expose significant forget--retain trade-offs and challenges in intra-subject factual boundaries. Robustness analysis under cross-image and prompt-based attacks reveals distinct vulnerabilities across different unlearning settings. Motivated by these findings, we propose Boundary-Aware Optimization (BAO), which explicitly models intra-subject forget-retain boundaries. Experimental results on two representative methods demonstrate that BAO can effectively enforce intra-subject factual boundaries.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PPU-Bench gives a new real-world dataset for partial unlearning but its core claims rest on unverified assumptions about what the models actually know.

read the letter

The paper introduces PPU-Bench with 24K samples drawn from 500 public figures and three settings that get progressively more fine-grained: complete unlearning, selective, and personalized. It also proposes Boundary-Aware Optimization (BAO) to try to keep intra-subject facts separate during the process. That is the actual new piece. Prior unlearning work leaned on synthetic injections or blunt subject deletion; this one tries to use pre-existing public knowledge and measure whether methods can drop targeted facts while holding onto the rest, plus cross-modal consistency and utility.

Referee Report

2 major / 2 minor

Summary. The paper introduces PPU-Bench, a benchmark of 24K multimodal and unimodal samples derived from pre-existing knowledge of 500 public figures, for evaluating personalized partial unlearning in vision-language models. It defines three settings (Complete, Selective, Personalized) and reports that complete unlearning tends to suppress visual identity rather than factual knowledge, while selective/personalized settings reveal forget-retain trade-offs and intra-subject boundary challenges; robustness to attacks is analyzed, and Boundary-Aware Optimization (BAO) is proposed to enforce those boundaries, with experiments on two representative methods.

Significance. If the ground-truth forget/retain partitions prove reliable, PPU-Bench would be a valuable contribution by moving beyond synthetic knowledge injection to real-world public-figure data, exposing practically relevant trade-offs in utility, cross-modal consistency, and fine-grained factual control. The explicit modeling of intra-subject boundaries via BAO and the attack robustness analysis are constructive; the scale (24K samples) and progressive difficulty settings add utility for the community.

major comments (2)

[Abstract / PPU-Bench construction] Abstract and benchmark description: the central claims (Complete Unlearning suppresses visual identity rather than facts; Selective/Personalized expose trade-offs; BAO enforces boundaries) rest on the assumption that the 24K samples provide verified ground-truth partitions. No independent verification is described that the base VLM has memorized the designated 'forget' items, that 'retain' items are non-overlapping, or that cross-modal consistency holds pre-unlearning; because facts come from uncontrolled public-figure data rather than injected knowledge, measured effects may be artifacts of label construction.
[Experiments / BAO proposal] Experiments and BAO sections: no details are supplied on sample construction procedure, exact forget/retain labeling process, evaluation metrics, statistical significance testing, or the precise implementation and hyper-parameters of Boundary-Aware Optimization (BAO). These omissions are load-bearing because the reported forget-retain trade-offs and BAO improvements cannot be reproduced or assessed without them.

minor comments (2)

[Abstract] The abstract states that BAO is motivated by the findings but does not preview the concrete mechanism (e.g., how boundaries are modeled or optimized) that would help readers anticipate the method's contribution.
Figure captions and tables should explicitly indicate which unlearning setting (Complete/Selective/Personalized) and which attack type each result corresponds to, to improve readability of the robustness analysis.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The comments highlight important aspects of verification and reproducibility that we will address through targeted revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract / PPU-Bench construction] Abstract and benchmark description: the central claims (Complete Unlearning suppresses visual identity rather than facts; Selective/Personalized expose trade-offs; BAO enforces boundaries) rest on the assumption that the 24K samples provide verified ground-truth partitions. No independent verification is described that the base VLM has memorized the designated 'forget' items, that 'retain' items are non-overlapping, or that cross-modal consistency holds pre-unlearning; because facts come from uncontrolled public-figure data rather than injected knowledge, measured effects may be artifacts of label construction.

Authors: We agree that explicit verification details are needed to support the reliability of the ground-truth partitions. The current manuscript does not include a dedicated description of independent verification steps. In the revision we will add a new subsection under PPU-Bench construction that reports: (i) pre-unlearning accuracy of the base VLM on both forget and retain queries to confirm differential memorization, (ii) the multi-source curation process used to ensure non-overlapping facts and cross-modal consistency, and (iii) quantitative checks (e.g., overlap statistics and consistency scores) performed during labeling. While we recognize that public-figure data introduces some label-construction uncertainty compared with synthetic injection, we view this as a deliberate design choice for ecological validity; the added pre-unlearning baselines will allow readers to evaluate whether observed effects are artifacts. We will also note this as a limitation and discuss how future work could further validate partitions. revision: yes
Referee: [Experiments / BAO proposal] Experiments and BAO sections: no details are supplied on sample construction procedure, exact forget/retain labeling process, evaluation metrics, statistical significance testing, or the precise implementation and hyper-parameters of Boundary-Aware Optimization (BAO). These omissions are load-bearing because the reported forget-retain trade-offs and BAO improvements cannot be reproduced or assessed without them.

Authors: We fully concur that the manuscript currently lacks the implementation-level details required for reproducibility. In the revised version we will expand the Experiments and BAO sections with: (1) a complete description of the sample construction pipeline and the exact forget/retain labeling rules with concrete examples for each of the three settings; (2) formal definitions of all evaluation metrics together with the formulas used; (3) results of statistical significance tests (paired t-tests and p-values) for the reported trade-offs and BAO gains; and (4) the full BAO algorithm, including pseudocode, the boundary-aware loss formulation, optimizer choice, learning-rate schedule, batch size, and all other hyperparameters employed for the two representative methods. These additions will make the benchmark and the proposed optimization fully reproducible. revision: yes

Circularity Check

0 steps flagged

No significant circularity: benchmark from external data, BAO motivated by but not reduced to observations

full rationale

The paper constructs PPU-Bench from pre-existing public-figure knowledge (24K samples across Complete/Selective/Personalized settings) and evaluates unlearning methods empirically. BAO is introduced as a new optimization approach explicitly motivated by observed forget-retain trade-offs and intra-subject boundary challenges, without any equations, parameters, or claims that reduce the benchmark labels, experimental outcomes, or BAO formulation to fitted inputs or self-citations by construction. The derivation chain consists of external data sourcing, standard evaluation metrics, and a proposed method with independent implementation details; no self-definitional loops, renamed predictions, or load-bearing self-citations appear in the provided abstract or described structure.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the domain assumption that fine-tuning-free unlearning is feasible and that intra-subject factual boundaries exist and can be modeled explicitly. No free parameters or invented physical entities are described.

axioms (1)

domain assumption Unlearning target knowledge while preserving non-target facts and cross-modal consistency is achievable without model fine-tuning
Benchmark is explicitly described as fine-tuning-free and evaluates preservation of utility and consistency.

invented entities (1)

Boundary-Aware Optimization (BAO) no independent evidence
purpose: Explicitly model and enforce intra-subject forget-retain boundaries during unlearning
New technique proposed to address observed trade-offs in selective and personalized settings

pith-pipeline@v0.9.0 · 5546 in / 1258 out tokens · 50976 ms · 2026-05-12T01:20:49.021839+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 1 internal anchor

[1]

Generated data with fake privacy: hidden dangers of fine-tuning large language models on generated data

Atilla Akkus, Masoud Poorghaffar Aghdam, Mingjie Li, Junjie Chu, Michael Backes, Yang Zhang, and Sinem Sav. Generated data with fake privacy: hidden dangers of fine-tuning large language models on generated data. InProceedings of the 34th USENIX Conference on Security Symposium, SEC ’25, USA, 2025. USENIX Association. ISBN 978-1-939133-52-6

work page 2025
[2]

Rwku: Benchmarking real-world knowledge unlearning for large language models.Advances in Neural Information Processing Systems, 37:98213–98263, 2024

Pengfei Cao, Chenhao Wang, Zhitao He, Hongbang Yuan, Jiachun Li, Yubo Chen, Kang Liu, Jun Zhao, et al. Rwku: Benchmarking real-world knowledge unlearning for large language models.Advances in Neural Information Processing Systems, 37:98213–98263, 2024

work page 2024
[3]

CLEAR: Character unlearning in textual and visual modalities

Alexey Dontsov, Dmitrii Korzh, Alexey Zhavoronkin, Boris Mikheev, Denis Bobkov, Aibek Alanov, Oleg Rogov, Ivan Oseledets, and Elena Tutubalina. CLEAR: Character unlearning in textual and visual modalities. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Findings of the Association for Computational Linguistics: ACL ...

work page 2025
[4]

Praxis-vlm: Vision-grounded decision making via text-driven reinforcement learning.arXiv preprint arXiv:2503.16965, 2025

Zhe Hu, Jing Li, Zhongzhu Pu, Hou Pong Chan, and Yu Yin. Praxis-vlm: Vision-grounded decision making via text-driven reinforcement learning.arXiv preprint arXiv:2503.16965, 2025

work page arXiv 2025
[5]

MMUnlearner: Reformulating multimodal machine unlearning in the era of multimodal large language models

Jiahao Huo, Yibo Yan, Xu Zheng, Yuanhuiyi Lyu, Xin Zou, Zhihua Wei, and Xuming Hu. MMUnlearner: Reformulating multimodal machine unlearning in the era of multimodal large language models. InFindings of the Association for Computational Linguistics: ACL 2025, pages 7190–7206, Vienna, Austria, July 2025. Association for Computational Linguistics. ISBN 979-8...

work page 2025
[6]

Mmpb: It’s time for multi-modal personalization

Jaeik Kim, Woojin Kim, Woohyeon Park, and Jaeyoung Do. Mmpb: It’s time for multi-modal personalization. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025

work page 2025
[7]

Single image unlearning: Efficient machine unlearning in multimodal large language models.Advances in Neural Information Processing Systems, 37:35414–35453, 2024

Jiaqi Li, Qianshan Wei, Chuanyi Zhang, Guilin Qi, Miaozeng Du, Yongrui Chen, Sheng Bi, and Fan Liu. Single image unlearning: Efficient machine unlearning in multimodal large language models.Advances in Neural Information Processing Systems, 37:35414–35453, 2024

work page 2024
[8]

Forget the token and pixel: Rethinking gradient ascent for concept unlearning in multimodal generative models

Jiaqi Li, Chuanyi Zhang, Miaozeng Du, Hui Zhang, Yongrui Chen, Qianshan Wei, Junfeng Fang, Ruipeng Wang, Sheng Bi, and Guilin Qi. Forget the token and pixel: Rethinking gradient ascent for concept unlearning in multimodal generative models. InFindings of the Association for Computational Linguistics: ACL 2025, pages 12179–12200, Vienna, Austria, July 2025...

work page 2025
[9]

Llm unlearning with llm beliefs, 2025

Kemou Li, Qizhou Wang, Yue Wang, Fengpeng Li, Jun Liu, Bo Han, and Jiantao Zhou. Llm unlearning with llm beliefs, 2025. 10

work page 2025
[10]

Rouge: A package for automatic evaluation of summaries

Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004

work page 2004
[11]

Continual learning and private unlearning

Bo Liu, Qiang Liu, and Peter Stone. Continual learning and private unlearning. InConference on Lifelong Learning Agents, pages 243–254. PMLR, 2022

work page 2022
[12]

Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233. Springer, 2024

work page 2024
[13]

Machine unlearning in generative ai: A survey.arXiv preprint arXiv:2407.20516, 2024

Zheyuan Liu, Guangyao Dou, Zhaoxuan Tan, Yijun Tian, and Meng Jiang. Machine unlearning in generative ai: A survey.arXiv preprint arXiv:2407.20516, 2024

work page arXiv 2024
[14]

Towards safer large language models through machine unlearning.arXiv preprint arXiv:2402.10058, 2024

Zheyuan Liu, Guangyao Dou, Zhaoxuan Tan, Yijun Tian, and Meng Jiang. Towards safer large language models through machine unlearning.arXiv preprint arXiv:2402.10058, 2024

work page arXiv 2024
[15]

Protecting privacy in multimodal large language models with mllmu-bench

Zheyuan Liu, Guangyao Dou, Mengzhao Jia, Zhaoxuan Tan, Qingkai Zeng, Yongle Yuan, and Meng Jiang. Protecting privacy in multimodal large language models with mllmu-bench. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Lang...

work page 2025
[16]

Modality-aware neuron pruning for unlearning in multimodal large language models

Zheyuan Liu, Guangyao Dou, Xiangchi Yuan, Chunhui Zhang, Zhaoxuan Tan, and Meng Jiang. Modality-aware neuron pruning for unlearning in multimodal large language models. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1:...

work page 2025
[17]

Association for Computational Linguistics, 2025

work page 2025
[18]

Benchmarking vision language model unlearning via fictitious facial identity dataset

Yingzi Ma, Jiongxiao Wang, Fei Wang, Siyuan Ma, Jiazhao Li, Jinsheng Pan, Xiujun Li, Furong Huang, Lichao Sun, Bo Li, Yejin Choi, Muhao Chen, and Chaowei Xiao. Benchmarking vision language model unlearning via fictitious facial identity dataset. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28,

work page 2025
[19]

OpenReview.net, 2025

work page 2025
[20]

Benchmarking vision language model unlearning via fictitious facial identity dataset

Yingzi Ma, Jiongxiao Wang, Fei Wang, Siyuan Ma, Jiazhao Li, Jinsheng Pan, Xiujun Li, Furong Huang, Lichao Sun, Bo Li, Yejin Choi, Muhao Chen, and Chaowei Xiao. Benchmarking vision language model unlearning via fictitious facial identity dataset. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[21]

TOFU: A Task of Fictitious Unlearning for LLMs

Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary C Lipton, and J Zico Kolter. Tofu: A task of fictitious unlearning for llms.arXiv preprint arXiv:2401.06121, 2024

work page internal anchor Pith review arXiv 2024
[22]

The eu proposal for a general data protection regulation and the roots of the ‘right to be forgotten’.Computer Law & Security Review, 29(3):229–235, 2013

Alessandro Mantelero. The eu proposal for a general data protection regulation and the roots of the ‘right to be forgotten’.Computer Law & Security Review, 29(3):229–235, 2013. ISSN 2212-473X

work page 2013
[23]

Unrolling sgd: Understanding factors influencing machine unlearning

Anvith Thudi, Gabriel Deza, Varun Chandrasekaran, and Nicolas Papernot. Unrolling sgd: Understanding factors influencing machine unlearning. In2022 IEEE 7th European Symposium on Security and Privacy (EuroS&P), pages 303–319. IEEE, 2022

work page 2022
[24]

Umu-bench: Closing the modality gap in multimodal unlearning evaluation

Chengye Wang, Yuyuan Li, XiaoHua Feng, Chaochao Chen, Xiaolin Zheng, and Jianwei Yin. Umu-bench: Closing the modality gap in multimodal unlearning evaluation. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track

work page
[25]

Pebench: A fictitious dataset to benchmark machine unlearning for multimodal large language models.CoRR, abs/2503.12545, 2025

Zhaopan Xu, Pengfei Zhou, Weidong Tang, Jiaxin Ai, Wangbo Zhao, Xiaojiang Peng, Kai Wang, Yang You, Wenqi Shao, Hongxun Yao, and Kaipeng Zhang. Pebench: A fictitious dataset to benchmark machine unlearning for multimodal large language models.CoRR, abs/2503.12545, 2025

work page arXiv 2025
[26]

Negative preference optimization: From catastrophic collapse to effective un- learning.arXiv preprint arXiv:2404.05868,

Ruiqi Zhang, Licong Lin, Yu Bai, and Song Mei. Negative preference optimization: From catastrophic collapse to effective unlearning.arXiv preprint arXiv:2404.05868, 2024. 11

work page arXiv 2024
[27]

Evaluating and steering modality preferences in multimodal large language model

Yu Zhang, Jinlong Ma, Yongshuai Hou, Xuefeng Bai, Kehai Chen, Yang Xiang, Jun Yu, and Min Zhang. Evaluating and steering modality preferences in multimodal large language model. arXiv preprint arXiv:2505.20977, 2025

work page arXiv 2025
[28]

OFFSIDE: benchmarking unlearning misinformation in multimodal large language models

Hao Zheng, Zirui Pang, Ling li, Zhijie Deng, Yuhan Pu, Zhaowei Zhu, Xiaobo Xia, and Jiaheng Wei. OFFSIDE: benchmarking unlearning misinformation in multimodal large language models. CoRR, abs/2510.22535, 2025

work page arXiv 2025
[29]

id": "120

Yingjie Zhu, Xuefeng Bai, Kehai Chen, Yang Xiang, Jun Yu, and Min Zhang. Benchmarking and improving large vision-language models for fundamental visual graph understanding and reasoning. InProceedings of the 63rd Annual Meeting of the Association for Computational Lin- guistics (V olume 1: Long Papers), pages 30678–30701, Vienna, Austria, July 2025. Assoc...

work page 2025
[30]

A person file describing a public figure

work page
[31]

person_name

A fixed set of candidate QA items about that person. Your task is to simulate a first-person deletion request from this person. Goal: Select which candidate items this person would most reasonably want removed from a AI model, while keeping the facts that are most essential to how they would want to be publicly represented. Important principles: - Only ch...

work page
[32]

Generate exactly one cloze probe

work page
[33]

Use "___" as the blank

work page
[34]

The output query must be a declarative cloze statement, not an interrogative question

work page
[35]

Do NOT keep the original wh-question form

work page
[36]

The blank must replace the answer span itself, not be appended to the end

work page
[37]

The query must sound natural and grammatical

work page
[38]

The probe must test exactly one atomic factual slot

work page
[39]

The answer must be unique and be a word or short phrase, not a full sentence

work page
[40]

Do not introduce any new facts

work page
[41]

associated with

Avoid vague or open-ended predicates such as "associated with", "known for", "linked to", or other prompts with many valid completions

work page
[42]

Avoid blanks that can be filled by multiple answer types, such as both a year and a place

work page
[43]

Avoid list-valued answers, long enumerations, and multi-fact answers

work page
[44]

Yes", "No

Do NOT generate probes whose answer is "Yes", "No", "True", or "False"

work page
[45]

Do NOT turn a statement into a yes/no-style cloze

work page
[46]

The blank must stand for a concrete fact such as a person, place, year, date, school, title, work, co-author, nationality, religion, or organization

work page
[47]

Prefer canonical and concrete factual slots

work page
[48]

cloze_probes

Return valid JSON only. Output format: { "cloze_probes": [ { "id": "<original qa id>", "query": "... ___ ...", "answer": "...", "type": "cloze" } ] } example: # question: What is Stephen King’s nationality? # answer: Stephen King’s nationality is American. Output: { "cloze_probes": [ { "id": "Stephen_King_05", "query": "Stephen King’s nationality is ___",...

work page arXiv 1963