pith. machine review for the scientific record. sign in

arxiv: 2605.08800 · v1 · submitted 2026-05-09 · 💻 cs.CV · cs.AI

Recognition: no theorem link

PPU-Bench:Real World Benchmark for Personalized Partial Unlearning in Vision Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:20 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords personalized unlearningpartial unlearningvision-language modelsmultimodal benchmarksboundary-aware optimizationforget-retain trade-offsintra-subject boundaries
0
0 comments X

The pith

A benchmark of 24,000 samples from public figures shows that complete unlearning in vision-language models erases visual identities rather than targeted facts, while boundary-aware optimization can enforce precise intra-subject distinctions

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PPU-Bench to evaluate unlearning methods on realistic personalized deletion requests in multimodal large language models. The benchmark draws 24K multimodal and unimodal samples from 500 public figures across three settings of increasing difficulty: complete subject deletion, selective fact deletion, and personalized deletion that keeps some facts while removing others. Experiments demonstrate that complete unlearning tends to suppress visual identity instead of factual knowledge, while selective and personalized settings expose clear trade-offs between forgetting and retaining information along with difficulties in maintaining boundaries within the same subject. The authors propose Boundary-Aware Optimization to explicitly model and enforce those intra-subject boundaries without requiring additional fine-tuning.

Core claim

We present PPU-Bench, a real-world fine-tuning-free benchmark with 24K samples under complete, selective, and personalized unlearning settings. Extensive experiments reveal that complete unlearning suppresses visual identity rather than factual knowledge, selective and personalized unlearning expose significant forget-retain trade-offs and intra-subject boundary challenges, and Boundary-Aware Optimization effectively enforces those boundaries on two representative methods.

What carries the argument

Boundary-Aware Optimization (BAO), which explicitly models intra-subject forget-retain boundaries to guide the unlearning process.

If this is right

  • Complete unlearning often removes visual identity information rather than specific factual knowledge.
  • Selective and personalized unlearning settings reveal significant trade-offs between forgetting target information and retaining non-target facts.
  • Intra-subject factual boundaries create persistent challenges that standard unlearning approaches do not address.
  • Boundary-Aware Optimization improves enforcement of forget-retain distinctions when applied to existing unlearning methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Unlearning techniques for large models may benefit from treating boundary detection as a core design requirement rather than an afterthought.
  • The same boundary-modeling approach could be tested on deletion requests involving private individuals or non-visual modalities to check generality.
  • If the benchmark patterns hold, model providers might need to expose internal representations of subject-specific facts to support fine-grained user requests.
  • Extending the evaluation to measure long-term stability of the enforced boundaries after continued model use would provide a practical next test.

Load-bearing premise

The 24K samples constructed from public figures accurately represent realistic personalized deletion requests and that forget-retain boundaries can be reliably identified and enforced without fine-tuning.

What would settle it

A new test set of deletion requests drawn from non-public figures or actual user data where Boundary-Aware Optimization fails to preserve non-target facts while removing targets would falsify the method's effectiveness.

Figures

Figures reproduced from arXiv: 2605.08800 by Cuiyun Gao, Haiyan Wang, Jiahui Guang, Jing Li, Yanchun Zhang, Zexun Zhan, Zhaoquan Gu, Zhenlin Xu.

Figure 1
Figure 1. Figure 1: Overview of the pipeline of the construction for PPU-Bench [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Trade-offs between forget efficiency and utility preservation under complete, selective, and [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Personalized unlearning performance of different methods across public figure categories. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Attack robustness analysis across three unlearning settings of Qwen3-VL-8B. A larger [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The results of memorization quantification. [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Attack robustness analysis across three unlearning settings of Gemma-3-12b. A larger ASR [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Case Study of Unlearning Methods under Complete, Selective, and Personalized Unlearning [PITH_FULL_IMAGE:figures/full_fig_p026_7.png] view at source ↗
read the original abstract

Multimodal Large Language Models (MLLMs) may memorize sensitive cross-modal information during pretraining. However, existing MLLM unlearning benchmarks rely on synthetic knowledge injection or complete subject-level deletion, which fail to capture realistic, personalized deletion requests that require fine-grained factual control. In this paper, we introduce PPU-Bench, a real-world and fine-tuning-free benchmark for personalized partial unlearning in MLLMs. PPU-Bench contains 24K multimodal and unimodal samples derived from pre-existing knowledge of 500 public figures under three progressively challenging settings: Complete, Selective, and Personalized unlearning. The benchmark evaluates whether methods can remove target knowledge while preserving non-target facts, model utility, and cross-modal consistency. Extensive experiments show that Complete Unlearning often suppresses visual identity rather than factual knowledge, while Selective and Personalized Unlearning expose significant forget--retain trade-offs and challenges in intra-subject factual boundaries. Robustness analysis under cross-image and prompt-based attacks reveals distinct vulnerabilities across different unlearning settings. Motivated by these findings, we propose Boundary-Aware Optimization (BAO), which explicitly models intra-subject forget-retain boundaries. Experimental results on two representative methods demonstrate that BAO can effectively enforce intra-subject factual boundaries.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces PPU-Bench, a benchmark of 24K multimodal and unimodal samples derived from pre-existing knowledge of 500 public figures, for evaluating personalized partial unlearning in vision-language models. It defines three settings (Complete, Selective, Personalized) and reports that complete unlearning tends to suppress visual identity rather than factual knowledge, while selective/personalized settings reveal forget-retain trade-offs and intra-subject boundary challenges; robustness to attacks is analyzed, and Boundary-Aware Optimization (BAO) is proposed to enforce those boundaries, with experiments on two representative methods.

Significance. If the ground-truth forget/retain partitions prove reliable, PPU-Bench would be a valuable contribution by moving beyond synthetic knowledge injection to real-world public-figure data, exposing practically relevant trade-offs in utility, cross-modal consistency, and fine-grained factual control. The explicit modeling of intra-subject boundaries via BAO and the attack robustness analysis are constructive; the scale (24K samples) and progressive difficulty settings add utility for the community.

major comments (2)
  1. [Abstract / PPU-Bench construction] Abstract and benchmark description: the central claims (Complete Unlearning suppresses visual identity rather than facts; Selective/Personalized expose trade-offs; BAO enforces boundaries) rest on the assumption that the 24K samples provide verified ground-truth partitions. No independent verification is described that the base VLM has memorized the designated 'forget' items, that 'retain' items are non-overlapping, or that cross-modal consistency holds pre-unlearning; because facts come from uncontrolled public-figure data rather than injected knowledge, measured effects may be artifacts of label construction.
  2. [Experiments / BAO proposal] Experiments and BAO sections: no details are supplied on sample construction procedure, exact forget/retain labeling process, evaluation metrics, statistical significance testing, or the precise implementation and hyper-parameters of Boundary-Aware Optimization (BAO). These omissions are load-bearing because the reported forget-retain trade-offs and BAO improvements cannot be reproduced or assessed without them.
minor comments (2)
  1. [Abstract] The abstract states that BAO is motivated by the findings but does not preview the concrete mechanism (e.g., how boundaries are modeled or optimized) that would help readers anticipate the method's contribution.
  2. Figure captions and tables should explicitly indicate which unlearning setting (Complete/Selective/Personalized) and which attack type each result corresponds to, to improve readability of the robustness analysis.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The comments highlight important aspects of verification and reproducibility that we will address through targeted revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract / PPU-Bench construction] Abstract and benchmark description: the central claims (Complete Unlearning suppresses visual identity rather than facts; Selective/Personalized expose trade-offs; BAO enforces boundaries) rest on the assumption that the 24K samples provide verified ground-truth partitions. No independent verification is described that the base VLM has memorized the designated 'forget' items, that 'retain' items are non-overlapping, or that cross-modal consistency holds pre-unlearning; because facts come from uncontrolled public-figure data rather than injected knowledge, measured effects may be artifacts of label construction.

    Authors: We agree that explicit verification details are needed to support the reliability of the ground-truth partitions. The current manuscript does not include a dedicated description of independent verification steps. In the revision we will add a new subsection under PPU-Bench construction that reports: (i) pre-unlearning accuracy of the base VLM on both forget and retain queries to confirm differential memorization, (ii) the multi-source curation process used to ensure non-overlapping facts and cross-modal consistency, and (iii) quantitative checks (e.g., overlap statistics and consistency scores) performed during labeling. While we recognize that public-figure data introduces some label-construction uncertainty compared with synthetic injection, we view this as a deliberate design choice for ecological validity; the added pre-unlearning baselines will allow readers to evaluate whether observed effects are artifacts. We will also note this as a limitation and discuss how future work could further validate partitions. revision: yes

  2. Referee: [Experiments / BAO proposal] Experiments and BAO sections: no details are supplied on sample construction procedure, exact forget/retain labeling process, evaluation metrics, statistical significance testing, or the precise implementation and hyper-parameters of Boundary-Aware Optimization (BAO). These omissions are load-bearing because the reported forget-retain trade-offs and BAO improvements cannot be reproduced or assessed without them.

    Authors: We fully concur that the manuscript currently lacks the implementation-level details required for reproducibility. In the revised version we will expand the Experiments and BAO sections with: (1) a complete description of the sample construction pipeline and the exact forget/retain labeling rules with concrete examples for each of the three settings; (2) formal definitions of all evaluation metrics together with the formulas used; (3) results of statistical significance tests (paired t-tests and p-values) for the reported trade-offs and BAO gains; and (4) the full BAO algorithm, including pseudocode, the boundary-aware loss formulation, optimizer choice, learning-rate schedule, batch size, and all other hyperparameters employed for the two representative methods. These additions will make the benchmark and the proposed optimization fully reproducible. revision: yes

Circularity Check

0 steps flagged

No significant circularity: benchmark from external data, BAO motivated by but not reduced to observations

full rationale

The paper constructs PPU-Bench from pre-existing public-figure knowledge (24K samples across Complete/Selective/Personalized settings) and evaluates unlearning methods empirically. BAO is introduced as a new optimization approach explicitly motivated by observed forget-retain trade-offs and intra-subject boundary challenges, without any equations, parameters, or claims that reduce the benchmark labels, experimental outcomes, or BAO formulation to fitted inputs or self-citations by construction. The derivation chain consists of external data sourcing, standard evaluation metrics, and a proposed method with independent implementation details; no self-definitional loops, renamed predictions, or load-bearing self-citations appear in the provided abstract or described structure.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the domain assumption that fine-tuning-free unlearning is feasible and that intra-subject factual boundaries exist and can be modeled explicitly. No free parameters or invented physical entities are described.

axioms (1)
  • domain assumption Unlearning target knowledge while preserving non-target facts and cross-modal consistency is achievable without model fine-tuning
    Benchmark is explicitly described as fine-tuning-free and evaluates preservation of utility and consistency.
invented entities (1)
  • Boundary-Aware Optimization (BAO) no independent evidence
    purpose: Explicitly model and enforce intra-subject forget-retain boundaries during unlearning
    New technique proposed to address observed trade-offs in selective and personalized settings

pith-pipeline@v0.9.0 · 5546 in / 1258 out tokens · 50976 ms · 2026-05-12T01:20:49.021839+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 1 internal anchor

  1. [1]

    Generated data with fake privacy: hidden dangers of fine-tuning large language models on generated data

    Atilla Akkus, Masoud Poorghaffar Aghdam, Mingjie Li, Junjie Chu, Michael Backes, Yang Zhang, and Sinem Sav. Generated data with fake privacy: hidden dangers of fine-tuning large language models on generated data. InProceedings of the 34th USENIX Conference on Security Symposium, SEC ’25, USA, 2025. USENIX Association. ISBN 978-1-939133-52-6

  2. [2]

    Rwku: Benchmarking real-world knowledge unlearning for large language models.Advances in Neural Information Processing Systems, 37:98213–98263, 2024

    Pengfei Cao, Chenhao Wang, Zhitao He, Hongbang Yuan, Jiachun Li, Yubo Chen, Kang Liu, Jun Zhao, et al. Rwku: Benchmarking real-world knowledge unlearning for large language models.Advances in Neural Information Processing Systems, 37:98213–98263, 2024

  3. [3]

    CLEAR: Character unlearning in textual and visual modalities

    Alexey Dontsov, Dmitrii Korzh, Alexey Zhavoronkin, Boris Mikheev, Denis Bobkov, Aibek Alanov, Oleg Rogov, Ivan Oseledets, and Elena Tutubalina. CLEAR: Character unlearning in textual and visual modalities. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Findings of the Association for Computational Linguistics: ACL ...

  4. [4]

    Praxis-vlm: Vision-grounded decision making via text-driven reinforcement learning.arXiv preprint arXiv:2503.16965, 2025

    Zhe Hu, Jing Li, Zhongzhu Pu, Hou Pong Chan, and Yu Yin. Praxis-vlm: Vision-grounded decision making via text-driven reinforcement learning.arXiv preprint arXiv:2503.16965, 2025

  5. [5]

    MMUnlearner: Reformulating multimodal machine unlearning in the era of multimodal large language models

    Jiahao Huo, Yibo Yan, Xu Zheng, Yuanhuiyi Lyu, Xin Zou, Zhihua Wei, and Xuming Hu. MMUnlearner: Reformulating multimodal machine unlearning in the era of multimodal large language models. InFindings of the Association for Computational Linguistics: ACL 2025, pages 7190–7206, Vienna, Austria, July 2025. Association for Computational Linguistics. ISBN 979-8...

  6. [6]

    Mmpb: It’s time for multi-modal personalization

    Jaeik Kim, Woojin Kim, Woohyeon Park, and Jaeyoung Do. Mmpb: It’s time for multi-modal personalization. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025

  7. [7]

    Single image unlearning: Efficient machine unlearning in multimodal large language models.Advances in Neural Information Processing Systems, 37:35414–35453, 2024

    Jiaqi Li, Qianshan Wei, Chuanyi Zhang, Guilin Qi, Miaozeng Du, Yongrui Chen, Sheng Bi, and Fan Liu. Single image unlearning: Efficient machine unlearning in multimodal large language models.Advances in Neural Information Processing Systems, 37:35414–35453, 2024

  8. [8]

    Forget the token and pixel: Rethinking gradient ascent for concept unlearning in multimodal generative models

    Jiaqi Li, Chuanyi Zhang, Miaozeng Du, Hui Zhang, Yongrui Chen, Qianshan Wei, Junfeng Fang, Ruipeng Wang, Sheng Bi, and Guilin Qi. Forget the token and pixel: Rethinking gradient ascent for concept unlearning in multimodal generative models. InFindings of the Association for Computational Linguistics: ACL 2025, pages 12179–12200, Vienna, Austria, July 2025...

  9. [9]

    Llm unlearning with llm beliefs, 2025

    Kemou Li, Qizhou Wang, Yue Wang, Fengpeng Li, Jun Liu, Bo Han, and Jiantao Zhou. Llm unlearning with llm beliefs, 2025. 10

  10. [10]

    Rouge: A package for automatic evaluation of summaries

    Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004

  11. [11]

    Continual learning and private unlearning

    Bo Liu, Qiang Liu, and Peter Stone. Continual learning and private unlearning. InConference on Lifelong Learning Agents, pages 243–254. PMLR, 2022

  12. [12]

    Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233. Springer, 2024

  13. [13]

    Machine unlearning in generative ai: A survey.arXiv preprint arXiv:2407.20516, 2024

    Zheyuan Liu, Guangyao Dou, Zhaoxuan Tan, Yijun Tian, and Meng Jiang. Machine unlearning in generative ai: A survey.arXiv preprint arXiv:2407.20516, 2024

  14. [14]

    Towards safer large language models through machine unlearning.arXiv preprint arXiv:2402.10058, 2024

    Zheyuan Liu, Guangyao Dou, Zhaoxuan Tan, Yijun Tian, and Meng Jiang. Towards safer large language models through machine unlearning.arXiv preprint arXiv:2402.10058, 2024

  15. [15]

    Protecting privacy in multimodal large language models with mllmu-bench

    Zheyuan Liu, Guangyao Dou, Mengzhao Jia, Zhaoxuan Tan, Qingkai Zeng, Yongle Yuan, and Meng Jiang. Protecting privacy in multimodal large language models with mllmu-bench. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Lang...

  16. [16]

    Modality-aware neuron pruning for unlearning in multimodal large language models

    Zheyuan Liu, Guangyao Dou, Xiangchi Yuan, Chunhui Zhang, Zhaoxuan Tan, and Meng Jiang. Modality-aware neuron pruning for unlearning in multimodal large language models. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1:...

  17. [17]

    Association for Computational Linguistics, 2025

  18. [18]

    Benchmarking vision language model unlearning via fictitious facial identity dataset

    Yingzi Ma, Jiongxiao Wang, Fei Wang, Siyuan Ma, Jiazhao Li, Jinsheng Pan, Xiujun Li, Furong Huang, Lichao Sun, Bo Li, Yejin Choi, Muhao Chen, and Chaowei Xiao. Benchmarking vision language model unlearning via fictitious facial identity dataset. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28,

  19. [19]

    OpenReview.net, 2025

  20. [20]

    Benchmarking vision language model unlearning via fictitious facial identity dataset

    Yingzi Ma, Jiongxiao Wang, Fei Wang, Siyuan Ma, Jiazhao Li, Jinsheng Pan, Xiujun Li, Furong Huang, Lichao Sun, Bo Li, Yejin Choi, Muhao Chen, and Chaowei Xiao. Benchmarking vision language model unlearning via fictitious facial identity dataset. InThe Thirteenth International Conference on Learning Representations, 2025

  21. [21]

    TOFU: A Task of Fictitious Unlearning for LLMs

    Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary C Lipton, and J Zico Kolter. Tofu: A task of fictitious unlearning for llms.arXiv preprint arXiv:2401.06121, 2024

  22. [22]

    The eu proposal for a general data protection regulation and the roots of the ‘right to be forgotten’.Computer Law & Security Review, 29(3):229–235, 2013

    Alessandro Mantelero. The eu proposal for a general data protection regulation and the roots of the ‘right to be forgotten’.Computer Law & Security Review, 29(3):229–235, 2013. ISSN 2212-473X

  23. [23]

    Unrolling sgd: Understanding factors influencing machine unlearning

    Anvith Thudi, Gabriel Deza, Varun Chandrasekaran, and Nicolas Papernot. Unrolling sgd: Understanding factors influencing machine unlearning. In2022 IEEE 7th European Symposium on Security and Privacy (EuroS&P), pages 303–319. IEEE, 2022

  24. [24]

    Umu-bench: Closing the modality gap in multimodal unlearning evaluation

    Chengye Wang, Yuyuan Li, XiaoHua Feng, Chaochao Chen, Xiaolin Zheng, and Jianwei Yin. Umu-bench: Closing the modality gap in multimodal unlearning evaluation. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track

  25. [25]

    Pebench: A fictitious dataset to benchmark machine unlearning for multimodal large language models.CoRR, abs/2503.12545, 2025

    Zhaopan Xu, Pengfei Zhou, Weidong Tang, Jiaxin Ai, Wangbo Zhao, Xiaojiang Peng, Kai Wang, Yang You, Wenqi Shao, Hongxun Yao, and Kaipeng Zhang. Pebench: A fictitious dataset to benchmark machine unlearning for multimodal large language models.CoRR, abs/2503.12545, 2025

  26. [26]

    Negative preference optimization: From catastrophic collapse to effective un- learning.arXiv preprint arXiv:2404.05868,

    Ruiqi Zhang, Licong Lin, Yu Bai, and Song Mei. Negative preference optimization: From catastrophic collapse to effective unlearning.arXiv preprint arXiv:2404.05868, 2024. 11

  27. [27]

    Evaluating and steering modality preferences in multimodal large language model

    Yu Zhang, Jinlong Ma, Yongshuai Hou, Xuefeng Bai, Kehai Chen, Yang Xiang, Jun Yu, and Min Zhang. Evaluating and steering modality preferences in multimodal large language model. arXiv preprint arXiv:2505.20977, 2025

  28. [28]

    OFFSIDE: benchmarking unlearning misinformation in multimodal large language models

    Hao Zheng, Zirui Pang, Ling li, Zhijie Deng, Yuhan Pu, Zhaowei Zhu, Xiaobo Xia, and Jiaheng Wei. OFFSIDE: benchmarking unlearning misinformation in multimodal large language models. CoRR, abs/2510.22535, 2025

  29. [29]

    id": "120

    Yingjie Zhu, Xuefeng Bai, Kehai Chen, Yang Xiang, Jun Yu, and Min Zhang. Benchmarking and improving large vision-language models for fundamental visual graph understanding and reasoning. InProceedings of the 63rd Annual Meeting of the Association for Computational Lin- guistics (V olume 1: Long Papers), pages 30678–30701, Vienna, Austria, July 2025. Assoc...

  30. [30]

    A person file describing a public figure

  31. [31]

    person_name

    A fixed set of candidate QA items about that person. Your task is to simulate a first-person deletion request from this person. Goal: Select which candidate items this person would most reasonably want removed from a AI model, while keeping the facts that are most essential to how they would want to be publicly represented. Important principles: - Only ch...

  32. [32]

    Generate exactly one cloze probe

  33. [33]

    Use "___" as the blank

  34. [34]

    The output query must be a declarative cloze statement, not an interrogative question

  35. [35]

    Do NOT keep the original wh-question form

  36. [36]

    The blank must replace the answer span itself, not be appended to the end

  37. [37]

    The query must sound natural and grammatical

  38. [38]

    The probe must test exactly one atomic factual slot

  39. [39]

    The answer must be unique and be a word or short phrase, not a full sentence

  40. [40]

    Do not introduce any new facts

  41. [41]

    associated with

    Avoid vague or open-ended predicates such as "associated with", "known for", "linked to", or other prompts with many valid completions

  42. [42]

    Avoid blanks that can be filled by multiple answer types, such as both a year and a place

  43. [43]

    Avoid list-valued answers, long enumerations, and multi-fact answers

  44. [44]

    Yes", "No

    Do NOT generate probes whose answer is "Yes", "No", "True", or "False"

  45. [45]

    Do NOT turn a statement into a yes/no-style cloze

  46. [46]

    The blank must stand for a concrete fact such as a person, place, year, date, school, title, work, co-author, nationality, religion, or organization

  47. [47]

    Prefer canonical and concrete factual slots

  48. [48]

    cloze_probes

    Return valid JSON only. Output format: { "cloze_probes": [ { "id": "<original qa id>", "query": "... ___ ...", "answer": "...", "type": "cloze" } ] } example: # question: What is Stephen King’s nationality? # answer: Stephen King’s nationality is American. Output: { "cloze_probes": [ { "id": "Stephen_King_05", "query": "Stephen King’s nationality is ___",...