pith. machine review for the scientific record. sign in

arxiv: 2512.10821 · v2 · submitted 2025-12-11 · 💻 cs.AI · cs.CV· cs.HC· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Agile Deliberation: Concept Deliberation for Subjective Visual Classification

Authors on Pith no claims yet

Pith reviewed 2026-05-16 23:08 UTC · model grok-4.3

classification 💻 cs.AI cs.CVcs.HCcs.LG
keywords human-in-the-loopconcept deliberationsubjective visual classificationcontent moderationiterative refinementborderline examplesconcept scoping
0
0 comments X

The pith

Agile Deliberation guides users to refine vague visual concepts into accurate classifiers through scoping and borderline feedback.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Agile Deliberation as a human-in-the-loop system that operationalizes real content-moderation strategies to handle subjective and evolving visual concepts. Users begin with an initial idea and move through concept scoping, which breaks the idea into a hierarchy of sub-concepts, followed by concept iteration, which presents borderline images for reflection and label feedback. The framework is evaluated in 18 controlled 1.5-hour user sessions rather than fixed datasets. Results show higher F1 scores than automated decomposition or manual deliberation baselines, along with reports of clearer understanding and lower effort. This matters for applications such as content moderation where concepts are rarely fixed in advance.

Core claim

Agile Deliberation is a two-stage framework that first decomposes an initial visual concept into a structured hierarchy of sub-concepts and then surfaces semantically borderline examples for iterative user feedback, allowing an image classifier to align with the user's evolving intent even when the concept begins vague and subjective.

What carries the argument

The Agile Deliberation framework with its two explicit stages of concept scoping into a sub-concept hierarchy and concept iteration on borderline examples.

If this is right

  • Visual classifiers for subjective tasks can be trained with less initial clarity from the user.
  • Users reach clearer conceptual understanding while expending lower cognitive effort.
  • The approach outperforms both fully automated decomposition and unstructured manual deliberation.
  • Borderline-example feedback becomes a repeatable mechanism for aligning models with evolving intent.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could be adapted to other domains with subjective labels such as medical image diagnosis or artistic style classification.
  • Combining the framework with active learning might further reduce the number of examples needed for convergence.
  • Organizations could use the scoping hierarchy as a shared artifact to improve consistency across multiple moderators.

Load-bearing premise

The deliberation patterns observed in the 18 user sessions will generalize to the strategies used by professional content moderators in ongoing production work.

What would settle it

A study with actual production moderation teams over several weeks that finds no gain in F1 score or reported effort would show the framework does not deliver the claimed benefits.

Figures

Figures reproduced from arXiv: 2512.10821 by Ariel Fuxman, Chun-Ta Lu, Enming Luo, Krishnamurthy Viswanathan, Leijie Wang, Otilia Stretcu, Ranjay Krishna, Thomas Denby, Tushar Dogra, Wei Qiao.

Figure 1
Figure 1. Figure 1: Overview of the Agile Deliberation framework and architecture. Given a subjective concept and a target dataset, Agile Deliberation produces both a structured concept definition and an image classifier through a human-in-the-loop deliberation process. At the scoping stage, the decomposition module helps users break down their initial concept into a structured definition. At the iteration stage, the borderli… view at source ↗
Figure 2
Figure 2. Figure 2: Example of iterative concept refinement in Agile Deliberation (from an actual study participant). We show the first three iteration rounds, highlighting updates within two subconcepts for brevity. Only one representative image from each batch of borderline images is displayed for illustration. In the concept scoping stage, the participant first decomposed their initial concept into in-scope and out￾of-scop… view at source ↗
Figure 3
Figure 3. Figure 3: F1 scores of Agile Deliberation across rounds of concept iteration compared with two automated baselines. alone cannot capture users’ nuanced intentions. Overall, these results show that iterative human feedback in Agile Deliberation enables finer alignment between human con￾cept understanding and classifier decisions. For participants using Manual Deliberation, a similar trend emerged for the concept paid… view at source ↗
read the original abstract

From content moderation to content curation, applications requiring vision classifiers for visual concepts are rapidly expanding. Existing human-in-the-loop approaches typically assume users begin with a clear, stable concept understanding to be able to provide high-quality supervision. In reality, users often start with a vague idea and must iteratively refine it through "concept deliberation", a practice we uncovered through structured interviews with content moderation experts. We operationalize the common strategies in deliberation used by real content moderators into a human-in-the-loop framework called "Agile Deliberation" that explicitly supports evolving and subjective concepts. The system supports users in defining the concept for themselves by exposing them to borderline cases. The system does this with two deliberation stages: (1) concept scoping, which decomposes the initial concept into a structured hierarchy of sub-concepts, and (2) concept iteration, which surfaces semantically borderline examples for user reflection and feedback to iteratively align an image classifier with the user's evolving intent. Since concept deliberation is inherently subjective and interactive, we painstakingly evaluate the framework through 18 user sessions, each 1.5h long, rather than standard benchmarking datasets. We find that Agile Deliberation achieves 7.5% higher F1 scores than automated decomposition baselines and more than 3% higher than manual deliberation, while participants reported clearer conceptual understanding and lower cognitive effort.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Agile Deliberation, a human-in-the-loop framework for subjective visual classification (e.g., content moderation) that operationalizes expert strategies for concept deliberation. It decomposes the process into two stages—concept scoping (decomposing an initial vague concept into a structured hierarchy of sub-concepts) and concept iteration (surfacing borderline examples for iterative user feedback to align the classifier with evolving intent)—and evaluates it via 18 user sessions of 1.5 hours each rather than standard benchmarks. The central claim is that the framework yields 7.5% higher F1 scores than automated decomposition baselines and more than 3% higher than manual deliberation, while also improving users' conceptual understanding and reducing cognitive effort.

Significance. If the reported gains prove robust under more detailed scrutiny, the work would meaningfully advance interactive machine learning for subjective, evolving visual concepts by moving beyond assumptions of stable user intent. The choice to ground the framework in interviews with content moderation experts and to prioritize controlled user sessions over synthetic benchmarks is a strength that aligns with the problem's inherent subjectivity; successful validation could inform practical tools in high-stakes moderation pipelines.

major comments (3)
  1. [Evaluation] Evaluation section: The abstract reports concrete F1 gains from 18 user sessions, but lacks details on statistical tests, exact baseline implementations, participant selection, and how F1 was computed across evolving concepts.
  2. [User Study] User study design: The central claim of 7.5% F1 lift and reduced cognitive effort rests on the assumption that the 18 sessions of 1.5 h accurately capture production-scale deliberation variability and expert strategies; the manuscript supplies no description of how task distributions, time pressure, or participant expertise were matched to real moderation queues.
  3. [Agile Deliberation Framework] Framework operationalization: The two-stage process is described at a high level, yet the manuscript does not specify the exact mechanism by which the concept-scoping hierarchy is generated from user input or how borderline examples are selected and labeled for the iteration stage, leaving reproducibility and parameter sensitivity unclear.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'painstakingly evaluate' is informal; a more neutral description of the evaluation protocol would improve tone.
  2. [Throughout] Notation: Ensure consistent use of 'concept deliberation' versus 'Agile Deliberation' throughout to avoid minor reader confusion.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We have carefully considered each point and revised the paper to improve clarity, reproducibility, and detail in the evaluation and framework sections. Our responses are provided below.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: The abstract reports concrete F1 gains from 18 user sessions, but lacks details on statistical tests, exact baseline implementations, participant selection, and how F1 was computed across evolving concepts.

    Authors: We agree that additional details are warranted. In the revised manuscript, we have expanded the Evaluation section (Section 5) to include: (1) statistical significance testing using paired t-tests on the F1 scores across the 18 sessions, with reported p-values; (2) exact descriptions of the automated baselines, including the specific LLM prompts and decomposition methods used; (3) participant selection criteria, including recruitment method and expertise levels (e.g., 10 participants with prior moderation experience); and (4) F1 computation details, where for each session, a post-deliberation ground truth labeling of 100 test images was used to evaluate the final classifier. These changes ensure the results are fully transparent. revision: yes

  2. Referee: [User Study] User study design: The central claim of 7.5% F1 lift and reduced cognitive effort rests on the assumption that the 18 sessions of 1.5 h accurately capture production-scale deliberation variability and expert strategies; the manuscript supplies no description of how task distributions, time pressure, or participant expertise were matched to real moderation queues.

    Authors: This is a valid concern regarding ecological validity. Our study was intentionally designed as a controlled experiment to isolate the effects of the Agile Deliberation framework, based on strategies identified in expert interviews. We have added a detailed description of the study design in Section 4.1, including how tasks were sampled from a curated set of images representing typical moderation scenarios (e.g., ambiguous visual content from public datasets), participant expertise (recruited users with 1-5 years in related fields), and session structure to simulate deliberation without real-time pressure. We acknowledge in the limitations section that it does not fully replicate production-scale variability or time pressures, and suggest future work in live deployments. The 1.5-hour sessions allowed for in-depth measurement of cognitive effort via NASA-TLX surveys. revision: partial

  3. Referee: [Agile Deliberation Framework] Framework operationalization: The two-stage process is described at a high level, yet the manuscript does not specify the exact mechanism by which the concept-scoping hierarchy is generated from user input or how borderline examples are selected and labeled for the iteration stage, leaving reproducibility and parameter sensitivity unclear.

    Authors: We have revised Section 3 to provide precise operational details. The concept-scoping hierarchy is generated interactively: users provide an initial concept description, which is fed to an LLM (with specific prompt template provided in the appendix) to propose sub-concepts; users then refine or approve them to build the hierarchy. For the iteration stage, borderline examples are selected via a combination of model uncertainty (using softmax entropy) and semantic similarity to the current concept hierarchy, with a fixed threshold of 0.3 for selection. Users label these as positive, negative, or 'needs refinement' and provide textual feedback. We include pseudocode, all hyperparameters, and a sensitivity analysis showing robustness to variations in the number of iterations and selection threshold. This addresses reproducibility concerns. revision: yes

Circularity Check

0 steps flagged

No significant circularity; evaluation relies on independent user sessions rather than self-referential definitions or fitted predictions

full rationale

The paper presents an empirical human-in-the-loop system whose core claims (F1 gains over baselines) are measured via 18 separate 1.5-hour user sessions with external participants. No equations, parameter fits, or derivations appear in the abstract or described framework; the two deliberation stages are operationalized from prior interviews but then tested against independent session outcomes rather than being defined in terms of those outcomes. No self-citation chains, uniqueness theorems, or ansatzes are invoked as load-bearing steps. The evaluation design therefore remains self-contained against external benchmarks and does not reduce any result to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the domain assumption that structured decomposition plus borderline-example feedback reliably improves classifier alignment with subjective intent.

axioms (1)
  • domain assumption Users begin with vague concepts that can be usefully decomposed into hierarchies and refined via borderline cases
    Extracted from the description of interviews with content moderation experts and the design of the two deliberation stages

pith-pipeline@v0.9.0 · 5582 in / 1133 out tokens · 29557 ms · 2026-05-16T23:08:59.773823+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages · 5 internal anchors

  1. [1]

    Dimakis, Ion Sto- ica, Dan Klein, Matei Zaharia, and Omar Khattab

    Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, Christo- pher Potts, Koushik Sen, Alexandros G. Dimakis, Ion Sto- ica, Dan Klein, Matei Zaharia, and Omar Khattab. GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning, 2025. 3

  2. [2]

    Be- longie

    Steve Branson, Catherine Wah, Florian Schroff, Boris Babenko, Peter Welinder, Pietro Perona, and Serge J. Be- longie. Visual recognition with humans in the loop. InEu- ropean Conference on Computer Vision, 2010. 2

  3. [3]

    PaLI: A Jointly-Scaled Multilingual Language-Image Model

    Xi Chen, Xiao Wang, Soravit Changpinyo, Anthony J Pier- giovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, et al. PaLI: A jointly-scaled multilingual language-image model.arXiv preprint arXiv:2209.06794, 2022. 2, 3, 7

  4. [4]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 6

  5. [5]

    Rlprompt: Optimizing discrete text prompts with reinforcement learning

    Mingkai Deng, Jianyu Wang, Cheng-Ping Hsieh, Yihan Wang, Han Guo, Tianmin Shu, Meng Song, Eric Xing, and Zhiting Hu. Rlprompt: Optimizing discrete text prompts with reinforcement learning. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Pro- cessing, pages 3369–3391, 2022. 6

  6. [6]

    Begriffsschrift, a formula language, modeled upon that of arithmetic, for pure thought.From Frege to Gödel: A source book in mathematical logic, 1931: 1–82, 1879

    Gottlob Frege et al. Begriffsschrift, a formula language, modeled upon that of arithmetic, for pure thought.From Frege to Gödel: A source book in mathematical logic, 1931: 1–82, 1879. 2, 5

  7. [7]

    Gemini API (Models 2.5 Pro & Flash), 2025

    Google Cloud. Gemini API (Models 2.5 Pro & Flash), 2025. 6

  8. [8]

    Google Colaboratory, 2025

    Google Research. Google Colaboratory, 2025. Interactive development environment accessed via web browser. 6

  9. [9]

    Quantization based fast inner product search

    Ruiqi Guo, Sanjiv Kumar, Krzysztof Choromanski, and David Simcha. Quantization based fast inner product search. InArtificial intelligence and statistics, pages 482–

  10. [10]

    Explainable Convolu- tional Neural Networks: A Taxonomy, Review, and Future Directions.ACM Computing Surveys, 55(10):1–37, 2023

    Rami Ibrahim and M Omair Shafiq. Explainable Convolu- tional Neural Networks: A Taxonomy, Review, and Future Directions.ACM Computing Surveys, 55(10):1–37, 2023. 1, 2

  11. [11]

    Scaling up visual and vision-language representa- tion learning with noisy text supervision

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representa- tion learning with noisy text supervision. InInternational conference on machine learning, pages 4904–4916. PMLR,

  12. [12]

    The hateful memes challenge: Detecting hate speech in multimodal memes.Advances in neural informa- tion processing systems, 33:2611–2624, 2020

    Douwe Kiela, Hamed Firooz, Aravind Mohan, Vedanuj Goswami, Amanpreet Singh, Pratik Ringshia, and Davide Testuggine. The hateful memes challenge: Detecting hate speech in multimodal memes.Advances in neural informa- tion processing systems, 33:2611–2624, 2020. 1

  13. [13]

    Annotation error detection: Analyzing the past and present for a more coherent future.Computational Linguistics, 49 (1):157–198, 2023

    Jan-Christoph Klie, Bonnie Webber, and Iryna Gurevych. Annotation error detection: Analyzing the past and present for a more coherent future.Computational Linguistics, 49 (1):157–198, 2023. 1

  14. [14]

    Concept Bottleneck Models

    Pang Wei Koh, Thao Nguyen, Yew Siang Tang, Stephen Mussmann, Emma Pierson, Been Kim, and Percy Liang. Concept Bottleneck Models. InInternational Conference on Machine Learning, pages 5338–5348. PMLR, 2020. 2

  15. [15]

    Crowdsourcing in computer vision.Founda- tions and Trends® in computer graphics and Vision, 10(3): 177–243, 2016

    Adriana Kovashka, Olga Russakovsky, Li Fei-Fei, Kristen Grauman, et al. Crowdsourcing in computer vision.Founda- tions and Trends® in computer graphics and Vision, 10(3): 177–243, 2016. 1, 2

  16. [16]

    Structured labeling for facilitat- ing concept evolution in machine learning

    Todd Kulesza, Saleema Amershi, Rich Caruana, Danyel Fisher, and Denis Charles. Structured labeling for facilitat- ing concept evolution in machine learning. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 3075–3084, 2014. 1

  17. [17]

    The Power of Scale for Parameter-Efficient Prompt Tuning

    Brian Lester, Rami Al-Rfou, and Noah Constant. The Power of Scale for Parameter-Efficient Prompt Tuning. InConfer- ence on Empirical Methods in Natural Language Processing,

  18. [18]

    A sequential algorithm for training text clas- sifiers: Corrigendum and additional data

    David D Lewis. A sequential algorithm for training text clas- sifiers: Corrigendum and additional data. InAcm Sigir Fo- rum, pages 13–19. ACM New York, NY , USA, 1995. 3, 4

  19. [19]

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven C. H. Hoi. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In International Conference on Machine Learning, 2022. 2

  20. [20]

    Knowledge pursuit prompting for zero-shot multi- modal synthesis.arXiv preprint arXiv: 2311.17898, 2023

    Jinqi Luo, Kwan Ho Ryan Chan, Dimitris Dimos, and René Vidal. Knowledge pursuit prompting for zero-shot multi- modal synthesis.arXiv preprint arXiv: 2311.17898, 2023. 11

  21. [21]

    What should we engineer in prompts? training humans in requirement-driven llm use

    Qianou Ma, Weirui Peng, Chenyang Yang, Hua Shen, Ken Koedinger, and Tongshuang Wu. What should we engineer in prompts? training humans in requirement-driven llm use. ACM Transactions on Computer-Human Interaction, 32(4): 1–27, 2025. 1

  22. [22]

    The magical number seven, plus or minus two: Some limits on our capacity for processing information

    George A Miller. The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychological review, 63(2):81, 1956. 2, 5

  23. [23]

    Comparing the effects of annotation type on machine learn- ing detection performance

    James F Mullen Jr, Franklin R Tanner, and Phil A Sallee. Comparing the effects of annotation type on machine learn- ing detection performance. InProceedings of the ieee/cvf conference on computer vision and pattern recognition workshops, pages 0–0, 2019. 1

  24. [24]

    Rahul Pandey, Hemant Purohit, Carlos Castillo, and Va- lerie L Shalin. Modeling and mitigating human annota- tion errors to design efficient stream processing systems with human-in-the-loop machine learning.International Journal of Human-Computer Studies, 160:102772, 2022. 1

  25. [25]

    Token cleaning: Fine-grained data selection for llm supervised fine-tuning.arXiv preprint arXiv:2502.01968, 2025

    Jinlong Pang, Na Di, Zhaowei Zhu, Jiaheng Wei, Hao Cheng, Chen Qian, and Yang Liu. Token cleaning: Fine-grained data selection for llm supervised fine-tuning.arXiv preprint arXiv:2502.01968, 2025. 1

  26. [26]

    GRIPS: Gradient-free, Edit-based Instruction Search for Prompting Large Language Models

    Archiki Prasad, Peter Hase, Xiang Zhou, and Mohit Bansal. GRIPS: Gradient-free, Edit-based Instruction Search for Prompting Large Language Models. InProceedings of the 9 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 3845–3864, 2023. 3, 6

  27. [27]

    Gradient Descent

    Reid Pryzant, Dan Iter, Jerry Li, Yin Tat Lee, Chenguang Zhu, and Michael Zeng. Automatic Prompt Optimization with "Gradient Descent" and Beam Search. InConference on Empirical Methods in Natural Language Processing, 2023. 2, 3, 6

  28. [28]

    Scaling up LLM re- views for Google Ads content moderation

    Wei Qiao, Tushar Dogra, Otilia Stretcu, Yu-Han Lyu, Tiantian Fang, Dongjin Kwon, Chun-Ta Lu, Enming Luo, Yuan Wang, Chih-Chun Chia, et al. Scaling up LLM re- views for Google Ads content moderation. InProceedings of the 17th ACM International Conference on Web Search and Data Mining, pages 1174–1175, 2024. 1

  29. [29]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2

  30. [30]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gener- ation with clip latents.arXiv preprint arXiv:2204.06125, 1 (2):3, 2022. 5

  31. [31]

    Ratner, Stephen H

    Alexander J. Ratner, Stephen H. Bach, Henry R. Ehrenberg, Jason Alan Fries, Sen Wu, and Christopher Ré. Snorkel: Rapid Training Data Creation with Weak Supervision.Pro- ceedings of the VLDB Endowment. International Conference on Very Large Data Bases, 11 3:269–282, 2017. 2

  32. [32]

    ImageNet Large Scale Visual Recognition Challenge.International Journal of Computer Vision, 115(3):211–252, 2015

    Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San- jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. ImageNet Large Scale Visual Recognition Challenge.International Journal of Computer Vision, 115(3):211–252, 2015. 1

  33. [33]

    Active Learning Literature Survey

    Burr Settles. Active Learning Literature Survey. 2009. 3

  34. [34]

    Introduction to multi-armed ban- dits.Foundations and Trends® in Machine Learning, 12(1- 2):1–286, 2019

    Aleksandrs Slivkins et al. Introduction to multi-armed ban- dits.Foundations and Trends® in Machine Learning, 12(1- 2):1–286, 2019. 5

  35. [35]

    Revealing the unwritten: Visual in- vestigation of beam search trees to address language model prompting challenges

    Thilo Spinner, Rita Sevastjanova, Rebecca Kehlbeck, Tobias Stähle, Daniel Keim, Oliver Deussen, Andreas Spitz, and Mennatallah El-Assady. Revealing the unwritten: Visual in- vestigation of beam search trees to address language model prompting challenges. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 3: S...

  36. [36]

    Agile Modeling: From concept to classifier in minutes

    Otilia Stretcu, Edward Vendrow, Kenji Hata, Krishnamurthy Viswanathan, Vittorio Ferrari, Sasan Tavakkol, Wenlei Zhou, Aditya Avinash, Emming Luo, Neil Gordon Alldrin, et al. Agile Modeling: From concept to classifier in minutes. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22323–22334, 2023. 1, 2, 4, 7

  37. [37]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 6

  38. [38]

    Dictionary learning.IEEE Signal Processing Magazine, 28(2):27–38, 2011

    Ivana Toši ´c and Pascal Frossard. Dictionary learning.IEEE Signal Processing Magazine, 28(2):27–38, 2011. 5

  39. [39]

    Modeling Collab- orator: Enabling subjective vision classification with min- imal human effort via LLM tool-use

    Imad Eddine Toubal, Aditya Avinash, Neil Gordon Alldrin, Jan Dlabal, Wenlei Zhou, Enming Luo, Otilia Stretcu, Hao Xiong, Chun-Ta Lu, Howard Zhou, et al. Modeling Collab- orator: Enabling subjective vision classification with min- imal human effort via LLM tool-use. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages ...

  40. [40]

    End user authoring of person- alized content classifiers: Comparing example labeling, rule writing, and llm prompting

    Leijie Wang, Kathryn Yurechko, Pranati Dani, Quan Ze Chen, and Amy X Zhang. End user authoring of person- alized content classifiers: Comparing example labeling, rule writing, and llm prompting. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–21, 2025. 1, 8

  41. [41]

    VisRet: Visualization Improves Knowledge-Intensive Text-to-Image Retrieval

    Di Wu, Yixin Wan, and Kai-Wei Chang. Visualized text-to- image retrieval.arXiv preprint arXiv:2505.20291, 2025. 11

  42. [42]

    Why johnny can’t prompt: how non-ai experts try (and fail) to design llm prompts

    J Diego Zamfirescu-Pereira, Richmond Y Wong, Bjoern Hartmann, and Qian Yang. Why johnny can’t prompt: how non-ai experts try (and fail) to design llm prompts. InPro- ceedings of the 2023 CHI conference on human factors in computing systems, pages 1–21, 2023. 1, 8

  43. [43]

    Le, and Ed H

    Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang 0002, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc V . Le, and Ed H. Chi. Least-to-Most Prompting Enables Complex Reasoning in Large Language Models. InThe Eleventh International Conference on Learn- ing Representations, ICLR 2023, Kigali, Rwanda, May 1-5,

  44. [44]

    OpenReview.net, 2023. 2, 3, 5

  45. [45]

    Learning to prompt for vision-language models

    Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. 3

  46. [46]

    a family is gathering together

    Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. Large lan- guage models are human-level prompt engineers.ArXiv, abs/2211.01910, 2022. 3 10 A. Interview Analysis A.1. Concept Deliberation by Experts We first conducted a qualitative analysis of 20 concept def- initions created by professional content m...

  47. [49]

    fruit", ' electronic devices', 'physical affection', 'outdoor activities', whereas examples of descriptive concepts are

    The primary concept should be more categorical ( concepts where you could think about specific instances) rather than descriptive (where you could only describe different aspects of the concept). Examples of categorical concepts are "fruit", ' electronic devices', 'physical affection', 'outdoor activities', whereas examples of descriptive concepts are "sl...

  48. [50]

    categories that are already included in the concept definition as positive or negative signals

  49. [51]

    </step2> <step3> Based on your answer in step2 and step3, reason and propose a category of subconcepts that you think is the most coherent and widely recognized

    categories that have beed explored in the previous rounds of brainstorming. </step2> <step3> Based on your answer in step2 and step3, reason and propose a category of subconcepts that you think is the most coherent and widely recognized. While you can include previous explored subconcepts, your category should not significantly overlap with previously exp...

  50. [52]

    You should ensure that this category itself is a well-defined and well-known concept so that average people can easily tell whether an image satisfies this category or not

  51. [53]

    Your category should not be too narrow that it only covers one or two instances

  52. [54]

    In cases where there are many potential categories of subconcepts, you should prioritize the one that most people would agree to be in-scope for the concept

  53. [55]

    fruits",

    You do not aim for proposing a category that includes the most subconcepts; Instead, you should prioritize proposing a category that is coherent, and well-defined. </requirements> 13 <examples> - For the primary concept "fruits", "fruits with red internal flesh" is not a well-known concept, whereas " citrus fruits" is. - For the primary concept "flowers",...

  54. [58]

    e.g., be careful about using 'depict' or 'mention', or 'show' as the previous two verbs introduce the slight emphasis on visual or textual aspects

    Be careful about your word choices of verbs, nouns, or adjectives, which might carry unexpected nuances. e.g., be careful about using 'depict' or 'mention', or 'show' as the previous two verbs introduce the slight emphasis on visual or textual aspects. e.g., be careful about using adjectives like ' clearly' or 'explicitly' as they might suggest a degree o...

  55. [59]

    This description might mention several concepts but you should only focus on the primary concept

  56. [60]

    In other words, your primary concept should have a different focus than those of the other necessary signals

    If the context indicates that the focus concept is part of the necessary signals of a larger concept, then the primary concepts of these necessary signals should focus on different subconcepts of this larger concept. In other words, your primary concept should have a different focus than those of the other necessary signals

  57. [61]

    fruit", ' electronic devices', 'physical affection', 'outdoor activities', whereas examples of descriptive concepts are

    The primary concept should be more categorical ( concepts where you could think about specific instances) rather than descriptive (where you could only describe different aspects of the concept). Examples of categorical concepts are "fruit", ' electronic devices', 'physical affection', 'outdoor activities', whereas examples of descriptive concepts are "sl...

  58. [62]

    You should NOT focus on detailing specific edgecase categories of this primary concept

  59. [63]

    Your category should NOT significantly overlap with the subconcepts that have been explored at step3

  60. [64]

    Your category should not refer to examples that significantly overlap with the examples that have been explored before in step2

  61. [65]

    images that show health supplements to promote wellness

    We will later define the other necessary signals for this concept, so your category should NOT try to define other necessary signals. </requirements> <example>For the concept 'health supplements' within the context of "images that show health supplements to promote wellness", 'fresh fruits', 'yoga mats', or ' spa treatments' might also be interesting beca...

  62. [66]

    Images show [a general term for the subconcept], such as [at most three specific examples from step3]

    The recommended format for the description would be "Images show [a general term for the subconcept], such as [at most three specific examples from step3]". These examples shoulld be representative of the subconcept and should be as specific as possible so that human image annotators can easily know whether an image includes this example or not. These exa...

  63. [67]

    eagles" is a good example of the category

    Avoid concept descriptions with too many specific and unnecessary details. e.g., for the concept 'beverages', your subconcept description should just be 'Images showing various types of tea drinks such as green tea, black tea, and herbal tea' rather than 'Images that show people drinking various types of tea drinks with different colors and flavors such a...

  64. [68]

    The image shows two people use sign language to communicate with each other, but it is unclear whether sign language is considered as

    Be careful about your word choices of verbs, nouns, or adjectives, which might carry unexpected nuances. e.g., be careful about using 'depict' or 'mention', or 'show' as the previous two verbs introduce the slight emphasis on visual or textual aspects. e.g., be careful about using adjectives like ' clearly' or 'explicitly' as they might suggest a degree o...

  65. [69]

    If the concept owner provides a clear feedback, what do you think the concept owner wants to clarify? Do not generalize too much beyond what the concept owner says

  66. [70]

    If the concept owner provides a different rating than the human raters, what is the possible reason for this disagreement? Do not generalize too much beyond this disagreement between ratings

  67. [71]

    </step1> <step2> Summarize your reasoning in the step 1 with a few sentences

    When the concept owner provides no clear feedback and the human raters and the concept owner are in agreement, what does this agreement between the concept owner and the human raters confirm? Especially in this scenario, since there is no less clear information, you should be more conservative and specific, and try to avoid generalizing too much. </step1>...

  68. [72]

    Always make sure that your final description is CONCISE, COHERENT, and ACCURATE; an average person could easiy determine whether an image satisfies the signal based on the description

  69. [73]

    DO NOT write a complex sentence structure in a description of a signal

  70. [74]

    You should only make important changes to the description. If the original description misses a point, you are encouraged to use one of the following ways to incorporate the nuances the concept owner wants to convey: a) add new adjectives, b) use different verbs, or c) add a few constraint words. If the original description uses an ambiguous or misleading...

  71. [75]

    e.g., be careful about using 'depict' or 'mention ', or 'show' as the previous two verbs introduce the slight emphasis on visual or textual aspects

    Be careful about your word choices of verbs, nouns, or adjectives, which might carry unexpected nuances. e.g., be careful about using 'depict' or 'mention ', or 'show' as the previous two verbs introduce the slight emphasis on visual or textual aspects. e.g., be careful about using adjectives like ' clearly' or 'explicitly' as they might suggest a degree ...

  72. [76]

    Images that 1) ... and 2)

    If your description consists of two independent conditions, you might consider use the format like " Images that 1) ... and 2) ..." to make it more clear . </description-requirements> </step3> Provide your answer in a valid XML format, adhering to the following structure: <keypoints>Describe your reasoning of the key points of these clarifications in the ...

  73. [77]

    If you want to edit an existing signal, the format is as follows: <concept> <name>The name of the signal you want to edit</ old-name> <old-description>The original description of the signal</old-description> <new-description>The new description of the signal</new-description> </concept>

  74. [78]

    </improve-description> <conceptDefinition>{definition.print_definition()}</ conceptDefinition> <clarifications>{reflections_str}</clarifications> C

    If you want to add a new signal, the format is as follows: <concept> <parent-signal>The name of the parent signal</ parent-signal> <type>The type of the new signal, either ' positive' or 'negative'</type> <new-name>The new name of the signal</name> <new-description>The new description of the signal</description> </concept> It might be possible that you ne...