pith. machine review for the scientific record. sign in

arxiv: 2603.08639 · v2 · submitted 2026-03-09 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

UNBOX: Unveiling Black-box visual models with Natural-language

Authors on Pith no claims yet

Pith reviewed 2026-05-15 14:23 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords black-box interpretabilitynatural language explanationsactivation maximizationvision modelsmodel auditingdiffusion modelslarge language models
0
0 comments X

The pith

Black-box vision models can be interpreted by finding natural-language concepts that maximize their class probabilities using only output scores.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents UNBOX as a way to dissect proprietary visual recognition systems that expose only class probabilities. It recasts activation maximization as a semantic search: large language models propose candidate text descriptors while text-to-image diffusion models generate proxy visuals whose scores from the target model steer the search. The process requires no gradients, parameters, architecture details, or training data. If the approach holds, it makes auditing, bias detection, and failure analysis feasible for real-world API-deployed models where white-box methods cannot apply.

Core claim

UNBOX performs class-wise model dissection under fully data-free, gradient-free constraints by using large language models to generate text descriptors and text-to-image diffusion models to create visual proxies, with output probabilities serving as the sole optimization signal; the resulting descriptors reveal the concepts each class has implicitly learned, the training distribution reflected, and potential bias sources.

What carries the argument

Semantic search that couples LLM-generated descriptors with diffusion-model visual proxies, scored directly against black-box class probabilities to perform activation maximization.

If this is right

  • The descriptors expose the specific concepts the model has learned for each class.
  • Bias sources and training-distribution artifacts become visible through the recovered concepts.
  • Auditing and failure analysis become possible for models available only as black-box APIs.
  • Performance matches state-of-the-art white-box methods on ImageNet-1K, Waterbirds, and CelebA despite stricter constraints.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Regulators could apply the technique to inspect commercial vision services without requiring internal access.
  • The same probability-driven semantic search could extend to auditing black-box models in other domains such as audio or tabular data.
  • Interactive querying of deployed models becomes feasible by testing user-provided natural-language concepts against output scores.

Load-bearing premise

Pre-trained language and diffusion models can reliably translate black-box output probabilities into the actual visual concepts that drive the model's decisions.

What would settle it

Generated text descriptors fail to produce high model scores on images that visually match those descriptors while low scores appear on mismatched images in controlled tests.

read the original abstract

Ensuring trustworthiness in open-world visual recognition requires models that are interpretable, fair, and robust to distribution shifts. Yet modern vision systems are increasingly deployed as proprietary black-box APIs, exposing only output probabilities and hiding architecture, parameters, gradients, and training data. This opacity prevents meaningful auditing, bias detection, and failure analysis. Existing explanation methods assume white- or gray-box access or knowledge of the training distribution, making them unusable in these real-world settings. We introduce UNBOX, a framework for class-wise model dissection under fully data-free, gradient-free, and backpropagation-free constraints. UNBOX leverages Large Language Models and text-to-image diffusion models to recast activation maximization as a purely semantic search driven by output probabilities. The method produces human-interpretable text descriptors that maximally activate each class, revealing the concepts a model has implicitly learned, the training distribution it reflects, and potential sources of bias. We evaluate UNBOX on ImageNet-1K, Waterbirds, and CelebA through semantic fidelity tests, visual-feature correlation analyses and slice-discovery auditing. Despite operating under the strictest black-box constraints, UNBOX performs competitively with state-of-the-art white-box interpretability methods. This demonstrates that meaningful insight into a model's internal reasoning can be recovered without any internal access, enabling more trustworthy and accountable visual recognition systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces UNBOX, a framework that recasts activation maximization for black-box visual classifiers as a semantic search over LLM-generated text descriptors and diffusion-synthesized images, using only scalar output probabilities. It claims to produce human-interpretable concepts for ImageNet-1K, Waterbirds, and CelebA classes, revealing learned concepts, training biases, and distribution shifts, while performing competitively with white-box interpretability methods under fully data-free, gradient-free constraints.

Significance. If the quantitative results hold, UNBOX would represent a meaningful advance in auditing proprietary vision APIs without internal access, enabling bias detection and failure analysis in real-world deployments. However, the absence of metrics, baselines, or ablation details in the abstract undermines confidence in the central claim of competitive performance.

major comments (2)
  1. [Abstract] Abstract: The claim that UNBOX 'performs competitively with state-of-the-art white-box interpretability methods' on ImageNet-1K, Waterbirds, and CelebA is unsupported by any quantitative metrics, baselines, ablation studies, or statistical comparisons, which is load-bearing for the central claim and prevents assessment of whether the semantic search actually recovers causal features.
  2. [Abstract] Abstract and methods description: The assumption that LLM-generated descriptors and diffusion images reliably map to the visual features driving the target model's decisions lacks direct validation against white-box methods on the same models and images; this risks surfacing correlated but non-causal concepts (e.g., texture or color statistics without natural-language labels), as noted in the skeptic analysis.
minor comments (1)
  1. [Abstract] The abstract references 'semantic fidelity tests, visual-feature correlation analyses and slice-discovery auditing' but provides no details on methodology or results; these should be expanded with concrete evaluation protocols.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important issues in how our claims are presented. We agree that the abstract should better support its assertions with quantitative details and that additional direct validation would strengthen the causal mapping argument. We address each major comment below and will incorporate revisions accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that UNBOX 'performs competitively with state-of-the-art white-box interpretability methods' on ImageNet-1K, Waterbirds, and CelebA is unsupported by any quantitative metrics, baselines, ablation studies, or statistical comparisons, which is load-bearing for the central claim and prevents assessment of whether the semantic search actually recovers causal features.

    Authors: We acknowledge that the abstract does not include specific quantitative metrics or explicit baseline comparisons, making the competitive performance claim difficult to evaluate directly from the abstract alone. The full manuscript reports results from semantic fidelity tests, visual-feature correlation analyses, and slice-discovery auditing across the three datasets, which demonstrate alignment with white-box methods. To address this, we will revise the abstract to include key quantitative indicators (such as fidelity scores and correlation values) and a brief reference to the baselines used, ensuring the claim is supported within the abstract itself. revision: yes

  2. Referee: [Abstract] Abstract and methods description: The assumption that LLM-generated descriptors and diffusion images reliably map to the visual features driving the target model's decisions lacks direct validation against white-box methods on the same models and images; this risks surfacing correlated but non-causal concepts (e.g., texture or color statistics without natural-language labels), as noted in the skeptic analysis.

    Authors: We agree this is a substantive concern regarding potential non-causal correlations. The manuscript already includes visual-feature correlation analyses and slice-discovery auditing to link recovered concepts to model decisions, but we recognize that a more explicit side-by-side comparison with white-box methods on identical images would provide stronger evidence against non-causal artifacts. We will add a dedicated validation subsection and accompanying figure that directly compares UNBOX concepts with white-box saliency outputs on the same inputs, quantifying regional overlap to better confirm that the semantic descriptors capture causally relevant features. revision: partial

Circularity Check

0 steps flagged

No circularity: method relies on external pre-trained models and empirical evaluation

full rationale

The UNBOX framework recasts activation maximization as semantic search over LLM-generated descriptors and diffusion-synthesized images, driven solely by scalar output probabilities from the target black-box model. No equations, parameter fittings, or derivations appear in the provided text that reduce the recovered concepts or performance claims to quantities defined by the target model itself. Evaluations on ImageNet-1K, Waterbirds, and CelebA are presented as independent semantic fidelity and correlation tests against white-box baselines, without self-citation chains, uniqueness theorems imported from prior author work, or ansatzes smuggled via citation. The derivation chain is therefore self-contained against external benchmarks and does not collapse by construction to its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the method implicitly assumes that LLM and diffusion model outputs can serve as faithful proxies for visual concepts without further justification.

pith-pipeline@v0.9.0 · 5558 in / 995 out tokens · 36521 ms · 2026-05-15T14:23:46.948493+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 4 internal anchors

  1. [1]

    In: Proceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition, pp

    Park, D.H., Hendricks, L.A., Akata, Z., Rohrbach, A., Schiele, B., Darrell, T., Rohrbach, M.: Multimodal explanations: Justifying decisions and pointing to the evi- dence. In: Proceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition, pp. 8779–8788 (2018)

  2. [2]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

    Sammani, F., Deligiannis, N.: Uni-nlx: Uni- fying textual explanations for vision and vision-language tasks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4634–4639 (2023)

  3. [3]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Sammani, F., Mukherjee, T., Deligiannis, N.: Nlx-gpt: A model for natural language explanations in vision and vision-language 15 tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8322–8332 (2022)

  4. [4]

    In: Pro- ceedings of the European Conference on Computer Vision (ECCV), pp

    Hendricks, L.A., Hu, R., Darrell, T., Akata, Z.: Grounding visual explanations. In: Pro- ceedings of the European Conference on Computer Vision (ECCV), pp. 264–279 (2018)

  5. [5]

    2019 ieee

    Zellers, R., Bisk, Y., Farhadi, A., Choi, Y.: From recognition to cognition: Visual com- monsense reasoning. 2019 ieee. In: CVF Con- ference on Computer Vision and Pattern Recognition (CVPR), pp. 6713–6724 (2018)

  6. [6]

    Computer Vision and Image Understanding262, 104559 (2025) https://doi.org/10.1016/j.cviu.2025

    Pennisi, M., Bellitto, G., Palazzo, S., Kava- sidis, I., Shah, M., Spampinato, C.: Diffex- plainer: Towards cross-modal global explana- tions with diffusion models. Computer Vision and Image Understanding262, 104559 (2025) https://doi.org/10.1016/j.cviu.2025. 104559

  7. [7]

    Advances in Neural Infor- mation Processing Systems (2025)

    Carnemolla, S., Pennisi, M., Samarasinghe, S., Bellitto, G., Palazzo, S., Giordano, D., Shah, M., Spampinato, C.: Dexter: Diffusion- guided explanations with textual reasoning for vision models. Advances in Neural Infor- mation Processing Systems (2025)

  8. [8]

    In: The Eleventh International Confer- ence on Learning Representations (2023)

    Oikarinen, T., Weng, T.-W.: CLIP-dissect: Automatic description of neuron rep- resentations in deep vision networks. In: The Eleventh International Confer- ence on Learning Representations (2023). https://openreview.net/forum?id=iPWiwWHc1V

  9. [9]

    TextGrad: Automatic "Differentiation" via Text

    Yuksekgonul, M., Bianchi, F., Boen, J., Liu, S., Huang, Z., Guestrin, C., Zou, J.: Textgrad: Automatic" differentiation" via text. arXiv preprint arXiv:2406.07496 (2024)

  10. [10]

    In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp

    Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). IEEE

  11. [11]

    Sagawa, S., Koh, P.W., Hashimoto, T.B., Liang, P.: Distributionally robust neural net- works.In:InternationalConferenceonLearn- ing Representations (2020)

  12. [12]

    Deep Learning Face Attributes in the Wild

    Liu, Z., Luo, P., Wang, X., Tang, X.: Deep Learning Face Attributes in the Wild (2015). https://arxiv.org/abs/1411.7766

  13. [13]

    In: International Conference on Artificial General Intelligence, pp

    Elton, D.C.: Self-explaining ai as an alter- native to interpretable ai. In: International Conference on Artificial General Intelligence, pp. 95–106 (2020). Springer

  14. [14]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Yang, Y., Panagopoulou, A., Zhou, S., Jin, D., Callison-Burch, C., Yatskar, M.: Language in a bottle: Language model guided concept bottlenecks for interpretable image classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19187–19197 (2023)

  15. [15]

    In: Proceed- ings of the IEEE International Conference on Computer Vision, pp

    Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad- cam: Visual explanations from deep networks via gradient-based localization. In: Proceed- ings of the IEEE International Conference on Computer Vision, pp. 618–626 (2017)

  16. [16]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Ahn, Y.H., Kim, H.B., Kim, S.T.: Www: a unified framework for explaining what where and why of neural networks by interpretation of neuron concepts. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10968–10977 (2024)

  17. [17]

    Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional networks: Visualis- ing image classification models and saliency maps.arXivpreprintarXiv:1312.6034(2013)

  18. [18]

    In: International Conference on Machine Learn- ing, pp

    Sundararajan, M., Taly, A., Yan, Q.: Axiomatic attribution for deep networks. In: International Conference on Machine Learn- ing, pp. 3319–3328 (2017). PMLR

  19. [19]

    In: European Conference on Computer Vision, pp

    Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: European Conference on Computer Vision, pp. 818–833 (2014). Springer 16

  20. [20]

    Advances in neural information processing systems32(2019)

    Srinivas, S., Fleuret, F.: Full-gradient rep- resentation for neural network visualization. Advances in neural information processing systems32(2019)

  21. [21]

    In: Proceedings of the IEEE Interna- tional Conference on Computer Vision, pp

    Fong, R.C., Vedaldi, A.: Interpretable expla- nations of black boxes by meaningful pertur- bation. In: Proceedings of the IEEE Interna- tional Conference on Computer Vision, pp. 3429–3437 (2017)

  22. [22]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Wagner,J.,Kohler,J.M.,Gindele,T.,Hetzel, L., Wiedemer, J.T., Behnke, S.: Interpretable and fine-grained visual explanations for con- volutional neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9097– 9107 (2019)

  23. [23]

    In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Kim, Y., Mo, S., Kim, M., Lee, K., Lee, J., Shin, J.: Discovering and mitigating visual biases through keyword explanation. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11082–11092 (2024)

  24. [24]

    why should i trust you?

    Ribeiro, M.T., Singh, S., Guestrin, C.: " why should i trust you?" explaining the predic- tions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Con- ference on Knowledge Discovery and Data Mining, pp. 1135–1144 (2016)

  25. [25]

    Advances in neural information processing systems30(2017)

    Lundberg, S.M., Lee, S.-I.: A unified approach to interpreting model predictions. Advances in neural information processing systems30(2017)

  26. [26]

    Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV)

    Kim, B., Wattenberg, M., Gilmer, J., Cai, C., Wexler, J., Viegas, F., Sayres, R.: Inter- pretability beyond feature attribution: Quan- titative testing with concept activation vec- tors. arxiv e-prints (nov. arXiv preprint stat.ML/1711.11279 (2017)

  27. [27]

    Advances in Neural Information Processing Systems35, 2590–2607 (2022)

    Crabbé, J., Schaar, M.: Concept activa- tion regions: A generalized framework for concept-based explanations. Advances in Neural Information Processing Systems35, 2590–2607 (2022)

  28. [28]

    In: Proceedings of the European ConferenceonComputerVision(ECCV),pp

    Zhou, B., Sun, Y., Bau, D., Torralba, A.: Interpretable basis decomposition for visual explanation. In: Proceedings of the European ConferenceonComputerVision(ECCV),pp. 119–134 (2018)

  29. [29]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Fel, T., Picard, A., Bethune, L., Boissin, T., Vigouroux, D., Colin, J., Cadène, R., Serre, T.: Craft: Concept recursive activation fac- torization for explainability. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2711– 2721 (2023)

  30. [30]

    In: Proceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition, pp

    Bau, D., Zhou, B., Khosla, A., Oliva, A., Torralba, A.: Network dissection: Quantify- ing interpretability of deep visual represen- tations. In: Proceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition, pp. 6541–6549 (2017)

  31. [31]

    In: Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pp

    Fong, R., Vedaldi, A.: Net2vec: Quantifying and explaining how concepts are encoded by filters in deep neural networks. In: Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8730– 8738 (2018)

  32. [32]

    Machine Vision and Applications36(2), 33 (2025)

    Gurkan, M.K., Arica, N., Yarman Vural, F.T.: A concept-aware explainability method for convolutional neural networks. Machine Vision and Applications36(2), 33 (2025)

  33. [33]

    In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Wang, A., Lee, W.-N., Qi, X.: Hint: Hier- archical neuron concept explainer. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10254–10264 (2022)

  34. [34]

    In: International Conference on Learning Representations (2021)

    Hernandez, E., Schwettmann, S., Bau, D., Bagashvili, T., Torralba, A., Andreas, J.: Natural language descriptions of deep visual features. In: International Conference on Learning Representations (2021)

  35. [35]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Kim, S., Oh, J., Lee, S., Yu, S., Do, J., Taghavi,T.:Groundingcounterfactualexpla- nation of image classifiers to textual concept space. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10942–10950 (2023)

  36. [36]

    In: ICLR 2024 Workshop on Reliable and Responsible Foundation Models (2024)

    Asgari, S., Khani, A., Khasahmadi, A.H., 17 Sanghi, A., Willis, K.D., Amiri, A.M.: tex- plain: Post-hoc textual explanation of image classifiers with pre-trained language models. In: ICLR 2024 Workshop on Reliable and Responsible Foundation Models (2024)

  37. [37]

    arXiv preprint arXiv:2411.15605 (2024)

    Zablocki, Ă., Gerard, V., Cardiel, A., Gaussier, E., Cord, M., Valle, E., et al.: Gift: A framework for global interpretable faith- ful textual explanations of vision classifiers. arXiv preprint arXiv:2411.15605 (2024)

  38. [38]

    In: Findings of the Association for Computational Linguistics: ACL 2025, pp

    Ghosh, S., Syed, R., Wang, C., Choudhary, V., Li, B., Poynton, C.B., Visweswaran, S., Batmanghelich, K.: Ladder: Language-driven slicediscoveryanderrorrectificationinvision classifiers. In: Findings of the Association for Computational Linguistics: ACL 2025, pp. 22935–22970 (2025)

  39. [39]

    Advances in Neural Information Processing Systems33, 20673– 20684 (2020)

    Nam, J., Cha, H., Ahn, S., Lee, J., Shin, J.: Learning from failure: De-biasing classi- fier from biased classifier. Advances in Neural Information Processing Systems33, 20673– 20684 (2020)

  40. [40]

    Advances in Neural Infor- mation Processing Systems33, 19339–19352 (2020)

    Sohoni, N., Dunnmon, J., Angus, G., Gu, A., Ré, C.: No subclass left behind: Fine- grained robustness in coarse-grained classifi- cation problems. Advances in Neural Infor- mation Processing Systems33, 19339–19352 (2020)

  41. [41]

    In: International Conference on Machine Learning, pp

    Liu, E.Z., Haghgoo, B., Chen, A.S., Raghu- nathan, A., Koh, P.W., Sagawa, S., Liang, P., Finn, C.: Just train twice: Improving group robustness without training group information. In: International Conference on Machine Learning, pp. 6781–6792 (2021). PMLR

  42. [42]

    arXiv preprint arXiv:2203.01517 (2022)

    Zhang, M., Sohoni, N.S., Zhang, H.R., Finn, C., Ré, C.: Correct-n-contrast: A contrastive approach for improving robust- ness to spurious correlations. arXiv preprint arXiv:2203.01517 (2022)

  43. [43]

    Singla, S., Feizi, S.: Salient imagenet: How to discover spurious features in deep learning? arXiv preprint arXiv:2110.04301 (2021)

  44. [44]

    arXiv preprint arXiv:2501.19032 (2025)

    Yu, H., Liu, J., Zou, H., Xu, R., He, Y., Zhang, X., Cui, P.: Error slice discovery via manifold compactness. arXiv preprint arXiv:2501.19032 (2025)

  45. [45]

    https://github.com/ black-forest-labs/flux (2024)

    Labs, B.F.: FLUX. https://github.com/ black-forest-labs/flux (2024)

  46. [46]

    https://arxiv.org/abs/2508

    OpenAI: gpt-oss-120b & gpt-oss-20b Model Card (2025). https://arxiv.org/abs/2508. 10925

  47. [47]

    Technical Report CNS-TR- 2011-001, California Institute of Technology (2011) 18

    Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S. Technical Report CNS-TR- 2011-001, California Institute of Technology (2011) 18