Recognition: 2 theorem links
· Lean TheoremUNBOX: Unveiling Black-box visual models with Natural-language
Pith reviewed 2026-05-15 14:23 UTC · model grok-4.3
The pith
Black-box vision models can be interpreted by finding natural-language concepts that maximize their class probabilities using only output scores.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
UNBOX performs class-wise model dissection under fully data-free, gradient-free constraints by using large language models to generate text descriptors and text-to-image diffusion models to create visual proxies, with output probabilities serving as the sole optimization signal; the resulting descriptors reveal the concepts each class has implicitly learned, the training distribution reflected, and potential bias sources.
What carries the argument
Semantic search that couples LLM-generated descriptors with diffusion-model visual proxies, scored directly against black-box class probabilities to perform activation maximization.
If this is right
- The descriptors expose the specific concepts the model has learned for each class.
- Bias sources and training-distribution artifacts become visible through the recovered concepts.
- Auditing and failure analysis become possible for models available only as black-box APIs.
- Performance matches state-of-the-art white-box methods on ImageNet-1K, Waterbirds, and CelebA despite stricter constraints.
Where Pith is reading between the lines
- Regulators could apply the technique to inspect commercial vision services without requiring internal access.
- The same probability-driven semantic search could extend to auditing black-box models in other domains such as audio or tabular data.
- Interactive querying of deployed models becomes feasible by testing user-provided natural-language concepts against output scores.
Load-bearing premise
Pre-trained language and diffusion models can reliably translate black-box output probabilities into the actual visual concepts that drive the model's decisions.
What would settle it
Generated text descriptors fail to produce high model scores on images that visually match those descriptors while low scores appear on mismatched images in controlled tests.
read the original abstract
Ensuring trustworthiness in open-world visual recognition requires models that are interpretable, fair, and robust to distribution shifts. Yet modern vision systems are increasingly deployed as proprietary black-box APIs, exposing only output probabilities and hiding architecture, parameters, gradients, and training data. This opacity prevents meaningful auditing, bias detection, and failure analysis. Existing explanation methods assume white- or gray-box access or knowledge of the training distribution, making them unusable in these real-world settings. We introduce UNBOX, a framework for class-wise model dissection under fully data-free, gradient-free, and backpropagation-free constraints. UNBOX leverages Large Language Models and text-to-image diffusion models to recast activation maximization as a purely semantic search driven by output probabilities. The method produces human-interpretable text descriptors that maximally activate each class, revealing the concepts a model has implicitly learned, the training distribution it reflects, and potential sources of bias. We evaluate UNBOX on ImageNet-1K, Waterbirds, and CelebA through semantic fidelity tests, visual-feature correlation analyses and slice-discovery auditing. Despite operating under the strictest black-box constraints, UNBOX performs competitively with state-of-the-art white-box interpretability methods. This demonstrates that meaningful insight into a model's internal reasoning can be recovered without any internal access, enabling more trustworthy and accountable visual recognition systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces UNBOX, a framework that recasts activation maximization for black-box visual classifiers as a semantic search over LLM-generated text descriptors and diffusion-synthesized images, using only scalar output probabilities. It claims to produce human-interpretable concepts for ImageNet-1K, Waterbirds, and CelebA classes, revealing learned concepts, training biases, and distribution shifts, while performing competitively with white-box interpretability methods under fully data-free, gradient-free constraints.
Significance. If the quantitative results hold, UNBOX would represent a meaningful advance in auditing proprietary vision APIs without internal access, enabling bias detection and failure analysis in real-world deployments. However, the absence of metrics, baselines, or ablation details in the abstract undermines confidence in the central claim of competitive performance.
major comments (2)
- [Abstract] Abstract: The claim that UNBOX 'performs competitively with state-of-the-art white-box interpretability methods' on ImageNet-1K, Waterbirds, and CelebA is unsupported by any quantitative metrics, baselines, ablation studies, or statistical comparisons, which is load-bearing for the central claim and prevents assessment of whether the semantic search actually recovers causal features.
- [Abstract] Abstract and methods description: The assumption that LLM-generated descriptors and diffusion images reliably map to the visual features driving the target model's decisions lacks direct validation against white-box methods on the same models and images; this risks surfacing correlated but non-causal concepts (e.g., texture or color statistics without natural-language labels), as noted in the skeptic analysis.
minor comments (1)
- [Abstract] The abstract references 'semantic fidelity tests, visual-feature correlation analyses and slice-discovery auditing' but provides no details on methodology or results; these should be expanded with concrete evaluation protocols.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which highlights important issues in how our claims are presented. We agree that the abstract should better support its assertions with quantitative details and that additional direct validation would strengthen the causal mapping argument. We address each major comment below and will incorporate revisions accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that UNBOX 'performs competitively with state-of-the-art white-box interpretability methods' on ImageNet-1K, Waterbirds, and CelebA is unsupported by any quantitative metrics, baselines, ablation studies, or statistical comparisons, which is load-bearing for the central claim and prevents assessment of whether the semantic search actually recovers causal features.
Authors: We acknowledge that the abstract does not include specific quantitative metrics or explicit baseline comparisons, making the competitive performance claim difficult to evaluate directly from the abstract alone. The full manuscript reports results from semantic fidelity tests, visual-feature correlation analyses, and slice-discovery auditing across the three datasets, which demonstrate alignment with white-box methods. To address this, we will revise the abstract to include key quantitative indicators (such as fidelity scores and correlation values) and a brief reference to the baselines used, ensuring the claim is supported within the abstract itself. revision: yes
-
Referee: [Abstract] Abstract and methods description: The assumption that LLM-generated descriptors and diffusion images reliably map to the visual features driving the target model's decisions lacks direct validation against white-box methods on the same models and images; this risks surfacing correlated but non-causal concepts (e.g., texture or color statistics without natural-language labels), as noted in the skeptic analysis.
Authors: We agree this is a substantive concern regarding potential non-causal correlations. The manuscript already includes visual-feature correlation analyses and slice-discovery auditing to link recovered concepts to model decisions, but we recognize that a more explicit side-by-side comparison with white-box methods on identical images would provide stronger evidence against non-causal artifacts. We will add a dedicated validation subsection and accompanying figure that directly compares UNBOX concepts with white-box saliency outputs on the same inputs, quantifying regional overlap to better confirm that the semantic descriptors capture causally relevant features. revision: partial
Circularity Check
No circularity: method relies on external pre-trained models and empirical evaluation
full rationale
The UNBOX framework recasts activation maximization as semantic search over LLM-generated descriptors and diffusion-synthesized images, driven solely by scalar output probabilities from the target black-box model. No equations, parameter fittings, or derivations appear in the provided text that reduce the recovered concepts or performance claims to quantities defined by the target model itself. Evaluations on ImageNet-1K, Waterbirds, and CelebA are presented as independent semantic fidelity and correlation tests against white-box baselines, without self-citation chains, uniqueness theorems imported from prior author work, or ansatzes smuggled via citation. The derivation chain is therefore self-contained against external benchmarks and does not collapse by construction to its inputs.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
UNBOX leverages Large Language Models and text-to-image diffusion models to recast activation maximization as a purely semantic search driven by output probabilities.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The method produces human-interpretable text descriptors that maximally activate each class
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
In: Proceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition, pp
Park, D.H., Hendricks, L.A., Akata, Z., Rohrbach, A., Schiele, B., Darrell, T., Rohrbach, M.: Multimodal explanations: Justifying decisions and pointing to the evi- dence. In: Proceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition, pp. 8779–8788 (2018)
work page 2018
-
[2]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp
Sammani, F., Deligiannis, N.: Uni-nlx: Uni- fying textual explanations for vision and vision-language tasks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4634–4639 (2023)
work page 2023
-
[3]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Sammani, F., Mukherjee, T., Deligiannis, N.: Nlx-gpt: A model for natural language explanations in vision and vision-language 15 tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8322–8332 (2022)
work page 2022
-
[4]
In: Pro- ceedings of the European Conference on Computer Vision (ECCV), pp
Hendricks, L.A., Hu, R., Darrell, T., Akata, Z.: Grounding visual explanations. In: Pro- ceedings of the European Conference on Computer Vision (ECCV), pp. 264–279 (2018)
work page 2018
- [5]
-
[6]
Computer Vision and Image Understanding262, 104559 (2025) https://doi.org/10.1016/j.cviu.2025
Pennisi, M., Bellitto, G., Palazzo, S., Kava- sidis, I., Shah, M., Spampinato, C.: Diffex- plainer: Towards cross-modal global explana- tions with diffusion models. Computer Vision and Image Understanding262, 104559 (2025) https://doi.org/10.1016/j.cviu.2025. 104559
-
[7]
Advances in Neural Infor- mation Processing Systems (2025)
Carnemolla, S., Pennisi, M., Samarasinghe, S., Bellitto, G., Palazzo, S., Giordano, D., Shah, M., Spampinato, C.: Dexter: Diffusion- guided explanations with textual reasoning for vision models. Advances in Neural Infor- mation Processing Systems (2025)
work page 2025
-
[8]
In: The Eleventh International Confer- ence on Learning Representations (2023)
Oikarinen, T., Weng, T.-W.: CLIP-dissect: Automatic description of neuron rep- resentations in deep vision networks. In: The Eleventh International Confer- ence on Learning Representations (2023). https://openreview.net/forum?id=iPWiwWHc1V
work page 2023
-
[9]
TextGrad: Automatic "Differentiation" via Text
Yuksekgonul, M., Bianchi, F., Boen, J., Liu, S., Huang, Z., Guestrin, C., Zou, J.: Textgrad: Automatic" differentiation" via text. arXiv preprint arXiv:2406.07496 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[10]
In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). IEEE
work page 2009
-
[11]
Sagawa, S., Koh, P.W., Hashimoto, T.B., Liang, P.: Distributionally robust neural net- works.In:InternationalConferenceonLearn- ing Representations (2020)
work page 2020
-
[12]
Deep Learning Face Attributes in the Wild
Liu, Z., Luo, P., Wang, X., Tang, X.: Deep Learning Face Attributes in the Wild (2015). https://arxiv.org/abs/1411.7766
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[13]
In: International Conference on Artificial General Intelligence, pp
Elton, D.C.: Self-explaining ai as an alter- native to interpretable ai. In: International Conference on Artificial General Intelligence, pp. 95–106 (2020). Springer
work page 2020
-
[14]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Yang, Y., Panagopoulou, A., Zhou, S., Jin, D., Callison-Burch, C., Yatskar, M.: Language in a bottle: Language model guided concept bottlenecks for interpretable image classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19187–19197 (2023)
work page 2023
-
[15]
In: Proceed- ings of the IEEE International Conference on Computer Vision, pp
Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad- cam: Visual explanations from deep networks via gradient-based localization. In: Proceed- ings of the IEEE International Conference on Computer Vision, pp. 618–626 (2017)
work page 2017
-
[16]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Ahn, Y.H., Kim, H.B., Kim, S.T.: Www: a unified framework for explaining what where and why of neural networks by interpretation of neuron concepts. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10968–10977 (2024)
work page 2024
-
[17]
Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional networks: Visualis- ing image classification models and saliency maps.arXivpreprintarXiv:1312.6034(2013)
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[18]
In: International Conference on Machine Learn- ing, pp
Sundararajan, M., Taly, A., Yan, Q.: Axiomatic attribution for deep networks. In: International Conference on Machine Learn- ing, pp. 3319–3328 (2017). PMLR
work page 2017
-
[19]
In: European Conference on Computer Vision, pp
Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: European Conference on Computer Vision, pp. 818–833 (2014). Springer 16
work page 2014
-
[20]
Advances in neural information processing systems32(2019)
Srinivas, S., Fleuret, F.: Full-gradient rep- resentation for neural network visualization. Advances in neural information processing systems32(2019)
work page 2019
-
[21]
In: Proceedings of the IEEE Interna- tional Conference on Computer Vision, pp
Fong, R.C., Vedaldi, A.: Interpretable expla- nations of black boxes by meaningful pertur- bation. In: Proceedings of the IEEE Interna- tional Conference on Computer Vision, pp. 3429–3437 (2017)
work page 2017
-
[22]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Wagner,J.,Kohler,J.M.,Gindele,T.,Hetzel, L., Wiedemer, J.T., Behnke, S.: Interpretable and fine-grained visual explanations for con- volutional neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9097– 9107 (2019)
work page 2019
-
[23]
In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Kim, Y., Mo, S., Kim, M., Lee, K., Lee, J., Shin, J.: Discovering and mitigating visual biases through keyword explanation. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11082–11092 (2024)
work page 2024
-
[24]
Ribeiro, M.T., Singh, S., Guestrin, C.: " why should i trust you?" explaining the predic- tions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Con- ference on Knowledge Discovery and Data Mining, pp. 1135–1144 (2016)
work page 2016
-
[25]
Advances in neural information processing systems30(2017)
Lundberg, S.M., Lee, S.-I.: A unified approach to interpreting model predictions. Advances in neural information processing systems30(2017)
work page 2017
-
[26]
Kim, B., Wattenberg, M., Gilmer, J., Cai, C., Wexler, J., Viegas, F., Sayres, R.: Inter- pretability beyond feature attribution: Quan- titative testing with concept activation vec- tors. arxiv e-prints (nov. arXiv preprint stat.ML/1711.11279 (2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[27]
Advances in Neural Information Processing Systems35, 2590–2607 (2022)
Crabbé, J., Schaar, M.: Concept activa- tion regions: A generalized framework for concept-based explanations. Advances in Neural Information Processing Systems35, 2590–2607 (2022)
work page 2022
-
[28]
In: Proceedings of the European ConferenceonComputerVision(ECCV),pp
Zhou, B., Sun, Y., Bau, D., Torralba, A.: Interpretable basis decomposition for visual explanation. In: Proceedings of the European ConferenceonComputerVision(ECCV),pp. 119–134 (2018)
work page 2018
-
[29]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Fel, T., Picard, A., Bethune, L., Boissin, T., Vigouroux, D., Colin, J., Cadène, R., Serre, T.: Craft: Concept recursive activation fac- torization for explainability. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2711– 2721 (2023)
work page 2023
-
[30]
In: Proceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition, pp
Bau, D., Zhou, B., Khosla, A., Oliva, A., Torralba, A.: Network dissection: Quantify- ing interpretability of deep visual represen- tations. In: Proceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition, pp. 6541–6549 (2017)
work page 2017
-
[31]
In: Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pp
Fong, R., Vedaldi, A.: Net2vec: Quantifying and explaining how concepts are encoded by filters in deep neural networks. In: Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8730– 8738 (2018)
work page 2018
-
[32]
Machine Vision and Applications36(2), 33 (2025)
Gurkan, M.K., Arica, N., Yarman Vural, F.T.: A concept-aware explainability method for convolutional neural networks. Machine Vision and Applications36(2), 33 (2025)
work page 2025
-
[33]
In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Wang, A., Lee, W.-N., Qi, X.: Hint: Hier- archical neuron concept explainer. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10254–10264 (2022)
work page 2022
-
[34]
In: International Conference on Learning Representations (2021)
Hernandez, E., Schwettmann, S., Bau, D., Bagashvili, T., Torralba, A., Andreas, J.: Natural language descriptions of deep visual features. In: International Conference on Learning Representations (2021)
work page 2021
-
[35]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Kim, S., Oh, J., Lee, S., Yu, S., Do, J., Taghavi,T.:Groundingcounterfactualexpla- nation of image classifiers to textual concept space. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10942–10950 (2023)
work page 2023
-
[36]
In: ICLR 2024 Workshop on Reliable and Responsible Foundation Models (2024)
Asgari, S., Khani, A., Khasahmadi, A.H., 17 Sanghi, A., Willis, K.D., Amiri, A.M.: tex- plain: Post-hoc textual explanation of image classifiers with pre-trained language models. In: ICLR 2024 Workshop on Reliable and Responsible Foundation Models (2024)
work page 2024
-
[37]
arXiv preprint arXiv:2411.15605 (2024)
Zablocki, Ă., Gerard, V., Cardiel, A., Gaussier, E., Cord, M., Valle, E., et al.: Gift: A framework for global interpretable faith- ful textual explanations of vision classifiers. arXiv preprint arXiv:2411.15605 (2024)
-
[38]
In: Findings of the Association for Computational Linguistics: ACL 2025, pp
Ghosh, S., Syed, R., Wang, C., Choudhary, V., Li, B., Poynton, C.B., Visweswaran, S., Batmanghelich, K.: Ladder: Language-driven slicediscoveryanderrorrectificationinvision classifiers. In: Findings of the Association for Computational Linguistics: ACL 2025, pp. 22935–22970 (2025)
work page 2025
-
[39]
Advances in Neural Information Processing Systems33, 20673– 20684 (2020)
Nam, J., Cha, H., Ahn, S., Lee, J., Shin, J.: Learning from failure: De-biasing classi- fier from biased classifier. Advances in Neural Information Processing Systems33, 20673– 20684 (2020)
work page 2020
-
[40]
Advances in Neural Infor- mation Processing Systems33, 19339–19352 (2020)
Sohoni, N., Dunnmon, J., Angus, G., Gu, A., Ré, C.: No subclass left behind: Fine- grained robustness in coarse-grained classifi- cation problems. Advances in Neural Infor- mation Processing Systems33, 19339–19352 (2020)
work page 2020
-
[41]
In: International Conference on Machine Learning, pp
Liu, E.Z., Haghgoo, B., Chen, A.S., Raghu- nathan, A., Koh, P.W., Sagawa, S., Liang, P., Finn, C.: Just train twice: Improving group robustness without training group information. In: International Conference on Machine Learning, pp. 6781–6792 (2021). PMLR
work page 2021
-
[42]
arXiv preprint arXiv:2203.01517 (2022)
Zhang, M., Sohoni, N.S., Zhang, H.R., Finn, C., Ré, C.: Correct-n-contrast: A contrastive approach for improving robust- ness to spurious correlations. arXiv preprint arXiv:2203.01517 (2022)
- [43]
-
[44]
arXiv preprint arXiv:2501.19032 (2025)
Yu, H., Liu, J., Zou, H., Xu, R., He, Y., Zhang, X., Cui, P.: Error slice discovery via manifold compactness. arXiv preprint arXiv:2501.19032 (2025)
-
[45]
https://github.com/ black-forest-labs/flux (2024)
Labs, B.F.: FLUX. https://github.com/ black-forest-labs/flux (2024)
work page 2024
-
[46]
OpenAI: gpt-oss-120b & gpt-oss-20b Model Card (2025). https://arxiv.org/abs/2508. 10925
work page 2025
-
[47]
Technical Report CNS-TR- 2011-001, California Institute of Technology (2011) 18
Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S. Technical Report CNS-TR- 2011-001, California Institute of Technology (2011) 18
work page 2011
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.