How can embedding models bind concepts?

Arnas Uselis; Darina Koishigarina; Seong Joon Oh

arxiv: 2605.31503 · v1 · pith:SCMG4RMJnew · submitted 2026-05-29 · 💻 cs.CV · cs.LG

How can embedding models bind concepts?

Arnas Uselis , Darina Koishigarina , Seong Joon Oh This is my paper

Pith reviewed 2026-06-28 23:03 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords concept bindingvision-language modelsCLIPbinding functiongeneralizationadditive decompositionmultiplicative interactionsembedding models

0 comments

The pith

CLIP's binding function is high-complexity, blocking shared generalization across image and text encoders to new concept combinations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines why vision-language models like CLIP recognize individual concepts yet fail to bind them correctly in scenes with multiple objects. It demonstrates that scene embeddings break down additively into separate object representations, which accounts for why object details remain recoverable from image or text embeddings alone. CLIP's binding function itself has high complexity, which stops the two encoders from converging on a common mechanism that works for unseen combinations. Controlled transformer models trained from scratch on synthetic data instead develop low-complexity binding built from multiplicative interactions between concepts once data coverage is high enough, and this supports systematic generalization.

Core claim

Scene embeddings decompose additively into object representations. CLIP's binding function is high-complexity, which likely prevents the image and text encoders from learning a shared binding mechanism that generalizes to unseen concept combinations. In controlled transformer models trained from scratch, binding generalization emerges with sufficient data coverage through low-complexity binding functions characterized by multiplicative interactions between concepts.

What carries the argument

The binding function that maps input concepts to scene embeddings, analyzed via its additive decomposition and its computational complexity.

Load-bearing premise

The controlled transformer experiments trained from scratch on synthetic scenes are representative of the mechanisms that would appear in large-scale pre-trained models like CLIP when data coverage is increased.

What would settle it

Train a CLIP-scale model on a dataset that systematically increases coverage of concept combinations and check whether its binding function complexity drops and cross-modal binding generalization improves.

Figures

Figures reproduced from arXiv: 2605.31503 by Arnas Uselis, Darina Koishigarina, Seong Joon Oh.

**Figure 1.** Figure 1: Schematic of the binding setup. (a) Example concept space C (e.g., color × shape), where each object is a tuple of concept values. (b) Example scene space S, where a scene s is a tuple of objects (here, two objects (o1, o2)). (c) Example of the two recognition criteria: concept recognition ranks present concept values above absent ones (Def. 3.3), while object recognition ranks present objects above absent… view at source ↗

**Figure 2.** Figure 2: Additive structure in two-object scene embeddings. MDS projection of CLIP embeddings for single- and two-object scenes that vary in color and shape (distances in the plot approximate embedding distances). Labels use R/B for red/blue and C/S for cube/sphere; points correspond to embeddings of the associated single concepts, single-object scenes, and their two-object combinations. Two-object scene embeddin… view at source ↗

**Figure 3.** Figure 3: Scene embeddings support object-level editing via linear operations in embedding space. Subtracting one object embedding and adding another produces a counterfactual embedding corresponding to the edited scene. Setup. Let s = (o1, o2) be a two-object scene, and let o ′ 1 be a counterfactual object obtained by changing one concept value of o1 (e.g. color). We construct an edited embedding by substituting t… view at source ↗

**Figure 4.** Figure 4: CLIP’s binding function is high-complexity. (a) We train a binding approximator g(o1, o2) (a single-layer MLP) to predict CLIP scene embeddings f(x) from concept indices describing the objects present, minimizing (6). (b) Maximum accuracy achieved across all MLP capacities and training coverages for concept vs. object recognition on scenes composed of held-out objects. Predicting concepts is high, confirmi… view at source ↗

**Figure 5.** Figure 5: Controlled setup for studying generalizable binding. We train transformer-based embedding models on synthetic multi-object data to test whether binding can generalize to entirely unseen objects. (a) Data design: We vary the training coverage ρtrain from 0.1 to 0.9, controlling what fraction of the object space the model observes during training. (b) Scene construction: Training scenes are composed of objec… view at source ↗

**Figure 6.** Figure 6: Binding generalization emerges with scale. Test accuracy on held-out objects as a function of training coverage. Each panel varies object complexity (C concepts, V values, |O| = V C objects). Concept recognition (orange) generalizes readily; object recognition (blue; binding) requires more coverage but reaches high accuracy, surpassing the bag-of-concepts baseline (dashed). ier than objects: concept recog… view at source ↗

**Figure 7.** Figure 7: Binding functions can be approximated by lowcapacity models. Even single-layer MLPs with small hidden dimensions achieve high accuracy, suggesting that the learned binding operation has low computational complexity. Each panel varies the number of concepts C and concept values V . Results [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Multiplicative structure best explains binding. Accuracy of the Additive (—), Per-obj. products (—), and Global product (—) probes for predicting scene embeddings from concept indices, evaluated on held-out objects. Dashed: concept recognition; solid: object recognition. objects, s = (o1, o2), where each object is described by two concept types (e.g., color and shape), so oi = (ci1, ci2) with cik ∈ Ck th… view at source ↗

**Figure 9.** Figure 9: Generalization correlates with multiplicative structure. Each point is a trained model; x-axis shows object recognition accuracy on scenes composed of held-out objects, y-axis shows how well the Global product probe (trained on 50% of objects) approximates scene embeddings. Colors indicate object space size. Models that generalize also admit simpler (multiplicative) binding. Takeaway §5.4: Generalizable b… view at source ↗

**Figure 10.** Figure 10: PUG:SPARE dataset samples. Photorealistic scenes with two objects varying in color and species (12 colors × 8 animals, 7392 scenes total). 17 [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 11.** Figure 11: CLEVR dataset samples. 3D-rendered scenes with two objects varying in color and shape (8 colors × 3 shapes, 576 scenes). a purple square and a gray circle a brown triangle and a red square a yellow square and a blue triangle a yellow square and a red circle a cyan square and a gray circle a green square and a yellow triangle a brown circle and a yellow triangle a green circle and a gray square [PITH_FULL… view at source ↗

**Figure 12.** Figure 12: CLEVR-2D dataset samples. Our 2D adaptation of CLEVR, replacing 3D shapes with flat 2D equivalents. Same concept structure as CLEVR (8 colors × 3 shapes, 576 scenes). C.2. CLIP geometry experiments This section specifies how we compute the retrieval and probing metrics reported for the Level-I/II decomposition experiments (Tab. 2) and for the intervention experiments (§4.1). To evaluate whether binding-r… view at source ↗

**Figure 13.** Figure 13: Effect of intervention strength on object replacement. We vary intervention strength k with position-independent (AVG) and position-dependent (AVG+POS) object embeddings for CLEVR and PUG:SPARE. We distinguish between samples in which two objects have different concepts (solid line) and those with a shared concept (dashed line). 21 [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

**Figure 14.** Figure 14: Object editing with SINGLE-OBJ embeddings. Intervention strength k is varied for CLEVR image embeddings using object embeddings estimated from single-object scenes [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗

**Figure 16.** Figure 16: Example CLEVR scenes with 3 objects. Used for the 3-object decomposition evaluation in Tab. 13 [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗

**Figure 17.** Figure 17: Example images spanning 5 objects and 5 patterns. Images were generated with Gemini Nano Banana 2 (gemini-3.1-flash-image-preview) at 512×512 resolution using the prompt template: “A square studio product photo of exactly two objects: a {obj1} with a {pattern1} print on its surface, next to a {obj2} with a {pattern2} print on its surface. Exactly two objects only. The pattern must appear only as the surfa… view at source ↗

**Figure 18.** Figure 18: MDS shows an approximate additive structure using averaged object embeddings. MDS projections of CLIP and DINOv2 embeddings for scenes from CLEVR, CLEVR-2D, and PUG (distances approximate embedding distances). Object embeddings (e.g. RC for ‘red cube’) and concept embeddings (e.g. R for ‘red’) are estimated as averages of embeddings of scenes containing the corresponding component. Two-object scenes are d… view at source ↗

**Figure 19.** Figure 19: Multiplicative structure does not hold for CLIP and DINOv2. We fit the Global product probe on CLIP and DINOv2 embeddings. Concept recognition (dashed) recovers with more training data, but object recognition (solid) stays near zero for all encoders. Unlike the from-scratch models trained on pixels ( [PITH_FULL_IMAGE:figures/full_fig_p026_19.png] view at source ↗

**Figure 20.** Figure 20: Synthetic pixel samples used to train vision encoders from scratch. Three levels of pixel-space complexity: (top) noise-free, non-overlapping objects; (middle) speckled noise, non-overlapping objects; (bottom) noisy and overlapping objects. Each object is defined by two concepts (square color and border color). Samples shown correspond to the C = 2, V = 50 setting, giving up to 6.5×106 object combinations… view at source ↗

**Figure 21.** Figure 21: Vision-domain results corresponding to [PITH_FULL_IMAGE:figures/full_fig_p028_21.png] view at source ↗

read the original abstract

Humans easily determine which color belongs to which shape in multi-object scenes, an ability known as concept binding. Vision-language embedding models such as CLIP struggle with binding: they recognize individual concepts but fail to represent which concepts form which objects. Although CLIP behaves like a bag-of-concepts model in cross-modal retrieval, object information is recoverable from its image and text embeddings separately. We study this tension through the binding function, which maps concepts to scene embeddings. We find that scene embeddings decompose additively into object representations, explaining why uni-modal probes can recover object information. However, CLIP's binding function is high-complexity, which likely prevents the image and text encoders from learning a shared binding mechanism that generalizes to unseen concept combinations. We then ask whether this limitation is fundamental. We show that it is not. In controlled transformer models trained from scratch, binding generalization emerges with sufficient data coverage. These models learn low-complexity binding functions characterized by multiplicative interactions between concepts, enabling systematic generalization. Code is publicly available at https://github.com/oshapio/binding-concepts-complexity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper defines a binding function, shows CLIP uses high-complexity additive binding while controlled transformers learn low-complexity multiplicative binding that generalizes with data coverage, but the link between the two regimes is untested.

read the letter

The main point is that CLIP embeddings decompose additively into object parts, which explains why separate probes recover objects, yet the binding function itself stays high-complexity and blocks generalization to new combinations. In contrast, small transformers trained from scratch on synthetic scenes shift to low-complexity multiplicative binding once data covers enough combinations, and that regime supports systematic generalization.

The explicit focus on the binding function and the additive decomposition are the clearest new pieces. The complexity distinction between additive and multiplicative regimes also looks fresh relative to earlier CLIP analyses. The controlled experiments are straightforward, the synthetic data lets them measure coverage directly, and the code release is useful.

The main limitation is that the controlled models are trained from scratch on synthetic scenes with explicit concept labels. Nothing in the work tests whether simply increasing data coverage inside a CLIP-style contrastive objective on real images would produce the same low-complexity regime or a shared binding mechanism across encoders. The additive decomposition result is compatible with high complexity and does not rule out that real-data statistics keep forcing it.

This is for researchers working on binding failures and systematic generalization in vision-language models. The experiments are concrete enough and the hypothesis is testable, so the paper deserves a serious referee even though the central claim about CLIP would need direct scaling evidence to land cleanly.

Referee Report

2 major / 1 minor

Summary. The paper examines concept binding in vision-language embedding models such as CLIP. It reports that CLIP scene embeddings decompose additively into object representations (allowing uni-modal recovery of objects) yet the binding function itself is high-complexity, which the authors argue prevents the image and text encoders from learning a shared, generalizable binding mechanism for unseen concept combinations. In contrast, controlled transformer models trained from scratch on synthetic scenes learn low-complexity binding functions based on multiplicative interactions; these generalize systematically once data coverage is sufficient. The authors conclude that high-complexity binding is not fundamental and release code at https://github.com/oshapio/binding-concepts-complexity.

Significance. If the controlled-experiment results hold and the representativeness assumption is validated, the work supplies a concrete complexity-based explanation for CLIP's binding failures and demonstrates that low-complexity multiplicative binding is achievable, with direct implications for training objectives that promote systematic generalization. The public code release and use of controlled synthetic data to isolate binding mechanisms are clear strengths that enable reproducibility and targeted follow-up.

major comments (2)

[Abstract] Abstract, paragraph on controlled models: the claim that high-complexity binding in CLIP 'is not fundamental' is load-bearing for the central thesis yet rests on an untested representativeness assumption; no experiment shows that increasing data coverage under a CLIP-style contrastive objective on real images induces the same low-complexity multiplicative regime or shared encoder binding mechanism.
[Abstract] Abstract: the additive decomposition result for CLIP embeddings is presented as compatible with high-complexity binding, but the manuscript does not quantify how this decomposition interacts with the reported complexity measures or rule out that real-data statistics force high complexity even under higher coverage.

minor comments (1)

[Abstract] The abstract states that object information is 'recoverable' from separate encoders but does not specify the probe architecture, training regime, or quantitative recovery metrics; these details are needed for readers to assess the strength of the uni-modal recovery claim.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each of the major comments below.

read point-by-point responses

Referee: [Abstract] Abstract, paragraph on controlled models: the claim that high-complexity binding in CLIP 'is not fundamental' is load-bearing for the central thesis yet rests on an untested representativeness assumption; no experiment shows that increasing data coverage under a CLIP-style contrastive objective on real images induces the same low-complexity multiplicative regime or shared encoder binding mechanism.

Authors: We agree that our experiments do not directly demonstrate the emergence of low-complexity binding under a CLIP-style contrastive objective on real-world images with increased data coverage. The controlled setting with synthetic data and transformers trained from scratch is designed to isolate the binding function and show that low-complexity multiplicative interactions are learnable with sufficient coverage. This supports our claim that high-complexity binding is not fundamental to the problem of concept binding. We will revise the abstract to state that our results indicate high-complexity binding is not fundamental, rather than asserting it definitively based on the controlled models alone. revision: partial
Referee: [Abstract] Abstract: the additive decomposition result for CLIP embeddings is presented as compatible with high-complexity binding, but the manuscript does not quantify how this decomposition interacts with the reported complexity measures or rule out that real-data statistics force high complexity even under higher coverage.

Authors: The additive decomposition of scene embeddings into object representations is a linear property that explains why object information can be recovered uni-modally, but it is orthogonal to the complexity of the binding function, which involves how concepts are combined (e.g., via multiplicative interactions). The complexity measures are applied specifically to the binding function. We will update the manuscript to include a more explicit discussion of this distinction, including any relevant quantification where possible, and note that the role of real-data statistics in forcing high complexity is an important direction for future investigation. revision: yes

standing simulated objections not resolved

Direct experiments training CLIP-style models on real images with varying data coverage to observe the binding complexity regime.

Circularity Check

0 steps flagged

No circularity; claims rest on independent empirical observations

full rationale

The paper derives its central claims from direct analysis of CLIP embeddings (additive decomposition into object representations) and from training controlled transformers from scratch on synthetic data. No equations or results are shown to reduce by construction to fitted parameters, self-definitions, or prior self-citations. The distinction between high-complexity binding in CLIP and low-complexity multiplicative binding in the controlled models is presented as an observed outcome under varying data coverage, not as a renaming or forced equivalence. The representativeness assumption is stated explicitly but does not create a definitional loop within the reported derivations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The paper rests on the introduced binding function construct and the observed additive decomposition of scene embeddings; these are not derived from prior literature but defined for the study.

axioms (1)

domain assumption Scene embeddings decompose additively into object representations
Stated as an empirical finding that explains recoverability of object information from separate embeddings.

invented entities (1)

binding function no independent evidence
purpose: Maps individual concepts to overall scene embeddings to study the tension between bag-of-concepts behavior and recoverable object information
New analytic tool introduced in the abstract to formalize the binding problem.

pith-pipeline@v0.9.1-grok · 5723 in / 1294 out tokens · 24440 ms · 2026-06-28T23:03:24.314266+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

9 extracted references · 1 canonical work pages

[1]

2 Berasi, D., Farina, M., Mancini, M., Ricci, E., and Strisci- uglio, N

URL https://openreview.net/forum? id=3RQ863cRbx. 2 Berasi, D., Farina, M., Mancini, M., Ricci, E., and Strisci- uglio, N. Not only text: Exploring compositionality of visual representations in vision-language models. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pp. 24917–24927, 2025. 2, 4 Bergsma, S., Dey, N. S., Gosal, G., G...

Pith/arXiv arXiv 2025
[2]

3, 6 Feng, J

URL https://openreview.net/forum? id=hKMPz3wkPV. 3, 6 Feng, J. and Steinhardt, J. How do language models bind en- tities in context? InThe Twelfth International Conference on Learning Representations, 2024. URL https:// openreview.net/forum?id=zb3b6oKO77. 2 Feng, J., Russell, S., and Steinhardt, J. Monitoring latent world states in language models with pr...

arXiv 2024
[3]

URL https://openaccess.thecvf.com/content_ cvpr_2017/html/Johnson_CLEVR_A_Diagnostic_CVPR_2017_paper.html

URL https://openreview.net/forum? id=rUK0P1Ejxl. 2 Jarvis, D., Klein, R., Rosman, B., and Saxe, A. M. On the specialization of neural modules, 2024. URL https: //arxiv.org/abs/2409.14981. 2 Jeong, Y ., Uselis, A., Laina, I., Oh, S. J., and Rohrbach, A. When do diffusion models learn to generate multiple objects?, 2026. URL https://arxiv.org/abs/ 2605.0027...

work page doi:10.1109/cvpr.2017.215 2024
[4]

Compositional risk minimiza- tion, 2025

2 Mahajan, D., Pezeshki, M., Arnal, C., Mitliagkas, I., Ahuja, K., and Vincent, P. Compositional risk minimiza- tion, 2025. URL https://arxiv.org/abs/2410. 06303. 2 Montero, M. L., Ludwig, C. J., Costa, R. P., Malhotra, G., and Bowers, J. The role of disentanglement in generalisa- tion. InInternational Conference on Learning Represen- tations, 2021. URL h...

Pith/arXiv arXiv 2025
[5]

1, 7, 14 Ren, Y

URL https://proceedings.mlr.press/ v139/radford21a.html. 1, 7, 14 Ren, Y . and Sutherland, D. J. Understanding simplicity bias towards compositional mappings via learning dynamics. arXiv preprint, 2024. 25 Schott, L., von K¨ugelgen, J., Tr¨auble, F., et al. Visual rep- resentation learning does not generalize strongly within the same domain, 2022. URL htt...

arXiv 2024
[6]

2 Uselis, A., Dittadi, A., and Oh, S

URL https://openreview.net/forum? id=M2WMUuwoh5. 2 Uselis, A., Dittadi, A., and Oh, S. J. Compositional general- ization requires linear, orthogonal representations in vi- sion embedding models, 2026. URL https://arxiv. org/abs/2602.24264. 2 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attent...

arXiv 2026
[7]

infinite

1, 2 12 How can embedding models bind concepts? Appendix A Notation summary 14 B Details on the controlled models 14 B.1 Scene space and tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 B.2 Encoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 B.3 ...

2021
[8]

We collect the unique concept-value tokens that appear in the batch (excluding SOO/EOO/EOS/PAD), subsample up to num concept values take max, and tokenize each as a concept query

Concept retrieval.Concept queries specify a single concept value. We collect the unique concept-value tokens that appear in the batch (excluding SOO/EOO/EOS/PAD), subsample up to num concept values take max, and tokenize each as a concept query
[9]

a purple square and a gray circle

Object retrieval.Object queries specify a full object. We parse object blocks from the tokenized scenes, take the set of unique objects present in the batch, subsample up to 1024, and tokenize each as an object query. We then augment the query set with hard negatives (random/perturbed objects and attribute swaps). Concept queries probe whether the model c...

2025

[1] [1]

2 Berasi, D., Farina, M., Mancini, M., Ricci, E., and Strisci- uglio, N

URL https://openreview.net/forum? id=3RQ863cRbx. 2 Berasi, D., Farina, M., Mancini, M., Ricci, E., and Strisci- uglio, N. Not only text: Exploring compositionality of visual representations in vision-language models. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pp. 24917–24927, 2025. 2, 4 Bergsma, S., Dey, N. S., Gosal, G., G...

Pith/arXiv arXiv 2025

[2] [2]

3, 6 Feng, J

URL https://openreview.net/forum? id=hKMPz3wkPV. 3, 6 Feng, J. and Steinhardt, J. How do language models bind en- tities in context? InThe Twelfth International Conference on Learning Representations, 2024. URL https:// openreview.net/forum?id=zb3b6oKO77. 2 Feng, J., Russell, S., and Steinhardt, J. Monitoring latent world states in language models with pr...

arXiv 2024

[3] [3]

URL https://openaccess.thecvf.com/content_ cvpr_2017/html/Johnson_CLEVR_A_Diagnostic_CVPR_2017_paper.html

URL https://openreview.net/forum? id=rUK0P1Ejxl. 2 Jarvis, D., Klein, R., Rosman, B., and Saxe, A. M. On the specialization of neural modules, 2024. URL https: //arxiv.org/abs/2409.14981. 2 Jeong, Y ., Uselis, A., Laina, I., Oh, S. J., and Rohrbach, A. When do diffusion models learn to generate multiple objects?, 2026. URL https://arxiv.org/abs/ 2605.0027...

work page doi:10.1109/cvpr.2017.215 2024

[4] [4]

Compositional risk minimiza- tion, 2025

2 Mahajan, D., Pezeshki, M., Arnal, C., Mitliagkas, I., Ahuja, K., and Vincent, P. Compositional risk minimiza- tion, 2025. URL https://arxiv.org/abs/2410. 06303. 2 Montero, M. L., Ludwig, C. J., Costa, R. P., Malhotra, G., and Bowers, J. The role of disentanglement in generalisa- tion. InInternational Conference on Learning Represen- tations, 2021. URL h...

Pith/arXiv arXiv 2025

[5] [5]

1, 7, 14 Ren, Y

URL https://proceedings.mlr.press/ v139/radford21a.html. 1, 7, 14 Ren, Y . and Sutherland, D. J. Understanding simplicity bias towards compositional mappings via learning dynamics. arXiv preprint, 2024. 25 Schott, L., von K¨ugelgen, J., Tr¨auble, F., et al. Visual rep- resentation learning does not generalize strongly within the same domain, 2022. URL htt...

arXiv 2024

[6] [6]

2 Uselis, A., Dittadi, A., and Oh, S

URL https://openreview.net/forum? id=M2WMUuwoh5. 2 Uselis, A., Dittadi, A., and Oh, S. J. Compositional general- ization requires linear, orthogonal representations in vi- sion embedding models, 2026. URL https://arxiv. org/abs/2602.24264. 2 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attent...

arXiv 2026

[7] [7]

infinite

1, 2 12 How can embedding models bind concepts? Appendix A Notation summary 14 B Details on the controlled models 14 B.1 Scene space and tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 B.2 Encoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 B.3 ...

2021

[8] [8]

We collect the unique concept-value tokens that appear in the batch (excluding SOO/EOO/EOS/PAD), subsample up to num concept values take max, and tokenize each as a concept query

Concept retrieval.Concept queries specify a single concept value. We collect the unique concept-value tokens that appear in the batch (excluding SOO/EOO/EOS/PAD), subsample up to num concept values take max, and tokenize each as a concept query

[9] [9]

a purple square and a gray circle

Object retrieval.Object queries specify a full object. We parse object blocks from the tokenized scenes, take the set of unique objects present in the batch, subsample up to 1024, and tokenize each as an object query. We then augment the query set with hard negatives (random/perturbed objects and attribute swaps). Concept queries probe whether the model c...

2025