TIGER-FG: Text-Guided Implicit Fine-Grained Grounding for E-commerce Retrieval

Ben Chen; Chenyi Lei; Huangyu Dai; Lingtao Mao; Wenwu Ou; Xinyu Sun; Zexin Zheng; Zihan Liang

arxiv: 2605.18434 · v1 · pith:Q56OF553new · submitted 2026-05-18 · 💻 cs.IR · cs.CV

TIGER-FG: Text-Guided Implicit Fine-Grained Grounding for E-commerce Retrieval

Xinyu Sun , Huangyu Dai , Lingtao Mao , Zexin Zheng , Zihan Liang , Ben Chen , Chenyi Lei , Wenwu Ou This is my paper

Pith reviewed 2026-05-19 23:51 UTC · model grok-4.3

classification 💻 cs.IR cs.CV

keywords e-commerce retrievalfine-grained groundingtext-guided matchingimplicit localizationmultimodal retrievaldual distillationcropped image queryimage-to-multimodal

0 comments

The pith

Item text can guide the creation of target-focused representations from full images to match cropped visual queries without object detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that structured product text supplies enough semantic direction to implicitly localize and emphasize the relevant object within full item images. This handles both the mismatch between a cropped visual query and complete multimodal items plus the extra background or distractors that standard encoders struggle with. A sympathetic reader would care because the method avoids the extra cost and error accumulation of explicit detection pipelines while still delivering higher accuracy on realistic e-commerce data that contains clutter and varied layouts.

Core claim

TIGER-FG uses item text as semantic guidance to produce target-focused item representations without object detection for retrieval. Dual distillation objectives preserve target-region spatial consistency and query-item similarity structure, yielding more stable and discriminative multimodal representations. The approach is evaluated on a new 10M-pair training set and two benchmarks covering standard and cluttered item layouts, plus public e-commerce sets for noisy and one-to-many cases.

What carries the argument

Text-guided implicit fine-grained grounding that turns item text descriptions into semantic signals for focusing full-image representations on target regions.

Load-bearing premise

Item text descriptions provide reliable semantic guidance sufficient to produce target-focused representations implicitly without explicit localization or detection.

What would settle it

Replace the real item texts with generic or mismatched descriptions and measure whether the reported recall gains over baselines disappear on the cluttered benchmark.

Figures

Figures reproduced from arXiv: 2605.18434 by Ben Chen, Chenyi Lei, Huangyu Dai, Lingtao Mao, Wenwu Ou, Xinyu Sun, Zexin Zheng, Zihan Liang.

**Figure 1.** Figure 1: a, uses explicit object detection before item encoding. Given a full item image, such methods first detect candidate item boxes, crop the corresponding regions, encode each cropped region, and then compare region embeddings with structured item text to retain the most text-compatible representation for indexing [Cheng et al., 2024, Nan et al., 2025]. A related grounding-based variant, not shown in the figu… view at source ↗

**Figure 2.** Figure 2: Overview of the proposed TIGER-FG framework. (a) Dual-encoder retrieval architecture. (b) Text-guided item representation learning. (c) Joint training objectives for item representation and query–item alignment. Two linear layers project them into a shared Cu-dimensional space, giving V′ ∈ R Nv×Cu and T′ ∈ R Nt×Cu . For notational simplicity, let c ′ = V′ [CLS] ∈ R Cu denote the visual class token, and let… view at source ↗

**Figure 3.** Figure 3: Heatmap and embedding visualizations. (a) Text-conditioned heatmaps show that TIGER-FG focuses on query-relevant regions. (b) Compared with Qwen3-VL-Emb, TIGER-FG forms tighter category clusters and closer query–item alignment. tation. Its Recall@1 decreases from 75.2 to 47.8, showing that clutter-aware training is essential for handling multi-item ambiguity. Replacing DINOv3 with a CLIP-based backbone als… view at source ↗

**Figure 4.** Figure 4: Paired examples from ECom-RF-IMMR-Normal and ECom-RF-IMMR-Mosaic. Each cell shows, for the same item text, the original Normal item image (left) and its Mosaic re-synthesis (right). The Mosaic image keeps the Normal target crop pixel-verbatim and pastes it, together with cross-category distractors, onto a random background at random scale and location. We construct the ECom-RF-IMMR suite—the training set E… view at source ↗

**Figure 5.** Figure 5: Category distribution of ECom-RF-IMMR-Normal. (a) Top-level (L1) distribution showing a balanced mix of verticals, with no single L1 exceeding ∼5% of samples. (b) Within-L1 diversity, measured by the number of distinct L2 and leaf categories under each of the top L1 verticals. (c) Long-tail cumulative coverage: reaching 80% of samples requires 40 L1, 235 L2, 1,933 L3, and 5,647 leaf categories, highlightin… view at source ↗

**Figure 6.** Figure 6: Top-6 retrieval results on eSSPR for the query “2 Pc Bodysuits Shorts Set . . . ”. TIGERFG hits the ground truth at rank 1; BLIPFF hits at rank 2, with rank 1 being an image-matching but title-mismatching candidate (biker shorts); Qwen3-VL-Embedding returns semantically related but incorrect items (yoga tops, bodycon dresses, lingerie) throughout the top-6. drifting into adjacent categories such as yoga a… view at source ↗

**Figure 7.** Figure 7: Top-6 retrieval results on ECom-RF-IMMR-Normal for the query “LEISEWIE exquisite scarves”. TIGER-FG ranks the ground truth first; BLIPFF reads the worn scarf as headwear and returns hats throughout the top-6; Qwen3-VL-Embedding hits the target only at rank 2, with an unrelated maternity-blanket item at rank 1 [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative visualization of the additive ablation (Case 1/3: dress and shoes queries). Each panel corresponds one-to-one to a row in [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

**Figure 9.** Figure 9: Qualitative visualization of the additive ablation (Case 2/3: storage basket and cosmetic brushes queries). Panels follow the same additive layout as [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗

**Figure 10.** Figure 10: Qualitative visualization of the additive ablation (Case 3/3: knitwear and dress queries). Panels follow the same additive layout as [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗

read the original abstract

E-commerce image search often takes a cropped image as the query, while each candidate is represented by full item images and structured text. This image-to-multimodal retrieval setting presents two asymmetries: a modality disparity -- a visual query must match image--text items, and a granularity disparity -- a cropped query must be compared with full images containing background context and possible distractors. Detection-based pipelines handle the granularity disparity through explicit localization but incur extra cost and error propagation, whereas CLIP-style encoders avoid detection, but are vulnerable to backgrounds or irrelevant items. To address these limitations, we propose TIGER-FG, a text-guided implicit fine-grained grounding framework for image-to-multimodal e-commerce retrieval. TIGER-FG uses item text as semantic guidance to produce target-focused item representations without object detection for retrieval. We further introduce dual distillation objectives that preserve target-region spatial consistency and query--item similarity structure, yielding more stable and discriminative multimodal representations. In addition, we construct ECom-RF-IMMR, a realistic benchmark suite with a 10M-pair training set and two evaluation benchmarks covering standard and cluttered item layouts. TIGER-FG improves Recall@1 over the strongest baseline by 6.1 and 34.4 percentage points on the two evaluation benchmarks, respectively, with only 85.7M query-side parameters and 256-dim embeddings. Results on public e-commerce benchmarks further demonstrate its generalization to noisy and one-to-many retrieval scenarios. Code and data will be released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TIGER-FG gives a workable detection-free route to handle granularity gaps in e-commerce retrieval via text-guided implicit grounding, but the large reported gains rest on an untested assumption about text reliability.

read the letter

The core point on this paper is that TIGER-FG uses item text to steer implicit fine-grained grounding in image-to-multimodal e-commerce retrieval, avoiding explicit detection while claiming big Recall@1 lifts on both clean and cluttered benchmarks with a compact 85.7M query model and 256-dim embeddings. They also release a new 10M-pair benchmark suite called ECom-RF-IMMR that covers standard and cluttered layouts plus some public e-commerce tests. That combination of text guidance plus dual distillation for spatial consistency and similarity structure looks like the actual new piece relative to CLIP-style or detection baselines. The setup directly targets the modality and granularity asymmetries that matter for real shopping search, and keeping everything detection-free is a practical plus for deployment cost. The benchmark construction itself is a useful contribution for the vertical. The soft spot is the central assumption that item text will reliably supply the semantic signal needed to focus on the target region without any localization supervision. E-commerce titles and attributes are frequently generic or only partially aligned with the photo, so any mismatch would flow straight into the embeddings and could explain or inflate the 34.4 pp jump on the cluttered set. The abstract gives concrete deltas and parameter counts but no ablations, error breakdowns, or sensitivity checks on text quality, which leaves the soundness thin. Without those, it is hard to judge whether the dual distillation actually stabilizes the representations across varied layouts or just fits the particular data. This paper is aimed at retrieval engineers and researchers working on multimodal e-commerce systems. Readers who care about practical grounding without detectors or who need realistic benchmarks for noisy shopping data will find usable ideas. It is coherent enough and grounded in a real problem to deserve a serious referee, though the review should press for method details, ablations on text noise, and checks on whether the gains hold when descriptions are deliberately mismatched. I would send it to peer review rather than desk reject.

Referee Report

3 major / 2 minor

Summary. The paper introduces TIGER-FG, a text-guided implicit fine-grained grounding framework for image-to-multimodal e-commerce retrieval. It addresses modality and granularity asymmetries between cropped visual queries and full item images with text by using item text as semantic guidance to produce target-focused representations without object detection. Dual distillation objectives are proposed to preserve target-region spatial consistency and query-item similarity structure. The authors construct the ECom-RF-IMMR benchmark with a 10M-pair training set and two evaluation sets (standard and cluttered layouts). Experiments report Recall@1 gains of 6.1 and 34.4 percentage points over the strongest baseline on the two benchmarks, using 85.7M query-side parameters and 256-dim embeddings, with additional results on public e-commerce benchmarks.

Significance. If the central performance claims hold under rigorous validation, this work would be significant for e-commerce retrieval by offering a detection-free, parameter-efficient alternative that handles cluttered item layouts better than CLIP-style encoders. The release of code, data, and the new ECom-RF-IMMR benchmark suite represents a concrete contribution to reproducible research in multimodal IR. The dual-distillation design and implicit grounding approach could influence future work on granularity handling in retrieval without explicit localization.

major comments (3)

[§3] §3 (Method): The dual distillation objectives are asserted to maintain spatial consistency and similarity structure, yet the manuscript provides no equations, architectural diagrams, or pseudocode detailing how text embeddings are aligned to implicit target regions in the absence of any localization or region-level supervision. This mechanism is load-bearing for the claim that text guidance alone suffices to overcome the granularity disparity.
[§4] §4 (Experiments): The main results table reports large Recall@1 deltas (especially the 34.4 pp gain on the cluttered benchmark), but no ablation tables isolate the contribution of the text-guidance component versus the dual distillation losses, nor is there error analysis on queries where item titles are generic or partially mismatched with visual content. Without these, the attribution of gains to the proposed implicit grounding remains under-supported.
[§4.1] §4.1 (Benchmark construction): The ECom-RF-IMMR suite is described as post-hoc with standard vs. cluttered splits, but the manuscript lacks details on how clutter is quantified, how distractors are selected, or potential selection biases in the 10M training pairs. This directly affects the reliability of the cross-benchmark generalization claims.

minor comments (2)

[Abstract] The abstract states that code and data will be released, but the manuscript does not include a repository link or data access instructions.
[§3] Notation for the 256-dim embeddings and query-side parameter count (85.7M) should be consistently defined in the method section to allow direct comparison with baselines.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and detailed review. The comments highlight important areas for improving clarity and supporting the central claims. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core contributions.

read point-by-point responses

Referee: [§3] §3 (Method): The dual distillation objectives are asserted to maintain spatial consistency and similarity structure, yet the manuscript provides no equations, architectural diagrams, or pseudocode detailing how text embeddings are aligned to implicit target regions in the absence of any localization or region-level supervision. This mechanism is load-bearing for the claim that text guidance alone suffices to overcome the granularity disparity.

Authors: We agree that the current description of the dual distillation objectives in §3 is insufficiently detailed. The revised manuscript will add the explicit loss equations for both the spatial consistency and similarity structure distillation terms, an architectural diagram illustrating the text-to-implicit-region alignment, and pseudocode for the forward pass. These additions will make explicit how item text embeddings serve as semantic guidance to focus representations on target regions without any localization supervision. revision: yes
Referee: [§4] §4 (Experiments): The main results table reports large Recall@1 deltas (especially the 34.4 pp gain on the cluttered benchmark), but no ablation tables isolate the contribution of the text-guidance component versus the dual distillation losses, nor is there error analysis on queries where item titles are generic or partially mismatched with visual content. Without these, the attribution of gains to the proposed implicit grounding remains under-supported.

Authors: We accept that the experimental section would benefit from additional ablations and error analysis. In the revision we will insert a dedicated ablation table that separately removes the text-guidance pathway and each distillation loss, and we will add a qualitative error analysis subsection that examines failure cases involving generic or mismatched item titles. These changes will provide stronger evidence for attributing the observed gains to the implicit grounding mechanism. revision: yes
Referee: [§4.1] §4.1 (Benchmark construction): The ECom-RF-IMMR suite is described as post-hoc with standard vs. cluttered splits, but the manuscript lacks details on how clutter is quantified, how distractors are selected, or potential selection biases in the 10M training pairs. This directly affects the reliability of the cross-benchmark generalization claims.

Authors: We acknowledge the need for greater transparency in benchmark construction. The revised §4.1 will include quantitative clutter metrics (e.g., average number of salient objects and background entropy), explicit criteria used to select distractors for the cluttered split, and a short analysis of potential selection biases in the 10M training pairs. These details will support the generalization claims across the two evaluation settings. revision: yes

Circularity Check

0 steps flagged

No circularity: TIGER-FG introduces independent framework components and benchmark without reduction to fitted inputs or self-referential derivations

full rationale

The paper proposes TIGER-FG as a new text-guided implicit fine-grained grounding framework that addresses modality and granularity asymmetries in e-commerce retrieval by using item text for semantic guidance and adding dual distillation objectives for spatial consistency and similarity structure. It also constructs the ECom-RF-IMMR benchmark suite with training and evaluation sets. These elements are presented as novel contributions evaluated empirically on Recall@1 improvements, with no equations or steps shown that reduce predictions to prior fits by construction, no load-bearing self-citations defining uniqueness, and no renaming of known results as new derivations. The central claims rest on the proposed architecture and objectives rather than tautological re-derivations of inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The framework implicitly assumes text descriptions are sufficiently descriptive and aligned with visual targets, but these are not formalized.

pith-pipeline@v0.9.0 · 5828 in / 1181 out tokens · 31170 ms · 2026-05-19T23:51:00.345273+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

TIGER-FG uses item text as semantic guidance to produce target-focused item representations without object detection... dual distillation objectives that preserve target-region spatial consistency and query–item similarity structure
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We further introduce dual distillation objectives... Spatial-relational distillation aligns target-region spatial consistency, while similarity-distribution distillation preserves the global query–item similarity structure

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 7 internal anchors

[1]

End-to-end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors,Computer Vision – ECCV 2020, pages 213–229, Cham,

work page 2020
[2]

Ben Chen, Linbo Jin, Xinxin Wang, Dehong Gao, Wen Jiang, and Wei Ning

Springer International Publishing. Ben Chen, Linbo Jin, Xinxin Wang, Dehong Gao, Wen Jiang, and Wei Ning. Unified vision-language representation modeling for e-commerce same-style products retrieval. InCompanion Proceedings of the ACM Web Conference 2023, pages 381–385, 2023a. Weijing Chen, Linli Yao, and Jin Qin. Rethinking benchmarks for cross-modal ima...

work page 2023
[3]

Category-oriented representation learning for image to multi-modal retrieval.arXiv preprint arXiv:2305.03972,

Zida Cheng, Chen Ju, Shuai Xiao, Xu Chen, Zhonghua Zhai, Xiaoyi Zeng, Weilin Huang, and Junchi Yan. Category-oriented representation learning for image to multi-modal retrieval.arXiv preprint arXiv:2305.03972,

work page arXiv
[4]

Vision Transformers Need Registers

Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers.arXiv preprint arXiv:2309.16588,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

The Faiss library

URLhttps://arxiv.org/abs/2401.08281. Chao Gao, Siqiao Xue, Jiwen Fu, Tingyi Gu, Shanshan Li, Fan Zhou, et al. Lookbench: A live and holistic open benchmark for fashion image retrieval.arXiv preprint arXiv:2601.14706,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

jina-embeddings- v4: Universal embeddings for multimodal multilingual retrieval

Michael Günther, Saba Sturua, Mohammad Kalim Akram, Isabelle Mohr, Andrei Ungureanu, Bo Wang, Sedigheh Eslami, Scott Martens, Maximilian Werk, Nan Wang, et al. jina-embeddings- v4: Universal embeddings for multimodal multilingual retrieval. InProceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025), pages 531–550,

work page 2025
[8]

Mdetr–modulated detection for end-to-end multi-modal understanding.arXiv preprint arXiv:2104.12763, 2021

URL https://arxiv.org/ abs/2104.12763. Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learning with momentum distillation.Advances in neural information processing systems,

work page arXiv
[9]

Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking

Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, and Jianfeng Gao. Grounded language-image pre-training. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10965–10975, 2022b. URL https://openaccess.thecvf.com/c...

work page internal anchor Pith review Pith/arXiv arXiv
[10]

arXiv preprint arXiv:2411.02571 , year=

Sheng-Chieh Lin, Chankyu Lee, Mohammad Shoeybi, Jimmy Lin, Bryan Catanzaro, and Wei Ping. Mm-embed: Universal multimodal retrieval with multimodal llms.arXiv preprint arXiv:2411.02571,

work page arXiv
[11]

Unidgf: A unified detection-to-generation framework for hierarchical object visual recognition.arXiv preprint arXiv:2511.15984,

Xinyu Nan, Lingtao Mao, Huangyu Dai, Zexin Zheng, Xinyu Sun, Zihan Liang, Ben Chen, Yuqing Ding, Chenyi Lei, Wenwu Ou, et al. Unidgf: A unified detection-to-generation framework for hierarchical object visual recognition.arXiv preprint arXiv:2511.15984,

work page arXiv
[12]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

DINOv3

Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Chinese clip: Contrastive vision-language pretraining in chinese.arXiv preprint arXiv:2211.01335,

An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, and Chang Zhou. Chinese clip: Contrastive vision-language pretraining in chinese.arXiv preprint arXiv:2211.01335,

work page arXiv
[15]

DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M Ni, and Heung-Yeung Shum. Dino: Detr with improved denoising anchor boxes for end-to-end object detection.arXiv preprint arXiv:2203.03605,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

GME: Improving Universal Multimodal Retrieval by Multimodal LLMs

Xin Zhang, Yanzhao Zhang, Wen Xie, Mingxin Li, Ziqi Dai, Dingkun Long, Pengjun Xie, Meishan Zhang, Wenjie Li, and Min Zhang. Gme: improving universal multimodal retrieval by multimodal llms.arXiv preprint arXiv:2412.16855,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Megapairs: Massive data synthesis for universal multimodal retrieval.arXiv preprint arXiv:2412.14475,

Junjie Zhou, Zheng Liu, Ze Liu, Shitao Xiao, Yueze Wang, Bo Zhao, Chen Jason Zhang, Defu Lian, and Yongping Xiong. Megapairs: Massive data synthesis for universal multimodal retrieval.arXiv preprint arXiv:2412.14475,

work page arXiv
[18]

In this work, we instantiate and evaluate it in the e-commerce setting, where each candidate is typically represented by full item images paired with structured item text

12 A Limitations TIGER-FG is designed for image-to-multimodal item retrieval. In this work, we instantiate and evaluate it in the e-commerce setting, where each candidate is typically represented by full item images paired with structured item text. This setting provides a practical testbed for studying cross- modal and granularity disparities in fine-gra...

work page 2021
[19]

2 Pc Bodysuits Shorts Set

and one from ourECom-RF-IMMR-Normal(Figure 7). For each method we show the retrieved item title and image at ranks 1–6, with the ground-truth candidate marked by a green check and incorrect candidates by a red cross. eSSPR (Figure 6).The query is a cropped image of a beige one-piece bodysuit, and the ground truth is a two-piece bodysuit–shorts set whose i...

work page 2021
[20]

Block(b)yields sharper and more localized responses, but title-guided routing remains unstable under raw-data training

Block(a)mainly improves entity localization. Block(b)yields sharper and more localized responses, but title-guided routing remains unstable under raw-data training. After introducing Mosaic-augmented training in block(c), the model better identifies which entity in the item image corresponds to the title. AddingDandTfurther strengthens this title-guided r...

work page 2025

[1] [1]

End-to-end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors,Computer Vision – ECCV 2020, pages 213–229, Cham,

work page 2020

[2] [2]

Ben Chen, Linbo Jin, Xinxin Wang, Dehong Gao, Wen Jiang, and Wei Ning

Springer International Publishing. Ben Chen, Linbo Jin, Xinxin Wang, Dehong Gao, Wen Jiang, and Wei Ning. Unified vision-language representation modeling for e-commerce same-style products retrieval. InCompanion Proceedings of the ACM Web Conference 2023, pages 381–385, 2023a. Weijing Chen, Linli Yao, and Jin Qin. Rethinking benchmarks for cross-modal ima...

work page 2023

[3] [3]

Category-oriented representation learning for image to multi-modal retrieval.arXiv preprint arXiv:2305.03972,

Zida Cheng, Chen Ju, Shuai Xiao, Xu Chen, Zhonghua Zhai, Xiaoyi Zeng, Weilin Huang, and Junchi Yan. Category-oriented representation learning for image to multi-modal retrieval.arXiv preprint arXiv:2305.03972,

work page arXiv

[4] [4]

Vision Transformers Need Registers

Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers.arXiv preprint arXiv:2309.16588,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [6]

The Faiss library

URLhttps://arxiv.org/abs/2401.08281. Chao Gao, Siqiao Xue, Jiwen Fu, Tingyi Gu, Shanshan Li, Fan Zhou, et al. Lookbench: A live and holistic open benchmark for fashion image retrieval.arXiv preprint arXiv:2601.14706,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [7]

jina-embeddings- v4: Universal embeddings for multimodal multilingual retrieval

Michael Günther, Saba Sturua, Mohammad Kalim Akram, Isabelle Mohr, Andrei Ungureanu, Bo Wang, Sedigheh Eslami, Scott Martens, Maximilian Werk, Nan Wang, et al. jina-embeddings- v4: Universal embeddings for multimodal multilingual retrieval. InProceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025), pages 531–550,

work page 2025

[7] [8]

Mdetr–modulated detection for end-to-end multi-modal understanding.arXiv preprint arXiv:2104.12763, 2021

URL https://arxiv.org/ abs/2104.12763. Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learning with momentum distillation.Advances in neural information processing systems,

work page arXiv

[8] [9]

Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking

Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, and Jianfeng Gao. Grounded language-image pre-training. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10965–10975, 2022b. URL https://openaccess.thecvf.com/c...

work page internal anchor Pith review Pith/arXiv arXiv

[9] [10]

arXiv preprint arXiv:2411.02571 , year=

Sheng-Chieh Lin, Chankyu Lee, Mohammad Shoeybi, Jimmy Lin, Bryan Catanzaro, and Wei Ping. Mm-embed: Universal multimodal retrieval with multimodal llms.arXiv preprint arXiv:2411.02571,

work page arXiv

[10] [11]

Unidgf: A unified detection-to-generation framework for hierarchical object visual recognition.arXiv preprint arXiv:2511.15984,

Xinyu Nan, Lingtao Mao, Huangyu Dai, Zexin Zheng, Xinyu Sun, Zihan Liang, Ben Chen, Yuqing Ding, Chenyi Lei, Wenwu Ou, et al. Unidgf: A unified detection-to-generation framework for hierarchical object visual recognition.arXiv preprint arXiv:2511.15984,

work page arXiv

[11] [12]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [13]

DINOv3

Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [14]

Chinese clip: Contrastive vision-language pretraining in chinese.arXiv preprint arXiv:2211.01335,

An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, and Chang Zhou. Chinese clip: Contrastive vision-language pretraining in chinese.arXiv preprint arXiv:2211.01335,

work page arXiv

[14] [15]

DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M Ni, and Heung-Yeung Shum. Dino: Detr with improved denoising anchor boxes for end-to-end object detection.arXiv preprint arXiv:2203.03605,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [16]

GME: Improving Universal Multimodal Retrieval by Multimodal LLMs

Xin Zhang, Yanzhao Zhang, Wen Xie, Mingxin Li, Ziqi Dai, Dingkun Long, Pengjun Xie, Meishan Zhang, Wenjie Li, and Min Zhang. Gme: improving universal multimodal retrieval by multimodal llms.arXiv preprint arXiv:2412.16855,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [17]

Megapairs: Massive data synthesis for universal multimodal retrieval.arXiv preprint arXiv:2412.14475,

Junjie Zhou, Zheng Liu, Ze Liu, Shitao Xiao, Yueze Wang, Bo Zhao, Chen Jason Zhang, Defu Lian, and Yongping Xiong. Megapairs: Massive data synthesis for universal multimodal retrieval.arXiv preprint arXiv:2412.14475,

work page arXiv

[17] [18]

In this work, we instantiate and evaluate it in the e-commerce setting, where each candidate is typically represented by full item images paired with structured item text

12 A Limitations TIGER-FG is designed for image-to-multimodal item retrieval. In this work, we instantiate and evaluate it in the e-commerce setting, where each candidate is typically represented by full item images paired with structured item text. This setting provides a practical testbed for studying cross- modal and granularity disparities in fine-gra...

work page 2021

[18] [19]

2 Pc Bodysuits Shorts Set

and one from ourECom-RF-IMMR-Normal(Figure 7). For each method we show the retrieved item title and image at ranks 1–6, with the ground-truth candidate marked by a green check and incorrect candidates by a red cross. eSSPR (Figure 6).The query is a cropped image of a beige one-piece bodysuit, and the ground truth is a two-piece bodysuit–shorts set whose i...

work page 2021

[19] [20]

Block(b)yields sharper and more localized responses, but title-guided routing remains unstable under raw-data training

Block(a)mainly improves entity localization. Block(b)yields sharper and more localized responses, but title-guided routing remains unstable under raw-data training. After introducing Mosaic-augmented training in block(c), the model better identifies which entity in the item image corresponds to the title. AddingDandTfurther strengthens this title-guided r...

work page 2025