pith. sign in
Pith Number

pith:Z4X4O7C5

pith:2026:Z4X4O7C57UKXQAI3WDKA4EDKYY
not attested not anchored not stored refs resolved

Text-Guided Visual Representation Learning for Robust Multimodal E-Commerce Recommendation

Jing Ma, Jungong Han, Pinghua Gong, Shijie Yang, Tianlu Zhang, Weijie Ding, Yanlong Zang, Yufei Guo

TGQ-Former uses metadata as text guidance to extract robust visual tokens from cluttered product images for e-commerce retrieval.

arxiv:2605.17366 v1 · 2026-05-17 · cs.IR

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{Z4X4O7C57UKXQAI3WDKA4EDKYY}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

TGQ-Former consistently outperforms strong connector baselines and end-to-end MLLMs on large-scale real-world e-commerce datasets with full-pool retrieval, improving Hit Rate@100 by 6.04% on average.

C2weakest assumption

Structured metadata is assumed to be accurate and sufficient to serve as reliable semantic guidance that allows the hybrid-query connector to disentangle metadata-anchored and exploratory visual streams without discarding useful visual evidence.

C3one line summary

TGQ-Former uses metadata-guided hybrid queries and dual-gated modulation to improve visual token selection in multimodal e-commerce retrieval, raising average Hit Rate@100 by 6.04% over baselines.

References

39 extracted · 39 resolved · 10 Pith anchors

[1] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al
[2] Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems35 (2022), 23716–23736 2022
[3] Qwen3-VL Technical Report 2025 · arXiv:2511.21631
[4] Qwen2.5-VL Technical Report 2025 · arXiv:2502.13923
[5] Xu Chen, Hanxiong Chen, Hongteng Xu, Yongfeng Zhang, Yixin Cao, Zheng Qin, and Hongyuan Zha. 2019. Personalized Fashion Recommendation with Visual Explanations based on Multimodal Attention Network: T 2019 · doi:10.1145/3331184.3331254

Formal links

2 machine-checked theorem links

Receipt and verification
First computed 2026-05-20T00:03:54.706133Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

cf2fc77c5dfd1578011bb0d40e106ac605a8a1cf179c26e84830bc10f71e6244

Aliases

arxiv: 2605.17366 · arxiv_version: 2605.17366v1 · doi: 10.48550/arxiv.2605.17366 · pith_short_12: Z4X4O7C57UKX · pith_short_16: Z4X4O7C57UKXQAI3 · pith_short_8: Z4X4O7C5
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/Z4X4O7C57UKXQAI3WDKA4EDKYY \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: cf2fc77c5dfd1578011bb0d40e106ac605a8a1cf179c26e84830bc10f71e6244
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "ced29fbc25e10a66a68bc4b20cc98ce5020e53d0926cfa1fbe8cbaf0296fefcf",
    "cross_cats_sorted": [],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.IR",
    "submitted_at": "2026-05-17T10:20:23Z",
    "title_canon_sha256": "af055bd0a527e76ae5e05797ca30bc7cbcef82b3d2350800a43ee05dcf39a1a7"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2605.17366",
    "kind": "arxiv",
    "version": 1
  }
}