pith. sign in

arxiv: 2605.17365 · v1 · pith:AL27M544new · submitted 2026-05-17 · 💻 cs.CV

Memory-Augmented Query Intent Understanding for Efficient Chat-based Image Retrieval

Pith reviewed 2026-05-20 13:28 UTC · model grok-4.3

classification 💻 cs.CV
keywords chat-based image retrievalquery intent understandingmemory-augmented modelmulti-round dialogueefficient retrievalintent evolutionvisual guidance
0
0 comments X

The pith

A lightweight memory module tracks evolving user intent across chat rounds for image retrieval without reprocessing full history.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes MAQIU, a framework that maintains a compact, evolving representation of what the user is seeking during multi-turn conversations aimed at finding specific images. Rather than concatenating every past query into an ever-longer text string or invoking a large model to rewrite the current request, the system uses a small memorization component that updates the intent summary step by step. A recall step guards against dropping earlier details, and previous retrieval results are fed back as visual hints to tighten the match. A reader would care because this style of interactive search can produce far more precise results than one-shot queries, yet existing solutions become expensive and prone to losing track of the original goal as the conversation lengthens.

Core claim

MAQIU introduces a memory-based user intent updating framework consisting of a lightweight memorization module that dynamically aggregates and evolves the semantic representation of query intent across dialogues. A memory recall mechanism prevents intent forgetting and strengthens long-term semantic integrity, while historical image retrieval results are integrated as visual guidance to refine cross-round correlations.

What carries the argument

The lightweight memorization module that aggregates and evolves semantic query intent representations round by round, paired with a memory recall mechanism to avoid forgetting earlier details.

If this is right

  • Dialogue encoding computation drops sharply because only the compact memory state is updated instead of the full history.
  • Retrieval quality improves by preserving long-term intent and by using prior image results as visual context.
  • The same memory structure can be applied to any multi-turn clarification task without requiring larger language models for query rewriting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method may scale to longer conversations where full-history approaches become impractical.
  • Similar memory modules could be tested in conversational recommendation or visual question answering to reduce repeated context processing.

Load-bearing premise

A lightweight memorization module can reliably aggregate and evolve semantic intent representations across rounds without introducing inconsistencies or forgetting key details that would degrade retrieval quality.

What would settle it

A test in which the memory module drops a critical constraint stated in the first round and later returns images that violate that constraint while the full-history baseline still satisfies it.

Figures

Figures reproduced from arXiv: 2605.17365 by Daizong Liu, Jianfeng Dong, Shuhui Wang, Xianke Chen, Xin Tan, Xun Wang, Xun Yang, Yushuo Lou.

Figure 1
Figure 1. Figure 1: The workflows of traditional text-to-image retrieval and chat-based [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of our motivation. Unlike previous methods that simply integrate dialogue contexts via either multi-round concatenation or LLM-based [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the proposed MAQIU framework for chat-based image retrieval. Given the initial query and multi-round dialogue interactions, the [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of Hit@10 and Recall@10 across dialogue rounds on VisDial under Original (a) and Reconstructed (b) dialogue settings. While Hit@10 [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Performance comparison on dialogue datasets ChatGPT-BLIP2 and [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Ablation on historical dialogue repository construction in terms of (a) [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Evaluation of memory recall designs. (a) similarity-based recall [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative multi-round retrieval comparison with different baselines. Each example shows the evolving dialogue on the left, baseline retrieval at rounds [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
read the original abstract

Different from traditional text-to-image retrieval tasks, chat-based image retrieval allows the human-interactive system to iteratively clarify and refine user intent through multi-round dialogue, thereby achieving more fine-grained retrieval results. The key challenge in this task lies in dynamically understanding and updating the user's query intent across dialogue rounds. Although existing works have achieved great performance on this new task, they simply handle history query information either by directly concatenating all previous queries into a long textual sequence or by relying on large language models to reconstruct the current query from history. Such strategies are computationally redundant and easily lead to inconsistent intent representations as the dialogue progresses. To alleviate these issues, this paper proposes a novel and efficient memory-based user intent updating framework for the chat-based image retrieval task, called Memory-Augmented Query Intent Understanding (MAQIU). It introduces a lightweight memorization module that dynamically aggregates and evolves the semantic representation of query intent across dialogues, while a memory recall mechanism is further employed to prevent intent forgetting and enhance long-term semantic integrity. In addition, MAQIU also integrates historical image retrieval results as visual guidance, allowing the model to strengthen cross-round correlations and refine current visual understanding. Extensive experiments demonstrate that MAQIU achieves substantial performance gains while maintaining high computational efficiency, reducing dialogue encoding FLOPs by 86.4\% compared with the prior baseline ChatIR. Source code is available at https://github.com/HuiGuanLab/MAQIU.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript proposes MAQIU, a memory-augmented framework for chat-based image retrieval. It introduces a lightweight memorization module to dynamically aggregate and evolve semantic query intent representations across multi-round dialogues, employs a memory recall mechanism to prevent forgetting, and integrates historical image retrieval results as visual guidance. Experiments claim substantial performance gains over baselines such as ChatIR while reducing dialogue encoding FLOPs by 86.4%.

Significance. If the reported efficiency and performance results hold under full verification, the work addresses a practical bottleneck in interactive retrieval by replacing redundant history concatenation or LLM-based reconstruction with a lightweight memory module. The 86.4% FLOP reduction and source-code release are concrete strengths that could influence design of efficient dialogue-driven vision systems.

major comments (2)
  1. [§4] §4 (Experimental Setup): The central efficiency claim of 86.4% FLOP reduction is load-bearing, yet the manuscript provides no explicit breakdown of how the memorization module's forward pass cost is measured relative to full history concatenation; without this accounting or an ablation isolating the recall mechanism's overhead, the reported savings cannot be independently verified.
  2. [Table 2] Table 2 (Ablation Studies): The contribution of visual guidance integration to both retrieval metrics and FLOP savings is not isolated; if removing this component degrades performance without proportionally affecting efficiency, the claim that MAQIU maintains high efficiency while strengthening cross-round correlations requires re-examination.
minor comments (3)
  1. [Abstract] Abstract: The phrase 'substantial performance gains' should be replaced with concrete deltas (e.g., +X% Recall@10) to allow readers to assess magnitude without consulting the full results section.
  2. [§3.1] §3.1: Notation for the memory state update equation is introduced without an accompanying diagram label; adding an explicit equation number would improve traceability to the recall mechanism.
  3. [Figure 3] Figure 3: The efficiency plot axes lack units for FLOPs per dialogue turn; clarifying whether measurements include or exclude the visual guidance branch would prevent misinterpretation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation and constructive feedback. We appreciate the recognition of the practical value of the efficiency gains and source-code release. We address each major comment below with clarifications and commitments to revisions.

read point-by-point responses
  1. Referee: [§4] §4 (Experimental Setup): The central efficiency claim of 86.4% FLOP reduction is load-bearing, yet the manuscript provides no explicit breakdown of how the memorization module's forward pass cost is measured relative to full history concatenation; without this accounting or an ablation isolating the recall mechanism's overhead, the reported savings cannot be independently verified.

    Authors: We agree that an explicit breakdown is needed for independent verification. The reported savings stem from replacing variable-length history concatenation with a fixed-size memory state updated by the lightweight memorization module. In the revised manuscript, we will add a detailed accounting in §4, including the operations considered in the FLOP count and a new ablation isolating the recall mechanism's overhead. revision: yes

  2. Referee: [Table 2] Table 2 (Ablation Studies): The contribution of visual guidance integration to both retrieval metrics and FLOP savings is not isolated; if removing this component degrades performance without proportionally affecting efficiency, the claim that MAQIU maintains high efficiency while strengthening cross-round correlations requires re-examination.

    Authors: We concur that isolating the visual guidance component would strengthen the claims. Table 2 currently focuses on the memorization and recall modules; we will expand it with an additional ablation variant that removes visual guidance integration. The revised table will report effects on both retrieval metrics and FLOP counts to demonstrate that this component improves cross-round correlations with negligible impact on efficiency. revision: yes

Circularity Check

0 steps flagged

No significant circularity; framework is an independent architectural proposal

full rationale

The paper presents MAQIU as a new system-level addition: a lightweight memorization module that aggregates query intent representations, a recall mechanism to avoid forgetting, and integration of prior retrieval results as visual guidance. No equations, derivations, or parameter-fitting steps are described that would reduce the claimed 86.4% FLOP reduction or retrieval gains to a self-referential definition or fitted input. The efficiency and performance claims rest on experimental measurements and ablations rather than on any self-citation chain or renamed known result. The derivation chain is therefore self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim rests on the effectiveness of a newly introduced memorization module whose internal design and training details are not visible in the abstract; no explicit free parameters or axioms are stated.

invented entities (1)
  • Lightweight memorization module no independent evidence
    purpose: Dynamically aggregate and evolve query intent semantic representations across dialogue rounds
    Presented as the core novel component that replaces concatenation or LLM rewriting

pith-pipeline@v0.9.0 · 5807 in / 1141 out tokens · 41010 ms · 2026-05-20T13:28:25.497573+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 4 internal anchors

  1. [1]

    Learning semantic structure-preserved embeddings for cross-modal retrieval,

    Y . Wu, S. Wang, and Q. Huang, “Learning semantic structure-preserved embeddings for cross-modal retrieval,” inProceedings of the 26th ACM international conference on Multimedia, 2018, pp. 825–833. 11

  2. [2]

    Fine-grained visual textual alignment for cross- modal retrieval using transformer encoders,

    N. Messina, G. Amato, A. Esuli, F. Falchi, C. Gennaro, and S. Marchand-Maillet, “Fine-grained visual textual alignment for cross- modal retrieval using transformer encoders,”ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), vol. 17, no. 4, pp. 1–23, 2021

  3. [3]

    Fine-grained image-text matching by cross-modal hard aligning network,

    Z. Pan, F. Wu, and B. Zhang, “Fine-grained image-text matching by cross-modal hard aligning network,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 19 275–19 284

  4. [4]

    An end-to-end graph attention network hashing for cross-modal retrieval,

    H. Jin, Y . Zhang, L. Shi, S. Zhang, F. Kou, J. Yang, C. Zhu, and J. Luo, “An end-to-end graph attention network hashing for cross-modal retrieval,”Advances in Neural Information Processing Systems, vol. 37, pp. 2106–2126, 2024

  5. [5]

    Achieving ensemble-like performance in a single model: A feature diversification framework for image-text matching,

    Z. Zhou, Y . Wang, W. Zhang, Y . Zheng, X. Du, and C. Jin, “Achieving ensemble-like performance in a single model: A feature diversification framework for image-text matching,” inProceedings of the AAAI Confer- ence on Artificial Intelligence, vol. 39, no. 10, 2025, pp. 10 879–10 886

  6. [6]

    Efficient token-guided image-text retrieval with consistent multimodal contrastive training,

    C. Liu, Y . Zhang, H. Wang, W. Chen, F. Wang, Y . Huang, Y .-D. Shen, and L. Wang, “Efficient token-guided image-text retrieval with consistent multimodal contrastive training,”IEEE Transactions on Image Processing, 2023

  7. [7]

    Ucpm: Uncertainty-guided cross-modal retrieval with partially mis- matched pairs,

    Q. Zha, X. Liu, Y .-M. Cheung, S.-J. Peng, X. Xu, and N. Wang, “Ucpm: Uncertainty-guided cross-modal retrieval with partially mis- matched pairs,”IEEE Transactions on Image Processing, 2025

  8. [8]

    Dual uncertainty-aware correspondence adapting and retaining for continual composed image retrieval,

    H. Zhou, F. Zhang, and C. Xu, “Dual uncertainty-aware correspondence adapting and retaining for continual composed image retrieval,”IEEE Transactions on Image Processing, vol. 34, pp. 7627–7641, 2025

  9. [9]

    Rebalanced vision-language retrieval considering structure-aware distillation,

    Y . Yang, W. Xi, L. Zhou, and J. Tang, “Rebalanced vision-language retrieval considering structure-aware distillation,”IEEE Transactions on Image Processing, vol. 33, pp. 6881–6892, 2024

  10. [10]

    Chatting makes perfect: Chat-based image retrieval,

    M. Levy, R. Ben-Ari, N. Darshan, and D. Lischinski, “Chatting makes perfect: Chat-based image retrieval,”Advances in Neural Information Processing Systems, vol. 36, pp. 61 437–61 449, 2023

  11. [11]

    Interactive text-to-image retrieval with large language models: A plug-and-play approach,

    S. Lee, S. Yu, J. Park, J. Yi, and S. Yoon, “Interactive text-to-image retrieval with large language models: A plug-and-play approach,” in Proceedings of the 62nd Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers), 2024, pp. 791–809

  12. [12]

    Enhancing inter- active image retrieval with query rewriting using large language models and vision language models,

    H. Zhu, J.-H. Huang, S. Rudinac, and E. Kanoulas, “Enhancing inter- active image retrieval with query rewriting using large language models and vision language models,” inProceedings of the 2024 International Conference on Multimedia Retrieval, 2024, pp. 978–987

  13. [13]

    Diffusion augmented retrieval: A training-free approach to inter- active text-to-image retrieval,

    Z. Long, K. Liang, G. Aragon Camarasa, R. Mccreadie, and P. Hender- son, “Diffusion augmented retrieval: A training-free approach to inter- active text-to-image retrieval,” inProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2025, pp. 823–832

  14. [14]

    Chat-based person retrieval via dialogue-refined cross-modal alignment,

    Y . Bai, Y . Ji, M. Cao, J. Wang, and M. Ye, “Chat-based person retrieval via dialogue-refined cross-modal alignment,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 3952– 3962

  15. [15]

    Mai: A multi-turn aggregation- iteration model for composed image retrieval,

    Y . Chen, Z. Yang, J. Xu, and Y . Peng, “Mai: A multi-turn aggregation- iteration model for composed image retrieval,” inThe Thirteenth Inter- national Conference on Learning Representations, 2025

  16. [16]

    Imagescope: Unifying language-guided image retrieval via large multimodal model collective reasoning,

    P. Luo, J. Zhou, T. Xu, Y . Xia, L. Xu, and E. Chen, “Imagescope: Unifying language-guided image retrieval via large multimodal model collective reasoning,” inProceedings of the ACM on Web Conference 2025, 2025, pp. 1666–1682

  17. [17]

    GPT-4 Technical Report

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

  18. [18]

    Training language models to follow instructions with human feedback,

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Rayet al., “Training language models to follow instructions with human feedback,”Advances in neural information processing systems, vol. 35, pp. 27 730–27 744, 2022

  19. [19]

    LLaMA: Open and Efficient Foundation Language Models

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azharet al., “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023

  20. [20]

    Pairwise rela- tionship guided deep hashing for cross-modal retrieval,

    E. Yang, C. Deng, W. Liu, X. Liu, D. Tao, and X. Gao, “Pairwise rela- tionship guided deep hashing for cross-modal retrieval,” inproceedings of the AAAI Conference on Artificial Intelligence, vol. 31, no. 1, 2017

  21. [21]

    Context-aware attention network for image-text retrieval,

    Q. Zhang, Z. Lei, Z. Zhang, and S. Z. Li, “Context-aware attention network for image-text retrieval,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 3536– 3545

  22. [22]

    Align before fuse: Vision and language representation learning with momentum distillation,

    J. Li, R. Selvaraju, A. Gotmare, S. Joty, C. Xiong, and S. C. H. Hoi, “Align before fuse: Vision and language representation learning with momentum distillation,”Advances in neural information processing systems, vol. 34, pp. 9694–9705, 2021

  23. [23]

    Scaling up visual and vision-language representation learning with noisy text supervision,

    C. Jia, Y . Yang, Y . Xia, Y .-T. Chen, Z. Parekh, H. Pham, Q. Le, Y .-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” inInternational conference on machine learning. PMLR, 2021, pp. 4904–4916

  24. [24]

    Gssf: Generalized structural sparse function for deep cross-modal metric learning,

    H. Diao, Y . Zhang, S. Gao, J. Zhu, L. Chen, and H. Lu, “Gssf: Generalized structural sparse function for deep cross-modal metric learning,”IEEE Transactions on Image Processing, vol. 33, pp. 6241– 6252, 2024

  25. [25]

    Multi-relational deep hash- ing for cross-modal search,

    X. Liang, E. Yang, Y . Yang, and C. Deng, “Multi-relational deep hash- ing for cross-modal search,”IEEE Transactions on Image Processing, vol. 33, pp. 3009–3020, 2024

  26. [26]

    Composed image retrieval via cross relation network with hierarchical aggregation transformer,

    Q. Yang, M. Ye, Z. Cai, K. Su, and B. Du, “Composed image retrieval via cross relation network with hierarchical aggregation transformer,” IEEE Transactions on Image Processing, vol. 32, pp. 4543–4554, 2023

  27. [27]

    Transformer-xl: Attentive language models beyond a fixed-length context,

    Z. Dai, Z. Yang, Y . Yang, J. G. Carbonell, Q. Le, and R. Salakhutdi- nov, “Transformer-xl: Attentive language models beyond a fixed-length context,” inProceedings of the 57th annual meeting of the association for computational linguistics, 2019, pp. 2978–2988

  28. [28]

    Retrieval- augmented generation for knowledge-intensive nlp tasks,

    P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K ¨uttler, M. Lewis, W.-t. Yih, T. Rockt ¨aschelet al., “Retrieval- augmented generation for knowledge-intensive nlp tasks,”Advances in neural information processing systems, vol. 33, pp. 9459–9474, 2020

  29. [29]

    Augmenting language models with long-term memory,

    W. Wang, L. Dong, H. Cheng, X. Liu, X. Yan, J. Gao, and F. Wei, “Augmenting language models with long-term memory,”Advances in Neural Information Processing Systems, vol. 36, pp. 74 530–74 543, 2023

  30. [30]

    Prompting video-language foundation models with domain-specific fine-grained heuristics for video question answering,

    T. Yu, K. Fu, S. Wang, Q. Huang, and J. Yu, “Prompting video-language foundation models with domain-specific fine-grained heuristics for video question answering,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 35, no. 2, pp. 1615–1630, 2024

  31. [31]

    Hmt: Hierarchical memory transformer for efficient long context language processing,

    Z. He, Y . Cao, Z. Qin, N. Prakriya, Y . Sun, and J. Cong, “Hmt: Hierarchical memory transformer for efficient long context language processing,” inProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2025, pp. 8068–8089

  32. [32]

    Fine-R1: Make Multi-modal LLMs Excel in Fine-Grained Visual Recognition by Chain-of-Thought Reasoning

    H. He, Z. Geng, and Y . Peng, “Fine-r1: Make multi-modal llms excel in fine-grained visual recognition by chain-of-thought reasoning,”arXiv preprint arXiv:2602.07605, 2026

  33. [33]

    Mitigating hallucinations in large vision-language models via reasoning uncertainty- guided refinement,

    S. Li, X. Xu, W. Meng, J. Song, C. Peng, and H. T. Shen, “Mitigating hallucinations in large vision-language models via reasoning uncertainty- guided refinement,”IEEE Transactions on Multimedia, 2025

  34. [34]

    Prompt learning with knowledge regularization for pre-trained vision-language models,

    B. Guo, L. Li, J. Zhang, Y . Sun, C. Yan, and X. Sheng, “Prompt learning with knowledge regularization for pre-trained vision-language models,” IEEE Transactions on Multimedia, 2025

  35. [35]

    Star: Sensitive trajectory regulation for unlearning in large reasoning models,

    J. Zhou, G. Cong, L. Su, and L. Li, “Star: Sensitive trajectory regulation for unlearning in large reasoning models,”arXiv preprint arXiv:2601.09281, 2026

  36. [36]

    Improving language models by retrieving from trillions of tokens,

    S. Borgeaud, A. Mensch, J. Hoffmann, T. Cai, E. Rutherford, K. Milli- can, G. B. Van Den Driessche, J.-B. Lespiau, B. Damoc, A. Clarket al., “Improving language models by retrieving from trillions of tokens,” in International conference on machine learning. PMLR, 2022, pp. 2206– 2240

  37. [37]

    Atlas: Few-shot learning with retrieval augmented language models,

    G. Izacard, P. Lewis, M. Lomeli, L. Hosseini, F. Petroni, T. Schick, J. Dwivedi-Yu, A. Joulin, S. Riedel, and E. Grave, “Atlas: Few-shot learning with retrieval augmented language models,”Journal of Machine Learning Research, vol. 24, no. 251, pp. 1–43, 2023

  38. [38]

    Compressive Transformers for Long-Range Sequence Modelling

    J. W. Rae, A. Potapenko, S. M. Jayakumar, and T. P. Lillicrap, “Com- pressive transformers for long-range sequence modelling,”arXiv preprint arXiv:1911.05507, 2019

  39. [39]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017

  40. [40]

    Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,

    J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” inInternational conference on machine learning. PMLR, 2022, pp. 12 888–12 900

  41. [41]

    VSE++: Improving visual-semantic embeddings with hard negatives,

    F. Faghri, D. J. Fleet, J. R. Kiros, and S. Fidler, “VSE++: Improving visual-semantic embeddings with hard negatives,” inProceedings of the British Machine Vision Conference, 2018, pp. 935–943

  42. [42]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763. 12

  43. [43]

    Visual dialog,

    A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav, J. M. Moura, D. Parikh, and D. Batra, “Visual dialog,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 326–335

  44. [44]

    Large-scale pretraining for visual dialog: A simple state-of-the-art baseline,

    V . Murahari, D. Batra, D. Parikh, and A. Das, “Large-scale pretraining for visual dialog: A simple state-of-the-art baseline,” inEuropean Conference on Computer Vision. Springer, 2020, pp. 336–352. Xianke Chenreceived the B.E. degree in network engineering from Zhejiang Gongshang University, Hangzhou, China, in 2020, and the M.E. degree from the College ...

  45. [45]

    His research interests include multimedia understanding, retrieval, and recommendation

    He is currently a Research Professor with the College of Computer Science and Technology, Zhe- jiang Gongshang University, Hangzhou, China. His research interests include multimedia understanding, retrieval, and recommendation. He was awarded the ACM Multimedia Grand Challenge Award and was selected into the Young Elite Scientists Sponsorship Program by t...