Memory-Augmented Query Intent Understanding for Efficient Chat-based Image Retrieval

Daizong Liu; Jianfeng Dong; Shuhui Wang; Xianke Chen; Xin Tan; Xun Wang; Xun Yang; Yushuo Lou

arxiv: 2605.17365 · v1 · pith:AL27M544new · submitted 2026-05-17 · 💻 cs.CV

Memory-Augmented Query Intent Understanding for Efficient Chat-based Image Retrieval

Xianke Chen , Daizong Liu , Yushuo Lou , Xin Tan , Xun Yang , Shuhui Wang , Xun Wang , Jianfeng Dong This is my paper

Pith reviewed 2026-05-20 13:28 UTC · model grok-4.3

classification 💻 cs.CV

keywords chat-based image retrievalquery intent understandingmemory-augmented modelmulti-round dialogueefficient retrievalintent evolutionvisual guidance

0 comments

The pith

A lightweight memory module tracks evolving user intent across chat rounds for image retrieval without reprocessing full history.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes MAQIU, a framework that maintains a compact, evolving representation of what the user is seeking during multi-turn conversations aimed at finding specific images. Rather than concatenating every past query into an ever-longer text string or invoking a large model to rewrite the current request, the system uses a small memorization component that updates the intent summary step by step. A recall step guards against dropping earlier details, and previous retrieval results are fed back as visual hints to tighten the match. A reader would care because this style of interactive search can produce far more precise results than one-shot queries, yet existing solutions become expensive and prone to losing track of the original goal as the conversation lengthens.

Core claim

MAQIU introduces a memory-based user intent updating framework consisting of a lightweight memorization module that dynamically aggregates and evolves the semantic representation of query intent across dialogues. A memory recall mechanism prevents intent forgetting and strengthens long-term semantic integrity, while historical image retrieval results are integrated as visual guidance to refine cross-round correlations.

What carries the argument

The lightweight memorization module that aggregates and evolves semantic query intent representations round by round, paired with a memory recall mechanism to avoid forgetting earlier details.

If this is right

Dialogue encoding computation drops sharply because only the compact memory state is updated instead of the full history.
Retrieval quality improves by preserving long-term intent and by using prior image results as visual context.
The same memory structure can be applied to any multi-turn clarification task without requiring larger language models for query rewriting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method may scale to longer conversations where full-history approaches become impractical.
Similar memory modules could be tested in conversational recommendation or visual question answering to reduce repeated context processing.

Load-bearing premise

A lightweight memorization module can reliably aggregate and evolve semantic intent representations across rounds without introducing inconsistencies or forgetting key details that would degrade retrieval quality.

What would settle it

A test in which the memory module drops a critical constraint stated in the first round and later returns images that violate that constraint while the full-history baseline still satisfies it.

Figures

Figures reproduced from arXiv: 2605.17365 by Daizong Liu, Jianfeng Dong, Shuhui Wang, Xianke Chen, Xin Tan, Xun Wang, Xun Yang, Yushuo Lou.

**Figure 2.** Figure 2: Illustration of our motivation. Unlike previous methods that simply integrate dialogue contexts via either multi-round concatenation or LLM-based [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the proposed MAQIU framework for chat-based image retrieval. Given the initial query and multi-round dialogue interactions, the [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of Hit@10 and Recall@10 across dialogue rounds on VisDial under Original (a) and Reconstructed (b) dialogue settings. While Hit@10 [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Performance comparison on dialogue datasets ChatGPT-BLIP2 and [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Ablation on historical dialogue repository construction in terms of (a) [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Evaluation of memory recall designs. (a) similarity-based recall [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 9.** Figure 9: Qualitative multi-round retrieval comparison with different baselines. Each example shows the evolving dialogue on the left, baseline retrieval at rounds [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗

read the original abstract

Different from traditional text-to-image retrieval tasks, chat-based image retrieval allows the human-interactive system to iteratively clarify and refine user intent through multi-round dialogue, thereby achieving more fine-grained retrieval results. The key challenge in this task lies in dynamically understanding and updating the user's query intent across dialogue rounds. Although existing works have achieved great performance on this new task, they simply handle history query information either by directly concatenating all previous queries into a long textual sequence or by relying on large language models to reconstruct the current query from history. Such strategies are computationally redundant and easily lead to inconsistent intent representations as the dialogue progresses. To alleviate these issues, this paper proposes a novel and efficient memory-based user intent updating framework for the chat-based image retrieval task, called Memory-Augmented Query Intent Understanding (MAQIU). It introduces a lightweight memorization module that dynamically aggregates and evolves the semantic representation of query intent across dialogues, while a memory recall mechanism is further employed to prevent intent forgetting and enhance long-term semantic integrity. In addition, MAQIU also integrates historical image retrieval results as visual guidance, allowing the model to strengthen cross-round correlations and refine current visual understanding. Extensive experiments demonstrate that MAQIU achieves substantial performance gains while maintaining high computational efficiency, reducing dialogue encoding FLOPs by 86.4\% compared with the prior baseline ChatIR. Source code is available at https://github.com/HuiGuanLab/MAQIU.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MAQIU adds a lightweight memory module with recall and visual history to track evolving intent in chat-based image retrieval, cutting FLOPs substantially while improving results.

read the letter

The main point is that this paper builds a memory-augmented system for chat-based image retrieval that keeps user intent consistent across turns without concatenating full history or leaning on large language models. The lightweight memorization module aggregates semantic representations on the fly, a recall step guards against forgetting, and past retrieval results supply visual guidance to tighten cross-round links. That combination is the concrete addition over prior work like ChatIR.

Referee Report

2 major / 3 minor

Summary. The manuscript proposes MAQIU, a memory-augmented framework for chat-based image retrieval. It introduces a lightweight memorization module to dynamically aggregate and evolve semantic query intent representations across multi-round dialogues, employs a memory recall mechanism to prevent forgetting, and integrates historical image retrieval results as visual guidance. Experiments claim substantial performance gains over baselines such as ChatIR while reducing dialogue encoding FLOPs by 86.4%.

Significance. If the reported efficiency and performance results hold under full verification, the work addresses a practical bottleneck in interactive retrieval by replacing redundant history concatenation or LLM-based reconstruction with a lightweight memory module. The 86.4% FLOP reduction and source-code release are concrete strengths that could influence design of efficient dialogue-driven vision systems.

major comments (2)

[§4] §4 (Experimental Setup): The central efficiency claim of 86.4% FLOP reduction is load-bearing, yet the manuscript provides no explicit breakdown of how the memorization module's forward pass cost is measured relative to full history concatenation; without this accounting or an ablation isolating the recall mechanism's overhead, the reported savings cannot be independently verified.
[Table 2] Table 2 (Ablation Studies): The contribution of visual guidance integration to both retrieval metrics and FLOP savings is not isolated; if removing this component degrades performance without proportionally affecting efficiency, the claim that MAQIU maintains high efficiency while strengthening cross-round correlations requires re-examination.

minor comments (3)

[Abstract] Abstract: The phrase 'substantial performance gains' should be replaced with concrete deltas (e.g., +X% Recall@10) to allow readers to assess magnitude without consulting the full results section.
[§3.1] §3.1: Notation for the memory state update equation is introduced without an accompanying diagram label; adding an explicit equation number would improve traceability to the recall mechanism.
[Figure 3] Figure 3: The efficiency plot axes lack units for FLOPs per dialogue turn; clarifying whether measurements include or exclude the visual guidance branch would prevent misinterpretation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation and constructive feedback. We appreciate the recognition of the practical value of the efficiency gains and source-code release. We address each major comment below with clarifications and commitments to revisions.

read point-by-point responses

Referee: [§4] §4 (Experimental Setup): The central efficiency claim of 86.4% FLOP reduction is load-bearing, yet the manuscript provides no explicit breakdown of how the memorization module's forward pass cost is measured relative to full history concatenation; without this accounting or an ablation isolating the recall mechanism's overhead, the reported savings cannot be independently verified.

Authors: We agree that an explicit breakdown is needed for independent verification. The reported savings stem from replacing variable-length history concatenation with a fixed-size memory state updated by the lightweight memorization module. In the revised manuscript, we will add a detailed accounting in §4, including the operations considered in the FLOP count and a new ablation isolating the recall mechanism's overhead. revision: yes
Referee: [Table 2] Table 2 (Ablation Studies): The contribution of visual guidance integration to both retrieval metrics and FLOP savings is not isolated; if removing this component degrades performance without proportionally affecting efficiency, the claim that MAQIU maintains high efficiency while strengthening cross-round correlations requires re-examination.

Authors: We concur that isolating the visual guidance component would strengthen the claims. Table 2 currently focuses on the memorization and recall modules; we will expand it with an additional ablation variant that removes visual guidance integration. The revised table will report effects on both retrieval metrics and FLOP counts to demonstrate that this component improves cross-round correlations with negligible impact on efficiency. revision: yes

Circularity Check

0 steps flagged

No significant circularity; framework is an independent architectural proposal

full rationale

The paper presents MAQIU as a new system-level addition: a lightweight memorization module that aggregates query intent representations, a recall mechanism to avoid forgetting, and integration of prior retrieval results as visual guidance. No equations, derivations, or parameter-fitting steps are described that would reduce the claimed 86.4% FLOP reduction or retrieval gains to a self-referential definition or fitted input. The efficiency and performance claims rest on experimental measurements and ablations rather than on any self-citation chain or renamed known result. The derivation chain is therefore self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim rests on the effectiveness of a newly introduced memorization module whose internal design and training details are not visible in the abstract; no explicit free parameters or axioms are stated.

invented entities (1)

Lightweight memorization module no independent evidence
purpose: Dynamically aggregate and evolve query intent semantic representations across dialogue rounds
Presented as the core novel component that replaces concatenation or LLM rewriting

pith-pipeline@v0.9.0 · 5807 in / 1141 out tokens · 41010 ms · 2026-05-20T13:28:25.497573+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MAQIU encodes only the current-round query while accumulating dialogue semantics through a fixed set of memory tokens, keeping both the token length and FLOPs nearly constant throughout the dialogue.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_strictMono_of_one_lt unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

progressive dialogue-semantic memorization mechanism, which represents the evolving query intent with a fixed set of memory tokens and updates them round by round

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 4 internal anchors

[1]

Learning semantic structure-preserved embeddings for cross-modal retrieval,

Y . Wu, S. Wang, and Q. Huang, “Learning semantic structure-preserved embeddings for cross-modal retrieval,” inProceedings of the 26th ACM international conference on Multimedia, 2018, pp. 825–833. 11

work page 2018
[2]

Fine-grained visual textual alignment for cross- modal retrieval using transformer encoders,

N. Messina, G. Amato, A. Esuli, F. Falchi, C. Gennaro, and S. Marchand-Maillet, “Fine-grained visual textual alignment for cross- modal retrieval using transformer encoders,”ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), vol. 17, no. 4, pp. 1–23, 2021

work page 2021
[3]

Fine-grained image-text matching by cross-modal hard aligning network,

Z. Pan, F. Wu, and B. Zhang, “Fine-grained image-text matching by cross-modal hard aligning network,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 19 275–19 284

work page 2023
[4]

An end-to-end graph attention network hashing for cross-modal retrieval,

H. Jin, Y . Zhang, L. Shi, S. Zhang, F. Kou, J. Yang, C. Zhu, and J. Luo, “An end-to-end graph attention network hashing for cross-modal retrieval,”Advances in Neural Information Processing Systems, vol. 37, pp. 2106–2126, 2024

work page 2024
[5]

Achieving ensemble-like performance in a single model: A feature diversification framework for image-text matching,

Z. Zhou, Y . Wang, W. Zhang, Y . Zheng, X. Du, and C. Jin, “Achieving ensemble-like performance in a single model: A feature diversification framework for image-text matching,” inProceedings of the AAAI Confer- ence on Artificial Intelligence, vol. 39, no. 10, 2025, pp. 10 879–10 886

work page 2025
[6]

Efficient token-guided image-text retrieval with consistent multimodal contrastive training,

C. Liu, Y . Zhang, H. Wang, W. Chen, F. Wang, Y . Huang, Y .-D. Shen, and L. Wang, “Efficient token-guided image-text retrieval with consistent multimodal contrastive training,”IEEE Transactions on Image Processing, 2023

work page 2023
[7]

Ucpm: Uncertainty-guided cross-modal retrieval with partially mis- matched pairs,

Q. Zha, X. Liu, Y .-M. Cheung, S.-J. Peng, X. Xu, and N. Wang, “Ucpm: Uncertainty-guided cross-modal retrieval with partially mis- matched pairs,”IEEE Transactions on Image Processing, 2025

work page 2025
[8]

Dual uncertainty-aware correspondence adapting and retaining for continual composed image retrieval,

H. Zhou, F. Zhang, and C. Xu, “Dual uncertainty-aware correspondence adapting and retaining for continual composed image retrieval,”IEEE Transactions on Image Processing, vol. 34, pp. 7627–7641, 2025

work page 2025
[9]

Rebalanced vision-language retrieval considering structure-aware distillation,

Y . Yang, W. Xi, L. Zhou, and J. Tang, “Rebalanced vision-language retrieval considering structure-aware distillation,”IEEE Transactions on Image Processing, vol. 33, pp. 6881–6892, 2024

work page 2024
[10]

Chatting makes perfect: Chat-based image retrieval,

M. Levy, R. Ben-Ari, N. Darshan, and D. Lischinski, “Chatting makes perfect: Chat-based image retrieval,”Advances in Neural Information Processing Systems, vol. 36, pp. 61 437–61 449, 2023

work page 2023
[11]

Interactive text-to-image retrieval with large language models: A plug-and-play approach,

S. Lee, S. Yu, J. Park, J. Yi, and S. Yoon, “Interactive text-to-image retrieval with large language models: A plug-and-play approach,” in Proceedings of the 62nd Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers), 2024, pp. 791–809

work page 2024
[12]

Enhancing inter- active image retrieval with query rewriting using large language models and vision language models,

H. Zhu, J.-H. Huang, S. Rudinac, and E. Kanoulas, “Enhancing inter- active image retrieval with query rewriting using large language models and vision language models,” inProceedings of the 2024 International Conference on Multimedia Retrieval, 2024, pp. 978–987

work page 2024
[13]

Diffusion augmented retrieval: A training-free approach to inter- active text-to-image retrieval,

Z. Long, K. Liang, G. Aragon Camarasa, R. Mccreadie, and P. Hender- son, “Diffusion augmented retrieval: A training-free approach to inter- active text-to-image retrieval,” inProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2025, pp. 823–832

work page 2025
[14]

Chat-based person retrieval via dialogue-refined cross-modal alignment,

Y . Bai, Y . Ji, M. Cao, J. Wang, and M. Ye, “Chat-based person retrieval via dialogue-refined cross-modal alignment,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 3952– 3962

work page 2025
[15]

Mai: A multi-turn aggregation- iteration model for composed image retrieval,

Y . Chen, Z. Yang, J. Xu, and Y . Peng, “Mai: A multi-turn aggregation- iteration model for composed image retrieval,” inThe Thirteenth Inter- national Conference on Learning Representations, 2025

work page 2025
[16]

Imagescope: Unifying language-guided image retrieval via large multimodal model collective reasoning,

P. Luo, J. Zhou, T. Xu, Y . Xia, L. Xu, and E. Chen, “Imagescope: Unifying language-guided image retrieval via large multimodal model collective reasoning,” inProceedings of the ACM on Web Conference 2025, 2025, pp. 1666–1682

work page 2025
[17]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

Training language models to follow instructions with human feedback,

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Rayet al., “Training language models to follow instructions with human feedback,”Advances in neural information processing systems, vol. 35, pp. 27 730–27 744, 2022

work page 2022
[19]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azharet al., “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

Pairwise rela- tionship guided deep hashing for cross-modal retrieval,

E. Yang, C. Deng, W. Liu, X. Liu, D. Tao, and X. Gao, “Pairwise rela- tionship guided deep hashing for cross-modal retrieval,” inproceedings of the AAAI Conference on Artificial Intelligence, vol. 31, no. 1, 2017

work page 2017
[21]

Context-aware attention network for image-text retrieval,

Q. Zhang, Z. Lei, Z. Zhang, and S. Z. Li, “Context-aware attention network for image-text retrieval,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 3536– 3545

work page 2020
[22]

Align before fuse: Vision and language representation learning with momentum distillation,

J. Li, R. Selvaraju, A. Gotmare, S. Joty, C. Xiong, and S. C. H. Hoi, “Align before fuse: Vision and language representation learning with momentum distillation,”Advances in neural information processing systems, vol. 34, pp. 9694–9705, 2021

work page 2021
[23]

Scaling up visual and vision-language representation learning with noisy text supervision,

C. Jia, Y . Yang, Y . Xia, Y .-T. Chen, Z. Parekh, H. Pham, Q. Le, Y .-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” inInternational conference on machine learning. PMLR, 2021, pp. 4904–4916

work page 2021
[24]

Gssf: Generalized structural sparse function for deep cross-modal metric learning,

H. Diao, Y . Zhang, S. Gao, J. Zhu, L. Chen, and H. Lu, “Gssf: Generalized structural sparse function for deep cross-modal metric learning,”IEEE Transactions on Image Processing, vol. 33, pp. 6241– 6252, 2024

work page 2024
[25]

Multi-relational deep hash- ing for cross-modal search,

X. Liang, E. Yang, Y . Yang, and C. Deng, “Multi-relational deep hash- ing for cross-modal search,”IEEE Transactions on Image Processing, vol. 33, pp. 3009–3020, 2024

work page 2024
[26]

Composed image retrieval via cross relation network with hierarchical aggregation transformer,

Q. Yang, M. Ye, Z. Cai, K. Su, and B. Du, “Composed image retrieval via cross relation network with hierarchical aggregation transformer,” IEEE Transactions on Image Processing, vol. 32, pp. 4543–4554, 2023

work page 2023
[27]

Transformer-xl: Attentive language models beyond a fixed-length context,

Z. Dai, Z. Yang, Y . Yang, J. G. Carbonell, Q. Le, and R. Salakhutdi- nov, “Transformer-xl: Attentive language models beyond a fixed-length context,” inProceedings of the 57th annual meeting of the association for computational linguistics, 2019, pp. 2978–2988

work page 2019
[28]

Retrieval- augmented generation for knowledge-intensive nlp tasks,

P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K ¨uttler, M. Lewis, W.-t. Yih, T. Rockt ¨aschelet al., “Retrieval- augmented generation for knowledge-intensive nlp tasks,”Advances in neural information processing systems, vol. 33, pp. 9459–9474, 2020

work page 2020
[29]

Augmenting language models with long-term memory,

W. Wang, L. Dong, H. Cheng, X. Liu, X. Yan, J. Gao, and F. Wei, “Augmenting language models with long-term memory,”Advances in Neural Information Processing Systems, vol. 36, pp. 74 530–74 543, 2023

work page 2023
[30]

Prompting video-language foundation models with domain-specific fine-grained heuristics for video question answering,

T. Yu, K. Fu, S. Wang, Q. Huang, and J. Yu, “Prompting video-language foundation models with domain-specific fine-grained heuristics for video question answering,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 35, no. 2, pp. 1615–1630, 2024

work page 2024
[31]

Hmt: Hierarchical memory transformer for efficient long context language processing,

Z. He, Y . Cao, Z. Qin, N. Prakriya, Y . Sun, and J. Cong, “Hmt: Hierarchical memory transformer for efficient long context language processing,” inProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2025, pp. 8068–8089

work page 2025
[32]

Fine-R1: Make Multi-modal LLMs Excel in Fine-Grained Visual Recognition by Chain-of-Thought Reasoning

H. He, Z. Geng, and Y . Peng, “Fine-r1: Make multi-modal llms excel in fine-grained visual recognition by chain-of-thought reasoning,”arXiv preprint arXiv:2602.07605, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[33]

Mitigating hallucinations in large vision-language models via reasoning uncertainty- guided refinement,

S. Li, X. Xu, W. Meng, J. Song, C. Peng, and H. T. Shen, “Mitigating hallucinations in large vision-language models via reasoning uncertainty- guided refinement,”IEEE Transactions on Multimedia, 2025

work page 2025
[34]

Prompt learning with knowledge regularization for pre-trained vision-language models,

B. Guo, L. Li, J. Zhang, Y . Sun, C. Yan, and X. Sheng, “Prompt learning with knowledge regularization for pre-trained vision-language models,” IEEE Transactions on Multimedia, 2025

work page 2025
[35]

Star: Sensitive trajectory regulation for unlearning in large reasoning models,

J. Zhou, G. Cong, L. Su, and L. Li, “Star: Sensitive trajectory regulation for unlearning in large reasoning models,”arXiv preprint arXiv:2601.09281, 2026

work page arXiv 2026
[36]

Improving language models by retrieving from trillions of tokens,

S. Borgeaud, A. Mensch, J. Hoffmann, T. Cai, E. Rutherford, K. Milli- can, G. B. Van Den Driessche, J.-B. Lespiau, B. Damoc, A. Clarket al., “Improving language models by retrieving from trillions of tokens,” in International conference on machine learning. PMLR, 2022, pp. 2206– 2240

work page 2022
[37]

Atlas: Few-shot learning with retrieval augmented language models,

G. Izacard, P. Lewis, M. Lomeli, L. Hosseini, F. Petroni, T. Schick, J. Dwivedi-Yu, A. Joulin, S. Riedel, and E. Grave, “Atlas: Few-shot learning with retrieval augmented language models,”Journal of Machine Learning Research, vol. 24, no. 251, pp. 1–43, 2023

work page 2023
[38]

Compressive Transformers for Long-Range Sequence Modelling

J. W. Rae, A. Potapenko, S. M. Jayakumar, and T. P. Lillicrap, “Com- pressive transformers for long-range sequence modelling,”arXiv preprint arXiv:1911.05507, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1911
[39]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017

work page 2017
[40]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,

J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” inInternational conference on machine learning. PMLR, 2022, pp. 12 888–12 900

work page 2022
[41]

VSE++: Improving visual-semantic embeddings with hard negatives,

F. Faghri, D. J. Fleet, J. R. Kiros, and S. Fidler, “VSE++: Improving visual-semantic embeddings with hard negatives,” inProceedings of the British Machine Vision Conference, 2018, pp. 935–943

work page 2018
[42]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763. 12

work page 2021
[43]

Visual dialog,

A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav, J. M. Moura, D. Parikh, and D. Batra, “Visual dialog,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 326–335

work page 2017
[44]

Large-scale pretraining for visual dialog: A simple state-of-the-art baseline,

V . Murahari, D. Batra, D. Parikh, and A. Das, “Large-scale pretraining for visual dialog: A simple state-of-the-art baseline,” inEuropean Conference on Computer Vision. Springer, 2020, pp. 336–352. Xianke Chenreceived the B.E. degree in network engineering from Zhejiang Gongshang University, Hangzhou, China, in 2020, and the M.E. degree from the College ...

work page 2020
[45]

His research interests include multimedia understanding, retrieval, and recommendation

He is currently a Research Professor with the College of Computer Science and Technology, Zhe- jiang Gongshang University, Hangzhou, China. His research interests include multimedia understanding, retrieval, and recommendation. He was awarded the ACM Multimedia Grand Challenge Award and was selected into the Young Elite Scientists Sponsorship Program by t...

work page

[1] [1]

Learning semantic structure-preserved embeddings for cross-modal retrieval,

Y . Wu, S. Wang, and Q. Huang, “Learning semantic structure-preserved embeddings for cross-modal retrieval,” inProceedings of the 26th ACM international conference on Multimedia, 2018, pp. 825–833. 11

work page 2018

[2] [2]

Fine-grained visual textual alignment for cross- modal retrieval using transformer encoders,

N. Messina, G. Amato, A. Esuli, F. Falchi, C. Gennaro, and S. Marchand-Maillet, “Fine-grained visual textual alignment for cross- modal retrieval using transformer encoders,”ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), vol. 17, no. 4, pp. 1–23, 2021

work page 2021

[3] [3]

Fine-grained image-text matching by cross-modal hard aligning network,

Z. Pan, F. Wu, and B. Zhang, “Fine-grained image-text matching by cross-modal hard aligning network,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 19 275–19 284

work page 2023

[4] [4]

An end-to-end graph attention network hashing for cross-modal retrieval,

H. Jin, Y . Zhang, L. Shi, S. Zhang, F. Kou, J. Yang, C. Zhu, and J. Luo, “An end-to-end graph attention network hashing for cross-modal retrieval,”Advances in Neural Information Processing Systems, vol. 37, pp. 2106–2126, 2024

work page 2024

[5] [5]

Achieving ensemble-like performance in a single model: A feature diversification framework for image-text matching,

Z. Zhou, Y . Wang, W. Zhang, Y . Zheng, X. Du, and C. Jin, “Achieving ensemble-like performance in a single model: A feature diversification framework for image-text matching,” inProceedings of the AAAI Confer- ence on Artificial Intelligence, vol. 39, no. 10, 2025, pp. 10 879–10 886

work page 2025

[6] [6]

Efficient token-guided image-text retrieval with consistent multimodal contrastive training,

C. Liu, Y . Zhang, H. Wang, W. Chen, F. Wang, Y . Huang, Y .-D. Shen, and L. Wang, “Efficient token-guided image-text retrieval with consistent multimodal contrastive training,”IEEE Transactions on Image Processing, 2023

work page 2023

[7] [7]

Ucpm: Uncertainty-guided cross-modal retrieval with partially mis- matched pairs,

Q. Zha, X. Liu, Y .-M. Cheung, S.-J. Peng, X. Xu, and N. Wang, “Ucpm: Uncertainty-guided cross-modal retrieval with partially mis- matched pairs,”IEEE Transactions on Image Processing, 2025

work page 2025

[8] [8]

Dual uncertainty-aware correspondence adapting and retaining for continual composed image retrieval,

H. Zhou, F. Zhang, and C. Xu, “Dual uncertainty-aware correspondence adapting and retaining for continual composed image retrieval,”IEEE Transactions on Image Processing, vol. 34, pp. 7627–7641, 2025

work page 2025

[9] [9]

Rebalanced vision-language retrieval considering structure-aware distillation,

Y . Yang, W. Xi, L. Zhou, and J. Tang, “Rebalanced vision-language retrieval considering structure-aware distillation,”IEEE Transactions on Image Processing, vol. 33, pp. 6881–6892, 2024

work page 2024

[10] [10]

Chatting makes perfect: Chat-based image retrieval,

M. Levy, R. Ben-Ari, N. Darshan, and D. Lischinski, “Chatting makes perfect: Chat-based image retrieval,”Advances in Neural Information Processing Systems, vol. 36, pp. 61 437–61 449, 2023

work page 2023

[11] [11]

Interactive text-to-image retrieval with large language models: A plug-and-play approach,

S. Lee, S. Yu, J. Park, J. Yi, and S. Yoon, “Interactive text-to-image retrieval with large language models: A plug-and-play approach,” in Proceedings of the 62nd Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers), 2024, pp. 791–809

work page 2024

[12] [12]

Enhancing inter- active image retrieval with query rewriting using large language models and vision language models,

H. Zhu, J.-H. Huang, S. Rudinac, and E. Kanoulas, “Enhancing inter- active image retrieval with query rewriting using large language models and vision language models,” inProceedings of the 2024 International Conference on Multimedia Retrieval, 2024, pp. 978–987

work page 2024

[13] [13]

Diffusion augmented retrieval: A training-free approach to inter- active text-to-image retrieval,

Z. Long, K. Liang, G. Aragon Camarasa, R. Mccreadie, and P. Hender- son, “Diffusion augmented retrieval: A training-free approach to inter- active text-to-image retrieval,” inProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2025, pp. 823–832

work page 2025

[14] [14]

Chat-based person retrieval via dialogue-refined cross-modal alignment,

Y . Bai, Y . Ji, M. Cao, J. Wang, and M. Ye, “Chat-based person retrieval via dialogue-refined cross-modal alignment,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 3952– 3962

work page 2025

[15] [15]

Mai: A multi-turn aggregation- iteration model for composed image retrieval,

Y . Chen, Z. Yang, J. Xu, and Y . Peng, “Mai: A multi-turn aggregation- iteration model for composed image retrieval,” inThe Thirteenth Inter- national Conference on Learning Representations, 2025

work page 2025

[16] [16]

Imagescope: Unifying language-guided image retrieval via large multimodal model collective reasoning,

P. Luo, J. Zhou, T. Xu, Y . Xia, L. Xu, and E. Chen, “Imagescope: Unifying language-guided image retrieval via large multimodal model collective reasoning,” inProceedings of the ACM on Web Conference 2025, 2025, pp. 1666–1682

work page 2025

[17] [17]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[18] [18]

Training language models to follow instructions with human feedback,

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Rayet al., “Training language models to follow instructions with human feedback,”Advances in neural information processing systems, vol. 35, pp. 27 730–27 744, 2022

work page 2022

[19] [19]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azharet al., “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[20] [20]

Pairwise rela- tionship guided deep hashing for cross-modal retrieval,

E. Yang, C. Deng, W. Liu, X. Liu, D. Tao, and X. Gao, “Pairwise rela- tionship guided deep hashing for cross-modal retrieval,” inproceedings of the AAAI Conference on Artificial Intelligence, vol. 31, no. 1, 2017

work page 2017

[21] [21]

Context-aware attention network for image-text retrieval,

Q. Zhang, Z. Lei, Z. Zhang, and S. Z. Li, “Context-aware attention network for image-text retrieval,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 3536– 3545

work page 2020

[22] [22]

Align before fuse: Vision and language representation learning with momentum distillation,

J. Li, R. Selvaraju, A. Gotmare, S. Joty, C. Xiong, and S. C. H. Hoi, “Align before fuse: Vision and language representation learning with momentum distillation,”Advances in neural information processing systems, vol. 34, pp. 9694–9705, 2021

work page 2021

[23] [23]

Scaling up visual and vision-language representation learning with noisy text supervision,

C. Jia, Y . Yang, Y . Xia, Y .-T. Chen, Z. Parekh, H. Pham, Q. Le, Y .-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” inInternational conference on machine learning. PMLR, 2021, pp. 4904–4916

work page 2021

[24] [24]

Gssf: Generalized structural sparse function for deep cross-modal metric learning,

H. Diao, Y . Zhang, S. Gao, J. Zhu, L. Chen, and H. Lu, “Gssf: Generalized structural sparse function for deep cross-modal metric learning,”IEEE Transactions on Image Processing, vol. 33, pp. 6241– 6252, 2024

work page 2024

[25] [25]

Multi-relational deep hash- ing for cross-modal search,

X. Liang, E. Yang, Y . Yang, and C. Deng, “Multi-relational deep hash- ing for cross-modal search,”IEEE Transactions on Image Processing, vol. 33, pp. 3009–3020, 2024

work page 2024

[26] [26]

Composed image retrieval via cross relation network with hierarchical aggregation transformer,

Q. Yang, M. Ye, Z. Cai, K. Su, and B. Du, “Composed image retrieval via cross relation network with hierarchical aggregation transformer,” IEEE Transactions on Image Processing, vol. 32, pp. 4543–4554, 2023

work page 2023

[27] [27]

Transformer-xl: Attentive language models beyond a fixed-length context,

Z. Dai, Z. Yang, Y . Yang, J. G. Carbonell, Q. Le, and R. Salakhutdi- nov, “Transformer-xl: Attentive language models beyond a fixed-length context,” inProceedings of the 57th annual meeting of the association for computational linguistics, 2019, pp. 2978–2988

work page 2019

[28] [28]

Retrieval- augmented generation for knowledge-intensive nlp tasks,

P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K ¨uttler, M. Lewis, W.-t. Yih, T. Rockt ¨aschelet al., “Retrieval- augmented generation for knowledge-intensive nlp tasks,”Advances in neural information processing systems, vol. 33, pp. 9459–9474, 2020

work page 2020

[29] [29]

Augmenting language models with long-term memory,

W. Wang, L. Dong, H. Cheng, X. Liu, X. Yan, J. Gao, and F. Wei, “Augmenting language models with long-term memory,”Advances in Neural Information Processing Systems, vol. 36, pp. 74 530–74 543, 2023

work page 2023

[30] [30]

Prompting video-language foundation models with domain-specific fine-grained heuristics for video question answering,

T. Yu, K. Fu, S. Wang, Q. Huang, and J. Yu, “Prompting video-language foundation models with domain-specific fine-grained heuristics for video question answering,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 35, no. 2, pp. 1615–1630, 2024

work page 2024

[31] [31]

Hmt: Hierarchical memory transformer for efficient long context language processing,

Z. He, Y . Cao, Z. Qin, N. Prakriya, Y . Sun, and J. Cong, “Hmt: Hierarchical memory transformer for efficient long context language processing,” inProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2025, pp. 8068–8089

work page 2025

[32] [32]

Fine-R1: Make Multi-modal LLMs Excel in Fine-Grained Visual Recognition by Chain-of-Thought Reasoning

H. He, Z. Geng, and Y . Peng, “Fine-r1: Make multi-modal llms excel in fine-grained visual recognition by chain-of-thought reasoning,”arXiv preprint arXiv:2602.07605, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[33] [33]

Mitigating hallucinations in large vision-language models via reasoning uncertainty- guided refinement,

S. Li, X. Xu, W. Meng, J. Song, C. Peng, and H. T. Shen, “Mitigating hallucinations in large vision-language models via reasoning uncertainty- guided refinement,”IEEE Transactions on Multimedia, 2025

work page 2025

[34] [34]

Prompt learning with knowledge regularization for pre-trained vision-language models,

B. Guo, L. Li, J. Zhang, Y . Sun, C. Yan, and X. Sheng, “Prompt learning with knowledge regularization for pre-trained vision-language models,” IEEE Transactions on Multimedia, 2025

work page 2025

[35] [35]

Star: Sensitive trajectory regulation for unlearning in large reasoning models,

J. Zhou, G. Cong, L. Su, and L. Li, “Star: Sensitive trajectory regulation for unlearning in large reasoning models,”arXiv preprint arXiv:2601.09281, 2026

work page arXiv 2026

[36] [36]

Improving language models by retrieving from trillions of tokens,

S. Borgeaud, A. Mensch, J. Hoffmann, T. Cai, E. Rutherford, K. Milli- can, G. B. Van Den Driessche, J.-B. Lespiau, B. Damoc, A. Clarket al., “Improving language models by retrieving from trillions of tokens,” in International conference on machine learning. PMLR, 2022, pp. 2206– 2240

work page 2022

[37] [37]

Atlas: Few-shot learning with retrieval augmented language models,

G. Izacard, P. Lewis, M. Lomeli, L. Hosseini, F. Petroni, T. Schick, J. Dwivedi-Yu, A. Joulin, S. Riedel, and E. Grave, “Atlas: Few-shot learning with retrieval augmented language models,”Journal of Machine Learning Research, vol. 24, no. 251, pp. 1–43, 2023

work page 2023

[38] [38]

Compressive Transformers for Long-Range Sequence Modelling

J. W. Rae, A. Potapenko, S. M. Jayakumar, and T. P. Lillicrap, “Com- pressive transformers for long-range sequence modelling,”arXiv preprint arXiv:1911.05507, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1911

[39] [39]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017

work page 2017

[40] [40]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,

J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” inInternational conference on machine learning. PMLR, 2022, pp. 12 888–12 900

work page 2022

[41] [41]

VSE++: Improving visual-semantic embeddings with hard negatives,

F. Faghri, D. J. Fleet, J. R. Kiros, and S. Fidler, “VSE++: Improving visual-semantic embeddings with hard negatives,” inProceedings of the British Machine Vision Conference, 2018, pp. 935–943

work page 2018

[42] [42]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763. 12

work page 2021

[43] [43]

Visual dialog,

A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav, J. M. Moura, D. Parikh, and D. Batra, “Visual dialog,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 326–335

work page 2017

[44] [44]

Large-scale pretraining for visual dialog: A simple state-of-the-art baseline,

V . Murahari, D. Batra, D. Parikh, and A. Das, “Large-scale pretraining for visual dialog: A simple state-of-the-art baseline,” inEuropean Conference on Computer Vision. Springer, 2020, pp. 336–352. Xianke Chenreceived the B.E. degree in network engineering from Zhejiang Gongshang University, Hangzhou, China, in 2020, and the M.E. degree from the College ...

work page 2020

[45] [45]

His research interests include multimedia understanding, retrieval, and recommendation

He is currently a Research Professor with the College of Computer Science and Technology, Zhe- jiang Gongshang University, Hangzhou, China. His research interests include multimedia understanding, retrieval, and recommendation. He was awarded the ACM Multimedia Grand Challenge Award and was selected into the Young Elite Scientists Sponsorship Program by t...

work page