pith. sign in

arxiv: 2606.04591 · v1 · pith:QAI4RB2Znew · submitted 2026-06-03 · 💻 cs.CL · cs.CV

Fine-grained Fragment Retrieval in Multi-modal Long-form Dialogues

Pith reviewed 2026-06-28 06:00 UTC · model grok-4.3

classification 💻 cs.CL cs.CV
keywords fine-grained fragment retrievalmulti-modal long-form dialoguesreinforcement learning retrievalfragment embedding modelMLDR datasetF2RVLMFFRS
0
0 comments X

The pith

A generation-based model trained via reinforcement learning with multi-objective rewards retrieves coherent multi-utterance, multi-image fragments from long dialogues more effectively than prior methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines Fine-grained Fragment Retrieval (FFR) as locating semantically relevant groups of utterances and images on a topic within multi-modal long-form dialogues rather than isolated lines. It introduces F2RVLM, a model trained with reinforcement learning that applies multi-objective rewards and difficulty-aware curriculum sampling to promote coherence across multiple turns and images. For large corpora, FFRS first decomposes dialogues into minimal semantic fragments, indexes them with a Fragment Embedding Model, and then applies F2RVLM for fine-grained selection. The authors release the MLDR dataset, the longest multi-modal dialogue retrieval collection to date, along with a real-world WeChat test set. Experiments show F2RVLM and FFRS outperform existing approaches on both single-dialogue and corpus-level FFR tasks.

Core claim

The authors establish that a generation-based retrieval model trained with reinforcement learning, multi-objective rewards, and difficulty-aware curriculum sampling can locate semantically relevant multi-utterance, multi-image fragments in multi-modal long-form dialogues, and that a two-stage system combining offline fragment indexing with this model yields superior performance on both single-dialogue and corpus-level retrieval benchmarks including the new MLDR dataset.

What carries the argument

F2RVLM, a generation-based retrieval model trained with reinforcement learning using multi-objective rewards and difficulty-aware curriculum sampling to enhance fragment coherence.

If this is right

  • FFR within a single dialogue improves when F2RVLM directly reasons over the full conversation history.
  • Corpus-level FFR becomes practical when dialogues are first decomposed into minimal semantic fragments and indexed offline.
  • The MLDR dataset and WeChat test set provide benchmarks that support further development of fragment retrieval systems.
  • Both single-dialogue and corpus-level settings show consistent gains when reinforcement learning is combined with fragment-level indexing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The fragment decomposition step could be adapted to retrieve coherent segments from other sequential multi-modal data such as video transcripts with images.
  • Better fragment retrieval may improve downstream applications like topic-focused summarization or context-aware question answering over dialogue histories.
  • The two-stage indexing plus fine-grained reasoning pattern might lower latency in real-time dialogue search systems compared to end-to-end generation over entire corpora.

Load-bearing premise

The multi-objective rewards and difficulty-aware curriculum sampling in F2RVLM produce genuinely more coherent fragments rather than merely optimizing the chosen automatic metrics.

What would settle it

Human raters scoring the semantic coherence and topical relevance of fragments returned by F2RVLM versus baseline retrievers on held-out dialogues, with no measurable preference for the proposed model.

read the original abstract

With the widespread adoption of multi-modal communication platforms, long-form dialogues interleaving text and images have become increasingly common. Users often need to retrieve coherent dialogue fragments related to specific topics, rather than isolated utterances. We propose Fine-grained Fragment Retrieval (FFR), which locates semantically relevant multi-utterance, multi-image fragments in multi-modal long-form dialogues. We explore two settings: (1) FFR within Single-Dialogue, retrieving fragments from a given dialogue; and (2) FFR within Dialogue Corpus, retrieving from a large-scale corpus for open-domain scenarios. For (1), we introduce F2RVLM, a generation-based retrieval model trained with reinforcement learning, using multi-objective rewards and difficulty-aware curriculum sampling to enhance fragment coherence. For (2), we develop FFRS, a two-stage system combining offline fragment-level indexing with online retrieval. Specifically, each dialogue is decomposed into minimal semantic fragments encoded by a Fragment Embedding Model (FEM) into a vector database; at inference, FEM rapidly recalls Top-K candidates, and F2RVLM performs fine-grained reasoning to identify the most relevant sub-content. To support FFR, we construct MLDR, the longest multi-modal dialogue retrieval dataset to date, and a WeChat-based real-world test set. Experiments on both benchmarks demonstrate that F2RVLM and FFRS consistently achieve superior performance across single-dialogue and corpus-level FFR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces Fine-grained Fragment Retrieval (FFR) for locating semantically relevant multi-utterance, multi-image fragments in multi-modal long-form dialogues. It defines two settings (single-dialogue and corpus-level), proposes F2RVLM (a generation-based RL model using multi-objective rewards and difficulty-aware curriculum sampling) for the first setting, and FFRS (a two-stage offline indexing + online retrieval system with a Fragment Embedding Model) for the second. A new dataset MLDR is constructed along with a real-world WeChat test set, and the abstract states that experiments demonstrate superior performance for both F2RVLM and FFRS.

Significance. If the empirical claims hold after proper validation, the work addresses a practical gap in moving beyond utterance-level retrieval to coherent fragment retrieval in multi-modal dialogues. The construction of the longest multi-modal dialogue retrieval dataset to date and the explicit handling of both single-dialogue and open-domain corpus settings constitute clear contributions. The RL-based approach with curriculum sampling is a reasonable direction for the single-dialogue case.

major comments (1)
  1. [Abstract] Abstract: The central claim that F2RVLM produces more coherent fragments rests on the use of multi-objective rewards and difficulty-aware curriculum sampling, yet the abstract provides no definition of the reward functions, no human correlation analysis for coherence, and no ablation isolating these components from automatic-metric overfitting. This assumption is load-bearing for both the single-dialogue superiority claim and the downstream FFRS pipeline.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the contributions. We address the single major comment below and will revise the abstract accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that F2RVLM produces more coherent fragments rests on the use of multi-objective rewards and difficulty-aware curriculum sampling, yet the abstract provides no definition of the reward functions, no human correlation analysis for coherence, and no ablation isolating these components from automatic-metric overfitting. This assumption is load-bearing for both the single-dialogue superiority claim and the downstream FFRS pipeline.

    Authors: We agree the abstract can be strengthened for self-containment. In the revision we will add concise definitions of the multi-objective rewards and difficulty-aware curriculum sampling drawn directly from Sections 3.2 and 3.3. We will also insert a reference to the component ablations already reported in Section 5.3, which isolate the contribution of each reward term and the curriculum strategy. The manuscript does not contain a dedicated human correlation study for the coherence reward; we will therefore not claim one in the revised abstract but can note that the automatic metrics follow conventions validated in prior dialogue work. These changes address the load-bearing concern while remaining faithful to the existing experiments. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical performance claims rest on external benchmarks and constructed datasets

full rationale

The paper introduces FFR task, F2RVLM model (RL-trained with multi-objective rewards and curriculum sampling), and FFRS pipeline, then reports superior results on MLDR and WeChat test sets. No equations, derivations, or parameter-fitting steps appear in the provided text. Central claims are experimental comparisons against baselines; they do not reduce by construction to author-defined inputs, self-citations, or renamed patterns. The coherence assumption is an empirical premise tested via benchmarks rather than a definitional loop. This is the common case of a self-contained empirical ML paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities beyond the named models and dataset can be extracted.

pith-pipeline@v0.9.1-grok · 5812 in / 1214 out tokens · 32253 ms · 2026-06-28T06:00:32.398864+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

148 extracted references · 35 canonical work pages · 24 internal anchors

  1. [1]

    Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages=

    Image-Grounded Conversations: Multimodal Context for Natural Question and Response Generation , author=. Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages=

  2. [2]

    arXiv:2012.15015 , year=

    Openvidial: A large-scale, open-domain dialogue dataset with visual contexts , author=. arXiv:2012.15015 , year=

  3. [3]

    Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages=

    Image-Chat: Engaging Grounded Conversations , author=. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages=

  4. [4]

    VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents

    Vlm2vec-v2: Advancing multimodal embedding for videos, images, and visual documents , author=. arXiv preprint arXiv:2507.04590 , year=

  5. [5]

    arXiv preprint arXiv:2404.05961 , year=

    Llm2vec: Large language models are secretly powerful text encoders , author=. arXiv preprint arXiv:2404.05961 , year=

  6. [6]

    Last accessed: Nov 24th , year=

    SFR-Embedding-2: Advanced text embedding with multi-stage training , author=. Last accessed: Nov 24th , year=

  7. [7]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Improving text embeddings with large language models , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  8. [8]

    Findings of the Association for Computational Linguistics: ACL 2023 , pages=

    One embedder, any task: Instruction-finetuned text embeddings , author=. Findings of the Association for Computational Linguistics: ACL 2023 , pages=

  9. [9]

    Proceedings of the 31st International Conference on Computational Linguistics: Industry Track , pages=

    Seeing beyond: Enhancing visual question answering with multi-modal retrieval , author=. Proceedings of the 31st International Conference on Computational Linguistics: Industry Track , pages=

  10. [10]

    GME: Improving Universal Multimodal Retrieval by Multimodal LLMs

    GME: Improving Universal Multimodal Retrieval by Multimodal LLMs , author=. arXiv preprint arXiv:2412.16855 , year=

  11. [11]

    Introducing Claude Sonnet 4.5 , year =

  12. [12]

    Qwen2.5-VL Technical Report

    Qwen2. 5-vl technical report , author=. arXiv preprint arXiv:2502.13923 , year=

  13. [13]

    Information retrieval , volume=

    A comparison of extrinsic clustering evaluation metrics based on formal constraints , author=. Information retrieval , volume=. 2009 , publisher=

  14. [14]

    Cognitive Computation and Systems , volume=

    Research on intelligent service of customer service system , author=. Cognitive Computation and Systems , volume=. 2021 , publisher=

  15. [15]

    The Thirteenth International Conference on Learning Representations , year=

    MM-EMBED: UNIVERSAL MULTIMODAL RETRIEVAL WITH MULTIMODAL LLMS , author=. The Thirteenth International Conference on Learning Representations , year=

  16. [16]

    arXiv preprint arXiv:2508.17714 , year=

    F2RVLM: Boosting Fine-grained Fragment Retrieval for Multi-Modal Long-form Dialogue with Vision Language Model , author=. arXiv preprint arXiv:2508.17714 , year=

  17. [17]

    ACM Transactions on Multimedia Computing, Communications and Applications , volume=

    Domain-aware multimodal dialog systems with distribution-based user characteristic modeling , author=. ACM Transactions on Multimedia Computing, Communications and Applications , volume=. 2024 , publisher=

  18. [18]

    arXiv preprint arXiv:2507.18515 , year=

    A Deep Dive into Retrieval-Augmented Generation for Code Completion: Experience on WeChat , author=. arXiv preprint arXiv:2507.18515 , year=

  19. [19]

    Proceedings

    Dependable multimodal communication and interaction with robotic assistants , author=. Proceedings. 11th IEEE International Workshop on Robot and Human Interactive Communication , pages=. 2002 , organization=

  20. [20]

    Digital Investigation , volume=

    Network and device forensic analysis of android social-messaging applications , author=. Digital Investigation , volume=. 2015 , publisher=

  21. [21]

    Proceedings of 3rd International Conference on Reliability, Infocom Technologies and Optimization , pages=

    Maturity model for features of social messaging applications , author=. Proceedings of 3rd International Conference on Reliability, Infocom Technologies and Optimization , pages=. 2014 , organization=

  22. [22]

    Journal of Organizational and End User Computing (JOEUC) , volume=

    Intelligent customer service system optimization based on artificial intelligence , author=. Journal of Organizational and End User Computing (JOEUC) , volume=. 2024 , publisher=

  23. [23]

    IEEE Transactions on Information Forensics and Security , volume=

    Face clustering: representation and pairwise constraints , author=. IEEE Transactions on Information Forensics and Security , volume=. 2018 , publisher=

  24. [24]

    Proceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025) , pages=

    jina-embeddings-v4: Universal embeddings for multimodal multilingual retrieval , author=. Proceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025) , pages=

  25. [25]

    MonoQwen: Visual Document Reranking , author=

  26. [26]

    The Eleventh International Conference on Learning Representations , year=

    Universal Vision-Language Dense Retrieval: Learning A Unified Representation Space for Multi-Modal Retrieval , author=. The Eleventh International Conference on Learning Representations , year=

  27. [27]

    Proceedings of the Thirteenth Language Resources and Evaluation Conference , pages=

    MMChat: Multi-Modal Chat Dataset on Social Media , author=. Proceedings of the Thirteenth Language Resources and Evaluation Conference , pages=

  28. [28]

    Proceedings of the 31st ACM International Conference on Multimedia , pages=

    TikTalk: a video-based dialogue dataset for multi-modal chitchat in real world , author=. Proceedings of the 31st ACM International Conference on Multimedia , pages=

  29. [29]

    PhotoChat: A Human-Human Dialogue Dataset With Photo Sharing Behavior For Joint Image-Text Modeling , author=. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages=

  30. [30]

    Constructing Multi-Modal Dialogue Dataset by Replacing Text with Semantically Relevant Images , author=. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers) , pages=

  31. [31]

    Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    MMDialog: A Large-scale Multi-turn Dialogue Dataset Towards Multi-modal Open-domain Conversation , author=. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  32. [32]

    Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

    DialogCC: An Automated Pipeline for Creating High-Quality Multi-Modal Dialogue Dataset , author=. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

  33. [33]

    International conference on machine learning , pages=

    Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

  34. [34]

    International conference on machine learning , pages=

    Scaling up visual and vision-language representation learning with noisy text supervision , author=. International conference on machine learning , pages=. 2021 , organization=

  35. [35]

    Proceedings of the IEEE/CVF international conference on computer vision , pages=

    Sigmoid loss for language image pre-training , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

  36. [36]

    arXiv:2211.01335 , year=

    Chinese clip: Contrastive vision-language pretraining in chinese , author=. arXiv:2211.01335 , year=

  37. [37]

    International conference on machine learning , pages=

    Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation , author=. International conference on machine learning , pages=. 2022 , organization=

  38. [38]

    arXiv:2209.00179 , year=

    Universal vision-language dense retrieval: Learning a unified representation space for multi-modal retrieval , author=. arXiv:2209.00179 , year=

  39. [39]

    European Conference on Computer Vision , pages=

    Uniir: Training and benchmarking universal multimodal information retrievers , author=. European Conference on Computer Vision , pages=. 2024 , organization=

  40. [40]

    Advances in neural information processing systems , volume=

    Visual instruction tuning , author=. Advances in neural information processing systems , volume=

  41. [41]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Qwen2-vl: Enhancing vision-language model's perception of the world at any resolution , author=. arXiv:2409.12191 , year=

  42. [42]

    VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks

    VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks , author=. arXiv:2410.05160 , year=

  43. [43]

    arXiv:2310.14804 , year=

    Large Language Models can Share Images, Too! , author=. arXiv:2310.14804 , year=

  44. [44]

    Pacific-Asia Conference on Knowledge Discovery and Data Mining , pages=

    Multimodal Contrastive Learning for Dialogue Embeddings with Global and Local Views , author=. Pacific-Asia Conference on Knowledge Discovery and Data Mining , pages=. 2025 , organization=

  45. [45]

    EMNLP 2024-2024 Conference on Empirical Methods in Natural Language Processing, Findings of EMNLP 2024 , pages=

    Balancing Visual Context Understanding in Dialogue for Image Retrieval , author=. EMNLP 2024-2024 Conference on Empirical Methods in Natural Language Processing, Findings of EMNLP 2024 , pages=. 2024 , organization=

  46. [46]

    ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

    Dialclip: Empowering clip as multi-modal dialog retriever , author=. ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2024 , organization=

  47. [47]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    A new formula for sticker retrieval: Reply with stickers in multi-modal and multi-session conversation , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  48. [48]

    GPT-4 Technical Report

    Gpt-4 technical report , author=. arXiv:2303.08774 , year=

  49. [49]

    Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

    Lamra: Large multimodal model as your advanced retrieval assistant , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

  50. [50]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv:2501.12948 , year=

  51. [51]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Dapo: An open-source llm reinforcement learning system at scale , author=. arXiv:2503.14476 , year=

  52. [52]

    Visual-RFT: Visual Reinforcement Fine-Tuning

    Visual-rft: Visual reinforcement fine-tuning , author=. arXiv:2503.01785 , year=

  53. [53]

    Video-R1: Reinforcing Video Reasoning in MLLMs

    Video-r1: Reinforcing video reasoning in mllms , author=. arXiv:2503.21776 , year=

  54. [54]

    LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL

    Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl , author=. arXiv:2503.07536 , year=

  55. [55]

    arXiv:2503.20752 , year=

    Reason-rft: Reinforcement fine-tuning for visual reasoning , author=. arXiv:2503.20752 , year=

  56. [56]

    VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

    Vl-rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning , author=. arXiv:2504.08837 , year=

  57. [57]

    OpenAI o1 System Card

    Openai o1 system card , author=. arXiv:2412.16720 , year=

  58. [58]

    VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

    Vapo: Efficient and reliable reinforcement learning for advanced reasoning tasks , author=. arXiv:2504.05118 , year=

  59. [59]

    Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

    Vision-r1: Incentivizing reasoning capability in multimodal large language models , author=. arXiv:2503.06749 , year=

  60. [60]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv:2402.03300 , year=

  61. [61]

    VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

    Vlm-r1: A stable and generalizable r1-style large vision-language model , author=. arXiv:2504.07615 , year=

  62. [62]

    2022 , url=

    Yirong Chen and Weiquan Fan and Xiaofen Xing and Jianxin Pang and Minlie Huang and Wenjing Han and Qianfeng Tie and Xiangmin Xu , journal=. 2022 , url=

  63. [63]

    Advances in Neural Information Processing Systems , volume=

    CMMA: benchmarking multi-affection detection in chinese multi-modal conversations , author=. Advances in Neural Information Processing Systems , volume=

  64. [64]

    Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    M3ED: Multi-modal Multi-scene Multi-label Emotional Dialogue Database , author=. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  65. [65]

    Qwen3 Technical Report

    Qwen3 technical report , author=. arXiv:2505.09388 , year=

  66. [66]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv:2507.06261 , year=

  67. [67]

    International conference on machine learning , pages=

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models , author=. International conference on machine learning , pages=. 2023 , organization=

  68. [68]

    Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

    Chat-based Person Retrieval via Dialogue-Refined Cross-Modal Alignment , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

  69. [69]

    Proximal Policy Optimization Algorithms

    Proximal policy optimization algorithms , author=. arXiv:1707.06347 , year=

  70. [70]

    2024 , eprint=

    SWIFT:A Scalable lightWeight Infrastructure for Fine-Tuning , author=. 2024 , eprint=

  71. [71]

    , author=

    Lora: Low-rank adaptation of large language models. , author=. ICLR , volume=

  72. [72]

    DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

    Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding , author=. arXiv:2412.10302 , year=

  73. [73]

    2024 , journal=

    Ovis: Structural Embedding Alignment for Multimodal Large Language Model , author=. 2024 , journal=

  74. [74]

    2025 , eprint=

    MiMo-VL Technical Report , author=. 2025 , eprint=

  75. [75]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  76. [76]

    E5-V: Universal Embeddings with Multimodal Large Language Models

    E5-v: Universal embeddings with multimodal large language models , author=. arXiv:2407.12580 , year=

  77. [77]

    2024 , eprint=

    mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models , author=. 2024 , eprint=

  78. [78]

    NeurIPS , year =

    Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae , title =. NeurIPS , year =

  79. [79]

    Seed1.5-VL Technical Report

    Seed1. 5-vl technical report , author=. arXiv:2505.07062 , year=

  80. [80]

    (2023) Gpt-4 technical report

    Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, Aleman FL, Almeida D, Altenschmidt J, Altman S, Anadkat S, et al. (2023) Gpt-4 technical report. arXiv:230308774

Showing first 80 references.