Jina clip: Your clip model is also your text retriever

Andreas Koukounas, Georgios Mastrapas, Michael Günther, Bo Wang, Scott Martens, Isabelle Mohr, Saba Sturua, Mohammad Kalim Akram, Joan Fontanals Martínez, Saahil Ognawala, Susana Guzman, Maximilian Werk, Nan Wang, Han Xiao · 2024 · arXiv 2405.20204

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

read on arXiv browse 6 citing papers

citation-role summary

background 2 baseline 2

citation-polarity summary

background 2 baseline 2

representative citing papers

TCD-Arena: Assessing Robustness of Time Series Causal Discovery Methods Against Assumption Violations

cs.LG · 2026-05-04 · unverdicted · novelty 7.0

TCD-Arena is a new customizable testing framework that runs millions of experiments to map how 33 different assumption violations affect time series causal discovery methods and shows ensembles can boost overall robustness.

MARVEL: Multimodal Adaptive Reasoning-intensiVe Expand-rerank and retrievaL

cs.IR · 2026-04-08 · unverdicted · novelty 7.0

MARVEL reaches 37.9 nDCG@10 on the MM-BRIGHT benchmark by combining LLM query expansion, a reasoning-enhanced dense retriever, and GPT-4o CoT reranking, beating prior multimodal encoders by 10.3 points.

jina-embeddings-v5-omni: Geometry-preserving Embeddings via Locked Aligned Towers

cs.CL · 2026-05-08 · unverdicted · novelty 6.0 · 2 refs

GELATO extends frozen text embedding models with locked image and audio encoders, training minimal connectors to produce a single semantic embedding space for text, image, audio, and video while keeping original text performance unchanged.

One Perturbation, Two Failure Modes: Probing VLM Safety via Embedding-Guided Typographic Perturbations

cs.CV · 2026-04-28 · unverdicted · novelty 6.0

Multimodal embedding distance predicts typographic attack success rate across VLMs, and optimizing it under bounded perturbations on surrogates exposes two co-occurring failure modes of lost readability and reduced safety refusals.

Reading Between the Pixels: Linking Text-Image Embedding Alignment to Typographic Attack Success on Vision-Language Models

cs.CV · 2026-04-14 · unverdicted · novelty 5.0

Text-image embedding distance negatively correlates with typographic attack success rates (r = -0.71 to -0.93) on VLMs, with font size and image degradations strongly modulating effectiveness.

BRIDGE: Multimodal-to-Text Retrieval via Reinforcement-Learned Query Alignment

cs.IR · 2026-04-08 · unverdicted · novelty 5.0

BRIDGE reaches 29.7 nDCG@10 on MM-BRIGHT by RL-aligning multimodal queries to text and using a reasoning retriever, beating multimodal encoders and, when combined with Nomic-Vision, exceeding the best text-only retriever at 33.3.

citing papers explorer

Showing 6 of 6 citing papers.

TCD-Arena: Assessing Robustness of Time Series Causal Discovery Methods Against Assumption Violations cs.LG · 2026-05-04 · unverdicted · none · ref 17
TCD-Arena is a new customizable testing framework that runs millions of experiments to map how 33 different assumption violations affect time series causal discovery methods and shows ensembles can boost overall robustness.
MARVEL: Multimodal Adaptive Reasoning-intensiVe Expand-rerank and retrievaL cs.IR · 2026-04-08 · unverdicted · none · ref 18
MARVEL reaches 37.9 nDCG@10 on the MM-BRIGHT benchmark by combining LLM query expansion, a reasoning-enhanced dense retriever, and GPT-4o CoT reranking, beating prior multimodal encoders by 10.3 points.
jina-embeddings-v5-omni: Geometry-preserving Embeddings via Locked Aligned Towers cs.CL · 2026-05-08 · unverdicted · none · ref 18 · 2 links
GELATO extends frozen text embedding models with locked image and audio encoders, training minimal connectors to produce a single semantic embedding space for text, image, audio, and video while keeping original text performance unchanged.
One Perturbation, Two Failure Modes: Probing VLM Safety via Embedding-Guided Typographic Perturbations cs.CV · 2026-04-28 · unverdicted · none · ref 7
Multimodal embedding distance predicts typographic attack success rate across VLMs, and optimizing it under bounded perturbations on surrogates exposes two co-occurring failure modes of lost readability and reduced safety refusals.
Reading Between the Pixels: Linking Text-Image Embedding Alignment to Typographic Attack Success on Vision-Language Models cs.CV · 2026-04-14 · unverdicted · none · ref 1
Text-image embedding distance negatively correlates with typographic attack success rates (r = -0.71 to -0.93) on VLMs, with font size and image degradations strongly modulating effectiveness.
BRIDGE: Multimodal-to-Text Retrieval via Reinforcement-Learned Query Alignment cs.IR · 2026-04-08 · unverdicted · none · ref 18
BRIDGE reaches 29.7 nDCG@10 on MM-BRIGHT by RL-aligning multimodal queries to text and using a reasoning retriever, beating multimodal encoders and, when combined with Nomic-Vision, exceeding the best text-only retriever at 33.3.

Jina clip: Your clip model is also your text retriever

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer