hub

mgte: Generalized long-context text representation and reranking models for multilingual text retrieval

Xin Zhang, Yanzhao Zhang, Dingkun Long, Wen Xie, Ziqi Dai, Jialong Tang, Huan Lin, Baosong Yang, Pengjun Xie, Fei Huang, 1 others · 2024 · arXiv 2407.19669

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

read on arXiv browse 10 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

representative citing papers

Priming, Path-dependence, and Plasticity: Understanding the molding of user-LLM interaction and its implications from (many) chat logs in the wild

cs.HC · 2026-05-07 · unverdicted · novelty 7.0

Large-scale analysis of wild LLM chat logs finds that user interaction patterns stabilize quickly after initial use and correlate with long-term outcomes like retention, creating an agency paradox of limited exploration in unconstrained systems.

ORPHEAS: A Cross-Lingual Greek-English Embedding Model for Retrieval-Augmented Generation

cs.CL · 2026-04-22 · unverdicted · novelty 6.0

ORPHEAS, a Greek-English embedding model created with knowledge graph fine-tuning, outperforms state-of-the-art multilingual models on monolingual and cross-lingual retrieval benchmarks.

REZE: Representation Regularization for Domain-adaptive Text Embedding Pre-finetuning

cs.CL · 2026-04-19 · unverdicted · novelty 6.0

REZE controls representation shifts in contrastive pre-finetuning of text embeddings via eigenspace decomposition of anchor-positive pairs and adaptive soft-shrinkage on task-variant directions.

Human-Inspired Context-Selective Multimodal Memory for Social Robots

cs.AI · 2026-04-13 · unverdicted · novelty 5.0

A new memory system for social robots selectively stores multimodal memories by emotional salience and novelty, achieving 0.506 Spearman correlation in selectivity and up to 13% better Recall@1 in multimodal retrieval.

OntoLogX: Ontology-Guided Knowledge Graph Extraction from Cybersecurity Logs with Large Language Models

cs.AI · 2025-10-01 · unverdicted · novelty 5.0

OntoLogX is a system that applies LLMs with ontology guidance, RAG, and iterative fixes to build valid knowledge graphs from cybersecurity logs and predict ATT&CK tactics from aggregated sessions.

Improving Korean-English Cross-Lingual Retrieval: A Data-Centric Study of Language Composition and Model Merging

cs.IR · 2025-07-11 · unverdicted · novelty 5.0

Language composition in training data creates opposing effects on CLIR and mono-IR performance for Korean-English retrieval, which model merging can partially resolve.

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

cs.CL · 2025-02-04 · unverdicted · novelty 5.0

SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.

Granite Embedding Multilingual R2 Models

cs.IR · 2026-05-13 · unverdicted · novelty 4.0

Granite Embedding Multilingual R2 releases 311M and 97M parameter bi-encoder models that achieve state-of-the-art retrieval performance on multilingual text, code, long-document, and reasoning datasets.

From Ambiguity to Accuracy: The Transformative Effect of Coreference Resolution on Retrieval-Augmented Generation systems

cs.CL · 2025-07-10 · unverdicted · novelty 4.0

Coreference resolution improves retrieval relevance and QA performance in RAG systems, with mean pooling performing best and smaller models benefiting more.

KIT-TIP-NLP at MultiPride: Continual Learning with Multilingual Foundation Model

cs.CL · 2026-05-13 · unverdicted · novelty 3.0 · 2 refs

A system using XLM-RoBERTa, GPT-4 back-translation augmentation, undersampling, and language-specific threshold tuning reports 2-5% F1 gains on multilingual slur reclamation detection.

citing papers explorer

Showing 10 of 10 citing papers.

Priming, Path-dependence, and Plasticity: Understanding the molding of user-LLM interaction and its implications from (many) chat logs in the wild cs.HC · 2026-05-07 · unverdicted · none · ref 115
Large-scale analysis of wild LLM chat logs finds that user interaction patterns stabilize quickly after initial use and correlate with long-term outcomes like retention, creating an agency paradox of limited exploration in unconstrained systems.
ORPHEAS: A Cross-Lingual Greek-English Embedding Model for Retrieval-Augmented Generation cs.CL · 2026-04-22 · unverdicted · none · ref 8
ORPHEAS, a Greek-English embedding model created with knowledge graph fine-tuning, outperforms state-of-the-art multilingual models on monolingual and cross-lingual retrieval benchmarks.
REZE: Representation Regularization for Domain-adaptive Text Embedding Pre-finetuning cs.CL · 2026-04-19 · unverdicted · none · ref 68
REZE controls representation shifts in contrastive pre-finetuning of text embeddings via eigenspace decomposition of anchor-positive pairs and adaptive soft-shrinkage on task-variant directions.
Human-Inspired Context-Selective Multimodal Memory for Social Robots cs.AI · 2026-04-13 · unverdicted · none · ref 61
A new memory system for social robots selectively stores multimodal memories by emotional salience and novelty, achieving 0.506 Spearman correlation in selectivity and up to 13% better Recall@1 in multimodal retrieval.
OntoLogX: Ontology-Guided Knowledge Graph Extraction from Cybersecurity Logs with Large Language Models cs.AI · 2025-10-01 · unverdicted · none · ref 56
OntoLogX is a system that applies LLMs with ontology guidance, RAG, and iterative fixes to build valid knowledge graphs from cybersecurity logs and predict ATT&CK tactics from aggregated sessions.
Improving Korean-English Cross-Lingual Retrieval: A Data-Centric Study of Language Composition and Model Merging cs.IR · 2025-07-11 · unverdicted · none · ref 44
Language composition in training data creates opposing effects on CLIR and mono-IR performance for Korean-English retrieval, which model merging can partially resolve.
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model cs.CL · 2025-02-04 · unverdicted · none · ref 252
SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.
Granite Embedding Multilingual R2 Models cs.IR · 2026-05-13 · unverdicted · none · ref 22
Granite Embedding Multilingual R2 releases 311M and 97M parameter bi-encoder models that achieve state-of-the-art retrieval performance on multilingual text, code, long-document, and reasoning datasets.
From Ambiguity to Accuracy: The Transformative Effect of Coreference Resolution on Retrieval-Augmented Generation systems cs.CL · 2025-07-10 · unverdicted · none · ref 23
Coreference resolution improves retrieval relevance and QA performance in RAG systems, with mean pooling performing best and smaller models benefiting more.
KIT-TIP-NLP at MultiPride: Continual Learning with Multilingual Foundation Model cs.CL · 2026-05-13 · unverdicted · none · ref 12 · 2 links
A system using XLM-RoBERTa, GPT-4 back-translation augmentation, undersampling, and language-specific threshold tuning reports 2-5% F1 gains on multilingual slur reclamation detection.

mgte: Generalized long-context text representation and reranking models for multilingual text retrieval

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer