arxiv: 1908.10084 · v1 · submitted 2019-08-27 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Nils Reimers , Iryna Gurevych

Authors on Pith no claims yet

Pith reviewed 2026-05-10 14:47 UTC · model grok-4.3

classification 💻 cs.CL

keywords sentence embeddingsBERTsiamese networkstriplet networkssemantic textual similaritycosine similaritytransfer learning

0 comments

The pith

Sentence-BERT uses siamese and triplet training on BERT to create fixed sentence embeddings that support fast cosine-similarity comparisons while matching original accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Sentence-BERT by adapting pretrained BERT with siamese and triplet network structures. This produces standalone sentence embeddings that capture semantic meaning and can be compared directly with cosine similarity. The change removes the need to run both sentences through the model together for each comparison. As a result, finding the most similar pair among 10,000 sentences drops from roughly 65 hours of BERT inference to about 5 seconds. A sympathetic reader cares because the approach makes semantic search and clustering practical at scale while preserving BERT-level performance on standard sentence-pair tasks.

Core claim

Sentence-BERT modifies the pretrained BERT network by applying siamese and triplet network structures to derive semantically meaningful sentence embeddings. These embeddings can be compared using cosine similarity. The method reduces the computational cost of finding the most similar pair in a collection of 10,000 sentences from approximately 50 million inference computations with BERT to a few seconds with SBERT, while maintaining the accuracy achieved by the original BERT model on semantic textual similarity tasks.

What carries the argument

Siamese and triplet network structures applied to BERT for producing standalone sentence embeddings.

If this is right

Semantic similarity search over large sentence collections becomes feasible in seconds rather than hours.
Unsupervised tasks such as clustering become practical with BERT-derived embeddings.
SBERT and SRoBERTa outperform prior state-of-the-art sentence embedding methods on standard STS benchmarks and transfer learning tasks.
The same accuracy as full BERT pairwise inference is retained on sentence-pair regression tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same siamese training approach could be applied to other pretrained transformer models to generate efficient embeddings.
Independent sentence embeddings may serve as a practical approximation for many semantic comparison tasks that originally required joint inference.
Combining SBERT-style embeddings with domain-specific fine-tuning could further improve performance on specialized corpora without reintroducing pairwise computation costs.

Load-bearing premise

Fine-tuning BERT with siamese and triplet networks produces sentence embeddings whose cosine similarities accurately reflect semantic similarity at the level of the original pairwise BERT inference.

What would settle it

A held-out semantic textual similarity dataset where the ranking of sentence pairs by SBERT cosine similarity differs substantially from the ranking obtained by direct BERT pairwise inference on the same pairs.

read the original abstract

BERT (Devlin et al., 2018) and RoBERTa (Liu et al., 2019) has set a new state-of-the-art performance on sentence-pair regression tasks like semantic textual similarity (STS). However, it requires that both sentences are fed into the network, which causes a massive computational overhead: Finding the most similar pair in a collection of 10,000 sentences requires about 50 million inference computations (~65 hours) with BERT. The construction of BERT makes it unsuitable for semantic similarity search as well as for unsupervised tasks like clustering. In this publication, we present Sentence-BERT (SBERT), a modification of the pretrained BERT network that use siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity. This reduces the effort for finding the most similar pair from 65 hours with BERT / RoBERTa to about 5 seconds with SBERT, while maintaining the accuracy from BERT. We evaluate SBERT and SRoBERTa on common STS tasks and transfer learning tasks, where it outperforms other state-of-the-art sentence embeddings methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SBERT turns BERT into a practical sentence embedder via siamese fine-tuning, delivering big speedups on semantic search while holding accuracy on STS tasks.

read the letter

Hi colleague, The main point from this paper is that Sentence-BERT turns the powerful but slow BERT into something practical for sentence embeddings by using siamese and triplet training. This lets you compute embeddings once and compare them quickly with cosine similarity, slashing the time for finding similar sentences in a 10,000-sentence set from 65 hours to 5 seconds, all while matching BERT's accuracy on STS tasks. They do a good job showing the practical benefits. The experiments cover standard STS benchmarks and some transfer tasks, outperforming other sentence embedding methods. They test different ways to pool the BERT outputs and use NLI data for fine-tuning, which seems to produce embeddings that capture semantic similarity effectively. The speedup is real and the accuracy holds in their reported results. A softer area is the reliance on the fine-tuned embeddings to make up for the missing cross-attention between sentences. Since each sentence is encoded separately, phenomena that need token-level interactions might suffer, even if the overall STS scores are strong. The paper doesn't dive deep into failure cases for complex semantics, so that could be a limitation in some domains. Overall, this is for NLP researchers and engineers dealing with semantic search, clustering, or retrieval on bigger datasets. It's a useful engineering advance that makes advanced models more deployable. I think it deserves peer review because the core idea is sound and the evaluations are comprehensive enough to be worth referee input.

Referee Report

2 major / 2 minor

Summary. The paper introduces Sentence-BERT (SBERT), a modification of pre-trained BERT (and RoBERTa) that employs siamese and triplet network structures to produce fixed-length sentence embeddings. These embeddings can be compared efficiently via cosine similarity, reducing the cost of finding the most similar pair among 10,000 sentences from ~65 hours (pairwise BERT inference) to ~5 seconds while claiming to maintain BERT-level accuracy on semantic textual similarity (STS) tasks and to outperform prior sentence embedding methods on transfer learning tasks.

Significance. If the empirical claims hold, the work is significant because it makes contextualized transformer representations practical for large-scale semantic search, clustering, and retrieval pipelines that were previously infeasible due to quadratic inference costs. The approach has influenced subsequent efficient embedding research and provides a reproducible recipe for adapting pre-trained models to standalone sentence encoding.

major comments (2)

[§3] §3 (SBERT Architecture): The central claim that siamese/triplet fine-tuning on NLI data produces embeddings whose cosine similarities recover the semantic judgments of BERT's joint [CLS] encoding is load-bearing for the 'maintaining the accuracy' assertion, yet the manuscript provides no ablation or diagnostic test on phenomena that rely on cross-sentence attention (e.g., negation scope, coreference resolution, or subtle entailment). A controlled comparison on such cases would be required to substantiate that independent encoding plus learned pooling fully compensates for the removed token-level interactions.
[§4] §4 (Evaluation): The STS and transfer-task results are presented without reporting run-to-run variance, statistical significance tests, or direct side-by-side numbers for the original BERT/RoBERTa pairwise baseline on the identical splits and metrics; this weakens the quantitative support for the efficiency-accuracy tradeoff claim.

minor comments (2)

[Abstract] Abstract: subject-verb agreement error ('BERT and RoBERTa has set') and subject-verb mismatch ('that use siamese').
[§3] Notation: the pooling operation (mean/max/[CLS]) and the exact form of the triplet loss are described but not given explicit equations; adding numbered equations would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive review and the recommendation of minor revision. We address each major comment below, clarifying our position and indicating changes to the manuscript where appropriate.

read point-by-point responses

Referee: [§3] §3 (SBERT Architecture): The central claim that siamese/triplet fine-tuning on NLI data produces embeddings whose cosine similarities recover the semantic judgments of BERT's joint [CLS] encoding is load-bearing for the 'maintaining the accuracy' assertion, yet the manuscript provides no ablation or diagnostic test on phenomena that rely on cross-sentence attention (e.g., negation scope, coreference resolution, or subtle entailment). A controlled comparison on such cases would be required to substantiate that independent encoding plus learned pooling fully compensates for the removed token-level interactions.

Authors: We agree that phenomena relying on cross-sentence attention represent an important test case for the claim that SBERT embeddings recover BERT-level semantic judgments. The NLI training data used for fine-tuning explicitly requires modeling entailment and contradiction relations, which frequently involve negation, coreference, and subtle semantic distinctions. The strong results on STS benchmarks, which contain many such examples, provide supporting evidence that the learned pooling and siamese objective capture the necessary information in the fixed embeddings. Nevertheless, the original manuscript did not include targeted diagnostic ablations or controlled comparisons isolating these phenomena. In the revised version we have added a paragraph in §3 discussing this point, along with qualitative examples illustrating SBERT's handling of negation and coreference in similarity tasks. A full controlled study would require new experiments outside the scope of the current work focused on efficient sentence encoding. revision: partial
Referee: [§4] §4 (Evaluation): The STS and transfer-task results are presented without reporting run-to-run variance, statistical significance tests, or direct side-by-side numbers for the original BERT/RoBERTa pairwise baseline on the identical splits and metrics; this weakens the quantitative support for the efficiency-accuracy tradeoff claim.

Authors: The BERT and RoBERTa pairwise numbers reported in the paper are taken directly from the same standard STS and transfer-task benchmarks and splits used in the original BERT/RoBERTa publications and subsequent leaderboard evaluations, enabling direct comparison on identical metrics. To strengthen the presentation, we have updated the evaluation section and tables to report run-to-run standard deviations (computed over five random seeds) for SBERT and SRoBERTa, and we have added paired statistical significance tests against the strongest baselines. The side-by-side BERT/RoBERTa figures already appear in Tables 1 and 2 using the same evaluation protocol. revision: yes

standing simulated objections not resolved

A dedicated controlled ablation isolating cross-sentence attention phenomena (negation scope, coreference, subtle entailment) was not performed in the original experiments.

Circularity Check

0 steps flagged

No circularity: SBERT is an empirical fine-tuning method with external validation

full rationale

The paper describes a practical modification of BERT using siamese and triplet networks to produce fixed sentence embeddings for cosine similarity, followed by direct evaluation on STS and transfer tasks. No derivation chain exists that reduces a claimed result to its own inputs by construction, no fitted parameters are relabeled as predictions, and no load-bearing self-citations or imported uniqueness theorems appear. The core claim rests on standard transfer learning from an external pre-trained model (BERT) and is tested against independent benchmarks, making the procedure self-contained rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that siamese and triplet fine-tuning successfully transfers BERT's semantic capabilities to independent sentence embeddings; no free parameters or invented entities are specified in the abstract.

axioms (1)

domain assumption BERT's pre-trained representations can be adapted via siamese and triplet training to produce standalone sentence embeddings that preserve semantic information.
This is the core unproven premise enabling the efficiency gain.

pith-pipeline@v0.9.0 · 5496 in / 1182 out tokens · 70372 ms · 2026-05-10T14:47:49.156617+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

A Unified Geometric Framework for Weighted Contrastive Learning
cs.LG 2026-05 unverdicted novelty 8.0

Weighted InfoNCE objectives realize specific target geometries in embedding space, with SupCon producing size-dependent inter-class similarities under imbalance while Soft SupCon and certain continuous variants preser...
REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations
cs.CL 2026-05 unverdicted novelty 8.0

REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reason...
Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs
cs.CR 2026-04 conditional novelty 8.0

Benign fine-tuning on audio data breaks safety alignment in Audio LLMs by raising jailbreak success rates up to 87%, with the dominant risk axis depending on model architecture and embedding proximity to harmful content.
PRISM-X: Experiments on Personalised Fine-Tuning with Human and Simulated Users
cs.CL 2026-05 unverdicted novelty 7.0

Preference fine-tuning outperforms prompting for personalisation but amplifies sycophancy and relationship-seeking, while simulated users recover aggregate rankings yet show far lower self-consistency and different to...
DiagnosticIQ: A Benchmark for LLM-Based Industrial Maintenance Action Recommendation from Symbolic Rules
cs.AI 2026-05 unverdicted novelty 7.0

DiagnosticIQ benchmark shows frontier LLMs perform similarly on standard rule-to-action tasks but lose substantial accuracy under distractor expansion and condition inversion, pointing to calibration as the key deploy...
PersonaTeaming: Supporting Persona-Driven Red-Teaming for Generative AI
cs.HC 2026-05 unverdicted novelty 7.0

Persona-driven workflow and interface improve automated and human-AI red-teaming of generative AI by incorporating diverse perspectives into adversarial prompt creation.
Privacy Without Losing Place: A Paradigm for Private Retrieval in Spatial RAGs
cs.CR 2026-05 unverdicted novelty 7.0

PAS encodes locations via relative anchors and bins to deliver roughly 370-400m adversarial error in spatial RAG while retaining over half the baseline retrieval performance and keeping generation quality robust.
TabEmbed: Benchmarking and Learning Generalist Embeddings for Tabular Understanding
cs.CL 2026-05 unverdicted novelty 7.0

TabEmbed is the first generalist embedding model for tabular data that unifies classification and retrieval in one space via contrastive learning and outperforms text embedding models on the new TabBench benchmark.
Automated Large-scale CVRP Solver Design via LLM-assisted Flexible MCTS
cs.AI 2026-05 unverdicted novelty 7.0

LaF-MCTS uses LLM-assisted flexible MCTS with a three-tier hierarchy, semantic pruning, and branch regrowth to automatically compose decomposition-enhanced CVRP solvers that outperform state-of-the-art methods on CVRP...
ContextualJailbreak: Evolutionary Red-Teaming via Simulated Conversational Priming
cs.CL 2026-05 unverdicted novelty 7.0

ContextualJailbreak uses evolutionary search over simulated primed dialogues with novel mutations to reach 90-100% attack success on open LLMs and transfers to some closed frontier models at 15-90% rates.
RepoDoc: A Knowledge Graph-Based Framework to Automatic Documentation Generation and Incremental Updates
cs.SE 2026-04 unverdicted novelty 7.0

RepoDoc uses a repository knowledge graph with module clustering and semantic impact propagation to generate more complete documentation 3x faster with 85% fewer tokens and handle incremental updates 73% faster than p...
Beyond Accuracy: Benchmarking Cross-Task Consistency in Unified Multimodal Models
cs.CV 2026-04 unverdicted novelty 7.0

XTC-Bench reveals that strong performance on generation or understanding tasks in unified multimodal models does not guarantee cross-task semantic consistency, which instead depends on how tightly coupled the learning...
Similar Users-Augmented Interest Network
cs.IR 2026-04 unverdicted novelty 7.0

SUIN improves CTR prediction by augmenting target user sequences with similar users' behaviors via embedding-based retrieval, user-specific position encoding, and user-aware target attention.
Prompt-Unknown Promotion Attacks against LLM-based Sequential Recommender Systems
cs.IR 2026-04 unverdicted novelty 7.0

PUDA enables effective promotion of unpopular target items in black-box LLM sequential recommenders by using evolutionary LLM refinement to infer hidden prompts, training a surrogate model, and combining adversarial t...
R2Code: A Self-Reflective LLM Framework for Requirements-to-Code Traceability
cs.SE 2026-04 unverdicted novelty 7.0

R2Code improves requirement-to-code traceability with a bidirectional alignment network, self-reflective consistency verification, and dynamic context-adaptive retrieval, yielding 7.4% average F1 gain and up to 41.7% ...
Multilingual and Domain-Agnostic Tip-of-the-Tongue Query Generation for Simulated Evaluation
cs.IR 2026-04 unverdicted novelty 7.0

An LLM simulation framework generates multilingual tip-of-the-tongue queries, validated by rank correlation with real queries, producing the first large-scale ToT benchmarks for four languages.
Semantic Recall for Vector Search
cs.IR 2026-04 unverdicted novelty 7.0

Semantic Recall is a new evaluation metric for approximate nearest neighbor search that focuses only on semantically relevant results, with Tolerant Recall as a proxy when relevance labels are unavailable.
HumanScore: Benchmarking Human Motions in Generated Videos
cs.CV 2026-04 unverdicted novelty 7.0

HumanScore defines six metrics for kinematic plausibility, temporal stability, and biomechanical consistency to benchmark human motions in videos from thirteen state-of-the-art generation models, revealing gaps betwee...
LLM-Viterbi: Semantic-Aware Decoding for Convolutional Codes
cs.IT 2026-04 unverdicted novelty 7.0

An LLM-enhanced Viterbi decoder achieves roughly 1.5 dB extra coding gain in block error rate and over 50% better semantic similarity than conventional Viterbi for constraint-length-3 convolutional codes on AWGN channels.
DocQAC: Adaptive Trie-Guided Decoding for Effective In-Document Query Auto-Completion
cs.IR 2026-04 conditional novelty 7.0

Adaptive trie-guided decoding with document context and tunable penalties improves in-document query auto-completion, outperforming baselines and larger models like LLaMA-3 on seen queries.
Bayesian Active Learning with Gaussian Processes Guided by LLM Relevance Scoring for Dense Passage Retrieval
cs.IR 2026-04 unverdicted novelty 7.0

BAGEL is a Bayesian active learning framework that uses Gaussian Processes to propagate LLM relevance signals across embedding space and guide global exploration, outperforming standard LLM reranking under identical b...
mEOL: Training-Free Instruction-Guided Multimodal Embedder for Vector Graphics and Image Retrieval
cs.CV 2026-04 unverdicted novelty 7.0

mEOL creates aligned embeddings for text, images, and SVGs using instruction-guided MLLM one-word summaries and semantic SVG rewriting, outperforming baselines on a new text-to-SVG retrieval benchmark.
Efficient Personalization of Generative User Interfaces
cs.LG 2026-04 unverdicted novelty 7.0

A dataset revealing high inter-designer disagreement on UI preferences motivates a sample-efficient method that personalizes generative interfaces by embedding new users in the space of prior designers, outperforming ...
Skill-Conditioned Visual Geolocation for Vision-Language Models
cs.CV 2026-04 unverdicted novelty 7.0

GeoSkill uses an evolving Skill-Graph initialized from expert trajectories and grown via autonomous analysis of successful and failed reasoning rollouts to boost geolocation accuracy, faithfulness, and generalization ...
Skill-Conditioned Visual Geolocation for Vision-Language Models
cs.CV 2026-04 unverdicted novelty 7.0

GeoSkill lets vision-language models improve geolocation accuracy and reasoning by maintaining an evolving Skill-Graph that grows through autonomous analysis of successful and failed rollouts on web-scale image data.
Harnessing Hyperbolic Geometry for Harmful Prompt Detection and Sanitization
cs.CR 2026-04 unverdicted novelty 7.0

HyPE detects harmful prompts as outliers in hyperbolic space and HyPS sanitizes them using explainable attribution, outperforming prior defenses in accuracy and robustness across datasets and adversarial scenarios.
LLM4Log: A Systematic Review of Large Language Model-based Log Analysis
cs.SE 2026-03 accept novelty 7.0

LLM4Log is a systematic review of 145 papers on LLM-based log analysis that delivers a unified taxonomy, design patterns, and open challenges for reliable adoption in AIOps.
WorkRB: A Community-Driven Evaluation Framework for AI in the Work Domain
cs.CL 2026-03 unverdicted novelty 7.0

WorkRB is the first open community-driven benchmark for AI in the work domain, organizing 13 tasks from 7 groups with dynamic multilingual ontology loading and modular design for proprietary task integration.
M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation
cs.CL 2024-02 unverdicted novelty 7.0

M3-Embedding is a single model for multi-lingual, multi-functional, and multi-granular text embeddings trained via self-knowledge distillation that achieves new state-of-the-art results on multilingual, cross-lingual,...
C-Pack: Packed Resources For General Chinese Embeddings
cs.CL 2023-09 accept novelty 7.0

C-Pack releases a new Chinese embedding benchmark, large training dataset, and optimized models that outperform priors by up to 10% on C-MTEB while also delivering English SOTA results.
Steering Language Models With Activation Engineering
cs.CL 2023-08 unverdicted novelty 7.0

Activation Addition steers language models by adding contrastive activation vectors from prompt pairs to control high-level properties like sentiment and toxicity at inference time without training.
Dual Hierarchical Dialogue Policy Learning for Legal Inquisitive Conversational Agents
cs.CL 2026-05 unverdicted novelty 6.0

A dual hierarchical RL framework lets agents learn when and how to ask probing questions in U.S. Supreme Court arguments, outperforming baselines on a court dataset.
SAGE: A Self-Evolving Agentic Graph-Memory Engine for Structure-Aware Associative Memory
cs.AI 2026-05 unverdicted novelty 6.0

SAGE is a self-evolving agentic graph-memory engine that dynamically constructs and refines structured memory graphs via writer-reader feedback, yielding performance gains on multi-hop QA, open-domain retrieval, and l...
ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox
cs.AI 2026-05 unverdicted novelty 6.0

ComplexMCP benchmark shows current LLM agents achieve at most 60% success on interdependent tool tasks versus 90% for humans, due to tool retrieval saturation, over-confidence, and strategic defeatism.
Sanity Checks for Long-Form Hallucination Detection
cs.CL 2026-05 unverdicted novelty 6.0

Hallucination detectors on LLM reasoning traces often rely on final-answer artifacts rather than reasoning validity; once controlled, lightweight lexical trajectory features suffice for robust detection.
WeatherSyn: An Instruction Tuning MLLM For Weather Forecasting Report Generation
cs.CL 2026-05 unverdicted novelty 6.0

WeatherSyn is the first instruction-tuned MLLM for weather forecasting report generation, outperforming closed-source models on a new dataset of 31 US cities across 8 weather aspects.
Structural Rationale Distillation via Reasoning Space Compression
cs.CL 2026-05 unverdicted novelty 6.0

D-RPC compresses reasoning into a dynamic bank of reusable paths to produce consistent teacher rationales, outperforming standard distillation baselines on five reasoning benchmarks while using fewer tokens.
RRCM: Ranking-Driven Retrieval over Collaborative and Meta Memories for LLM Recommendation
cs.IR 2026-05 unverdicted novelty 6.0

RRCM trains an LLM to dynamically retrieve from collaborative and meta memories using group relative policy optimization driven by final top-k recommendation quality.
Query-efficient model evaluation using cached responses
cs.LG 2026-05 unverdicted novelty 6.0

DKPS-based methods leverage cached model responses to achieve equivalent benchmark prediction accuracy with substantially fewer queries than standard evaluation.
On the Role of Language Representations in Auto-Bidding: Findings and Implications
cs.AI 2026-05 unverdicted novelty 6.0

SemBid injects LLM-encoded Task, History, and Strategy semantics as tokens into offline bidding trajectories and uses self-attention to outperform numerical-only baselines in performance, constraint satisfaction, and ...
PersonaTeaming: Supporting Persona-Driven Red-Teaming for Generative AI
cs.HC 2026-05 unverdicted novelty 6.0

PersonaTeaming Workflow improves automated red-teaming attack success rates over RainbowPlus using personas while maintaining diversity, and PersonaTeaming Playground supports human-AI collaboration in red-teaming as ...
You Snooze, You Lose: Automatic Safety Alignment Restoration through Neural Weight Translation
cs.CR 2026-05 unverdicted novelty 6.0

NeWTral is a non-linear weight translation framework using MoE routing that reduces average attack success rate from 70% to 13% on unsafe domain adapters across Llama, Mistral, Qwen, and Gemma models up to 72B while r...
Anticipating Innovation Using Large Language Models
cs.CL 2026-05 unverdicted novelty 6.0

TechToken uses transformer embeddings of IPC codes to measure linguistic convergence in patents and predict future technological combinations.
Revisiting Graph-Tokenizing Large Language Models: A Systematic Evaluation of Graph Token Understanding
cs.CL 2026-05 unverdicted novelty 6.0

GTokenLLMs do not fully understand graph tokens, exhibiting over-sensitivity or insensitivity to instruction changes and relying heavily on text for reasoning even when graph information is preserved.
RECAP: An End-to-End Platform for Capturing, Replaying, and Analyzing AI-Assisted Programming Interactions
cs.SE 2026-05 unverdicted novelty 6.0

RECAP captures, replays, and analyzes AI-assisted programming sessions by linking prompts, edits, and developer actions in a single timeline.
A Replicability Study of XTR
cs.IR 2026-05 accept novelty 6.0

XTR training does not improve retrieval effectiveness over ColBERT but enhances IVF engine efficiency by flattening token scores to produce more discriminative centroids.
From Unstructured Recall to Schema-Grounded Memory: Reliable AI Memory via Iterative, Schema-Aware Extraction
cs.AI 2026-04 unverdicted novelty 6.0

Schema-aware iterative extraction turns AI memory into a verified system of record, reaching 90-97% accuracy on extraction and end-to-end memory benchmarks where retrieval baselines score 80-87%.
Make Any Collection Navigable: Methods for Constructing and Evaluating Hypergraph of Text
cs.IR 2026-04 unverdicted novelty 6.0

Methods for constructing Hypergraphs of Text are proposed with a new effort ratio metric where TF-IDF baselines match LLM methods in experiments.
LatentDiff: Scaling Semantic Dataset Comparison to Millions of Images
cs.CV 2026-04 unverdicted novelty 6.0

LatentDiff scales semantic dataset comparison to millions of images using latent spaces of vision encoders combined with sparse autoencoders and density ratio estimation, showing better accuracy and robustness than ca...
MIPIC: Matryoshka Representation Learning via Self-Distilled Intra-Relational and Progressive Information Chaining
cs.CL 2026-04 unverdicted novelty 6.0

MIPIC trains nested Matryoshka representations via self-distilled intra-relational alignment with top-k CKA and progressive information chaining across depths, yielding competitive performance especially at extreme lo...
When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs
cs.CV 2026-04 unverdicted novelty 6.0

Hallucinations in LVLMs largely arise from textual priors in prompts, and can be reduced by fine-tuning with preference optimization on grounded vs. hallucinated response pairs.
COMPASS: COntinual Multilingual PEFT with Adaptive Semantic Sampling
cs.LG 2026-04 unverdicted novelty 6.0

COMPASS uses semantic clustering on multilingual embeddings to select auxiliary data for PEFT adapters, outperforming linguistic-similarity baselines on multilingual benchmarks while supporting continual adaptation.
Text Steganography with Dynamic Codebook and Multimodal Large Language Model
cs.CR 2026-04 unverdicted novelty 6.0

A black-box text steganography method using a dynamic codebook generated by multimodal LLMs and reject-sampling feedback achieves higher embedding capacity and text quality than prior white-box and fixed-codebook blac...
Assessing Capabilities of Large Language Models in Social Media Analytics: A Multi-task Quest
cs.CL 2026-04 unverdicted novelty 6.0

LLMs show mixed results on authorship verification, post generation, and attribute inference from Twitter data, with new frameworks and user studies establishing benchmarks for these analytics tasks.
Reasoning Structure Matters for Safety Alignment of Reasoning Models
cs.AI 2026-04 unverdicted novelty 6.0

Changing the internal reasoning structure of large reasoning models through simple supervised fine-tuning on 1K examples produces strong safety alignment that generalizes across tasks and languages.
HiGMem: A Hierarchical and LLM-Guided Memory System for Long-Term Conversational Agents
cs.CL 2026-04 unverdicted novelty 6.0

HiGMem combines hierarchical event-turn memory with LLM-guided selection to retrieve concise relevant evidence from long dialogues, improving F1 scores and cutting retrieved turns by an order of magnitude on the LoCoM...
Identifying Ethical Biases in Action Recognition Models
cs.CV 2026-04 unverdicted novelty 6.0

The authors create a synthetic video auditing framework that detects statistically significant skin color biases in popular human action recognition models even when actions are identical.
DuConTE: Dual-Granularity Text Encoder with Topology-Constrained Attention for Text-attributed Graphs
cs.CL 2026-04 unverdicted novelty 6.0

DuConTE is a dual-granularity text encoder that incorporates graph topology into language model attention for improved node representations in text-attributed graphs.
REZE: Representation Regularization for Domain-adaptive Text Embedding Pre-finetuning
cs.CL 2026-04 unverdicted novelty 6.0

REZE controls representation shifts in contrastive pre-finetuning of text embeddings via eigenspace decomposition of anchor-positive pairs and adaptive soft-shrinkage on task-variant directions.
Lorentz Framework for Semantic Segmentation
cs.CV 2026-04 unverdicted novelty 6.0

A Lorentz-model hyperbolic framework for semantic segmentation that integrates with Euclidean networks, provides free uncertainty maps, and is validated on ADE20K, COCO-Stuff, Pascal-VOC and Cityscapes using DeepLabV3...

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · cited by 113 Pith papers · 3 internal anchors

[1]

Eneko Agirre, Carmen Banea, Claire Cardie, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Weiwei Guo, Inigo Lopez-Gazpio, Montse Maritxalar, Rada Mihalcea, German Rigau, Larraitz Uria, and Janyce Wiebe. 2015. http://www.aclweb.org/anthology/S15-2045 SemEval-2015 Task 2: Semantic Textual Similarity, English, Spanish and Pilot on Interpretability . In Procee...

work page 2015
[2]

Eneko Agirre, Carmen Banea, Claire Cardie, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Weiwei Guo, Rada Mihalcea, German Rigau, and Janyce Wiebe. 2014. https://doi.org/10.3115/v1/S14-2010 S em E val-2014 Task 10: Multilingual Semantic Textual Similarity . In Proceedings of the 8th International Workshop on Semantic Evaluation ( S em E val 2014) , pages ...

work page doi:10.3115/v1/s14-2010 2014
[3]

Cer, Mona T

Eneko Agirre, Carmen Banea, Daniel M. Cer, Mona T. Diab, Aitor Gonzalez - Agirre, Rada Mihalcea, German Rigau, and Janyce Wiebe. 2016. http://aclweb.org/anthology/S/S16/S16-1081.pdf SemEval-2016 Task 1: Semantic Textual Similarity, Monolingual and Cross-Lingual Evaluation . In Proceedings of the 10th International Workshop on Semantic Evaluation, SemEval@...

work page 2016
[4]

Eneko Agirre, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, and Weiwei Guo. 2013. https://www.aclweb.org/anthology/S13-1004 * SEM 2013 shared task: Semantic Textual Similarity . In Second Joint Conference on Lexical and Computational Semantics (* SEM ), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity , pages 3...

work page 2013
[5]

Eneko Agirre, Mona Diab, Daniel Cer, and Aitor Gonzalez-Agirre. 2012. http://dl.acm.org/citation.cfm?id=2387636.2387697 SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity . In Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the Main Conference and the Shared Task, and Volume 2: Proceedin...

work page arXiv 2012
[6]

and Angeli, Gabor and Potts, Christopher and Manning, Christopher D

Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. https://doi.org/10.18653/v1/D15-1075 A large annotated corpus for learning natural language inference . In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632--642, Lisbon, Portugal. Association for Computational Linguistics

work page doi:10.18653/v1/d15-1075 2015
[7]

Daniel Cer, Mona Diab, Eneko Agirre, Iñigo Lopez-Gazpio, and Lucia Specia. 2017. http://arxiv.org/abs/1708.00055 SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation . In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 1--14, Vancouver, Canada

work page Pith review arXiv 2017
[8]

Universal Sentence Encoder

Daniel Cer, Yinfei Yang, Sheng - yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St. John, Noah Constant, Mario Guajardo - Cespedes, Steve Yuan, Chris Tar, Yun - Hsuan Sung, Brian Strope, and Ray Kurzweil. 2018. http://arxiv.org/abs/1803.11175 Universal Sentence Encoder . arXiv preprint arXiv:1803.11175

work page Pith review arXiv 2018
[9]

Alexis Conneau and Douwe Kiela. 2018. https://arxiv.org/abs/1803.05449 SentEval: An Evaluation Toolkit for Universal Sentence Representations . arXiv preprint arXiv:1803.05449

work page arXiv 2018
[10]

Alexis Conneau, Douwe Kiela, Holger Schwenk, Lo\" i c Barrault, and Antoine Bordes. 2017. https://www.aclweb.org/anthology/D17-1070 Supervised Learning of Universal Sentence Representations from Natural Language Inference Data . In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 670--680, Copenhagen, Denmark. ...

work page 2017
[11]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. https://arxiv.org/abs/1810.04805 BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding . arXiv preprint arXiv:1810.04805

work page internal anchor Pith review Pith/arXiv arXiv 2018
[12]

Bill Dolan, Chris Quirk, and Chris Brockett. 2004. https://doi.org/10.3115/1220355.1220406 Unsupervised Construction of Large Paraphrase Corpora: Exploiting Massively Parallel News Sources . In Proceedings of the 20th International Conference on Computational Linguistics, COLING '04, Stroudsburg, PA, USA. Association for Computational Linguistics

work page doi:10.3115/1220355.1220406 2004
[13]

Liat Ein Dor, Yosi Mass, Alon Halfon, Elad Venezian, Ilya Shnayderman, Ranit Aharonov, and Noam Slonim. 2018. https://doi.org/10.18653/v1/P18-2009 Learning Thematic Similarity Metric from Article Sections Using Triplet Networks . In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 49--...

work page doi:10.18653/v1/p18-2009 2018
[14]

Felix Hill, Kyunghyun Cho, and Anna Korhonen. 2016. https://doi.org/10.18653/v1/N16-1162 Learning Distributed Representations of Sentences from Unlabelled Data . In Proceedings of the 2016 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies , pages 1367--1377, San Diego, California. Assoc...

work page doi:10.18653/v1/n16-1162 2016
[15]

Minqing Hu and Bing Liu. 2004. https://doi.org/10.1145/1014052.1014073 Mining and Summarizing Customer Reviews . In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '04, pages 168--177, New York, NY, USA. ACM

work page doi:10.1145/1014052.1014073 2004
[16]

Samuel Humeau, Kurt Shuster, Marie - Anne Lachaux, and Jason Weston. 2019. http://arxiv.org/abs/1905.01969 Real-time Inference in Multi-sentence Tasks with Deep Pretrained Transformers . arXiv preprint arXiv:1905.01969, abs/1905.01969

work page arXiv 2019
[17]

Jeff Johnson, Matthijs Douze, and Herv \'e J \'e gou. 2017. https://arxiv.org/abs/1702.08734 Billion-scale similarity search with GPUs . arXiv preprint arXiv:1702.08734

work page Pith review arXiv 2017
[18]

Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. http://papers.nips.cc/paper/5950-skip-thought-vectors.pdf Skip-Thought Vectors . In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 3294--3302. Curra...

work page 2015
[19]

Xin Li and Dan Roth. 2002. https://doi.org/10.3115/1072228.1072378 Learning Question Classifiers . In Proceedings of the 19th International Conference on Computational Linguistics - Volume 1, COLING '02, pages 1--7, Stroudsburg, PA, USA. Association for Computational Linguistics

work page doi:10.3115/1072228.1072378 2002
[20]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. http://arxiv.org/abs/1907.11692 RoBERTa: A Robustly Optimized BERT Pretraining Approach . arXiv preprint arXiv:1907.11692

work page internal anchor Pith review Pith/arXiv arXiv 2019
[21]

Marco Marelli, Stefano Menini, Marco Baroni, Luisa Bentivogli, Raffaella Bernardi, and Roberto Zamparelli. 2014. http://www.lrec-conf.org/proceedings/lrec2014/pdf/363_Paper.pdf A SICK cure for the evaluation of compositional distributional semantic models . In Proceedings of the Ninth International Conference on Language Resources and Evaluation ( LREC ' ...

work page 2014
[22]

Bowman and Rachel Rudinger , title =

Chandler May, Alex Wang, Shikha Bordia, Samuel R. Bowman, and Rachel Rudinger. 2019. http://arxiv.org/abs/1903.10561 On Measuring Social Biases in Sentence Encoders . arXiv preprint arXiv:1903.10561

work page arXiv 2019
[23]

Amita Misra, Brian Ecker, and Marilyn A. Walker. 2016. http://aclweb.org/anthology/W/W16/W16-3636.pdf Measuring the Similarity of Sentential Arguments in Dialogue . In Proceedings of the SIGDIAL 2016 Conference, The 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue, 13-15 September 2016, Los Angeles, CA, USA , pages 276--287

work page 2016
[24]

Bo Pang and Lillian Lee. 2004. https://doi.org/10.3115/1218955.1218990 A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts . In Proceedings of the 42nd Meeting of the Association for Computational Linguistics ( ACL ' 04), Main Volume , pages 271--278, Barcelona, Spain

work page doi:10.3115/1218955.1218990 2004
[25]

Bo Pang and Lillian Lee. 2005. https://doi.org/10.3115/1219840.1219855 Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales . In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics ( ACL ' 05) , pages 115--124, Ann Arbor, Michigan. Association for Computational Linguistics

work page doi:10.3115/1219840.1219855 2005
[26]

Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. https://www.aclweb.org/anthology/D14-1162 GloVe: Global Vectors for Word Representation . In Empirical Methods in Natural Language Processing (EMNLP), pages 1532--1543

work page 2014
[27]

Yifan Qiao, Chenyan Xiong, Zheng - Hao Liu, and Zhiyuan Liu. 2019. http://arxiv.org/abs/1904.07531 Understanding the Behaviors of BERT in Ranking . arXiv preprint arXiv:1904.07531

work page arXiv 2019
[28]

Nils Reimers, Philip Beyer, and Iryna Gurevych. 2016. https://www.aclweb.org/anthology/C16-1009 Task-Oriented Intrinsic Evaluation of Semantic Textual Similarity . In Proceedings of the 26th International Conference on Computational Linguistics (COLING), pages 87--96

work page 2016
[29]

Nils Reimers and Iryna Gurevych. 2018. http://arxiv.org/abs/1803.09578 Why Comparing Single Performance Scores Does Not Allow to Draw Conclusions About Machine Learning Approaches . arXiv preprint arXiv:1803.09578, abs/1803.09578

work page arXiv 2018
[30]

Nils Reimers, Benjamin Schiller, Tilman Beck, Johannes Daxenberger, Christian Stab, and Iryna Gurevych. 2019. https://www.aclweb.org/anthology/P19-1054 Classification and Clustering of Arguments with Contextualized Word Embeddings . In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 567--578, Florence, Italy....

work page 2019
[31]

Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. http://arxiv.org/abs/1503.03832 FaceNet: A Unified Embedding for Face Recognition and Clustering . arXiv preprint arXiv:1503.03832, abs/1503.03832

work page arXiv 2015
[32]

Manning, Andrew Ng, and Christopher Potts

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. 2013. https://www.aclweb.org/anthology/D13-1170 Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank . In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631--1642, Seattle...

work page 2013
[33]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. 2017. http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf Attention is All you Need . In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information P...

work page 2017
[34]

Janyce Wiebe, Theresa Wilson, and Claire Cardie. 2005. https://doi.org/10.1007/s10579-005-7880-9 Annotating Expressions of Opinions and Emotions in Language . Language Resources and Evaluation, 39(2):165--210

work page doi:10.1007/s10579-005-7880-9 2005
[35]

Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. http://aclweb.org/anthology/N18-1101 A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference . In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112--...

work page 2018
[36]

Yinfei Yang, Steve Yuan, Daniel Cer, Sheng-Yi Kong, Noah Constant, Petr Pilar, Heming Ge, Yun-hsuan Sung, Brian Strope, and Ray Kurzweil. 2018. https://www.aclweb.org/anthology/W18-3022 Learning Semantic Textual Similarity from Conversations . In Proceedings of The Third Workshop on Representation Learning for NLP , pages 164--174, Melbourne, Australia. A...

work page 2018
[37]

Xlnet: Generalized autoregressive pretraining for language understanding

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. http://arxiv.org/abs/1906.08237 XLNet: Generalized Autoregressive Pretraining for Language Understanding . arXiv preprint arXiv:1906.08237, abs/1906.08237

work page arXiv 2019
[38]

BERTScore: Evaluating Text Generation with BERT

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2019. http://arxiv.org/abs/1904.09675 BERTScore: Evaluating Text Generation with BERT . arXiv preprint arXiv:1904.09675

work page internal anchor Pith review arXiv 2019