hub

Unixcoder: Unified cross-modal pre-training for code representation

Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming Zhou, Jian Yin · 2022 · Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) · DOI 10.18653/v1/2022.acl-long.499

12 Pith papers cite this work, alongside 459 external citations. Polarity classification is still indexing.

12 Pith papers citing it

459 external citations · Crossref

open at publisher browse 12 citing papers

hub tools

JSON dossier citing papers JSON publisher DOI

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Edit, But Verify: An Empirical Audit of Instructed Code-Editing Benchmarks

cs.SE · 2026-04-06 · conditional · novelty 8.0

The two main benchmarks for LLM instructed code editing over-represent Python, miss common real-world domains and edit types, and have test coverage issues that limit what they measure.

Deep Graph-Language Fusion for Structure-Aware Code Generation

cs.SE · 2026-05-05 · unverdicted · novelty 7.0

CGFuse enables deep token-level fusion of graph-derived structural features into language models, yielding 10-16% BLEU and 6-11% CodeBLEU gains on code generation tasks.

Parallel-SFT: Improving Zero-Shot Cross-Programming-Language Transfer for Code RL

cs.CL · 2026-04-22 · unverdicted · novelty 7.0

Parallel-SFT mixes parallel programs across languages during SFT to produce more transferable RL initializations, yielding better zero-shot generalization to unseen programming languages.

RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems

cs.CL · 2023-06-05 · unverdicted · novelty 7.0

RepoBench is a new benchmark with retrieval, completion, and pipeline tasks to evaluate code auto-completion systems on entire repositories instead of single files.

Do not copy and paste! Rewriting strategies for code retrieval

cs.SE · 2026-05-08 · conditional · novelty 6.0

Full natural-language rewriting of code and queries boosts retrieval on code benchmarks while corpus-only rewriting often hurts, with token entropy difference serving as a cheap predictor of gains.

SAGE: Signal-Amplified Guided Embeddings for LLM-based Vulnerability Detection

cs.CR · 2026-04-21 · unverdicted · novelty 6.0

SAGE uses sparse autoencoders to boost vulnerability signals in LLMs, raising internal SNR 12.7x and delivering up to 318% MCC gains on vulnerability detection benchmarks.

MARGIN: Margin-Aware Regularized Geometry for Imbalanced Vulnerability Detection

cs.SE · 2026-05-11 · unverdicted · novelty 5.0

MARGIN reduces geometric distortions in imbalanced vulnerability embeddings by dynamically regularizing margins with von Mises-Fisher concentration estimates and hyperspherical prototypes.

Standing on the Shoulders of Giants: Stabilized Knowledge Distillation for Cross--Language Code Clone Detection

cs.AI · 2026-05-04 · unverdicted · novelty 5.0

Reasoning-oriented knowledge distillation from DeepSeek-R1 plus response stabilization improves reliability and often performance of compact models for cross-language code clone detection on pairs like Python-Java and Rust-Java.

How Code Representation Shapes False-Positive Dynamics in Cross-Language LLM Vulnerability Detection

cs.CR · 2026-04-30 · unverdicted · novelty 5.0

Text fine-tuning of 8B LLMs on C/C++ vulnerability data inflates cross-language false-positive rates through surface-cue memorization, which an AST inference probe can partially reverse while direct AST fine-tuning cannot.

Towards General Text Embeddings with Multi-stage Contrastive Learning

cs.CL · 2023-08-07 · unverdicted · novelty 5.0

GTE_base is a compact text embedding model using multi-stage contrastive learning on diverse data that outperforms OpenAI's API and 10x larger models on massive benchmarks and works for code as text.

Carbon-Taxed Transformers: A Green Compression Pipeline for Overgrown Language Models

cs.SE · 2026-04-28 · unverdicted · novelty 4.0

CTT is a compression pipeline for LLMs that achieves up to 49x memory reduction, 10x faster inference, 81% lower CO2 emissions, and retains 68-98% accuracy on code clone detection, summarization, and generation tasks.

LoRA-MME: Multi-Model Ensemble of LoRA-Tuned Encoders for Code Comment Classification

cs.SE · 2026-03-04 · conditional · novelty 4.0

LoRA-MME ensembles LoRA-adapted UniXcoder, CodeBERT, GraphCodeBERT, and CodeBERTa with learned weights to reach 0.7906 weighted F1 and 0.6867 macro F1 on code comment classification.

citing papers explorer

Showing 12 of 12 citing papers.

Edit, But Verify: An Empirical Audit of Instructed Code-Editing Benchmarks cs.SE · 2026-04-06 · conditional · none · ref 9
The two main benchmarks for LLM instructed code editing over-represent Python, miss common real-world domains and edit types, and have test coverage issues that limit what they measure.
Deep Graph-Language Fusion for Structure-Aware Code Generation cs.SE · 2026-05-05 · unverdicted · none · ref 10
CGFuse enables deep token-level fusion of graph-derived structural features into language models, yielding 10-16% BLEU and 6-11% CodeBLEU gains on code generation tasks.
Parallel-SFT: Improving Zero-Shot Cross-Programming-Language Transfer for Code RL cs.CL · 2026-04-22 · unverdicted · none · ref 37
Parallel-SFT mixes parallel programs across languages during SFT to produce more transferable RL initializations, yielding better zero-shot generalization to unseen programming languages.
RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems cs.CL · 2023-06-05 · unverdicted · none · ref 20
RepoBench is a new benchmark with retrieval, completion, and pipeline tasks to evaluate code auto-completion systems on entire repositories instead of single files.
Do not copy and paste! Rewriting strategies for code retrieval cs.SE · 2026-05-08 · conditional · none · ref 17
Full natural-language rewriting of code and queries boosts retrieval on code benchmarks while corpus-only rewriting often hurts, with token entropy difference serving as a cheap predictor of gains.
SAGE: Signal-Amplified Guided Embeddings for LLM-based Vulnerability Detection cs.CR · 2026-04-21 · unverdicted · none · ref 21
SAGE uses sparse autoencoders to boost vulnerability signals in LLMs, raising internal SNR 12.7x and delivering up to 318% MCC gains on vulnerability detection benchmarks.
MARGIN: Margin-Aware Regularized Geometry for Imbalanced Vulnerability Detection cs.SE · 2026-05-11 · unverdicted · none · ref 31
MARGIN reduces geometric distortions in imbalanced vulnerability embeddings by dynamically regularizing margins with von Mises-Fisher concentration estimates and hyperspherical prototypes.
Standing on the Shoulders of Giants: Stabilized Knowledge Distillation for Cross--Language Code Clone Detection cs.AI · 2026-05-04 · unverdicted · none · ref 12
Reasoning-oriented knowledge distillation from DeepSeek-R1 plus response stabilization improves reliability and often performance of compact models for cross-language code clone detection on pairs like Python-Java and Rust-Java.
How Code Representation Shapes False-Positive Dynamics in Cross-Language LLM Vulnerability Detection cs.CR · 2026-04-30 · unverdicted · none · ref 10
Text fine-tuning of 8B LLMs on C/C++ vulnerability data inflates cross-language false-positive rates through surface-cue memorization, which an AST inference probe can partially reverse while direct AST fine-tuning cannot.
Towards General Text Embeddings with Multi-stage Contrastive Learning cs.CL · 2023-08-07 · unverdicted · none · ref 124
GTE_base is a compact text embedding model using multi-stage contrastive learning on diverse data that outperforms OpenAI's API and 10x larger models on massive benchmarks and works for code as text.
Carbon-Taxed Transformers: A Green Compression Pipeline for Overgrown Language Models cs.SE · 2026-04-28 · unverdicted · none · ref 17
CTT is a compression pipeline for LLMs that achieves up to 49x memory reduction, 10x faster inference, 81% lower CO2 emissions, and retains 68-98% accuracy on code clone detection, summarization, and generation tasks.
LoRA-MME: Multi-Model Ensemble of LoRA-Tuned Encoders for Code Comment Classification cs.SE · 2026-03-04 · conditional · none · ref 3
LoRA-MME ensembles LoRA-adapted UniXcoder, CodeBERT, GraphCodeBERT, and CodeBERTa with learned weights to reach 0.7906 weighted F1 and 0.6867 macro F1 on code comment classification.

Unixcoder: Unified cross-modal pre-training for code representation

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer