ViLegalNLI is the first 42k-pair Vietnamese legal NLI dataset built via semi-automatic LLM-assisted generation and validation.
hub
DeBERTa: Decoding-enhanced BERT with Disentangled Attention
69 Pith papers cite this work. Polarity classification is still indexing.
abstract
Recent progress in pre-trained neural language models has significantly improved the performance of many natural language processing (NLP) tasks. In this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT with disentangled attention) that improves the BERT and RoBERTa models using two novel techniques. The first is the disentangled attention mechanism, where each word is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices on their contents and relative positions, respectively. Second, an enhanced mask decoder is used to incorporate absolute positions in the decoding layer to predict the masked tokens in model pre-training. In addition, a new virtual adversarial training method is used for fine-tuning to improve models' generalization. We show that these techniques significantly improve the efficiency of model pre-training and the performance of both natural language understanding (NLU) and natural langauge generation (NLG) downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9% (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). Notably, we scale up DeBERTa by training a larger version that consists of 48 Transform layers with 1.5 billion parameters. The significant performance boost makes the single DeBERTa model surpass the human performance on the SuperGLUE benchmark (Wang et al., 2019a) for the first time in terms of macro-average score (89.9 versus 89.8), and the ensemble DeBERTa model sits atop the SuperGLUE leaderboard as of January 6, 2021, out performing the human baseline by a decent margin (90.3 versus 89.8).
hub tools
citation-role summary
citation-polarity summary
representative citing papers
An unsupervised technique extracts latent yes-no knowledge from language model activations by locating a direction that satisfies logical consistency properties, outperforming zero-shot accuracy by 4% on average across models and datasets.
RoFormer introduces rotary position embeddings that encode absolute positions via rotation matrices and relative dependencies in attention, outperforming prior position methods on long text classification tasks.
VSE perturbs images only to probe visual ambiguity in VLMs, clusters outputs into semantic prototypes, and computes mass-weighted dispersion, outperforming prior entropy methods on five VQA benchmarks across five models.
Kernel lock-in from SoC SDKs creates inherited vulnerability debt in SOHO devices, with SoC vendor community engagement as the viable mitigation strategy.
Introduces object-level semantic uncertainty for VLM memory, the UQ-DAAAM refinement system, and probabilistic guarantees that selected high-quality views reduce uncertainty more effectively.
SlotGCG uses Vulnerable Slot Score (VSS) to identify and target the most vulnerable prompt positions for adversarial token insertion, delivering 14% higher ASR than standard GCG and 42% higher against defenses.
RWGBench is a citation-centric benchmark for related work generation built from 40k CS papers and a 100-paper test set, with multi-dimensional metrics that better match human expert judgment than standard similarity scores.
Prompting LLMs with test-taking strategies for true/false factuality checks reduces tokens by over 80%, matches strong baselines on two benchmarks with SOTA on one, and enables fine-tuned SLMs to perform similarly at low cost with rationales.
RISE is an inference-time semantic reranking framework that refines low-confidence predictions in rhetorical role labeling using contrastively learned label representations, delivering an average +9.15 macro-F1 gain on hard examples across eight datasets and seven models.
GoR extracts citation DAGs using position, frequency, predecessor links and time, then fine-tunes Qwen2.5-7B on 498 seed papers to generate ideas, claiming SOTA over gpt-4o baselines via LLM judges.
DSR uses transformer models to detect sentiment targets in text and score them along three theory-motivated axes, with validation showing correlations to existing social science datasets.
RSAT uses SFT on verified traces followed by GRPO with NLI faithfulness rewards to make 1-8B models produce verifiable table reasoning with cell citations, raising faithfulness 3.7x to 0.826.
JPT enables bidirectional token classification in causal LLMs for zero-shot NER via input concatenation plus definition-guided embeddings, delivering +7.9 F1 gains and over 20x speedup on benchmarks.
Unimodal model representations converge to a relational structure captured by the Indra representation via V-enriched Yoneda embedding, which is unique and structure-preserving and improves cross-model and cross-modal robustness when instantiated with angular distance.
GRAPE unifies RoPE and ALiBi as special cases of group actions on positions, providing a principled design space for positional encodings via SO(d) rotations and GL unipotent transformations.
QA-SNNE adds question-answer alignment via bilateral gating to semantic nearest neighbor entropy, yielding higher AUROC for uncertainty detection in surgical VQA models under both standard and rephrased questions.
MEDIAREF is a publicly available knowledge store of documents from 200 media sources that enables low-cost, reproducible evaluation of media background check generation for fact-checking systems.
DPPE decouples rotation and translation in camera positional encodings for multi-view transformers to resolve late-stage training stagnation and improve generalization in novel view synthesis.
Proposes a source-data-free transfer learning framework for sparse single-index models that transfers generalized Stein's lemma summaries and uses a guided MLP for nonlinear adaptation.
ITNet frames convolution, attention, and recurrence as special cases of one learnable integral transform with an MLP kernel and shows a single shared operator plus modality encoders matches specialized models on ImageNet-1K, GLUE, ModelNet40, VQA v2, and NLVR2.
Provenance checks in graph memory are blind to structural attacks that reallocate top-k membership; authselect prevents this by enforcing selection on the authenticated subgraph only.
CCHD formulates hallucination detector training as constrained optimization with paraphrase-consistency and label-preservation rules solved via gradient descent-ascent, outperforming baselines on factuality benchmarks.
Introduces functional equivalence methods and functional entropy to predict functional correctness of LLM-generated code via uncertainty quantification, outperforming NLI-based baselines in most tested settings.
citing papers explorer
-
ViLegalNLI: Natural Language Inference for Vietnamese Legal Texts
ViLegalNLI is the first 42k-pair Vietnamese legal NLI dataset built via semi-automatic LLM-assisted generation and validation.
-
Discovering Latent Knowledge in Language Models Without Supervision
An unsupervised technique extracts latent yes-no knowledge from language model activations by locating a direction that satisfies logical consistency properties, outperforming zero-shot accuracy by 4% on average across models and datasets.
-
RoFormer: Enhanced Transformer with Rotary Position Embedding
RoFormer introduces rotary position embeddings that encode absolute positions via rotation matrices and relative dependencies in attention, outperforming prior position methods on long text classification tasks.
-
Visual Semantic Entropy: Do Vision Language Models Recognize Visual Ambiguity?
VSE perturbs images only to probe visual ambiguity in VLMs, clusters outputs into semantic prototypes, and computes mass-weighted dispersion, outperforming prior entropy methods on five VQA benchmarks across five models.
-
Anchors that Don't Lift: Understanding Supply Chain Driven Kernel Lock-In and Governance-Mediated Mitigation Strategies in SOHO Devices
Kernel lock-in from SoC SDKs creates inherited vulnerability debt in SOHO devices, with SoC vendor community engagement as the viable mitigation strategy.
-
Remember with Confidence: Uncertainty Quantification for Spatio-temporal Memory with Probabilistic Guarantees
Introduces object-level semantic uncertainty for VLM memory, the UQ-DAAAM refinement system, and probabilistic guarantees that selected high-quality views reduce uncertainty more effectively.
-
SlotGCG: Exploiting the Positional Vulnerability in LLMs for Jailbreak Attacks
SlotGCG uses Vulnerable Slot Score (VSS) to identify and target the most vulnerable prompt positions for adversarial token insertion, delivering 14% higher ASR than standard GCG and 42% higher against defenses.
-
RWGBench: Evaluating Scholarly Positioning in Related Work Generation
RWGBench is a citation-centric benchmark for related work generation built from 40k CS papers and a 100-paper test set, with multi-dimensional metrics that better match human expert judgment than standard similarity scores.
-
Teaching Language Models to Check Grounded Claim Factuality with Human Test-Taking Strategies
Prompting LLMs with test-taking strategies for true/false factuality checks reduces tokens by over 80%, matches strong baselines on two benchmarks with SOTA on one, and enables fine-tuned SLMs to perform similarly at low cost with rationales.
-
Semantic Reranking at Inference Time for Hard Examples in Rhetorical Role Labeling
RISE is an inference-time semantic reranking framework that refines low-confidence predictions in rhetorical role labeling using contrastively learned label representations, delivering an average +9.15 macro-F1 gain on hard examples across eight datasets and seven models.
-
Graphs of Research: Citation Evolution Graphs as Supervision for Research Idea Generation
GoR extracts citation DAGs using position, frequency, predecessor links and time, then fine-tunes Qwen2.5-7B on 498 seed papers to generate ideas, claiming SOTA over gpt-4o baselines via LLM judges.
-
Directed Social Regard: Surfacing Targeted Advocacy, Opposition, Aid, Harms, and Victimization in Online Media
DSR uses transformer models to detect sentiment targets in text and score them along three theory-motivated axes, with validation showing correlations to existing social science datasets.
-
RSAT: Structured Attribution Makes Small Language Models Faithful Table Reasoners
RSAT uses SFT on verified traces followed by GRPO with NLI faithfulness rewards to make 1-8B models produce verifiable table reasoning with cell citations, raising faithfulness 3.7x to 0.826.
-
Just Pass Twice: Efficient Token Classification with LLMs for Zero-Shot NER
JPT enables bidirectional token classification in causal LLMs for zero-shot NER via input concatenation plus definition-guided embeddings, delivering +7.9 F1 gains and over 20x speedup on benchmarks.
-
The Indra Representation Hypothesis for Multimodal Alignment
Unimodal model representations converge to a relational structure captured by the Indra representation via V-enriched Yoneda embedding, which is unique and structure-preserving and improves cross-model and cross-modal robustness when instantiated with angular distance.
-
Group Representational Position Encoding
GRAPE unifies RoPE and ALiBi as special cases of group actions on positions, providing a principled design space for positional encodings via SO(d) rotations and GL unipotent transformations.
-
When to Trust the Answer: Question-Aligned Semantic Nearest Neighbor Entropy for Safer Surgical VQA
QA-SNNE adds question-answer alignment via bilateral gating to semantic nearest neighbor entropy, yielding higher AUROC for uncertainty detection in surgical VQA models under both standard and rephrased questions.
-
Know Your Source: A Public Knowledge Store for Media Background Checks
MEDIAREF is a publicly available knowledge store of documents from 200 media sources that enables low-cost, reproducible evaluation of media background check generation for fact-checking systems.
-
DPPE: Rethinking Camera-Based Positional Encoding for Scaling Multi-View Transformers
DPPE decouples rotation and translation in camera positional encodings for multi-view transformers to resolve late-stage training stagnation and improve generalization in novel view synthesis.
-
Multi-Source Transfer Learning of Sparse Single-Index Models
Proposes a source-data-free transfer learning framework for sparse single-index models that transfers generalized Stein's lemma summaries and uses a guided MLP for nonlinear adaptation.
-
ITNet: A Learnable Integral Transform That Subsumes Convolution, Attention, and Recurrence
ITNet frames convolution, attention, and recurrence as special cases of one learnable integral transform with an MLP kernel and shows a single shared operator plus modality encoders matches specialized models on ImageNet-1K, GLUE, ModelNet40, VQA v2, and NLVR2.
-
Selection Integrity for LLM Graph Memory: An Accumulability Criterion for Information-Flow-Blind Retrieval
Provenance checks in graph memory are blind to structural attacks that reallocate top-k membership; authselect prevents this by enforcing selection on the authenticated subgraph only.
-
Constrained Paraphrase Consistency for LLM Hallucination Detection
CCHD formulates hallucination detector training as constrained optimization with paraphrase-consistency and label-preservation rules solved via gradient descent-ascent, outperforming baselines on factuality benchmarks.
-
Functional Entropy: Predicting Functional Correctness in LLM-Generated Code with Uncertainty Quantification
Introduces functional equivalence methods and functional entropy to predict functional correctness of LLM-generated code via uncertainty quantification, outperforming NLI-based baselines in most tested settings.
-
GHI: Graphormer over Conditioned Hypergraph Incidence for Aspect-Based Sentiment Analysis
GHI introduces an incidence-based structural reasoning layer using Graphormer on conditioned hypergraphs for ABSA, reporting outperformance on SemEval benchmarks, near-parity with 11B models at 247M parameters, and robustness on ARTS.
-
From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents
A dataset-agnostic framework converts text tool-calling benchmarks to paired audio evaluations via TTS, speaker variation and noise, then evaluates seven omni-modal models showing model- and task-dependent performance with small text-to-voice gaps.
-
Context-Aware Spear Phishing: Generative AI-Enabled Attacks Against Individuals via Public Social Media Data
Generative AI enables scalable, context-aware spear phishing by extracting profiles from public social media, producing emails that outperform real-world phishing samples in personalization and lower recipient suspicion.
-
An Information-theoretic Propagation Denoising and Fusion Framework for Fake News Detection
InfoPDF uses mutual information to suppress noise in LLM-generated synthetic propagation graphs and adaptively fuse them with real data, yielding more discriminative representations for fake news detection.
-
TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning
TwinGate deploys a stateful dual-encoder system with asymmetric contrastive learning to detect decompositional jailbreaks in untraceable LLM traffic at high recall and low false-positive rate with negligible latency.
-
ADE: Adaptive Dictionary Embeddings -- Scaling Multi-Anchor Representations to Large Language Models
ADE scales multi-anchor word representations to transformers via Vocabulary Projection, Grouped Positional Encoding, and context-aware reweighting, achieving 98.7% fewer trainable parameters than DeBERTa-v3-base while matching or exceeding it on two text-classification benchmarks and compressing the
-
EPM-RL: Reinforcement Learning for On-Premise Product Mapping in E-Commerce
EPM-RL uses PEFT followed by RL with agent-based rewards from judge models to create a trainable in-house product mapping model that improves on fine-tuning alone and beats API baselines in quality-cost while enabling private use.
-
Beyond Importance Sampling: Rejection-Gated Policy Optimization
RGPO replaces importance sampling with a smooth [0,1] acceptance gate in policy gradients, unifying TRPO/PPO/REINFORCE, bounding variance for heavy-tailed ratios, and showing gains in online RLHF experiments.
-
Entities as Retrieval Signals: A Systematic Study of Coverage, Supervision, and Evaluation in Entity-Oriented Ranking
Entity signals cover only 19.7% of relevant documents on Robust04 and no configuration among 443 systems improves MAP by more than 0.05 in open-world evaluation, despite gains when entities are pre-restricted.
-
Million Tutoring Moves (MTM): An Open Multimodal Dataset for the Science of Tutoring
MTM v1 releases 4,654 open math tutoring transcripts as the first step toward a large-scale multimodal repository for studying and improving tutoring.
-
Overconfidence and Calibration in Medical VQA: Empirical Findings and Hallucination-Aware Mitigation
Empirical study finds overconfidence persists in medical VLMs despite scaling and prompting; post-hoc calibration reduces error while hallucination-aware calibration improves both calibration and AUROC.
-
From Spark to Fire: Modeling and Mitigating Error Cascades in LLM-Based Multi-Agent Collaboration
A graph-based propagation model for error cascades in LLM multi-agent systems plus a genealogy-graph governance plugin that prevents final infection in at least 89% of runs across tested frameworks.
-
Frozen LVLMs for Micro-Video Recommendation: A Systematic Study of Feature Extraction and Fusion
Intermediate decoder hidden states from frozen LVLMs fused with ID embeddings outperform caption representations and deliver state-of-the-art micro-video recommendation performance on two real-world benchmarks.
-
Interpretability from the Ground Up: Stakeholder-Centric Design of Automated Scoring in Educational Assessments
AnalyticScore applies new FGTI interpretability principles to text-based scoring and achieves accuracy within 0.06 QWK of uninterpretable state-of-the-art while matching human featurization on the ASAP-SAS dataset.
-
Positional Encoding via Token-Aware Phase Attention
TAPA adds a learnable phase function to attention to preserve long-range token interactions, enabling direct continual pretraining, length extrapolation, lower perplexity, and stronger retrieval than RoPE-style methods.
-
TriagerX: Dual Transformers for Bug Triaging Tasks with Content and Interaction Based Rankings
TriagerX combines dual-transformer content rankings with developer interaction history to improve top-k accuracy for developer and component recommendations in bug triaging across five datasets.
-
LIMO: Less is More for Reasoning
LIMO achieves 63.3% on AIME24 and 95.6% on MATH500 via supervised fine-tuning on roughly 1% of the data used by prior models, supporting the claim that minimal strategic examples suffice when pre-training has already encoded domain knowledge.
-
MiniMax-01: Scaling Foundation Models with Lightning Attention
MiniMax-01 models match GPT-4o and Claude-3.5-Sonnet performance while providing 20-32 times longer context windows through lightning attention and MoE scaling.
-
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations
Math-Shepherd is an automatically trained process reward model that scores solution steps to verify and reinforce LLMs, lifting Mistral-7B from 77.9% to 89.1% on GSM8K and 28.6% to 43.5% on MATH.
-
Demystifying CLIP Data
MetaCLIP curates balanced 400M-pair subsets from CommonCrawl that outperform CLIP data, reaching 70.8% zero-shot ImageNet accuracy on ViT-B versus CLIP's 68.3%.
-
Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation
Semantic entropy improves uncertainty estimation in natural language generation by incorporating semantic equivalences, outperforming standard entropy baselines on predicting model accuracy for question answering.
-
Ethical and social risks of harm from Language Models
The authors provide a detailed taxonomy of 21 risks associated with language models, covering discrimination, information leaks, misinformation, malicious applications, interaction harms, and societal impacts like job loss and environmental costs.
-
BamiBERT: A New BERT-based Language Model for Vietnamese
BamiBERT is a new base-sized Vietnamese BERT model trained on raw text that outperforms PhoBERT on 11 of 15 metrics across 8 benchmarks.
-
Mixture-of-Control: State-Aware Fine-Tuning for Transformer-based Models
Mixture-of-Control adaptively combines local and global control states in transformer fine-tuning by treating per-block states as experts in a sparse MoE setup to improve cross-block communication while keeping memory and compute costs comparable to prior state-based methods.
-
Cross Paraphrastic Invariance Learning for Hallucination Detection
CPIL is a contrastive two-stage method that enforces paraphrase invariance on limited labeled data to outperform baselines in hallucination detection across 11 tasks.
-
Position: Uncertainty Quantification in LLMs is Just Unsupervised Clustering
Mainstream UQ for LLMs reduces to unsupervised clustering of internal generation consistency and therefore cannot detect confident hallucinations or provide reliable safety signals.