hub Mixed citations

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

· 2024 · cs.LG · arXiv 2404.04475

Mixed citation behavior. Most common role is background (33%).

62 Pith papers citing it

Background 33% of classified citations

open full Pith review browse 62 citing papers arXiv PDF

abstract

LLM-based auto-annotators have become a key component of the LLM development process due to their cost-effectiveness and scalability compared to human-based evaluation. However, these auto-annotators can introduce biases that are hard to remove. Even simple, known confounders such as preference for longer outputs remain in existing automated evaluation metrics. We propose a simple regression analysis approach for controlling biases in auto-evaluations. As a real case study, we focus on reducing the length bias of AlpacaEval, a fast and affordable benchmark for instruction-tuned LLMs that uses LLMs to estimate response quality. Despite being highly correlated with human preferences, AlpacaEval is known to favor models that generate longer outputs. We introduce a length-controlled AlpacaEval that aims to answer the counterfactual question: "What would the preference be if the model's and baseline's output had the same length?" To achieve this, we first fit a generalized linear model to predict the biased auto-annotator's preferences based on the mediators we want to control for (length difference) and other relevant features. We then obtain length-controlled preferences by predicting preferences while conditioning the GLM with a zero difference in lengths. Length-controlling not only improves the robustness of the metric to manipulations in model verbosity, but we also find that it increases the Spearman correlation with LMSYS Chatbot Arena from 0.94 to 0.98.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3 dataset 3 baseline 1 method 1 other 1

citation-polarity summary

background 3 use dataset 3 baseline 1 unclear 1 use method 1

representative citing papers

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

cs.CL · 2024-06-27 · unverdicted · novelty 8.0

LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.

Evaluating Deep Research Agents on Expert Consulting Work: A Benchmark with Verifiers, Rubrics, and Cognitive Traps

cs.AI · 2026-05-17 · unverdicted · novelty 7.0

New benchmark evaluates three frontier deep research agents on 42 SME prompts with verifiers and rubrics, reporting low acceptance rates of 9.5-21.4% and agent-specific failure modes.

Distributionally Robust Multi-Task Reinforcement Learning via Adaptive Task Sampling

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.

Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought

cs.CL · 2026-04-24 · unverdicted · novelty 7.0

Abstract-CoT lets models reason with short discrete latent token sequences from a reserved vocabulary, using warm-up training and RL to match verbal CoT performance with up to 11.6x fewer tokens.

Understanding and Improving Continuous Adversarial Training for LLMs via In-context Learning Theory

cs.LG · 2026-04-14 · unverdicted · novelty 7.0

Continuous adversarial training in the embedding space produces a robust generalization bound for linear transformers that decreases with perturbation radius, tied to singular values of the embedding matrix, and motivates a new regularizer that improves real LLM jailbreak robustness-utility tradeoff

Sell More, Play Less: Benchmarking LLM Realistic Selling Skill

cs.CL · 2026-04-08 · conditional · novelty 7.0

SalesLLM provides an automatic evaluation framework for LLM sales dialogues that correlates 0.98 with human experts and shows top models approaching human performance while weaker ones lag.

TiCo: Time-Controllable Spoken Dialogue Model

cs.CL · 2026-03-23 · unverdicted · novelty 7.0

TiCo enables spoken dialogue models to follow explicit time constraints in generated responses using Spoken Time Markers and reinforcement learning with verifiable rewards, cutting duration error by 2.7x over its backbone.

CyclicJudge: Mitigating Judge Bias Efficiently in LLM-based Evaluation

cs.CL · 2026-03-02 · unverdicted · novelty 7.0

CyclicJudge uses round-robin judge-to-scenario assignment to recover the panel-mean score exactly while using the same number of judge calls as single-judge evaluation.

Improving Sampling for Masked Diffusion Models via Information Gain

cs.CL · 2026-02-20 · unverdicted · novelty 7.0

Info-Gain Sampler improves MDM decoding by using bidirectional information gain to reduce cumulative uncertainty, outperforming greedy samplers on reasoning accuracy and creative writing tasks.

Geometry-Aware Decoding with Wasserstein-Regularized Truncation and Mass Penalties for Large Language Models

cs.CL · 2026-02-10 · unverdicted · novelty 7.0

Top-W applies Wasserstein-regularized truncation on token-embedding geometry to create a closed-form optimal crop for LLM sampling that outperforms prior methods by up to 33.7% on GSM8K, GPQA, AlpacaEval, and MT-Bench.

VIDEOP2R: Video Understanding from Perception to Reasoning

cs.CV · 2025-11-14 · conditional · novelty 7.0

VideoP2R separates perception and reasoning in a process-aware RFT pipeline with a new CoT dataset and PA-GRPO rewards, reaching SOTA on six of seven video benchmarks.

Rethinking Predictive Modeling for LLM Routing: When Simple kNN Beats Complex Learned Routers

cs.LG · 2025-05-19 · conditional · novelty 7.0

A well-tuned kNN router matches or exceeds state-of-the-art learned routers on new standardized benchmarks spanning instruction, QA, reasoning, and the first multi-modal visual routing dataset, due to locality of model performance in embedding space.

Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing

cs.CL · 2024-06-12 · unverdicted · novelty 7.0

Magpie synthesizes 300K high-quality alignment instructions from Llama-3-Instruct via auto-regressive prompting on partial templates, enabling fine-tuned models to match official instruct performance on AlpacaEval, ArenaHard, and WildBench.

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

cs.CL · 2024-05-07 · unverdicted · novelty 7.0

DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.

Convex Optimization for Alignment and Preference Learning on a Single GPU

cs.LG · 2026-05-22 · unverdicted · novelty 6.0

COALA applies convex optimization reformulations of neural networks to direct preference optimization, claiming single-GPU training with ~18% of DPO's TFLOPs and competitive performance on multiple datasets and models up to 8B parameters.

General Preference Reinforcement Learning

cs.LG · 2026-05-18 · unverdicted · novelty 6.0 · 3 refs

GPRL carries a k-dimensional skew-symmetric preference structure into policy updates with per-dimension advantages and a drift monitor, yielding 56.51% length-controlled win rate on AlpacaEval 2.0 from Llama-3-8B-Instruct while outperforming SimPO and SPPO on other benchmarks.

Evaluating Multi-turn Human-AI Interaction

cs.HC · 2026-05-18 · unverdicted · novelty 6.0

Introduces the TCR framework to evaluate educational LLM assistants on transparency, consistency, and refinement in multi-turn interactions, complementing aggregate metrics.

Dynamic Model Merging Made Slim

cs.LG · 2026-05-17 · unverdicted · novelty 6.0

DiDi-Merging achieves dynamic model merging performance matching or exceeding prior methods while using only 1.24x to 1.4x the parameters of a single fine-tuned model.

Transitivity Meets Cyclicity: Explicit Preference Decomposition for Dynamic Large Language Model Alignment

cs.CL · 2026-05-17 · unverdicted · novelty 6.0

Introduces HRC model for game-theoretic decomposition of preferences into orthogonal transitive and cyclic components, paired with DSPPO for dynamic Nash-seeking alignment, reporting gains over BT and GPM baselines on RewardBench and downstream LLM evaluations.

Dimension-Level Intent Fidelity Evaluation for Large Language Models: Evidence from Structured Prompt Ablation

cs.CL · 2026-05-14 · unverdicted · novelty 6.0

Dimension-level evaluation reveals that 25-58% of LLM outputs with perfect holistic scores still show measurable intent deficits across languages and domains.

Leveraging RAG for Training-Free Alignment of LLMs

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

RAG-Pref is a training-free RAG-based alignment technique that conditions LLMs on contrastive preference samples during inference, yielding over 3.7x average improvement in agentic attack refusals when combined with offline methods across five LLMs.

G-Zero: Self-Play for Open-Ended Generation from Zero Data

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

G-Zero uses the Hint-δ intrinsic reward to drive co-evolution between a Proposer and Generator via GRPO and DPO, providing a theoretical suboptimality guarantee for self-improvement from internal dynamics alone.

Bias and Uncertainty in LLM-as-a-Judge Estimation

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

Bias-corrected LLM-as-a-Judge estimators can reverse true model orderings under shared calibration, and the paper supplies judge quality J and cross-model instability ΔJ as practical diagnostics for when such estimates are unreliable.

Don't Lose Focus: Activation Steering via Key-Orthogonal Projections

cs.CL · 2026-05-07 · unverdicted · novelty 6.0

SKOP uses key-orthogonal projections to steer LLM activations while preserving attention patterns on focus tokens, cutting utility degradation by 5-7x and retaining over 95% of standard steering efficacy.

citing papers explorer

Showing 50 of 62 citing papers.

LiveBench: A Challenging, Contamination-Limited LLM Benchmark cs.CL · 2024-06-27 · unverdicted · none · ref 15 · internal anchor
LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.
Evaluating Deep Research Agents on Expert Consulting Work: A Benchmark with Verifiers, Rubrics, and Cognitive Traps cs.AI · 2026-05-17 · unverdicted · none · ref 5 · internal anchor
New benchmark evaluates three frontier deep research agents on 42 SME prompts with verifiers and rubrics, reporting low acceptance rates of 9.5-21.4% and agent-specific failure modes.
Distributionally Robust Multi-Task Reinforcement Learning via Adaptive Task Sampling cs.LG · 2026-05-14 · unverdicted · none · ref 251 · internal anchor
DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.
Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought cs.CL · 2026-04-24 · unverdicted · none · ref 8 · internal anchor
Abstract-CoT lets models reason with short discrete latent token sequences from a reserved vocabulary, using warm-up training and RL to match verbal CoT performance with up to 11.6x fewer tokens.
Understanding and Improving Continuous Adversarial Training for LLMs via In-context Learning Theory cs.LG · 2026-04-14 · unverdicted · none · ref 6 · internal anchor
Continuous adversarial training in the embedding space produces a robust generalization bound for linear transformers that decreases with perturbation radius, tied to singular values of the embedding matrix, and motivates a new regularizer that improves real LLM jailbreak robustness-utility tradeoff
Sell More, Play Less: Benchmarking LLM Realistic Selling Skill cs.CL · 2026-04-08 · conditional · none · ref 13 · internal anchor
SalesLLM provides an automatic evaluation framework for LLM sales dialogues that correlates 0.98 with human experts and shows top models approaching human performance while weaker ones lag.
TiCo: Time-Controllable Spoken Dialogue Model cs.CL · 2026-03-23 · unverdicted · none · ref 7 · internal anchor
TiCo enables spoken dialogue models to follow explicit time constraints in generated responses using Spoken Time Markers and reinforcement learning with verifiable rewards, cutting duration error by 2.7x over its backbone.
CyclicJudge: Mitigating Judge Bias Efficiently in LLM-based Evaluation cs.CL · 2026-03-02 · unverdicted · none · ref 1 · internal anchor
CyclicJudge uses round-robin judge-to-scenario assignment to recover the panel-mean score exactly while using the same number of judge calls as single-judge evaluation.
Improving Sampling for Masked Diffusion Models via Information Gain cs.CL · 2026-02-20 · unverdicted · none · ref 5 · internal anchor
Info-Gain Sampler improves MDM decoding by using bidirectional information gain to reduce cumulative uncertainty, outperforming greedy samplers on reasoning accuracy and creative writing tasks.
Geometry-Aware Decoding with Wasserstein-Regularized Truncation and Mass Penalties for Large Language Models cs.CL · 2026-02-10 · unverdicted · none · ref 8 · internal anchor
Top-W applies Wasserstein-regularized truncation on token-embedding geometry to create a closed-form optimal crop for LLM sampling that outperforms prior methods by up to 33.7% on GSM8K, GPQA, AlpacaEval, and MT-Bench.
VIDEOP2R: Video Understanding from Perception to Reasoning cs.CV · 2025-11-14 · conditional · none · ref 12 · internal anchor
VideoP2R separates perception and reasoning in a process-aware RFT pipeline with a new CoT dataset and PA-GRPO rewards, reaching SOTA on six of seven video benchmarks.
Rethinking Predictive Modeling for LLM Routing: When Simple kNN Beats Complex Learned Routers cs.LG · 2025-05-19 · conditional · none · ref 27 · internal anchor
A well-tuned kNN router matches or exceeds state-of-the-art learned routers on new standardized benchmarks spanning instruction, QA, reasoning, and the first multi-modal visual routing dataset, due to locality of model performance in embedding space.
Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing cs.CL · 2024-06-12 · unverdicted · none · ref 107 · internal anchor
Magpie synthesizes 300K high-quality alignment instructions from Llama-3-Instruct via auto-regressive prompting on partial templates, enabling fine-tuned models to match official instruct performance on AlpacaEval, ArenaHard, and WildBench.
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cs.CL · 2024-05-07 · unverdicted · none · ref 126 · internal anchor
DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
Convex Optimization for Alignment and Preference Learning on a Single GPU cs.LG · 2026-05-22 · unverdicted · none · ref 91 · internal anchor
COALA applies convex optimization reformulations of neural networks to direct preference optimization, claiming single-GPU training with ~18% of DPO's TFLOPs and competitive performance on multiple datasets and models up to 8B parameters.
General Preference Reinforcement Learning cs.LG · 2026-05-18 · unverdicted · none · ref 30 · 3 links · internal anchor
GPRL carries a k-dimensional skew-symmetric preference structure into policy updates with per-dimension advantages and a drift monitor, yielding 56.51% length-controlled win rate on AlpacaEval 2.0 from Llama-3-8B-Instruct while outperforming SimPO and SPPO on other benchmarks.
Evaluating Multi-turn Human-AI Interaction cs.HC · 2026-05-18 · unverdicted · none · ref 50 · internal anchor
Introduces the TCR framework to evaluate educational LLM assistants on transparency, consistency, and refinement in multi-turn interactions, complementing aggregate metrics.
Dynamic Model Merging Made Slim cs.LG · 2026-05-17 · unverdicted · none · ref 24 · internal anchor
DiDi-Merging achieves dynamic model merging performance matching or exceeding prior methods while using only 1.24x to 1.4x the parameters of a single fine-tuned model.
Transitivity Meets Cyclicity: Explicit Preference Decomposition for Dynamic Large Language Model Alignment cs.CL · 2026-05-17 · unverdicted · none · ref 71 · internal anchor
Introduces HRC model for game-theoretic decomposition of preferences into orthogonal transitive and cyclic components, paired with DSPPO for dynamic Nash-seeking alignment, reporting gains over BT and GPM baselines on RewardBench and downstream LLM evaluations.
Dimension-Level Intent Fidelity Evaluation for Large Language Models: Evidence from Structured Prompt Ablation cs.CL · 2026-05-14 · unverdicted · none · ref 14 · internal anchor
Dimension-level evaluation reveals that 25-58% of LLM outputs with perfect holistic scores still show measurable intent deficits across languages and domains.
Leveraging RAG for Training-Free Alignment of LLMs cs.LG · 2026-05-11 · unverdicted · none · ref 19 · internal anchor
RAG-Pref is a training-free RAG-based alignment technique that conditions LLMs on contrastive preference samples during inference, yielding over 3.7x average improvement in agentic attack refusals when combined with offline methods across five LLMs.
G-Zero: Self-Play for Open-Ended Generation from Zero Data cs.LG · 2026-05-11 · unverdicted · none · ref 3 · internal anchor
G-Zero uses the Hint-δ intrinsic reward to drive co-evolution between a Proposer and Generator via GRPO and DPO, providing a theoretical suboptimality guarantee for self-improvement from internal dynamics alone.
Bias and Uncertainty in LLM-as-a-Judge Estimation cs.LG · 2026-05-07 · unverdicted · none · ref 5 · internal anchor
Bias-corrected LLM-as-a-Judge estimators can reverse true model orderings under shared calibration, and the paper supplies judge quality J and cross-model instability ΔJ as practical diagnostics for when such estimates are unreliable.
Don't Lose Focus: Activation Steering via Key-Orthogonal Projections cs.CL · 2026-05-07 · unverdicted · none · ref 9 · internal anchor
SKOP uses key-orthogonal projections to steer LLM activations while preserving attention patterns on focus tokens, cutting utility degradation by 5-7x and retaining over 95% of standard steering efficacy.
Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges cs.AI · 2026-05-07 · unverdicted · none · ref 10 · internal anchor
LLM safety judges flip verdicts on equivalent policy rewrites up to 9.1% of the time and cannot distinguish meaningful from meaningless changes, requiring new invariance-based reliability metrics.
Data-dependent Exploration for Online Reinforcement Learning from Human Feedback cs.LG · 2026-05-06 · unverdicted · none · ref 67 · internal anchor
DEPO uses historical data to build a data-dependent uncertainty bonus for exploration in online RLHF, yielding an adaptive regret bound and stronger empirical performance than baselines.
LocalAlign: Enabling Generalizable Prompt Injection Defense via Generation of Near-Target Adversarial Examples for Alignment Training cs.CR · 2026-05-02 · unverdicted · none · ref 11 · internal anchor
LocalAlign generates near-target adversarial examples via prompting and applies margin-aware alignment training to enforce tighter boundaries against prompt injection attacks.
TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning cs.CR · 2026-04-30 · unverdicted · none · ref 6 · internal anchor
TwinGate deploys a stateful dual-encoder system with asymmetric contrastive learning to detect decompositional jailbreaks in untraceable LLM traffic at high recall and low false-positive rate with negligible latency.
MultEval: Supporting Collaborative Alignment for LLM-as-a-Judge Evaluation Criteria cs.HC · 2026-04-29 · unverdicted · none · ref 10 · internal anchor
MultEval supports collaborative creation of LLM-as-a-judge criteria by surfacing disagreements via consensus-building methods, allowing iterative revisions with examples and history, and keeping transparent how human judgments become automated rules.
Who Defines "Best"? Towards Interactive, User-Defined Evaluation of LLM Leaderboards cs.AI · 2026-04-23 · unverdicted · none · ref 12 · internal anchor
Analysis of the LMArena dataset reveals heavy topic skew and varying model rankings, leading to an interactive visualization tool for users to define custom evaluation priorities on LLM leaderboards.
Hybrid Policy Distillation for LLMs cs.CL · 2026-04-22 · unverdicted · none · ref 45 · internal anchor
Hybrid Policy Distillation unifies existing knowledge distillation methods for LLMs into a reweighted log-likelihood objective and introduces a hybrid forward-reverse KL approach with mixed data sampling to improve stability, efficiency, and performance.
S2H-DPO: Hardness-Aware Preference Optimization for Vision-Language Models cs.CV · 2026-04-20 · unverdicted · none · ref 61 · internal anchor
S2H-DPO generates hierarchical prompt-driven preference pairs to improve multi-image reasoning in VLMs while keeping single-image performance intact.
Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts cs.LG · 2026-04-20 · unverdicted · none · ref 7 · internal anchor
BAR trains independent domain experts via separate mid-training, SFT, and RL pipelines then composes them with a MoE router to match monolithic retraining performance at lower cost and without catastrophic forgetting.
Towards Disentangled Preference Optimization Dynamics: Suppress the Loser, Preserve the Winner cs.LG · 2026-04-20 · unverdicted · none · ref 4 · internal anchor
A unified incentive-score decomposition of preference optimization reveals the disentanglement band condition and reward calibration method that enables suppressing losers while preserving winners in LLM training.
IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures cs.AI · 2026-04-09 · unverdicted · none · ref 12 · internal anchor
AI models exhibit identity-contingent withholding, providing better clinical guidance on benzodiazepine tapering to physicians than laypeople in identical scenarios, with a measured decoupling gap of +0.38 and 13.1 percentage point drop in safety-critical action hit rates.
Relative Density Ratio Optimization for Stable and Statistically Consistent Model Alignment cs.LG · 2026-04-06 · unverdicted · none · ref 5 · internal anchor
Relative density ratio optimization stabilizes direct density ratio estimation for language model alignment while preserving statistical consistency without assuming a Bradley-Terry preference model.
Data Agent: Learning to Select Data via End-to-End Dynamic Optimization cs.LG · 2026-03-08 · unverdicted · none · ref 4 · internal anchor
Data Agent learns a co-evolving sample selection policy end-to-end that accelerates training by over 50% on ImageNet-1k and MMLU with no performance loss.
Factored Causal Representation Learning for Robust Reward Modeling in RLHF cs.LG · 2026-01-29 · unverdicted · none · ref 9 · internal anchor
A factored causal representation learning method improves robustness of reward models in RLHF by isolating causal factors from biases like length and sycophancy using adversarial gradient reversal.
Multiplayer Nash Preference Optimization cs.AI · 2025-09-27 · unverdicted · none · ref 8 · internal anchor
MNPO extends NLHF to multiplayer Nash games, inheriting equilibrium guarantees while showing empirical gains on instruction-following benchmarks under diverse preferences.
RLBFF: Binary Flexible Feedback to bridge between Human Feedback & Verifiable Rewards cs.CL · 2025-09-25 · unverdicted · none · ref 9 · internal anchor
RLBFF extracts binary principles from human feedback to train reward models that outperform Bradley-Terry models on RM-Bench and JudgeBench and enable customizable inference-time focus for LLM alignment.
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning cs.AI · 2025-07-01 · conditional · none · ref 211 · internal anchor
Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.
XtraGPT: Context-Aware and Controllable Academic Paper Revision via Human-AI Collaboration cs.CL · 2025-05-16 · conditional · none · ref 27 · internal anchor
XtraGPT is a suite of 1.5B-14B parameter open-source LLMs fine-tuned on 140,000 revision pairs from 7,000 top-tier papers to support controllable, context-aware academic paper editing.
The Differences Between Direct Alignment Algorithms are a Blur cs.LG · 2025-02-03 · unverdicted · none · ref 14 · internal anchor
A controlled unification of direct alignment algorithms shows the ranking objective (pairwise vs pointwise) drives alignment quality more than the scalar score optimized.
DataComp-LM: In search of the next generation of training sets for language models cs.LG · 2024-06-17 · unverdicted · none · ref 58 · internal anchor
DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.
DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence cs.SE · 2024-06-17 · unverdicted · none · ref 6 · internal anchor
An open-source MoE code model matches GPT-4 Turbo on coding and math benchmarks while expanding to 338 languages and 128K context length.
Mixture-of-Agents Enhances Large Language Model Capabilities cs.CL · 2024-06-07 · unverdicted · none · ref 7 · internal anchor
A layered Mixture-of-Agents system combining multiple LLMs achieves state-of-the-art results on AlpacaEval 2.0 (65.1%), MT-Bench, and FLASK, outperforming GPT-4 Omni.
LoCar: Localization-Aware Evaluation of In-Vehicle Assistants through Fine-Grained Sociolinguistic Control cs.CL · 2026-05-20 · unverdicted · none · ref 8 · internal anchor
LoCar is a localization-aware evaluation framework for in-vehicle assistants that identifies unstable Korean honorific control and weaker performance on strategic metrics like clarification and proactivity in current LLMs.
Re-Triggering Safeguards within LLMs for Jailbreak Detection cs.CR · 2026-05-11 · unverdicted · none · ref 4 · internal anchor
Embedding disruption re-triggers LLM internal safeguards to detect jailbreak prompts more effectively than standalone defenses.
A Validated Prompt Bank for Malicious Code Generation: Separating Executable Weapons from Security Knowledge in 1,554 Consensus-Labeled Prompts cs.CR · 2026-05-04 · accept · none · ref 34 · internal anchor
The paper releases a 1,554-prompt consensus-labeled bank separating executable malicious code requests from security knowledge requests, validated by five-model majority labeling with Fleiss' kappa of 0.876.
LLM-as-a-Judge for Human-AI Co-Creation: A Reliability-Aware Evaluation Framework for Coding cs.SE · 2026-04-30 · unverdicted · none · ref 27 · internal anchor
LLM judges for human-AI coding co-creation show moderate performance (ROC-AUC 0.59) and low agreement, with co-creation success concentrating early in interactions.

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer