New metrics KSS and KPS are introduced to evaluate multilingual machine unlearning quality and cross-language consistency in LLMs, addressing limitations of single-language evaluation protocols.
hub
Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=
38 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
method 1polarities
use method 1representative citing papers
DOLORES, an agent using a formal language for meta-reasoning to construct adaptive scaffolds on the fly, outperforms prior scaffolding methods by 24.8% on average across four hard benchmarks and multiple model sizes.
Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.
AnchorSeg uses ordered query banks of latent reasoning tokens plus a spatial anchor token and a Token-Mask Cycle Consistency loss to achieve 67.7% gIoU and 68.1% cIoU on the ReasonSeg benchmark.
CognitiveBench reveals LLMs suffer representation overlap on joint cognitive tasks due to hierarchical structure; HyCoLLM in hyperbolic space fixes the mismatch and outperforms GPT-4o with far fewer parameters.
Magpie synthesizes 300K high-quality alignment instructions from Llama-3-Instruct via auto-regressive prompting on partial templates, enabling fine-tuned models to match official instruct performance on AlpacaEval, ArenaHard, and WildBench.
DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
PACE coordinates low-risk prompt evolution with validated higher-risk control-logic updates to improve frozen SLM agents on benchmarks without model retraining.
Single-prompt evaluations of instruction-tuned embedding models misrepresent performance and allow any model to be ranked first by favorable prompt choice.
An empirical red-teaming study measures political Overton Windows across more than 30 open-source LLMs from 10 families and finds left-leaning bias, inverse size correlation, regional variation, and variable jailbreak effectiveness.
SAGE reshapes the reverse-KL anchor via guide function q(x,y) for controllable empirical support expansion, yielding gains in both pass@1 and pass@k on math reasoning benchmarks.
Deep Pre-Alignment uses a small VLM perceiver instead of ViT to pre-align visual features with LLM text space, yielding 1.9-3.0 point gains on multimodal benchmarks and 32.9% less language forgetting.
MSIFR stops faulty LLM generations early via staged rule-based checks, reducing token consumption 11-78% with no accuracy loss.
ODRPO decomposes discrete rewards into ordinal binary indicators to create robust, variance-aware advantage estimators for noisy RLAIF in LLM alignment.
OPT-BENCH trains LLMs on NP-hard optimization via quality-aware RLVR, achieving 93.1% success rate and 46.6% quality ratio on Qwen2.5-7B while outperforming GPT-4o and transferring gains to other domains.
ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
FlexSQL reaches 65.4% on Spider2-Snow by allowing agents to flexibly explore schemas, generate diverse plans, choose SQL or Python execution, and apply two-tiered repair.
DoTS decouples SFT and RLVR training then synthesizes their task vectors at inference time to match integrated training results at ~3% compute cost.
Unsupervised single-generation confidence calibration for reasoning LLMs via offline self-consistency proxy distillation outperforms baselines on math and QA tasks and improves selective prediction.
FUSE ensembles verifiers unsupervisedly by controlling their conditional dependencies to improve spectral ensembling algorithms, matching or exceeding semi-supervised baselines on benchmarks including GPQA Diamond and Humanity's Last Exam.
Recovering an orthogonal basis from model activations yields a model-native skill characterization that improves reasoning Pass@1 by up to 41% via targeted data selection and supports inference steering, outperforming human-characterized alternatives.
Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.
Muon optimizer with weight decay and update scaling achieves ~2x efficiency over AdamW for large LLMs, shown via the Moonlight 3B/16B MoE model trained on 5.7T tokens.