Local linearity of LLM layers enables LQR-based closed-loop activation steering with theoretical tracking guarantees.
Title resolution pending
24 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 3representative citing papers
FinAuditing is a taxonomy-structured multi-document benchmark with 1,102 instances averaging over 33k tokens from XBRL filings, defining three tasks to evaluate LLMs on financial auditing capabilities.
FinTagging decomposes XBRL tagging into FinNI extraction and FinCL full-taxonomy linking, showing LLMs handle extraction but struggle with fine-grained concept alignment in zero-shot settings.
Gated SAEs decouple which features to use from how large their activations should be, applying the L1 penalty only to selection and thereby eliminating shrinkage while halving the number of firing features needed for good fidelity.
Introduces C²R cross-sample consistency regularization that mitigates feature splitting and absorption in SAEs while preserving reconstruction fidelity.
The paper constructs an SCPI dataset via LLM-based annotation and trains classifiers to detect sensitive personal information in Japanese pre-training corpora, claiming this is the first such exploration.
MOC formalizes a multi-order evidence stream and Semantic-Topological Merging algorithm that improves task performance while cutting communication costs on six datasets.
MAFIG is a multi-agent framework that uses LLM agents and evaluators to generate reading comprehension items with significantly higher adherence to specified feature constraints than single-agent baselines.
The paper introduces a dynamical model that decomposes alignment updates in LLM fine-tuning into rebound and driving forces and predicts a rehearsal priming effect.
Tree SAE learns hierarchical feature structures by combining activation coverage with a new reconstruction condition, outperforming prior SAEs on hierarchical pair detection while matching state-of-the-art benchmark performance.
NeWTral is a non-linear weight translation framework using MoE routing that reduces average attack success rate from 70% to 13% on unsafe domain adapters across Llama, Mistral, Qwen, and Gemma models up to 72B while retaining 90% knowledge fidelity.
Neuron-level inference-time intervention reduces multiple biases in reward models, enabling 2B and 7B models to match 70B performance on LLM alignment benchmarks without trade-offs.
GRPO-VPS improves GRPO by using segment-wise conditional probabilities of the correct answer to supply process-level feedback, yielding up to 2.6-point accuracy gains and 13.7% shorter reasoning on math tasks.
LLMs produce lower-fidelity summaries of identical public comments when attributed to lower-status occupations like street vendors versus financial analysts, with inconsistent race effects and no gender effects.
TRACE is a three-stage optimization framework that realigns LLMs to new policies by categorizing preference conflicts, scoring impact via bi-level optimization, and applying hybrid losses without new human annotations.
CFDLLMBench is a new benchmark suite with CFDQuery, CFDCodeBench, and FoamBench to evaluate LLMs on graduate-level CFD knowledge, numerical reasoning, and context-dependent code implementation.
StrongREJECT provides a standardized benchmark and evaluator for jailbreak attacks that aligns better with human judgments than prior methods and reveals that successful jailbreaks often reduce model capabilities.
MOSAIC uses an Integer Linear Program scheduler for expert placement and prompt assignment plus adaptive aggregation to achieve 1.7-2.3x end-to-end speedup on 4-GPU MoA workloads while keeping accuracy within 0.1pp.
HybridMoE with controlled hybridization and idiomatic property signals yields 5-6% gains in figurative language representation for multilingual vision-language models.
MAPLE uses meta-learning with prototypical networks to learn transferable representations and achieves state-of-the-art cross-prompt essay scoring on ELLIPSE, LAILA, and parts of ASAP datasets.
LLMs handle skin tone emoji modifiers better than dedicated embedding models but display systemic disparities in sentiment and semantic consistency across tones.
MedConclusion is a 5.7M-instance benchmark dataset for generating biomedical conclusions from structured PubMed abstracts, with LLM evaluations showing conclusion writing differs from summarization and that judge choice affects scores.
Systematic evaluation shows LLMs frequently give unsafe responses to eating disorder prompts when linguistic cues signal risk, as measured by varying prompt danger levels with clinician feedback.
citing papers explorer
No citing papers match the current filters.