Local linearity of LLM layers enables LQR-based closed-loop activation steering with theoretical tracking guarantees.
hub
thought” of LLM by finding the “circuit
18 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 3representative citing papers
FinAuditing is a taxonomy-structured multi-document benchmark with 1,102 instances averaging over 33k tokens from XBRL filings, defining three tasks to evaluate LLMs on financial auditing capabilities.
FinTagging decomposes XBRL tagging into FinNI extraction and FinCL full-taxonomy linking, showing LLMs handle extraction but struggle with fine-grained concept alignment in zero-shot settings.
Gated SAEs decouple which features to use from how large their activations should be, applying the L1 penalty only to selection and thereby eliminating shrinkage while halving the number of firing features needed for good fidelity.
MAFIG is a multi-agent framework that uses LLM agents and evaluators to generate reading comprehension items with significantly higher adherence to specified feature constraints than single-agent baselines.
The paper introduces a dynamical model that decomposes alignment updates in LLM fine-tuning into rebound and driving forces and predicts a rehearsal priming effect.
Tree SAE learns hierarchical feature structures by combining activation coverage with a new reconstruction condition, outperforming prior SAEs on hierarchical pair detection while matching state-of-the-art benchmark performance.
NeWTral is a non-linear weight translation framework using MoE routing that reduces average attack success rate from 70% to 13% on unsafe domain adapters across Llama, Mistral, Qwen, and Gemma models up to 72B while retaining 90% knowledge fidelity.
Neuron-level inference-time intervention reduces multiple biases in reward models, enabling 2B and 7B models to match 70B performance on LLM alignment benchmarks without trade-offs.
GRPO-VPS improves GRPO by using segment-wise conditional probabilities of the correct answer to supply process-level feedback, yielding up to 2.6-point accuracy gains and 13.7% shorter reasoning on math tasks.
LLMs produce lower-fidelity summaries of identical public comments when attributed to lower-status occupations like street vendors versus financial analysts, with inconsistent race effects and no gender effects.
TRACE is a three-stage optimization framework that realigns LLMs to new policies by categorizing preference conflicts, scoring impact via bi-level optimization, and applying hybrid losses without new human annotations.
CFDLLMBench is a new benchmark suite with CFDQuery, CFDCodeBench, and FoamBench to evaluate LLMs on graduate-level CFD knowledge, numerical reasoning, and context-dependent code implementation.
StrongREJECT provides a standardized benchmark and evaluator for jailbreak attacks that aligns better with human judgments than prior methods and reveals that successful jailbreaks often reduce model capabilities.
MAPLE uses meta-learning with prototypical networks to learn transferable representations and achieves state-of-the-art cross-prompt essay scoring on ELLIPSE, LAILA, and parts of ASAP datasets.
LLMs handle skin tone emoji modifiers better than dedicated embedding models but display systemic disparities in sentiment and semantic consistency across tones.
MedConclusion is a 5.7M-instance benchmark dataset for generating biomedical conclusions from structured PubMed abstracts, with LLM evaluations showing conclusion writing differs from summarization and that judge choice affects scores.
A DPO framework augmented with curriculum learning and two new loss parameters generates veracity explanations for Hindi news using LLMs and PLMs.
citing papers explorer
-
Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control
Local linearity of LLM layers enables LQR-based closed-loop activation steering with theoretical tracking guarantees.
-
FinAuditing: A Financial Taxonomy-Structured Multi-Document Benchmark for Evaluating LLMs
FinAuditing is a taxonomy-structured multi-document benchmark with 1,102 instances averaging over 33k tokens from XBRL filings, defining three tasks to evaluate LLMs on financial auditing capabilities.
-
FinTagging: Benchmarking LLMs for Extracting and Structuring Financial Information
FinTagging decomposes XBRL tagging into FinNI extraction and FinCL full-taxonomy linking, showing LLMs handle extraction but struggle with fine-grained concept alignment in zero-shot settings.
-
Improving Dictionary Learning with Gated Sparse Autoencoders
Gated SAEs decouple which features to use from how large their activations should be, applying the L1 penalty only to selection and thereby eliminating shrinkage while halving the number of firing features needed for good fidelity.
-
A Multi-Agent Framework for Feature-Constrained Difficulty Control in Reading Comprehension Item Generation
MAFIG is a multi-agent framework that uses LLM agents and evaluators to generate reading comprehension items with significantly higher adherence to specified feature constraints than single-agent baselines.
-
Alignment Dynamics in LLM Fine-Tuning
The paper introduces a dynamical model that decomposes alignment updates in LLM fine-tuning into rebound and driving forces and predicts a rehearsal priming effect.
-
Tree SAE: Learning Hierarchical Feature Structures in Sparse Autoencoders
Tree SAE learns hierarchical feature structures by combining activation coverage with a new reconstruction condition, outperforming prior SAEs on hierarchical pair detection while matching state-of-the-art benchmark performance.
-
You Snooze, You Lose: Automatic Safety Alignment Restoration through Neural Weight Translation
NeWTral is a non-linear weight translation framework using MoE routing that reduces average attack success rate from 70% to 13% on unsafe domain adapters across Llama, Mistral, Qwen, and Gemma models up to 72B while retaining 90% knowledge fidelity.
-
Debiasing Reward Models via Causally Motivated Inference-Time Intervention
Neuron-level inference-time intervention reduces multiple biases in reward models, enabling 2B and 7B models to match 70B performance on LLM alignment benchmarks without trade-offs.
-
GRPO-VPS: Enhancing Group Relative Policy Optimization with Verifiable Process Supervision for Effective Reasoning
GRPO-VPS improves GRPO by using segment-wise conditional probabilities of the correct answer to supply process-level feedback, yielding up to 2.6-point accuracy gains and 13.7% shorter reasoning on math tasks.
-
All Public Voices Are Equal, But Are Some More Equal Than Others to LLMs?
LLMs produce lower-fidelity summaries of identical public comments when attributed to lower-status occupations like street vendors versus financial analysts, with inconsistent race effects and no gender effects.
-
The Realignment Problem: When Right becomes Wrong in LLMs
TRACE is a three-stage optimization framework that realigns LLMs to new policies by categorizing preference conflicts, scoring impact via bi-level optimization, and applying hybrid losses without new human annotations.
-
CFDLLMBench: A Benchmark Suite for Evaluating Large Language Models in Computational Fluid Dynamics
CFDLLMBench is a new benchmark suite with CFDQuery, CFDCodeBench, and FoamBench to evaluate LLMs on graduate-level CFD knowledge, numerical reasoning, and context-dependent code implementation.
-
A StrongREJECT for Empty Jailbreaks
StrongREJECT provides a standardized benchmark and evaluator for jailbreak attacks that aligns better with human judgments than prior methods and reveals that successful jailbreaks often reduce model capabilities.
-
MAPLE: A Meta-learning Framework for Cross-Prompt Essay Scoring
MAPLE uses meta-learning with prototypical networks to learn transferable representations and achieves state-of-the-art cross-prompt essay scoring on ELLIPSE, LAILA, and parts of ASAP datasets.
-
Digital Skin, Digital Bias: Uncovering Tone-Based Biases in LLMs and Emoji Embeddings
LLMs handle skin tone emoji modifiers better than dedicated embedding models but display systemic disparities in sentiment and semantic consistency across tones.
-
MedConclusion: A Benchmark for Biomedical Conclusion Generation from Structured Abstracts
MedConclusion is a 5.7M-instance benchmark dataset for generating biomedical conclusions from structured PubMed abstracts, with LLM evaluations showing conclusion writing differs from summarization and that judge choice affects scores.
-
From Fragments to Facts: A Curriculum-Driven DPO Approach for Generating Hindi News Veracity Explanations
A DPO framework augmented with curriculum learning and two new loss parameters generates veracity explanations for Hindi news using LLMs and PLMs.