hub

thought” of LLM by finding the “circuit

Gemma Team · 2024 · DOI 10.34740/kaggle/m/3301

18 Pith papers cite this work. Polarity classification is still indexing.

18 Pith papers citing it

open at publisher browse 18 citing papers

hub tools

JSON dossier citing papers JSON publisher DOI

citation-role summary

background 3

citation-polarity summary

background 2 unclear 1

representative citing papers

Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control

cs.LG · 2026-04-21 · conditional · novelty 7.0

Local linearity of LLM layers enables LQR-based closed-loop activation steering with theoretical tracking guarantees.

FinAuditing: A Financial Taxonomy-Structured Multi-Document Benchmark for Evaluating LLMs

cs.CL · 2025-10-10 · unverdicted · novelty 7.0

FinAuditing is a taxonomy-structured multi-document benchmark with 1,102 instances averaging over 33k tokens from XBRL filings, defining three tasks to evaluate LLMs on financial auditing capabilities.

FinTagging: Benchmarking LLMs for Extracting and Structuring Financial Information

cs.CL · 2025-05-27 · unverdicted · novelty 7.0

FinTagging decomposes XBRL tagging into FinNI extraction and FinCL full-taxonomy linking, showing LLMs handle extraction but struggle with fine-grained concept alignment in zero-shot settings.

Improving Dictionary Learning with Gated Sparse Autoencoders

cs.LG · 2024-04-24 · unverdicted · novelty 7.0

Gated SAEs decouple which features to use from how large their activations should be, applying the L1 penalty only to selection and thereby eliminating shrinkage while halving the number of firing features needed for good fidelity.

A Multi-Agent Framework for Feature-Constrained Difficulty Control in Reading Comprehension Item Generation

cs.CL · 2026-05-19 · unverdicted · novelty 6.0

MAFIG is a multi-agent framework that uses LLM agents and evaluators to generate reading comprehension items with significantly higher adherence to specified feature constraints than single-agent baselines.

Alignment Dynamics in LLM Fine-Tuning

cs.LG · 2026-05-18 · unverdicted · novelty 6.0

The paper introduces a dynamical model that decomposes alignment updates in LLM fine-tuning into rebound and driving forces and predicts a rehearsal priming effect.

Tree SAE: Learning Hierarchical Feature Structures in Sparse Autoencoders

cs.LG · 2026-05-08 · unverdicted · novelty 6.0 · 2 refs

Tree SAE learns hierarchical feature structures by combining activation coverage with a new reconstruction condition, outperforming prior SAEs on hierarchical pair detection while matching state-of-the-art benchmark performance.

You Snooze, You Lose: Automatic Safety Alignment Restoration through Neural Weight Translation

cs.CR · 2026-05-06 · unverdicted · novelty 6.0

NeWTral is a non-linear weight translation framework using MoE routing that reduces average attack success rate from 70% to 13% on unsafe domain adapters across Llama, Mistral, Qwen, and Gemma models up to 72B while retaining 90% knowledge fidelity.

Debiasing Reward Models via Causally Motivated Inference-Time Intervention

cs.CL · 2026-04-30 · unverdicted · novelty 6.0

Neuron-level inference-time intervention reduces multiple biases in reward models, enabling 2B and 7B models to match 70B performance on LLM alignment benchmarks without trade-offs.

GRPO-VPS: Enhancing Group Relative Policy Optimization with Verifiable Process Supervision for Effective Reasoning

cs.LG · 2026-04-22 · unverdicted · novelty 6.0

GRPO-VPS improves GRPO by using segment-wise conditional probabilities of the correct answer to supply process-level feedback, yielding up to 2.6-point accuracy gains and 13.7% shorter reasoning on math tasks.

All Public Voices Are Equal, But Are Some More Equal Than Others to LLMs?

cs.CY · 2026-04-19 · unverdicted · novelty 6.0

LLMs produce lower-fidelity summaries of identical public comments when attributed to lower-status occupations like street vendors versus financial analysts, with inconsistent race effects and no gender effects.

The Realignment Problem: When Right becomes Wrong in LLMs

cs.CL · 2025-11-04 · unverdicted · novelty 6.0

TRACE is a three-stage optimization framework that realigns LLMs to new policies by categorizing preference conflicts, scoring impact via bi-level optimization, and applying hybrid losses without new human annotations.

CFDLLMBench: A Benchmark Suite for Evaluating Large Language Models in Computational Fluid Dynamics

cs.CL · 2025-09-19 · unverdicted · novelty 6.0

CFDLLMBench is a new benchmark suite with CFDQuery, CFDCodeBench, and FoamBench to evaluate LLMs on graduate-level CFD knowledge, numerical reasoning, and context-dependent code implementation.

A StrongREJECT for Empty Jailbreaks

cs.LG · 2024-02-15 · conditional · novelty 6.0

StrongREJECT provides a standardized benchmark and evaluator for jailbreak attacks that aligns better with human judgments than prior methods and reveals that successful jailbreaks often reduce model capabilities.

MAPLE: A Meta-learning Framework for Cross-Prompt Essay Scoring

cs.CL · 2026-04-19 · unverdicted · novelty 5.0

MAPLE uses meta-learning with prototypical networks to learn transferable representations and achieves state-of-the-art cross-prompt essay scoring on ELLIPSE, LAILA, and parts of ASAP datasets.

Digital Skin, Digital Bias: Uncovering Tone-Based Biases in LLMs and Emoji Embeddings

cs.SI · 2026-04-08 · unverdicted · novelty 5.0

LLMs handle skin tone emoji modifiers better than dedicated embedding models but display systemic disparities in sentiment and semantic consistency across tones.

MedConclusion: A Benchmark for Biomedical Conclusion Generation from Structured Abstracts

cs.CL · 2026-04-07 · unverdicted · novelty 5.0

MedConclusion is a 5.7M-instance benchmark dataset for generating biomedical conclusions from structured PubMed abstracts, with LLM evaluations showing conclusion writing differs from summarization and that judge choice affects scores.

From Fragments to Facts: A Curriculum-Driven DPO Approach for Generating Hindi News Veracity Explanations

cs.CL · 2025-07-07 · unverdicted · novelty 5.0

A DPO framework augmented with curriculum learning and two new loss parameters generates veracity explanations for Hindi news using LLMs and PLMs.

citing papers explorer

Showing 18 of 18 citing papers.

Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control cs.LG · 2026-04-21 · conditional · none · ref 54
Local linearity of LLM layers enables LQR-based closed-loop activation steering with theoretical tracking guarantees.
FinAuditing: A Financial Taxonomy-Structured Multi-Document Benchmark for Evaluating LLMs cs.CL · 2025-10-10 · unverdicted · none · ref 25
FinAuditing is a taxonomy-structured multi-document benchmark with 1,102 instances averaging over 33k tokens from XBRL filings, defining three tasks to evaluate LLMs on financial auditing capabilities.
FinTagging: Benchmarking LLMs for Extracting and Structuring Financial Information cs.CL · 2025-05-27 · unverdicted · none · ref 27
FinTagging decomposes XBRL tagging into FinNI extraction and FinCL full-taxonomy linking, showing LLMs handle extraction but struggle with fine-grained concept alignment in zero-shot settings.
Improving Dictionary Learning with Gated Sparse Autoencoders cs.LG · 2024-04-24 · unverdicted · none · ref 237
Gated SAEs decouple which features to use from how large their activations should be, applying the L1 penalty only to selection and thereby eliminating shrinkage while halving the number of firing features needed for good fidelity.
A Multi-Agent Framework for Feature-Constrained Difficulty Control in Reading Comprehension Item Generation cs.CL · 2026-05-19 · unverdicted · none · ref 81
MAFIG is a multi-agent framework that uses LLM agents and evaluators to generate reading comprehension items with significantly higher adherence to specified feature constraints than single-agent baselines.
Alignment Dynamics in LLM Fine-Tuning cs.LG · 2026-05-18 · unverdicted · none · ref 26
The paper introduces a dynamical model that decomposes alignment updates in LLM fine-tuning into rebound and driving forces and predicts a rehearsal priming effect.
Tree SAE: Learning Hierarchical Feature Structures in Sparse Autoencoders cs.LG · 2026-05-08 · unverdicted · none · ref 22 · 2 links
Tree SAE learns hierarchical feature structures by combining activation coverage with a new reconstruction condition, outperforming prior SAEs on hierarchical pair detection while matching state-of-the-art benchmark performance.
You Snooze, You Lose: Automatic Safety Alignment Restoration through Neural Weight Translation cs.CR · 2026-05-06 · unverdicted · none · ref 126
NeWTral is a non-linear weight translation framework using MoE routing that reduces average attack success rate from 70% to 13% on unsafe domain adapters across Llama, Mistral, Qwen, and Gemma models up to 72B while retaining 90% knowledge fidelity.
Debiasing Reward Models via Causally Motivated Inference-Time Intervention cs.CL · 2026-04-30 · unverdicted · none · ref 12
Neuron-level inference-time intervention reduces multiple biases in reward models, enabling 2B and 7B models to match 70B performance on LLM alignment benchmarks without trade-offs.
GRPO-VPS: Enhancing Group Relative Policy Optimization with Verifiable Process Supervision for Effective Reasoning cs.LG · 2026-04-22 · unverdicted · none · ref 16
GRPO-VPS improves GRPO by using segment-wise conditional probabilities of the correct answer to supply process-level feedback, yielding up to 2.6-point accuracy gains and 13.7% shorter reasoning on math tasks.
All Public Voices Are Equal, But Are Some More Equal Than Others to LLMs? cs.CY · 2026-04-19 · unverdicted · none · ref 35
LLMs produce lower-fidelity summaries of identical public comments when attributed to lower-status occupations like street vendors versus financial analysts, with inconsistent race effects and no gender effects.
The Realignment Problem: When Right becomes Wrong in LLMs cs.CL · 2025-11-04 · unverdicted · none · ref 19
TRACE is a three-stage optimization framework that realigns LLMs to new policies by categorizing preference conflicts, scoring impact via bi-level optimization, and applying hybrid losses without new human annotations.
CFDLLMBench: A Benchmark Suite for Evaluating Large Language Models in Computational Fluid Dynamics cs.CL · 2025-09-19 · unverdicted · none · ref 51
CFDLLMBench is a new benchmark suite with CFDQuery, CFDCodeBench, and FoamBench to evaluate LLMs on graduate-level CFD knowledge, numerical reasoning, and context-dependent code implementation.
A StrongREJECT for Empty Jailbreaks cs.LG · 2024-02-15 · conditional · none · ref 12
StrongREJECT provides a standardized benchmark and evaluator for jailbreak attacks that aligns better with human judgments than prior methods and reveals that successful jailbreaks often reduce model capabilities.
MAPLE: A Meta-learning Framework for Cross-Prompt Essay Scoring cs.CL · 2026-04-19 · unverdicted · none · ref 47
MAPLE uses meta-learning with prototypical networks to learn transferable representations and achieves state-of-the-art cross-prompt essay scoring on ELLIPSE, LAILA, and parts of ASAP datasets.
Digital Skin, Digital Bias: Uncovering Tone-Based Biases in LLMs and Emoji Embeddings cs.SI · 2026-04-08 · unverdicted · none · ref 41
LLMs handle skin tone emoji modifiers better than dedicated embedding models but display systemic disparities in sentiment and semantic consistency across tones.
MedConclusion: A Benchmark for Biomedical Conclusion Generation from Structured Abstracts cs.CL · 2026-04-07 · unverdicted · none · ref 5
MedConclusion is a 5.7M-instance benchmark dataset for generating biomedical conclusions from structured PubMed abstracts, with LLM evaluations showing conclusion writing differs from summarization and that judge choice affects scores.
From Fragments to Facts: A Curriculum-Driven DPO Approach for Generating Hindi News Veracity Explanations cs.CL · 2025-07-07 · unverdicted · none · ref 43
A DPO framework augmented with curriculum learning and two new loss parameters generates veracity explanations for Hindi news using LLMs and PLMs.

thought” of LLM by finding the “circuit

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer