Title resolution pending

gpt-oss-120b & gpt-oss-20b Model Card , author= · 2025

27 Pith papers cite this work. Polarity classification is still indexing.

27 Pith papers citing it

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

representative citing papers

Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents

cs.CL · 2026-05-21 · unverdicted · novelty 7.0

Agentic CLEAR automates multi-level evaluation of LLM agents, generating textual insights at system, trace, and node granularity that align with human annotations and predict task success.

When Reasoning Supervision Hurts: TTCW-Based Long-Form Literary Review Generation

cs.CL · 2026-05-19 · conditional · novelty 7.0

A new 263k TTCW-annotated story dataset shows non-reasoning fine-tuning of Qwen3 models outperforms reasoning-supervised fine-tuning for fixed-format long-form literary review generation.

Rule2DRC: Benchmarking LLM Agents for DRC Script Synthesis with Execution-Guided Test Generation

cs.LG · 2026-05-15 · unverdicted · novelty 7.0

Rule2DRC is a benchmark for LLM agents synthesizing DRC scripts from natural language rules, paired with SplitTester that improves Best-of-N selection via execution-guided discriminative test generation.

RoPE Distinguishes Neither Positions Nor Tokens in Long Contexts, Provably

cs.CL · 2026-05-15 · conditional · novelty 7.0

Proves that RoPE attention loses locality bias and token distinction in long contexts, approaching random behavior independent of content.

Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost

cs.AI · 2026-05-07 · conditional · novelty 7.0

Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.

CircuitFormer: A Circuit Language Model for Analog Topology Design from Natural Language Prompt

cs.AI · 2026-05-07 · unverdicted · novelty 7.0

CircuitFormer is a 511M-parameter encoder-decoder model that generates analog circuit topologies from text prompts at 100% syntactic correctness and 83% functional success using a new subcircuit-mining tokenizer that keeps vocabulary size fixed at 512.

SVR-MAD: A Bayesian-Inspired Framework for Posterior-Guided Multi-Agent Debate

cs.MA · 2026-05-21 · unverdicted · novelty 6.0

SVR-MAD treats pre-debate signals as priors and debate results as evidence to build a sparser communication graph, cutting token use by up to 61% while preserving or raising accuracy over prior MAD methods.

One prompt is not enough: Instruction Sensitivity Undermines Embedding Model Evaluation

cs.CL · 2026-05-21 · accept · novelty 6.0

Single-prompt evaluations of instruction-tuned embedding models misrepresent performance and allow any model to be ranked first by favorable prompt choice.

SWE-Mutation: Can LLMs Generate Reliable Test Suites in Software Engineering?

cs.SE · 2026-05-21 · unverdicted · novelty 6.0

SWE-Mutation benchmark shows current LLMs achieve low verification (10.20%) and detection (36.15%) rates on 2,636 mutated variants, exposing weaknesses in generating reliable test suites.

How Far Will They Go? Red-Teaming Online Influence with Large Language Models

cs.CL · 2026-05-20 · unverdicted · novelty 6.0

An empirical red-teaming study measures political Overton Windows across more than 30 open-source LLMs from 10 families and finds left-leaning bias, inverse size correlation, regional variation, and variable jailbreak effectiveness.

The Illusion of Intervention: Your LLM-Simulated Experiment is an Observational Study

cs.CL · 2026-05-20 · unverdicted · novelty 6.0

Interventions in LLM-simulated user experiments induce distribution shifts in latent attributes that create confounding bias, diagnosable with negative control outcomes and partially mitigated by adding setting-relevant persona details.

Predictive Prefetching for Retrieval-Augmented Generation

cs.CL · 2026-05-18 · unverdicted · novelty 6.0

Introduces predictive prefetching for RAG that anticipates retrieval needs several tokens ahead via three components, reporting up to 43.5% latency reduction and 62.4% TTFT improvement while preserving answer quality.

Self-Pruned Key-Value Attention: Learning When to Write by Predicting Future Utility

cs.LG · 2026-05-13 · unverdicted · novelty 6.0

SP-KV trains a utility predictor jointly with the LLM to dynamically prune low-utility KV cache entries, achieving 3-10x memory reduction during generation with negligible performance loss.

BioTool: A Comprehensive Tool-Calling Dataset for Enhancing Biomedical Capabilities of Large Language Models

cs.CL · 2026-05-07 · unverdicted · novelty 6.0

BioTool dataset enables fine-tuning a 4B-parameter LLM to outperform GPT-5.1 in biomedical tool calling while improving downstream answer quality per human experts.

MultiBreak: A Scalable and Diverse Multi-turn Jailbreak Benchmark for Evaluating LLM Safety

cs.CL · 2026-05-03 · unverdicted · novelty 6.0

MultiBreak is a large diverse multi-turn jailbreak benchmark that achieves substantially higher attack success rates on LLMs than prior datasets and reveals topic-specific vulnerabilities in multi-turn settings.

Verbal-R3: Verbal Reranker as the Missing Bridge between Retrieval and Reasoning

cs.CL · 2026-05-02 · unverdicted · novelty 6.0

Verbal-R3 uses a verbal reranker to generate analytic narratives that guide retrieval and reasoning in LLMs, achieving SOTA results on complex QA benchmarks.

LLM Output Detectability and Task Performance Can be Jointly Optimized

cs.CL · 2026-05-02 · unverdicted · novelty 6.0

PUPPET jointly optimizes LLM outputs for high detectability and task performance via RL rewards from a detector and a task evaluator, outperforming watermarking on tasks while matching detectability.

Convergent Evolution: How Different Language Models Learn Similar Number Representations

cs.CL · 2026-04-22 · unverdicted · novelty 6.0

Diverse language models converge on similar periodic number features with a two-tier hierarchy of Fourier sparsity and geometric separability, acquired via language co-occurrences or multi-token arithmetic.

Polysemantic Experts, Monosemantic Paths: Routing as Control in MoEs

cs.AI · 2026-04-20 · unverdicted · novelty 6.0

A parameter-free decomposition in MoE models separates routing control from content, showing that expert trajectories cluster tokens by semantic function across languages and forms, making paths rather than experts the natural unit of interpretability.

SPENCE: A Syntactic Probe for Detecting Contamination in NL2SQL Benchmarks

cs.CL · 2026-04-20 · unverdicted · novelty 6.0

SPENCE shows older NL2SQL benchmarks like Spider have high performance sensitivity to syntactic changes, indicating likely training contamination, while newer ones like BIRD show little sensitivity and appear largely clean.

Exploring the Effectiveness of Using LLMs for Automated Assessment of Student Self Explanations in Programming Education

cs.HC · 2026-05-20 · unverdicted · novelty 5.0

Compares LLMs against semantic similarity for binary classification of student self-explanations in programming education.

ReacTOD: Bounded Neuro-Symbolic Agentic NLU for Zero-Shot Dialogue State Tracking

cs.CL · 2026-05-18 · unverdicted · novelty 5.0

ReacTOD introduces a bounded neuro-symbolic ReAct architecture with symbolic validation that delivers new zero-shot SOTA joint goal accuracy on MultiWOZ 2.1 and strong results on SGD.

NoisyCoconut: Counterfactual Consensus via Latent Space Reasoning

cs.LG · 2026-05-06 · unverdicted · novelty 5.0

Injecting noise into LLM latent trajectories creates diverse reasoning paths whose agreement acts as a confidence signal for selective abstention, cutting error rates from 40-70% to under 15% on math tasks.

NVIDIA Nemotron 3: Efficient and Open Intelligence

cs.CL · 2025-12-24 · unverdicted · novelty 5.0

NVIDIA releases the Nemotron 3 model family with hybrid Mamba-Transformer architecture, LatentMoE, NVFP4 training, MTP layers, and multi-environment RL post-training for reasoning and agentic tasks.

citing papers explorer

Showing 27 of 27 citing papers.

Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents cs.CL · 2026-05-21 · unverdicted · none · ref 25
Agentic CLEAR automates multi-level evaluation of LLM agents, generating textual insights at system, trace, and node granularity that align with human annotations and predict task success.
When Reasoning Supervision Hurts: TTCW-Based Long-Form Literary Review Generation cs.CL · 2026-05-19 · conditional · none · ref 7
A new 263k TTCW-annotated story dataset shows non-reasoning fine-tuning of Qwen3 models outperforms reasoning-supervised fine-tuning for fixed-format long-form literary review generation.
Rule2DRC: Benchmarking LLM Agents for DRC Script Synthesis with Execution-Guided Test Generation cs.LG · 2026-05-15 · unverdicted · none · ref 21
Rule2DRC is a benchmark for LLM agents synthesizing DRC scripts from natural language rules, paired with SplitTester that improves Best-of-N selection via execution-guided discriminative test generation.
RoPE Distinguishes Neither Positions Nor Tokens in Long Contexts, Provably cs.CL · 2026-05-15 · conditional · none · ref 53
Proves that RoPE attention loses locality bias and token distinction in long contexts, approaching random behavior independent of content.
Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost cs.AI · 2026-05-07 · conditional · none · ref 6
Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.
CircuitFormer: A Circuit Language Model for Analog Topology Design from Natural Language Prompt cs.AI · 2026-05-07 · unverdicted · none · ref 21
CircuitFormer is a 511M-parameter encoder-decoder model that generates analog circuit topologies from text prompts at 100% syntactic correctness and 83% functional success using a new subcircuit-mining tokenizer that keeps vocabulary size fixed at 512.
SVR-MAD: A Bayesian-Inspired Framework for Posterior-Guided Multi-Agent Debate cs.MA · 2026-05-21 · unverdicted · none · ref 13
SVR-MAD treats pre-debate signals as priors and debate results as evidence to build a sparser communication graph, cutting token use by up to 61% while preserving or raising accuracy over prior MAD methods.
One prompt is not enough: Instruction Sensitivity Undermines Embedding Model Evaluation cs.CL · 2026-05-21 · accept · none · ref 10
Single-prompt evaluations of instruction-tuned embedding models misrepresent performance and allow any model to be ranked first by favorable prompt choice.
SWE-Mutation: Can LLMs Generate Reliable Test Suites in Software Engineering? cs.SE · 2026-05-21 · unverdicted · none · ref 52
SWE-Mutation benchmark shows current LLMs achieve low verification (10.20%) and detection (36.15%) rates on 2,636 mutated variants, exposing weaknesses in generating reliable test suites.
How Far Will They Go? Red-Teaming Online Influence with Large Language Models cs.CL · 2026-05-20 · unverdicted · none · ref 37
An empirical red-teaming study measures political Overton Windows across more than 30 open-source LLMs from 10 families and finds left-leaning bias, inverse size correlation, regional variation, and variable jailbreak effectiveness.
The Illusion of Intervention: Your LLM-Simulated Experiment is an Observational Study cs.CL · 2026-05-20 · unverdicted · none · ref 48
Interventions in LLM-simulated user experiments induce distribution shifts in latent attributes that create confounding bias, diagnosable with negative control outcomes and partially mitigated by adding setting-relevant persona details.
Predictive Prefetching for Retrieval-Augmented Generation cs.CL · 2026-05-18 · unverdicted · none · ref 6
Introduces predictive prefetching for RAG that anticipates retrieval needs several tokens ahead via three components, reporting up to 43.5% latency reduction and 62.4% TTFT improvement while preserving answer quality.
Self-Pruned Key-Value Attention: Learning When to Write by Predicting Future Utility cs.LG · 2026-05-13 · unverdicted · none · ref 20
SP-KV trains a utility predictor jointly with the LLM to dynamically prune low-utility KV cache entries, achieving 3-10x memory reduction during generation with negligible performance loss.
BioTool: A Comprehensive Tool-Calling Dataset for Enhancing Biomedical Capabilities of Large Language Models cs.CL · 2026-05-07 · unverdicted · none · ref 24
BioTool dataset enables fine-tuning a 4B-parameter LLM to outperform GPT-5.1 in biomedical tool calling while improving downstream answer quality per human experts.
MultiBreak: A Scalable and Diverse Multi-turn Jailbreak Benchmark for Evaluating LLM Safety cs.CL · 2026-05-03 · unverdicted · none · ref 58
MultiBreak is a large diverse multi-turn jailbreak benchmark that achieves substantially higher attack success rates on LLMs than prior datasets and reveals topic-specific vulnerabilities in multi-turn settings.
Verbal-R3: Verbal Reranker as the Missing Bridge between Retrieval and Reasoning cs.CL · 2026-05-02 · unverdicted · none · ref 14
Verbal-R3 uses a verbal reranker to generate analytic narratives that guide retrieval and reasoning in LLMs, achieving SOTA results on complex QA benchmarks.
LLM Output Detectability and Task Performance Can be Jointly Optimized cs.CL · 2026-05-02 · unverdicted · none · ref 17
PUPPET jointly optimizes LLM outputs for high detectability and task performance via RL rewards from a detector and a task evaluator, outperforming watermarking on tasks while matching detectability.
Convergent Evolution: How Different Language Models Learn Similar Number Representations cs.CL · 2026-04-22 · unverdicted · none · ref 19
Diverse language models converge on similar periodic number features with a two-tier hierarchy of Fourier sparsity and geometric separability, acquired via language co-occurrences or multi-token arithmetic.
Polysemantic Experts, Monosemantic Paths: Routing as Control in MoEs cs.AI · 2026-04-20 · unverdicted · none · ref 24
A parameter-free decomposition in MoE models separates routing control from content, showing that expert trajectories cluster tokens by semantic function across languages and forms, making paths rather than experts the natural unit of interpretability.
SPENCE: A Syntactic Probe for Detecting Contamination in NL2SQL Benchmarks cs.CL · 2026-04-20 · unverdicted · none · ref 55
SPENCE shows older NL2SQL benchmarks like Spider have high performance sensitivity to syntactic changes, indicating likely training contamination, while newer ones like BIRD show little sensitivity and appear largely clean.
Exploring the Effectiveness of Using LLMs for Automated Assessment of Student Self Explanations in Programming Education cs.HC · 2026-05-20 · unverdicted · none · ref 78
Compares LLMs against semantic similarity for binary classification of student self-explanations in programming education.
ReacTOD: Bounded Neuro-Symbolic Agentic NLU for Zero-Shot Dialogue State Tracking cs.CL · 2026-05-18 · unverdicted · none · ref 28
ReacTOD introduces a bounded neuro-symbolic ReAct architecture with symbolic validation that delivers new zero-shot SOTA joint goal accuracy on MultiWOZ 2.1 and strong results on SGD.
NoisyCoconut: Counterfactual Consensus via Latent Space Reasoning cs.LG · 2026-05-06 · unverdicted · none · ref 15
Injecting noise into LLM latent trajectories creates diverse reasoning paths whose agreement acts as a confidence signal for selective abstention, cutting error rates from 40-70% to under 15% on math tasks.
NVIDIA Nemotron 3: Efficient and Open Intelligence cs.CL · 2025-12-24 · unverdicted · none · ref 180
NVIDIA releases the Nemotron 3 model family with hybrid Mamba-Transformer architecture, LatentMoE, NVFP4 training, MTP layers, and multi-environment RL post-training for reasoning and agentic tasks.
Supplement Generation Training for Enhancing Agentic Task Performance cs.LG · 2026-04-22 · unverdicted · none · ref 49
SGT trains a lightweight model to generate task-specific supplemental text that improves performance of a larger frozen LLM on agentic tasks without modifying the large model.
AMARIS: A Memory-Augmented Rubric Improvement System for Rubric-Based Reinforcement Learning cs.LG · 2026-05-18 · unreviewed · ref 23
RAG over Thinking Traces Can Improve Reasoning Tasks cs.IR · 2026-05-05 · unreviewed · ref 4

Title resolution pending

fields

years

verdicts

representative citing papers

citing papers explorer