super hub Mixed citations

gpt-oss-120b & gpt-oss-20b Model Card

Andy Applebaum, Edwin Arbus, Jason Ai, Lama Ahmad, OpenAI: Sandhini Agarwal, Sam Altman · 2025 · cs.CL · arXiv 2508.10925

Mixed citation behavior. Most common role is background (41%).

428 Pith papers citing it

Background 41% of classified citations

open full Pith review browse 428 citing papers more from Andy Applebaum arXiv PDF

abstract

We present gpt-oss-120b and gpt-oss-20b, two open-weight reasoning models that push the frontier of accuracy and inference cost. The models use an efficient mixture-of-expert transformer architecture and are trained using large-scale distillation and reinforcement learning. We optimize the models to have strong agentic capabilities (deep research browsing, python tool use, and support for developer-provided functions), all while using a rendered chat format that enables clear instruction following and role delineation. Both models achieve strong results on benchmarks ranging from mathematics, coding, and safety. We release the model weights, inference implementations, tool environments, and tokenizers under an Apache 2.0 license to enable broad use and further research.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 33 baseline 16 method 16 other 6 dataset 5

citation-polarity summary

background 31 baseline 16 use method 16 unclear 7 use dataset 5 support 1

claims ledger

abstract We present gpt-oss-120b and gpt-oss-20b, two open-weight reasoning models that push the frontier of accuracy and inference cost. The models use an efficient mixture-of-expert transformer architecture and are trained using large-scale distillation and reinforcement learning. We optimize the models to have strong agentic capabilities (deep research browsing, python tool use, and support for developer-provided functions), all while using a rendered chat format that enables clear instruction following and role delineation. Both models achieve strong results on benchmarks ranging from mathematics,

authors

Andy Applebaum Edwin Arbus Jason Ai Lama Ahmad OpenAI: Sandhini Agarwal Sam Altman

co-cited works

representative citing papers

Sumi: Open Uniform Diffusion Language Model from Scratch

cs.CL · 2026-06-17 · unverdicted · novelty 8.0

Sumi is an openly released 7B parameter uniform diffusion language model pretrained from scratch on 1.5T tokens that matches autoregressive models on several benchmarks.

TW-LegalBench: Measuring Taiwanese Legal Understanding

cs.CL · 2026-06-17 · unverdicted · novelty 8.0

TW-LegalBench evaluates 13 LLMs on over 30,000 Taiwanese legal tasks from exams and judgments, showing top models pass lawyer thresholds but struggle with exact statute citations.

UltraEP: Unleash MoE Training and Inference on Rack-Scale Nodes with Near-Optimal Load Balancing

cs.DC · 2026-06-02 · unverdicted · novelty 8.0

UltraEP is the first exact-load real-time expert balancer for large-EP MoE training and serving on rack-scale nodes, reaching 94.3% of ideal throughput and 1.49x over no-balancing.

RobotValues: Evaluating Household Robots When Human Values Conflict

cs.RO · 2026-06-02 · unverdicted · novelty 8.0

RobotValues is a benchmark of 10K value-conflict scenarios that reveals VLMs default to safety and accommodation while failing to follow instructions to prioritize other values 80% of the time.

Fully Open Meditron: An Auditable Pipeline for Clinical LLMs

cs.AI · 2026-05-15 · unverdicted · novelty 8.0 · 2 refs

Presents the first fully open pipeline for clinical LLMs by unifying eight public QA datasets with three clinician-vetted synthetic extensions and applying it to five base models to achieve benchmark gains while maintaining auditability.

MathAtlas: A Benchmark for Autoformalization in the Wild

cs.AI · 2026-05-13 · accept · novelty 8.0

MathAtlas is the first large-scale benchmark for autoformalizing graduate mathematics, where even strong models reach only 9.8% correctness on theorem statements and drop to 2.6% on the hardest dependency-deep subset.

Large Language Models Lack Temporal Awareness of Medical Knowledge

cs.LG · 2026-05-13 · unverdicted · novelty 8.0

LLMs lack temporal awareness of medical knowledge, showing gradual performance decline on up-to-date facts, much lower accuracy on historical knowledge (25-54% relative), and inconsistent year-to-year predictions.

Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models

cs.AR · 2026-05-11 · conditional · novelty 8.0

Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.

Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs

cs.CL · 2026-05-09 · unverdicted · novelty 8.0 · 2 refs

Soohak is a 439-problem mathematician-curated benchmark where frontier LLMs reach at most 30.4% on research math challenges and no model exceeds 50% on refusal for ill-posed problems.

MathConstraint: Automated Generation of Verified Combinatorial Reasoning Instances for LLMs

cs.LG · 2026-05-08 · unverdicted · novelty 8.0

MathConstraint generates scalable, automatically verifiable combinatorial problems where LLMs achieve 18.5-66.9% accuracy without tools but roughly double that with solver access.

LLM Translation of Compiler Intermediate Representation

cs.PL · 2026-05-07 · unverdicted · novelty 8.0

IRIS-14B is the first LLM trained explicitly for GIMPLE-to-LLVM IR translation and outperforms much larger models by up to 44 percentage points on real-world C code.

Efficient Training on Multiple Consumer GPUs with RoundPipe

cs.DC · 2026-04-29 · conditional · novelty 8.0

RoundPipe achieves near-zero-bubble pipeline parallelism for LLM training on consumer GPUs by dynamically dispatching computation stages round-robin, yielding 1.48-2.16x speedups and enabling 235B model fine-tuning on 8x RTX 4090.

InfiniteScienceGym: An Unbounded, Procedurally-Generated Benchmark for Scientific Analysis

cs.CL · 2026-04-14 · unverdicted · novelty 8.0

InfiniteScienceGym procedurally generates unbounded scientific repositories with exact ground-truth QA pairs to benchmark LLMs on data reasoning, abstention, and tool use without static datasets.

Narrative over Numbers: The Identifiable Victim Effect and its Amplification Under Alignment and Reasoning in Large Language Models

cs.CL · 2026-04-13 · conditional · novelty 8.0

Large language models display the identifiable victim effect at roughly twice the human baseline, strongly amplified by instruction tuning and chain-of-thought prompting but inverted by reasoning-specialized models.

Tessera: Unlocking Heterogeneous GPUs through Kernel-Granularity Disaggregation

cs.DC · 2026-04-11 · unverdicted · novelty 8.0

Tessera performs kernel-granularity disaggregation on heterogeneous GPUs, achieving up to 2.3x throughput and 1.6x cost efficiency gains for large model inference while generalizing beyond prior methods.

Evaluating Large Language Models in Scientific Discovery

cs.AI · 2025-12-17 · unverdicted · novelty 8.0

The SDE benchmark shows LLMs lag on scientific discovery tasks relative to general science tests, with diminishing scaling returns and shared weaknesses across models.

Unlocking Speech-Text Compositional Powers: Instruction-Following Speech Language Models without Instruction Tuning

cs.CL · 2026-07-02 · unverdicted · novelty 7.0

SpeechCombine produces instruction-following SLMs via speech pre-training followed by direct weight combination with the text LLM instruction delta, without any speech instruction tuning.

OpenSafeIntent: Evaluating Intent-Calibrated Safe Completion Across Dual-Use Prompt Sets

cs.CL · 2026-07-02 · unverdicted · novelty 7.0

OpenSafeIntent benchmark shows models fail to calibrate safety across intent shifts in matched dual-use prompts, indicating current evaluations are insufficient.

Can Language Models Actually Retrieve In-Context? Drowning in Documents at Million Token Scale

cs.CL · 2026-07-01 · unverdicted · novelty 7.0

A 0.6B LM with length-aware attention adjustments performs competitive in-context retrieval at million-token scale on MS MARCO, NQ, and LIMIT benchmarks.

Measuring the Gap Between Human and LLM Research Ideas

cs.CL · 2026-07-01 · unverdicted · novelty 7.0

LLM-generated research ideas cluster more around bridge-like opportunities and synthesis methods than the broader distribution seen in human papers.

ELDR: Expert-Locality-Aware Decode Routing for PD-Disaggregated MoE Serving

cs.DC · 2026-07-01 · unverdicted · novelty 7.0 · 2 refs

ELDR reduces median TPOT by 5.9-13.9% in PD-disaggregated MoE serving via expert signatures from prefill, K-means partitioning, and locality-band routing with KV-co-indexed signature cache.

Using AI Agents to Automate Black-Box Audits of Personalization Algorithms at Scale

cs.CL · 2026-06-29 · unverdicted · novelty 7.0

Introduces GenAI agent framework for auditing personalization algorithms via synthetic accounts with fixed personas, applied to X post-2024 election showing amplification of toxic and right-leaning content varying by ideology.

SABER-Math: Automated Benchmark for Information Retrieval Evaluation in Mathematics

cs.IR · 2026-06-29 · unverdicted · novelty 7.0

SABER-Math is an automated benchmark for mathematical IR that uses LLM summaries, topic similarities, and preference tournaments on 283K problems to create reranking tasks, showing embedding models outperform baselines but struggle in symbol-heavy areas and that MTEB does not predict math performanc

Agentic Abstention: Do Agents Know When to Stop Instead of Act?

cs.AI · 2026-06-27 · unverdicted · novelty 7.0

LLM agents often fail to abstain at the right time in uncertain multi-turn tasks, and the CONVOLVE context engineering method raises timely abstention rates on WebShop from 26.7 to 57.4 without parameter updates.

citing papers explorer

Showing 50 of 428 citing papers.

Sumi: Open Uniform Diffusion Language Model from Scratch cs.CL · 2026-06-17 · unverdicted · none · ref 27 · internal anchor
Sumi is an openly released 7B parameter uniform diffusion language model pretrained from scratch on 1.5T tokens that matches autoregressive models on several benchmarks.
TW-LegalBench: Measuring Taiwanese Legal Understanding cs.CL · 2026-06-17 · unverdicted · none · ref 17 · internal anchor
TW-LegalBench evaluates 13 LLMs on over 30,000 Taiwanese legal tasks from exams and judgments, showing top models pass lawyer thresholds but struggle with exact statute citations.
UltraEP: Unleash MoE Training and Inference on Rack-Scale Nodes with Near-Optimal Load Balancing cs.DC · 2026-06-02 · unverdicted · none · ref 46 · internal anchor
UltraEP is the first exact-load real-time expert balancer for large-EP MoE training and serving on rack-scale nodes, reaching 94.3% of ideal throughput and 1.49x over no-balancing.
RobotValues: Evaluating Household Robots When Human Values Conflict cs.RO · 2026-06-02 · unverdicted · none · ref 49 · internal anchor
RobotValues is a benchmark of 10K value-conflict scenarios that reveals VLMs default to safety and accommodation while failing to follow instructions to prioritize other values 80% of the time.
Fully Open Meditron: An Auditable Pipeline for Clinical LLMs cs.AI · 2026-05-15 · unverdicted · none · ref 34 · 2 links · internal anchor
Presents the first fully open pipeline for clinical LLMs by unifying eight public QA datasets with three clinician-vetted synthetic extensions and applying it to five base models to achieve benchmark gains while maintaining auditability.
MathAtlas: A Benchmark for Autoformalization in the Wild cs.AI · 2026-05-13 · accept · none · ref 28 · internal anchor
MathAtlas is the first large-scale benchmark for autoformalizing graduate mathematics, where even strong models reach only 9.8% correctness on theorem statements and drop to 2.6% on the hardest dependency-deep subset.
Large Language Models Lack Temporal Awareness of Medical Knowledge cs.LG · 2026-05-13 · unverdicted · none · ref 1 · internal anchor
LLMs lack temporal awareness of medical knowledge, showing gradual performance decline on up-to-date facts, much lower accuracy on historical knowledge (25-54% relative), and inconsistent year-to-year predictions.
Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models cs.AR · 2026-05-11 · conditional · none · ref 2 · internal anchor
Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.
Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs cs.CL · 2026-05-09 · unverdicted · none · ref 2 · 2 links · internal anchor
Soohak is a 439-problem mathematician-curated benchmark where frontier LLMs reach at most 30.4% on research math challenges and no model exceeds 50% on refusal for ill-posed problems.
MathConstraint: Automated Generation of Verified Combinatorial Reasoning Instances for LLMs cs.LG · 2026-05-08 · unverdicted · none · ref 1 · internal anchor
MathConstraint generates scalable, automatically verifiable combinatorial problems where LLMs achieve 18.5-66.9% accuracy without tools but roughly double that with solver access.
LLM Translation of Compiler Intermediate Representation cs.PL · 2026-05-07 · unverdicted · none · ref 3 · internal anchor
IRIS-14B is the first LLM trained explicitly for GIMPLE-to-LLVM IR translation and outperforms much larger models by up to 44 percentage points on real-world C code.
Efficient Training on Multiple Consumer GPUs with RoundPipe cs.DC · 2026-04-29 · conditional · none · ref 36 · internal anchor
RoundPipe achieves near-zero-bubble pipeline parallelism for LLM training on consumer GPUs by dynamically dispatching computation stages round-robin, yielding 1.48-2.16x speedups and enabling 235B model fine-tuning on 8x RTX 4090.
InfiniteScienceGym: An Unbounded, Procedurally-Generated Benchmark for Scientific Analysis cs.CL · 2026-04-14 · unverdicted · none · ref 21 · internal anchor
InfiniteScienceGym procedurally generates unbounded scientific repositories with exact ground-truth QA pairs to benchmark LLMs on data reasoning, abstention, and tool use without static datasets.
Narrative over Numbers: The Identifiable Victim Effect and its Amplification Under Alignment and Reasoning in Large Language Models cs.CL · 2026-04-13 · conditional · none · ref 29 · internal anchor
Large language models display the identifiable victim effect at roughly twice the human baseline, strongly amplified by instruction tuning and chain-of-thought prompting but inverted by reasoning-specialized models.
Tessera: Unlocking Heterogeneous GPUs through Kernel-Granularity Disaggregation cs.DC · 2026-04-11 · unverdicted · none · ref 47 · internal anchor
Tessera performs kernel-granularity disaggregation on heterogeneous GPUs, achieving up to 2.3x throughput and 1.6x cost efficiency gains for large model inference while generalizing beyond prior methods.
Evaluating Large Language Models in Scientific Discovery cs.AI · 2025-12-17 · unverdicted · none · ref 50 · internal anchor
The SDE benchmark shows LLMs lag on scientific discovery tasks relative to general science tests, with diminishing scaling returns and shared weaknesses across models.
Unlocking Speech-Text Compositional Powers: Instruction-Following Speech Language Models without Instruction Tuning cs.CL · 2026-07-02 · unverdicted · none · ref 2 · internal anchor
SpeechCombine produces instruction-following SLMs via speech pre-training followed by direct weight combination with the text LLM instruction delta, without any speech instruction tuning.
OpenSafeIntent: Evaluating Intent-Calibrated Safe Completion Across Dual-Use Prompt Sets cs.CL · 2026-07-02 · unverdicted · none · ref 6 · internal anchor
OpenSafeIntent benchmark shows models fail to calibrate safety across intent shifts in matched dual-use prompts, indicating current evaluations are insufficient.
Can Language Models Actually Retrieve In-Context? Drowning in Documents at Million Token Scale cs.CL · 2026-07-01 · unverdicted · none · ref 27 · internal anchor
A 0.6B LM with length-aware attention adjustments performs competitive in-context retrieval at million-token scale on MS MARCO, NQ, and LIMIT benchmarks.
Measuring the Gap Between Human and LLM Research Ideas cs.CL · 2026-07-01 · unverdicted · none · ref 41 · internal anchor
LLM-generated research ideas cluster more around bridge-like opportunities and synthesis methods than the broader distribution seen in human papers.
ELDR: Expert-Locality-Aware Decode Routing for PD-Disaggregated MoE Serving cs.DC · 2026-07-01 · unverdicted · none · ref 29 · 2 links · internal anchor
ELDR reduces median TPOT by 5.9-13.9% in PD-disaggregated MoE serving via expert signatures from prefill, K-means partitioning, and locality-band routing with KV-co-indexed signature cache.
Using AI Agents to Automate Black-Box Audits of Personalization Algorithms at Scale cs.CL · 2026-06-29 · unverdicted · none · ref 56 · internal anchor
Introduces GenAI agent framework for auditing personalization algorithms via synthetic accounts with fixed personas, applied to X post-2024 election showing amplification of toxic and right-leaning content varying by ideology.
SABER-Math: Automated Benchmark for Information Retrieval Evaluation in Mathematics cs.IR · 2026-06-29 · unverdicted · none · ref 1 · internal anchor
SABER-Math is an automated benchmark for mathematical IR that uses LLM summaries, topic similarities, and preference tournaments on 283K problems to create reranking tasks, showing embedding models outperform baselines but struggle in symbol-heavy areas and that MTEB does not predict math performanc
Agentic Abstention: Do Agents Know When to Stop Instead of Act? cs.AI · 2026-06-27 · unverdicted · none · ref 25 · internal anchor
LLM agents often fail to abstain at the right time in uncertain multi-turn tasks, and the CONVOLVE context engineering method raises timely abstention rates on WebShop from 26.7 to 57.4 without parameter updates.
CodeChat-Eval: Evaluating Large Language Models in Multi-Turn Code Refinement Dialogues cs.SE · 2026-06-24 · unverdicted · none · ref 35 · internal anchor
CodeChat-Eval shows LLMs lose 19.2% to 69.2% functional correctness over multi-turn refinement dialogues, with largest drops on logic-level and additive changes.
Do Thinking Tokens Help with Safety? cs.LG · 2026-06-23 · unverdicted · none · ref 32 · internal anchor
Thinking tokens in reasoning models do not enable safety deliberation; refusal/compliance is strongly predictable from the first token and rarely changes during thinking.
Bagpiper-TTS: Natural Language Guided Universal Speech Synthesis cs.CL · 2026-06-22 · unverdicted · none · ref 42 · internal anchor
Bagpiper-TTS uses natural language prompts and intent reasoning to derive rich captions that guide a single model for universal speech synthesis across classical TTS, multi-talker, singing, and role-play tasks.
Black-Box Forensics for Conversational LLM Agents cs.CR · 2026-06-21 · unverdicted · none · ref 1 · internal anchor
Empirical study reporting 98% base-model attribution accuracy and cross-encoder fingerprinting of unseen system prompts (AUC 0.768 single-conversation, 0.943 with 50 conversations) in black-box LLM agents.
BioMatrix: Towards a Comprehensive Biological Foundation Model Spanning the Modality Matrix of Sequences, Structures, and Language cs.CL · 2026-06-20 · unverdicted · none · ref 6 · internal anchor
BioMatrix unifies sequences, structures, and language for molecules and proteins inside one decoder-only foundation model via shared discrete tokens and achieves SOTA or competitive results on 77 of 80 downstream tasks.
SIGMA: Skill-Incidence Graphs for Compositional Multi-Agent Design cs.MA · 2026-06-18 · unverdicted · none · ref 37 · internal anchor
SIGMA introduces skill-incidence graphs to compose agents from reusable skills, yielding higher average performance and robustness than topology-only baselines on reasoning and coding benchmarks.
Hidden Anchors in Multi-Agent LLM Deliberation cs.AI · 2026-06-17 · unverdicted · none · ref 35 · internal anchor
Multi-agent LLM deliberation is modeled with recoverable hidden anchors that allow opinions to escape the convex hull of initial beliefs, unlike classical consensus models.
LEDGER: A Long-Context Benchmark of Corporate Annual Reports for Grounded Financial Retrieval and Extraction cs.CL · 2026-06-11 · unverdicted · none · ref 20 · internal anchor
LEDGER provides a corpus of 4,999 annual reports with 31 labeled KPIs and three benchmarks for page-level retrieval, needle-in-haystack lookup, and full KPI extraction from long documents.
CodeAlchemy: Synthetic Code Rewriting at Scale cs.CL · 2026-06-08 · unverdicted · none · ref 1 · internal anchor
CodeAlchemy generates 850B+ tokens of synthetic code data across 15 languages via five strategies and enables 3B models to reach 83.5% HumanEval while beating 10x larger frontier models on new DevEval and TraceEval benchmarks.
BenSyc: Benchmarking Conversational Sycophancy and Human Alignment in LLMs for Bengali Contexts cs.CL · 2026-06-08 · unverdicted · none · ref 111 · internal anchor
BenSyc is the first benchmark for conversational sycophancy in Bengali, with top LLMs achieving only 61.8 Macro-F1 on binary detection and 61.7 on five-class classification while often generating overly validating responses.
Support Vector Rubrics: Closing the Gap Between Self-Generated and Human Rubrics cs.CL · 2026-06-06 · unverdicted · none · ref 37 · internal anchor
SVR learns a bank of contrastive rubrics from preference data via max-margin boundaries and prompt-conditioned selection, narrowing the gap to human rubrics on RubricBench from 24.1 to 0.3 points.
SurgiQ: A Large-Scale Multi-Domain Benchmark for Evaluating Surgical Understanding in Large Language Models cs.CL · 2026-06-06 · unverdicted · none · ref 78 · internal anchor
SurgiQ is a new 13k-question surgical benchmark showing general-purpose LLMs reach 68.1% accuracy while most biomedical models lag and smaller models stay near random baseline.
WhiFlash: Accelerating Speculative Decoding with Token-Level Cross-Paradigm Routing cs.LG · 2026-06-05 · unverdicted · none · ref 1 · internal anchor
WhiFlash introduces token-level cross-paradigm routing between autoregressive and diffusion drafting models, with cache optimizations, to raise acceptance lengths and deliver up to 69.6% throughput gains over EAGLE-3.
SentinelRAG: Synthetic Sentinel Knowledge for RAG Database Copyright Protection cs.CR · 2026-06-04 · unverdicted · none · ref 3 · internal anchor
SentinelRAG embeds synthetic fictitious knowledge into RAG databases at 0.1% rate to enable reliable watermark detection with p < 10^{-5} and low false positives across tested datasets.
ReasoningFlow: Discourse Structures for Understanding LLM Reasoning Traces cs.CL · 2026-06-03 · unverdicted · none · ref 10 · internal anchor
ReasoningFlow represents LLM reasoning traces as DAGs, finding structural similarity across models and that most erroneous steps are unused in final answers.
Synthetic Personalities: How Well Can LLMs Mimic Individual Respondents Using Socio-Economic Microdata? cs.CY · 2026-06-03 · unverdicted · none · ref 53 · internal anchor
LLMs achieve up to 78.8% accuracy and r=0.590 correlation mimicking individual SOEP respondents using cumulative microdata, with gains from more information but diminishing returns past the 75% entropy point.
ViBE: Co-Optimizing Workload Skew and Hardware Variability for MoE Serving cs.DC · 2026-05-30 · unverdicted · none · ref 1 · internal anchor
ViBE co-optimizes expert placement with measured GPU performance variability in MoE inference to cut execution-time imbalance, delivering 14% better SLO attainment and up to 45% lower P90 TTFT.
Stateful Online Monitoring Catches Distributed Agent Attacks cs.CR · 2026-05-29 · unverdicted · none · ref 1 · internal anchor
A clustering-based stateful online monitor detects distributed multi-agent cyberattacks that evade standard per-transcript monitors, catching them 30% earlier in large-scale simulated traffic with low overhead.
Every Act Has Its Price: Compressed Moral Composition in Frontier LLMs cs.CL · 2026-05-29 · unverdicted · none · ref 18 · internal anchor
Moral Trolley Arena shows frontier LLMs produce composite moral preferences that are compressed rather than additive functions of calibrated component act strengths across Moral Foundations Theory.
Benchmarking Single-Factor Physical Video-to-Audio Generation cs.CV · 2026-05-28 · unverdicted · none · ref 3 · internal anchor
FlatSounds benchmark shows state-of-the-art V2A models rely more on text captions than visual input for physical and semantic accuracy, with captions improving correctness but degrading temporal alignment.
How's it going? Reinforcement learning in language models recruits a functional welfare axis cs.LG · 2026-05-28 · unverdicted · none · ref 23 · internal anchor
Reinforcement learning recruits rather than creates a functional welfare axis in language models, as reward and punishment vectors from a maze task generalize to unrelated settings and appear in pretrain-only models.
Bastion: Budget-Aware Speculative Decoding with Tree-structured Block Diffusion Drafting cs.LG · 2026-05-28 · unverdicted · none · ref 1 · internal anchor
BASTION is a budget-aware speculative decoding framework with adaptive tree-structured block diffusion drafting that reports up to 6.61x speedup and 39% improvement over block-diffusion baselines.
Predicting Causal Effects from Natural Language Queries using Structured Representations cs.CL · 2026-05-28 · unverdicted · none · ref 1 · internal anchor
Introduces the Query2Effect benchmark and a two-step structured-representation framework for predicting causal effect sizes from natural language queries, with reported gains from fine-tuning and better out-of-domain generalization.
Battery-Sim-Agent: Leveraging LLM-Agent for Inverse Battery Parameter Estimation cs.AI · 2026-05-28 · unverdicted · none · ref 22 · internal anchor
Battery-Sim-Agent reframes inverse battery parameter estimation as an LLM reasoning task in closed loop with a simulator and outperforms Bayesian optimization baselines on diverse benchmarks.
Kronecker Embeddings: Byte-Level Structured Token Representations for Parameter-Efficient Language Models cs.CL · 2026-05-28 · unverdicted · none · ref 9 · internal anchor
Kronecker Embeddings replace learned embedding tables with a deterministic byte-level character-position factorization and single projection, reducing parameters over 90% with reported gains in loss and robustness on language modeling tasks.
CORE: Contrastive Reflection Enables Rapid Improvements in Reasoning cs.AI · 2026-05-27 · unverdicted · none · ref 2 · internal anchor
CORE distills contrasts between successful and unsuccessful reasoning traces into compact natural-language insights that enable faster model self-improvement on reasoning tasks with fewer rollouts than parametric or other non-parametric baselines.

gpt-oss-120b & gpt-oss-20b Model Card

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer