arxiv: 2407.10671 · v4 · submitted 2024-07-15 · 💻 cs.CL · cs.AI

Recognition: 1 theorem link

Qwen2 Technical Report

An Yang , Baosong Yang , Binyuan Hui , Bo Zheng , Bowen Yu , Chang Zhou , Chengpeng Li , Chengyuan Li

show 54 more authors

Dayiheng Liu Fei Huang Guanting Dong Haoran Wei Huan Lin Jialong Tang Jialin Wang Jian Yang Jianhong Tu Jianwei Zhang Jianxin Ma Jianxin Yang Jin Xu Jingren Zhou Jinze Bai Jinzheng He Junyang Lin Kai Dang Keming Lu Keqin Chen Kexin Yang Mei Li Mingfeng Xue Na Ni Pei Zhang Peng Wang Ru Peng Rui Men Ruize Gao Runji Lin Shijie Wang Shuai Bai Sinan Tan Tianhang Zhu Tianhao Li Tianyu Liu Wenbin Ge Xiaodong Deng Xiaohuan Zhou Xingzhang Ren Xinyu Zhang Xipin Wei Xuancheng Ren Xuejing Liu Yang Fan Yang Yao Yichang Zhang Yu Wan Yunfei Chu Yuqiong Liu Zeyu Cui Zhenru Zhang Zhifang Guo Zhihao Fan

Authors on Pith no claims yet

Pith reviewed 2026-05-10 14:02 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords large language modelsQwen2open-weight modelsmultilingual capabilitiesbenchmark evaluationcoding and reasoningmodel releaseinstruction tuning

0 comments

The pith

Qwen2 releases open models from 0.5B to 72B parameters that outperform most prior open-weight systems on language, coding, math, and reasoning benchmarks while supporting about 30 languages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Qwen2 series of large language models, including dense and Mixture-of-Experts variants across a wide parameter range. It establishes that these models exceed the results of most earlier open-weight models, including the Qwen1.5 predecessor, on standard evaluations for understanding, generation, multilingual proficiency, coding, mathematics, and reasoning. Specific scores are given for the 72B model, such as 84.2 on MMLU and 64.6 on HumanEval for the base version. The weights are made publicly available with supporting resources to enable community use and further development.

Core claim

The Qwen2 series consists of foundational and instruction-tuned language models from 0.5 to 72 billion parameters, featuring both dense models and a Mixture-of-Experts model. The 72B base model records 84.2 on MMLU, 37.9 on GPQA, 64.6 on HumanEval, 89.5 on GSM8K, and 82.4 on BBH. The instruction-tuned 72B version scores 9.1 on MT-Bench, 48.1 on Arena-Hard, and 35.7 on LiveCodeBench. Qwen2 also demonstrates strong capabilities across approximately 30 languages, and all model weights are released openly on Hugging Face and ModelScope along with tools for quantization, fine-tuning, and deployment.

What carries the argument

The Qwen2 model family, a set of scaled dense and Mixture-of-Experts language models whose training yields the reported benchmark gains and multilingual coverage.

If this is right

Developers can integrate competitive open models into applications for language understanding and generation tasks.
Researchers obtain new strong baselines for advancing work in multilingual systems, coding assistance, and mathematical reasoning.
The public release of weights and fine-tuning resources supports customization for domain-specific uses.
Global users benefit from built-in proficiency across roughly 30 languages for broader accessibility.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Widespread use of these models could narrow the practical gap between open and closed AI systems in everyday applications.
The multilingual reach may accelerate development of tools suited to non-English markets and cross-language tasks.
Community fine-tuning on the released weights could produce specialized variants that extend performance in targeted areas like coding or reasoning.

Load-bearing premise

The chosen benchmarks accurately and fairly represent the models' overall capabilities without significant evaluation artifacts or selective reporting.

What would settle it

An independent run of the same benchmarks that produces substantially lower scores for Qwen2 models than reported, especially compared to Qwen1.5, would undermine the performance claims.

read the original abstract

This report introduces the Qwen2 series, the latest addition to our large language models and large multimodal models. We release a comprehensive suite of foundational and instruction-tuned language models, encompassing a parameter range from 0.5 to 72 billion, featuring dense models and a Mixture-of-Experts model. Qwen2 surpasses most prior open-weight models, including its predecessor Qwen1.5, and exhibits competitive performance relative to proprietary models across diverse benchmarks on language understanding, generation, multilingual proficiency, coding, mathematics, and reasoning. The flagship model, Qwen2-72B, showcases remarkable performance: 84.2 on MMLU, 37.9 on GPQA, 64.6 on HumanEval, 89.5 on GSM8K, and 82.4 on BBH as a base language model. The instruction-tuned variant, Qwen2-72B-Instruct, attains 9.1 on MT-Bench, 48.1 on Arena-Hard, and 35.7 on LiveCodeBench. Moreover, Qwen2 demonstrates robust multilingual capabilities, proficient in approximately 30 languages, spanning English, Chinese, Spanish, French, German, Arabic, Russian, Korean, Japanese, Thai, Vietnamese, and more, underscoring its versatility and global reach. To foster community innovation and accessibility, we have made the Qwen2 model weights openly available on Hugging Face and ModelScope, and the supplementary materials including example code on GitHub. These platforms also include resources for quantization, fine-tuning, and deployment, facilitating a wide range of applications and research endeavors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The paper introduces the Qwen2 series of LLMs, including dense models (0.5B to 72B parameters) and a MoE model. It reports that the Qwen2-72B base model achieves 84.2 on MMLU, 37.9 on GPQA, 64.6 on HumanEval, 89.5 on GSM8K, and 82.4 on BBH. The Qwen2-72B-Instruct scores 9.1 on MT-Bench, 48.1 on Arena-Hard, and 35.7 on LiveCodeBench. The models demonstrate strong multilingual capabilities across approximately 30 languages and are released openly along with resources for quantization, fine-tuning, and deployment.

Significance. This technical report contributes significantly to the open-source LLM ecosystem by providing a family of models that achieve competitive or superior performance to prior open models on a wide range of tasks including understanding, coding, math, and reasoning. The public release of the model weights enables direct use and further fine-tuning by the research community, potentially accelerating progress in multilingual and specialized applications.

major comments (1)

[Evaluation] Evaluation section: The specific scores for benchmarks like GPQA (37.9 for base) and LiveCodeBench (35.7 for instruct) are presented without detailing the prompting method, number of few-shot examples, or decoding parameters used. This omission makes it challenging to independently verify the claims of surpassing other models under identical conditions.

minor comments (3)

The abstract lists several languages but states 'approximately 30 languages'; a more precise count or complete list in the main text would improve clarity.
Consider including a comparison table that explicitly lists the scores of Qwen1.5 and other models alongside Qwen2 for direct visual comparison.
Ensure consistency in reporting whether scores are for base or instruct models across all mentioned benchmarks.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of the Qwen2 Technical Report and the recommendation for minor revision. We address the single major comment below.

read point-by-point responses

Referee: [Evaluation] Evaluation section: The specific scores for benchmarks like GPQA (37.9 for base) and LiveCodeBench (35.7 for instruct) are presented without detailing the prompting method, number of few-shot examples, or decoding parameters used. This omission makes it challenging to independently verify the claims of surpassing other models under identical conditions.

Authors: We agree that detailed evaluation protocols are necessary for reproducibility and fair comparison. The current manuscript provides limited information on these aspects for GPQA and LiveCodeBench. In the revised version, we will expand the Evaluation section (and add an appendix if needed) to specify the prompting methods, number of few-shot examples, and decoding parameters (including temperature, top-p, and max new tokens) used for these and other benchmarks. This addition will enable independent verification under identical conditions. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The Qwen2 Technical Report is an empirical model release paper whose central claims consist of benchmark scores on external, independently defined datasets (MMLU, GPQA, HumanEval, GSM8K, BBH, MT-Bench, Arena-Hard, LiveCodeBench). No derivations, equations, fitted parameters, or predictions are presented that reduce to self-defined quantities or self-citation chains. Comparisons to prior models and proprietary systems rely on publicly reported numbers rather than internal redefinitions. The results are directly falsifiable by third-party evaluation of the released weights.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical model-release report focused on benchmark results rather than theoretical derivation; no free parameters, axioms, or invented entities underpin the central performance claims.

pith-pipeline@v0.9.0 · 5822 in / 1135 out tokens · 40235 ms · 2026-05-10T14:02:46.189609+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models
cs.LG 2026-05 unverdicted novelty 8.0

Adaptive scheduling of interventions in discrete diffusion language models, timed to attribute-specific commitment schedules discovered with sparse autoencoders, delivers precise multi-attribute steering up to 93% str...
HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization
cs.LG 2026-05 conditional novelty 8.0

HeadQ removes 84-94% of excess perplexity from 2-bit key quantization by storing low-rank residuals in a calibration-learned query basis for score-space correction and using A²-weighted distortion for values.
AgentSocialBench: Evaluating Privacy Risks in Human-Centered Agentic Social Networks
cs.AI 2026-04 unverdicted novelty 8.0

AgentSocialBench demonstrates that privacy preservation is fundamentally harder in human-centered agentic social networks than in single-agent cases due to cross-domain coordination pressures and an abstraction parado...
Large Language Diffusion Models
cs.CL 2025-02 unverdicted novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
cs.CV 2024-09 accept novelty 8.0

Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
cs.CL 2024-09 accept novelty 8.0

MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.
KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving
cs.DC 2026-05 conditional novelty 7.0

KVServe delivers up to 9.13x job completion time speedup and 32.8x time-to-first-token reduction by making KV cache compression service-aware and adaptive in disaggregated LLM serving.
Temper and Tilt Lead to SLOP: Reward Hacking Mitigation with Inference-Time Alignment
cs.LG 2026-05 unverdicted novelty 7.0

Temperature adjustment on the reference model generalizes inference-time alignment to SLOP ensembles of reward models, with a calibration algorithm that improves robustness to reward hacking while preserving alignment...
TokAlign++: Advancing Vocabulary Adaptation via Better Token Alignment
cs.CL 2026-05 unverdicted novelty 7.0

TokAlign++ learns token alignments between LLM vocabularies from monolingual representations to enable faster adaptation, better text compression, and effective token-level distillation across 15 languages with minimal steps.
Controlling Logical Collapse in LLMs via Algebraic Ontology Projection over F2
cs.LG 2026-05 unverdicted novelty 7.0

Projecting LLM hidden states onto F2 algebra with 42 pairs yields 93% zero-shot accuracy on logical relations and identifies prompt-preventable late-layer collapse.
Towards Automated Air Traffic Safety Assessment Around Non-Towered Airports Using Large Language Models
cs.AI 2026-05 unverdicted novelty 7.0

Large language models achieve macro F1 scores above 0.85 on binary nominal-versus-danger classification from CTAF radio transcripts and METAR weather data using a new synthetic dataset with a 12-category hazard taxonomy.
Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control
cs.LG 2026-05 unverdicted novelty 7.0

Entropy polarity from a first-order entropy change approximation enables Polarity-Aware Policy Optimization (PAPO) that preserves complementary polarity branches and outperforms baselines on math and agentic RL fine-t...
Predictive Maps of Multi-Agent Reasoning: A Successor-Representation Spectrum for LLM Communication Topologies
cs.MA 2026-05 unverdicted novelty 7.0

Successor-representation spectra of row-stochastic communication operators predict perturbation robustness, consensus speed, and error accumulation in multi-agent LLM topologies, with condition number showing perfect ...
Privacy-preserving Chunk Scheduling in a BitTorrent Implementation of Federated Learning
cs.DC 2026-05 unverdicted novelty 7.0

FLTorrent achieves within-round source unlinkability in decentralized federated learning via a BitTorrent warm-up with pre-round obfuscation, randomized lags, and coordination-only non-owner-first scheduling, reaching...
GraphInstruct: A Progressive Benchmark for Diagnosing Capability Gaps in LLM Graph Generation
cs.SI 2026-05 unverdicted novelty 7.0

GraphInstruct is a progressive benchmark with six complexity levels for LLM graph generation that identifies multi-constraint composition as the hardest point and shows a verification-guided iterative framework outper...
Beyond Position Bias: Shifting Context Compression from Position-Driven to Semantic-Driven
cs.CL 2026-05 unverdicted novelty 7.0

SeCo performs semantic-driven context compression for LLMs by anchoring on query-relevant semantic centers and applying consistency-weighted token merging, yielding better downstream performance, lower latency, and st...
Towards Backdoor-Based Ownership Verification for Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 7.0

GuardVLA embeds a stealthy backdoor watermark in VLAs via secret messages in visual data and uses a swap-and-detect mechanism for post-release ownership verification that preserves task performance.
Queryable LoRA: Instruction-Regularized Routing Over Shared Low-Rank Update Atoms
cs.LG 2026-05 unverdicted novelty 7.0

Queryable LoRA adds dynamic routing over shared low-rank atoms with attention and language-instruction regularization to make parameter-efficient fine-tuning more adaptive across inputs and layers.
Chain-based Distillation for Effective Initialization of Variable-Sized Small Language Models
cs.CL 2026-05 unverdicted novelty 7.0

Chain-based Distillation constructs a sequence of anchor models to enable efficient initialization of variable-sized SLMs through interpolation, with bridge distillation for cross-architecture transfer, yielding bette...
Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective
cs.LG 2026-05 unverdicted novelty 7.0

The cumulative token IS ratio gives unbiased prefix correction and lower variance than full-sequence ratios for token-level gradients in LLM policy optimization, enabling CTPO to outperform GRPO and GSPO baselines on ...
Understanding Performance Collapse in Layer-Pruned Large Language Models via Decision Representation Transitions
cs.CL 2026-05 unverdicted novelty 7.0

Performance collapse in layer-pruned LLMs stems from disrupting the Silent Phase of decision-making, which blocks the transition to correct predictions, while the later Decisive Phase is robust to pruning.
Beyond Factor Aggregation: Gauge-Aware Low-Rank Server Representations for Federated LoRA
cs.LG 2026-05 unverdicted novelty 7.0

GLoRA replaces raw factor averaging with gauge-aware aggregation in a consensus subspace estimated from client projectors, enabling consistent low-rank federated LoRA under heterogeneity.
Logic-Regularized Verifier Elicits Reasoning from LLMs
cs.CL 2026-05 unverdicted novelty 7.0

LOVER creates an unsupervised logic-regularized verifier that reaches 95% of supervised verifier performance on reasoning tasks across 10 datasets.
When Quantization Is Free: An int4 KV Cache That Outruns fp16 on Apple Silicon
cs.PF 2026-05 unverdicted novelty 7.0

A single fused int4 KV cache kernel on Apple Silicon outperforms fp16 in latency with 3x memory compression and near-zero quality loss on tested models.
Decoding-Time Debiasing via Process Reward Models: From Controlled Fill-in to Open-Ended Generation
cs.CL 2026-05 unverdicted novelty 7.0

Decoding-time use of process reward models for bias mitigation raises fairness scores by up to 0.40 on a bilingual benchmark while preserving fluency across four LLMs and extends to open-ended generation with low overhead.
QASecClaw: A Multi-Agent LLM Approach for False Positive Reduction in Static Application Security Testing
cs.CR 2026-05 unverdicted novelty 7.0

A multi-agent LLM system cuts false positives in static application security testing by 88.6% on the OWASP Benchmark while dropping recall by only 3.1%.
InvEvolve: Evolving White-Box Inventory Policies via Large Language Models with Performance Guarantees
cs.LG 2026-05 unverdicted novelty 7.0

InvEvolve evolves white-box inventory policies from LLMs with statistical safety guarantees and outperforms classical and deep learning methods on synthetic and real retail data.
PuzzleMark: Implicit Jigsaw Learning for Robust Code Dataset Watermarking in Neural Code Completion Models
cs.SE 2026-04 unverdicted novelty 7.0

PuzzleMark provides a robust and imperceptible watermarking method for code datasets using adaptive variable name concatenation and statistical verification, achieving perfect detection rates with minimal performance impact.
R-CoT: A Reasoning-Layer Watermark via Redundant Chain-of-Thought in Large Language Models
cs.CR 2026-04 unverdicted novelty 7.0

R-CoT embeds watermarks into LLM reasoning paths via redundant CoT and GRPO-based dual optimization, maintaining over 95% true positive rate under fine-tuning and post-training changes.
Incompressible Knowledge Probes: Estimating Black-Box LLM Parameter Counts via Factual Capacity
cs.LG 2026-04 unverdicted novelty 7.0

Incompressible Knowledge Probes enable log-linear estimation of LLM parameter counts from factual accuracy on obscure questions, showing continued scaling of knowledge capacity across open and closed models.
Agentic Adversarial Rewriting Exposes Architectural Vulnerabilities in Black-Box NLP Pipelines
cs.AI 2026-04 unverdicted novelty 7.0

A two-agent adversarial rewriting framework achieves 20-40% evasion rates against LLM-based misinformation detectors under strict black-box constraints with binary feedback only, far outperforming prior methods and li...
Supernodes and Halos: Loss-Critical Hubs in LLM Feed-Forward Layers
cs.LG 2026-04 unverdicted novelty 7.0

In LLM feed-forward networks, the top 1% of channels per layer carry a median 58.7% of loss sensitivity, forming supernodes whose protection enables effective 50% sparsity pruning with much lower perplexity than baselines.
Preserving Long-Tailed Expert Information in Mixture-of-Experts Tuning
cs.LG 2026-04 unverdicted novelty 7.0

A new SFT framework for MoE models combines bias-driven sparsification with gated condenser experts to retain long-tailed expert information, outperforming DenseMixer and ESFT by over 2.5% on math reasoning and common...
AmaraSpatial-10K: A Spatially and Semantically Aligned 3D Dataset for Spatial Computing and Embodied AI
cs.CV 2026-04 unverdicted novelty 7.0

AmaraSpatial-10K is a new dataset of over 10,000 metric-scaled and semantically anchored 3D assets that achieves 3.4 times higher text retrieval precision than Objaverse for embodied AI and spatial computing.
Toward Efficient Membership Inference Attacks against Federated Large Language Models: A Projection Residual Approach
cs.LG 2026-04 unverdicted novelty 7.0

ProjRes achieves near-100% accuracy in membership inference on FedLLMs by measuring projection residuals of hidden embeddings on gradient subspaces, outperforming prior methods by up to 75.75% even under differential privacy.
FASER: Fine-Grained Phase Management for Speculative Decoding in Dynamic LLM Serving
cs.DC 2026-04 unverdicted novelty 7.0

FASER delivers up to 53% higher throughput and 1.92x lower latency in dynamic LLM serving by adjusting speculative lengths per request, early pruning of rejects, and overlapping draft/verification phases via frontiers.
R2IF: Aligning Reasoning with Decisions via Composite Rewards for Interpretable LLM Function Calling
cs.LG 2026-04 unverdicted novelty 7.0

R2IF improves LLM function-calling accuracy by up to 34.62% on BFCL using a composite reward system with CER and SMV components optimized via GRPO, while increasing interpretability through positive CoT effectiveness.
Indic-CodecFake meets SATYAM: Towards Detecting Neural Audio Codec Synthesized Speech Deepfakes in Indic Languages
eess.AS 2026-04 unverdicted novelty 7.0

Introduces the Indic-CodecFake dataset for Indic codec deepfakes and SATYAM, a novel hyperbolic ALM that outperforms baselines through dual-stage semantic-prosodic fusion using Bhattacharya distance.
Super Apriel: One Checkpoint, Many Speeds
cs.LG 2026-04 unverdicted novelty 7.0

A single 15B supernet checkpoint supports runtime switching between attention mixer placements for multiple decode speed presets while retaining 77-96% quality relative to the teacher model.
SimDiff: Depth Pruning via Similarity and Difference
cs.AI 2026-04 unverdicted novelty 7.0

SimDiff uses similarity and difference metrics to prune LLM layers more effectively than cosine similarity alone, retaining over 91% performance at 25% pruning on LLaMA2-7B.
Rethinking Scale: Deployment Trade-offs of Small Language Models under Agent Paradigms
cs.CL 2026-04 unverdicted novelty 7.0

Single-agent systems with tools provide the optimal performance-efficiency trade-off for small language models, outperforming base models and multi-agent setups.
Self-Consistency from Only Two Samples: CoT-PoT Ensembling for Efficient LLM Reasoning
cs.CL 2026-04 unverdicted novelty 7.0

CoT-PoT ensembling achieves self-consistency accuracy in LLMs with only two samples for 78.6% of tasks, reducing computation by 9.3x compared to standard methods.
Nautilus: An Auto-Scheduling Tensor Compiler for Efficient Tiled GPU Kernels
cs.PL 2026-04 unverdicted novelty 7.0

Nautilus auto-compiles math-like tensor descriptions into optimized GPU kernels, delivering up to 42% higher throughput than prior compilers on transformer models across NVIDIA GPUs.
Reasoning Resides in Layers: Restoring Temporal Reasoning in Video-Language Models with Layer-Selective Merging
cs.CV 2026-04 unverdicted novelty 7.0

MERIT restores temporal reasoning in VLMs via layer-selective self-attention merging guided by a TR-improving objective that penalizes TP degradation.
Computational Lesions in Multilingual Language Models Separate Shared and Language-specific Brain Alignment
cs.CL 2026-04 unverdicted novelty 7.0

Lesioning a shared core in multilingual LLMs drops whole-brain fMRI encoding correlation by 60.32%, while language-specific lesions selectively weaken predictions only for the matched native language.
EgoTL: Egocentric Think-Aloud Chains for Long-Horizon Tasks
cs.CV 2026-04 unverdicted novelty 7.0

EgoTL provides a new egocentric dataset with think-aloud chains and metric labels that benchmarks VLMs on long-horizon tasks and improves their planning, reasoning, and spatial grounding after finetuning.
Learning Vision-Language-Action World Models for Autonomous Driving
cs.CV 2026-04 unverdicted novelty 7.0

VLA-World improves autonomous driving by using action-guided future image generation followed by reflective reasoning over the imagined scene to refine trajectories.
TaxPraBen: A Scalable Benchmark for Structured Evaluation of LLMs in Chinese Real-World Tax Practice
cs.CL 2026-04 unverdicted novelty 7.0

TaxPraBen is a new benchmark with 14 datasets and a structured evaluation method for measuring LLM performance on Chinese real-world tax tasks and scenarios.
Can LLMs Deobfuscate Binary Code? A Systematic Analysis of Large Language Models into Pseudocode Deobfuscation
cs.SE 2026-04 unverdicted novelty 7.0

LLM deobfuscation of binaries to pseudocode depends more on reasoning ability and task-specific fine-tuning than on model size, with reasoning models showing robustness across ISAs and obfuscation levels on the new Bi...
On the Invariants of Softmax Attention
cs.LG 2026-04 unverdicted novelty 7.0

Softmax attention has algebraic invariants including zero-sum rows and head-dimension rank limits, plus consistent variance spread in language models attributed to key incoherence.
Large Language Models Align with the Human Brain during Creative Thinking
q-bio.NC 2026-04 unverdicted novelty 7.0

LLMs show scaling and training-dependent alignment with human brain responses in creativity-related networks during divergent thinking tasks, measured via RSA on fMRI data.
When Sinks Help or Hurt: Unified Framework for Attention Sink in Large Vision-Language Models
cs.CV 2026-04 unverdicted novelty 7.0

Attention sinks in LVLM create a global-vs-local trade-off that a layer-wise gating module can balance to improve multimodal benchmark performance.
PR-CAD: Progressive Refinement for Unified Controllable and Faithful Text-to-CAD Generation with Large Language Models
cs.CL 2026-03 unverdicted novelty 7.0

PR-CAD unifies text-to-CAD generation and editing via progressive refinement with LLMs, a new interaction dataset, and RL-enhanced reasoning to achieve better controllability and faithfulness.
Post-Selection Distributional Model Evaluation
stat.ML 2026-03 unverdicted novelty 7.0

PS-DME is a new framework that controls post-selection false coverage rate for distributional KPI estimates via e-values and is provably more sample-efficient than data splitting under explicit conditions.
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs
cs.CL 2024-12 unverdicted novelty 7.0

o1-like models overthink easy tasks; self-training reduces compute use without accuracy loss on GSM8K, MATH500, GPQA, and AIME.
Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents
cs.CR 2024-10 unverdicted novelty 7.0

ASB is a new benchmark that tests 10 prompt injection attacks, memory poisoning, a novel Plan-of-Thought backdoor attack, and 11 defenses on LLM agents across 13 models, finding attack success rates up to 84.3% and li...
KTO: Model Alignment as Prospect Theoretic Optimization
cs.LG 2024-02 conditional novelty 7.0

KTO aligns LLMs by directly maximizing prospect-theoretic utility on binary signals and matches or exceeds preference-based methods like DPO from 1B to 30B parameters.
Language Generation as Optimal Control: Closed-Loop Diffusion in Latent Control Space
cs.CL 2026-05 unverdicted novelty 6.0

Language generation is recast as optimal control and solved approximately with flow matching in rectified latent control space to enable high-fidelity parallel text generation.
Pause and Reflect: Conformal Aggregation for Chain-of-Thought Reasoning
stat.ML 2026-05 unverdicted novelty 6.0

A conformal procedure for CoT replaces majority voting with weighted aggregation and calibrates abstention to guarantee low confident-error rates, achieving 90.1% selective accuracy on GSM8K by abstaining on under 5% ...
ODRPO: Ordinal Decompositions of Discrete Rewards for Robust Policy Optimization
cs.LG 2026-05 unverdicted novelty 6.0

ODRPO decomposes discrete rewards into ordinal binary indicators to compute independent advantages and reduce noise corruption in RLAIF policy optimization.