super hub Mixed citations

Qwen Technical Report

author=, Qwen technical report · 2023 · cs.CL · arXiv 2309.16609

Mixed citation behavior. Most common role is background (67%).

523 Pith papers citing it

Background 67% of classified citations

open full Pith review browse 523 citing papers more from author= arXiv PDF

abstract

Large language models (LLMs) have revolutionized the field of artificial intelligence, enabling natural language processing tasks that were previously thought to be exclusive to humans. In this work, we introduce Qwen, the first installment of our large language model series. Qwen is a comprehensive language model series that encompasses distinct models with varying parameter counts. It includes Qwen, the base pretrained language models, and Qwen-Chat, the chat models finetuned with human alignment techniques. The base language models consistently demonstrate superior performance across a multitude of downstream tasks, and the chat models, particularly those trained using Reinforcement Learning from Human Feedback (RLHF), are highly competitive. The chat models possess advanced tool-use and planning capabilities for creating agent applications, showcasing impressive performance even when compared to bigger models on complex tasks like utilizing a code interpreter. Furthermore, we have developed coding-specialized models, Code-Qwen and Code-Qwen-Chat, as well as mathematics-focused models, Math-Qwen-Chat, which are built upon base language models. These models demonstrate significantly improved performance in comparison with open-source models, and slightly fall behind the proprietary models.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 83 baseline 16 method 16 dataset 1 extension 1 other 1

citation-polarity summary

background 79 baseline 16 use method 16 unclear 4 extend 1 support 1 use dataset 1

claims ledger

abstract Large language models (LLMs) have revolutionized the field of artificial intelligence, enabling natural language processing tasks that were previously thought to be exclusive to humans. In this work, we introduce Qwen, the first installment of our large language model series. Qwen is a comprehensive language model series that encompasses distinct models with varying parameter counts. It includes Qwen, the base pretrained language models, and Qwen-Chat, the chat models finetuned with human alignment techniques. The base language models consistently demonstrate superior performance across a mult

authors

author= Qwen technical report

co-cited works

representative citing papers

Behavior Cloning is Not All You Need: The Optimality of On-Policy Distillation for Noisy Expert Feedback

cs.LG · 2026-06-29 · unverdicted · novelty 8.0

Noisy expert imitation learning requires exponential samples for offline methods but polynomial for a variant of on-policy distillation under a noise condition.

Blind PRNG Hijacking: An Undetectable Integrity-Preserving Attack Against LLM Watermarking

cs.CR · 2026-05-27 · unverdicted · novelty 8.0

SeedHijack is a blind, integrity-preserving PRNG hijacking attack that amplifies LLM watermark z-scores up to 2.42x while evading all tested content-side statistical detectors across three schemes and models.

When Robots Do the Chores: A Benchmark and Agent for Long-Horizon Household Task Execution

cs.AI · 2026-05-14 · unverdicted · novelty 8.0 · 2 refs

LongAct benchmark evaluates long-horizon household task execution from free-form instructions; HoloMind agent raises performance but top VLMs still reach only 59% goal completion and 16% full-task success.

A Unified Framework for Critical Scaling of Inverse Temperature in Self-Attention

stat.ML · 2026-05-12 · unverdicted · novelty 8.0

The upper-tail accumulation scale derived from the gap-counting function N_n sets the critical inverse temperature for softmax attention concentration, unifying prior conflicting laws as special cases of different N_n.

LLM Translation of Compiler Intermediate Representation

cs.PL · 2026-05-07 · unverdicted · novelty 8.0

IRIS-14B is the first LLM trained explicitly for GIMPLE-to-LLVM IR translation and outperforms much larger models by up to 44 percentage points on real-world C code.

Learning the Signature of Memorization in Autoregressive Language Models

cs.CL · 2026-04-03 · accept · novelty 8.0

A classifier trained only on transformer fine-tuning data detects an invariant memorization signature that transfers to Mamba, RWKV-4, and RecurrentGemma with AUCs of 0.963, 0.972, and 0.936.

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

cs.CL · 2024-06-27 · unverdicted · novelty 8.0

LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

cs.AI · 2024-04-11 · accept · novelty 8.0

OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.

ALEE: Any-Language Evaluation of Embeddings via English-Centric Minimal Pairs

cs.CL · 2026-06-30 · unverdicted · novelty 7.0

ALEE generates AMR-based English minimal pairs with fine-grained semantic shifts, translates them, and evaluates embedding models on 275+ languages to expose cross-lingual gaps linked to training data and tokenization.

No Place to Hide: Benchmarking Video Hallucination with Background-Controlled Pairs

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

Introduces VidPair-Halluc benchmark of 1K background-controlled adversarial video pairs and 11K QA pairs generated via PairFlow pipeline to evaluate hallucination in LVMs.

A Sensitivity-Aware Test Collection for Search Among Personal Information

cs.IR · 2026-06-25 · accept · novelty 7.0

A new sensitivity-labeled test collection is released from Enron emails with crowdsourced queries, relevance judgments, and LLM extensions for evaluating sensitivity-aware search.

APEX4: Efficient Pure W4A4 LLM Inference via Intra-SM Compute Rebalancing

cs.DC · 2026-06-07 · conditional · novelty 7.0 · 2 refs

APEX4 co-designs pure INT4 GEMM kernels with ρ-aware granularity adaptation to deliver up to 2.09× end-to-end speedup on GPUs with low ρ while keeping LLaMA-2-70B perplexity within 0.63 of FP16.

SurgiQ: A Large-Scale Multi-Domain Benchmark for Evaluating Surgical Understanding in Large Language Models

cs.CL · 2026-06-06 · unverdicted · novelty 7.0

SurgiQ is a new 13k-question surgical benchmark showing general-purpose LLMs reach 68.1% accuracy while most biomedical models lag and smaller models stay near random baseline.

From Correctness to Utility: Gain-Based Prefix Evaluation for LLM Reasoning

cs.CL · 2026-06-05 · unverdicted · novelty 7.0

Prefix gain measured via student-model solve-rate improvement is used to train a Prefix Utility Model (PUM) that supplies stronger supervision than correctness-based process rewards for mathematical reasoning.

Anchored, Not Graded: Vision-Language Models Fail at Slant-from-Texture Perception

cs.CV · 2026-06-04 · unverdicted · novelty 7.0 · 2 refs

VLMs across families and scales show anchoring to discrete slant angles in zero-shot and prompted settings rather than human-like graded texture-based slant perception.

Affordance2Action: Task-Conditioned Scene-level Affordance Grounding for Real-Time Manipulation

cs.RO · 2026-06-02 · unverdicted · novelty 7.0

Affordance2Action introduces A2A-Bench, a manipulation-oriented benchmark for scene-level task-conditioned affordance grounding covering single- and multi-region correspondences, plus an annotation pipeline, and reports gaps in existing segmentation and VLM baselines.

Benchmarking Visual State Tracking in Multimodal Video Understanding

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

VSTAT benchmark shows state-of-the-art MLLMs perform far below humans and only modestly above answer-prior baselines on visual state tracking, failing at visual perception despite correct textual reasoning.

Diffusing in the Right Space: A Systematic Study of Latent Diffusability

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

A large-scale empirical study across tokenizers and diffusion backbones identifies Velocity Irreducible Variance (VIV) as one of the most stable predictors of latent diffusion generation quality.

RogueMerge: Robust and Unified Attacks against LLM Model Merging

cs.CR · 2026-06-02 · unverdicted · novelty 7.0

RogueMerge is a unified attack method that jointly optimizes task vectors to succeed after merging, using stochastic min-max simulation for unknown merging settings and a Taylor-approximated DRO for prompt generalization on generative LLMs.

OctoT2I: A Self-Evolving Agentic Text-to-Image Router

cs.AI · 2026-06-01 · unverdicted · novelty 7.0

OctoT2I uses a no-supervision PSEL loop to discover model capability frontiers and route T2I tasks, reaching 0.96 GenEval score with 90.3% speedup over Flow-GRPO.

Every Act Has Its Price: Compressed Moral Composition in Frontier LLMs

cs.CL · 2026-05-29 · unverdicted · novelty 7.0

Moral Trolley Arena shows frontier LLMs produce composite moral preferences that are compressed rather than additive functions of calibrated component act strengths across Moral Foundations Theory.

Beyond English and Evasion: A Human-Annotated Multi-Domain Benchmark for High-Stakes LLM Safety Evaluation in Chinese

cs.CL · 2026-05-28 · unverdicted · novelty 7.0

Introduces ChiSafe-PAS, a 1,897-prompt human-annotated Chinese adversarial benchmark for LLM safety with 3-class labels, 9-category obfuscation taxonomy, and domain coverage in self-harm, drugs, fraud, and satire.

EvoGM: Learning to Merge LLMs via Evolutionary Generative Optimization

cs.NE · 2026-05-28 · unverdicted · novelty 7.0

EvoGM uses a dual-generator architecture with cycle-consistent learning on winner-loser pairs from search history to optimize LLM merging coefficients inside a multi-round evolutionary pipeline and reports outperformance over baselines on seen and unseen tasks.

FTibSuite: A Comprehensive Resource Suite for Tibetan Vision-Language Modeling

cs.CV · 2026-05-26 · unverdicted · novelty 7.0

FTibSuite provides human-verified multimodal corpora, Tibetan-adapted benchmarks with quality controls, and a baseline VLM showing gains on tasks like MMBench while preserving Chinese capabilities.

citing papers explorer

Showing 50 of 108 citing papers after filters.

Step-Audio 2 Technical Report cs.CL · 2025-07-22 · unverdicted · none · ref 4 · internal anchor
Step-Audio 2 integrates a latent audio encoder, reasoning-centric reinforcement learning, and discrete audio token generation into language modeling to deliver state-of-the-art performance on audio understanding and conversational benchmarks.
CoLD: Counterfactually-Guided Length Debiasing for Process Reward Models in Mathematical Reasoning cs.CL · 2025-07-21 · unverdicted · none · ref 1 · internal anchor
CoLD mitigates length bias in process reward models for mathematical reasoning via counterfactual guidance, length penalties, bias estimation, and joint training, improving step selection accuracy and conciseness on MATH500 and GSM-Plus while boosting downstream RL performance.
Mixture-of-Experts Can Surpass Dense LLMs Under Strictly Equal Resource cs.CL · 2025-06-13 · conditional · none · ref 3 · internal anchor
MoE models with activation rates in an optimal region outperform dense LLMs of identical total parameter count, training compute, and data budget, with the optimal region consistent across scales.
XtraGPT: Context-Aware and Controllable Academic Paper Revision via Human-AI Collaboration cs.CL · 2025-05-16 · conditional · none · ref 45 · internal anchor
XtraGPT is a suite of 1.5B-14B parameter open-source LLMs fine-tuned on 140,000 revision pairs from 7,000 top-tier papers to support controllable, context-aware academic paper editing.
LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL cs.CL · 2025-03-10 · unverdicted · none · ref 4 · internal anchor
A two-stage RL framework first boosts text reasoning in 3B LMMs then adapts it to multimodal inputs, producing modest benchmark gains of 4.5-4.8%.
Hallucinations are inevitable but can be made statistically negligible cs.CL · 2025-02-15 · unverdicted · none · ref 27 · internal anchor
Hallucinations are inevitable on an infinite set of inputs but can be made statistically negligible with sufficient training data quality and quantity.
Qwen2.5-1M Technical Report cs.CL · 2025-01-26 · accept · none · ref 3 · internal anchor
Qwen2.5-1M models reach 1M token context with improved long-context performance, no short-context loss, and 3-7x prefill speedup via open inference optimizations.
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization cs.CL · 2024-11-15 · conditional · none · ref 5 · internal anchor
Mixed Preference Optimization with the MMPR dataset boosts multimodal CoT reasoning, lifting InternVL2-8B to 67.0 accuracy on MathVista (+8.7 points) and matching the 76B model.
DynamicNER: A Dynamic, Multilingual, and Fine-Grained Dataset for LLM-based Named Entity Recognition cs.CL · 2024-09-17 · unverdicted · none · ref 6 · internal anchor
DynamicNER is a dynamic-categorization multilingual NER dataset with 155 entity types paired with CascadeNER, a two-stage lightweight LLM method claiming higher fine-grained accuracy.
Mixture-of-Agents Enhances Large Language Model Capabilities cs.CL · 2024-06-07 · unverdicted · none · ref 1 · internal anchor
A layered Mixture-of-Agents system combining multiple LLMs achieves state-of-the-art results on AlpacaEval 2.0 (65.1%), MT-Bench, and FLASK, outperforming GPT-4 Omni.
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark cs.CL · 2024-06-03 · conditional · none · ref 5 · internal anchor
MMLU-Pro is a revised benchmark that makes language model evaluation harder and more stable by using ten options per question and emphasizing reasoning over simple knowledge recall.
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies cs.CL · 2024-04-09 · conditional · none · ref 5 · internal anchor
MiniCPM 1.2B and 2.4B models reach parity with 7B-13B LLMs via model wind-tunnel scaling and a WSD scheduler that yields a higher optimal data-to-model ratio than Chinchilla scaling.
Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive cs.CL · 2024-02-20 · conditional · none · ref 100 · internal anchor
DPOP is a new loss function that prevents DPO from lowering preferred response likelihoods and outperforms standard DPO on diverse datasets, MT-Bench, and enables Smaug-72B to exceed 80% on the Open LLM Leaderboard.
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models cs.CL · 2024-02-05 · unverdicted · none · ref 4 · internal anchor
DeepSeekMath 7B reaches 51.7% on MATH via continued pretraining on curated web math data and Group Relative Policy Optimization.
IR3DE: A Linear Router for Large Language Models cs.CL · 2026-06-04 · unverdicted · none · ref 2 · internal anchor
IR3DE is a ridge regression router for domain-expert LLMs that matches or exceeds baselines in language modeling and reasoning tasks while allowing dynamic expert addition or removal without retraining.
The Role of Ambiguity in Error Prediction via Uncertainty Quantification cs.CL · 2026-06-01 · unverdicted · none · ref 17 · internal anchor
Disentangling input ambiguity from uncertainty quantification improves error prediction for LLMs on QA tasks, yielding over 10 PRR point gains across models and datasets.
Learning Design Skills as Memory Policies for Agentic Photonic Inverse Design cs.CL · 2026-05-28 · unverdicted · none · ref 1 · internal anchor
SkillPCF is a closed-loop agent framework with a physics-guided memory skill bank, reinforcement-learned skill selection, and simulator-grounded evolution that improves design quality and efficiency for photonic crystal fiber inverse design under limited simulation budgets.
Large Language Model-Powered Query-Driven Event Timeline Summarization in Industrial Search cs.CL · 2026-05-26 · unverdicted · none · ref 15 · internal anchor
QDET deploys a 7B-parameter model fine-tuned with three auxiliary tasks and RL that matches a 671B model's F1 on query-driven timeline summarization while delivering measurable gains in production search metrics.
Text Analytics Evaluation Framework: A Case Study on LLMs and Social Media cs.CL · 2026-05-20 · unverdicted · none · ref 39 · internal anchor
Presents a new question-based evaluation framework for LLMs on aggregated social media text and reports that performance declines with input scale, task complexity, and numerical operations beyond 500 instances.
Quantifying the Utility of User Simulators for Building Collaborative LLM Assistants cs.CL · 2026-05-10 · unverdicted · none · ref 60 · internal anchor
Fine-tuned simulators grounded in real human data produce LLM assistants that win more often against real users than those trained against role-playing simulators.
Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM cs.CL · 2026-05-07 · unverdicted · none · ref 63 · 2 links · internal anchor
TextPro-SLM reduces the speech-text modality gap by feeding an LLM backbone with synchronized text tokens and prosody embeddings from WhisperPro, achieving lowest gap scores at 3B/7B scales with roughly 1,000 hours of audio.
MTA: Multi-Granular Trajectory Alignment for Large Language Model Distillation cs.CL · 2026-05-02 · unverdicted · none · ref 70 · internal anchor
MTA is a distillation method that aligns teacher-student LLM representations along their transformation trajectories using layer-adaptive granularities and dynamic structural plus hidden representation alignment losses.
Understanding the Prompt Sensitivity cs.CL · 2026-04-20 · unverdicted · none · ref 47 · internal anchor
LLMs disperse meaning-preserving prompts internally instead of clustering them, which produces an excessively high upper bound on output log-probability differences via Taylor expansion and Cauchy-Schwarz.
Qwen3.5-Omni Technical Report cs.CL · 2026-04-17 · unverdicted · none · ref 3 · internal anchor
Qwen3.5-Omni scales an omnimodal model to hundreds of billions of parameters with 256k context, introduces ARIA for stable speech synthesis, and reports SOTA performance on 215 audio-visual benchmarks while adding multilingual and audio-visual coding capabilities.
H-Probes: Extracting Hierarchical Structures From Latent Representations of Language Models cs.CL · 2026-04-15 · unverdicted · none · ref 2 · internal anchor
H-probes locate low-dimensional subspaces encoding hierarchy in LLM activations for synthetic tree tasks, show causal importance and generalization, and detect weaker signals in mathematical reasoning traces.
Benchmarking Linguistic Adaptation in Comparable-Sized LLMs: A Study of Llama-3.1-8B, Mistral-7B-v0.1, and Qwen3-8B on Romanized Nepali cs.CL · 2026-03-25 · unverdicted · none · ref 8 · internal anchor
Fine-tuning Llama-3.1-8B, Mistral-7B-v0.1, and Qwen3-8B on Romanized Nepali data enables effective generation where zero-shot fails, with Qwen3-8B performing best overall and Llama-3.1-8B showing the largest gains.
HUOZIIME: An On-Device LLM-enhanced Input Method for Deep Personalization cs.CL · 2026-03-23 · unverdicted · none · ref 1 · internal anchor
HUOZIIME is an on-device LLM-powered input method with post-training on synthesized data and hierarchical memory that achieves efficient execution and memory-driven personalization.
Chinese Short-Form Creative Content Generation via Explanation-Oriented Multi-Objective Optimization cs.CL · 2025-11-19 · unverdicted · none · ref 83 · internal anchor
MAGIC-HMO is a multi-agent framework that treats Chinese short-form creative NLG as heterogeneous multi-objective optimization over personalized constraints plus explanation reliability and outperforms baselines on a baby-naming benchmark.
Qwen2.5-Omni Technical Report cs.CL · 2025-03-26 · conditional · none · ref 3 · internal anchor
Qwen2.5-Omni presents a multimodal model with block-wise encoders, TMRoPE position embeddings, and a Thinker-Talker architecture that enables simultaneous text and streaming speech generation while matching text performance on reasoning benchmarks.
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model cs.CL · 2025-02-04 · unverdicted · none · ref 148 · internal anchor
SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference cs.CL · 2024-12-18 · unverdicted · none · ref 117 · internal anchor
ModernBERT is a new bidirectional encoder model achieving SOTA performance on diverse classification and retrieval benchmarks while offering superior speed and memory efficiency for long-context inference.
E2LLM: Encoder Elongated Large Language Models for Long-Context Understanding and Reasoning cs.CL · 2024-09-10 · unverdicted · none · ref 15 · internal anchor
E2LLM uses encoder-based soft prompt compression for long contexts to improve LLM reasoning on tasks like summarization and QA while maintaining efficiency.
InternLM2 Technical Report cs.CL · 2024-03-26 · unverdicted · none · ref 164 · internal anchor
InternLM2 is a new open-source LLM that outperforms prior versions on 30 benchmarks and long-context tasks through scaled pre-training to 32k tokens and a conditional online RLHF alignment strategy.
ARGUS: Policy-Adaptive Ad Governance via Evolving Reinforcement with Adversarial Umpiring cs.CL · 2026-05-04 · unverdicted · none · ref 39
ARGUS uses a Prosecutor-Defender-Umpire multi-agent setup plus RAG and chain-of-thought rewards to adapt ad policy enforcement to new regulations using minimal fresh labels.
Qwen3 Technical Report cs.CL · 2025-05-14 · unverdicted · none · ref 4
Pith review generated a malformed one-line summary.
From Self to Other: Evaluating Demographic Perspective-Taking in LLM Hate Speech Annotation cs.CL · 2026-06-04 · unverdicted · none · ref 1 · internal anchor
Persona-conditioned LLMs fail to consistently capture human-like inter-group disagreement, in-group sensitivity, and vicarious prediction in hate speech annotation, though Llama 3.1 with vicarious prompting performs best.
Knowledge Graph-Driven Expert-Level Reasoning for Neuroscience cs.CL · 2026-05-24 · unverdicted · none · ref 2 · internal anchor
A textbook-derived neuroscience knowledge graph supplies synthetic multi-hop QA supervision and RL rewards to fine-tune a small LM claimed to exceed larger general models on expert reasoning.
m3BERT: A Modern, Multi-lingual, Matryoshka Bidirectional Encoder cs.CL · 2026-05-19 · unverdicted · none · ref 1 · internal anchor
m3BERT uses a three-stage Matryoshka pretraining approach on a bidirectional encoder to support variable embedding sizes while outperforming prior models on large-scale retrieval tasks.
Toward LLMs Beyond English-Centric Development cs.CL · 2026-05-15 · unverdicted · none · ref 1 · internal anchor
Analysis of open-weight LLMs reveals strong English bias in generated sequences, with continual pre-training providing no cost benefit over from-scratch training for non-English adaptation.
Gyan: An Explainable Neuro-Symbolic Language Model cs.CL · 2026-05-06 · unverdicted · none · ref 9 · 2 links · internal anchor
Gyan is a novel explainable non-transformer language model that achieves SOTA results on multiple datasets by mimicking human-like compositional context and world models.
SemEval-2026 Task 7: Everyday Knowledge Across Diverse Languages and Cultures cs.CL · 2026-05-04 · unverdicted · none · ref 33 · internal anchor
SemEval-2026 Task 7 presents a benchmark and two evaluation tracks for assessing LLMs on everyday knowledge in diverse languages and cultures without allowing training on the test data.
HalluScan: A Systematic Benchmark for Detecting and Mitigating Hallucinations in Instruction-Following LLMs cs.CL · 2026-05-04 · unverdicted · none · ref 1 · 2 links · internal anchor
HalluScan benchmark evaluates hallucination detection in LLMs, reporting NLI Verification at AUROC 0.88 and introducing HalluScore (r=0.41 with humans) plus Adaptive Detection Routing for 2x cost savings.
Effects of Cross-lingual Evidence in Multilingual Medical Question Answering cs.CL · 2026-04-22 · unverdicted · none · ref 11 · internal anchor
Combining English and target-language web retrieval boosts medical QA for low-resource languages to match high-resource performance, while English web data benefits high-resource languages most and specialized sources like PubMed lack multilingual coverage.
WisdomInterrogatory (LuWen): An Open-Source Legal Large Language Model Technical Report cs.CL · 2026-04-08 · unverdicted · none · ref 1 · internal anchor
LuWen is a new open-source Chinese legal LLM that outperforms baselines on judgment prediction, judicial exams, summarization, article QA, and decision reasoning through legal-domain adaptation of a general base model.
Qwen2.5-Coder Technical Report cs.CL · 2024-09-18 · unverdicted · none · ref 4 · internal anchor
Qwen2.5-Coder models claim state-of-the-art results on over 10 code benchmarks, outperforming larger models of similar size.
Yi: Open Foundation Models by 01.AI cs.CL · 2024-03-07 · unverdicted · none · ref 4 · internal anchor
Yi models are 6B and 34B open foundation models pretrained on 3.1T curated tokens that achieve strong benchmark results through data quality and targeted extensions like long context and vision alignment.
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism cs.CL · 2024-01-05 · unverdicted · none · ref 136 · internal anchor
DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.
Neural Activation Patterns Across Language Model Architectures: A Comprehensive Analysis of Cognitive Task Performance cs.CL · 2026-05-14 · unverdicted · none · ref 5 · internal anchor
Analysis of 144 task-model pairs finds mathematical reasoning produces the highest attention entropy in all architectures while decoder models show significantly higher sparsity than encoders.
On The Landscape of Spoken Language Models: A Comprehensive Survey cs.CL · 2025-04-11 · unverdicted · none · ref 3 · internal anchor
A literature survey that organizes spoken language models by architecture, training, and evaluation choices and identifies key challenges and future directions.
Qwen2.5 Technical Report cs.CL · 2024-12-19 · unverdicted · none · ref 6 · internal anchor
Qwen2.5 LLMs scale pre-training data to 18 trillion tokens and apply multistage reinforcement learning, achieving competitive performance on benchmarks with models up to 5 times larger.

Qwen Technical Report

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer