super hub Mixed citations

Qwen Technical Report

author=, Qwen technical report · 2023 · cs.CL · arXiv 2309.16609

Mixed citation behavior. Most common role is background (67%).

565 Pith papers citing it

Background 67% of classified citations

open full Pith review browse 565 citing papers more from author= arXiv PDF

abstract

Large language models (LLMs) have revolutionized the field of artificial intelligence, enabling natural language processing tasks that were previously thought to be exclusive to humans. In this work, we introduce Qwen, the first installment of our large language model series. Qwen is a comprehensive language model series that encompasses distinct models with varying parameter counts. It includes Qwen, the base pretrained language models, and Qwen-Chat, the chat models finetuned with human alignment techniques. The base language models consistently demonstrate superior performance across a multitude of downstream tasks, and the chat models, particularly those trained using Reinforcement Learning from Human Feedback (RLHF), are highly competitive. The chat models possess advanced tool-use and planning capabilities for creating agent applications, showcasing impressive performance even when compared to bigger models on complex tasks like utilizing a code interpreter. Furthermore, we have developed coding-specialized models, Code-Qwen and Code-Qwen-Chat, as well as mathematics-focused models, Math-Qwen-Chat, which are built upon base language models. These models demonstrate significantly improved performance in comparison with open-source models, and slightly fall behind the proprietary models.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 83 baseline 16 method 16 dataset 1 extension 1 other 1

citation-polarity summary

background 79 baseline 16 use method 16 unclear 4 extend 1 support 1 use dataset 1

claims ledger

abstract Large language models (LLMs) have revolutionized the field of artificial intelligence, enabling natural language processing tasks that were previously thought to be exclusive to humans. In this work, we introduce Qwen, the first installment of our large language model series. Qwen is a comprehensive language model series that encompasses distinct models with varying parameter counts. It includes Qwen, the base pretrained language models, and Qwen-Chat, the chat models finetuned with human alignment techniques. The base language models consistently demonstrate superior performance across a mult

authors

author= Qwen technical report

co-cited works

representative citing papers

Behavior Cloning is Not All You Need: The Optimality of On-Policy Distillation for Noisy Expert Feedback

cs.LG · 2026-06-29 · unverdicted · novelty 8.0

Noisy expert imitation learning requires exponential samples for offline methods but polynomial for a variant of on-policy distillation under a noise condition.

Blind PRNG Hijacking: An Undetectable Integrity-Preserving Attack Against LLM Watermarking

cs.CR · 2026-05-27 · unverdicted · novelty 8.0

SeedHijack is a blind, integrity-preserving PRNG hijacking attack that amplifies LLM watermark z-scores up to 2.42x while evading all tested content-side statistical detectors across three schemes and models.

When Robots Do the Chores: A Benchmark and Agent for Long-Horizon Household Task Execution

cs.AI · 2026-05-14 · unverdicted · novelty 8.0 · 2 refs

LongAct benchmark evaluates long-horizon household task execution from free-form instructions; HoloMind agent raises performance but top VLMs still reach only 59% goal completion and 16% full-task success.

A Unified Framework for Critical Scaling of Inverse Temperature in Self-Attention

stat.ML · 2026-05-12 · unverdicted · novelty 8.0

The upper-tail accumulation scale derived from the gap-counting function N_n sets the critical inverse temperature for softmax attention concentration, unifying prior conflicting laws as special cases of different N_n.

LLM Translation of Compiler Intermediate Representation

cs.PL · 2026-05-07 · unverdicted · novelty 8.0

IRIS-14B is the first LLM trained explicitly for GIMPLE-to-LLVM IR translation and outperforms much larger models by up to 44 percentage points on real-world C code.

Learning the Signature of Memorization in Autoregressive Language Models

cs.CL · 2026-04-03 · accept · novelty 8.0

A classifier trained only on transformer fine-tuning data detects an invariant memorization signature that transfers to Mamba, RWKV-4, and RecurrentGemma with AUCs of 0.963, 0.972, and 0.936.

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

cs.CL · 2024-06-27 · unverdicted · novelty 8.0

LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

cs.AI · 2024-04-11 · accept · novelty 8.0

OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.

ALEE: Any-Language Evaluation of Embeddings via English-Centric Minimal Pairs

cs.CL · 2026-06-30 · unverdicted · novelty 7.0

ALEE generates AMR-based English minimal pairs with fine-grained semantic shifts, translates them, and evaluates embedding models on 275+ languages to expose cross-lingual gaps linked to training data and tokenization.

No Place to Hide: Benchmarking Video Hallucination with Background-Controlled Pairs

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

Introduces VidPair-Halluc benchmark of 1K background-controlled adversarial video pairs and 11K QA pairs generated via PairFlow pipeline to evaluate hallucination in LVMs.

A Sensitivity-Aware Test Collection for Search Among Personal Information

cs.IR · 2026-06-25 · accept · novelty 7.0

A new sensitivity-labeled test collection is released from Enron emails with crowdsourced queries, relevance judgments, and LLM extensions for evaluating sensitivity-aware search.

GeMoE: Gating Entropy is All You Need for Uncertainty-aware Adaptive Routing in MoE-based Large Vision-Language Models

cs.CV · 2026-06-24 · unverdicted · novelty 7.0

GeMoE adaptively sets the number of experts per token via gating entropy, retaining 99.5% of static-routing performance while raising average sparsity by 36.5%.

SSMNBench: Diagnosing Image-based Cross-View Human-Object Understanding via Single-View Sufficiency and Multi-View Necessity

cs.CV · 2026-06-24 · unverdicted · novelty 7.0

SSMNBench shows that MLLMs suffer distraction degradation on single-view-sufficient tasks and fail to integrate geometric evidence across views, instead relying on semantic averaging and view preference.

DiT-Reward: Generative Representations for Text-to-Image Reward Modeling

cs.LG · 2026-06-22 · unverdicted · novelty 7.0

DiT-Reward converts pretrained DiT models into reward predictors that outperform HPSv3 on four benchmarks while providing 1.65x inference speedup.

FlexServe: A Fast and Secure LLM Serving System for Mobile Devices with Flexible Resource Isolation

cs.CR · 2026-06-22 · unverdicted · novelty 7.0

FlexServe introduces recallable secure memory and NPU to enable cooperative secure LLM inference on mobile devices, reporting 10.05X TTFT speedup over a basic TrustZone strawman.

From Efficiency to Leakage -- Privacy Backdoor in Federated Language Model Fine-Tuning

cs.CR · 2026-06-18 · conditional · novelty 7.0

NeuroImprint attack assigns isolated memorization neurons to training samples in PEFT adapters, enabling closed-form reconstruction of 59-79% of samples across BERT, GPT-2, Qwen2, and Llama3.2 on multiple datasets.

SAGE: Retain-Aware Post-Hoc Sanitization of Final Unlearning Vector

cs.LG · 2026-06-16 · unverdicted · novelty 7.0

SAGE is a source-agnostic post-hoc correction for LLM unlearning updates that suppresses components aligned with high-energy retained activation directions while preserving the forgetting carrier.

Beyond Attack Success Rate: Examining Trigger Leakage in Vision-Language Agentic Systems

cs.CR · 2026-06-10 · unverdicted · novelty 7.0

Introduces Neighbor Leakage Rate showing high trigger leakage in VLAS backdoors at 3% poisoning, caused by broad activation regions in fine-tuning that hard-negative samples can narrow.

Earth-OneVision: Extending Remote Sensing Multimodal Large Language Models to More Sensor Modalities and Tasks

cs.CV · 2026-06-09 · unverdicted · novelty 7.0

Earth-OneVision is a unified 2B-parameter RS-MLLM supporting six modalities and nine tasks via FGVLA, SLIS, and PCMA mechanisms plus a 34M QA-pair dataset, reporting competitive or superior benchmark results versus larger models.

APEX4: Efficient Pure W4A4 LLM Inference via Intra-SM Compute Rebalancing

cs.DC · 2026-06-07 · conditional · novelty 7.0 · 2 refs

APEX4 co-designs pure INT4 GEMM kernels with ρ-aware granularity adaptation to deliver up to 2.09× end-to-end speedup on GPUs with low ρ while keeping LLaMA-2-70B perplexity within 0.63 of FP16.

SurgiQ: A Large-Scale Multi-Domain Benchmark for Evaluating Surgical Understanding in Large Language Models

cs.CL · 2026-06-06 · unverdicted · novelty 7.0

SurgiQ is a new 13k-question surgical benchmark showing general-purpose LLMs reach 68.1% accuracy while most biomedical models lag and smaller models stay near random baseline.

From Correctness to Utility: Gain-Based Prefix Evaluation for LLM Reasoning

cs.CL · 2026-06-05 · unverdicted · novelty 7.0

Prefix gain measured via student-model solve-rate improvement is used to train a Prefix Utility Model (PUM) that supplies stronger supervision than correctness-based process rewards for mathematical reasoning.

Anchored, Not Graded: Vision-Language Models Fail at Slant-from-Texture Perception

cs.CV · 2026-06-04 · unverdicted · novelty 7.0 · 2 refs

VLMs across families and scales show anchoring to discrete slant angles in zero-shot and prompted settings rather than human-like graded texture-based slant perception.

Affordance2Action: Task-Conditioned Scene-level Affordance Grounding for Real-Time Manipulation

cs.RO · 2026-06-02 · unverdicted · novelty 7.0

Affordance2Action introduces A2A-Bench, a manipulation-oriented benchmark for scene-level task-conditioned affordance grounding covering single- and multi-region correspondences, plus an annotation pipeline, and reports gaps in existing segmentation and VLM baselines.

citing papers explorer

Showing 48 of 48 citing papers after filters.

From Efficiency to Leakage -- Privacy Backdoor in Federated Language Model Fine-Tuning cs.CR · 2026-06-18 · conditional · none · ref 1 · internal anchor
NeuroImprint attack assigns isolated memorization neurons to training samples in PEFT adapters, enabling closed-form reconstruction of 59-79% of samples across BERT, GPT-2, Qwen2, and Llama3.2 on multiple datasets.
APEX4: Efficient Pure W4A4 LLM Inference via Intra-SM Compute Rebalancing cs.DC · 2026-06-07 · conditional · none · ref 18 · 2 links · internal anchor
APEX4 co-designs pure INT4 GEMM kernels with ρ-aware granularity adaptation to deliver up to 2.09× end-to-end speedup on GPUs with low ρ while keeping LLaMA-2-70B perplexity within 0.63 of FP16.
Provably Secure Steganography Based on List Decoding cs.CR · 2026-04-23 · conditional · none · ref 1 · internal anchor
List decoding enables a provably secure steganography scheme with higher embedding capacity for LLMs via candidate sets and suffix matching.
PlayCoder: Making LLM-Generated GUI Code Playable cs.SE · 2026-04-21 · conditional · none · ref 5 · internal anchor
PlayCoder raises the rate of LLM-generated GUI apps that can be played end-to-end without logic errors from near zero to 20.3% Play@3 by adding repository-aware generation, agent-driven testing, and iterative repair.
DMax: Aggressive Parallel Decoding for dLLMs cs.LG · 2026-04-09 · conditional · none · ref 6 · 2 links · internal anchor
DMax uses On-Policy Uniform Training and Soft Parallel Decoding to enable aggressive parallelism in dLLMs, raising TPF on GSM8K from 2.04 to 5.47 and on MBPP from 2.71 to 5.86 while preserving accuracy.
Generative Control as Optimization: Time Unconditional Flow Matching for Adaptive and Robust Robotic Control cs.RO · 2026-03-18 · conditional · none · ref 2 · internal anchor
GeCO replaces time-dependent flow matching with time-unconditional optimization, enabling adaptive inference and intrinsic OOD detection for robotic imitation learning.
SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips cs.DC · 2026-01-28 · conditional · none · ref 3 · internal anchor
SuperInfer improves TTFT SLO attainment by up to 74.7% on GH200 Superchips via SLO-aware rotary scheduling (RotaSched) and full-duplex KV cache rotation (DuplexKV) over NVLink-C2C while preserving TBT and throughput.
Large Video Planner Enables Generalizable Robot Control cs.RO · 2025-12-17 · conditional · none · ref 5 · internal anchor
A video foundation model trained on human demonstrations generates zero-shot plans that convert to executable robot actions on novel scenes and tasks.
Retrieval as a Decision: Training-Free Adaptive Gating for Efficient RAG cs.CL · 2025-11-12 · conditional · none · ref 1 · internal anchor
TARG uses uncertainty scores from a short no-context draft to gate retrieval in RAG, matching Always-RAG accuracy while cutting retrievals by 70-90% on QA benchmarks.
PuzzleWorld: A Benchmark for Multimodal, Open-Ended Reasoning in Puzzlehunts cs.CL · 2025-06-06 · conditional · none · ref 4 · internal anchor
PuzzleWorld benchmark reveals state-of-the-art AI models solve only 18% of complex puzzlehunt problems with 40% stepwise accuracy, matching novices but trailing enthusiasts, while fine-tuning on traces yields modest gains.
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation cs.CV · 2024-06-10 · conditional · none · ref 2 · internal anchor
Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.
MLVU: Benchmarking Multi-task Long Video Understanding cs.CV · 2024-06-06 · conditional · none · ref 4 · internal anchor
MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.
Sparse Tokens Suffice: Jailbreaking Audio Language Models via Token-Aware Gradient Optimization cs.CR · 2026-05-06 · conditional · none · ref 3
TAGO performs sparse jailbreak optimization on audio LMs by retaining only high-gradient-energy tokens, preserving near-full ASR at 25% retention across three models.
An Open-Source Benchmark and Baseline for Multi-temporal Referring Segmentation cs.CV · 2026-05-31 · conditional · none · ref 3 · internal anchor
Introduces MTRS task, MTRefSeg-21K benchmark of 21K image-text-mask triplets, and MTRefSeg-R1 LVLM baseline that outperforms standard models via two-stage change-aware training.
Focus-then-Context: Subject-Centric Progressive Visual Token Reduction for Vision-Language Models cs.CV · 2026-05-20 · conditional · none · ref 14 · internal anchor
SPpruner reduces visual tokens in VLMs via focus identification followed by context-aware scanning, retaining 22.2% tokens for 2.53x speedup on Qwen2.5-VL with negligible accuracy loss.
DEL: Digit Entropy Loss for Numerical Learning of Large Language Models cs.CL · 2026-05-19 · conditional · none · ref 43 · internal anchor
DEL is a new loss for LLM numerical learning that applies supervised digit entropy optimization and extends to floating-point numbers, showing improved accuracy and distance metrics over prior methods on math benchmarks.
BEAM: Binary Expert Activation Masking for Dynamic Routing in MoE cs.AI · 2026-05-14 · conditional · none · ref 3 · internal anchor
BEAM uses binary expert activation masks trained end-to-end to achieve dynamic sparsity in MoE models, cutting FLOPs by 85% with over 98% performance retention.
CLIP Tricks You: Training-free Token Pruning for Efficient Pixel Grounding in Large VIsion-Language Models cs.CV · 2026-05-13 · conditional · none · ref 2 · 2 links · internal anchor
LiteLVLM is a training-free text-guided token pruning strategy that reverses CLIP similarity rankings to retain referent tokens and recover context for efficient pixel grounding while keeping 90% performance.
ROMER: Expert Replacement and Router Calibration for Robust MoE LLMs on Analog Compute-in-Memory Systems cs.LG · 2026-05-12 · conditional · none · ref 42 · internal anchor
ROMER cuts perplexity by up to 59% in noisy analog CIM environments for MoE LLMs via expert replacement and router recalibration calibrated on real-chip measurements.
LithoBench: Benchmarking Large Multimodal Models for Remote-Sensing Lithology Interpretation cs.CV · 2026-05-08 · conditional · none · ref 3 · internal anchor
LithoBench is a new multi-level benchmark showing that existing large multimodal models have substantial limitations in geological semantic understanding for remote sensing lithology interpretation.
The Override Gap: A Magnitude Account of Knowledge Conflict Failure in Hypernetwork-Based Instant LLM Adaptation cs.LG · 2026-04-26 · conditional · none · ref 1 · 2 links · internal anchor
Knowledge conflicts in hypernetwork LLM adaptation stem from constant adapter margins losing to frequency-dependent pretrained margins; selective layer boosting and conflict-aware triggering raise deep-conflict accuracy to 71-72.5% on Gemma-2B and Mistral-7B.
Hard to Be Heard: Phoneme-Level ASR Analysis of Phonologically Complex, Low-Resource Endangered Languages cs.CL · 2026-04-20 · conditional · none · ref 1 · internal anchor
Phoneme-level analysis of ASR on Archi and Rutul shows data scarcity explains recognition errors better than phonological complexity, with language-specific adaptations improving wav2vec2 performance.
LWGR: Lagrangian-Constrained Personalized World Knowledge for Generative Recommendation cs.IR · 2026-04-16 · conditional · none · ref 3 · internal anchor
LWGR applies personalized soft instructions for LLM knowledge extraction and Lagrangian primal-dual optimization to selectively fuse beneficial world knowledge into generative recommendation while bounding degradation.
SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs cs.CV · 2026-04-15 · conditional · none · ref 3 · 2 links · internal anchor
SLQ adapts frozen MLLMs for multimodal retrieval by appending shared latent queries to text and image tokens and introduces KARR-Bench to test knowledge-aware reasoning retrieval.
Long-Horizon Streaming Video Generation via Hybrid Attention with Decoupled Distillation cs.CV · 2026-04-11 · conditional · none · ref 1 · internal anchor
Hybrid Forcing combines linear temporal attention for long-range retention, block-sparse attention for efficiency, and decoupled distillation to achieve real-time unbounded 832x480 streaming video generation at 29.5 FPS.
ExpertGen: Scalable Sim-to-Real Expert Policy Learning from Imperfect Behavior Priors cs.RO · 2026-03-16 · conditional · none · ref 7 · internal anchor
ExpertGen generates high-success expert policies in simulation from imperfect priors by freezing a diffusion behavior model and optimizing its initial noise via RL, then distills them for real-robot deployment.
Safety Alignment as Continual Learning: Mitigating the Alignment Tax via Orthogonal Gradient Projection cs.LG · 2026-02-08 · conditional · none · ref 3 · internal anchor
OGPSA projects safety gradients orthogonal to a low-rank subspace from general capability gradients, improving safety-utility trade-offs in SFT and DPO pipelines on Qwen2.5-7B and Llama3.1-8B.
CNSocialDepress: A Chinese Social Media Dataset for Depression Risk Detection and Structured Analysis cs.CL · 2025-10-13 · conditional · none · ref 1 · internal anchor
CNSocialDepress is a new benchmark dataset containing 44,178 Chinese social media posts annotated by experts with binary depression risk labels and multidimensional psychological attributes for fine-grained analysis.
Sequential Data Augmentation for Generative Recommendation cs.LG · 2025-09-17 · conditional · none · ref 2 · internal anchor
GenPAS unifies common data augmentation strategies for generative recommendation as special cases of a bias-controlled stochastic sampling process and demonstrates gains in accuracy, data efficiency, and parameter efficiency on benchmarks and industrial data.
MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation cs.RO · 2025-08-26 · conditional · none · ref 2 · internal anchor
MemoryVLA introduces a perceptual-cognitive memory bank and working-memory retrieval mechanism into VLA models, raising success rates on long-horizon robotic tasks by up to 26 points over prior baselines.
Fine-Tuning Code Language Models to Detect Cross-Language Bugs cs.SE · 2025-07-29 · conditional · none · ref 4 · internal anchor
Fine-tuning 13 CodeLMs on a constructed CLB dataset with nine interaction types improves detection, with UniXcoder-base reaching F1 0.7407 and small models outperforming large ones.
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning cs.AI · 2025-07-01 · conditional · none · ref 217 · internal anchor
Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.
Mixture-of-Experts Can Surpass Dense LLMs Under Strictly Equal Resource cs.CL · 2025-06-13 · conditional · none · ref 3 · internal anchor
MoE models with activation rates in an optimal region outperform dense LLMs of identical total parameter count, training compute, and data budget, with the optimal region consistent across scales.
XtraGPT: Context-Aware and Controllable Academic Paper Revision via Human-AI Collaboration cs.CL · 2025-05-16 · conditional · none · ref 45 · internal anchor
XtraGPT is a suite of 1.5B-14B parameter open-source LLMs fine-tuned on 140,000 revision pairs from 7,000 top-tier papers to support controllable, context-aware academic paper editing.
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization cs.CL · 2024-11-15 · conditional · none · ref 5 · internal anchor
Mixed Preference Optimization with the MMPR dataset boosts multimodal CoT reasoning, lifting InternVL2-8B to 67.0 accuracy on MathVista (+8.7 points) and matching the 76B model.
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs cs.LG · 2024-06-26 · conditional · none · ref 3 · internal anchor
Step-DPO performs preference optimization on individual reasoning steps rather than complete answers, producing nearly 3% accuracy gains on MATH for 70B+ parameter models with 10K preference pairs.
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark cs.CL · 2024-06-03 · conditional · none · ref 5 · internal anchor
MMLU-Pro is a revised benchmark that makes language model evaluation harder and more stable by using ten options per question and emphasizing reasoning over simple knowledge recall.
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies cs.CL · 2024-04-09 · conditional · none · ref 5 · internal anchor
MiniCPM 1.2B and 2.4B models reach parity with 7B-13B LLMs via model wind-tunnel scaling and a WSD scheduler that yields a higher optimal data-to-model ratio than Chinchilla scaling.
Are We on the Right Way for Evaluating Large Vision-Language Models? cs.CV · 2024-03-29 · conditional · none · ref 1 · internal anchor
Current LVLM benchmarks overestimate capabilities because many questions can be answered without images due to design flaws or data leakage; MMStar is a human-curated set of 1,500 vision-indispensable samples across 6 capabilities and 18 axes with new metrics for leakage and true multi-modal gain.
Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive cs.CL · 2024-02-20 · conditional · none · ref 100 · internal anchor
DPOP is a new loss function that prevents DPO from lowering preferred response likelihoods and outperforms standard DPO on diverse datasets, MT-Bench, and enables Smaug-72B to exceed 80% on the Open LLM Leaderboard.
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models cs.CV · 2024-01-29 · conditional · none · ref 2 · internal anchor
MoE-LLaVA applies mixture-of-experts sparsity to LVLMs via MoE-Tuning, delivering LLaVA-1.5-7B level visual understanding and better hallucination resistance with only ~3B active parameters.
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions cs.CV · 2023-11-21 · conditional · none · ref 2 · internal anchor
A new 1.2M-caption dataset generated via GPT-4V improves LMMs on MME and MMBench by 222.8/22.0/22.3 and 2.7/1.3/1.5 points respectively when used for supervised fine-tuning.
Gradient-Gated DPO: Stabilizing Preference Optimization in Language Models cs.LG · 2026-05-04 · conditional · none · ref 2
Gate-DPO attenuates gradients on low-probability rejected responses to reduce probability collapse and improve chosen-response likelihood during preference optimization.
RealBench: A Repo-Level Code Generation Benchmark Aligned with Real-World Software Development Practices cs.SE · 2026-04-24 · conditional · none · ref 9 · internal anchor
RealBench is a repo-level code generation benchmark pairing UML diagrams with natural language requirements, revealing that LLMs perform significantly worse on realistic repo-level tasks than existing benchmarks suggest.
SUMMIR: A Hallucination-Aware Framework for Ranking Sports Insights from LLMs cs.IR · 2026-03-30 · conditional · none · ref 5 · internal anchor
SUMMIR is a multimetric ranking model that orders LLM-generated sports insights by importance while incorporating hallucination detection to improve factual reliability across cricket, soccer, basketball, and baseball articles.
SurgViVQA: Temporally-Grounded Video Question Answering for Surgical Scene Understanding cs.CV · 2025-11-05 · conditional · none · ref 14 · internal anchor
SurgViVQA adds temporal video encoding to surgical VideoQA and reports 9-11% gains in keyword accuracy over image-only baselines on two datasets plus improved robustness to question rephrasing.
Semantic-Aware Logical Reasoning via a Semiotic Framework cs.AI · 2025-09-29 · conditional · none · ref 3 · internal anchor
LogicAgent uses a semiotic-square-guided approach to enhance logical reasoning in LLMs on the new RepublicQA benchmark and others, reporting average gains of 6.25% and 7.05% respectively.
Qwen2.5-Omni Technical Report cs.CL · 2025-03-26 · conditional · none · ref 3 · internal anchor
Qwen2.5-Omni presents a multimodal model with block-wise encoders, TMRoPE position embeddings, and a Thinker-Talker architecture that enables simultaneous text and streaming speech generation while matching text performance on reasoning benchmarks.

Qwen Technical Report

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer