super hub Mixed citations

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Eric Bieber, Gheorghe Comanici, Ice Pasupat, Inderjit Dhillon, Mike Schaekermann, Noveen Sachdeva · 2025 · cs.CL · arXiv 2507.06261

Mixed citation behavior. Most common role is background (55%).

980 Pith papers citing it

Background 55% of classified citations

open full Pith review browse 980 citing papers more from Eric Bieber arXiv PDF

abstract

In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. Gemini 2.5 Flash provides excellent reasoning abilities at a fraction of the compute and latency requirements and Gemini 2.0 Flash and Flash-Lite provide high performance at low latency and cost. Taken together, the Gemini 2.X model generation spans the full Pareto frontier of model capability vs cost, allowing users to explore the boundaries of what is possible with complex agentic problem solving.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 122 baseline 46 method 28 other 8 dataset 3

citation-polarity summary

background 114 baseline 47 use method 28 unclear 12 support 3 use dataset 3

claims ledger

abstract In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. G

authors

Eric Bieber Gheorghe Comanici Ice Pasupat Inderjit Dhillon Mike Schaekermann Noveen Sachdeva

co-cited works

representative citing papers

EHRNote-ChatQA: A Benchmark for Evidence-Grounded Multi-Turn Clinical Question Answering over Longitudinal Discharge Summaries

cs.CL · 2026-06-14 · unverdicted · novelty 8.0

EHRNote-ChatQA is the first benchmark for evidence-grounded multi-turn clinical QA over longitudinal discharge summaries, containing 16,072 medical-expert-verified pairs across eight categories and revealing LLM weaknesses in evidence grounding and multi-turn consistency.

HKJudge: A Legal Discourse-Annotated Corpus for Interpreting What Courts Find, How They Reason, and What They Rule

cs.CL · 2026-06-04 · accept · novelty 8.0

HKJudge is a new ~290k-sentence expert-annotated corpus of Hong Kong criminal judgments with 26 rhetorical roles and 3 sentencing elements, plus benchmarks on classification and extraction tasks.

RRP-Voice: A Longitudinal Dataset and Benchmark for Recurrent Respiratory Papillomatosis Detection

eess.AS · 2026-06-01 · unverdicted · novelty 8.0

Introduces the first longitudinal voice dataset for RRP with benchmarks across handcrafted features, deep networks, self-supervised models, and audio LLMs under patient-level validation.

VideoFDB: Evaluating Full-Duplex Vision-Speech Capabilities in Conversational Agents

cs.CV · 2026-05-28 · unverdicted · novelty 8.0

VideoFDB is a new benchmark and LM-as-judge framework for evaluating full-duplex audio-visual-to-audio-visual conversational agents on nonverbal dynamics from real video calls.

EgoIntrospect: An Egocentric Dataset and Benchmark for User-Centric Internal State Reasoning

cs.CV · 2026-05-17 · unverdicted · novelty 8.0

EgoIntrospect provides the first egocentric dataset with self-annotations for internal state tasks and shows multimodal LLMs struggle to infer subjective states from combined signals.

Tracing Persona Vectors Through LLM Pretraining

cs.CL · 2026-05-13 · unverdicted · novelty 8.0

Persona vectors form within the first 0.22% of LLM pretraining and remain effective for steering post-trained models, with continued refinement and transfer to other models.

Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models

cs.AR · 2026-05-11 · conditional · novelty 8.0

Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.

Knowledge Poisoning Attacks on Medical Multi-Modal Retrieval-Augmented Generation

cs.CR · 2026-05-11 · unverdicted · novelty 8.0

M³Att poisons medical multimodal RAG by pairing covert textual misinformation with query-agnostic visual perturbations that increase retrieval of the bad content, causing LLMs to generate clinically plausible but incorrect responses.

Omni-DeepSearch: A Benchmark for Audio-Driven Omni-Modal Deep Search

cs.SD · 2026-05-09 · unverdicted · novelty 8.0

Omni-DeepSearch is a 640-sample benchmark for audio-driven omni-modal search where the best model reaches only 43.44% accuracy, exposing bottlenecks in audio inference, tool use, and cross-modal reasoning.

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

cs.CV · 2026-05-08 · unverdicted · novelty 8.0

TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.

S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images

cs.CV · 2026-04-23 · unverdicted · novelty 8.0

S1-VL combines structured scientific reasoning with iterative image manipulation via code execution to reach state-of-the-art results on visual and scientific reasoning benchmarks.

Lost in Translation: Do LVLM Judges Generalize Across Languages?

cs.CL · 2026-04-21 · unverdicted · novelty 8.0

MM-JudgeBench shows substantial cross-lingual performance variance in 22 LVLM judges, with model size and architecture as poor predictors of multilingual robustness.

HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models

cs.SD · 2026-04-21 · unverdicted · novelty 8.0

HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-language models.

When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models

cs.CV · 2026-04-19 · unverdicted · novelty 8.0

VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-specific experts and adaptive routing.

Narrative over Numbers: The Identifiable Victim Effect and its Amplification Under Alignment and Reasoning in Large Language Models

cs.CL · 2026-04-13 · conditional · novelty 8.0

Large language models display the identifiable victim effect at roughly twice the human baseline, strongly amplified by instruction tuning and chain-of-thought prompting but inverted by reasoning-specialized models.

MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark

cs.CV · 2026-04-12 · unverdicted · novelty 8.0 · 2 refs

MMRareBench provides 1,756 QA pairs and 7,958 images from PMC rare-disease cases to evaluate 23 MLLMs, revealing low treatment-planning scores and medical models underperforming general models on multi-image tasks due to capacity dilution.

HM-Bench: A Comprehensive Benchmark for Multimodal Large Language Models in Hyperspectral Remote Sensing

cs.CV · 2026-04-10 · accept · novelty 8.0

HM-Bench is the first benchmark for MLLMs on hyperspectral images, showing models struggle with complex spatial-spectral reasoning and perform better with visual PCA images than textual reports.

DialBGM: A Benchmark for Background Music Recommendation from Everyday Multi-Turn Dialogues

cs.AI · 2026-04-09 · unverdicted · novelty 8.0

DialBGM is a new benchmark dataset revealing that existing AI models fall far short of human performance when recommending fitting background music for open-domain conversations.

V2X-QA: A Comprehensive Reasoning Dataset and Benchmark for Multimodal Large Language Models in Autonomous Driving Across Ego, Infrastructure, and Cooperative Views

cs.RO · 2026-04-03 · conditional · novelty 8.0

V2X-QA provides a view-decoupled benchmark showing infrastructure views aid macroscopic traffic understanding while cooperative reasoning requires explicit cross-view alignment, with V2X-MoE as a routing-based baseline that improves performance.

ScreenParse: Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision

cs.CV · 2026-02-15 · conditional · novelty 8.0

ScreenParse dataset and ScreenVLM model deliver dense screen parsing that outperforms larger VLMs on PageIoU and transfers to better UI grounding.

EgoSound: Benchmarking Sound Understanding in Egocentric Videos

cs.CV · 2026-02-15 · unverdicted · novelty 8.0

EgoSound is a new benchmark with 7315 QA pairs across seven tasks to evaluate egocentric sound understanding in multimodal large language models.

VLRS-Bench: A Vision-Language Reasoning Benchmark for Remote Sensing

cs.CV · 2026-02-04 · unverdicted · novelty 8.0

VLRS-Bench is the first benchmark dedicated to complex vision-language reasoning in remote sensing, with 2000 QA pairs across 14 tasks in cognition, decision, and prediction dimensions.

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

cs.CV · 2026-01-15 · unverdicted · novelty 8.0

Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.

ConceptPose: Training-Free Zero-Shot Object Pose Estimation using Concept Vectors

cs.CV · 2025-12-09 · unverdicted · novelty 8.0

ConceptPose delivers state-of-the-art zero-shot relative pose estimation by matching open-vocabulary 3D concept vectors derived from VLM saliency maps, beating the strongest baseline by 62% in ADD(-S) without training.

citing papers explorer

Showing 50 of 980 citing papers.

SemanticOpt: Towards LLM-Based Semantic Black-Box Optimization cs.LG · 2025-10-29 · unverdicted · none · ref 4 · internal anchor
SemanticOpt fine-tunes LLMs on structured Bayesian optimization trajectories augmented with natural-language context to jointly use numerical and semantic evidence for black-box optimization.
From Tokens to Layers: Redefining Stall-Free Scheduling for MoE Serving with Layered Prefill cs.LG · 2025-10-09 · unverdicted · none · ref 4 · internal anchor
Layered prefill replaces token-chunked prefill with layer-group interleaving in MoE models, cutting TTFT by up to 70%, end-to-end latency by 41%, and per-token energy by 22% while preserving stall-free TBT.
Test-Time Matching: Unlocking Compositional Reasoning in Multimodal Models cs.AI · 2025-10-09 · unverdicted · none · ref 6 · internal anchor
Introduces group matching score for better evaluation of compositional reasoning and Test-Time Matching (TTM) algorithm for unsupervised self-improvement in multimodal models, achieving SOTA gains including surpassing GPT-4.1 and estimated human performance.
Inferring Dynamic Physical Properties from Video Foundation Models cs.CV · 2025-10-02 · unverdicted · none · ref 5 · internal anchor
Video foundation models infer dynamic physical properties such as elasticity, viscosity, and friction from videos at levels close to classical oracles while outperforming current MLLMs with suitable prompting.
Self-Forcing++: Towards Minute-Scale High-Quality Video Generation cs.CV · 2025-10-02 · conditional · none · ref 9 · internal anchor
Self-Forcing++ scales autoregressive video diffusion to over 4 minutes by using self-generated segments for guidance, reducing error accumulation and outperforming baselines in fidelity and consistency.
AgenticEval: Toward Agentic and Self-Evolving Safety Evaluation of Large Language Models cs.AI · 2025-09-30 · unverdicted · none · ref 4 · internal anchor
AgenticEval is a multi-agent framework that ingests unstructured policies to generate and self-evolve comprehensive safety benchmarks for LLMs, with experiments showing declining safety rates as tests harden.
Structured In-context Environment Scaling for Large Language Model Reasoning cs.CL · 2025-09-27 · conditional · none · ref 4 · internal anchor
SIE framework automatically constructs scalable, verifiable reasoning environments from structured data, improving in-domain performance and enabling generalization to out-of-domain math and logic tasks.
ChatInject: Abusing Chat Templates for Prompt Injection in LLM Agents cs.CL · 2025-09-26 · conditional · none · ref 2 · internal anchor
ChatInject exploits LLM chat template structures to boost indirect prompt injection success rates on agents from ~5-15% to 32-52% across benchmarks, with multi-turn persuasion variants performing best.
MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing cs.CV · 2025-09-26 · unverdicted · none · ref 7 · internal anchor
MinerU2.5 uses a two-stage decoupled vision-language architecture to achieve state-of-the-art document parsing accuracy with lower computational overhead than existing general and domain-specific models.
Retrieval-of-Thought: Efficient Reasoning via Reusing Thoughts cs.AI · 2025-09-26 · unverdicted · none · ref 4 · internal anchor
Retrieval-of-Thought organizes prior reasoning into a thought graph for retrieval and reward-guided recombination, reducing output tokens by up to 40% and latency by 82% while preserving accuracy on reasoning benchmarks.
Tiny but Mighty: A Software-Hardware Co-Design Approach for Efficient Multimodal Inference on Battery-Powered Small Devices cs.DC · 2025-09-25 · unverdicted · none · ref 3 · internal anchor
Nanomind decomposes LMMs into modular bricks mapped to heterogeneous accelerators with TABM zero-copy transfers, fused low-bit kernels, and a battery-aware scheduler, cutting energy 42.3% and enabling 18.8-hour runtime on a 2000 mAh battery for LLaVA-OneVision-Qwen2-0.5B.
Video models are zero-shot learners and reasoners cs.LG · 2025-09-24 · unverdicted · none · ref 2 · internal anchor
Generative video models exhibit emergent zero-shot capabilities across perception, manipulation, and basic reasoning tasks.
Qwen3-Omni Technical Report cs.CL · 2025-09-22 · unverdicted · none · ref 6 · internal anchor
Qwen3-Omni is a unified multimodal model that achieves open-source SOTA on 32 of 36 audio and audio-visual benchmarks and overall SOTA on 22 without degrading performance on text, image, or video relative to single-modal Qwen counterparts.
VideoAgent: Personalized Synthesis of Scientific Videos cs.AI · 2025-09-14 · unverdicted · none · ref 26 · internal anchor
VideoAgent is a modular framework that redefines scientific video synthesis as an intent-driven planning problem and introduces the SciVidEval benchmark for multimodal quality and pedagogical utility.
AU-Harness: An Open-Source Toolkit for Holistic Evaluation of Audio LLMs cs.SD · 2025-09-09 · unverdicted · none · ref 4 · internal anchor
AU-Harness introduces an efficient unified evaluation framework for audio LLMs featuring batch optimizations, multi-turn dialogue support, and standardized protocols for fair comparisons.
Progressive Multimodal Search and Reasoning for Knowledge-Intensive Visual Question Answering cs.CV · 2025-08-31 · unverdicted · none · ref 10 · internal anchor
PMSR progressively constructs structured reasoning trajectories with dual-scope queries and compositional reasoning to improve knowledge acquisition and answer accuracy in knowledge-intensive VQA.
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency cs.CV · 2025-08-25 · unverdicted · none · ref 19 · internal anchor
InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and agentic tasks.
GoViG: Goal-Conditioned Visual Navigation Instruction Generation via Multimodal Reasoning cs.CV · 2025-08-13 · unverdicted · none · ref 4 · internal anchor
GoViG decomposes goal-conditioned navigation instruction generation into visual state prediction and instruction synthesis using an autoregressive multimodal LLM with one-pass and interleaved reasoning, showing gains on a new R2R-Goal dataset.
PrinciplismQA: A Philosophy-Grounded Approach to Assessing LLM-Human Clinical Medical Ethics Alignment cs.CL · 2025-08-07 · unverdicted · none · ref 6 · internal anchor
PrinciplismQA benchmark reveals significant gaps in LLMs' clinical ethical reasoning despite high knowledge accuracy.
Can Large Language Models Really Recognize Your Name? cs.CR · 2025-05-20 · unverdicted · none · ref 29 · internal anchor
LLMs exhibit 20-40% lower recall on ambiguous human names for PII detection, worsening under prompt injections, as shown via the new AmBench benchmark.
Phonetic Perturbations Reveal Tokenizer-Rooted Safety Gaps in LLMs cs.CL · 2025-05-20 · unverdicted · none · ref 5 · internal anchor
Phonetic perturbations fragment safety-critical tokens in LLMs, suppressing attribution scores while preserving input understanding and causing safety mechanisms to fail despite good comprehension.
Uno-Orchestra: Parsimonious Agent Routing via Selective Delegation cs.AI · 2026-05-06 · unverdicted · none · ref 12
A learned orchestration policy for LLM agents that jointly optimizes task decomposition and selective routing to (model, primitive) pairs, delivering 77% macro pass@1 at 10x lower cost than strong baselines across 13 benchmarks.
DiffCap-Bench: A Comprehensive, Challenging, Robust Benchmark for Image Difference Captioning cs.CV · 2026-05-06 · unverdicted · none · ref 6
DiffCap-Bench supplies a diverse IDC benchmark with ten categories and LLM judging grounded in human difference lists to evaluate MLLMs more robustly than prior lexical metrics.
The Anatomy of Silent Data Corruption: GPU Error Pattern Study and Modeling Guidance cs.AR · 2026-05-05 · unverdicted · none · ref 11
Large-scale GPU fault injection shows NaN/inf outcomes are only 1% of SDC, single-bit flips under 40%, and corruption addresses are periodic, supporting distribution-aware modeling.
AlbumFill: Album-Guided Reasoning and Retrieval for Personalized Image Completion cs.CV · 2026-05-04 · unverdicted · none · ref 11
AlbumFill retrieves identity-consistent references from personal albums via VLM-inferred semantic cues to support personalized image completion.
Change-Robust Online Spatial-Semantic Topological Mapping cs.RO · 2026-05-04 · unverdicted · none · ref 9
CROSS replaces globally consistent metric maps with a pose-aware topological graph of RGB-D keyframes and maintains a bounded Gaussian-mixture belief over poses via sequential hypothesis testing in SE(3) to achieve change-robust spatial-semantic mapping and navigation.
T$^2$PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning cs.AI · 2026-05-04 · unverdicted · none · ref 4
T²PO improves stability and performance in multi-turn agentic RL by using uncertainty dynamics at token and turn levels to guide exploration and avoid wasted rollouts.
LASER: A Corrective Lens for LVLMs via Visual Attention Preservation and Sink Suppression cs.CV · 2026-07-02 · unverdicted · none · ref 51 · internal anchor
LASER uses Visual Grounding Reward and Sink Suppression Reward to preserve visual attention trajectories and suppress sink tokens, reducing visual forgetting in LVLMs.
Agent4cs: A Multi-agent System for Code Summarization in Large Hierarchical Codebases cs.AI · 2026-07-01 · unverdicted · none · ref 7 · internal anchor
Agent4cs deploys summarization, keyword-extraction, and quality-assurance agents in a bottom-up pipeline that raises semantic consistency by 8% and normalized keyword coverage by up to 38% over structured prompting baselines on seven frontier models.
How Ethos and Pathos Appeals Resonate in Reader Interpretations of Social Media Messages cs.CL · 2026-07-01 · unverdicted · none · ref 8 · 2 links · internal anchor
Analyses of labeled social media sentences and interpretations show 30% divergence in ethos and pathos, greater variability for charged content, and predictive power for audience attitudes toward the author.
Personalization as Inverse Planning: Learning Latent Design Intents for Agentic Slide Generation via Structural Denoising cs.AI · 2026-07-01 · unverdicted · none · ref 6 · internal anchor
SPIRE approximates page-level slide personalization by training agents to denoise corrupted slide structures via collaborative RL, claiming a proof of consistency as a surrogate for inverse planning.
Beyond Binary Instrument QA: Probing Instrument Grounding in Music Audio-Language Models cs.SD · 2026-06-30 · unverdicted · none · ref 5 · internal anchor
Introduces an OpenMIC-derived multi-axis benchmark sequence showing that high binary instrument QA accuracy fails to predict robust grounding, with models showing position bias, confusable errors, and temporal bias.
On the Faithfulness of Post-Hoc Concept Bottleneck Models cs.CV · 2026-06-29 · unverdicted · none · ref 12 · internal anchor
Post-hoc CBMs produce unfaithful concept projections due to covariate shifts and systematic label noise; new metrics are introduced to measure faithfulness separately from accuracy.
Experience Augmented Policy Optimization for LLM Reasoning cs.LG · 2026-06-29 · unverdicted · none · ref 3 · internal anchor
EAPO reuses prior RL policy experience adaptively at decision points in LLM rollouts with adapted importance sampling and reports gains over prior RLVR methods on math benchmarks.
FlowAWR: Online Adaptive Flow Reinforcement via Advantage-Weighted Rectification cs.LG · 2026-06-29 · unverdicted · none · ref 4 · internal anchor
FlowAWR derives an advantage-weighted rectification for optimal velocity fields in flow models, claiming 2-5x faster convergence than DiffusionNFT on SD3.5-Medium.
REAR: Test-time Preference Realignment through Reward Decomposition cs.CL · 2026-06-29 · unverdicted · none · ref 106 · internal anchor
REAR decomposes the reward into question and preference components, rescales their balance, and expresses the result as a linear combination of token log-probabilities for efficient integration with best-of-N and tree search.
Concept Removal Guidance: Evidence-Calibrated Negative Guidance for Safe Diffusion Sampling cs.CV · 2026-06-29 · unverdicted · none · ref 63 · internal anchor
CRG adaptively calibrates negative guidance in diffusion models by estimating concept presence from noise predictions at each step to suppress unwanted content while preserving fidelity.
The Body as Status: Muscularity, Engagement, and Body Image Risk on #GymTok cs.CY · 2026-06-29 · unverdicted · none · ref 11 · internal anchor
Content analysis of #GymTok videos finds positive associations between muscularity, perceived harm, and engagement metrics, implying TikTok algorithms may amplify muscular ideals and risky behaviors.
ALM2Vec: Learning Audio Embeddings for Universal Audio Retrieval with Large Audio-Language Models cs.SD · 2026-06-27 · unverdicted · none · ref 13 · internal anchor
ALM2Vec learns unified audio embeddings from large audio-language models for text-audio retrieval, instruction-aware retrieval, and other tasks across domains.
Mitigating Batch Effects in Histopathology via Language-Mediated Robust Embedding Generation cs.CV · 2026-06-27 · unverdicted · none · ref 15 · internal anchor
GLMP generates robust pathology embeddings by routing histology images through an intermediate textual representation produced by general-purpose MLLMs to mitigate batch effects.
CogniRoute: Learning to Route Social Evidence in Omni-Modal Models cs.CV · 2026-06-18 · unverdicted · none · ref 116 · internal anchor
CogniRoute adds a cognitive schema and route-aware RL to an omni-modal MoE, reaching 59.38% accuracy on a new 118K-example social video QA benchmark and beating prior baselines by 15-27 points.
PromptMark: A Prompt-Guided Iterative-Feedback Framework for Source Code Watermarking cs.CR · 2026-06-18 · unverdicted · none · ref 23 · internal anchor
PromptMark is a black-box prompt-guided iterative-feedback framework that embeds statistically detectable watermarks in LLM-generated source code via naming patterns while preserving functional correctness.
Multi-View Decompilation for LLM-Based Malware Classification cs.CR · 2026-06-18 · unverdicted · none · ref 17 · internal anchor
Multi-decompiler prompting improves LLM malware classification F1 by supplying complementary views of the same binary.
Qiskit Code Migration with LLMs cs.SE · 2026-06-18 · unverdicted · none · ref 141 · internal anchor
A taxonomy-guided RAG system with LLMs reduces hallucinations and improves migration suggestions for Qiskit code compared to unconstrained retrieval.
LLMs Struggle to Measure What Distinguishes Students of Different Proficiency Levels: A Study of Item Discrimination in Reading Comprehension Assessment cs.CL · 2026-06-17 · unverdicted · none · ref 12 · internal anchor
LLMs achieve maximum Spearman correlations of 0.152 (direct) and 0.241 (response-based) with human item discrimination values, showing non-random but unreliable signal for distinguishing student proficiency.
WEQA: Wearable hEalth Question Answering with Query-Adaptive Agentic Reasoning cs.AI · 2026-06-16 · unverdicted · none · ref 14 · internal anchor
WEQA proposes a query-adaptive agent framework combining LLMs with wearable data tools, achieving 24% higher accuracy than baselines on a benchmark from four open datasets, with gains in expert-rated usefulness.
From Drift to Coherence: Stabilizing Beliefs in LLMs cs.LG · 2026-06-16 · unverdicted · none · ref 23 · internal anchor
In multiple-choice QA, LLM beliefs drift early under repeated sampling but self-stabilize; seed-answer prompting and a self-consistency loss reduce drift while preserving accuracy.
EnvRL: Learn from Environment Dynamics in Agentic Reinforcement Learning cs.LG · 2026-06-16 · unverdicted · none · ref 4 · internal anchor
EnvRL incorporates environment dynamics learning via state prediction and inverse dynamics auxiliary objectives into agentic RL, reporting higher success rates than RL-only baselines on ALFWorld and WebShop.
Evaluating Pluralism in LLMs through Latent Perspectives cs.CL · 2026-06-11 · unverdicted · none · ref 33 · internal anchor
A domain-agnostic framework extracts perspectives from book reviews showing LLMs underrepresent rarer viewpoints relative to human text.
Constrained Semantic Decompression in LLMs through Persian Proverb-Conditioned Story Generation cs.CL · 2026-06-10 · unverdicted · none · ref 1 · internal anchor
Introduces PAND dataset for Persian proverbs and reports a persistent decompression gap in LLMs that explicit reasoning partially reduces.

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer