mega hub Mixed citations

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Eric Bieber, Gheorghe Comanici, Ice Pasupat, Inderjit Dhillon, Mike Schaekermann, Noveen Sachdeva · 2025 · cs.CL · arXiv 2507.06261

Mixed citation behavior. Most common role is background (55%).

1005 Pith papers citing it

Background 55% of classified citations

open full Pith review browse 1005 citing papers more from Eric Bieber arXiv PDF

abstract

In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. Gemini 2.5 Flash provides excellent reasoning abilities at a fraction of the compute and latency requirements and Gemini 2.0 Flash and Flash-Lite provide high performance at low latency and cost. Taken together, the Gemini 2.X model generation spans the full Pareto frontier of model capability vs cost, allowing users to explore the boundaries of what is possible with complex agentic problem solving.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 122 baseline 46 method 28 other 8 dataset 3

citation-polarity summary

background 114 baseline 47 use method 28 unclear 12 support 3 use dataset 3

claims ledger

abstract In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. G

authors

Eric Bieber Gheorghe Comanici Ice Pasupat Inderjit Dhillon Mike Schaekermann Noveen Sachdeva

mega hub controls

export citing contexts JSON export graph JSON export full bundle JSON open full Pith review annotated reader queued

Recognition alignment

counterfactual ablation

If this work disappeared, these are the nearest dependency candidates in Pith, weighted toward method, dataset, baseline, and extension contexts where available. This is a structural signal, not a retraction verdict.

co-cited works

representative citing papers

EHRNote-ChatQA: A Benchmark for Evidence-Grounded Multi-Turn Clinical Question Answering over Longitudinal Discharge Summaries

cs.CL · 2026-06-14 · unverdicted · novelty 8.0

EHRNote-ChatQA is the first benchmark for evidence-grounded multi-turn clinical QA over longitudinal discharge summaries, containing 16,072 medical-expert-verified pairs across eight categories and revealing LLM weaknesses in evidence grounding and multi-turn consistency.

HKJudge: A Legal Discourse-Annotated Corpus for Interpreting What Courts Find, How They Reason, and What They Rule

cs.CL · 2026-06-04 · accept · novelty 8.0

HKJudge is a new ~290k-sentence expert-annotated corpus of Hong Kong criminal judgments with 26 rhetorical roles and 3 sentencing elements, plus benchmarks on classification and extraction tasks.

RRP-Voice: A Longitudinal Dataset and Benchmark for Recurrent Respiratory Papillomatosis Detection

eess.AS · 2026-06-01 · unverdicted · novelty 8.0

Introduces the first longitudinal voice dataset for RRP with benchmarks across handcrafted features, deep networks, self-supervised models, and audio LLMs under patient-level validation.

VideoFDB: Evaluating Full-Duplex Vision-Speech Capabilities in Conversational Agents

cs.CV · 2026-05-28 · unverdicted · novelty 8.0

VideoFDB is a new benchmark and LM-as-judge framework for evaluating full-duplex audio-visual-to-audio-visual conversational agents on nonverbal dynamics from real video calls.

EgoIntrospect: An Egocentric Dataset and Benchmark for User-Centric Internal State Reasoning

cs.CV · 2026-05-17 · unverdicted · novelty 8.0

EgoIntrospect provides the first egocentric dataset with self-annotations for internal state tasks and shows multimodal LLMs struggle to infer subjective states from combined signals.

Tracing Persona Vectors Through LLM Pretraining

cs.CL · 2026-05-13 · unverdicted · novelty 8.0

Persona vectors form within the first 0.22% of LLM pretraining and remain effective for steering post-trained models, with continued refinement and transfer to other models.

Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models

cs.AR · 2026-05-11 · conditional · novelty 8.0

Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.

Knowledge Poisoning Attacks on Medical Multi-Modal Retrieval-Augmented Generation

cs.CR · 2026-05-11 · unverdicted · novelty 8.0

M³Att poisons medical multimodal RAG by pairing covert textual misinformation with query-agnostic visual perturbations that increase retrieval of the bad content, causing LLMs to generate clinically plausible but incorrect responses.

Omni-DeepSearch: A Benchmark for Audio-Driven Omni-Modal Deep Search

cs.SD · 2026-05-09 · unverdicted · novelty 8.0

Omni-DeepSearch is a 640-sample benchmark for audio-driven omni-modal search where the best model reaches only 43.44% accuracy, exposing bottlenecks in audio inference, tool use, and cross-modal reasoning.

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

cs.CV · 2026-05-08 · unverdicted · novelty 8.0

TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.

S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images

cs.CV · 2026-04-23 · unverdicted · novelty 8.0

S1-VL combines structured scientific reasoning with iterative image manipulation via code execution to reach state-of-the-art results on visual and scientific reasoning benchmarks.

Lost in Translation: Do LVLM Judges Generalize Across Languages?

cs.CL · 2026-04-21 · unverdicted · novelty 8.0

MM-JudgeBench shows substantial cross-lingual performance variance in 22 LVLM judges, with model size and architecture as poor predictors of multilingual robustness.

HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models

cs.SD · 2026-04-21 · unverdicted · novelty 8.0

HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-language models.

When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models

cs.CV · 2026-04-19 · unverdicted · novelty 8.0

VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-specific experts and adaptive routing.

Narrative over Numbers: The Identifiable Victim Effect and its Amplification Under Alignment and Reasoning in Large Language Models

cs.CL · 2026-04-13 · conditional · novelty 8.0

Large language models display the identifiable victim effect at roughly twice the human baseline, strongly amplified by instruction tuning and chain-of-thought prompting but inverted by reasoning-specialized models.

MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark

cs.CV · 2026-04-12 · unverdicted · novelty 8.0 · 2 refs

MMRareBench provides 1,756 QA pairs and 7,958 images from PMC rare-disease cases to evaluate 23 MLLMs, revealing low treatment-planning scores and medical models underperforming general models on multi-image tasks due to capacity dilution.

HM-Bench: A Comprehensive Benchmark for Multimodal Large Language Models in Hyperspectral Remote Sensing

cs.CV · 2026-04-10 · accept · novelty 8.0

HM-Bench is the first benchmark for MLLMs on hyperspectral images, showing models struggle with complex spatial-spectral reasoning and perform better with visual PCA images than textual reports.

DialBGM: A Benchmark for Background Music Recommendation from Everyday Multi-Turn Dialogues

cs.AI · 2026-04-09 · unverdicted · novelty 8.0

DialBGM is a new benchmark dataset revealing that existing AI models fall far short of human performance when recommending fitting background music for open-domain conversations.

V2X-QA: A Comprehensive Reasoning Dataset and Benchmark for Multimodal Large Language Models in Autonomous Driving Across Ego, Infrastructure, and Cooperative Views

cs.RO · 2026-04-03 · conditional · novelty 8.0

V2X-QA provides a view-decoupled benchmark showing infrastructure views aid macroscopic traffic understanding while cooperative reasoning requires explicit cross-view alignment, with V2X-MoE as a routing-based baseline that improves performance.

ScreenParse: Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision

cs.CV · 2026-02-15 · conditional · novelty 8.0

ScreenParse dataset and ScreenVLM model deliver dense screen parsing that outperforms larger VLMs on PageIoU and transfers to better UI grounding.

EgoSound: Benchmarking Sound Understanding in Egocentric Videos

cs.CV · 2026-02-15 · unverdicted · novelty 8.0

EgoSound is a new benchmark with 7315 QA pairs across seven tasks to evaluate egocentric sound understanding in multimodal large language models.

VLRS-Bench: A Vision-Language Reasoning Benchmark for Remote Sensing

cs.CV · 2026-02-04 · unverdicted · novelty 8.0

VLRS-Bench is the first benchmark dedicated to complex vision-language reasoning in remote sensing, with 2000 QA pairs across 14 tasks in cognition, decision, and prediction dimensions.

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

cs.CV · 2026-01-15 · unverdicted · novelty 8.0

Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.

ConceptPose: Training-Free Zero-Shot Object Pose Estimation using Concept Vectors

cs.CV · 2025-12-09 · unverdicted · novelty 8.0

ConceptPose delivers state-of-the-art zero-shot relative pose estimation by matching open-vocabulary 3D concept vectors derived from VLM saliency maps, beating the strongest baseline by 62% in ADD(-S) without training.

citing papers explorer

Showing 50 of 1005 citing papers.

WhoSaidIt: Human-LLM Collaborative Annotation for Text-Based Multilingual Speaker-Attribute Classification cs.CL · 2026-05-25 · unverdicted · none · ref 4 · internal anchor
Authors develop a human-LLM collaborative annotation framework and construct the WhoSaidIt multilingual dataset for nine speaker-attribute labels, revealing cross-lingual annotation differences and LLM limitations.
Extending Embodied Question Answering from Perception to Decision cs.RO · 2026-05-25 · unverdicted · none · ref 13 · internal anchor
Introduces EQA-Decision dataset with 4M+ QA pairs across four embodied reasoning dimensions and RoboDecision baseline for joint perception-reasoning-decision evaluation.
Clipping Bottleneck: Stabilizing RLVR via Stochastic Recovery of Near-Boundary Signals cs.LG · 2026-05-21 · unverdicted · none · ref 2 · internal anchor
Proposes Near-boundary Stochastic Rescue (NSR) as a stochastic modification to clipping in RLVR that recovers near-boundary signals and yields gains over baselines like DAPO and GSPO.
Case-Aware Medical Image Classification with Multimodal Knowledge Graphs and Reliability-Guided Refinement cs.CV · 2026-05-21 · unverdicted · none · ref 5 · 2 links · internal anchor
The paper presents a case-aware multimodal knowledge graph approach for medical image classification that retrieves similar cases, propagates knowledge via graph attention, and refines predictions with reliability estimates.
One-Way Policy Optimization for Self-Evolving LLMs cs.LG · 2026-05-21 · unverdicted · none · ref 2 · internal anchor
OWPO decouples optimization direction from magnitude via asymmetric reweighting (Accelerated Alignment for inferior deviations, Gain Locking for superior) plus iterative references to create a ratchet effect for continuous LLM improvement.
ExComm: Exploration-Stage Communication for Error-Resilient Agentic Test-Time Scaling cs.AI · 2026-05-21 · unverdicted · none · ref 14 · internal anchor
ExComm adds cross-agent conflict detection and soft belief correction plus trajectory diversification to agentic test-time scaling, yielding 5-6% gains over baselines on AIME and GAIA benchmarks.
Active Evidence-Seeking and Diagnostic Reasoning in Large Language Models for Clinical Decision Support cs.AI · 2026-05-21 · unverdicted · none · ref 40 · internal anchor
Multi-turn evidence seeking reduces LLM diagnostic accuracy by 12.75% and supporting-evidence quality by 24.36% versus full-context evaluation in a new OSCE-inspired benchmark across 468 cases and 15 models.
The Illusion of Reasoning: Exposing Evasive Data Contamination in LLMs via Zero-CoT Truncation cs.LG · 2026-05-21 · unverdicted · none · ref 33 · internal anchor
ZCP detects direct and evasive data contamination in LLMs by truncating CoT reasoning and contrasting zero-CoT accuracy on original versus perturbed isomorphic datasets, plus a Contamination Confidence metric.
Improving 3D Labeling in Self-Driving by Inferring Vehicle Information using Vision Language Models cs.CV · 2026-05-20 · unverdicted · none · ref 37 · internal anchor
Vision language models are used in zero-shot mode to infer vehicle make/model/generation and accurate 3D dimensions from image crops, improving label quality and reducing manual effort especially under occlusion.
CR4T: Rewrite-Based Guardrails for Adolescent LLM Safety cs.CL · 2026-05-20 · unverdicted · none · ref 58 · internal anchor
CR4T is a model-agnostic framework using lightweight risk detection and domain-conditioned rewriting to convert unsafe or refusal-style LLM responses into developmentally appropriate guidance for adolescents.
Text Analytics Evaluation Framework: A Case Study on LLMs and Social Media cs.CL · 2026-05-20 · unverdicted · none · ref 59 · internal anchor
Presents a new question-based evaluation framework for LLMs on aggregated social media text and reports that performance declines with input scale, task complexity, and numerical operations beyond 500 instances.
MONET: A Massive, Open, Non-redundant and Enriched Text-to-image dataset cs.CV · 2026-05-20 · unverdicted · none · ref 15 · internal anchor
MONET is an open 104.9M image-text pair dataset created via safety filtering, deduplication, and multi-VLM recaptioning from 2.9B raw pairs, validated by training a competitive 4B-parameter latent diffusion model.
Frequency-Domain Regularized Adversarial Alignment for Transferable Attacks against Closed-Source MLLMs cs.CR · 2026-05-20 · unverdicted · none · ref 6 · internal anchor
FRA-Attack uses high-pass DCT feature alignment and frequency-domain gradient regularization to boost adversarial transferability across 15 MLLMs from 7 vendors.
SCRIBE: Diagnostic Evaluation and Rich Transcription Models for Indic ASR cs.CL · 2026-05-20 · unverdicted · none · ref 31 · 2 links · internal anchor
New categorical error decomposition framework SCRIBE for Indic ASR evaluation, with released rich transcription models for Hindi, Malayalam, and Kannada.
QwenSafe: Multimodal Content Rating Description Identification via Preference-Aligned VLMs cs.CV · 2026-05-20 · unverdicted · none · ref 16 · internal anchor
QwenSafe adapts Qwen3-VL-8B via SFT and DPO on a metadata2CRD synthesis pipeline to classify 12 Apple CRDs, reporting large gains in positive-class recall over Qwen3-VL, LLaVA-1.6, and Gemini-2.5-Flash.
CLUE: Adaptively Prioritized Contextual Cues by Leveraging a Unified Semantic Map for Effective Zero-Shot Object-Goal Navigation cs.RO · 2026-05-19 · unverdicted · none · ref 25 · internal anchor
CLUE adaptively weights room-type and object-co-location cues from an LLM to construct a unified semantic value map that improves success rate and efficiency in zero-shot object-goal navigation.
StableVLA: Towards Robust Vision-Language-Action Models without Extra Data cs.CV · 2026-05-18 · unverdicted · none · ref 11 · internal anchor
StableVLA adds an Information Bottleneck Adapter to VLA models that improves robustness to visual corruptions by 30% on average with under 10M extra parameters and no extra data, even when using a much smaller backbone.
Self-Evolving Spatial Reasoning in Vision Language Models via Geometric Logic Consistency cs.CV · 2026-05-18 · unverdicted · none · ref 11 · internal anchor
SAGE adds duality consistency as an auxiliary reward in GRPO training with a dynamic operation pool to improve spatial reasoning robustness and generalization in VLMs.
TaskGround: Structured Executable Task Inference for Full-Scene Household Reasoning cs.AI · 2026-05-18 · unverdicted · none · ref 8 · internal anchor
TaskGround introduces a Ground-Infer-Execute framework for full-scene household reasoning that improves success rates on the FullHome benchmark and enables compact models to match larger ones at up to 18x lower token cost.
Omni-Customizer: End-to-End MultiModal Customization for Joint Audio-Video Generation cs.CV · 2026-05-17 · unverdicted · none · ref 10 · internal anchor
Omni-Customizer proposes an end-to-end framework using Omni-Context Fusion, Masked TTS Cross-Attention, Semantic-Anchored Multimodal RoPE, and specialized training curricula to achieve precise multimodal identity binding in joint audio-video generation.
PyraVid: Hierarchical Multimodal Memory for Long-Horizon Video Reasoning cs.MA · 2026-05-16 · unverdicted · none · ref 29 · internal anchor
PyraVid is a hierarchical multimodal memory system that structures long videos into pyramids to improve long-horizon reasoning and evidence aggregation.
Response-free item difficulty modelling for multiple-choice items with fine-tuned transformers: Component-wise representation and multi-task learning cs.CL · 2026-05-16 · conditional · none · ref 86 · internal anchor
Fine-tuned transformers with multi-task learning recover substantial wording-derived signal for item difficulty at small sample sizes typical in applied testing.
Can We Trust AI-Inferred User States. A Psychometric Framework for Validating the Reliability of Users States Classification by LLMs in Operational Environments cs.AI · 2026-05-15 · unverdicted · none · ref 5 · internal anchor
Empirical replication across three LLMs shows only 31 of 213 user-state metrics meet reliability criteria for individual scores, supporting a validation framework for responsible AI in adaptive environments.
DealMaTe: Multi-Dimensional Material Transfer via Diffusion Transformer cs.GR · 2026-05-15 · unverdicted · none · ref 8 · internal anchor
DealMaTe proposes a simplified diffusion framework for material transfer that injects multi-dimensional 3D conditions via Multi-Dim 3D Shader LoRA and Shader Causal Mutual Attention with KV caching.
MeMo: Memory as a Model cs.CL · 2026-05-14 · unverdicted · none · ref 73 · 2 links · internal anchor
MeMo encodes new knowledge into a separate memory model that integrates with frozen LLMs, showing strong performance on QA benchmarks while avoiding catastrophic forgetting and working without access to model weights.
Boosting Reinforcement Learning with Verifiable Rewards via Randomly Selected Few-Shot Guidance cs.LG · 2026-05-14 · unverdicted · none · ref 19 · internal anchor
FEST improves RLVR sample efficiency on math and coding benchmarks by combining supervised signals, on-policy signals, and decaying weights on just 128 randomly chosen demonstrations, matching full-dataset baselines.
What Limits Vision-and-Language Navigation ? cs.RO · 2026-05-13 · unverdicted · none · ref 58 · internal anchor
StereoNav reaches new benchmark highs on R2R-CE and RxR-CE and improves real-robot reliability by supplying persistent target-location priors and stereo-derived geometry that stay stable under lighting changes and blur.
A Recipe for Long-Context Reasoning in Large Language Models via On-Policy Optimization and Distillation cs.CL · 2026-05-12 · unverdicted · none · ref 13 · internal anchor
Combines GRPO with teacher-guided on-policy distillation and introduces LongBlocks dataset to yield more stable long-context reasoning than either method alone.
AuDirector: A Self-Reflective Closed-Loop Framework for Immersive Audio Storytelling cs.SD · 2026-05-12 · unverdicted · none · ref 10 · 2 links · internal anchor
AuDirector proposes a self-reflective closed-loop multi-agent framework with identity-aware pre-production, collaborative synthesis-correction, and human-guided refinement for coherent immersive audio storytelling.
Evaluating the False Trust Engendered by LLM Explanations cs.HC · 2026-05-11 · unverdicted · none · ref 47 · 2 links · internal anchor
LLM reasoning traces and post-hoc explanations increase false trust in incorrect predictions, whereas contrastive dual explanations enhance users' ability to distinguish correct from incorrect AI outputs.
Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents cs.CL · 2026-05-11 · unverdicted · none · ref 7 · 2 links · internal anchor
Proposes image-bank harness and ODE closed-loop data generation to boost multimodal deep search agents, reporting average score gains from 24.9% to 39.0% on 8 benchmarks for 8B model and 30.6% to 41.5% for 30B.
Thinking with Novel Views: A Systematic Analysis of Generative-Augmented Spatial Intelligence cs.CV · 2026-05-11 · unverdicted · none · ref 15 · internal anchor
Integrating generative novel-view synthesis into LMM reasoning loops improves accuracy on spatial subtasks by 1.3 to 3.9 percentage points across multiple models and tasks.
Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse cs.CV · 2026-05-11 · unverdicted · none · ref 31 · 2 links · internal anchor
The paper organizes research on generalist game AI into Dataset, Model, Harness, and Benchmark pillars and charts a five-level progression from single-game mastery to agents that create and live inside game multiverses.
M2A: Synergizing Mathematical and Agentic Reasoning in Large Language Models cs.AI · 2026-05-11 · unverdicted · none · ref 4 · internal anchor
M2A uses null-space model merging to combine mathematical and agentic reasoning in LLMs, raising SWE-Bench Verified performance from 44.0% to 51.2% on Qwen3-8B without retraining.
Ace-Skill: Bootstrapping Multimodal Agents with Prioritized and Clustered Evolution cs.AI · 2026-05-09 · unverdicted · none · ref 18 · internal anchor
Ace-Skill boosts multimodal agent self-evolution via prioritized rollouts with lazy-decay tracking and semantic knowledge clustering, yielding up to 35% relative gains on tool-use benchmarks and zero-shot transfer to smaller models.
How You Begin is How You Reason: Driving Exploration in RLVR via Prefix-Tuned Priors cs.AI · 2026-05-09 · unverdicted · none · ref 9 · internal anchor
IMAX trains soft prefixes with an InfoMax reward to drive diverse exploration in RLVR, yielding up to 11.60% gains in Pass@4 over standard RLVR across model scales.
SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training cs.LG · 2026-05-09 · unverdicted · none · ref 60 · 2 links · internal anchor
Pruning pretrained MoE models outperforms training from scratch under fixed budget, different expert compression methods converge after continued training, and progressive pruning plus multi-token KD improves the final 23A2B model.
Position: The Term "Machine Unlearning" Is Overused in LLMs cs.CL · 2026-05-08 · accept · none · ref 4 · internal anchor
Machine unlearning should be restricted to dataset-defined deletion achieving retraining equivalence, while other LLM tasks require separate terminology and evaluation baselines.
PACEvolve++: Improving Test-time Learning for Evolutionary Search Agents cs.LG · 2026-05-07 · unverdicted · none · ref 9 · internal anchor
PACEvolve++ uses a phase-adaptive reinforcement learning advisor to decouple hypothesis selection from execution in LLM-driven evolutionary search, delivering faster convergence than prior frameworks on load balancing, recommendation, and protein tasks.
StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction cs.CL · 2026-05-07 · unverdicted · none · ref 69 · internal anchor
StraTA improves LLM agent success rates to 93.1% on ALFWorld and 84.2% on WebShop by sampling a compact initial strategy and training it jointly with action execution via hierarchical GRPO-style rollouts.
VISD: Enhancing Video Reasoning via Structured Self-Distillation cs.CV · 2026-05-07 · unverdicted · none · ref 8 · 4 links · internal anchor
VISD proposes structured self-distillation with a multi-dimensional judge model and direction-magnitude decoupling to improve token-level credit assignment and convergence speed in VideoLLM reasoning training.
Attribution-Guided Continual Learning for Large Language Models cs.LG · 2026-05-06 · unverdicted · none · ref 4 · internal anchor
An attribution-based continual learning framework for LLMs modulates per-parameter gradients using task-specific importance scores to reduce forgetting of prior tasks.
Uncertainty-Aware Exploratory Direct Preference Optimization for Multimodal Large Language Models cs.LG · 2026-05-06 · unverdicted · none · ref 5 · internal anchor
UE-DPO quantifies epistemic uncertainty from grounding failures to direct more learning pressure on hard visual tokens in preferred samples while easing penalties on dispreferred ones.
From Where Things Are to What They Are For: Benchmarking Spatial-Functional Intelligence in Multimodal LLMs cs.CV · 2026-05-04 · unverdicted · none · ref 16 · internal anchor
SFI-Bench shows current multimodal LLMs struggle to integrate spatial memory with functional reasoning and external knowledge in video tasks.
Omni-Fake: Benchmarking Unified Multimodal Social Media Deepfake Detection cs.CV · 2026-05-02 · unverdicted · none · ref 18 · internal anchor
Omni-Fake delivers a unified multimodal deepfake benchmark dataset and RL-driven detector that reports gains in accuracy, cross-modal generalization, and explainability over prior baselines.
Hierarchical Abstract Tree for Cross-Document Retrieval-Augmented Generation cs.LG · 2026-05-01 · unverdicted · none · ref 3 · internal anchor
Ψ-RAG improves cross-document multi-hop QA performance using an adaptive hierarchical abstract tree and agent-powered hybrid retrieval, outperforming RAPTOR by 25.9% and HippoRAG 2 by 7.4% in average F1.
Agentic AI for Trip Planning Optimization Application cs.AI · 2026-04-30 · unverdicted · none · ref 10 · internal anchor
An orchestrated multi-agent AI framework for trip planning optimization paired with a new ground-truth dataset achieves 77.4% accuracy on the TOP Benchmark, outperforming single-agent and workflow baselines.
GUI Agents with Reinforcement Learning: Toward Digital Inhabitants cs.AI · 2026-04-30 · unverdicted · none · ref 14 · internal anchor
The paper delivers the first comprehensive overview of RL for GUI agents, organizing methods into offline, online, and hybrid strategies while analyzing trends in rewards, efficiency, and deliberation to outline a future roadmap.
Heterogeneous Scientific Foundation Model Collaboration cs.AI · 2026-04-30 · unverdicted · none · ref 58 · internal anchor
Eywa enables language-based agentic AI systems to collaborate with specialized scientific foundation models for improved performance on structured data tasks.
Hierarchical Long-Term Semantic Memory for LinkedIn's Hiring Agent cs.IR · 2026-04-29 · unverdicted · none · ref 5 · 2 links · internal anchor
HLTM structures textual data into a schema-aligned memory tree for scalable ingestion and low-latency retrieval in LinkedIn's Hiring Assistant, reporting over 5% higher answer correctness, over 10% higher retrieval F1, and a better latency tradeoff, with full production deployment.