mega hub Mixed citations

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Eric Bieber, Gheorghe Comanici, Ice Pasupat, Inderjit Dhillon, Mike Schaekermann, Noveen Sachdeva · 2025 · cs.CL · arXiv 2507.06261

Mixed citation behavior. Most common role is background (55%).

1027 Pith papers citing it

Background 55% of classified citations

open full Pith review browse 1027 citing papers more from Eric Bieber arXiv PDF

abstract

In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. Gemini 2.5 Flash provides excellent reasoning abilities at a fraction of the compute and latency requirements and Gemini 2.0 Flash and Flash-Lite provide high performance at low latency and cost. Taken together, the Gemini 2.X model generation spans the full Pareto frontier of model capability vs cost, allowing users to explore the boundaries of what is possible with complex agentic problem solving.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 122 baseline 46 method 28 other 8 dataset 3

citation-polarity summary

background 114 baseline 47 use method 28 unclear 12 support 3 use dataset 3

claims ledger

abstract In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. G

authors

Eric Bieber Gheorghe Comanici Ice Pasupat Inderjit Dhillon Mike Schaekermann Noveen Sachdeva

mega hub controls

export citing contexts JSON export graph JSON export full bundle JSON open full Pith review annotated reader queued

Recognition alignment

counterfactual ablation

If this work disappeared, these are the nearest dependency candidates in Pith, weighted toward method, dataset, baseline, and extension contexts where available. This is a structural signal, not a retraction verdict.

co-cited works

representative citing papers

EHRNote-ChatQA: A Benchmark for Evidence-Grounded Multi-Turn Clinical Question Answering over Longitudinal Discharge Summaries

cs.CL · 2026-06-14 · unverdicted · novelty 8.0

EHRNote-ChatQA is the first benchmark for evidence-grounded multi-turn clinical QA over longitudinal discharge summaries, containing 16,072 medical-expert-verified pairs across eight categories and revealing LLM weaknesses in evidence grounding and multi-turn consistency.

HKJudge: A Legal Discourse-Annotated Corpus for Interpreting What Courts Find, How They Reason, and What They Rule

cs.CL · 2026-06-04 · accept · novelty 8.0

HKJudge is a new ~290k-sentence expert-annotated corpus of Hong Kong criminal judgments with 26 rhetorical roles and 3 sentencing elements, plus benchmarks on classification and extraction tasks.

RRP-Voice: A Longitudinal Dataset and Benchmark for Recurrent Respiratory Papillomatosis Detection

eess.AS · 2026-06-01 · unverdicted · novelty 8.0

Introduces the first longitudinal voice dataset for RRP with benchmarks across handcrafted features, deep networks, self-supervised models, and audio LLMs under patient-level validation.

VideoFDB: Evaluating Full-Duplex Vision-Speech Capabilities in Conversational Agents

cs.CV · 2026-05-28 · unverdicted · novelty 8.0

VideoFDB is a new benchmark and LM-as-judge framework for evaluating full-duplex audio-visual-to-audio-visual conversational agents on nonverbal dynamics from real video calls.

EgoIntrospect: An Egocentric Dataset and Benchmark for User-Centric Internal State Reasoning

cs.CV · 2026-05-17 · unverdicted · novelty 8.0

EgoIntrospect provides the first egocentric dataset with self-annotations for internal state tasks and shows multimodal LLMs struggle to infer subjective states from combined signals.

Tracing Persona Vectors Through LLM Pretraining

cs.CL · 2026-05-13 · unverdicted · novelty 8.0

Persona vectors form within the first 0.22% of LLM pretraining and remain effective for steering post-trained models, with continued refinement and transfer to other models.

Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models

cs.AR · 2026-05-11 · conditional · novelty 8.0

Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.

Knowledge Poisoning Attacks on Medical Multi-Modal Retrieval-Augmented Generation

cs.CR · 2026-05-11 · unverdicted · novelty 8.0

M³Att poisons medical multimodal RAG by pairing covert textual misinformation with query-agnostic visual perturbations that increase retrieval of the bad content, causing LLMs to generate clinically plausible but incorrect responses.

Omni-DeepSearch: A Benchmark for Audio-Driven Omni-Modal Deep Search

cs.SD · 2026-05-09 · unverdicted · novelty 8.0

Omni-DeepSearch is a 640-sample benchmark for audio-driven omni-modal search where the best model reaches only 43.44% accuracy, exposing bottlenecks in audio inference, tool use, and cross-modal reasoning.

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

cs.CV · 2026-05-08 · unverdicted · novelty 8.0

TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.

S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images

cs.CV · 2026-04-23 · unverdicted · novelty 8.0

S1-VL combines structured scientific reasoning with iterative image manipulation via code execution to reach state-of-the-art results on visual and scientific reasoning benchmarks.

Lost in Translation: Do LVLM Judges Generalize Across Languages?

cs.CL · 2026-04-21 · unverdicted · novelty 8.0

MM-JudgeBench shows substantial cross-lingual performance variance in 22 LVLM judges, with model size and architecture as poor predictors of multilingual robustness.

HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models

cs.SD · 2026-04-21 · unverdicted · novelty 8.0

HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-language models.

When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models

cs.CV · 2026-04-19 · unverdicted · novelty 8.0

VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-specific experts and adaptive routing.

Narrative over Numbers: The Identifiable Victim Effect and its Amplification Under Alignment and Reasoning in Large Language Models

cs.CL · 2026-04-13 · conditional · novelty 8.0

Large language models display the identifiable victim effect at roughly twice the human baseline, strongly amplified by instruction tuning and chain-of-thought prompting but inverted by reasoning-specialized models.

MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark

cs.CV · 2026-04-12 · unverdicted · novelty 8.0 · 2 refs

MMRareBench provides 1,756 QA pairs and 7,958 images from PMC rare-disease cases to evaluate 23 MLLMs, revealing low treatment-planning scores and medical models underperforming general models on multi-image tasks due to capacity dilution.

HM-Bench: A Comprehensive Benchmark for Multimodal Large Language Models in Hyperspectral Remote Sensing

cs.CV · 2026-04-10 · accept · novelty 8.0

HM-Bench is the first benchmark for MLLMs on hyperspectral images, showing models struggle with complex spatial-spectral reasoning and perform better with visual PCA images than textual reports.

DialBGM: A Benchmark for Background Music Recommendation from Everyday Multi-Turn Dialogues

cs.AI · 2026-04-09 · unverdicted · novelty 8.0

DialBGM is a new benchmark dataset revealing that existing AI models fall far short of human performance when recommending fitting background music for open-domain conversations.

V2X-QA: A Comprehensive Reasoning Dataset and Benchmark for Multimodal Large Language Models in Autonomous Driving Across Ego, Infrastructure, and Cooperative Views

cs.RO · 2026-04-03 · conditional · novelty 8.0

V2X-QA provides a view-decoupled benchmark showing infrastructure views aid macroscopic traffic understanding while cooperative reasoning requires explicit cross-view alignment, with V2X-MoE as a routing-based baseline that improves performance.

ScreenParse: Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision

cs.CV · 2026-02-15 · conditional · novelty 8.0

ScreenParse dataset and ScreenVLM model deliver dense screen parsing that outperforms larger VLMs on PageIoU and transfers to better UI grounding.

EgoSound: Benchmarking Sound Understanding in Egocentric Videos

cs.CV · 2026-02-15 · unverdicted · novelty 8.0

EgoSound is a new benchmark with 7315 QA pairs across seven tasks to evaluate egocentric sound understanding in multimodal large language models.

VLRS-Bench: A Vision-Language Reasoning Benchmark for Remote Sensing

cs.CV · 2026-02-04 · unverdicted · novelty 8.0

VLRS-Bench is the first benchmark dedicated to complex vision-language reasoning in remote sensing, with 2000 QA pairs across 14 tasks in cognition, decision, and prediction dimensions.

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

cs.CV · 2026-01-15 · unverdicted · novelty 8.0

Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.

ConceptPose: Training-Free Zero-Shot Object Pose Estimation using Concept Vectors

cs.CV · 2025-12-09 · unverdicted · novelty 8.0

ConceptPose delivers state-of-the-art zero-shot relative pose estimation by matching open-vocabulary 3D concept vectors derived from VLM saliency maps, beating the strongest baseline by 62% in ADD(-S) without training.

citing papers explorer

Showing 50 of 1027 citing papers.

LVSum: A Benchmark for Timestamp-Aware Long Video Summarization cs.CV · 2026-04-11 · unverdicted · none · ref 6 · internal anchor
LVSum is a new benchmark for timestamp-aware long video summarization that exposes systematic temporal gaps in existing multimodal large language models.
FREE-Switch: Frequency-based Dynamic LoRA Switch for Style Transfer cs.CV · 2026-04-11 · unverdicted · none · ref 5 · internal anchor
FREE-Switch dynamically switches LoRA adapters using frequency importance per diffusion step and adds semantic alignment to reduce content drift when merging specialized image generators.
Demographic and Linguistic Bias Evaluation in Omnimodal Language Models cs.CV · 2026-04-11 · unverdicted · none · ref 8 · internal anchor
Omnimodal models show reduced demographic bias in image and video tasks compared to substantial biases and lower performance in audio tasks.
MedLVR: Latent Visual Reasoning for Reliable Medical Visual Question Answering cs.CV · 2026-04-10 · unverdicted · none · ref 36 · internal anchor
MedLVR interleaves latent visual reasoning segments in autoregressive decoding and uses two-stage training to raise average medical VQA accuracy from 48.3% to 53.4% over a Qwen2.5-VL-7B backbone on OmniMedVQA and five other benchmarks.
Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models cs.CV · 2026-04-09 · unverdicted · none · ref 6 · internal anchor
HDPO reframes tool efficiency as a conditional objective within accurate trajectories, enabling Metis to reduce tool invocations by orders of magnitude while raising reasoning accuracy.
OpenVLThinkerV2: A Generalist Multimodal Reasoning Model for Multi-domain Visual Tasks cs.CV · 2026-04-09 · unverdicted · none · ref 7 · internal anchor
OpenVLThinkerV2 applies a new Gaussian GRPO training objective with response and entropy shaping to outperform prior open-source and proprietary models on 18 visual reasoning benchmarks.
BLaDA: Bridging Language to Functional Dexterous Actions within 3DGS Fields cs.CV · 2026-04-09 · unverdicted · none · ref 46 · internal anchor
BLaDA grounds open-vocabulary language into functional dexterous manipulation via knowledge-guided parsing, triangular localization in 3DGS fields, and keypoint grasp execution.
A Decomposition Perspective to Long-context Reasoning for LLMs cs.CL · 2026-04-09 · unverdicted · none · ref 5 · internal anchor
Decomposing long-context reasoning into atomic skills, synthesizing targeted pseudo-datasets, and applying RL improves LLM performance on long-context benchmarks by an average of 7.7%.
OpenSpatial: A Principled Data Engine for Empowering Spatial Intelligence cs.CL · 2026-04-08 · unverdicted · none · ref 14 · internal anchor
OpenSpatial supplies a principled open-source data engine and 3-million-sample dataset that raises spatial-reasoning model performance by an average of 19 percent on benchmarks.
AudioKV: KV Cache Eviction in Efficient Large Audio Language Models cs.SD · 2026-04-08 · unverdicted · none · ref 8 · internal anchor
AudioKV prioritizes audio-critical attention heads identified via ASR analysis and applies spectral score smoothing to evict KV cache tokens, achieving high compression with minimal accuracy loss in LALMs.
Understanding Performance Gap Between Parallel and Sequential Sampling in Large Reasoning Models cs.CL · 2026-04-07 · unverdicted · none · ref 4 · internal anchor
Lack of exploration from conditioning on prior answers is the primary reason parallel sampling outperforms sequential sampling in large reasoning models.
SALLIE: Safeguarding Against Latent Language & Image Exploits cs.CR · 2026-04-06 · unverdicted · none · ref 6 · internal anchor
SALLIE detects jailbreaks in text and vision-language models by extracting residual stream activations, scoring maliciousness per layer with k-NN, and ensembling predictions, outperforming baselines on multiple datasets.
Yet Even Less Is Even Better For Agentic, Reasoning, and Coding LLMs cs.SE · 2026-04-01 · unverdicted · none · ref 2 · internal anchor
STITCH trains superior agentic coding and reasoning LLMs by using fewer high-quality trajectories filtered to keep only critical decision tokens, delivering up to 63% relative gains on SWE-bench Verified.
Large Language Models for Math Education in Low-Resource Languages: A Study in Sinhala and Tamil cs.CL · 2026-02-16 · unverdicted · none · ref 32 · internal anchor
LLMs handle basic arithmetic reliably in Sinhala and Tamil but show clear performance drops on complex math tasks compared to English.
Towards Explainable Industrial Anomaly Detection via Knowledge-Guided Latent Reasoning cs.CV · 2026-02-10 · unverdicted · none · ref 4 · internal anchor
Reason-IAD improves explainable industrial anomaly detection by combining retrieval-augmented category knowledge with entropy-guided latent reasoning and dynamic visual patch injection in MLLMs.
Embodied Task Planning via Graph-Informed Action Generation with Large Language Models cs.CL · 2026-01-29 · unverdicted · none · ref 2 · internal anchor
GiG uses a Graph-in-Graph architecture with GNN-encoded states, experience memory retrieval, and bounded symbolic lookahead to improve LLM planning on embodied benchmarks with gains up to 37%.
CG-MLLM: Captioning and Generating 3D content via Multi-modal Large Language Models cs.CV · 2026-01-29 · unverdicted · none · ref 10 · internal anchor
CG-MLLM is a multimodal LLM using a Mixture-of-Transformer architecture with separate TokenAR and BlockAR components integrated with a pre-trained vision-language backbone and 3D VAE to enable 3D captioning and high-fidelity generation.
Concise Geometric Description as a Bridge: Unleashing the Potential of LLM for Plane Geometry Problem Solving cs.AI · 2026-01-29 · unverdicted · none · ref 6 · internal anchor
An MLLM interpreter generates concise CDL descriptions from diagrams, enabling an off-the-shelf LLM to solve plane geometry problems competitively after training on only 5.5k examples.
Reward-Forcing: Autoregressive Video Generation with Reward Feedback cs.CV · 2026-01-23 · unverdicted · none · ref 4 · internal anchor
Reward-Forcing guides autoregressive video generation with reward feedback to achieve performance comparable to teacher-dependent methods on benchmarks like VBench without relying on distillation.
LLM-based Multimodal Feedback Produces Equivalent Learning and Better Student Perceptions than Educator Feedback cs.HC · 2026-01-21 · unverdicted · none · ref 14 · internal anchor
LLM-based multimodal feedback matches educator feedback in learning outcomes but exceeds it in student perceptions of quality, engagement, and reduced cognitive load.
Enhancing Large Language Model-Based Systems for End-to-End Circuit Analysis Problem Solving cs.CY · 2025-12-10 · conditional · none · ref 5 · internal anchor
Hybrid pipeline using YOLO vision and ngspice verification raises circuit analysis accuracy from Gemini's 79.52% baseline to 97.59%, with similar gains on hand-drawn diagrams.
OneThinker: All-in-one Reasoning Model for Image and Video cs.CV · 2025-12-02 · unverdicted · none · ref 53 · internal anchor
OneThinker unifies image and video reasoning in one model across 10 tasks via a 600k corpus, CoT-annotated SFT, and EMA-GRPO reinforcement learning, reporting strong results on 31 benchmarks plus some cross-task transfer.
AgroCoT: A Chain-of-Thought Benchmark for Evaluating Reasoning in Vision-Language Models for Agriculture cs.AI · 2025-11-28 · unverdicted · none · ref 11 · internal anchor
AgroCoT is a new Chain-of-Thought VQA benchmark with 4759 samples to evaluate reasoning capabilities of vision-language models in agriculture.
DisCEdge: Distributed Context Management for Large Language Models at the Edge cs.DC · 2025-11-27 · unverdicted · none · ref 9 · internal anchor
DisCEdge manages LLM context in tokenized form replicated on edge nodes, delivering up to 14.46% faster median responses, 15% lower sync overhead, and 90% smaller client requests versus baselines while ensuring consistency.
AVATAAR: Agentic Video Answering via Temporal Adaptive Alignment and Reasoning cs.CV · 2025-11-19 · unverdicted · none · ref 3 · internal anchor
AVATAAR reports relative gains of 5-8% over baseline on CinePile benchmark categories through agentic feedback for long video QA.
Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm cs.CV · 2025-11-06 · unverdicted · none · ref 8 · internal anchor
Video generation models demonstrate competitive multimodal reasoning on a new benchmark, matching or exceeding VLMs on visual puzzles and achieving 92% on MATH and 69.2% on MMMU.
PRISM: Probabilistic Runtime Insights and Scalable Performance Modeling for Large-Scale Distributed Training cs.DC · 2025-10-17 · unverdicted · none · ref 37 · internal anchor
PRISM introduces a probabilistic performance modeling framework that quantifies guarantees on training time for large-scale distributed systems under runtime variability.
SongFormer: Scaling Music Structure Analysis with Heterogeneous Supervision eess.AS · 2025-10-03 · unverdicted · none · ref 23 · internal anchor
SongFormer achieves state-of-the-art strict boundary detection and functional label accuracy in music structure analysis by fusing SSL representations and using learned source embeddings on a new 14k-song corpus and expert benchmark.
When Silence Matters: The Impact of Irrelevant Audio on Text Reasoning in Large Audio-Language Models cs.SD · 2025-10-01 · unverdicted · none · ref 7 · internal anchor
Irrelevant audio including silence reduces accuracy and increases volatility in text reasoning for large audio-language models, with effects worsening at longer durations, higher amplitudes, and higher temperatures.
What Is The Political Content in LLMs' Pre- and Post-Training Data? cs.CL · 2025-09-26 · unverdicted · none · ref 8 · internal anchor
Training data for open LLMs is systematically left-leaning, with pre-training corpora containing more political material than post-training data and model stances aligning with data distributions.
Video Parallel Scaling: Aggregating Diverse Frame Subsets for VideoLLMs cs.CV · 2025-09-09 · unverdicted · none · ref 4 · internal anchor
Video Parallel Scaling improves VideoLLM performance by aggregating outputs from parallel inferences on complementary disjoint frame subsets, effectively contracting the Chinchilla scaling law via uncorrelated visual evidence.
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning cs.AI · 2025-09-02 · conditional · none · ref 15 · internal anchor
UI-TARS-2 reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld while attaining 59.8 mean normalized score on a 15-game suite through multi-turn RL and scalable data generation.
Self-Rewarding Vision-Language Model via Reasoning Decomposition cs.CV · 2025-08-27 · unverdicted · none · ref 3 · internal anchor
Vision SR1 decomposes VLM reasoning into visual and language components and uses internal self-rewards to improve visual reasoning and reduce hallucinations more efficiently than external-supervision methods.
DiscussLLM: Teaching Large Language Models When to Speak cs.CL · 2025-08-25 · unverdicted · none · ref 2 · internal anchor
DiscussLLM introduces a two-stage synthetic data pipeline to annotate multi-turn discussions with five intervention types and trains LLMs to time contributions via a silent token or proactive responses.
ChemDFM-R: A Chemical Reasoning LLM Enhanced with Atomized Chemical Knowledge cs.CE · 2025-07-29 · unverdicted · none · ref 4 · internal anchor
ChemDFM-R is a chemical reasoning LLM trained via a four-stage pipeline on the ChemFG dataset of functional-group annotations for molecules and reactions, reaching performance comparable to or better than commercial models on chemical benchmarks.
CoRe: Combined Rewards with Vision-Language Model Feedback for Preference-Aligned Reinforcement Learning cs.RO · 2026-07-02 · unverdicted · none · ref 77 · internal anchor
CoRe combines VLM-designed formal rewards with VLM-labeled residual rewards to produce preference-aligned policies on robotic manipulation tasks.
SimpleSearch-VL: A Simple Recipe for Multimodal Agentic Deep Search cs.CV · 2026-06-30 · unverdicted · none · ref 64 · internal anchor
SimpleSearch-VL improves Qwen3-VL multimodal agent baselines by 15.8-16 points on average using 7K total training examples and reaches parity with Gemini-3-Pro on the 30B variant.
Uncertainty-Aware Generation and Decision-Making Under Ambiguity cs.CL · 2026-06-29 · unverdicted · none · ref 2 · internal anchor
Uncertainty-aware algorithms based on Bayesian decision theory improve generation utility on tutoring and reviewing tasks while risk-averse methods can degrade performance under high ambiguity, with conformal prediction providing guarantees.
Illuminating Unified Multimodal Model for Free-form Interleaved Text-Image Generation cs.CV · 2026-06-29 · unverdicted · none · ref 11 · internal anchor
ILLUME-X is a unified multimodal model that generates free-form interleaved text-image sequences via an expanded data pipeline, progressive self-adaptive training, and ILScore evaluation, claiming outperformance over prior unified models on style transfer, image decomposition, and storytelling.
MAVIN: Multi-Shot Audio-Visual Generation with Narrative Control cs.CV · 2026-06-28 · unverdicted · none · ref 11 · internal anchor
MAVIN proposes boundary-aware attention, ID-aware propagation, a multi-agent scripting pipeline, and the MAVINSet dataset as the first framework for multi-shot audio-visual generation with narrative control, claiming SOTA results.
Reward-Free Code Alignment from Pretrained or Fine-Tuned LLM: Unpacking the Trade-offs for Code Generation cs.SE · 2026-06-27 · unverdicted · none · ref 8 · internal anchor
Empirical study on five LLMs finds pretrained-to-aligned paths yield bigger gains over baseline than finetuned-to-aligned paths, though absolute accuracy remains lower for pretrained starts.
GROVE: Grounded Pedestrian Simulation via Natural Language for Interactive Social Robot Navigation cs.RO · 2026-06-24 · unverdicted · none · ref 16 · internal anchor
GROVE combines multiple state-of-the-art modules tuned by user prompts to produce realistic long-, medium-, and short-horizon pedestrian behaviors integrated into robot simulators.
S1-Omni-Image: A Unified Model for Scientific Image Understanding, Generation, and Editing cs.CV · 2026-06-23 · unverdicted · none · ref 5 · internal anchor
S1-Omni-Image unifies scientific image understanding, generation and editing via a think-before-generate paradigm on top of S1-VL-32B, trained on a 314K-sample SciGenEdit dataset, and reports SOTA results on multiple generation and editing benchmarks.
UniTranslator: A Unified Multi-modal Framework for End-to-end In-Image Machine Translation cs.CV · 2026-06-23 · unverdicted · none · ref 9 · internal anchor
UniTranslator adds an Understand-Generation Alignment Module and Spatial Mask Decoder to a unified multimodal model to fix translation inconsistency and spatial misalignment in in-image machine translation, reporting SOTA results on multiple benchmarks.
Token-to-Token Alignment of Text Embeddings for Semantic Blending cs.CV · 2026-06-22 · unverdicted · none · ref 10 · internal anchor
Token-to-Token alignment rephrases prompts into shared structure then matches token embeddings by semantic similarity, making linear interpolation a meaningful operation for blending in text-to-image models.
Music Playlist Captioning at Scale with Large Language Models cs.IR · 2026-06-21 · unverdicted · none · ref 13 · internal anchor
Deezer deployed an LLM-driven playlist captioning system in 2025 for its Daily Mix recommendations, claiming significant gains in user engagement from the added natural-language descriptions.
Multi-LCB: Extending LiveCodeBench to Multiple Programming Languages cs.AI · 2026-06-18 · unverdicted · none · ref 18 · internal anchor
Multi-LCB extends LiveCodeBench to 12 languages by translating Python tasks, revealing Python overfitting and performance disparities when evaluating 24 LLMs.
ASTRA: A Scalable Next-Generation ATCO Training Simulator with Autonomous Simpilots cs.LG · 2026-06-16 · unverdicted · none · ref 20 · internal anchor
ASTRA automates simpilot roles in ATCO training with a fine-tuned ASR pipeline that cuts WER to 23.45% on Singaporean aviation speech and an AI evaluator scoring 86.9-91.7% on accuracy, brevity, and completeness.
Ling and Ring 2.6 Technical Report: Efficient and Instant Agentic Intelligence at Trillion-Parameter Scale cs.CL · 2026-06-13 · unverdicted · none · ref 37 · internal anchor
Technical report announcing Ling-2.6 and Ring-2.6 models with hybrid linear attention, evolutionary CoT, and KPop RL for efficient agentic intelligence at scale.
InternVideo3: Agentify Foundation Models with Multimodal Contextual Reasoning cs.CV · 2026-06-10 · unverdicted · none · ref 10 · internal anchor
InternVideo3 introduces Multimodal Contextual Reasoning and M^2LA attention to enable closed-loop evidence accumulation in long-video understanding and agentic tool use, reporting strong benchmark results.