SIRA mitigates hallucinations in LVLMs by internally contrasting full visual access against a masked late-layer branch that retains shared context but lacks fine-grained visual evidence.
hub
Qwen2.5-vl technical report
19 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
years
2026 19representative citing papers
CoRDS selects a compact KV-cache subset via joint-space coreset coverage and log-det diversity to outperform token-wise heuristics on long-video VLM benchmarks.
CoLVR uses latent contrastive objectives with angle-based perturbation and RL trajectory rewards to increase exploratory visual reasoning in MLLMs, delivering 5-8% gains on VSP, Jigsaw, and MMStar benchmarks.
MemCompiler reframes memory use as state-conditioned compilation, delivering relevant guidance via text and latent channels to improve embodied agent performance up to 129% and cut latency 60% versus static injection.
RobotEQ is the first benchmark for active intelligence in embodied AI, demonstrating that current models underperform on social norm adherence and spatial grounding tasks.
DharmaOCR models reach 0.925 and 0.911 extraction scores with 0.40% and 0.20% degeneration rates on a new benchmark covering printed, handwritten, and legal documents, outperforming open-source and commercial baselines via SFT plus DPO.
TCM-Serve applies modality-aware scheduling to reduce average TTFT by 54% and 78.5% for latency-critical requests in MLLM inference.
SplitQ improves low-bit PTQ for VLMs by isolating modality-specific outlier channels via MOCD and applying dual-branch adaptive calibration via ACC, outperforming prior methods on six datasets across W4A8 to W3A2 settings.
SkyNative introduces an encoder-free architecture using raw patch tokens and modality-specific parameters in a unified autoregressive model to improve image-grounded reasoning in remote sensing vision-language tasks.
GRIP-VLM applies group-relative policy optimization via reinforcement learning to prune visual tokens in VLMs, yielding up to 15% inference speedup at matched accuracy over prior methods.
A task-conditioned two-stage system decouples grasp localization from interaction trajectory planning using specialized foundation models to improve generalization across heterogeneous object types.
Visual latents in MLLMs are systematically silenced by autoregressive training but can be unsilenced at inference via query-guided contrastive alignment followed by a confidence-progression reward.
Template collapse is a distinct failure mode in agentic RL invisible to entropy; mutual information proxies diagnose it better and SNR-aware filtering using reward variance improves input-dependent reasoning and task performance across planning, math, navigation, and code tasks.
Video-MME-v2 is a new benchmark that applies progressive visual-to-reasoning levels and non-linear group scoring to expose gaps in video MLLM capabilities.
An anonymization framework replaces sensitive UI content with deterministic placeholders to protect privacy in mobile GUI agents while preserving task performance.
OmniSelect is a training-free, modality-adaptive token pruning framework that dynamically selects Audio-Centric, Video-Centric, or Uniform compression regimes using AudioCLIP cross-modal relevance scores and then applies adaptive fine-grained pruning within temporal groups.
RAC adds ranking-aware group loss and clean-corrupted pairwise loss to RL post-training to boost both accuracy and calibration in multimodal reasoning without extra annotations.
citing papers explorer
-
Do We Really Need External Tools to Mitigate Hallucinations? SIRA: Shared-Prefix Internal Reconstruction of Attribution
SIRA mitigates hallucinations in LVLMs by internally contrasting full visual access against a masked late-layer branch that retains shared context but lacks fine-grained visual evidence.
-
CoRDS: Coreset-based Representative and Diverse Selection for Streaming Video Understanding
CoRDS selects a compact KV-cache subset via joint-space coreset coverage and log-det diversity to outperform token-wise heuristics on long-video VLM benchmarks.
-
CoLVR: Enhancing Exploratory Latent Visual Reasoning via Contrastive Optimization
CoLVR uses latent contrastive objectives with angle-based perturbation and RL trajectory rewards to increase exploratory visual reasoning in MLLMs, delivering 5-8% gains on VSP, Jigsaw, and MMStar benchmarks.
-
MemCompiler: Compile, Don't Inject -- State-Conditioned Memory for Embodied Agents
MemCompiler reframes memory use as state-conditioned compilation, delivering relevant guidance via text and latent channels to improve embodied agent performance up to 129% and cut latency 60% versus static injection.
-
RobotEQ: Transitioning from Passive Intelligence to Active Intelligence in Embodied AI
RobotEQ is the first benchmark for active intelligence in embodied AI, demonstrating that current models underperform on social norm adherence and spatial grounding tasks.
-
DharmaOCR: Specialized Small Language Models for Structured OCR that outperform Open-Source and Commercial Baselines
DharmaOCR models reach 0.925 and 0.911 extraction scores with 0.40% and 0.20% degeneration rates on a new benchmark covering printed, handwritten, and legal documents, outperforming open-source and commercial baselines via SFT plus DPO.
-
TCM-Serve: Modality-aware Scheduling for Multimodal Large Language Model Inference
TCM-Serve applies modality-aware scheduling to reduce average TTFT by 54% and 78.5% for latency-critical requests in MLLM inference.
-
Breaking Modality Heterogeneity in Low-Bit Quantization for Large Vision-Language Models
SplitQ improves low-bit PTQ for VLMs by isolating modality-specific outlier channels via MOCD and applying dual-branch adaptive calibration via ACC, outperforming prior methods on six datasets across W4A8 to W3A2 settings.
-
SkyNative: A Native Multimodal Framework for Remote Sensing Visual Evidence Reasoning
SkyNative introduces an encoder-free architecture using raw patch tokens and modality-specific parameters in a unified autoregressive model to improve image-grounded reasoning in remote sensing vision-language tasks.
-
GRIP-VLM: Group-Relative Importance Pruning for Efficient Vision-Language Models
GRIP-VLM applies group-relative policy optimization via reinforcement learning to prune visual tokens in VLMs, yielding up to 15% inference speedup at matched accuracy over prior methods.
-
HeteroGenManip: Generalizable Manipulation For Heterogeneous Object Interactions
A task-conditioned two-stage system decouples grasp localization from interaction trajectory planning using specialized foundation models to improve generalization across heterogeneous object types.
-
Visual Latents Know More Than They Say: Unsilencing Latent Reasoning in MLLMs
Visual latents in MLLMs are systematically silenced by autoregressive training but can be unsilenced at inference via query-guided contrastive alignment followed by a confidence-progression reward.
-
RAGEN-2: Reasoning Collapse in Agentic RL
Template collapse is a distinct failure mode in agentic RL invisible to entropy; mutual information proxies diagnose it better and SNR-aware filtering using reward variance improves input-dependent reasoning and task performance across planning, math, navigation, and code tasks.
-
Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding
Video-MME-v2 is a new benchmark that applies progressive visual-to-reasoning levels and non-linear group scoring to expose gaps in video MLLM capabilities.
-
Anonymization-Enhanced Privacy Protection for Mobile GUI Agents: Available but Invisible
An anonymization framework replaces sensitive UI content with deterministic placeholders to protect privacy in mobile GUI agents while preserving task performance.
-
OmniSelect: Dynamic Modality-Aware Token Compression for Efficient Omni-modal Large Language Models
OmniSelect is a training-free, modality-adaptive token pruning framework that dynamically selects Audio-Centric, Video-Centric, or Uniform compression regimes using AudioCLIP cross-modal relevance scores and then applies adaptive fine-grained pruning within temporal groups.
-
Ranking-Aware Calibration for Reliable Multimodal Reinforcement Learning
RAC adds ranking-aware group loss and clean-corrupted pairwise loss to RL post-training to boost both accuracy and calibration in multimodal reasoning without extra annotations.
- RISE: Reliable Improvement in Self-Evolving Vision-Language Models
- SVSR: A Self-Verification and Self-Rectification Paradigm for Multimodal Reasoning