RailVQA-bench supplies 21,168 QA pairs for ATO visual cognition while RailVQA-CoM combines large-model reasoning with small-model efficiency via transparent modules and temporal sampling.
Flamingo: a visual language model for few-shot learning
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 7verdicts
UNVERDICTED 7roles
background 2polarities
background 2representative citing papers
M²-VLA shows that generalized VLMs can serve as direct backbones for robotic manipulation by selectively extracting task-critical features via Mixture of Layers and adding Meta Skill Modules for efficient trajectory learning.
Instruction-free tuning of LVLMs on medical image-description pairs via momentum proxy instructions and response shuffling achieves SOTA accuracy on VQA tasks across SKINCON, WBCAtt, CBIS, and MIMIC-CXR.
CVA aggregates frozen VFM embeddings via latent reasoning to create compact video embeddings for efficient micro-video recommendation, delivering consistent performance gains and orders-of-magnitude efficiency improvements.
A Meta AutoEncoder framework enables adaptive, progressive compression of visual features for low-latency edge-cloud VLM inference without model fine-tuning.
PCMECL improves speech-preserving facial expression manipulation by learning personalized prompts from individual visuals and using feature differencing to align visual and semantic changes from VLMs.
A literature review of intelligent automation approaches using robotics, AI, and control for disassembly, inspection, sorting, and reprocessing of end-of-life electronics.
citing papers explorer
-
RailVQA: A Benchmark and Framework for Efficient Interpretable Visual Cognition in Automatic Train Operation
RailVQA-bench supplies 21,168 QA pairs for ATO visual cognition while RailVQA-CoM combines large-model reasoning with small-model efficiency via transparent modules and temporal sampling.
-
$M^2$-VLA: Boosting Vision-Language Models for Generalizable Manipulation via Layer Mixture and Meta-Skills
M²-VLA shows that generalized VLMs can serve as direct backbones for robotic manipulation by selectively extracting task-critical features via Mixture of Layers and adding Meta Skill Modules for efficient trajectory learning.
-
Instruction-Free Tuning of Large Vision Language Models for Medical Instruction Following
Instruction-free tuning of LVLMs on medical image-description pairs via momentum proxy instructions and response shuffling achieves SOTA accuracy on VQA tasks across SKINCON, WBCAtt, CBIS, and MIMIC-CXR.
-
Compressed Video Aggregator: Content-driven Module for Efficient Micro-Video Recommendation
CVA aggregates frozen VFM embeddings via latent reasoning to create compact video embeddings for efficient micro-video recommendation, delivering consistent performance gains and orders-of-magnitude efficiency improvements.
-
Progressive Semantic Communication for Efficient Edge-Cloud Vision-Language Models
A Meta AutoEncoder framework enables adaptive, progressive compression of visual features for low-latency edge-cloud VLM inference without model fine-tuning.
-
Personalized Cross-Modal Emotional Correlation Learning for Speech-Preserving Facial Expression Manipulation
PCMECL improves speech-preserving facial expression manipulation by learning personalized prompts from individual visuals and using feature differencing to align visual and semantic changes from VLMs.
-
Redefining End-of-Life: Intelligent Automation for Electronics Remanufacturing Systems
A literature review of intelligent automation approaches using robotics, AI, and control for disassembly, inspection, sorting, and reprocessing of end-of-life electronics.