FashionMV introduces product-level multi-view CIR, a 127K-product dataset built via automated LMM pipeline, and a 0.8B ProCIR model that beats larger baselines on three fashion benchmarks.
hub
Finecir: Explicit parsing of fine- grained modification semantics for composed image re- trieval.https://arxiv.org/abs/2503.21309
21 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 21representative citing papers
ConeSep tackles noisy triplet correspondences in composed image retrieval by introducing geometric fidelity quantization to locate noise, negative boundary learning for semantic opposites, and targeted unlearning via optimal transport, outperforming prior methods on FashionIQ and CIRR.
First end-to-end RAG on mobile NPU delivers 18.1x faster prefilling, 4x lower latency and energy than CPU on Snapdragon X Elite with equivalent quality.
COMBINER proposes a new architecture for composed image retrieval using adaptive semantic disentanglement, unified prototype-based composition, and dual attribute-based relation modeling to address visually similar but attribute-unrelated samples.
TOPD improves on-policy distillation for LLM reasoning by using near-future guidance to identify divergent states, raising average accuracy from 47.8% to 52.2% on math benchmarks including AIME24 and AIME25.
Rock Tokens in on-policy distillation persist at high loss, account for up to 18% of outputs, absorb large gradient norms, but add negligible value to reasoning performance.
Air-Know decouples MLLM-based external arbitration from proxy learning via knowledge internalization and dual-stream training to overcome noisy triplet correspondence in composed image retrieval.
ReTrack calibrates directional bias in composed video features using semantic disentanglement and bidirectional evidence alignment to improve retrieval performance on CVR and CIR tasks.
AgentIAD introduces an agentic VLM with Perceptive Zoomer, Web Searcher, and Comparative Retriever tools plus two-stage SFT-then-RL training, achieving 5.92% higher classification accuracy than prior SOTA on the MMAD benchmark.
R^3 is a zero-shot pipeline that generates reasoning traces to augment composed video queries, fuses scores via agreement-gated residual, and re-ranks candidates for the CoVR-R challenge.
MHMamba combines a U-Net with multi-head Mamba, channel calibration, and adaptive skip fusion to improve 3D brain tumor segmentation accuracy and small-lesion sensitivity on BraTS datasets while retaining linear complexity.
RankVR introduces GSCP and ASVC modules to improve CIR robustness by decoupling clean samples via low-rank structure and dynamically scoring triplet value in noisy datasets.
IMAGINE uses adaptive schema-imagery via dynamic multimodal prototypes to incorporate implicit semantics into composed video retrieval, claiming SOTA results on CVR and CIR benchmarks.
NaviEdit is a training-free inference-time controller that decouples edit progress from model scale traversal in diffusion-based image editing via self-consistency, reporting average gains across editors and backbones.
VOTE-RAG applies retrieval voting across diverse queries and response voting across independent generations to mitigate hallucination-on-hallucination in RAG, matching or exceeding complex baselines on six benchmarks with a parallelizable design.
Hermes is a multi-scale spatial-temporal hypergraph network that improves stock forecasting accuracy by capturing inter-industry lead-lag dependencies and fusing information across scales.
A study deriving mathematical formulations and bounds for diffusion editing objectives while empirically comparing methods on fidelity and control metrics and discussing ethical issues.
EgoAdapt improves VQA on the HD-EPIC egocentric benchmark via category-conditioned routing, calibrated option scoring, and test-time consistency adaptation.
EgoAction uses decoupled verb-noun temporal detectors on VideoMAE features and Dynamic Weighted Fusion of boundaries based on classification confidences for the EPIC-KITCHENS action detection challenge.
OmniEgo-R² is a competition system that combines domain-specific VL models with temporal normalization, capability routing, and answer calibration to reach 66.35-66.77% accuracy on the EgoCross challenge.
TempRet enhances a CLIP dual-encoder with temporal modeling and two-stage reranking to report 67.97% mAP and 82.92% nDCG on the EK-100 MIR benchmark.
citing papers explorer
-
AgentIAD: Agentic Industrial Anomaly Detection via Adaptive Memory Augmentation
AgentIAD introduces an agentic VLM with Perceptive Zoomer, Web Searcher, and Comparative Retriever tools plus two-stage SFT-then-RL training, achieving 5.92% higher classification accuracy than prior SOTA on the MMAD benchmark.
-
Hermes: A Multi-Scale Spatial-Temporal Hypergraph Network for Stock Time Series Forecasting
Hermes is a multi-scale spatial-temporal hypergraph network that improves stock forecasting accuracy by capturing inter-industry lead-lag dependencies and fusing information across scales.