Eva-02: A visual representation for neon genesis.Image and Vision Computing, 149:105171,

Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Huang, Xinlong Wang, Yue Cao

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

browse 6 citing papers

representative citing papers

ProcObject-10K: Benchmarking Object-Centric Procedural Understanding in Instructional Videos

cs.CV · 2025-12-03 · conditional · novelty 7.0

ProcObject-10K is the first benchmark for object-centric procedural reasoning in videos that exposes a large gap where models answer questions plausibly but fail to ground their answers in the correct video segments.

Selective, Regularized, and Calibrated: Harnessing Vision Foundation Models for Cross-Domain Few-Shot Semantic Segmentation

cs.CV · 2026-05-19 · unverdicted · novelty 6.0

HERA is a select-regularize-calibrate framework adapting frozen vision foundation models for cross-domain few-shot semantic segmentation via hierarchical layer selection with ETR, prior-guided regularization, and pixelwise adaptive calibration, reporting over 4.1 mIoU gains.

SToRe3D: Sparse Token Relevance in ViTs for Efficient Multi-View 3D Object Detection

cs.CV · 2026-05-13 · unverdicted · novelty 6.0

SToRe3D delivers up to 3x faster inference for multi-view 3D object detection in ViTs by selecting relevant 2D tokens and 3D queries via mutual relevance heads with only marginal accuracy loss.

MultiModalPFN: Extending Prior-Data Fitted Networks for Multimodal Tabular Learning

cs.LG · 2026-02-23 · unverdicted · novelty 6.0

MultiModalPFN extends TabPFN with modality projectors, a multi-head gated MLP, and cross-attention pooler to unify tabular and non-tabular inputs, outperforming prior methods on medical and general multimodal datasets.

GeoBridge: A Semantic-Anchored Multi-View Foundation Model Bridging Images and Text for Geo-Localization

cs.CV · 2025-12-02 · unverdicted · novelty 6.0

GeoBridge introduces a semantic-anchor mechanism using text to bridge multi-view image features for bidirectional cross-view and language-to-image geo-localization, supported by the new GeoLoc dataset of over 50,000 aligned pairs.

SemLT3D: Semantic-Guided Expert Distillation for Camera-only Long-Tailed 3D Object Detection

cs.CV · 2026-04-20 · unverdicted · novelty 5.0

SemLT3D introduces semantic-guided expert distillation with a language MoE module and CLIP projection to enrich features for long-tailed classes in camera-only 3D detection.

citing papers explorer

Showing 6 of 6 citing papers.

ProcObject-10K: Benchmarking Object-Centric Procedural Understanding in Instructional Videos cs.CV · 2025-12-03 · conditional · none · ref 10
ProcObject-10K is the first benchmark for object-centric procedural reasoning in videos that exposes a large gap where models answer questions plausibly but fail to ground their answers in the correct video segments.
Selective, Regularized, and Calibrated: Harnessing Vision Foundation Models for Cross-Domain Few-Shot Semantic Segmentation cs.CV · 2026-05-19 · unverdicted · none · ref 15
HERA is a select-regularize-calibrate framework adapting frozen vision foundation models for cross-domain few-shot semantic segmentation via hierarchical layer selection with ETR, prior-guided regularization, and pixelwise adaptive calibration, reporting over 4.1 mIoU gains.
SToRe3D: Sparse Token Relevance in ViTs for Efficient Multi-View 3D Object Detection cs.CV · 2026-05-13 · unverdicted · none · ref 10
SToRe3D delivers up to 3x faster inference for multi-view 3D object detection in ViTs by selecting relevant 2D tokens and 3D queries via mutual relevance heads with only marginal accuracy loss.
MultiModalPFN: Extending Prior-Data Fitted Networks for Multimodal Tabular Learning cs.LG · 2026-02-23 · unverdicted · none · ref 17
MultiModalPFN extends TabPFN with modality projectors, a multi-head gated MLP, and cross-attention pooler to unify tabular and non-tabular inputs, outperforming prior methods on medical and general multimodal datasets.
GeoBridge: A Semantic-Anchored Multi-View Foundation Model Bridging Images and Text for Geo-Localization cs.CV · 2025-12-02 · unverdicted · none · ref 10
GeoBridge introduces a semantic-anchor mechanism using text to bridge multi-view image features for bidirectional cross-view and language-to-image geo-localization, supported by the new GeoLoc dataset of over 50,000 aligned pairs.
SemLT3D: Semantic-Guided Expert Distillation for Camera-only Long-Tailed 3D Object Detection cs.CV · 2026-04-20 · unverdicted · none · ref 10
SemLT3D introduces semantic-guided expert distillation with a language MoE module and CLIP projection to enrich features for long-tailed classes in camera-only 3D detection.

Eva-02: A visual representation for neon genesis.Image and Vision Computing, 149:105171,

fields

years

verdicts

representative citing papers

citing papers explorer