hub

Recognize anything: A strong image tagging model

Youcai Zhang, Xinyu Huang, Jinyu Ma, Zhaoyang Li, Zhaochuan Luo, Yanchun Xie, Yuzhuo Qin, Tong Luo, Yaqian Li, Shilong Liu, et al · 2023 · arXiv 2306.03514

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

read on arXiv browse 10 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

method 2

citation-polarity summary

use method 2

representative citing papers

Qwen3-VL-Seg: Unlocking Open-World Referring Segmentation with Vision-Language Grounding

cs.CV · 2026-05-08 · unverdicted · novelty 7.0

Qwen3-VL-Seg decodes MLLM bounding boxes into pixel-level referring segmentation via a lightweight box-guided mask decoder, new SA1B-ORS training data, and ORS-Bench evaluation, showing strong open-world performance.

AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation

cs.CV · 2026-04-20 · unverdicted · novelty 7.0

AnchorSeg uses ordered query banks of latent reasoning tokens plus a spatial anchor token and a Token-Mask Cycle Consistency loss to achieve 67.7% gIoU and 68.1% cIoU on the ReasonSeg benchmark.

A Woman with a Knife or A Knife with a Woman? Measuring Directional Bias Amplification in Image Captions

cs.CV · 2025-03-10 · unverdicted · novelty 7.0

DBAC is a new directional metric for bias amplification in image captions that is less sensitive to sentence encoders and more accurate than LIC, validated on COCO gender and race attributes.

VACE: All-in-One Video Creation and Editing

cs.CV · 2025-03-10 · unverdicted · novelty 7.0

VACE unifies reference-to-video generation, video-to-video editing, and masked video-to-video editing in one Diffusion Transformer framework using a Video Condition Unit for inputs and a Context Adapter for task injection.

SR-Ground: Image Quality Grounding for Super-Resolved Content

cs.CV · 2026-05-20 · unverdicted · novelty 6.0

The paper releases SR-Ground, a crowdsourced dataset for pixel-level segmentation of six artifact types in super-resolved images, and shows its use for training grounded IQA models and artifact-reducing fine-tuning.

Vista4D: Video Reshooting with 4D Point Clouds

cs.CV · 2026-04-23 · unverdicted · novelty 6.0

Vista4D re-synthesizes dynamic videos from new viewpoints by grounding them in a 4D point cloud built with static segmentation and multiview training.

Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

cs.CV · 2024-01-25 · unverdicted · novelty 6.0

Grounded SAM integrates Grounding DINO and SAM to support text-prompted open-world detection and segmentation, achieving 48.7 mean AP on SegInW zero-shot with the base detector and huge segmenter.

Empowering NPC Dialogue with Environmental Context Using LLMs and Panoramic Images

cs.GR · 2026-04-21 · unverdicted · novelty 4.0

NPCs gain spatial awareness via panoramic images turned into JSON scene data for LLMs, enabling dynamic references to nearby objects and improving player preference in user studies.

Step1X-Edit: A Practical Framework for General Image Editing

cs.CV · 2025-04-24 · unverdicted · novelty 4.0

Step1X-Edit integrates a multimodal LLM with a diffusion decoder, trained on a custom high-quality dataset, to deliver image editing performance that surpasses open-source baselines and approaches proprietary models on the new GEdit-Bench.

A Survey on Hallucination in Large Vision-Language Models

cs.CV · 2024-02-01 · unverdicted · novelty 3.0

This survey reviews the definition, symptoms, evaluation benchmarks, root causes, and mitigation methods for hallucinations in large vision-language models.

citing papers explorer

Showing 10 of 10 citing papers.

Qwen3-VL-Seg: Unlocking Open-World Referring Segmentation with Vision-Language Grounding cs.CV · 2026-05-08 · unverdicted · none · ref 21
Qwen3-VL-Seg decodes MLLM bounding boxes into pixel-level referring segmentation via a lightweight box-guided mask decoder, new SA1B-ORS training data, and ORS-Bench evaluation, showing strong open-world performance.
AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation cs.CV · 2026-04-20 · unverdicted · none · ref 96
AnchorSeg uses ordered query banks of latent reasoning tokens plus a spatial anchor token and a Token-Mask Cycle Consistency loss to achieve 67.7% gIoU and 68.1% cIoU on the ReasonSeg benchmark.
A Woman with a Knife or A Knife with a Woman? Measuring Directional Bias Amplification in Image Captions cs.CV · 2025-03-10 · unverdicted · none · ref 40
DBAC is a new directional metric for bias amplification in image captions that is less sensitive to sentence encoders and more accurate than LIC, validated on COCO gender and race attributes.
VACE: All-in-One Video Creation and Editing cs.CV · 2025-03-10 · unverdicted · none · ref 78
VACE unifies reference-to-video generation, video-to-video editing, and masked video-to-video editing in one Diffusion Transformer framework using a Video Condition Unit for inputs and a Context Adapter for task injection.
SR-Ground: Image Quality Grounding for Super-Resolved Content cs.CV · 2026-05-20 · unverdicted · none · ref 47
The paper releases SR-Ground, a crowdsourced dataset for pixel-level segmentation of six artifact types in super-resolved images, and shows its use for training grounded IQA models and artifact-reducing fine-tuning.
Vista4D: Video Reshooting with 4D Point Clouds cs.CV · 2026-04-23 · unverdicted · none · ref 66
Vista4D re-synthesizes dynamic videos from new viewpoints by grounding them in a 4D point cloud built with static segmentation and multiview training.
Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks cs.CV · 2024-01-25 · unverdicted · none · ref 83
Grounded SAM integrates Grounding DINO and SAM to support text-prompted open-world detection and segmentation, achieving 48.7 mean AP on SegInW zero-shot with the base detector and huge segmenter.
Empowering NPC Dialogue with Environmental Context Using LLMs and Panoramic Images cs.GR · 2026-04-21 · unverdicted · none · ref 18
NPCs gain spatial awareness via panoramic images turned into JSON scene data for LLMs, enabling dynamic references to nearby objects and improving player preference in user studies.
Step1X-Edit: A Practical Framework for General Image Editing cs.CV · 2025-04-24 · unverdicted · none · ref 69
Step1X-Edit integrates a multimodal LLM with a diffusion decoder, trained on a custom high-quality dataset, to deliver image editing performance that surpasses open-source baselines and approaches proprietary models on the new GEdit-Bench.
A Survey on Hallucination in Large Vision-Language Models cs.CV · 2024-02-01 · unverdicted · none · ref 51
This survey reviews the definition, symptoms, evaluation benchmarks, root causes, and mitigation methods for hallucinations in large vision-language models.

Recognize anything: A strong image tagging model

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer