Scene text visual question answering

· 1905 · arXiv 1905.13648

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

representative citing papers

FLARE: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding

cs.CV · 2025-04-14 · unverdicted · novelty 7.0

FLARE is a vision-language model family using text-guided vision encoding, context-aware alignment decoding, dual-semantic mapping loss, and text-driven VQA synthesis to achieve deep cross-modal integration, outperforming larger models with only 630 vision tokens at 3B scale.

ICDAR 2019 Competition on Scene Text Visual Question Answering

cs.CV · 2019-06-30 · accept · novelty 7.0

Introduces a new dataset and three-tier competition for visual question answering that requires reading scene text to answer questions about images.

citing papers explorer

Showing 2 of 2 citing papers.

FLARE: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding cs.CV · 2025-04-14 · unverdicted · none · ref 5
FLARE is a vision-language model family using text-guided vision encoding, context-aware alignment decoding, dual-semantic mapping loss, and text-driven VQA synthesis to achieve deep cross-modal integration, outperforming larger models with only 630 vision tokens at 3B scale.
ICDAR 2019 Competition on Scene Text Visual Question Answering cs.CV · 2019-06-30 · accept · none · ref 5
Introduces a new dataset and three-tier competition for visual question answering that requires reading scene text to answer questions about images.

Scene text visual question answering

fields

years

verdicts

representative citing papers

citing papers explorer