E2e-vlp: End-to-end vision-language pre-training enhanced by visual learning

Haiyang Xu, Ming Yan, Chenliang Li, Bin Bi, Songfang Huang, Wenming Xiao, Fei Huang · 2021 · arXiv 2106.01804

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

representative citing papers

Machine Intelligence that Understands Visual and Linguistic Information and Interacts with Humans and Environments

cs.CV · 2026-05-20 · unverdicted · novelty 4.0

Introduces GRIT, LTMI, and a hierarchical attention framework claiming performance gains on image captioning, visual dialog, and ALFRED instruction following.

citing papers explorer

Showing 1 of 1 citing paper.

Machine Intelligence that Understands Visual and Linguistic Information and Interacts with Humans and Environments cs.CV · 2026-05-20 · unverdicted · none · ref 50
Introduces GRIT, LTMI, and a hierarchical attention framework claiming performance gains on image captioning, visual dialog, and ALFRED instruction following.

E2e-vlp: End-to-end vision-language pre-training enhanced by visual learning

fields

years

verdicts

representative citing papers

citing papers explorer