pith. sign in

arxiv: 1903.02930 · v1 · pith:GJ4PHVIFnew · submitted 2019-03-07 · 💻 cs.CL

Neural Language Modeling with Visual Features

classification 💻 cs.CL
keywords languagefeaturesmodelvisualmodelingmodelsmultimodalneural
0
0 comments X
read the original abstract

Multimodal language models attempt to incorporate non-linguistic features for the language modeling task. In this work, we extend a standard recurrent neural network (RNN) language model with features derived from videos. We train our models on data that is two orders-of-magnitude bigger than datasets used in prior work. We perform a thorough exploration of model architectures for combining visual and text features. Our experiments on two corpora (YouCookII and 20bn-something-something-v2) show that the best performing architecture consists of middle fusion of visual and text features, yielding over 25% relative improvement in perplexity. We report analysis that provides insights into why our multimodal language model improves upon a standard RNN language model.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Vision-Assisted Foundation Model for Solving Multi-Task Vehicle Routing Problems

    cs.CV 2026-06 unverdicted novelty 6.0

    VaFM encodes constraint-specific VRP images via CNN into patch embeddings fused with graph nodes, using an auxiliary task to handle pixel imbalance, and reports better performance than prior methods on 16 VRP variants.