Enriching Language Models with Visually-Grounded Word Vectors and the Lancaster Sensorimotor Norms

Research output: Chapter in Book/Report/Conference proceedingChapter

8 Scopus citations

Abstract

Language models are trained only on text despite the fact that humans learn their first language in a highly interactive and multimodal environment where the first set of learned words are largely concrete, denoting physical entities and embodied states. To enrich language models with some of this missing experience, we leverage two sources of information: (1) the Lancaster Sensorimotor norms, which provide ratings (means and standard deviations) for over 40,000 English words along several dimensions of embodiment, and which capture the extent to which something is experienced across 11 different sensory modalities, and (2) vectors from coefficients of binary classifiers trained on images for the BERT vocabulary. We pre-trained the ELECTRA model and fine-tuned the RoBERTa model with these two sources of information then evaluate using the established GLUE benchmark and the Visual Dialog benchmark. We find that enriching language models with the Lancaster norms and image vectors improves results in both tasks, with some implications for robust language models that capture holistic linguistic meaning in a language learning context.

Original languageAmerican English
Title of host publicationProceedings of the 25th Conference on Computational Natural Language Learning
StatePublished - 1 Jan 2021

EGS Disciplines

  • Computer Sciences

Fingerprint

Dive into the research topics of 'Enriching Language Models with Visually-Grounded Word Vectors and the Lancaster Sensorimotor Norms'. Together they form a unique fingerprint.

Cite this