TY - GEN
T1 - Tiny Language Models Enriched with Multimodal Knowledge from Multiplex Networks
AU - Fields, Clayton
AU - Natouf, Osama
AU - McMains, Andrew
AU - Henry, Catherine
AU - Kennington, Casey
N1 - Publisher Copyright:
© 2023 Association for Computational Linguistics.
PY - 2023
Y1 - 2023
N2 - Large transformer language models trained exclusively on massive quantities of text are now the standard in NLP. In addition to the impractical amounts of data used to train them, they require enormous computational resources for training. Furthermore, they lack the rich array of sensory information available to humans, who can learn language with much less exposure to language. In this study, performed for submission in the BabyLM challenge, we show that we can improve a small transformer model’s data efficiency by enriching its embeddings by swapping the learned word embeddings from a tiny transformer model with vectors extracted from a custom multiplex network that encodes visual and sensorimotor information. Further, we use a custom variation of the ELECTRA model that contains less than 7 million parameters and can be trained end-to-end using a single GPU. Our experiments show that models using these embeddings outperform equivalent models when pretrained with only the small BabyLM dataset, containing only 10 million words of text, on a variety of natural language understanding tasks from the GLUE and SuperGLUE benchmarks and a variation of the BLiMP task.
AB - Large transformer language models trained exclusively on massive quantities of text are now the standard in NLP. In addition to the impractical amounts of data used to train them, they require enormous computational resources for training. Furthermore, they lack the rich array of sensory information available to humans, who can learn language with much less exposure to language. In this study, performed for submission in the BabyLM challenge, we show that we can improve a small transformer model’s data efficiency by enriching its embeddings by swapping the learned word embeddings from a tiny transformer model with vectors extracted from a custom multiplex network that encodes visual and sensorimotor information. Further, we use a custom variation of the ELECTRA model that contains less than 7 million parameters and can be trained end-to-end using a single GPU. Our experiments show that models using these embeddings outperform equivalent models when pretrained with only the small BabyLM dataset, containing only 10 million words of text, on a variety of natural language understanding tasks from the GLUE and SuperGLUE benchmarks and a variation of the BLiMP task.
UR - http://www.scopus.com/inward/record.url?scp=85185349114&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85185349114
T3 - CoNLL 2023 - BabyLM Challenge at the 27th Conference on Computational Natural Language Learning, Proceedings
SP - 47
EP - 57
BT - CoNLL 2023 - BabyLM Challenge at the 27th Conference on Computational Natural Language Learning, Proceedings
A2 - Warstadt, Alex
A2 - Mueller, Aaron
A2 - Choshen, Leshem
A2 - Wilcox, Ethan
A2 - Zhuang, Chengxu
A2 - Ciro, Juan
A2 - Mosquera, Rafael
A2 - Paranjabe, Bhargavi
A2 - Williams, Adina
A2 - Linzen, Tal
A2 - Cotterell, Ryan
T2 - BabyLM Challenge at the 27th Conference on Computational Natural Language Learning, CoNLL 2023
Y2 - 6 December 2023 through 7 December 2023
ER -