Paper presented at NAACL 2022 in Seattle

1 July 2022

In the field of visiolinguistic research ("Vision and Language"), a large number of transformer-based models have been published in recent years. Architectures have been improved and models have been pre-trained on increasingly large datasets. However, there is little research that tries to understand which concepts are learned in these models.


Philipp J. Rösch from VIS and Dr. Jindřich Libovický from Charles University in Prague, investigated the influence of positional information of objects, as well as introduced new pre-training strategies. The models were evaluated on the GQA dataset, among others, where the correct answer must be given to a textual question about an image. Their work was accepted at the 2022 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) in Seattle, USA, and will be published in Findings of NAACL 2022.

Probing the Role of Positional Information in Vision-Language Models
Philipp J. Rösch, Jindřich Libovický
[ACL], [PDF], [Code]