Laitenberger, Filipe (2022) Multi-Modal Image and Text Processing Using a Transformer for Sub-Figure Sequence Classification with or without a convolutional pre-processor. Bachelor's Thesis, Artificial Intelligence.
|
Text
Bachelors_Report_Filipe_Laitenberger_s3894479.pdf Download (1MB) | Preview |
|
Text
Toestemming.pdf Restricted to Registered users only Download (121kB) |
Abstract
Recent findings in deep learning research indicate that the successful Transformer approach for text processing can also be used in the image domain, using Vision Transformers (ViTs). Compared to the common convolutional paradigm, the images are tokenized in a surpris- ingly primitive patch-wise fashion. This study introduces a multi-modal transformer architecture for image and text processing with a single encoder. Two variations of a transformer-based model were compared: a standard one, taking the raw, complete input image as input, and a variant with a convolutional pre-processor. Both variants were tested on a task with multiple objects combined in one image in a 2x2 grid alongside a text sequence. The model should decide whether the text sequence describes the correct ’reading order’ of the objects shown in the image. Results showed that a transformer with a convolution-based pre-processing layer performed significantly better (about 96-98% accuracy) than the plain Transformer-based model (about 92% accuracy). Apparently, the Transformer is supported by the learned feature maps, increasing the reliability of the correlations needed for the sequentialization task. In turn, the results support the hypothe- sis that ViTs are inferior to CNN architectures regarding training time and data set size required for successful generalization. ViTs would need to overcome their inductive bias related to patch extraction, which, in its pure form, lacks appropriate feature extracti...
Item Type: | Thesis (Bachelor's Thesis) |
---|---|
Supervisor name: | Schomaker, L.R.B. |
Degree programme: | Artificial Intelligence |
Thesis type: | Bachelor's Thesis |
Language: | English |
Date Deposited: | 09 Aug 2022 06:32 |
Last Modified: | 09 Aug 2022 06:32 |
URI: | https://fse.studenttheses.ub.rug.nl/id/eprint/28293 |
Actions (login required)
View Item |