Javascript must be enabled for the correct page display

Multi-Modal Image and Text Processing Using a Transformer for Sub-Figure Sequence Classification with or without a convolutional pre-processor.

Laitenberger, Filipe (2022) Multi-Modal Image and Text Processing Using a Transformer for Sub-Figure Sequence Classification with or without a convolutional pre-processor. Bachelor's Thesis, Artificial Intelligence.

[img]
Preview
Text
Bachelors_Report_Filipe_Laitenberger_s3894479.pdf

Download (1MB) | Preview
[img] Text
Toestemming.pdf
Restricted to Registered users only

Download (121kB)

Abstract

Recent findings in deep learning research indicate that the successful Transformer approach for text processing can also be used in the image domain, using Vision Transformers (ViTs). Compared to the common convolutional paradigm, the images are tokenized in a surpris- ingly primitive patch-wise fashion. This study introduces a multi-modal transformer architecture for image and text processing with a single encoder. Two variations of a transformer-based model were compared: a standard one, taking the raw, complete input image as input, and a variant with a convolutional pre-processor. Both variants were tested on a task with multiple objects combined in one image in a 2x2 grid alongside a text sequence. The model should decide whether the text sequence describes the correct ’reading order’ of the objects shown in the image. Results showed that a transformer with a convolution-based pre-processing layer performed significantly better (about 96-98% accuracy) than the plain Transformer-based model (about 92% accuracy). Apparently, the Transformer is supported by the learned feature maps, increasing the reliability of the correlations needed for the sequentialization task. In turn, the results support the hypothe- sis that ViTs are inferior to CNN architectures regarding training time and data set size required for successful generalization. ViTs would need to overcome their inductive bias related to patch extraction, which, in its pure form, lacks appropriate feature extracti...

Item Type: Thesis (Bachelor's Thesis)
Supervisor name: Schomaker, L.R.B.
Degree programme: Artificial Intelligence
Thesis type: Bachelor's Thesis
Language: English
Date Deposited: 09 Aug 2022 06:32
Last Modified: 09 Aug 2022 06:32
URI: https://fse.studenttheses.ub.rug.nl/id/eprint/28293

Actions (login required)

View Item View Item