Context is All You Need: Comparing Transformer-based to RNN-based Models in a Handwritten Text Recognition Task

Velich, Maximilian (2022) Context is All You Need: Comparing Transformer-based to RNN-based Models in a Handwritten Text Recognition Task. Master's Thesis / Essay, Artificial Intelligence.

Preview

Text
mAI_2022_VelichM.pdf
Download (11MB) | Preview

Text
Toestemming.pdf
Restricted to Registered users only
Download (121kB)

Abstract

In this thesis, two Deep Learning architectures are compared in a Handwritten Text Recognition (HTR) task. The Transformer-based model constitutes a modern method to sequence-to-sequence learning tasks, especially potent in Natural Language Processing tasks. The RNN-based model represents the state-of-the-art; many impressive systems were built with RNNs at their core. The complete absence of recurrence of the Transformer represents a paradigm shift from recurrence to attention mechanisms at the core of sequence-to-sequence models. Thus, it is imperative to examine what role Transformers could play in HTR. The focus is on maximizing the available lexical contextual information extraction. For that purpose, two comparable systems were built — parameter counts were kept low, no language models were used, and both systems feature the same visual CNN-based frontend. A series of experiments was conducted, the results are based on evaluation and test curves, utilizing Character Error Rates (CER) and Word Error Rates (WER). The results were evaluated on the IAM data set. The Transformer-based model showed lower error rates (CER: 12.8%, WER: 31.6%) than the RNN-based model (CER: 18.6%, WER: 49.7%). The difference between observed WER and expected WER was larger in the case of the Transformer (27.1% lower) than the RNN-based model (13.9% lower). Thus, the results include direct evidence that the Transformer-based model exploits the available context of the given line-strip images more than the RNN-based model. Furthermore, it is shown that the Connectionist Temporal Classification framework can be utilized in a composite loss function to decrease error rates quicker. Overall error rates can be decreased by introducing weighing to the loss function following a curriculum function that shifts training emphasis from the encoder to the decoder. This also shows smooth training curves. Mean CER decreased to 9.1% and mean WER decreased to 25.9%. In summary, within the perimeters of the experiments, the Transformer-based model achieves lower error rates, and shows better ability to make use of the context available. Thus, contending with the Transformer to further the field of HTR, is imperative.

Item Type:	Thesis (Master's Thesis / Essay)
Supervisor name:	Schomaker, L.R.B. and Ameryan, M.
Degree programme:	Artificial Intelligence
Thesis type:	Master's Thesis / Essay
Language:	English
Date Deposited:	28 Jul 2022 11:51
Last Modified:	09 Aug 2022 07:07
URI:	https://fse.studenttheses.ub.rug.nl/id/eprint/27947

Actions (login required)

View Item