Javascript must be enabled for the correct page display

Incorporating Automatically Generated Genre Labels into Neural Machine Translation Systems

Chichirau, Malina (2024) Incorporating Automatically Generated Genre Labels into Neural Machine Translation Systems. Master's Thesis / Essay, Computational Cognitive Science.

[img]
Preview
Text
Master_Thesis_Malina_Chichirau.pdf

Download (4MB) | Preview
[img] Text
toestemming.pdf
Restricted to Registered users only

Download (134kB)

Abstract

State-of-the-art neural machine translation (NMT) systems are often highly specialized for a certain type of text, referred to as a domain. However, the definition of a domain is still ambiguous in literature, with many studies focusing more on the provenance of the texts used for training rather than on their properties, under the assumption that texts from a single source have similar characteristics. Nevertheless, reliable information about the provenance of texts, especially in the case of web-crawled corpora, is not always available. This study explores whether domains can be described based on text genres, defined by non-topical properties such as function, style, or register that can be automatically inferred from texts. We experiment with training genre-specific NMT systems for translating from English to Icelandic, Croatian, and Turkish. When tested on a holdout dataset, the genre-specific systems tend to outperform general NMT systems and NMT systems specialized in other genres, on their target genre. However, the results are not replicable on external datasets. Furthermore, we use special tokens that indicate the genres in the training data to train general genre-aware NMT systems. But, we find no significant difference compared to the equivalent genre-agnostic systems. Therefore, we conclude that genres are not sufficiently informative to define reliable translation domains that can be utilized across different corpora.

Item Type: Thesis (Master's Thesis / Essay)
Supervisor name: Jones, S.M. and Noord, R.I.K. van and Toral Ruiz, A.
Degree programme: Computational Cognitive Science
Thesis type: Master's Thesis / Essay
Language: English
Date Deposited: 05 Apr 2024 14:42
Last Modified: 05 Apr 2024 14:49
URI: https://fse.studenttheses.ub.rug.nl/id/eprint/32205

Actions (login required)

View Item View Item