Using Confidential Data for Domain Adaptation of Neural Machine Translation

Kim, Sohyung (2021) Using Confidential Data for Domain Adaptation of Neural Machine Translation. Master's Thesis / Essay, Artificial Intelligence.

Preview

Text
Master_thesis__Sohyung_Kim_s3475743.pdf
Download (4MB) | Preview

Text
toestemming.pdf
Restricted to Registered users only
Download (124kB)

Abstract

Domain adaptation has led to remarkable achievements in Neural Machine Translation (NMT). Therefore, the availability of in-domain data remains essential to ensure the quality of NMT, especially in technical domains. However, obtaining such data is often challenging, and in many real-world scenarios this is further aggravated by data confidentiality or copyright concerns. We study the problem of domain adaptation in NMT when domain-specific data cannot be shared due to confidentiality issues. We propose to fragment data into phrase pairs and use a shuffled and random sample to fine-tune a generic NMT model instead of using the full sentences. Despite the loss of long segments, we find that NMT quality can considerably benefit from this adaptation and that further gains can be obtained with a simple tagging technique.

Item Type:	Thesis (Master's Thesis / Essay)
Supervisor name:	Bisazza, A. and Spenader, J.K. and Turkmen, F.
Degree programme:	Artificial Intelligence
Thesis type:	Master's Thesis / Essay
Language:	English
Date Deposited:	23 Aug 2021 09:59
Last Modified:	23 Aug 2021 09:59
URI:	https://fse.studenttheses.ub.rug.nl/id/eprint/25668

Actions (login required)

View Item