Javascript must be enabled for the correct page display

An Evaluation of Parallel Text Extraction and Sentence Alignment for Low-Resource Polysynthetic Languages

Kelly, Kevin (2020) An Evaluation of Parallel Text Extraction and Sentence Alignment for Low-Resource Polysynthetic Languages. Bachelor's Thesis, Artificial Intelligence.

[img]
Preview
Text
AI_BA_2020_KevinKelly.pdf

Download (294kB) | Preview
[img] Text
toestemming.pdf
Restricted to Registered users only

Download (111kB)

Abstract

For the development of robust NLP applications, large monolingual and parallel corpora are essential. Many polysynthetic languages, where words are built up out of several concatenated morphemes, lack such resources. Their high morpheme to word ratio and sometimes complex morphological structure make creation of these resources problematic. This research explores if high quality alignment between a polysynthetic language and non-polysynthetic language is possible when using existing sentence alignment tools, and if tools to automatically harvest bitexts from multilingual sites can be used to produce large amounts of meaningful parallel data between these languages. Alignment between Inuktitut and English, and Kalaallisut and Danish was evaluated. In an Intrinsic evaluation, a method to obtain accuracy, recall and F1 scores in absence of a gold standard through the use of tf-idf and co-occurrence statistics was designed and evaluated. Results show high F1 scores when used with a one co-occurrence “word pair” per aligned sentence threshold on segmented polysynthetic data, but degrades when higher threshold limits are set. In an extrinsic evaluation, a neural machine translation model showed improved CHRF scores for Inuktitut to English translation when 1134 aligned sentences were added to an existing training set. Poor translation quality was shown between Danish and Kalaallisut when an NMT system was trained on 14778 aligned sentences acquired through parallel text extraction.

Item Type: Thesis (Bachelor's Thesis)
Supervisor name: Spenader, J.K.
Degree programme: Artificial Intelligence
Thesis type: Bachelor's Thesis
Language: English
Date Deposited: 06 Oct 2020 12:00
Last Modified: 06 Oct 2020 12:00
URI: https://fse.studenttheses.ub.rug.nl/id/eprint/23465

Actions (login required)

View Item View Item