Feature distribution-based intrinsic textual plagiarism detection using statistical hypothesis tests.

Veldhuis, D. (2015) Feature distribution-based intrinsic textual plagiarism detection using statistical hypothesis tests. Master's Thesis / Essay, Artificial Intelligence.

Preview

Text
AI-MAI-2015-D.VELDHUIS.pdf - Published Version
Download (2MB) | Preview

Text
toestemming.pdf - Other
Restricted to Backend only
Download (750kB)

Abstract

With the rapid increase in publicly available information, plagiarism is an increasingly important problem. Checking for plagiarism, however, can be a tedious job. To support society with plagiarism detection, automated plagiarism detectors were developed. Intrinsic plagiarism detectors compare passages of text within the document itself. Incorporating text of other authors might lead to `unnatural' deviations in writing style within the document. The goal of intrinsic plagiarism detection is to detect these deviations. In this thesis, the writing style is expressed by 19 feature distributions. The main innovation of the present study is the comparison of writing style feature distributions using a statistical hypothesis test, which is uncommon in intrinsic plagiarism detection but could give valuable insight into plagiarism. The feature distribution of a chunk of sequential sentences is compared to the feature distribution of the rest of the document. The result is a vector of 19 probabilities that the feature distribution of a chunk and the feature distribution of the rest of the document come from a similar population. The idea is that feature distributions from non-plagiarized chunks resemble the feature distribution of the document. It is assumed that most text of a document is non-plagiarized text so that non-plagiarized text is more similar to the rest of the document than text written by another author. A naive Bayes classifier is trained to detect chunks with more than 50% of the text plagiarized. Several features showed more variation and a different average feature distribution for chunks of plagiarized text than for chunks of non-plagiarized text. The average similarity (p-value) of the chunks consisting of plagiarized text was lower than for chunks consisting of non-plagiarized text. The variation in p-value was, however, high. Together with a highly imbalanced data set, this resulted in poor performance of the individual features. The set of 19 features, however, resulted in a performance that was higher than a plagiarism detector randomly assigning classes with a specified probability. In fact, with a plagdet score of 0.21, the plagiarism detector scored second highest compared to implementations of the PAN'11 competition of intrinsic plagiarism detection. This study mainly focused on the methods and not on feature creation. Therefore, better features might improve the performance, especially if the writing style of an author can be captured in smaller chunks. Furthermore we found that permutation tests showed better performance on a small data set than the regular statistical hypothesis tests. More experiments are needed to determine whether the suggested improvements increase performance.

Item Type:	Thesis (Master's Thesis / Essay)
Degree programme:	Artificial Intelligence
Thesis type:	Master's Thesis / Essay
Language:	English
Date Deposited:	15 Feb 2018 08:09
Last Modified:	15 Feb 2018 08:09
URI:	https://fse.studenttheses.ub.rug.nl/id/eprint/13379

Actions (login required)

View Item