Semantic relatedness using web data

Geertsema F.S. (2013) Semantic relatedness using web data. Master's Thesis / Essay, Computing Science.

Preview

Text
Semantic_relatedness.pdf - Published Version
Download (2MB) | Preview

Text
AkkoordLazovik.pdf - Other
Restricted to Registered users only
Download (40kB)

Abstract

Investigating the existence of relations between people is the starting point of this research. Previous scientific research focussed on rela- tions between general concepts in lexical databases. Web data was only part of the periphery of scientific research. Due to the impor- tant role of web data in determining relations between people further research into relatedness between general concepts in web data is needed. For handling the different contexts of general concepts in web data for calculating semantic relatedness three different algorithms are used. The Normalized Compression Distance searches for overlap- ping pieces of text in web pages to calculate semantic relatedness. The Jaccard index on keywords uses text annotation to find keywords in texts and uses these keywords to calculate an overlap between them. The Normalized Web Distance uses the co-occurrence of concepts to calculate their semantic relatedness. These approaches are tested with the use of the WordSimilarity-353 test collection. This dataset consists of 353 different concepts pairs with a human assigned relatedness score. The concepts in this collec- tion are the input for gathering web pages from Google, Wikipedia and IMDb. Variables that influence the results of the algorithms are the number of web pages, the type of content and algorithm specific variables like the used compressor and weight factors. The results are analysed on accuracy, robustness and performance. The results show that the context of concepts can be used in different ways to calculate semantic relatedness. The Normalized Compression Distance achieves higher scores than the Jaccard index on the gen- eral web data from Google and Wikipedia. Even though this score is influenced by writing styles on web pages. The better performance of the Normalized Compression Distance and the higher scores on general web data make it a good candidate for applications with au- tomated semantic relatedness calculations. To achieve better scores further research into better compressors and cleaning of input data will improve the accuracy of this algorithm and decrease the sensitiv- ity to writing style. For applications that provide exploratory insights in semantic relatedness, the Jaccard index on keywords is advised.

Item Type:	Thesis (Master's Thesis / Essay)
Degree programme:	Computing Science
Thesis type:	Master's Thesis / Essay
Language:	English
Date Deposited:	15 Feb 2018 07:52
Last Modified:	15 Feb 2018 07:52
URI:	https://fse.studenttheses.ub.rug.nl/id/eprint/10867

Actions (login required)

View Item