Geertsema F.S. (2013) Semantic relatedness using web data. Master's Thesis / Essay, Computing Science.
|
Text
Semantic_relatedness.pdf - Published Version Download (2MB) | Preview |
|
Text
AkkoordLazovik.pdf - Other Restricted to Registered users only Download (40kB) |
Abstract
Investigating the existence of relations between people is the starting point of this research. Previous scientific research focussed on rela- tions between general concepts in lexical databases. Web data was only part of the periphery of scientific research. Due to the impor- tant role of web data in determining relations between people further research into relatedness between general concepts in web data is needed. For handling the different contexts of general concepts in web data for calculating semantic relatedness three different algorithms are used. The Normalized Compression Distance searches for overlap- ping pieces of text in web pages to calculate semantic relatedness. The Jaccard index on keywords uses text annotation to find keywords in texts and uses these keywords to calculate an overlap between them. The Normalized Web Distance uses the co-occurrence of concepts to calculate their semantic relatedness. These approaches are tested with the use of the WordSimilarity-353 test collection. This dataset consists of 353 different concepts pairs with a human assigned relatedness score. The concepts in this collec- tion are the input for gathering web pages from Google, Wikipedia and IMDb. Variables that influence the results of the algorithms are the number of web pages, the type of content and algorithm specific variables like the used compressor and weight factors. The results are analysed on accuracy, robustness and performance. The results show that the context of concepts can be used in different ways to calculate semantic relatedness. The Normalized Compression Distance achieves higher scores than the Jaccard index on the gen- eral web data from Google and Wikipedia. Even though this score is influenced by writing styles on web pages. The better performance of the Normalized Compression Distance and the higher scores on general web data make it a good candidate for applications with au- tomated semantic relatedness calculations. To achieve better scores further research into better compressors and cleaning of input data will improve the accuracy of this algorithm and decrease the sensitiv- ity to writing style. For applications that provide exploratory insights in semantic relatedness, the Jaccard index on keywords is advised.
Item Type: | Thesis (Master's Thesis / Essay) |
---|---|
Degree programme: | Computing Science |
Thesis type: | Master's Thesis / Essay |
Language: | English |
Date Deposited: | 15 Feb 2018 07:52 |
Last Modified: | 15 Feb 2018 07:52 |
URI: | https://fse.studenttheses.ub.rug.nl/id/eprint/10867 |
Actions (login required)
View Item |