Efficient Classification and Unsupervised Keyphrase Extraction for Web Pages

Haarman, Tim (2019) Efficient Classification and Unsupervised Keyphrase Extraction for Web Pages. Master's Thesis / Essay, Artificial Intelligence.

Preview

Text
Master_Thesis_Tim_Haarman_Final.pdf
Download (1MB) | Preview

Text
Toestemming.pdf
Restricted to Registered users only
Download (121kB)

Abstract

With the ever increasing size of the World Wide Web, indexing and searching through all Web pages is becoming increasingly difficult. An effective analysis and categorization of Web content therefore requires an automated approach. However, the open nature of the Web causes it to be plagued by noise and general inconsistency, making it difficult for heuristic approaches to be effective. In this thesis we investigate how information can effectively be extracted from Web pages in two tasks: keyphrase extraction and Web page classification. Keyphrase extraction is a popular field in natural language processing (NLP), although it is rarely applied to Web pages. We therefore propose a novel unsupervised keyphrase extraction method called WebEmbedRank, which is specialized at extracting keyphrases from hypertext documents. This method combines the extraction power of a state-of-the-art keyphrase extraction method with a weighting process based on the structural information in the HTML code of a Web page. To evaluate this new method we created a gold-standard dataset, and used this to compare the results of WebEmbedRank with several state-of-the-art keyphrase extraction methods. This showed that our newly proposed method was significantly better at extracting keyphrases from Web pages compared to all other methods. We furthermore compared several methods to automatically categorize Web pages. Multiple baseline models, convolutional and recurrent neural networks were trained on a dataset consisting of more than one million company Web pages with corresponding industrial category labels. This showed that overall the recurrent neural networks achieved the highest classification scores. Especially the model with GRU cells was found to be a good choice, as it achieved scores similar to the LSTM but required less time to make new predictions. While the recurrent neural networks generally achieved the highest classification scores, it came at the cost of highly increased prediction times for classifying new samples compared to the baselines and simpler feed forward neural networks. Finally, we also tested the impact of different pre-trained word embeddings on the classification. We found that the choice of word embedding model did not seem to have a large effect on the results, indicating that they generalize well for the current task.

Item Type:	Thesis (Master's Thesis / Essay)
Supervisor name:	Wiering, M.A.
Degree programme:	Artificial Intelligence
Thesis type:	Master's Thesis / Essay
Language:	English
Date Deposited:	11 Jul 2019
Last Modified:	12 Jul 2019 11:23
URI:	https://fse.studenttheses.ub.rug.nl/id/eprint/20106

Actions (login required)

View Item