Topic detection in microblogs using big data and neural networks

Musters, A.B. (2017) Topic detection in microblogs using big data and neural networks. Master's Thesis / Essay, Artificial Intelligence.

Preview

Text
Master_AI_research_project_201_1.pdf - Published Version
Download (2MB) | Preview

Text
toestemming.pdf - Other
Restricted to Backend only
Download (80kB)

Abstract

In Artificial Intelligence and Natural Language Processing, topic detection is challenging. Topic detection is the classification, differentiation and detection of, for example text, in a single program. A common approach is using Latent Dirichlet Allocation (LDA) to model a topic as a combination of words. A close resemblance to LDA is Non-Negative Matrix factorization. Both algorithms are used in topic modeling. The latter has a close resemblance to a Neural Network named word2vec, but lacks the ability to make an informed decision. The downside of training the previously mentioned algorithms, is that they require a large amount of annotated data to work accurately. Annotating data is a labor intensive task. Kmeans is an algorithm that overcomes the lack of annotated data. The pitfalls of Kmeans are discussed and Kmeans is compared to a neural network. This thesis researches the use of Active learning techniques to facilitate the annotation task. Active learning is a semi-supervised machine learning technique which interacts with the user to train an algorithm. Active learning can significantly reduce the amount of annotation by asking a user to label data in a smart manner. The data contain five years of textual messages collected from the social media platform Twitter. Architectures to analyse large amounts of data, the field of Big Data, are discussed and used. Topic detection using Kmeans is compared to the word2vec model with an extra neural network layer. We use Active learning to research the efficiency of using it to annotate Twitter data. An objective experiment, using real life users, is done to validate the word2vec model on Twitter data. The proposed method could increase the accuracy on topic detection. Also, it could aid business and academia to annotate (Twitter) data more efficiently.

Item Type:	Thesis (Master's Thesis / Essay)
Degree programme:	Artificial Intelligence
Thesis type:	Master's Thesis / Essay
Language:	English
Date Deposited:	15 Feb 2018 08:28
Last Modified:	15 Feb 2018 08:28
URI:	https://fse.studenttheses.ub.rug.nl/id/eprint/15186

Actions (login required)

View Item