Extracting social graphs from the Web

Visser, W.M. (2015) Extracting social graphs from the Web. Master's Thesis / Essay, Computing Science.

Preview

Text
thesis_wytse_visser.pdf - Published Version
Download (5MB) | Preview

Text
toestemming.pdf - Other
Restricted to Backend only
Download (611kB)

Abstract

The Web can be seen as a graph structure with documents as vertices being connected to each other via hyperlinks. From the content of these documents, we can extract another type of graph with semantically interrelated entities. Such graphs are more difficult to extract, because their relations are more implicitly defined and spread out over multiple documents. We analyze the possibilities of combining the scattered information on the Web to extract social graphs with users as vertices and their relationships as edges. The developed end-to-end system can map HTML documents to a social graph and provides a visualization of the result. With a combination of a keyword-based and a configurable ad-hoc approach, we are able to extract usernames from web documents. To evaluate the system, we gather a dataset containing 5812 documents by injecting the Alexa Top 100 of The Netherlands as seeds into a crawler. For this dataset, the system extracts usernames with an average F1 score of 0.91 per document. Based on these usernames and their co-occurrences, our system can create a graph and store it in a Titan database. This process relies on MapReduce, making our solution capable of scaling out horizontally. Co-occurrence metrics are used to resolve relation strengths between users in the social graph. A high value indicates a stronger relationship (e. g. close friends) than a low value (e. g. acquaintances). We compare the Jaccard index, Sørensen-Dice index, overlap coefficient and a thresholded overlap coefficient to determine these strengths. In the queries to our graph, we use strength values to remove the weakest relations from query results. This allows us to visualize only the most relevant results and provide better insight in the data. By analyzing several often-occurring patterns in our dataset, we discover that the Jaccard index performs best.

Item Type:	Thesis (Master's Thesis / Essay)
Degree programme:	Computing Science
Thesis type:	Master's Thesis / Essay
Language:	English
Date Deposited:	15 Feb 2018 08:10
Last Modified:	15 Feb 2018 08:10
URI:	https://fse.studenttheses.ub.rug.nl/id/eprint/13566

Actions (login required)

View Item