Document Understanding for Automatic Proceedings Generation

Groot, J. de (2013) Document Understanding for Automatic Proceedings Generation. Master's Thesis / Essay, Computing Science.

Preview

Text
Document_Understanding_for_Pro_1.pdf - Published Version
Download (1MB) | Preview

Text
GrootJAkkoordAiello.pdf - Other
Restricted to Registered users only
Download (24kB)

Abstract

Conference Management Tools (CMT’s) support people involved in running and participating in conferences with process management. Several management tools are available on the World Wide Web, but none of these tools offer a full generation of the proceedings. Together with the fact that automation and digitization of data becomes more and more important we introduce in this thesis a management tool which combines a solution of meta-data extraction and proceedings generation. Meta-data extraction from research papers is mainly used for indexation of the papers into a digital library. In this thesis, we show that meta-data extraction is also suitable for obtaining correct meta-data which is used for a proper generation of the proceedings for the conference. When meta-data is extracted automatically the user does not have to worry about spelling mistakes which might happen when the data is entered manually, because the extracted data is an exact copy of the data present in the paper. We also show that the automatic extraction improves the usability of the CMT. For the extraction of the meta-data we applied two different extraction ap- proaches. The title, abstract and index terms are extracted using a rule based approach. For the extraction of the author data we used a machine learning algorithm, in particular a naive Bayes classifier. The results of those extraction methods are promising. We achieved 99%, 87%, 89% and 96% accuracy for the title, abstract, index terms and authors respectively. This in combination with a low recall (missing results), makes this data very usable for the generation of the proceedings. Once all the papers are collected for the proceedings and all the meta-data is collected and verified, the proceedings are generated using LATEX. Based on our findings we conclude that meta-data extraction is suitable in order to improve the usability of the CMT and ensure the meta-data listed in the proceedings is free of spelling errors in at least 95% of the times. The extracted meta-data is also directly usable for indexing of the papers in order to search through them or for distribution.

Item Type:	Thesis (Master's Thesis / Essay)
Degree programme:	Computing Science
Thesis type:	Master's Thesis / Essay
Language:	English
Date Deposited:	15 Feb 2018 07:55
Last Modified:	15 Feb 2018 07:55
URI:	https://fse.studenttheses.ub.rug.nl/id/eprint/11331

Actions (login required)

View Item