Javascript must be enabled for the correct page display

PressAnalyzer Print Media & Text Mining

Wilhelm, M. (2001) PressAnalyzer Print Media & Text Mining. Master's Thesis / Essay, Computing Science.

Infor_Ma_2001_MWilhelm.CV.pdf - Published Version

Download (1MB) | Preview


Text mining, also known as knowledge discovery from text or document information mining, refers to the process of extracting interesting data from very large text corpus in order to discover knowledge. Text mining is an interdisciplinary field involving information retrieval, text understanding, information extraction, clustering, categorization, database technology, machine learning and data mining.This thesis presents PressAnalyzer as a hybrid approach to structure analysis and clustering of job advertisements in the press. Hybrid in the sense, that it makes use of layout as well as textual features of an advertisement. We use these data to define clusters and classify an advertisement text into these categories, each presenting a group of similar information. The pilot version of PressAnalyzer is concerned with the definition of one cluster and the implementation of a categorization algorithm that labels a block of text in an advertisement containing information in a specific concept. A keyword list represents the characteristic terms and words in this cluster and the application of a simple least distance measurement algorithm is used to determine, which part of the text contains a high density of keywords.The concluding results show that a keyword based clustering and categorizing algorithm would be successful only if all clusters were considered and labeled in parallel because of the overlapped clusters, so that the borders of each text block can be correctly marked out. Another aspect about clustering methods at textual level is the effect of misspelling and other errors caused by the text quality.

Item Type: Thesis (Master's Thesis / Essay)
Degree programme: Computing Science
Thesis type: Master's Thesis / Essay
Language: English
Date Deposited: 15 Feb 2018 07:29
Last Modified: 15 Feb 2018 07:29

Actions (login required)

View Item View Item