Javascript must be enabled for the correct page display

A Context-based Approach to Reduce the Amount of Unknown Words in User Search Queries

Hielkema, R.M. (2010) A Context-based Approach to Reduce the Amount of Unknown Words in User Search Queries. Master's Thesis / Essay, Human-Machine Communication.

[img]
Preview
Text
AI-MHMC-2010-R.M.HIELKEMA.pdf - Published Version

Download (1MB) | Preview

Abstract

Unknown words lead to bad answers in Question Answering. Next to traditional Question Answering challenges, web search applications face another challenge when dealing with unknown words: they have to deal with the language found in user queries. User queries are frequently ungrammatical, have a telegram style and contain misspelled words, all of which make their automatic interpretation very difficult. Although little research has been done on finding unknown words in user search queries, we will show that this is a valuable goal. Q-go is a company specialized in natural language search, but like most search engines, it has difficulty interpreting unknown words when these appear in user search queries. We investigated how often unknown words occur and also which lexical types are responsible for the unknowns Q-go encounters in three domains. This has provided a useful and unique insight in unknown words in a real world application. Many projects do not take the time to analyze the data manually. The manual analysis showed that the majority of unknowns words in the data are named entities (28.4%), so further research concentrated on the identification and semantic classification of named entities in user search queries. Two context-based approaches to reduce the number of named entities were compared, Paşca (2007) and Pennacchiotti and Pantel (2006). Starting from a small set of seed entities we extract candidates for various classes of interest to web search users. We experimented with multiple parameters, similarity metrics and the three domains, among others. The most promising reulst were obtained with the approach based on Paşca (2007). The productivity of a class seems to be the factor most predictive of success in finding new named entities of the correct semantic class. Further, the approach of Paşca (2007) was able to successfully deal with user query language, also an important result.

Item Type: Thesis (Master's Thesis / Essay)
Supervisor:
Supervisor nameSupervisor E mail
Spenader, J.UNSPECIFIED
Degree programme: Human-Machine Communication
Thesis type: Master's Thesis / Essay
Language: English
Date Deposited: 15 Feb 2018 07:31
Last Modified: 02 May 2019 12:38
URI: http://fse.studenttheses.ub.rug.nl/id/eprint/9141

Actions (login required)

View Item View Item