Lexical semantic similarity and its application to business catalog retrieval

Jiang, Jian

Lexical semantic similarity and its application to business catalog retrieval

dc.contributor.author	Jiang, Jian	en
dc.date.accessioned	2006-07-28T20:02:55Z
dc.date.available	2006-07-28T20:02:55Z
dc.date.issued	1998	en
dc.date.submitted	1998	en
dc.description.abstract	This thesis targets the problems of language variability (i.e. polysemy and synonymy)from the viewpoint of lexical semantic similarity - a measure of semantic/conceptual similarity between pairs of lexicalized concepts represented in words or terms. As is often the case for many tasks in information retrieval (IR) and natural language processing (NLP), a job is decomposed to the requirement of resolving the semantic relation between lower-level constituents such as words or concepts. One needs to develop a consistent, widely applicable computational model to assess this type of relation. We believe that a proper identification of similarity between concepts would contribute significantly in resolving semantic ambiguity in general. We start by looking at the fundamentals of the concept of similarity, its assumptions and characteristics. A new framework of universal object comparison and similarity determination scheme is then constructed in set-theoretic notions. This is in response to the observation that there is generally a lack of systematic classification and definition of various similarity formulae. Typically, a similarity formula or metric is directly employed in a problem without much theoretical justification and the assumptions behind it are not stated explicitly. In our study, rather than directly stipulate a similarity definition, we intend to derive it from a set of reasonable and intuitively justifiable assumptions. We then argue that this framework provides a general account for modeling object comparisons, and some of the specific comparison schemes can be further abstracted and quantified using information-theoretic notions so that a simple computational means of measuring universal object similarity can be achieved. To realize such a computational object comparison scheme, we propose a new model of measuring lexical semantic similarity given the context of a lexical taxonomy. This model enhances the graph distance approach by properly quantifying the weight of each edge along the shortest path that links two concept odes in the taxonomic hierarchy. The lexical semantic information derived from taxonomy structure essentially secures a solution to determining the 'commonality' of two objects in information-theoretic fashion, which is crucial to a realization of a computational means of resolving universal lexical semantic similarity. This also allows us to develop a unified view of various similarity measures based on taxonomic knowledge, given the background of our general framework about object comparison schemes. Contrary to the common view that object 'commonality' dictates similarity, both our theoretic model and later empirical verifications have demonstrated that object 'difference' is perhaps a better approximation to the similarity measure when object content is measured by its information content. This core similarity model is then applied to several levels of applications in NLP and IR. First, a word-pair similarity ranking experiment was conducted. The results indicate the proposed similarity measure ('difference' approach), compared with other related computational models, achieves a result that is closest to humans' performance when the same task was replicated. Second, a simple word sense disambiguation algorithm is developed based on the local contextual information obtained from words surrounding the target ambiguous word. The empirical evidence from the test of tagging all nouns in a running text verifies that the proposed similarity model can generate better performance than other related similarity models in this 'intermediate'-level NLP application. To raise the complexity of the issue of semantic similarity we move from conducting a single, elemental concept pair similarity comparison to a multi-layered compounds concept (phrase-like) similarity comparison. A final and more practical application of the proposed similarity model is in the areas of text retrieval and document classification. Originating from a project to develop an Electronic Industrial Directory (EID) system, a prototype business catalog retrieval system is designed and implemented. The main function is to locate the relevant catalog headings under which a product/service description would belong. In order to provide an appropriate context for lexical-level similarity comparisons, a shallow parsing algorithm is designed to capture both syntactic and semantic information in both catalog headings and queries. The weak technologies employed here require no pre-existing domain specific knowledge structures. Hence the resultant model has an appeal to wider domains of application. We developed various methods in both decomposing complex linguistic constructs into single lexicalized items wherein the developed lexical similarity methods can be applied, and aggregating such calculated subcomponent similarities to an overall similarity determination. For the within-subcomponent parsing and similarity determinations, we designed several algorithms to tackle the syntactic ambiguity problems such as compounds nouns and complex phrases analysis. For the between-subcomponent similarity determinations, we constructed a framework and proposed two dynamic subcomponent weighting schemes in terms of subcomponent information content value. A prototype system was developed and evaluated against the benchmark of the classical vector space model. The analysis of results indicates that there is a significant improvement over the benchmark (both precision and recall are increased by about 10%).	en
dc.format	application/pdf	en
dc.format.extent	8824320 bytes
dc.format.mimetype	application/pdf
dc.identifier.uri	http://hdl.handle.net/10012/315
dc.language.iso	en	en
dc.pending	false	en
dc.publisher	University of Waterloo	en
dc.rights	Copyright: 1998, Jiang, Jian. All rights reserved.	en
dc.subject	Harvested from Collections Canada	en
dc.title	Lexical semantic similarity and its application to business catalog retrieval	en
dc.type	Doctoral Thesis	en
uws-etd.degree	Ph.D.	en
uws.peerReviewStatus	Unreviewed	en
uws.scholarLevel	Graduate	en
uws.typeOfResource	Text	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: NQ32834.pdf
Size:: 6.59 MB
Format:: Adobe Portable Document Format

Download

Collections

Digitized University of Waterloo Theses