A Semantic Graph Model for Text Representation and Matching in Document Mining

Shaban, Khaled

A Semantic Graph Model for Text Representation and Matching in Document Mining

Files

kshaban2006.pdf (1.39 MB)

Date

2006

Authors

Shaban, Khaled

Publisher

University of Waterloo

Abstract

The explosive growth in the number of documents produced daily necessitates the development of effective alternatives to explore, analyze, and discover knowledge from documents. Document mining research work has emerged to devise automated means to discover and analyze useful information from documents. This work has been mainly concerned with constructing text representation models, developing distance measures to estimate similarities between documents, and utilizing that in mining processes such as document clustering, document classification, information retrieval, information filtering, and information extraction. Conventional text representation methodologies consider documents as bags of words and ignore the meanings and ideas their authors want to convey. It is this deficiency that causes similarity measures to fail to perceive contextual similarity of text passages due to the variation of the words the passages contain, or at least perceive contextually dissimilar text passages as being similar because of the resemblance of words the passages have. This thesis presents a new paradigm for mining documents by exploiting semantic information of their texts. A formal semantic representation of linguistic inputs is introduced and utilized to build a semantic representation scheme for documents. The representation scheme is constructed through accumulation of syntactic and semantic analysis outputs. A new distance measure is developed to determine the similarities between contents of documents. The measure is based on inexact matching of attributed trees. It involves the computation of all distinct similarity common sub-trees, and can be computed efficiently. It is believed that the proposed representation scheme along with the proposed similarity measure will enable more effective document mining processes. The proposed techniques to mine documents were implemented as vital components in a mining system. A case study of semantic document clustering is presented to demonstrate the working and the efficacy of the framework. Experimental work is reported, and its results are presented and analyzed.

Keywords

Electrical & Computer Engineering, Document mining, semantic understanding, text representation, similarity measure, document clustering.

URI

http://hdl.handle.net/10012/2860

Collections

Theses
Electrical and Computer Engineering

Full item page

A Semantic Graph Model for Text Representation and Matching in Document Mining

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

LC Keywords

Citation

URI

Collections