Application of Textual Feature Extraction to Corporate Bankruptcy Risk Assessment

WANG, ZHEXUAN

Application of Textual Feature Extraction to Corporate Bankruptcy Risk Assessment

Files

Wang_Zhexuan.pdf (5.28 MB)

Date

2017-09-21

Authors

WANG, ZHEXUAN

Advisor

Wong, Andrew
Stashuk, Daniel

Publisher

University of Waterloo

Abstract

The inception of the Internet in the late twentieth century has established the ability to generate a huge volume of data from multitudinous sources in a very short period of time. However, most of this data is presented in an unstructured format. According to the latest research, unstructured data contains more comprehensive, effective and practical information when compared to structured data due to its descriptive characteristics, especially in finance, healthcare, manufacturing and other domains. It is anticipated that the effective use of data mining technology can be applied to the development of more accurate predictive models, decision-support platforms and man-machine interactive systems on unstructured data. This thesis focuses on the application of a text mining system known as TP2K which stands for Text Pattern to Knowledge System, developed by my supervisor Professor Andrew K.C. Wong, to the finance industry. More specifically, the text mining system I proposed in this thesis is a concept-based textual feature extraction based on TP2K for corporate bankruptcy risk assessment. Bankruptcy risk assessment is to assess the bankruptcy risk of a corporation in the finance industry. It is linked to enterprise sustainability assessment, investment portfolio optimization and corporate management. Throughout the years, various models have been built using numerical and structured data (e.g. financial indicators and ratios). Yet no model has adequately leveraged the textual data for quantitative analysis in corporate bankruptcy risk assessment. Note that certain critical information such as strategic future directions and cooperate governance of an enterprise can only be reflected through textual data (e.g. annual financial reports). Recently, it has been reported that the combination of textual and numeric features will render a more accurate assessment of corporate bankruptcy. Nevertheless, extracting features from textual data remains difficult since it still requires considerable human efforts. According to the existing literature, there is no obvious criteria for textual feature mining and extraction in finance due to the diversity of objectives and interests. From a general perspective, there is no simple criteria for textual feature mining and extraction in finance according to existing literature. Thus, domain experts still remain essential in the industry. The current textual feature extraction methods in finance can be categorized into two distinct types. The first type is based on a comprehensive handcrafted dictionary of proper keywords with continuous manual updating. The second type is based on data mining technology (e.g. high-frequency words). The former is time-consuming, while the latter usually produces results which are ambiguous, irrelevant or hard to be interpreted by industry in practice. In this thesis, we (my supervisor and I) proposed a method known as concept-based textual feature extraction based on TP2K for corporate bankruptcy risk assessment. Compared to existing methods, this method can extract and mine textual features more accurately and succinctly from financial reports, allowing industrial interpretation in practice with limited human participation. It is semi-automatic and interactive. Its algorithmic procedure is briefly described as follows: (1) apply a linear-time and language-independent TP2K system to discover the “Word, Term and Phrase” (WTP) patterns from text data without relying on explicit prior knowledge or training; (2) apply a WTP-directed search algorithm in TP2K to find appropriate financial attribute names and their attribute values from the text context to obtain relevant attribute and attribute value pairs (AVPs) to build part of the Domain Knowledge Base (DKB) in support of predictive analysis of corporate bankruptcy risk. At the onset, domain experts will still play a major role in building the DKB. As more user-inputted domain information is integrated into the DKB, the system will become more automated to extract and validate related information for bankruptcy risk assessment with limited involvement from domain experts. In this thesis, AVPs have been used in corporate risk assessment to render more robust and less biased textual features. This allows experts to reasonably acquire and assist with the organization of individual selection rules in a comprehensive manner using traditional machine learning processing. To validate the proposed method, experiments on financial data have been conducted. A collection of corporate annual reports containing textual and numeric information were adopted to evaluate the corporate risk assessment in a semi-automatic manner. Initially the extracted AVPs data was converted to binarized textual features in accordance with certain finance field criteria. It was then integrated with related numerical features (financial ratios) for traditional machine learning technologies to construct a predictive model for corporate bankruptcy assessment. The experimental results demonstrated an effective two-year ahead (T-2) prediction, outperforming prediction models based on only numeric features under 10-fold cross-validation. At the same time, we observed that all features discovered, numeric or textual, were consistent to the industry standard. Hence, we believe the proposed method has achieved an important milestone for assessing bankruptcy assessment in practice, and is potentially useful for providing trading advice for investors in the future.