Completeness of Fact Extractors and a New Approach to Extraction with Emphasis on the Refers-to Relation

Lin, Yuan

Completeness of Fact Extractors and a New Approach to Extraction with Emphasis on the Refers-to Relation

Files

thesis_yuan_lin.pdf (776.94 KB)

Date

2008-08-19T15:48:35Z

Authors

Lin, Yuan

Publisher

University of Waterloo

Abstract

This thesis deals with fact extraction, which analyzes source code (and sometimes related artifacts) to produce extracted facts about the code. These facts may, for example, record where in the code variables are declared and where they are used, as well as related information. These extracted facts are typically used in software reverse engineering to reconstruct the design of the program. This thesis has two main parts, each of which deals with a formal approach to fact extraction. Part 1 of the thesis deals with the question: How can we demonstrate that a fact extractor actually does its job? That is, does the extractor produce the facts that it is supposed to produce? This thesis builds on the concept of semantic completeness of a fact extractor, as defined by Tom Dean et al, and further defines source, syntax and compiler completeness. One of the contributions of this thesis is to show that in particular important cases (when the extractor is deterministic and its front end is idempotent), there is an efficient algorithm to determine if the extractor is compiler complete. This result is surprising, considering that in general it is undecidable if two programs are semantically equivalent, and it would seem that source code and its corresponding extracted facts are each essentially programs that are to be proved to be equivalent or at least sufficiently similar. The larger part of the thesis, Part 2, presents Algebraic Refers-to Analysis (ARA), a new approach to fact extraction with emphasis on the Refers-to relation. ARA provides a framework for specifying fact extraction, based on a three-step pipeline: (1) basic (lexical and syntactic) extraction, (2) a normalization step and (3) a binding step. For practical programming languages, these three steps are repeated, in stages and phases, until the Refers-to relation is computed. During the writing of this thesis, ARA pipelines for C, Java, C++, Fortran, Pascal and Ada have been designed. A prototype fact extractor for the C language has been created. Validating ARA means to demonstrate that ARA pipelines satisfy the programming language standards such as ISO C++ standard. In other words, we show that ARA phases (stages and formulas) are correctly transcribed from the rules in the language standard. Comparing with the existing approaches such as Attribute Grammar, ARA has the following advantages. First, ARA formulas are concise, elegant and more importantly, insightful. As a result, we have some interesting discovery about the programming languages. Second, ARA is validated based on set theory and relational algebra, which is more reliable than exhaustive testing. Finally, ARA formulas are supported by existing software tools such as database management systems and relational calculators. Overall, the contributions of this thesis include 1) the invention of the concept of hierarchy of completeness and the automatic testing of completeness, 2) the use of the relational data model in fact extraction, 3) the invention of Algebraic Refers-to Relation Analysis (ARA) and 4) the discovery of some interesting facts of programming languages.