Beyond Natural Language Processing: Advancing Software Engineering Tasks through Code Structure

Ding, Zishuo

Beyond Natural Language Processing: Advancing Software Engineering Tasks through Code Structure

Files

Ding_Zishuo.pdf (7.19 MB)

Date

2024-01-25

Authors

Ding, Zishuo

Advisor

Weiyi, Shang

Publisher

University of Waterloo

Abstract

Machine learning-based approaches have been widely used to address natural language processing (NLP) problems. Considering the similarities between natural language text and source code, researchers have been working on the application of NLP techniques to code-related tasks. However, it is crucial to acknowledge that source code and natural language are different by their natures. For example, source code is highly structured and executable; while NLP techniques may not understand the structure of source code. As a result, applying NLP techniques directly may not yield optimal results, and effectively adapting these techniques to suit software engineering tasks remains a significant challenge. To tackle this challenge, in this thesis, we focus on two important intersections between the source code and natural language text: (1) learning and evaluating distributed code representations (i.e., code embeddings), which plays a fundamental role in numerous software engineering tasks, especially in the era of deep learning, and (2) improving the textual information in logging statements (i.e., logging texts), which record useful information (i.e., logs) to support various software engineering activities. For distributed code representations, we first conduct a comprehensive survey of existing code embedding techniques. This survey encompasses techniques borrowed from NLP, as well as those specifically tailored for source code. We also identify six downstream software engineering tasks to evaluate the effectiveness of the learned code embeddings. Moreover, based on our analysis of existing code embedding techniques, we propose a novel approach to learn more generalizable code embeddings in a task-agnostic manner. This approach represents source code as graphs and leverages Graph Convolutional Networks to learn code embeddings that exhibit greater generalizability. For the textual information in logging statements, we propose to improve the current logging practices from two aspects: (1) proactively suggesting the generation of new logging texts: we propose automated deep learning-based approaches that generate logging texts by translating the related source code into short textual descriptions; (2) retroactively analyzing existing logging texts: we make the first attempt to comprehensively study the temporal relations between logging and its corresponding source code, which is later successfully used to detect anti-patterns in existing logging statements. Based on the experimental results on the subject systems, we anticipate that our work can offer valuable suggestions and support to developers, aiding them in the effective utilization of NLP techniques for software engineering tasks.

Keywords

code structure, software engineering, natural language processing, logging, code embedding

URI

http://hdl.handle.net/10012/20285

Collections

Theses
Electrical and Computer Engineering

Full item page

Beyond Natural Language Processing: Advancing Software Engineering Tasks through Code Structure

Files

Date

Authors

Advisor

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

LC Subject Headings

Citation

URI

Collections