On the Automatic Coding of Text Answers to Open-ended Questions in Surveys

He, Zhoushanyue

On the Automatic Coding of Text Answers to Open-ended Questions in Surveys

Files

He_Zhoushanyue.pdf (1.45 MB)

Date

2021-01-13

Authors

He, Zhoushanyue

Advisor

Schonlau, Matthias

Publisher

University of Waterloo

Abstract

Open-ended questions allow participants to answer survey questions without any constraint. Responses to open-ended questions, however, are more difficult to analyze quantitatively than close-ended questions. In this thesis, I focus on analyzing text responses to open-ended questions in surveys. The thesis includes three parts: double coding of open-ended questions, predictions of potential coding errors in manual coding, and comparison between manual coding and automatic coding. Double coding refers to two coders coding the same observations independently. It is often used to assess coders' reliability. I investigate the usage of double coding to improve the performance of automatic coding. I find that, when the budget for manual coding is fixed, double coding which involves a more experienced expert coder results in a smaller but cleaner training set than single coding, and improves the prediction of statistical learning models when the coding error rate of coders exceeds a threshold. When data have already been double coded, double coding always outperforms single coding. In many research projects, only a subset of data can be double coded due to limited funding. My idea is that researchers can make use of the double-coded subset to improve the coding quality of the remaining single-coded observations. Therefore, I propose a model-assisted coding process that predicts the risk of coding errors. High risk text answers are then double-coded. The proposed coding process reduces coding error while keeping the ability to assess inter-coder reliability. Manual coding and automatic coding are two main approaches to code responses to open-ended questions, yet the similarity or difference in terms of coding error has not been well studied. I compare the coding error of human coders and automated coders. I find, despite a different error rate, human coders and automated coders make similar mistakes.