Machine learning to detect invalid text responses: Validation and comparison to existing detection methods

dc.contributor.authorYeung, Ryan C.
dc.contributor.authorFernandes, Myra A.
dc.date.accessioned2025-12-03T16:03:42Z
dc.date.available2025-12-03T16:03:42Z
dc.date.issued2022-02-17
dc.descriptionThis is a post-peer-review, pre-copyedit version of an article published in Behavior Research Model. The final authenticated version is available online at: https://doi.org/10.3758/s13428-022-01801-y
dc.description.abstractA crucial step in analysing text data is the detection and removal of invalid texts (e.g., texts with meaningless or irrelevant content). To date, research topics that rely heavily on analysis of text data, such as autobiographical memory, have lacked methods of detecting invalid texts that are both effective and practical. Although researchers have suggested many data quality indicators that might identify invalid responses (e.g., response time, character/word count), few of these methods have been empirically validated with text responses. In the current study, we propose and implement a supervised machine learning approach that can mimic the accuracy of human coding, but without the need to hand-code entire text datasets. Our approach (a) trains, validates, and tests on a subset of texts manually labelled as valid or invalid, (b) calculates performance metrics to help select the best model, and (c) predicts whether unlabelled texts are valid or invalid based on the text alone. Model validation and evaluation using autobiographical memory texts indicated that machine learning accurately detected invalid texts with performance near human coding, significantly outperforming existing data quality indicators. Our openly available code and instructions enable new methods of improving data quality for researchers using text as data.
dc.description.sponsorshipNSERC
dc.identifier.urihttps://doi.org/10.3758/s13428-022-01801-y
dc.identifier.urihttps://hdl.handle.net/10012/22702
dc.language.isoen
dc.publisherSpringer
dc.relation.ispartofseriesBehavior Research Model; 54(1)
dc.rightsAttribution 4.0 Internationalen
dc.rights.urihttp://creativecommons.org/licenses/by/4.0/
dc.subjectmachine learning
dc.subjectautobiographical memory
dc.subjecttext classification
dc.subjecttext as data
dc.subjectcareless responding
dc.titleMachine learning to detect invalid text responses: Validation and comparison to existing detection methods
dc.typeArticle
dcterms.bibliographicCitationYeung, R. C., & Fernandes, M. A. (2022). Machine learning to detect invalid text responses: Validation and comparison to existing detection methods. Behavior Research Methods, 54(6), 3055–3070.
uws.contributor.affiliation1Faculty of Arts
uws.contributor.affiliation2Psychology
uws.peerReviewStatusReviewed
uws.scholarLevelFaculty
uws.typeOfResourceTexten

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
s13428-022-01801-y.pdf
Size:
1.02 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
4.47 KB
Format:
Item-specific license agreed upon to submission
Description: