Data Quality Relevance in Linguistic Analysis: The Impact of Transcription Errors on Multiple Methods of Linguistic Analysis

Research output: Chapter in Book/Report/Conference proceedingChapter

Abstract

There is an enormous amount of recorded speech generated daily, and quickly transcribing and analyzing the text of this speech could have tremendous value to organizations and researchers. However, the speech transcription process has historically been laborious, expensive, and slow. Automatic speech recognition (ASR) tools have matured a great deal in the last decade and may be a suitable method to generate large scale, high quality transcriptions. These tools are are fast and economical, but generally produce errors at a much greater rate than human transcribers. It is unknown whether these errors matter when conducting psycholinguistic research. In this study, we will investigate the accuracy of earnings conference call transcripts produced by multiple tools and the impact of that transcription accuracy on the results of subsequent text mining analysis. While prior studies have focused on a single form of text mining, we will conduct three types of text analysis: bag-of-words based classification, lexicon-based classification and sentiment analysis. The results will show whether a different level of transcription quality is required for different types of text mining and the feasibility of using automated transcription services across a range of text mining applications.

Original languageAmerican English
Title of host publicationData Quality Relevance in Linguistic Analysis: The Impact of Transcription Errors on Multiple Methods of Linguistic Analysis
StatePublished - 1800
Externally publishedYes

Fingerprint

Dive into the research topics of 'Data Quality Relevance in Linguistic Analysis: The Impact of Transcription Errors on Multiple Methods of Linguistic Analysis'. Together they form a unique fingerprint.

Cite this