Lunch at 12:30pm, talk at 1pm, in 148 Fitzpatrick

Title: Building an Automatic Speech Recognition Dataset and Model for an Extremely Low-Resource Language

Abstract: Modern NLP research and applications have centered around a handful of high-resource languages, while more than 99% of the world’s languages are seldom or never mentioned. Ideally, NLP should be able to include all the human languages equally, as long as NLP encompasses natural languages as a scientific field. The biggest obstacle that hinders this idea is lack of readily available data, commonly called being “low-resource” in the NLP literature, due to the small number of speakers, absence of orthography, technological underdevelopment, sociopolitical marginalization, and language endangerment and extinction. In this presentation, we introduce our effort to develop an Automatic Speech Recognition (ASR) model for Kichwa, an endangered language spoken in Ecuador, collaborating with the speaker community. We created the first Kichwa ASR dataset by collecting spoken Kichwa data online and annotating them with transcriptions. The experiments show that the model trained only on 4-hour audio data with pretrained multilingual model (Wav2Vec2-XLSR-53) performs well with 3.25% character error rate. This grassroots project exemplifies the inclusion of marginalized languages in NLP for accelerated language documentation and revitalization.

Bio: Chihiro Taguchi is a second-year Ph.D. student in the NLP group, advised by Dr. David Chiang. His research interests broadly include language sciences, in particular both text-based and speech-based NLP and theoretical linguistics. He is currently working on the project “Language Documentation with an AI Helper”. He studied the Kichwa language at Notre Dame in Spring 2023 and visited Ecuador for five weeks to further study Kichwa.