Lunch at 12:30pm, talk at 1pm, in 148 Fitzpatrick

Title: Language Identification for Under-Resourced Languages

Abstract: Knowing the language of an input text/audio is a necessary first step for using almost any NLP tool such as taggers, parsers, or translation systems. This is called “language identification (LangID)” and is a well-studied problem, sometimes even considered solved. However, even now, state-of-the-art systems cannot accurately identify most of the world’s 7000 languages, which demonstrates scope for improvement in LangID modeling and design. In this talk, we will present our misprediction-based hierarchical model (LIMIT) that can improve predictions from LangID systems in post-processing. Our technique reduces error by 40-55% on various benchmarks and can be useful in reducing the needle-in-a-haystack nature of low-resource data. We will also discuss our follow-up work that addresses heavy script-reliance in current LangID systems, a flaw in our view that hurts low-resource languages the most, and how it might be beneficial to adopt script-agnostic LangID, especially for Indian languages.

Bio: Milind Agarwal is a second-year Ph.D. student in the NLP group at George Mason University, advised by Dr. Antonios Anastasopoulos. His research interests are centered around foundational problems such as language identification and scalable resource creation using OCR. He is currently working on developing techniques to better identify extremely under-resourced languages in the wild, and using these to better extract existing data online.