Lunch at 12:30pm, talk at 1pm, in 148 Fitzpatrick

Abstract: Pre-training language models (PLMs) aim to learn universal language representation by conducting self-supervised training tasks on large-scale unlabeled corpora. In PLMs, the quality of word representations is highly depended on the word frequency in the corpus. It is well-known that in language data corpus, words follow a heavy-tail distribution. A large proportion of words appear only very few times and the embeddings of rare words are usually poorly optimized. Such embeddings usually carry inadequate semantic signals, which could complicate the understanding of the input text, and hurt the pre-training of the entire model. My current research is working on enhancing language model pre-training by leveraging rare word definitions in dictionary (e.g., Wiktionary). We have proposed two novel contrastive objectives during the pre-training stage for learning semantic meanings from dictionary and improving the language representation. In this talk, I will first present the literature review under this topic, demonstrate some preliminary results of our proposed methods, and discuss several promising directions by using dictionary to enhance NLP tasks.

Bio: Wenhao Yu is a third-year Ph.D. student in the Department of Computer Science and Engineering at the University of Notre Dame. His research lies in knowledge-driven natural language understanding and generation. He has published more than 10 papers in top-ranked NLP and data mining conferences such as ACL, EMNLP, NAACL, AAAI, KDD, and WWW. He has interned at Microsoft Research and IBM Research.