Dr. Marcos Zampieri

An NLP Perspective on Offensive Content in Social Media

Abstract: Offensive language is pervasive in social media. Individuals frequently take advantage of the perceived anonymity of computer-mediated communication, using this to engage in behavior that many of them would not consider in real life. Online communities, social media platforms, and technology companies have been investing heavily in ways to cope with offensive content to prevent abusive behavior in social media. One of the most effective strategies for tackling this problem is to use computational methods to identify offense, aggression, and hate speech in user-generated content (e.g. posts, comments, microblogs, etc.).

In this talk, I discuss some of the challenges of using NLP to recognize offensive content online. I present a new taxonomy created to annotate offensive language datasets. The challenges of collecting and annotated multilingual datasets for offensive language identification are also discussed. Finally, I present the set-up and the results of the two editions of the OffensEval competition hosted at SemEval-2019 and SemEval-2020 (https://sites.google.com/site/offensevalsharedtask/home).

Bio: Marcos Zampieri is a tenure-track assistant professor at the Rochester Institute of Technology in Rochester, NY where he leads the Language Technology Group. He obtained his PhD from Saarland University in Germany with a thesis on computational approaches to language variation. He has previously held research and teaching positions in Germany and the UK. Marcos published papers on many topics in Computational Linguistics and Natural Language Processing such as language acquisition and variation, offensive language identification, and machine translation. Since 2014, he is the main organizer of the workshop series on NLP for Similar Languages, Varieties and Dialects (VarDial) co-located yearly with international top-tier NLP conferences such as COLING, EACL, and NAACL. He has co-edited a volume on the same topic to appear at the series Studies in Natural Language Processing by Cambridge University Press.