In this question-and-answer feature, we speak to Dr. Hassan Sajjad, a senior scientist at Hamad Bin Khalifa University’s (HBKU)’s Qatar Computing Research Institute (QCRI) and one of the founders of Shaheen, a machine translation system that recently achieved a significant milestone of 1 billion translated words.
Shaheen, which is the QCRI Arabic Language Team’s Machine Translation (MT) project, started as a major endeavor and a key platform for the research group. While statistical approaches were more dominant in the beginning, in the last few years, technology advancements have shifted toward Deep Learning methods, and we sought to apply that as we created Shaheen.
In the first phase, we developed a state-of-the-art machine translation system for the conversion of Modern Standard Arabic to and from English. With the advent of social media, dialectal Arabic became a de facto language for communication and especially for informal conversations, such as those we see on Twitter and Facebook. Translation systems that are optimized for Modern Standard Arabic cannot work well with dialects. In the current phase of the project, we have achieved a major milestone by developing an Arabic translation system that can translate most of the dialects, as well as standard Arabic, to English, effectively.
Shaheen uses a transformer-based sequence-to-sequence model with hierarchical fine-tuning to adapt our Modern Standard Arabic-English translation system towards dialectal Arabic translation. This hierarchical fine-tuning enables the successful adaptation of a general translation system towards learning various variations of a language in a single system which are different varieties of dialects and their genre in our case.
Nowadays, the world is interconnected and the need for information accessibility is more apparent than ever. The volume of information produced and disseminated via social media outlets is much larger than traditional information venues such as newspapers, television, and others. Due to the informal nature of social media conversations, dialectal Arabic remains to be the most common form of communication.
Automated translation enables many other technologies and facilitates tasks that are related to information extraction, analysis and understanding. In addition, it eases communication by bridging the language barrier. It can also directly impact the economy, healthcare system, political sphere, and more. For example, the FIFA World Cup 2022 will be attracting people from all parts of the world. A translation tool that can effectively translate between dialectal Arabic and English can be regarded as an essential tool of communication.
While a lot of work has been done to support Modern Standard Arabic-to-English machine translation (MT), little effort has been exerted to translate Arabic dialects into English. The systems designed for MSA cannot translate dialects well, and it is essential that we enrich our systems to explicitly handle the translation of Arabic dialects.
Shaheen provides a one-size-fits-all solution that works for a large number of Arabic dialects and genres, an aspect seldom seen with competing translation platforms. In an extensive human evaluation of four dialects (Nile, Gulf, Levantine and Maghrebi), Shaheen outperformed popular online systems in terms of the Nile, Gulf and Levantine dialects. Work remains to be under progress with the Maghrebi dialect, which requires large-scale pooling of dialectical data.
Shaheen can be deployed to the backend of multi-genre, multi-dialectal speech translation systems such as those that exist at Al Jazeera, where interviewees may sometimes speak different dialects or where the interview is about specialized domains like education, or medical subjects.
Other potential usage areas include being able to translate Arabic content on social media into English for better dissemination of information by narrowing down the language gap.
It is true that the competition is fierce given that technology giants such as Google have an enormous amount of data and computation power. Shaheen, on the other hand, specializes on handling the linguistic intricacies of Arabic specifically and is now adaptable to dialects, and that is where we have our edge compared to other translation companies. We want to ensure the best performance, and we have been proactively creating data for a large variety of Arabic dialects and are continuously exploring newly emerging methods that can be integrated into Shaheen to boost translation quality.
The Shaheen team comprises Dr. Hassan Sajjad, senior scientist; Dr. Nadir Durrani, scientist; Dr. Ahmed Abdelali and Hamdy Mubarak, senior software engineers; and Fahim Dalvi; software engineer, at QCRI. For more information on the platform, please visit https://mt.qcri.org/demos/dialect/