OPTDIAC: An Optimal Diacritization Scheme for Arabic Orthographic Representation

CMU-Q Point of Contact

Different languages use different scripts for orthographic (spelling of sounds) representations in their writing system. Typical Arabic orthography is mostly consonantal and it is underspecified for short vowels and other phonemic markers, known as diacritics. The underspecification of the orthography creates significant readability issues with both learners (L2) and native speakers (L1) of Arabic. It also poses significant challenges for automatic processing tools (NLP) in handling the Arabic language since it renders the text extremely ambiguous. Recently, researchers have devised automatic diacritization tools for Arabic text rendering it fully diacritized. However full diacritization has been shown to degrade performance in NLP tools. Notably, psycholinguists and educators also noted that full diacritization delays readability even among advanced skilled Arabic speakers. We hypothesize there is an intermediate level of diacritization that is beneficial to both NLP and human readability. In this proposal, we explore the space of principled partial diacritizations in the context of both NLP and human readability aiming to discover the optimal diacritization level. We will investigate the problem using advanced machine learning techniques for NLP applications. Simultanelously, we will collaborate with colleagues in education and language learning to measure the impact of our devised diacritization schemes on both L1 and L2 language learners of Arabic.

Project

NPRP 6 - 1020 - 1 - 199

Year

2014

Status

Closed