صفحات جديدة باللغة العربية حصريًا قريبًا

يسرّنا الإعلان أننا نعكف حاليًا على إعداد صفحات جديدة مُصمّمة لجمهورنا الناطق باللغة العربية لتقديم تجربة استخدام متميزة ومحتوى مخصص وملائم أكثر لهم.

سنطلق هذه الصفحات المرتقبة قريبًا في الأشهر القليلة

Dedicated Arabic Pages Are Coming Soon

We're excited to announce that we are actively developing new, dedicated pages specifically designed for our Arabic-speaking users. These will offer tailored content and an enhanced experience.

Expected to launch in the next few months. Stay tuned!

Learning from Comparable Corpora for Improved English-Arabic Statistical Machine Translation

Kemal Oflazer

CMU-Q Point of Contact

Parallel corpora are an essential training resource for training state-of-the-art data-driven statistical machine translation systems. Unfortunately, large parallel corpora are only available for a handful of language pairs and for view genres, like political documents. We therefore propose to use the much richer body of comparable corpora to deal with this data sparseness problem. Comparable corpora are documents that contain the same of similar information in different languages. Examples are the multilingual newswire texts that are produced by news organizations such as Agence France Presse and BBC. These texts often describe the same event in multiple languages in varying degree of details. The proposed project addresses the question how comparable corpora can be leveraged for improving translation systems. We will extend existing and develop new techniques to collect comparable corpora from a variety of data steams available on the Web, including news wires, but also sources like the online encyclopedia Wikipedia. Cross-lingual information retrieval techniques and classifiers will be used to identify documents containing similar content. We will then develop new word and phrase alignment techniques to extract lexicons and phrase tables from these comparable documents. Our new alignment approaches will be validated by improving existing translation systems, esp. for English-Arabic translation. We will also test the effect for low resource languages.

Project

NPRP 09 - 1140 - 1 - 177

Year

2010

Status

Closed

Team
image

Stephan Vogel

Carnegie Mellon University
image

Silvia Pessoa

Carnegie Mellon University - Qatar
image

Alon Lavie

Carnegie Mellon University