Carnegie Mellon professor compiles decades of research into Turkish Natural Language Processing
Many people may not be familiar with the term “Natural Language Processing,” but English-speakers experience ask Siri or Alexa a question or use Google Translate service to translate text from one language to another. There have been many noteworthy advances in natural language processing in English, and often these techniques can be applied to other languages. Some languages, however, can pose significant computational challenges.
Kemal Oflazer completed his bachelor’s and master’s degrees at Middle East Technical University in Ankara. He then pursued his PhD in computer science at Carnegie Mellon University, studying and then working in the United States for a full decade.
When he returned to Turkey to teach at Bilkent University, he found that his time away had given him a new perspective: “You rarely get the chance to see what your first language looks like from an external point of view. I wrote a Turkish document, and I realized, there is no Turkish spell-checker.”
It was the early 1990s, and this observation raised questions that would guide Oflazer’s research interests for the next three decades.
Turkish is an agglutinative language, which means which means that suffixes attach to a root word like beads on a string. One complex Turkish word with several suffixes could express the same meaning as an entire sentence in English.
“In English, the computer can check spelling against a finite list of words,” he explains. “In Turkish, a given verb root can give rise to about 1.5 million different word forms. It is rather amazing.”
This also brings other interesting properties, such as free word order where the subject, object or the verb can be arranged in any possible order. In English, by comparison, the order is rather fixed.
In the early 1990s, there was no work being done in the area of Turkish natural language processing. Through funding provided by NATO Science for Stability Program, European Union and Turkish Scientific and Technological Research Council, Oflazer and his graduate students did research and development on Turkish natural language processing.
In 2012, Oflazer was invited to deliver a talk at the Language Resources Evaluation Conference (LREC) in Istanbul on the challenges of Turkish NLP. After the lecture, he was approached by Springer Verlag with a proposition to compile a book on the state of the art of Turkish NLP. Along with co-editor Murat Saraçlar of Boğaziçi University in Istanbul, Oflazer spent more than four years working with researchers—many of them their former graduate students—to bring together 25 years of work. The book was published in 2018 in both hard-copy and online versions, and so far more than 2,000 copies of the various chapters have been downloaded.
“We are very proud of this book,” says Oflazer. “This research has the potential to touch a vast number of people.” While Turkish is spoken by more than 70 million people in Turkey, the Middle East, and in European countries, the wider family of Turkic languages are spoken as a native language by approximately 165 million people worldwide.
Oflazer stresses that this book is not the endpoint of research into Turkish NLP: “We are not even 10% into the job. There are many more techniques and resources that need to be developed before the state of the art in Turkish gets anywhere close to languages like English and Chinese.”
Kemal Oflazer continues his research in the area of Natural Language Processing with projects supported by the Qatar National Research Fund: “Multi-Arabic dialect applications and resources,” and “Testing English reading comprehension through deep text analysis and question generation.” Oflazer has also completed five NPRP projects since 2009.
Information about the Turkish Natural Language Processing book is available at https://www.springer.com/gp/book/9783319901633