IXA group


News from OPENMT-2 project


Three pieces of news related to the OPENMT-2 project (2010-2012):

Gorka Labaka’s PhD thesis

In his PhD thesis (“EUSMT: Incorporating Linguistic Information to Statistical Machine Translation for Basque“)  Labaka studied how Statistical Machine Translation (SMT) can handle the translation of Spanish into Basque, a morphologically rich and less-resourced language. He found two ways to enhance the quality of the translation by using linguistic tools:

  • The use of morphological tools allowed him to perform translation at word-segments level, so avoiding spareness problems in corpora.
  • Complementarily, the  syntactic tools enabled the Spanish word-segments to be rearranged into their corresponding order in Basque. This reordering helped the SMT decoder to look for correct translations.

Recent research trends to focus more on statistical systems, and to ignore rule-based attempts. However, according to Gorka Labaka’s evaluation the RBMT and the state-of-the-art basic SMT systems work with a similar quality when translating into Basque. His improved SMT system based on segmentation and re-ordering outperforms both, the RBMT system and the basic SMT system, in more than 10% in HTER metric.  Besides, he calculated that a hypothetical oracle system would yield a result even 10% better; this oracle system should select the improved SMT output for 55% of the sentences, the RBMT output for other 41% of them, and EBMT for 4%. So he concluded that, at least in the case of morphologically rich languages with few resources, and hence few parallel corpora, the SMT approach is limited, and the RBMT approach should not be ignored. Currently, we are experimenting with hybrid architectures combining Matxin (rule-based) and EUSMT (statistical) translation-engines.


Visiting researcher Lluís Màrquez (NLPRG, Technical University of Catalonia, UPC)

With the aim of collaborating in this research line, Lluis Marquez, the main researcher in the UPC team within the OPENMT-2 project, is going to be in Donostia visiting the Ixa group until summer. He is an expert in integrating Machine Learning techniques in Language Technology. The first experiments on combining MT engines made by Gorka Labaka confirmed there is room for improvement. Now we want to find out the most suitable ways to do it.



Collaboration on Post-Editing with Basque Wikipedia (eu.wikipedia)

Within this project, a set of 60 long articles of the Spanish Wikipedia (adding up to more than 100.000 words) have been selected, and then translated into Basque language by using Matxin-Opentrad, our open-source rule-based machine translation system. Soon, in 2011 spring, a group of users of Basque Wikipedia will review them using an special interface we have adapted using OmegaT. They will correct the errors they find; this process is also known as post-editing. In this process, changes made by these users will be logged. The fixed articles will be included into Basque Wikipedia, but additionally the resulting post-editing logs will be used to enhance the machine translation process by manually improving the different modules of their MT system, or by implementing an automated statistical post-editing process that is expected to enhance the accuracy in the translation. (paper in Wikimania 2010)

Leave a Reply




You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>