IXA group


HAP/LAP master theses (2016-09-27)

Master HAP/LAPhap-laptesi-irakurketa
EMLCT master
Master thesis defences

Date: September 27th
Place: Ada Lovelace room


Universal Dependencies for Buryat.
Author: Elena Badmaeva
Supervirors: Koldo Gojenola , Gosse Bouma

LexSynSimpleText, a lexical and syntactic simplifier: first steps.
Author: Maria Eguimendia
Supervirors: Arantza Diaz de Ilarraza and Gosse Bouma

Data Sparsity in Highly Inflected Languages: The Case of Morphosyntactic Tagging in Polish.
Egilea / Author: Michael Ustaszewski
Tutoreak / Supervirors: Rodrigo Agerri and German Rigau

Multilingual Central Repository version 3.0: improving a very large lexical knowledge base.
Egilea / Author: Daniel Parera Perez
Tutoreak / Supervirors: German Rigau Claramunt

Book: Microparameters in the Grammar of Basque

Edited by Beatriz Fernández (UPV/EHU) and Jon Ortiz de Urbina (Deusto University), this book is an endeavor to present and analyze some standard topics in the grammar of Basque from a micro-comparative perspective. From case and agreement to word order and the left periphery, and including an incursion into determiners, the book combines fine-grained theoretical analyses with empirically detailed descriptions. Working from a micro-parametric perspective, the contributions to the volume address in depth some of the exuberant variation attested in the different dialects and subdialects of Basque. At the same time, although the contributions focus mainly on Basque data, cross-linguistic evidence is also presented and discussed.
After all, the goal pursued in this book is to attempt to explain variation in Basque as a particular instantiation of variation in human language at large. The volume presents and analyzes a wide range of empirical phenomena, many typologically marked among European languages, and will therefore be a welcome resource to linguists looking for detailed description and/or theoretical discussion.

Nora Aranberri: Machine Translation for Translators (Innsbruck, 2016-07-20)

InsbrukSummertransOur colleague Nora Aranberri has been the lecturer in the workshop on “Machine Translation for Translators: Taking Advantage of the New Technology” at SummerTrans 2016.

The International Translation Summer School SummerTrans, was founded in Innsbruck in 2004.  From 11 to 20 July 2016 the University of Innsbruck hosted the 7th International Translation SummerSchool “SummerTrans VII: Quality and Competence in Translation”. Addressing trainee translators, professional translators and translation researchers alike, its varied programme featured cutting-edge courses and workshops aiming to advance participants’ theoretical knowledge of and practical skills in translation and interpreting, including state-of-the art translation technology and human-machine interaction in translation.
SummerTrans VII welcomed more than 60 participants from 16 countries spanning from Tunisia over half of Europe to India and China.NoraInnsbruck2016
Michael Ustaszewski, one of our students in Eramus Mundus LCT master2014-2016, now is a lecturer at the University of Innsbruck and one of the organizers of SummerTrans 2016  🙂
Michael told us that now the participants in the workshop know the state-of-the art translation technology and human-machine interaction in translation.


Nice results in Codefestdss2016 projects

This a list of the aims of the projects in CODEFEST 2016 summer school and the results achieved by each of them. Further information can be found in Codefest_dss2016 website.


Quiz Bowl: Multilingual question-answering for trivia games with Wikipedia


The QUIZ Bowl team was the winner in our codefest competition. Congratulations!

Aims:The question-answering trivia quiz project is in progress. To start the first game prototype, the team is using some of the questions translated into Basque on Monday. This prototype  matches the Basque Wikipedia articles with the questions or hints from the quiz, so that the answer to the hint pops out as an article.

Results: We had the chance tre o play a quiz based on Wikipedia trivia: Human vs. Computer. This time humans have been the winners, but by a very small margin only.

The code is available here: github.com/dss2016eu/codefest/tree/master/quizbowl
References to all the code generated in will also be posted there!

Create a morphological analyzer for your minority language

Aims:In order to develop the morphological analyzer for Hungarian language, Ixa group members Iñaki Alegria and Montse Maritxalar have gathered to offer their help in programming tasks. After creating a list of the lexical roots of Hungarian, they have made a selection based on verbs and adjectives, among other criteria. Afterwards, they want to computerize that selection through a specific program in lexc format.

Results: They have explained several projects they’ve been developing through these days, all of them related to machine translation devices: for Hungarian, Buryat (a variation of Mongolian), Rif Berber (language spoken mostly in Morocco), Uyghur (Turkic language spoken in Western China), among others.

NLP for Literature Analysis and Creation

Aims:Members of the group have chosen the name Story buffet for their tools for analysis and creation of literary texts. The team is made up of linguists, programmers and other experts who consider themselves to be “hybrids” of the two.

On the second day, we had a break so that people from Ixa group (the ones in charge of this project) could explain their work to us. Manex Agirrezabal is an expert on metrical analysis in poetry; therefore, along with his knowledge in programming/coding, he thinks this is a great chance to semantically alter short stories. Originally, Itziar Gonzalez-Dios’ field of study was linguistics, but she has joined the world of programming in the last few years; she is interested in the analysis of the complexity and synthesis of texts.

Results: They have showed their webpage (Story buffet) for literature creation and analysis, in a quite humorous way.


Aims:The team has continued developing the Behagunea project making use of their different abilities. Victor (programmer) has visualized the results of the Ixa-pipes, and he is working on designing an attractive interface. Also, Dani (IT expert) is trying to translate Ixa-pipes resources into Catalán. Sabrina (linguist), with the help of Iñaki (programmer), is starting an app based on tweets to study what countries think about each other. Finally, due to some problems, Kassandra has decided to put aside one of the projects: the one that aims to include social media in the website DSS2016EU Iritzien Behagunea (Opinion Observatory). Instead, she has chosen to examine the tweets about the DonostiCup football competition.

Results: They have accomplished their goals. Apart from adding new languages (Catalan, Italian) to the Behagunea project, they have managed to merge social media and geolocalization.

Enriching ZureTTS platform with new languages

Aims: Several aspects of the project ZureTTS have been treated. On the one hand, the members of Aholab have focused on developing the platform to include the dialect from Iparralde (the northern side of the Basque Country), and they have started both writing the questions for the voice donors and designing the new interface. Concerning the app for Android, they have spent the day identifying errors and preparing everything required to install the new platform. To conclude, in the “Ireland team” they have translated the webpage interface into Gaélic and contacted some Irish experts within their university to get hold of a good, reliable database.

Results: At the end of the week, apart from adding the Lapurtera (Basque dialect) version to the web, they’ve made a huge progress in Gaelic, thanks to the help of the Irish people specially.

SRL and Dockers

Aims: Members of the SRL project have been structuring a database to add and handle information later on. As Suhail Sarwan says, developments in SRL mean a direct benefit in the field of semantics, particularly if we want to promote and improve the e-learning model. Aided by Rodrigo Agerri, among others, they have worked on the SRL, and Eleanor Dutton intends to develop a tool for linguistic analysis and to apply it to Moroccan Arabic.

Results: They showed us a tool they have developed to identify the participants of the events described by the predicates within a sentence, by sequence tagging methods.

Machine Translation for minority languages

Aims: Each member of the group is focusing on the pair of languages in which he/she is fluent. Based on the program called Apertium, for example, they have started working on a translator for the language combination French-Occitan, so that they can later develop a linguistic analyzer for Occitan. They have also been working on a Tetum-Portuguese translator (the two official languages spoken on the island of Timor) with the same program. Others have started preparing lexical transfers (they will try to do the same with dependency transfers) for the English-Spanish combination using  Matxin. This exact same program also allows the creation of a English-Welsh translator, as well as a translator for English-Basque (one such translator already exists, but some errors must be identified and corrected). The latter will be applied in the field of medicine.

Results: They have explained several projects they’ve been developing through these days, all of them related to machine translation devices: for Hungarian, Buryat (a variation of Mongolian), Rif Berber (language spoken mostly in Morocco), Uyghur (Turkic language spoken in Western China), among others.

Erasmus Mundus LCT master. Annual Meeting 2016 in Donostia (June 09 - 10)

Seminar: Big Data and NLP at Trivago (Min Fang, 2016-06-08)

Talk: Big Data and NLP at Trivago
Speaker: Min Fang
………..2013 – 2015: Master Erasmus Mundus Language and Communication Technologies, summa cum laude
………..2015-… :   (Trivago, hotel metasearch)
When: Wed, 8 June, 10pm – 11pm
oom 3.2 gelan   map
I’m interested in getting insights from data by applying natural language processing, machine learning and statistical analyses. Ideally, those insights can then be turned into useful applications or facilitate higher level decisions.

Together with our software engineers I take care of our NLP capabilities: We work on improving and maintaining a highly flexible and scalable pipeline that is geared towards aspect-based sentiment analysis (and more in the future). Extracting knowledge from a large number of natural language texts allows us to understand our domain better and enhance the experience for our users.

Our technology stack includes:
– Python and Java
– R for analysis
– AWS for infrastructure

Ixa Group is one of the 15 institutional members of EAMT

Ixa Group is an institutional member in the European Association of Machine Translation  (EAMT) since 2012, the organization that serves the growing community of people interested in MT and translation tools, including users, developers, and researchers of this increasingly viable technology. Now we have pubished a new a page about IXA Group inside EAMT’s website.

The EAMT is one of three regional associations of the International Association for Machine Translation (IAMT). Its sister organizations are the Association for Machine Translation in the Americas (AMTA) and the Asia-Pacific Association for Machine Translation (AAMT).

Among other activities, the EAMT organizes the bi-annual MT Summit and the annual EAMT conferences, maintains the MT-List mailing list, and  compiles listings of companies and products which are distributed free or at nominal cost to its members (Compendium of Translation Software)

The current 15 corporate and institutional members are the following:

Video: HAP-LAP master thesis (Mikel Artetxe)

Last week Mikel Artetxe presented his thesis at HAP-LAP master.

The presentation can be seen here:



Distributional Semantics and Machine Learning for Statistical Machine Translation
Author: Mikel Artetxe Zurutuza
Supervisors: Eneko Agirre eta Gorka Labaka

We have 14 papers at LREC

This week we present 14 papers/posters in LREC Conference (Language Resources and Evaluation Conference).  🙂

Three of them written in collaboration with Elhuyar.


The links for download of our papers:

  1. A Comparison of Domain-based Word Polarity Estimation using different Word Embeddings
  2. A Comparison of Named-Entity Disambiguation and Word Sense Disambiguation
  3. A Multilingual Predicate Matrix
  4. Addressing the MFS Bias in WSD systems
  5. Domain Adaptation in MT Using Titles in Wikipedia as a Parallel Corpus: Resources and Evaluation
  6. Evaluating the Noisy Channel Model for the Normalization of Historical Texts: Basque, Spanish and Slovene
  7. Evaluating Translation Quality and CLIR Performance of Query Sessions
  8. Interoperability of Annotation Schemes: Using the Pepper Framework to Display AWA Documents in the ANNIS Interface
  9. QTLeap WSD/NED Corpora: Semantic Annotation of Parallel Corpora in Six Languages
  10. The Event and Implied Situation Ontology (ESO): Application and Evaluation
  11. Tools and Guidelines for Principled Machine Translation Development
  12. TweetMT: A Parallel Microblog Corpus
  13. Two Architectures for Parallel Processing of Huge Amounts of Text
  14. Word Sense-Aware Machine Translation: Including Senses as Contextual Features for Improved Translation Models

HAP/LAP Master theses defence

Master HAP/LAP masterra

Master-tesien defentsak / Master theses defence

Eguna / Date: maiatzaren 17a / May 17th
Lekua / Place: Ada Lovelace aretoa / Ada Lovelace room

Adverse Drug Reaction event extraction on Electronic Health Records written in Spanish.
Egilea / Author: Sara Santiso González
Tutoreak / Supervirors: Alicia Pérez eta Arantza Casillas
Epaimahaia: Eva Navas, Montse Maritxalar Arantza Casillas

Distributional Semantics and Machine Learning for Statistical Machine Translation
Egilea / Author: Mikel Artetxe Zurutuza
Tutoreak / Supervirors: Eneko Agirre eta Gorka Labaka
Epaimahaia: Eva Navas, Montse Maritxalar, Gorka Labaka