A sample text widget

Etiam pulvinar consectetur dolor sed malesuada. Ut convallis euismod dolor nec pretium. Nunc ut tristique massa.

Nam sodales mi vitae dolor ullamcorper et vulputate enim accumsan. Morbi orci magna, tincidunt vitae molestie nec, molestie at mi. Nulla nulla lorem, suscipit in posuere in, interdum non magna.

The Ixa research group has been awarded in the artificial intelligence competition promoted by the US government related to COVID-19 disease

The competition CORD-19 (COVID-19 Open Research Dataset Challenge)  has been organized by several organizations such as Allen Institute for AI, Chan Zuckerberg Initiative, Georgetown University, Microsoft Research, National Institutes of Health and The White House Office of Science and Technology Policy. The organization has made available to the global research community more than 50,000 scientific articles on COVID-19, SARS-CoV-2 and other coronavirus. At the same time, they issue a call to action to artificial intelligence researchers to apply the recent advances in natural language processing, in order to help scientists fighting COVID-19 disease to find necessary information in the scientific literature.

In the first phase of the competition there were 10 awards, and the system developed in the Ixa group of the HITZ centre has been awarded with one of them. Researchers from the University of the Basque Country Arantxa Otegi and Jon Ander Campos and professors Eneko Agirre and Aitor Soroa participated in the development of the system. The developed system finds answers to high priority questions from experts related to COVID-19 disease and the SARS-CoV-2 virus analyzing the aforementioned scientific articles. Thus, this system is useful for finding answers to questions such as the history of coronavirus, the transmission and diagnosis of the virus, the prevention measures in the contact between humans and animals and the lessons of previous epidemiological studies. The results of the system have been evaluated by a group of experts from the NIH of the United States and it has been selected as the system that has best answered a set of questions on the topic “What do we know about diagnostics and surveillance?”. The answers given by the system can be seen here.

See here some examples


Five papers accepted at 58th annual meeting of the Association for Computational Linguistics

The members of the Ixa group and their collaborators will present five papers at 58th annual meeting of the Association for Computational Linguistics (ACL). ACL is one of the most important conferences on Natural Language Processing. It was to be held in July in Seattle, but this year it will be online.

Following, we present the accepted papers:

Selecting Backtranslated Data from Multiple Sources for improved Neural Machine Translation (Xabier Soto, Dimitar Shterionov, Alberto Poncelas, Andy Way): We analyse the impact that data backtranslated with diverse systems has on eu-es and de-en clinical domain NMT, and employ data selection (DS) to optimise the synthetic corpus. We further rescore the output of DS by considering the quality of the MT systems used for backtranslation and lexical diversity of the resulting corpora.

On the Cross-lingual Transferability of Monolingual Representations (Mikel Artetxe, Sebastian Ruder, Dani Yogatama): We challenge common beliefs of why multilingual BERT works by showing that a monolingual BERT model can also be transferred to new languages at the lexical level.

A Call for More Rigor in Unsupervised Cross-lingual Learning (Mikel Artetxe, Sebastian Ruder, Dani Yogatama, Gorka Labaka, Eneko Agirre): In this position paper, we review motivations, definition, approaches and methodology for unsupervised cross-lingual learning and call for a more rigorous position in each of them.

DoQA – Accessing Domain-Specific FAQs via Conversational QA (Jon Ander Campos, Arantxa Otegi, Aitor Soroa, Jan Deriu, Mark Cieliebak, Eneko Agirre): We present DoQA, a dataset for accessing FAQs via conversational Question Answering, showing that it is possible to build high quality conversational QA systems for accessing FAQs without in-domain training data.

A Methodology for Creating Question Answering Corpora Using Inverse Data Annotation (Jan Deriu, Katsiaryna Mlynchyk, Philippe Schläpfer, Alvaro Rodrigo, Dirk von Grünigen, Nicolas Kaiser, Kurt Stockinger, Eneko Agirre, Mark Cieliebak): We introduce a novel methodology to efficiently construct a corpus for question answering over structured data, with threefold manual annotation speed gains compared to previous schemes such as Spider. Our method also produces fine-grained alignment of query tokens to parsing operations. We train a state-of-the-art semantic parsing model on our data and show that our corpus is a challenging dataset and that the token alignment can be leveraged to increase the performance significantly.

Congratulations to all the authors!

Eneko Agirre won for the third consecutive year the Google prize

Eneko Agirre  won again a Google prize last March. He is one of the few researchers who has obtained the Google Faculty Research Award on three occasions. The $62,000 prize will fund the project ‘Conversational Question Answering agents that learn after deployment’ to develop user dialogue systems, chatbots and artificial intelligence.

Eneko Agirre, member of Ixa Group and professor at the Faculty of Computer Science of the UPV/EHU, is the director of the newly created HiTZ Research Center. The other 6 colleagues in the project are professors Aitor Soroa and Gorka Azkune, researcher Arantxa Otegi, doctoral student Jon Ander Campos, student of Master in Language Analysis and Processing Aitor Agirre and student of Degree in Computer Science Eduardo Vallejo.

Although the project focuses mainly on English dialogues (questions about cooking and food), they are also working with Basque dialogues. For this purpose, last year Ixa Group launched a campaign to recruit volunteers for the collection of interviews in Basque. The campaign was succesfull and many personal interviews were collected in Basque (


“Itzulbide” project: a tool for normalizing the use of Basque in clinical histories

The use of machine translation tools between languages in today’s society is common and widespread. Our Ixa group of the University of the Basque Country (UPV/EHU) has extensive experience in the Natural Language Processing for Basque. In this context, UPV-EHU and Osakidetza (The official Organization for Health in the Basque Country) in 2019 saw the opportunity to develop a tool adapted to the clinical field by using the new technological conditions (use of the successful paradigm of neural networks in machine translation) and also by taking profit of the new professional conditions (increase of bilingual staff who want to work in Basque and significant number of new young doctors trained in Basque at the university).

Translation is not, not even the development of automatic translators, the final objective in Basque Country official plans, but a potentially useful tool to get it. The objective of the Basque Country official plans, as well as that of Osakidetza, is to increase the presence and use of  Basque language in its everyday clinical histories, and it must be demonstrated whether  this tool will contribute to this goal. In fact, Itzulbide has been launched as a research project based on the hypothesis that if the general domain MT system is taught to translate in the clinical field, in the future we will have a fast and reliable translation tool. Within a few years it will be seen whether this hypothesis is fulfilled.

The project began in June 2019 and the promoters of this project (Ixa Group of the UPV/EHU and the Osakidetza Itzulbide working group) have begun to carry out the open presentations of the center to center project to clarify the opinions and doubts of the professionals and collect the contributions of the professionals. At the time of the writing of this text, 68 professionals from different specialties and categories collaborated in the project, creating bilingual clinical texts. Encouragement and thanks to all the participants!

The “Itzulbide” Automatic Translator project does not prevent or condition the other complementary specific language objectives and normalization measures currently included in the Osakidetza’s Basque Language Plan.

If its usefulness is demonstrated, the tool will be integrated into the information system of Osakidetza, but in addition, the development of this tool could extend to the entire healthcare community (professionals of public and private companies, pharmacists, university students and professors, and non-university health, residents, professional associations) and to the geographical scope of the Basque language. It can also be a help tool for professionals who are learning Basque. In summary, the possible use of Itzulbide could go beyond clinical history.

A project of this type can generate doubts, but we will test and measure whether this tool brings us closer to the objectives of the Basque Country’s Language Plan, give an opportunity to Itzulbide.

EHU-Ixa and Itzulbide-Osakidetza


Visitor: Andrea Horbach, Automatic scoring

Andrea Horbach is visiting San Sebastian within the enetCollect network on crowdsourcing for language learning, as part of an ongoing collaborating with Itziar Aldabe, Oier Lopez de Lacalle and Montse Maritxalar about evaluating manually as well as automatically generated reading comprehension questions.

Andrea Horbach is a researcher at the Language Technology Lab headed by Prof. Torsten Zesch at the University of Duisburg-Essen, Germany. Last year, she defended her PhD thesis in computational linguistics, titled “Analyzing Short-Answer Questions and their Automatic Scoring: Studies on Semantic Relations in Reading Comprehension and the Reduction of Human Annotation Effort“ at Saarland Universityl. Her main research interests include educational NLP, such as automatic scoring and exercise generation, as well as the processing of non-standard language.

Last Tuesday (2019-09-17 ) she pesented us a talk entitles “Automated and Assisted Content Scoring in Mono- and Cross-Lingual Educational Settings

Automatic content scoring of free-text answers has the goal to reduce the scoring workload of teachers and to provide consistency in scoring. In high-stakes tests, fully automatic scoring is often not an option. Nevertheless teachers can benefit from assisted scoring, where they are supported by NLP but are still in control of the scoring process.This talk presents ongoing work of two research projects related to educational scoring: First, we investigate content scoring in a cross-lingual setup, where a model trained on data in one language is applied to new data in a different language in order to foster educational equality as well as to overcome data sparseness. We present our cross-lingual data collection, as well as machine learning experiments using machine translation to bridge the language gap.

In the second part of the talk we present work on assisted scoring of listening comprehension data from language proficiency testing. We show assisted scoring studies where teachers are supported in scoring answers by the use of clustering techniques.


One of the best three papers on Clinical NLP in 2017 was published by Ixa Group

A paper written by IXA members Arantza Casillas, Koldo Gojenola, Maite Oronoz and Alicia Perez, among the 3 best papers published in 2017 in the field of clinical Natural Language Processing.

The paper entitled “Semi-supervised medical entity recognition: A study on Spanish and Swedish clinical corpora“, by Pérez A, Weegar R, Casillas A, Gojenola K, Oronoz M, Dalianis H., published in the Journal of Biomedical Informatics , was considered one of the best three papers in the field of clinical Natural Language Processing in 2017.

A survey of the literature was performed in bibliographic databases. PubMed and Association of Computational Linguistics (ACL) Anthology were searched for papers with a focus on NLP efforts applied to clinical texts or aimed at a clinical outcome. A total of 709 papers were automatically ranked and then manually reviewed. A shortlist of 15 candidate best papers was selected by the section editors and peer-reviewed by independent external reviewers to come to the three best clinical NLP papers for 2017.

The paper addresses “medical named entity recognition in clinical text in Spanish and Swedish; furthermore, they emphasize methods’ contribution in a context where little training data is available, which is often the case for languages other than English or when a new medical specialty is explored”.

The selection process is described and published in “Expanding the Diversity of Texts and Applications: Findings from the Section on Clinical Natural Language Processing of the International Medical Informatics Association Yearbook“, by Aurélie Névéol, Pierre Zweigenbaum, in the Yearbook of Medical Informatics,

Mitxelena Award for PhD theses 2018 to Olatz Perez-de-Viñaspre: Automatic medical term generation

Our colleague Olatz Perez de Viñaspre won last week the VI. Koldo MItxelena Award for PhD Theses organized by Euskaltzaindia (the Academy of Basque Language) and  the University of the Basque Country.

This thesis faced the creation of computational tools to promote the use of Basque in helath services.

The winners of Mitxelena Awards 2018

Title: Automatic medical term generation fora low-resource language: translation of SNOMED CT into Basque (pdf)
Supervisors: Arantza Diaz de Ilarraza and Maite Oronoz
Publications in English:

  • Design of EuSnomed:
    • Perez-de-Viñaspre O., and Oronoz M.Translating SNOMEDCT Terminology into a Minor Language.Proceedings ofthe 5th International Workshop on Health Text Mining and Infor-mation Analysis (Louhi), 38–45. Association for ComputationalLinguistics. Gothenburg, Sweden, 2014.
    • Perez-de-Viñaspre O., and Oronoz M.An XML Based TBXFramework to Represent Multilingual SNOMED CT forTranslation.12th Mexican International Conference on Artifi-cial Intelligence, MICAI 2013. Lecture Notes in Artificial Intel-ligence, vol. 8265, 419–429. Springer, ISBN 978-3-642-45113-3.Mexico DF, Mexico. 2013
  • Sinple terms: lexical resources and neoclassical terms:
    • Perez-de-Viñaspre O., Oronoz M., Agirrezabal M., and LersundiM.A finite state approach to translate SNOMED CTterms into Basque using medical prefixes and suffixes.Proceedings of the 11th International Conference on Finite StateMethods and Natural Language Processing, 99–103. St Andrews,Scotland, 2013.7
    • Perez-de-Viñaspre O., and Oronoz M.SNOMED CT in a lan-guage isolate: an algorithm for a semiautomatic transla-tion.BMC medical informatics and decision making, volume 15,number 2, S5. BioMed Central. 2015.
  • Complex terms: nested terms and automatic translator:
    • Perez-de-Viñaspre O., and Oronoz M.Osasun-zientzietako ter-minologiaren euskaratze automatikoaren ebaluazioa, os-asungintzako euskal komunitatea inplikatuz.II. IkerGazte,Nazioarteko Ikerketa Euskaraz. Udako Euskal Unibertsitatea. IruÃśea,Basque Country, 2017.
  • Other papers:
    • Perez-de-Viñaspre O., Oronoz M., and Patrick J.Osasun-txostenelebidunak posible ote?I. IkerGazte, Nazioarteko Ikerketa Eu-skaraz, 730–738. Udako Euskal Unibertsitatea, ISBN 978-84-8438-539-4. Durango, Basque Country, 2015. IkerGazte Special Award.
    • Perez-de-Viñaspre O., and Labaka G.IXA Biomedical TranslationSystem at WMT16 Biomedical Translation Task.Proceedingsof the First Conference on Machine Translation (WMT16), 477–482.Association for Computational Linguistics. Berlin, Germany, 2016

Eneko Agirre awarded by Google Research

Google Faculty Research Awards 2018 were published in March. Eneko Agirre was awarded with a Google’s Faculty Research Award after its annual open call for proposals on computer science and related topics including quantum computing, machine learning, algorithms and theory, natural language processing and more. Google  received 910 proposals covering 40 countries and over 320 universities. After expert reviews and committee discussions, they decided to fund 158 projects, twelve of then related to Natural Language Processing. (See: Google Faculty Research Awards 2018)

Eneko will expend the $80.000 prize in research on “Accessing FAQ and CQA sites via dialogue

This is not his first Google award, he received another one in 2015 to work on “Learning Interlingual Representations of Words and Concepts.

The methods that support this kind of research are taught in Eneko Agirre’s course in the Master “Language Analysis and Processing” at the Faculty of Informatics of the University of the Basque Country in Donostia.


PhD Thesis: Computational Model for Semantic Textual Similarity (I. San Vicente, 2019/03/11)

Title: Multilingual Sentiment Analysis in Social Media
Author: Iñaki San Vicente
Supervisors: German Rigau  / Rodrigo Agerri (Ixa Group)
Date: Mars 11, 2019, Monday


The main goal of this thesis was to research on Multilingual Sentiment Analysis in order to develop a social media monitor on specific topics. The most relevant contributions are listed below:

  • Improvement of the state of the art for Spanish polarity classification, and obtained the first position in the TASS shared task twice
  • Contribution to the state of the art in aspect based SA for English, and notable results on the Semeval 2015 aspect based SA shared task
  • Pioneering work for Basque in the SA field, specifically:
    • Creating the first sentiment lexicons for Basque
    • The first polarity annotated datasets for Basque.
    • First resources for Basque microtext normalization.
  • EliXa, The first multilingual SA system including Basque.
  • Talaia, a real social media monitoring platform applying all the previous research.
  • A set of robust and open domain tools and resources that are freely available.

Seminar: Irish Machine Translation and Resources (M. Dowling, 2019-03-11)

Seminar title: Irish Machine Translation and Resources
Speaker: Meghan Dowling (PhD student in the ADAPT Centre, Dublin City University)
When: Mon, 11 Mars, 15:00pm
oom 3.2 gelan   map

Summary: As an official language in both Ireland and the EU, there is a high demand for Irish language translation in public administration. The difficulty that translators face in meeting this demand leads to an even greater need for English-Irish (EN-GA) machine translation (MT). This presentation will discuss the advances in EN-GA MT that have been made so far, the language resources gathered, as well as some possible avenues for future research.
Bio: Meghan Dowling is a PhD student in the ADAPT Centre, Dublin City University. Her PhD topic focuses on improving English-Irish machine translation. She is visiting IXA group for 3 months to learn about how Basque resources and MT have improved, with a view to using this knowledge to improve resources and MT for Irish.