IXA group


Ixa group has been awarded in the CAPITEL@IberLEF2020 competition

The three systems presented by IXA Group (HiTZ center) to the competition CAPITEL@IberLEF2020 have ranked first in Sub-task 1 (Named Entity Recognition and Classification in Spanish News Articles). The systems were developed by Rodrigo Agerri with the help of German Rigau, Ander Barrena and Jon Ander Campos.

Zorionak, congratulations to Rodrigo and all the team!

Within the framework of the PlanTL, the Royal Spanish Academy (RAE) and the Secretariat of State for Digital Advancement (SEAD) of the Ministry of Economy signed an agreement for developing a linguistically annotated corpus of Spanish news articles, aimed at expanding the language resource infrastructure for the Spanish language. The name of such corpus is CAPITEL (Corpus del Plan de Impulso a las Tecnologías del Lenguaje}, and is composed of contemporary news articles thanks to agreements with a number of news media providers. CAPITEL has three levels of linguistic annotation: morphosyntactic (with lemmas and Universal Dependencies-style POS tags and features), syntactic (following Universal Dependencies v2), and named entities.

The linguistic annotation of a subset of the CAPITEL corpus has been revised using a machine-annotation-followed-by-human-revision procedure. Manual revision has been carried out by a team of graduated linguists using the Annotation Guidelines created specifically for CAPITEL. The named entity and syntactic layers of revised annotations comprise about 1 million words for the former, and roughly 250,000 for the latter.

Due to the size of the corpus and the nature of the annotations, they proposed two IberLEF sub-tasks under the more general, umbrella task of CAPITEL @ IberLEF 2020, where they used the revised subset of the CAPITEL corpus in two challenges, namely:

(1) Named Entity Recognition and Classification and

(2) Universal Dependency Parsing.

Master students won EHealth-KD-2020 subtask on Relation Extraction

Oscar Sainz and Edgar Andrés, students of the HAP-LAP master, obtained an excellent result in the eHealth-2020 challenge presented with professors Oier Lopez de Lacalle and Aitziber Atutxa. Their team (IXA-NER-RE) has been “champion” in the Relational Extraction sub-task.

Although their main objective was participation only in the Relation Extraction subtask, they also presented tiny systems in the other two subtasks (Entity Recognition and Alternative Domain) and so their system was fourth in the main evaluation.

You have done a good job!

The results can be consulted here:

IXA-NER-RE was the “champion” in “Relation Extraction” subtask














The Ixa research group has been awarded in the artificial intelligence competition promoted by the US government related to COVID-19 disease

The competition CORD-19 (COVID-19 Open Research Dataset Challenge)  has been organized by several organizations such as Allen Institute for AI, Chan Zuckerberg Initiative, Georgetown University, Microsoft Research, National Institutes of Health and The White House Office of Science and Technology Policy. The organization has made available to the global research community more than 50,000 scientific articles on COVID-19, SARS-CoV-2 and other coronavirus. At the same time, they issue a call to action to artificial intelligence researchers to apply the recent advances in natural language processing, in order to help scientists fighting COVID-19 disease to find necessary information in the scientific literature.

In the first phase of the competition there were 10 awards, and the system developed in the Ixa group of the HITZ centre has been awarded with one of them. Researchers from the University of the Basque Country Arantxa Otegi and Jon Ander Campos and professors Eneko Agirre and Aitor Soroa participated in the development of the system. The developed system finds answers to high priority questions from experts related to COVID-19 disease and the SARS-CoV-2 virus analyzing the aforementioned scientific articles. Thus, this system is useful for finding answers to questions such as the history of coronavirus, the transmission and diagnosis of the virus, the prevention measures in the contact between humans and animals and the lessons of previous epidemiological studies. The results of the system have been evaluated by a group of experts from the NIH of the United States and it has been selected as the system that has best answered a set of questions on the topic “What do we know about diagnostics and surveillance?”. The answers given by the system can be seen here.

See here some examples


Five papers accepted at 58th annual meeting of the Association for Computational Linguistics

The members of the Ixa group and their collaborators will present five papers at 58th annual meeting of the Association for Computational Linguistics (ACL). ACL is one of the most important conferences on Natural Language Processing. It was to be held in July in Seattle, but this year it will be online.

Following, we present the accepted papers:

Selecting Backtranslated Data from Multiple Sources for improved Neural Machine Translation (Xabier Soto, Dimitar Shterionov, Alberto Poncelas, Andy Way): We analyse the impact that data backtranslated with diverse systems has on eu-es and de-en clinical domain NMT, and employ data selection (DS) to optimise the synthetic corpus. We further rescore the output of DS by considering the quality of the MT systems used for backtranslation and lexical diversity of the resulting corpora.

On the Cross-lingual Transferability of Monolingual Representations (Mikel Artetxe, Sebastian Ruder, Dani Yogatama): We challenge common beliefs of why multilingual BERT works by showing that a monolingual BERT model can also be transferred to new languages at the lexical level.

A Call for More Rigor in Unsupervised Cross-lingual Learning (Mikel Artetxe, Sebastian Ruder, Dani Yogatama, Gorka Labaka, Eneko Agirre): In this position paper, we review motivations, definition, approaches and methodology for unsupervised cross-lingual learning and call for a more rigorous position in each of them.

DoQA – Accessing Domain-Specific FAQs via Conversational QA (Jon Ander Campos, Arantxa Otegi, Aitor Soroa, Jan Deriu, Mark Cieliebak, Eneko Agirre): We present DoQA, a dataset for accessing FAQs via conversational Question Answering, showing that it is possible to build high quality conversational QA systems for accessing FAQs without in-domain training data.

A Methodology for Creating Question Answering Corpora Using Inverse Data Annotation (Jan Deriu, Katsiaryna Mlynchyk, Philippe Schläpfer, Alvaro Rodrigo, Dirk von Grünigen, Nicolas Kaiser, Kurt Stockinger, Eneko Agirre, Mark Cieliebak): We introduce a novel methodology to efficiently construct a corpus for question answering over structured data, with threefold manual annotation speed gains compared to previous schemes such as Spider. Our method also produces fine-grained alignment of query tokens to parsing operations. We train a state-of-the-art semantic parsing model on our data and show that our corpus is a challenging dataset and that the token alignment can be leveraged to increase the performance significantly.

Congratulations to all the authors!

Eneko Agirre won for the third consecutive year the Google prize

Eneko Agirre  won again a Google prize last March. He is one of the few researchers who has obtained the Google Faculty Research Award on three occasions. The $62,000 prize will fund the project ‘Conversational Question Answering agents that learn after deployment’ to develop user dialogue systems, chatbots and artificial intelligence.

Eneko Agirre, member of Ixa Group and professor at the Faculty of Computer Science of the UPV/EHU, is the director of the newly created HiTZ Research Center. The other 6 colleagues in the project are professors Aitor Soroa and Gorka Azkune, researcher Arantxa Otegi, doctoral student Jon Ander Campos, student of Master in Language Analysis and Processing Aitor Agirre and student of Degree in Computer Science Eduardo Vallejo.

Although the project focuses mainly on English dialogues (questions about cooking and food), they are also working with Basque dialogues. For this purpose, last year Ixa Group launched a campaign to recruit volunteers for the collection of interviews in Basque. The campaign was succesfull and many personal interviews were collected in Basque (http://ixa.eus/lagundu).


“Itzulbide” ikerketa-proiektua

Gaur egungo gizartean hizkuntzen arteko itzultzaile tresna automatikoak erabiltzea arrunta eta orokorra da. Euskal Herrian hizkuntzaren prozesamenduan ibilbide eta esperientzia zabala du EHUko Ixa taldeak. Testuinguru honetan, arlo klinikora egokitutako tresna bat garatzeko aukera ikusi dute EHUk eta Osakidetzak itzulpen automatikoan sare neuronalek ekarritako paradigma-aldaketagatik eta inoiz baino baldintza hobeak daudelako horretarako, bai baldintza sozialak (euskaraz ahoz eta idatziz lan egin nahi duten langile elebidunen gorakada eta unibertsitatean prestakuntza euskaraz jasotako gazte euskaldun prestatuak), bai teknologikoak ere.

Itzulpengintza ez da – ezta itzultzaile automatikoen garapena ere– helburu bat inongo Euskara Planetan, baizik eta bitarteko bat edo baliabide bat. Euskara Planen helburua da, baita Osakidetzarena ere, euskararen presentzia eta erabilera handitzea eta, frogatu beharra dago tresna honek lagunduko ote duen helburu horretan. Izan ere, Itzulbide ikerketa-proiektu bezala abiatu da eta hipotesi bat du oinarrian: itzultzaile sistema arruntari erakusten bazaio arlo klinikoan nola itzuli, etorkizunean itzulpen tresna azkar eta fidagarria izango dugu. Hipotesi hori betetzen ote den ikusiko da urte batzuk barru.

Proiektua 2019ko ekainean hasi zen eta hasita daude proiektu honen sustatzaileak (EHUko Ixa Taldea eta Osakidetzako Itzulbide lantaldea) zentroz zentro proiektuaren aurkezpen irekiak egiten profesionalen iritzi eta zalantzak argitzeko eta bertatik bertara profesionalen ekarpenak jasotzeko. Testu hau erredaktatzeko unean espezialitate eta kategoria ezberdinetako 68 profesional zeuden proiektuan kolaboratzen, testu kliniko elebidunak sortuz. Animo eta eskerrik asko parte-hartzaile guztiei!

“Itzulbide” izeneko Itzultzaile automatikoaren proiektuak ez ditu oztopatzen ezta baldintzatzen ere Osakidetzaren Euskara planean gaur egun jasotako helburu espezifikoak eta normalkuntza neurriak.

Baliagarria dela frogatzen bada, Osakidetzaren informazio-sisteman txertatuko da tresna, baina, horretaz gain, tresna honen garapena osasun arloko komunitate osora heda liteke (enpresa publiko eta pribatuetako profesionalak, farmazialariak, unibertsitateko eta unibertsitatez kanpoko osasun arloko ikasle eta irakasleak, egoiliarrak, elkarte profesionalak) eta euskararen esparru geografiko osora. Euskara ikasten ari diren profesionalentzat ere tresna lagungarria izan daiteke. Laburtuz, Itzulbideren balizko erabilera historia klinikotik haratago joan liteke.

Era honetako proiektu batek zalantzak sor ditzake, baina froga dezagun eta neur dezagun tresna honek Euskara Planaren helburuetara hurbiltzen gaituen; eman diezaiogun aukera bat Itzulbideri.

EHU-Ixa taldeko eta Osakidetzako Itzulbide lantaldea

Visitor: Andrea Horbach, Automatic scoring

Andrea Horbach is visiting San Sebastian within the enetCollect network on crowdsourcing for language learning, as part of an ongoing collaborating with Itziar Aldabe, Oier Lopez de Lacalle and Montse Maritxalar about evaluating manually as well as automatically generated reading comprehension questions.

Andrea Horbach is a researcher at the Language Technology Lab headed by Prof. Torsten Zesch at the University of Duisburg-Essen, Germany. Last year, she defended her PhD thesis in computational linguistics, titled “Analyzing Short-Answer Questions and their Automatic Scoring: Studies on Semantic Relations in Reading Comprehension and the Reduction of Human Annotation Effort“ at Saarland Universityl. Her main research interests include educational NLP, such as automatic scoring and exercise generation, as well as the processing of non-standard language.

Last Tuesday (2019-09-17 ) she pesented us a talk entitles “Automated and Assisted Content Scoring in Mono- and Cross-Lingual Educational Settings

Automatic content scoring of free-text answers has the goal to reduce the scoring workload of teachers and to provide consistency in scoring. In high-stakes tests, fully automatic scoring is often not an option. Nevertheless teachers can benefit from assisted scoring, where they are supported by NLP but are still in control of the scoring process.This talk presents ongoing work of two research projects related to educational scoring: First, we investigate content scoring in a cross-lingual setup, where a model trained on data in one language is applied to new data in a different language in order to foster educational equality as well as to overcome data sparseness. We present our cross-lingual data collection, as well as machine learning experiments using machine translation to bridge the language gap.

In the second part of the talk we present work on assisted scoring of listening comprehension data from language proficiency testing. We show assisted scoring studies where teachers are supported in scoring answers by the use of clustering techniques.


One of the best three papers on Clinical NLP in 2017 was published by Ixa Group

A paper written by IXA members Arantza Casillas, Koldo Gojenola, Maite Oronoz and Alicia Perez, among the 3 best papers published in 2017 in the field of clinical Natural Language Processing.

The paper entitled “Semi-supervised medical entity recognition: A study on Spanish and Swedish clinical corpora“, by Pérez A, Weegar R, Casillas A, Gojenola K, Oronoz M, Dalianis H., published in the Journal of Biomedical Informatics , was considered one of the best three papers in the field of clinical Natural Language Processing in 2017.

A survey of the literature was performed in bibliographic databases. PubMed and Association of Computational Linguistics (ACL) Anthology were searched for papers with a focus on NLP efforts applied to clinical texts or aimed at a clinical outcome. A total of 709 papers were automatically ranked and then manually reviewed. A shortlist of 15 candidate best papers was selected by the section editors and peer-reviewed by independent external reviewers to come to the three best clinical NLP papers for 2017.

The paper addresses “medical named entity recognition in clinical text in Spanish and Swedish; furthermore, they emphasize methods’ contribution in a context where little training data is available, which is often the case for languages other than English or when a new medical specialty is explored”.

The selection process is described and published in “Expanding the Diversity of Texts and Applications: Findings from the Section on Clinical Natural Language Processing of the International Medical Informatics Association Yearbook“, by Aurélie Névéol, Pierre Zweigenbaum, in the Yearbook of Medical Informatics,

Mitxelena Award for PhD theses: Olatz Perez-de-Viñaspre

Our colleague Olatz Perez de Viñaspre won last week the VI. Koldo MItxelena Award for PhD Theses organized by Euskaltzaindia (the Academy of Basque Language) and  the University of the Basque Country.

This thesis faced the creation of computational tools to promote the use of Basque in helath services.

The winners of Mitxelena Awards 2018

Title: Automatic medical term generation fora low-resource language: translation of SNOMED CT into Basque (pdf)
Supervisors: Arantza Diaz de Ilarraza and Maite Oronoz
Publications in English:

  • Design of EuSnomed:
    • Perez-de-Viñaspre O., and Oronoz M.Translating SNOMEDCT Terminology into a Minor Language.Proceedings ofthe 5th International Workshop on Health Text Mining and Infor-mation Analysis (Louhi), 38–45. Association for ComputationalLinguistics. Gothenburg, Sweden, 2014.
    • Perez-de-Viñaspre O., and Oronoz M.An XML Based TBXFramework to Represent Multilingual SNOMED CT forTranslation.12th Mexican International Conference on Artifi-cial Intelligence, MICAI 2013. Lecture Notes in Artificial Intel-ligence, vol. 8265, 419–429. Springer, ISBN 978-3-642-45113-3.Mexico DF, Mexico. 2013
  • Sinple terms: lexical resources and neoclassical terms:
    • Perez-de-Viñaspre O., Oronoz M., Agirrezabal M., and LersundiM.A finite state approach to translate SNOMED CTterms into Basque using medical prefixes and suffixes.Proceedings of the 11th International Conference on Finite StateMethods and Natural Language Processing, 99–103. St Andrews,Scotland, 2013.7
    • Perez-de-Viñaspre O., and Oronoz M.SNOMED CT in a lan-guage isolate: an algorithm for a semiautomatic transla-tion.BMC medical informatics and decision making, volume 15,number 2, S5. BioMed Central. 2015.
  • Complex terms: nested terms and automatic translator:
    • Perez-de-Viñaspre O., and Oronoz M.Osasun-zientzietako ter-minologiaren euskaratze automatikoaren ebaluazioa, os-asungintzako euskal komunitatea inplikatuz.II. IkerGazte,Nazioarteko Ikerketa Euskaraz. Udako Euskal Unibertsitatea. IruÃśea,Basque Country, 2017.
  • Other papers:
    • Perez-de-Viñaspre O., Oronoz M., and Patrick J.Osasun-txostenelebidunak posible ote?I. IkerGazte, Nazioarteko Ikerketa Eu-skaraz, 730–738. Udako Euskal Unibertsitatea, ISBN 978-84-8438-539-4. Durango, Basque Country, 2015. IkerGazte Special Award.
    • Perez-de-Viñaspre O., and Labaka G.IXA Biomedical TranslationSystem at WMT16 Biomedical Translation Task.Proceedingsof the First Conference on Machine Translation (WMT16), 477–482.Association for Computational Linguistics. Berlin, Germany, 2016

Eneko Agirre awarded by Google Research

Google Faculty Research Awards 2018 were published in March. Eneko Agirre was awarded with a Google’s Faculty Research Award after its annual open call for proposals on computer science and related topics including quantum computing, machine learning, algorithms and theory, natural language processing and more. Google  received 910 proposals covering 40 countries and over 320 universities. After expert reviews and committee discussions, they decided to fund 158 projects, twelve of then related to Natural Language Processing. (See: Google Faculty Research Awards 2018)

Eneko will expend the $80.000 prize in research on “Accessing FAQ and CQA sites via dialogue

This is not his first Google award, he received another one in 2015 to work on “Learning Interlingual Representations of Words and Concepts.

The methods that support this kind of research are taught in Eneko Agirre’s course in the Master “Language Analysis and Processing” at the Faculty of Informatics of the University of the Basque Country in Donostia.