IXA group


Visitor: Andrea Horbach, Automatic scoring

Andrea Horbach is visiting San Sebastian within the enetCollect network on crowdsourcing for language learning, as part of an ongoing collaborating with Itziar Aldabe, Oier Lopez de Lacalle and Montse Maritxalar about evaluating manually as well as automatically generated reading comprehension questions.

Andrea Horbach is a researcher at the Language Technology Lab headed by Prof. Torsten Zesch at the University of Duisburg-Essen, Germany. Last year, she defended her PhD thesis in computational linguistics, titled “Analyzing Short-Answer Questions and their Automatic Scoring: Studies on Semantic Relations in Reading Comprehension and the Reduction of Human Annotation Effort“ at Saarland Universityl. Her main research interests include educational NLP, such as automatic scoring and exercise generation, as well as the processing of non-standard language.

Last Tuesday (2019-09-17 ) she pesented us a talk entitles “Automated and Assisted Content Scoring in Mono- and Cross-Lingual Educational Settings

Automatic content scoring of free-text answers has the goal to reduce the scoring workload of teachers and to provide consistency in scoring. In high-stakes tests, fully automatic scoring is often not an option. Nevertheless teachers can benefit from assisted scoring, where they are supported by NLP but are still in control of the scoring process.This talk presents ongoing work of two research projects related to educational scoring: First, we investigate content scoring in a cross-lingual setup, where a model trained on data in one language is applied to new data in a different language in order to foster educational equality as well as to overcome data sparseness. We present our cross-lingual data collection, as well as machine learning experiments using machine translation to bridge the language gap.

In the second part of the talk we present work on assisted scoring of listening comprehension data from language proficiency testing. We show assisted scoring studies where teachers are supported in scoring answers by the use of clustering techniques.


One of the best three papers on Clinical NLP in 2017 was published by Ixa Group

A paper written by IXA members Arantza Casillas, Koldo Gojenola, Maite Oronoz and Alicia Perez, among the 3 best papers published in 2017 in the field of clinical Natural Language Processing.

The paper entitled “Semi-supervised medical entity recognition: A study on Spanish and Swedish clinical corpora“, by Pérez A, Weegar R, Casillas A, Gojenola K, Oronoz M, Dalianis H., published in the Journal of Biomedical Informatics , was considered one of the best three papers in the field of clinical Natural Language Processing in 2017.

A survey of the literature was performed in bibliographic databases. PubMed and Association of Computational Linguistics (ACL) Anthology were searched for papers with a focus on NLP efforts applied to clinical texts or aimed at a clinical outcome. A total of 709 papers were automatically ranked and then manually reviewed. A shortlist of 15 candidate best papers was selected by the section editors and peer-reviewed by independent external reviewers to come to the three best clinical NLP papers for 2017.

The paper addresses “medical named entity recognition in clinical text in Spanish and Swedish; furthermore, they emphasize methods’ contribution in a context where little training data is available, which is often the case for languages other than English or when a new medical specialty is explored”.

The selection process is described and published in “Expanding the Diversity of Texts and Applications: Findings from the Section on Clinical Natural Language Processing of the International Medical Informatics Association Yearbook“, by Aurélie Névéol, Pierre Zweigenbaum, in the Yearbook of Medical Informatics,

Mitxelena Award for PhD theses: Olatz Perez-de-Viñaspre

Our colleague Olatz Perez de Viñaspre won last week the VI. Koldo MItxelena Award for PhD Theses organized by Euskaltzaindia (the Academy of Basque Language) and  the University of the Basque Country.

This thesis faced the creation of computational tools to promote the use of Basque in helath services.

The winners of Mitxelena Awards 2018

Title: Automatic medical term generation fora low-resource language: translation of SNOMED CT into Basque (pdf)
Supervisors: Arantza Diaz de Ilarraza and Maite Oronoz
Publications in English:

  • Design of EuSnomed:
    • Perez-de-Viñaspre O., and Oronoz M.Translating SNOMEDCT Terminology into a Minor Language.Proceedings ofthe 5th International Workshop on Health Text Mining and Infor-mation Analysis (Louhi), 38–45. Association for ComputationalLinguistics. Gothenburg, Sweden, 2014.
    • Perez-de-Viñaspre O., and Oronoz M.An XML Based TBXFramework to Represent Multilingual SNOMED CT forTranslation.12th Mexican International Conference on Artifi-cial Intelligence, MICAI 2013. Lecture Notes in Artificial Intel-ligence, vol. 8265, 419–429. Springer, ISBN 978-3-642-45113-3.Mexico DF, Mexico. 2013
  • Sinple terms: lexical resources and neoclassical terms:
    • Perez-de-Viñaspre O., Oronoz M., Agirrezabal M., and LersundiM.A finite state approach to translate SNOMED CTterms into Basque using medical prefixes and suffixes.Proceedings of the 11th International Conference on Finite StateMethods and Natural Language Processing, 99–103. St Andrews,Scotland, 2013.7
    • Perez-de-Viñaspre O., and Oronoz M.SNOMED CT in a lan-guage isolate: an algorithm for a semiautomatic transla-tion.BMC medical informatics and decision making, volume 15,number 2, S5. BioMed Central. 2015.
  • Complex terms: nested terms and automatic translator:
    • Perez-de-Viñaspre O., and Oronoz M.Osasun-zientzietako ter-minologiaren euskaratze automatikoaren ebaluazioa, os-asungintzako euskal komunitatea inplikatuz.II. IkerGazte,Nazioarteko Ikerketa Euskaraz. Udako Euskal Unibertsitatea. IruÃśea,Basque Country, 2017.
  • Other papers:
    • Perez-de-Viñaspre O., Oronoz M., and Patrick J.Osasun-txostenelebidunak posible ote?I. IkerGazte, Nazioarteko Ikerketa Eu-skaraz, 730–738. Udako Euskal Unibertsitatea, ISBN 978-84-8438-539-4. Durango, Basque Country, 2015. IkerGazte Special Award.
    • Perez-de-Viñaspre O., and Labaka G.IXA Biomedical TranslationSystem at WMT16 Biomedical Translation Task.Proceedingsof the First Conference on Machine Translation (WMT16), 477–482.Association for Computational Linguistics. Berlin, Germany, 2016

Eneko Agirre awarded by Google Research

Google Faculty Research Awards 2018 were published in March. Eneko Agirre was awarded with a Google’s Faculty Research Award after its annual open call for proposals on computer science and related topics including quantum computing, machine learning, algorithms and theory, natural language processing and more. Google  received 910 proposals covering 40 countries and over 320 universities. After expert reviews and committee discussions, they decided to fund 158 projects, twelve of then related to Natural Language Processing. (See: Google Faculty Research Awards 2018)

Eneko will expend the $80.000 prize in research on “Accessing FAQ and CQA sites via dialogue

This is not his first Google award, he received another one in 2015 to work on “Learning Interlingual Representations of Words and Concepts.

The methods that support this kind of research are taught in Eneko Agirre’s course in the Master “Language Analysis and Processing” at the Faculty of Informatics of the University of the Basque Country in Donostia.


PhD Thesis: Computational Model for Semantic Textual Similarity (I. San Vicente, 2019/03/11)

Title: Multilingual Sentiment Analysis in Social Media
Author: Iñaki San Vicente
Supervisors: German Rigau  / Rodrigo Agerri (Ixa Group)
Date: Mars 11, 2019, Monday


The main goal of this thesis was to research on Multilingual Sentiment Analysis in order to develop a social media monitor on specific topics. The most relevant contributions are listed below:

  • Improvement of the state of the art for Spanish polarity classification, and obtained the first position in the TASS shared task twice
  • Contribution to the state of the art in aspect based SA for English, and notable results on the Semeval 2015 aspect based SA shared task
  • Pioneering work for Basque in the SA field, specifically:
    • Creating the first sentiment lexicons for Basque
    • The first polarity annotated datasets for Basque.
    • First resources for Basque microtext normalization.
  • EliXa, The first multilingual SA system including Basque.
  • Talaia, a real social media monitoring platform applying all the previous research.
  • A set of robust and open domain tools and resources that are freely available.

Seminar: Irish Machine Translation and Resources (M. Dowling, 2019-03-11)

Seminar title: Irish Machine Translation and Resources
Speaker: Meghan Dowling (PhD student in the ADAPT Centre, Dublin City University)
When: Mon, 11 Mars, 15:00pm
oom 3.2 gelan   map

Summary: As an official language in both Ireland and the EU, there is a high demand for Irish language translation in public administration. The difficulty that translators face in meeting this demand leads to an even greater need for English-Irish (EN-GA) machine translation (MT). This presentation will discuss the advances in EN-GA MT that have been made so far, the language resources gathered, as well as some possible avenues for future research.
Bio: Meghan Dowling is a PhD student in the ADAPT Centre, Dublin City University. Her PhD topic focuses on improving English-Irish machine translation. She is visiting IXA group for 3 months to learn about how Basque resources and MT have improved, with a view to using this knowledge to improve resources and MT for Irish.

Meeting of LINGUATEC project in Donostia (2019-02-21)

LINGUATEC project:  Development of cross-border cooperation and knowledge transfer in language technologies.

LINGUATEC is an European project funded by FEDER via POCTEFA (Programa INTERREG V-A España-Francia-Andorra). The partners are the followings:

  • Elhuyar Fundazioa
  • Lo Congrès Permanent de la Lenga Occitana
  • Universidad Del País Vasco / Euskal Herriko Unibertsitatea (Ixa Taldea)
  • CNRS (CENTRE National de la Recherche Scientifique) – Delegation Regionale Midi-Pyrenees
  • Euskaltzaindia – Real Academia de la Lengua Vasca
  • Sociedad De Promoción y Gestión del Turismo Aragonés

The main objective in Linguatec is to develop, test and disseminate new innovative linguistic resources, tools and solutions for a better digitalization level of the Aragonian, Basque and Occitan languages. As a result, we will obtain, among others, (1) a road map of Aragonian Digitalization, (2) new monolingual and bilingual lexicons and morphosyntactic and syntactic analysers for Occitan, (3) a Northern Basque speech recognition system and several linguistic tools as well as (4) new innovative solutions for Aragonian, Basque and Occitan.

These cross-border cooperation will allow the transfer of knowledge and to develop linguistic solutions with a potential market uptake, benefiting language professionals, easing access to multilingual contents, and fostering the development of a cross-border language tech cluster.

After one year work, last Wednesday we had a project meeting in Donostia organized by Euskaltzaindia. Ixa Group presented the progress in the creation of an improved Neuronal Machine Translation system for the pair Spanish-Basque.


2 master-theses today (2019-02-25, 16:00)


Best Paper Award on CoNLL2018

Last week our colleagues Mikel Artetxe, Gorka Labaka, Iñigo Lopez-Gazpio, and Eneko Agirre were the recipients of the Best Paper Award in the  22nd Conference on Computational Natural Language Learning (CoNLL 2018) for the paper “Uncovering Divergent Linguistic Information in Word Embeddings with Lessons for Intrinsic and Extrinsic Evaluation”.


Following the recent success of word embeddings, it has been argued that there is no such thing as an ideal representation for words, as different models tend to capture divergent and often mutually incompatible aspects like semantics/syntax and similarity/relatedness. In this paper, we show that each embedding model captures more information than directly apparent. A linear transformation that adjusts the similarity order of the model without any external resource can tailor it to achieve better results in those aspects, providing a new perspective on how embeddings encode divergent linguistic information. In addition, we explore the relation between intrinsic and extrinsic evaluation, as the effect of our transformations in downstream tasks is higher for unsupervised systems than for supervised ones.

This is an open source implementation in GitHub of our word embedding post-processing and evaluation framework, described in the paper.

Talk: Neural Networks and Linguistics. Talking Past Each Other? (M. Hulden, 2018-11-08)

Speaker: Mans Hulden, University of Colorado.
When: Thursday, 8 November
12:00 – 1:00pm

Description: Neural networks have led to previously unimaginable advances in NLP engineering tasks. The main criticism against them from a linguistic point of view is that neural models – while fine for “language engineering tasks” – are thought of as being black boxes, and that their parameter opacity prevents us from discovering new facts about the nature of language itself, or specific languages. In this talk I will challenge that assumption to show that there are ways to uncover facts about language, even with a black box learner. I will discuss specific experiments with neural models and sound embeddings that reveal new information about the organization of sound systems in human languages (phonology), give us insight into the complexity of word-formation (morphology), give us models of why and when irregular forms – surely an inefficiency in a communication system – can persist over long periods of time (historical linguistics), and reveal what the boundaries of pattern learning is (how much information do we minimally need to learn a grammatical aspect of language such as its word inflection or sentence formation).