IXA group


Workshop: Resources and tools for the automatic processing of the languages of the Pyrenees (2021-04-12, Online, Free)

The European project EFA 227/16/LINGUATEC “Development of cross-border cooperation and knowledge transfer in language technologies” organizes the workshop open to all researchers, with the aim of disseminating the work carried out within the project and presenting some of the advances made for Basque and Occitan.
This project is co-financed by the European Regional Development Fund (ERDF)

Free registration. Please, use this registration form

12 May, 2021
Online, with presentations in English, Spanish and French, with simultaneous translation into English, Spanish and French.

10h – Opening

10h15 Invited talks: Catalan processing

Lluis Padró (Universitat Politècnica de Catalunya)
Morphological and Syntactic Resources in FreeLing
Presentation in English – Simultaneous translation in Spanish and French

Mariona Taulé (Universitat de Barcelona)
AnCora: un corpus anotado a diferentes niveles lingüístico
AnCora: a corpus annotated at different linguistic levels
Presentation in Spanish – Simultaneous translation in English and French

11h15 — Break

11h30 Presentations: Corpora for Occitan, Basque and other under-resourced languages

Assaf Urieli, Joliciel
Talismane, Jochre: automatic syntax analysis and OCR for under-resourced languages
Presentation in English – Simultaneous translation in Spanish and French

Aleksandra Miletic y Dejan Stosic, CLLE
Mutualisation des ressources pour la création de treebanks : le cas du serbe et de l’occitan
Pooling resources for the creation of syntactic tree banks: the case of Serbian and Occitan
Presentation in French – Simultaneous translation in English and Spanish

Ainara Estarrona (IXA, HiTZ, UPV/EHU)
Construcción del corpus histórico en euskera
Construction of a historical corpus in Basque
Presentation in Spanish – Simultaneous translation in English and French

13h — Break

14h30 Invited talk: Use of Neural Networks

Mans Hulden (University of Colorado)
Neural Networks in Linguistic Research
Presentation in English – Simultaneous translation in Spanish and French

15h30 Presentación: Language processing

Rodrigo Agerri (IXA, HITZ, UPV/EHU)
Contextual lemmatization for inflected languages: statistical and deep-learning approaches
Presentation in English – Simultaneous translation in Spanish and French

16h – Break

16h15 – Presentations: Results of the LINGUATEC project

Myriam Bras, Aleksandra Miletic, Marianne Vergez-Couret, Clamença Poujade, Jean Sibille, Louise Esher, CLLE :
Automatic processing of Occitan: construction of the first annotated corpora
Video in Occitan with accessible subtitles in English, Spanish and French

Creation and improvement of Basque resources within the framework of Linguatec
Video in Occitan with accessible subtitles in English, Spanish and French

16h45 – Conclusions

Presentation in Spanish and French – No simultaneous translation

17h – Closing

HiTZ Research Center = IXA + Aholab (2020-11-05)

Jokin Bildarratz, Education Chancellor in Basque Government

Last week we presented the new HiTZ Research Center.
Along with director Eneko Agirre in the presentation were present Rector Nekane Balluerka, Education Chancellor in the Basque Government Jokin Bildarratz, Deputy General of Gipuzkoa Markel Olano and Mayor of Saint Sebastian Eneko Goia.

Eneko Agirre, Director of HiTZ Center

We have created HiTZ Center by merging two research groups: IXA and Aholab. The research groups Ixa and Aholab, both from the University of the Basque Country  (UPV/EHU), have been — since their creation in 1988 and 1998 respectively — the main tractors in the area of Language Technologies of the Basque Country. Both groups have been collaborating since 2002, when they promoted the formation of an Basque consortium in strategic research after contacting and attracting to the area the Elhuyar Foundation and the Vicomtech-IK4T and Tecnalia technology centers (named Robotiker at the time). Since then, the consortium has maintained an uninterrupted line of collaboration within the strategic projects of the Basque Government as well as local and state calls for proposals.

Now we are a total of 60 members (professors, researchers, technicians and PhD students), a broad team with interdisciplinary experts: computer scientists, linguists, engineers, translators and sociologists, among others.

By bringing together what the two groups separately did in training, research and technological transfer, this new research center aims to be an international  reference in Language Technology.

The IXA Group has a history of 32 years in the treatment of written contents.
The Aholab Group has been working since 1998 in speech technology.
We have been collaborating since 2002

We are an important research and development center worldwide together with other agents related to artificial intelligence, and we currently work on 36 research projects, six of them in Europe and the United States. In the last year we published our research results in 78 scientific articles.

As for training, we offer the Erasmus Mundus International Master’s Degree, a doctoral programme in linguistic technologies and an international course on “Deep Learning” techniques. In the newly created Degree in Artificial Intelligence at the Faculty of Informatics of San Sebastian we also have an important role.


PhD Thesis: Corpus based metrics for measuring distances between languages (José Ramom Pichel, 2020-10-29)

Title:  Medidas de distância entre línguas baseadas em corpus. Aplicação à linguística histórica do galego, poruguês, espanhol e inglês
           / Corpus based metrics for measuring distances between languages. Application to historical linguistics of Galician, Portuguese, Spanish and English.

Where: Teleconference,
……….Faculty of informatics (UPV/EHU) Ada Lovelace room
Date: October 29, 2020, Thursday,  18:30
Author: José Ramom Pichel
Supervisors: Iñaki Alegria & Pablo Gamallo
Languages:  Portuguese, mainly


As dúvidas sobre a classificação filogenética histórica e actual do galego e as hesitações na identificação au-
tomática da língua e na construção e concepção de tradutores automáticos, sugerem que o cálculo automático da distância entre o galego, o português e o espanhol, a partir de textos escritos reais, e um desafio interessante.

1. Pode a distância entre línguas ser medida automaticamente com base em corpus?
2. Que papel desempenha a ortografia na distância entre as línguas?
3. É possível traduzir esta distância numa única métrica robusta?
4. A distância calculada com essa métrica verifica as hipóteses dos linguistas?
Adiciona novos dados sobre hipóteses minoritárias ou controversas?
5. Será que a distância entre períodos históricos da mesma língua muda? Como?
6. A distância entre línguas muda historicamente ou é sempre a mesma?
E se mudar, esta distância entre línguas é linear?
7. Será que a distância histórica entre variantes reconhecidas da mesma língua muda?

Related publications to his PhD work:

IXAmBERT: Good news for languages with few resources!

Good news for languages with few resources!
Pre-trained Basque monolingual and multilingual language models have proven to be very useful in NLP tasks for Basque!
Even they have been created with a 500 times smaller corpus than the English one and with a 80 times smaller wikipedia.


An example of Conversational Question Answering, and its  transcription to English.

Word embeddings and pre-trained language models allow to build rich representations of text and have enabled improvements across
most NLP tasks. Unfortunately they are very expensive to train, and many small companies and research groups tend to use models that have been pre-trained and made available by third parties, rather than building their own. This is suboptimal as, for many languages, the models have been trained on smaller (or lower quality) corpora. In addition, monolingual pre-trained models for non-English languages are not always available. At best, models for those languages are included in multilingual versions, where each language shares the quota of substrings and parameters with the rest of the languages. This is particularly true for smaller languages such as Basque.

Last April we show that a number of monolingual models (FastText word embeddings, FLAIR and BERT language models) trained with larger Basque corpora (crawled news articles from online newspapers) produced much better results than publicly available versions in downstream NLP tasks, including topic classification, sentiment classification, PoS tagging and NER. this work was presented in the paper entitled “Give your Text Representation Models some Love: the Case for Basque“. The composition of the Basque Media Corpus (BMC) used in that experiment was as follows:

Source Text type Million tokens
Basque Wikipedia Enciclopedia 35M
Berria newspaper News 81M
EiTB News 28M
Argia magazine News 16M
Local news sites News 224.6M

Take into account that the original BERT language model for English was trained using Google books corpus  that contains 155 billion words in American English, 34 billion words in British English. The  English corpus  is almost  500 times bigger than the Basque one.



San Vicente






E. Agirre

An example of a dialogue where there are many references in the questions to previous answers in the dialogue.












Now, in September we have published IXAmBERT,  a multilingual language model pretrained for English, Spanish and Basque. And we have successfully experimented with it in a Basque Conversational Question Answering system. This transfer experiments could be already performed with Google’s official mBERT model, but as it covers that many languages, Basque is not very well represented. In order to create this new multilingual model that contains just English, Spanish and Basque, we have followed the same configuration as in the BERTeus model presented in April. We re-use the same corpus of the monolingual Basque model and add the English and Spanish Wikipedia with 2.5G and 650M tokens respectively. The size of these wikipedias is 80 and 20 times bigger than the Basque one.

The good news is that this model has been successfully used to transfer knowledge from English to Basque in a conversational Question/Answering system, as reported in the paper Conversational Question Answering in Low Resource Scenarios: A Dataset and Case Study for Basque. In the paper, the new language model called IXAmBERT performed better than mBERT when transferring knowledge from English to Basque, as shown in the following table:

Model Zero-shot Transfer learning
Baseline 28.7 28.7
mBERT 31.5 37.4
IXAmBERT 38.9 41.2
mBERT + history 33.3 28.7
IXAmBERT + history 40.7 40.0

This table shows the results on a Basque Conversational Question Answering (CQA) dataset. Zero-shot means that the model is fine-tuned using using QuaC, an English CQA dataset. In the Transfer Learning setting the model is first fine-tuned on QuaC, and then on a Basque CQA dataset.

These works set a new state-of-the-art in those tasks for Basque.
All benchmarks and models used in this work are publicly available: https://huggingface.co/ixa-ehu/ixambert-base-cased

HAP/LAP Master Theses (2020-09-21 and 22)

'AI Basque' created by GAIA Cluster and HITZ Center

This summer we have created AI Basque with the GAIA cluster. We want to achieve greater diffusion in our local area for the work that our HITZ Center carries out in language technology and artificial intelligence.

Basque Cluster on ‘Knowledge and Applied Technology Industry’

With AI Basque center we want to be an international reference in the field of artificial intelligence (AI), in industry, services sector and public administration. Our goal is to make companies, institutions and citizens aware of the advantages of artificial intelligence and proper data management. We want to extend this effort to all economic and social areas, in collaboration with other sectors, promoting the development of new products and services.

Who are we in AI Basque at present:

  • 21 companies in the sector of  ‘Knowledge and Applied Technology Industry’
  • 3 technology centers
  • Gaia Cluster
  • and our Hitz Center of the UPV/EHU.

See news published by GAIA Cluster:

Nace AI BASQUE de la mano del Clúster GAIA y del Grupo Hitz de la UPV-EHU para impulsar la Inteligencia Artificial en Euskadi

and by the Basque Government:

AI Basque sortu da, Gaia Klusterraren eta EHUko Hitz Taldearen eskutik, Inteligentzia Artifiziala sustatzeko Euskadin

PhD Thesis: Unsupervised Machine Translation (Mikel Artetxe, 2020/07/29)

Title:  Unsupervised Machine Translation
           / Itzulpen automatiko gainbegiratu gabea

Non: Teleconference: https://eu.bbcollab.com/guest/b22b606d9ae74bc5b3e067821c897617
Faculty of informatics (UPV/EHU) Ada Lovelace room
Date: July 29, 2020, Wednesday,  11:00
Author: Mikel Artetxe Zurutuza 
Supervisors: Eneko Agirre & Gorka Labaka
Languages:  Basque (motivation, state of the art)  and English (second half, papers, conclusions, ~11:30…)


The advent of neural sequence-to-sequence models has led to impressive progress in machine translation, with large improvements in standard benchmarks and the first solid claims of human parity in certain settings. Nevertheless, existing systems require strong supervision in the form of parallel corpora, typically consisting of several million sentence pairs. Such a requirement greatly departs from the way in which humans acquire language, and poses a major practical problem for the vast majority of low-resource
language pairs.

The goal of this thesis is to remove the dependency on parallel data altogether, relying on nothing but monolingual corpora to train unsupervised machine translation systems. For that purpose, our approach first aligns separately trained word representations in
different languages based on their structural similarity, and uses them to initialize either a neural or a statistical machine translation system, which is further trained through back-translation.

Mikel Artetxe publications related to his PhD work:

Ixa group has been awarded in the CAPITEL@IberLEF2020 competition

The three systems presented by IXA Group (HiTZ center) to the competition CAPITEL@IberLEF2020 have ranked first in Sub-task 1 (Named Entity Recognition and Classification in Spanish News Articles). The systems were developed by Rodrigo Agerri with the help of German Rigau, Ander Barrena and Jon Ander Campos.

Zorionak, congratulations to Rodrigo and all the team!

Within the framework of the PlanTL, the Royal Spanish Academy (RAE) and the Secretariat of State for Digital Advancement (SEAD) of the Ministry of Economy signed an agreement for developing a linguistically annotated corpus of Spanish news articles, aimed at expanding the language resource infrastructure for the Spanish language. The name of such corpus is CAPITEL (Corpus del Plan de Impulso a las Tecnologías del Lenguaje}, and is composed of contemporary news articles thanks to agreements with a number of news media providers. CAPITEL has three levels of linguistic annotation: morphosyntactic (with lemmas and Universal Dependencies-style POS tags and features), syntactic (following Universal Dependencies v2), and named entities.

The linguistic annotation of a subset of the CAPITEL corpus has been revised using a machine-annotation-followed-by-human-revision procedure. Manual revision has been carried out by a team of graduated linguists using the Annotation Guidelines created specifically for CAPITEL. The named entity and syntactic layers of revised annotations comprise about 1 million words for the former, and roughly 250,000 for the latter.

Due to the size of the corpus and the nature of the annotations, they proposed two IberLEF sub-tasks under the more general, umbrella task of CAPITEL @ IberLEF 2020, where they used the revised subset of the CAPITEL corpus in two challenges, namely:

(1) Named Entity Recognition and Classification and

(2) Universal Dependency Parsing.

Master students won EHealth-KD-2020 subtask on Relation Extraction

Oscar Sainz and Edgar Andrés, students of the HAP-LAP master, obtained an excellent result in the eHealth-2020 challenge presented with professors Oier Lopez de Lacalle and Aitziber Atutxa. Their team (IXA-NER-RE) has been “champion” in the Relational Extraction sub-task.

Although their main objective was participation only in the Relation Extraction subtask, they also presented tiny systems in the other two subtasks (Entity Recognition and Alternative Domain) and so their system was fourth in the main evaluation.

You have done a good job!

The results can be consulted here:

IXA-NER-RE was the “champion” in “Relation Extraction” subtask














The Ixa research group has been awarded in the artificial intelligence competition promoted by the US government related to COVID-19 disease

The competition CORD-19 (COVID-19 Open Research Dataset Challenge)  has been organized by several organizations such as Allen Institute for AI, Chan Zuckerberg Initiative, Georgetown University, Microsoft Research, National Institutes of Health and The White House Office of Science and Technology Policy. The organization has made available to the global research community more than 50,000 scientific articles on COVID-19, SARS-CoV-2 and other coronavirus. At the same time, they issue a call to action to artificial intelligence researchers to apply the recent advances in natural language processing, in order to help scientists fighting COVID-19 disease to find necessary information in the scientific literature.

In the first phase of the competition there were 10 awards, and the system developed in the Ixa group of the HITZ centre has been awarded with one of them. Researchers from the University of the Basque Country Arantxa Otegi and Jon Ander Campos and professors Eneko Agirre and Aitor Soroa participated in the development of the system. The developed system finds answers to high priority questions from experts related to COVID-19 disease and the SARS-CoV-2 virus analyzing the aforementioned scientific articles. Thus, this system is useful for finding answers to questions such as the history of coronavirus, the transmission and diagnosis of the virus, the prevention measures in the contact between humans and animals and the lessons of previous epidemiological studies. The results of the system have been evaluated by a group of experts from the NIH of the United States and it has been selected as the system that has best answered a set of questions on the topic “What do we know about diagnostics and surveillance?”. The answers given by the system can be seen here.

See here some examples