IXA group


“Language Analysis and Processing” master theses (2018-06-26)

Four master theses have been presented in June:

Noisy Speech Recognition using Kaldi and Neural Architectures
Ikaslea/Student: Ander González Docasal
Zuzendariak/Supervisors: Vassilis Tsiaras, George P. Kafentzis, Yannis Stylianou

Unsupervised Methods to Predict Example Difficulty in Word Sense Annotation
Ikaslea/Student: Cristina Aceta Moreno
Zuzendariak/Supervisors: Oier Lopez de Lacalle, Eneko Agirre, Izaskun Aldezabal

To post‐edit or to translate… That is the question.
A case study of a recommender system for Quality Estimation of Machine Translation based on linguistic feature
Ikaslea/Student: Ona de Gilbert Bonet
Zuzendaria/Supervisor: Nora Aranberri

Basque‐to‐Spanish and Spanish‐to‐Basque Machine Translation for the health domain
Ikaslea/Student: Xabier Soto García
Zuzendariak/Supervisors: Gorka Labaka, Olatz Perez de Viñaspre
Zuzendarikidea/Co‐advisor: Maite Oronoz

Talk: Karelian dialects, how to study variation between closely related languages? (I. Moshnikov, 2018-06-19)

Speaker: Ilia Moshnikov
…………Karelian Institute (Joensuu)
Date: Tuesday,June 19, 2018
Time: 15:00-16:00
Place: UPV/EHUko Informatika Fakultatea, Manuel de Lardizabal 1, 20018 Donostia (map)
Title:  Variants of the active past participle in the Border Karelian dialects:
how to study variation between closely related languages?

Karelian languages (Wikipedia)

During my visit I would like to present my research interests. I will speak about my home university in general. I will say a few words about current situation of the Karelian language and usage of it in Internet. During my work in Kiännä-research project I investigated from a virtual linguistic landscape point of view what websites use Karelian as a language of full interface. I will also talk about my doctoral dissertation. Topic of my presentation is Variants of the active past participle in the Border Karelian dialects: how to study variation between closely related languages? I use some statistical methods. Theoretically my background is in language contacts and language variation research.

Short bio:
My name is Ilia Moshnikov and I am a visiting researcher from University of Eastern Finland (Joensuu, Finland). I will stay in San Sebastian one month. I am a linguist and my doctoral dissertation is about language contacts between Finnish and Karelian languages in Border Karelian dialects. Moreover, some of my interests are language revitalization and modern language usage. For example, I am involved in Karelian Wikipedia. Originally, I am from Russian Karelia. I speak Karelian, Finnish, English and Russian (a bit Spanish as well). I work as a researcher in Karelian Institute (Joensuu) and teach some Karelian (and Russian) courses.

Talk: The Open Multilingual Wordnet (F. Bond., 2018-06-13)

Speaker: Francis Bond.
…………Division of Linguistics and Multilingual Studies,
…………Nanyang Technological University. Singapore
Date: June 13, 2018
Time: 15:00
Place: UPV/EHUko Informatika Fakultatea, Manuel de Lardizabal 1, 20018 Donostia (map)
Title: The Open Multilingual Wordnet


In this talk I introduce the Open Multilingual Wordnet, a large lexical network of words grouped into concepts and linked by typed semantic relations. The talk will cover how the resource has evolved over time (increases in both size and complexity) and introduce some of the latest extensions.

Short bio:

Francis Bond is an Associate Professor at the Division of Linguistics and Multilingual Studies, Nanyang Technological University, Singapore. He worked on machine translation and natural language understanding in Japan, first at Nippon Telegraph and Telephone Corporation and then at the National Institute of Information and Communications Technology, where his focus was on open source natural language processing. He isan active member of the Deep Linguistic Processing with HPSG Initiative (DELPH-IN) and the Global WordNet Association.  His main research interest is in natural language understanding. Francis has developed and released wordnets for Chinese, Japanese, Malay and Indonesian and coordinates the open multilingual wordnet.

PROCESSING OF HISTORICAL CORPORA (Open day workshop, 2018-06-11)

The collection, tagging, analysis and recovery of historical corpora are basic tasks in the quantitative research on linguistic and cultural evolution. Collaboration between the areas of linguistics, history and technology is necessary for the success of these processes.

Several international projects are being carried out in this field and some of these experiences will be presented at this workshop. In the Basque Country there are also projects in progress but in an atomized manner.

Date: June 11th. 11.00 a.m. (Ada Lovelace hall)
Place: Informatics Faculty UPV/EHU. Manuel Lardizabal 1, 20018 Donostia (map)
Language: English

11.00-11.30: Ricardo Etxepare: BIM project, Basque in the making (Sintaktikoki Etiketatutako Euskarazko Corpus Historikoa)
11.30-12.15: Martin Reynaert: Text-Induced Corpus Clean-up: current state-of-the-art
12.15-13.00: Eckhard Bick: Automatic Grammatical Annotation of Historical Brazilian Portuguese

Sponsors: UPPA  –  UPV/EHU  –  Clarin


Jornada abierta. 11 de Junio.

La recopilación, etiquetado, análisis y consulta de corpus históricos son tareas fundamentales en la investigación cuantitativa de la evolución lingüística y cultural. La colaboración entre las áreas de lingüística, historia y tecnología es necesaria para el éxito de los procesos mencionados.

Diversos proyectos internacionales se están llevando a cabo en este ámbito y en esta jornada se expondrán algunas de estas experiencias. En Euskal Herria también hay proyectos en marcha pero de forma atomizada.
Fecha: 11 de junio de 2018, 11.00. (Sala Ada Lovelace)
Lugar: Facultad de Informática UPV/EHU. Manuel Lardizabal 1, 20018 Donostia (mapa)
Idioma: inglés

11.00-11.30: Ricardo Etxepare: BIM project, Basque in the making (Sintaktikoki Etiketatutako Euskarazko Corpus Historikoa)
11.30-12.15: Martin Reynaert: Text-Induced Corpus Clean-up: current state-of-the-art
12.15-13.00: Eckhard Bick: Automatic Grammatical Annotation of Historical Brazilian Portuguese

Patrocinadores: UPPA  –  UPV/EHU  – Clarin


Be a friend of the Minority SafePack!

We call upon the EU to adopt a set of legal acts to improve the protection of persons belonging to national and linguistic minorities and strengthen cultural and linguistic diversity in the Union. It shall include policy actions in the areas of regional and minority languages, education and culture, regional policy, participation, equality, audiovisual and other media content, and also regional (state) support

A European citizens’ initiative is an invitation to the European Commission to propose legislation on matters where the EU has competence to legislate. A citizens’ initiative has to be backed by at least one million EU citizens, coming from at least 7 out of the 28 member states. A minimum number of signatories is required in each of those 7 member states.

Minority SafePackiniciative has got 849.888 signatures. 150.000 more are needed in two weeks.

 You can sign here

In the European Union there are about 50 million people who belong to a national minority or a minority language community.

Science journal: 'Ixa opens a new research avenue: Machine Translation without a dictionary?'

Science reported this week about the work recently published by our colleagues Mikel Artetxe, Eneko Agirre and Gorka Labaka: Artificial intelligence goes bilingual—without a dictionary
In October the 30th our three colleagues published a pre-print paper entitled  Unsupervised Neural Machine Translation in collaboration with Kyunghyun Cho.
One day later G. Lample published another paper with similar contents  entitled Unsupervised Machine Translation Using Monolingual Corpora Only. Both papers are under consideration at ICLR 2018.
Those are some sentences written by Matthew Hutson a freelance writer covering technology for Science:

[…] two new papers show that neural networks can learn to translate with no parallel texts—a surprising advance that could make documents in many languages more accessible.

[…]  Imagine that you give one person lots of Chinese books and lots of Arabic books—none of them overlapping—and the person has to learn to translate Chinese to Arabic. That seems impossible, right?” says the first author of one study, Mikel Artetxe, a computer scientist at the University of the Basque Country (UPV) in San Sebastián, Spain. “But we show that a computer can do that.”

[…]  “This is in infancy,” Artetxe’s co-author Eneko Agirre cautions. “We just opened a new research avenue, so we don’t know where it’s heading.”

[…] Artetxe says the fact that his method and Lample’s—uploaded to arXiv within a day of each other—are so similar is surprising. “But at the same time, it’s great. It means the approach is really in the right direction.”

Congratulations Mikel, Eneko, Gorka and Kyunghyun!

Course: Deep Learning for Natural Language Processing (4,5 ECTS, February)

Are the meanings of these two words related? (Eneko’s Google Award 2015)

Course: Deep Learning for Natural Language Processing

    Course open to anyone, see details and pre-requisite information below.
    Deep Learning neural network models have been successfully applied to natural language processing, and are now changing radically how we interact with machines (Siri, Amazon Alexa, Google Home, Skype translator, Google Translate, or the Google search engine). These models are able to infer a continuous representation for words and sentences, instead of using hand-engineered features as in other machine learning approaches. The seminar will introduce the main deep learning models used in natural language processing, allowing the attendees to gain hands-on understanding and implementation of them in Tensorflow.


Introduction to machine learning and NLP with Tensorflow, Deep learning, Word embeddings, Language modeling and recurrent neural networks, Convolutional neural networks, Attention mechanisms

Instructors :Eneko Agirre & Oier Lopez de Lacalle

Practical details

Part of the Language Analysis and Processing master program
Schedule: Twelve days, February 5-8, 19-22, 26-28 and March 1 (2018)
Time: 17:30 – 20:00
Where: Lab 0.1, Computer science faculty, San Sebastian
Teaching language: English
Capacity: 20 students (selected according to CV)
Price: 180€
4.5 ECTS credits


Pre-registration and contact: send an e-mail with CV to amaia.lorenzo@ehu.eus and e.agirre@ehu.eus
Pre-registration open: now to 24th of December
Prerequisite: Basic programming experience, a university-level course in computer science and experience in Python.
Basic math skills (algebra or pre-calculus) are also needed.

Presentation: Research groups in the Faculty of Informatics (2017-10-10, 10:00-11:10)

Tomorrow morning the research groups in the Faculty of Informatics will present their work to the students.

Date: Tuesday, October 10
Time: 10:05-11:10
Where: Ada-Lovelace room
Audience: Students of 3rd & 4th levels
Subject: Presentation of research subjects and groups in the Faculty.
IXA Group’s collaboration with students: job opportunities for undergraduate students, scholarships…

HAP/LAP master theses (2017-09-26)

Master HAP/LAP  —  EMLCT master
Master thesis defences


Izenburua / Title: Automatic Generation of Named Entity Taggers Leveraging Parallel Corpora
Egilea / Author: Yi-Ling Chung (EMLCT)
Tutoreak / Supervirors: Rodrigo Agerri and German Rigau


Izenburua / Title: Dialect normalisation with deep learning-based automatic speech recognition
Egilea / Author: Mahsa Vafaie (EMLCT)
Tutoreak / Supervirors
: Inma Hernaez, Josef Van Genabith
Izenburua / Title: Mapping of Electronic Health Records in Spanish to the Unified Medical Language System Metathesaurus
Egilea / Author: Naiara Perez (HAP/LAP)
Tutoreak / Supervirors
: Montse Cuadros and German Riga

Best paper award in SEPLN2017

Last week, our colleagues Begoña Altuna, María Jesús Aranzabe, and Arantza Diaz de Ilarraza were awarded in Murcia with the best paper award in the 33rd INTERNATIONAL CONFERENCE OF THE SPANISH SOCIETY FOR NATURAL LANGUAGE PROCESSING (SEPLN 2017)


The paper is available here: EusHeidelTime: Time Expression Extraction and Normalisation for Basque

Temporal information helps to organise the information in texts as it places the actions and states in time. It is therefore very important to identify the time points and intervals in the text, as well as what times they refer to. We developed EusHeidelTime for Basque time expression extraction and normalisation. For it, we analysed time expressions in Basque, we created the rules and resources for the tool and we built corpora for development and testing. We finally ran an experiment to evaluate EusHeidelTime’s performance. We achieved satisfactory results and we proved the adaptability of the tool for morphologically rich languages.