Dr. Nora Aranberri (UPV/EHU)

Nora Aranberri is a researcher at the IXA natural language processing group and lecturer at the Faculty of Education of Bilbao at the University of the Basque Country. She specialises in the area of machine translation (MT), where her research focuses on integrating linguistic knowledge into the systems and their evaluation, and pays special attention to aspects related to their use by both professional translators and regular users. Although not exclusively, the language pairs she mainly works with involve Basque, providing her the opportunity to explore the implications MT can have for low-resource minority languages. She has also led hands-on workshops on post-editing with trainee and professional translators and collaborates with the Association of Translators, Correctors and Interpreters of Basque Language.

Parallel corpora in machine translation: opportunities, challenges, and... Basque

Parallel corpora are vital to the development and evaluation of many natural language processing applications. In many cases, however, compiling suitable parallel resources poses an enormous challenge. In this talk, we will focus on machine translation (MT) and consider a number of situations at different stages of the development and implementation cycle where parallel corpora play a key role. We will first concentrate on the development stage, and specifically consider the features of the data required to build the systems. We will look into ways in which researchers have tried to generate the parallel corpora, discussing examples of targeted manual generation and automatic generation, including the implications of back-translation. Secondly, we will examine the requirements of the parallel corpora used in the implementation stage to help users take full advantage of MT and also corpora compiled to draw conclusions on MT use by professional translators and regular users. Throughout the talk we will present specific examples where Basque is involved, allowing us to highlight the implications of working with a low-resource minority language.

Dr. Xavier Gómez Guinovart (Universidade de Vigo)

Xavier Gómez Guinovart is an assistant professor at the University of Vigo, where he teaches Computational Linguistics. He is the leader of the research group Tecnoloxías e Aplicacións da Lingua Galega (TALG, in its Galician acronym), in charge of running seminars on Computational Linguistics (http://sli.uvigo.gal). His research interests include linguistic applications of computing, the development of multilingual lexical resources and ontologies, and the construction and exploitation of corpora, both parallel and specialized. Dr Gómez Guinovart has led numerous projects about technologies applied to the Galician language. He is an active member of research networks, and has organised and assessed scientific and academic activities and journals. He is the editor of the journal Linguamática (http://linguamatica.com), devoted to the computational processing of the languages of the Iberian Peninsula.

Semantic networks in the construction and exploitation of parallel corpora

In this talk, I will explain the research on parallel corpora recently conducted in the Seminars of Computational Linguistics at the University of Vigo. I will focus on the use of lexico-semantic information, as provided by WordNet, in the construction and exploitation of the CLUVI and SensoGal corpora respectively. This combination of resources is possible in both directions, namely, from the parallel corpus to WordNet and from WordNet to the parallel corpus.

On the one hand, it is possible to apply on the parallel corpora a variety of equivalents extraction techniques that widen the lexical coverage of the wordnets of the languages under alignment. It is also possible to benefit from parallel corpora to obtain contexts of use, for WordNet, of the concepts compiled in the net, provided that the corpus is previously processed with a suitable semantic framework. 

On the other, WordNet may be used in the alignment of parallel corpora at the lexical level as well as in their lexico-semantic annotation. For example, the graph technique of semantic relations in WordNet is used for constricting semantic taggers able to disambiguate, lexically, parallel corpora. Another resource used for this purpose has been English language corpus SemCor, semantically annotated by the team who developed English WordNet in Princeton.

I will attempt to provide a wide overview of the many facets of the research in progress for the audience to perceive the benefits of lexico-semantic annotation in the construction and exploitation of parallel corpora.


Gómez Guinovart, X. & Solla Portela, M.A. (2020). Construction of a WordNet-based multilingual lexical ontology for Galician. In M. J. Domínguez Vázquez, M. Mirazo Balsa & C. Valcárcel Riveiro (Eds.) Studies on Multilingual Lexicography, pp. 179-196. De Gruyter, Berlin and Boston. Doi: https://doi.org/10.1515/9783110607659

Gómez Guinovart, X. (2019). Enriching parallel corpora with multimedia and lexical semantics: From the CLUVI Corpus to WordNet and SemCor. In I. Doval & M. Teresa Sánchez Nieto (Eds.), Parallel Corpora for Contrastive and Translation Studies: New resources and applications, pp 141-158. John Benjamins, Amsterdam. DOI: https://doi.org/10.1075/scl.90.09gom

Simões, A. & Gómez Guinovart, X. (2018). Extending the Galician wordnet using a multilingual Bible through lexical alignment and semantic annotation. In P. Rangel Henriques, J. P. Leal, A. Menezes Leitão & X. Gómez Guinovart (Eds.) 7th Symposium on Languages, Applications and Technologies (SLATE 2018), pp. 14:1-14:13. Schloss Dagstuhl/Leibniz-Zentrum fuer Informatik, Dagstuhl, DOI: https://doi.org/10.4230/OASIcs.SLATE.2018.14

Gómez Guinovart, X. & Solla Portela M.A. (2018). Building the Galician wordnet: Methods and applications. Language Resources and Evaluation, 52 (1) 317-339. DOI: https://doi.org/10.1007/s10579-017-9408-5

Dr. Signe Oksefjell Ebeling (University of Oslo)

Signe Oksefjell Ebeling is Professor of English language at the University of Oslo, Norway. Her research focuses on corpus-based contrastive analysis on topics such as verb semantics, phraseology and idiomaticity. Her publications include several papers on these contrastive topics as well as the monograph (with J. Ebeling) Patterns in Contrast (2013). She has co-edited several volumes on contrastive analysis and she was editor (with H. Hasselgård) of the international journal for contrastive linguistics Languages in Contrast (2014-2019). She has been a member of several corpus teams, including the English-Norwegian Parallel Corpus, its extension the English Norwegian Parallel Corpus+, and the Oslo Multilingual Corpus. She was a member of the project team on the Computational Processing of Portuguese (now Linguateca). She is currently engaged in the compilation of two comparable corpora: the English-Norwegian Match Report Corpus and the International Comparable Corpus.



Bidirectional parallel corpora: Challenges and possibilities

In this talk I will start by outlining some of the main challenges relating to the use of bidirectional parallel corpora for contrastive research, offering some insights from my own experience of compiling and using parallel corpora of this kind. These challenges notwithstanding, I will then move on to describe the potential of bidirectional parallel corpora and give a snapshot of some of the possibilities they offer. More specifically, I will give examples of different kinds of contrastive studies that have benefitted from the bidirectional corpus design devised by Stig Johansson (Johansson & Hofland 1994). The selection of studies discussed, mainly from my own research, will range from lexical and lexico-grammatical studies of predefined items and patterns in two languages to more exploratory studies of n-grams.