What we call our Reference Corpus is a corpus of prose writings that appeared in print between 2000-2007. Altogether it contains some 25.1 million words, of which 13.1 million are drawn from books chosen for their quality (287 volumes) and 12 million are from newspaper articles published in Spain (Berria) and in France (Herria).
It is a closed corpus, because our team of researchers concluded, after making numerous trials, that the information that could be gleaned from more than 25 million words would not be significant given our objectives and that it would only add to the work and time spent in data processing. In any event, in future other linguistic corpora with different features will be added to the Institute for the Basque Language's website.
This corpus makes it possible to look up words as they are used today by authors writing in Basque. The word being looked up will appear in context, in a full sentence. Also indicated are frequency (number of times the word appears in books and newspapers), the writer concerned and the title and page of publication.
This corpus has given rise to various academic works made possible thanks to the information that the corpus contains:
- Dictionary of Standard Basque in Contemporary Prose / Hiztegi Batua Euskal Prosan
- Dictionary of Contemporary Basque / Egungo Euskararen Hiztegia
- The Lexicon, Past and Present / Lexikoa, Atzo eta Gaur
, etc., all available on the Institute for the Basque Language's website.
This part of the project was partially funded by the City Council of San Sebastián and the Gipuzkoa Provincial Council.