Goenkale Corpus

Goenkale is a Basque TV series that has been aired without interruption since 1994 on the Basque TV channel EBT. Episode number 3,000 was shown in 2010, making Goenkale one of the longest-running series in Europe. This corpus has been designed on the basis of sequences of text used in the series since its inception. The corpus contains the following.

  • Number of episodes: 2,418
  • Sequences: 38,821
  • Number of dialogs: 805,796
  • Number of words: 11,000,000
  • Number of words taken from dialogs: 7,700,000

The main interest of this corpus lies in the dialogs. It is very difficult to find large groups of words that correspond to conversation and dialog. Besides, this series has a very special property: its dialogs, written by specialists in Basque, reflect a natural day-to-day language (as its viewers acknowledge). This is an important corpus, with its almost 8 million words of dialog.