Advanced search

Breadcrumb

Patterns of Frequency in the Basque Lexicon (PFBL)

A Guide to Using Patterns of Frequency
in the Basque Lexicon (PFBL) Correctly

Introduction

This application is a rich dictionary of patterns of frequency of modern Basque words. Based on twenty-first century texts, it measures numerous things but these are always related to word patterns. In the beginning, whilst it might at first glance appear quite complex it is actually straightforward to use for anyone interested in the topic. By clicking the "data" section you can see the basic data which appear on this introductory page. Thus, you see that the foundation of the application is a corpus of 22,704,373 words which has 53,310 lemmas. There are also more data.

There are two ways to get information: starting with data or starting with words. Either way, the same system is used: the data sought appears on the right-hand side but we must look at the left-hand column in order to find precisely what we are looking for and choose there what kind of information we want.

First, on the right-hand side of the page that comes into view we see "Order the result" and "Details of the result", each with two choices below. If we focus on the details we see "abbreviated" and "complete". This means that it is the information we are looking for and that there are two options: a basic one ("abbreviated") and a comprehensive one ("complete"). The second of these offers extensive information but this might be more burdensome to obtain because it is larger and more difficult to read.

Therefore, we specify what kind of information we want on the left-hand side and on the right-hand side we see the information organised ("by frequency" or "by alphabet") and arranged ("abbreviated" or with the full details, "complete") according to how we choose it.

One Example to Test This

On the left-hand side of the option "From data to words", in the FREQUENCY section, we see "Number of appearances", "Per million words" and "Logarithm".

Only the first of these will appear, and just as regards the major data (because if one understands these, the other data is not all difficult to understand), since it is exactly the same as the others.

"Number of appearances" offers different data: We know that there is one (or more) in that corpus of 22,704,373 words which is repeated 987,639 times. And there is another one (or more) which only appears once. And that the words, on average, are repeated 60 times (some of them a lot, others much less).

Let's say that we want to see what that word is (or words are) which is (or are) repeated 987,639 times. We have to click "Number of appearances" on the left-hand side. Immediately, we will see that the right-hand window has refreshed. And there we have the option of choosing either "maximum" or "minimum". If we enter the number 987,639 in the "maximum" box, this is what we're asking the device: to display the number of times a word appears from 1 to 987,639 instances, because we have introduced a maximum but not a minimum amount: the device has atomically selected everything which appears from once upwards. The application will need some time to do this because it is being asked for a lot of information: which word appears only once, twice, three times, four times, and so on, up to 987,639 times. We will thus see that "eta" ("and"), "ez" ("no") and "da" ("is") are the most repeated words in the corpus, and these appear at the top of the list:

Here we have just copied the first three words which appear on the first page. The application tells us that we can see the full information on 7,556 pages. Furthermore, the list of words is organised by frequency, but if we click "by alphabet" then we will see the same words listed in alphabetical order.

We can now do a test the other way around: seeing what appears when we select 987,639 times in the "minimum" box. Now, only "eta" appears and the process is much quicker:

This also returns more information: how many syllables each word has, and this word's neighbors are.

Some Other Searches

The basic search has been described above. Thereafter, anyone wanting to search for frequencies only has to select the appropriate options. In the aforementioned example, for instance, one can select ranges: for example, if we want to search for words which are repeated a maximum of 60,000 times and a minimum of 24,000 times. And the application will return every word in that range. Then, if one puts the cursor over any of the column headings, for example over that of "frequency", then a note will appear explaining what this means. This is very useful information to understand, for example, which words Basques use most frequently (it could be a valuable means of, for instance, determining levels of Basque with real word data).

The same system is used to receive data according to "Per million words" and "Logarithm" beneath the "Number of appearances" choice in the option "From data to words". In both cases, it is best to close that information search window by clicking on the right-hand side. That way, the right-hand side will always be clear in order to undertake another search. However, nothing happens if this is not closed because more precise searches will just be returned. Let's say, for example, that we want to know which words appear a maximum of 2,000 times and a minimum of 100 times: the words in that range; but at the same time, we also want to know which words from amongst these have a minimum of 6 and a maximum of 8 syllables. One can ask for all this information at the same time:

And this information will automatically appear:

As one can see, it is very easy for anyone to get data from the application.

The same information is available in the ORTHOGRAPHIC STRUCTURE section: this is used if we want to know the number of appearances of words and what these words are, or the number of syllables and what words have that number of syllables. Let's say we want to see words which have a maximum of 12 and a minimum of 11 syllables. If we set these limits the information begins thus:

It is perhaps a little harder to figure out what the "Consonant-Vowel structure" and "Syllabic structure" are. Here, we ourselves have to specify what we want to search for. Let's say we want to look for words which have a Basque VVCCV pattern. In other words, those words which start with two vowels, followed by two consonants, and which end with a vowel. If we choose to look for this information, we click on "Consonant-vowel structure" and a window will open on the right-hand side as usual. There we have to write "VVCCV" (and of course, without any commas). The information will appear immediately. The following appears (only a part of which is copied here):

Those are Basque words which have a pattern (that is, words which appear in our corpus). We can go further still: we can tell the application that we want to know which words end that way, whatever they contain beforehand. In order to do so, one need only make the following selection: "%VVCCV". And this will appear (this is how the information begins):

As one can see, the result gives all the previous words, plus those words which contain something else beforehand (the word "herrialde" or "territory", for example, was not on the previous information list).

In the "Syllabic structure" section one word must be selected, and divided up into syllables: "ka-tu-a":

Yet one can also do more complex searches. Let's say, for example, that we are interested in getting any words which contain the text "ka" and the syllable "re". Thus, we would have to enter the following in the window: "%ka%-re-%". In other words, anything before or after that "ka", which would then be followed by the "re" syllable, and then finally anything else. If we ask for this, the following appears (as before, this is just the beginning of the list):

The "Repeated letters?" option gives straightforward information: in other words, whether a letter in the words which are in the corpus we have is repeated (in which case "1" should be selected) or not ("0"). What if we want to know what words we have which do not repeat letters? Sometimes we want to know this. Make the second selection and this is the result:

Of course, we can also do a more detailed search. For example, let's say we want to look for words which do not contain repeated letters, but these words must appear a minimum of 4,000 times and be between 4 and 10 letters long. The search must be done in the following way:

And the list will appear thus:

As one can see, this list is different from the previous one.

If one selects the "The word itself" option, information about a word is returned. In the case of "etxe" ("house"), for example, the following will appear:

The information received can be very useful in order to do several other tasks. For instance, if we want to get a list of words which end in the letters "ti", there are two ways of going about this:

1) In the "OTHERS" section select "The word itself" and in the "text" window which appears on the right-hand side enter "%ti". The list will appear immediately.

2) The second method offers more options. First, select the "Syllables and n-grams" tab on the left-hand side and from there select "Syllables". As always, a new window will open on the right-hand side. There, write "ti" in the "syllable" box. What information then, compared to the previous method, does this return? Well, you can search for any place in which that "ti" appears: at the end, at the beginning or in the middle of a word.

More Complex and Complete Searches

Let's now say that we want a list made up of words which fulfil the following criteria: a maximum of between 4 and 6 syllables and ending in the syllable "-ti". We would click "Number of syllables" and "The word itself" and ask for the information like this:

Of course, if we want all the words to be 6 syllables, we would have to ask for both a maximum and a minimum of 6. Let's look at what the aforementioned information tells us (as always, only the initial words are listed here):

Although they have not been mentioned, one can make other selections in the searches: for example, if we want to know which of these words is a noun then we click on the "Morphology" tab on the left-hand side, then "POS category" and from the options there enter "ize". The following information now appears (we have now done a complete word search: all those words in the corpus which fulfil all the conditions mentioned):

All of these searches have been chosen on the basis of "abbreviated details". As an alternative to this, by selecting "complete details" on the right-hand side, we see a lot more information about each word: number of appearances, frequency, logarithm, length, syllables, CV structure, OUP, neighbors, neighbors 1, neighbors 2, neighbors 3, neighbors 4, neighbors +, and so on.

Here, simply by moving the cursor over the relevant term, you will immediately see that there is much more information: what the "neighbors" are, etc.

From words to data

So far, we have just started from data and the information this gives about words, arranged by frequency or alphabetically. But we can also do this the other way around, starting just from text or words. In order to do so, on the opening page you must click on "From words to data". This way, we will search for the complete information on every word.

Here, in the box provided on the left-hand side, we can write either just one word or a piece of text: for example, if we write "aitak amari gona gorri erosi dio" ("dad bought mum a red dress"), we get the following information:

Taking our corpus into account these words appear thus: "aitak" 4578 times, "amari" 1286 times, etc. You can enter any kind of text, including that uploaded from a file.

Finally, we should note that you can receive all of the information here in files, with the ".txt" extension.

Node: liferay2.lgp.ehu.es