This study uses the root vocabluary as defined by Azkue in his dictionary. He defined 9853 words in this category. It is likely that modern philologists will dispute some of his entries and there are words known to exist that he has excluded because of his own political and religious persuasion. However the list should be relatively accurate with perhpas no more the 5% of it being in question.
This study is not meant to be an entire historical analysis on its own. Rather it is an attempt to explore the usefulness of using Azkue's list and a computationally based analysis strategy to see what wealth of understanding it might bring or at least lead to by further human based interpretation. Hence its aim is to assist the linguists not replace them. Only time will tell if the hopes materialise in real progress in understanding euskera more.
The Method of Analysis
All the words in Bold print in the dictionary were typed into the computer. The printouts were checked against the original dictionary with errors in about 0.1% of words. Then the words were modified to conform to the orthography of Batua using the following rules with the number of changes included:
39 PH P pheskiza (39 groups PH replaced with P, for example pheskiza)
38 TH T artha
30 KH K ekhi
22 RH R erho
18 NH N ainhara
17 LH L zalhu
Palatalisation (in the order shown below)
74 IS= IS ishun
1152 S= X urrus=a, ts=uringa (it includes TS= TX)
191 IN= IN makin=a
57 N N~ n=imin=o
4 IT= IT arrapit=it=a
58 T= TT t=at=ar
322 IL= IL zurrunbil=o
67 L= LL akabal=a
11 D= DD d=and=ar
17 I= I sagarroi=
6 U= U jau=nsi
2 A= A eskala=npo
72 U: U aholku:
21 ! et! (just removed)
Once this modification was made the wordlist was submitted to the XUXEN spell checker which accepted about 4000 of the words and rejected the remainder. The reject list was submitted to linguists on the XUXEN team who were able to give Batua equivalents to about another 500+ words. The result is a list of 4603 confirmed batua words and 5250 dialectic words. These two groups of words are referred to as "Common" and "Uncommon" respectively as they have no othe particular way of being classified. The subsequent analysis has been performed independently on each word group.
The Computer Analysis
A PERL program was designed that took a list of affices and applied to them to each word in the rootlist and decomposed the root into all possible structures given a list of prefixes, suffices and infices. The list of decompositions can be accessed through the Table.
The process was repeated 3 times using firstly the list of prefices and suffices used by XUXEN, secondly using the first list plus the list of suffices and prefices declared in the Basque-English diciotnary of Gorka Aulestia, third using the lists from stages 1&2 and the list of noun declensions from Aulestia's dictionary and Geren~o's book "A New Method for learning Basque" cross checked with Aulestia. Fourthly, using the lists from the previous 3 stages and the full set of infices, prefices and suffices from the verb conjugation tables. This last stage of processing may be seen by many as meaningless, however it was done for completeness.
A Table of the statistical results is available.
The word lists are made available by selecting the numbers in theTable.
Notes on reading the word lists:
1. The ~ actually means the letter n~. It is easier for us to process the data with a single letter than this special character.
This project will be improved by positive critical comments that suggest improvements or variants to the processing method, commentary on the word lists or descriptions of how they are or could be used or improved and corrections for errors.
No right to use the lists is to be assumed without the express permission of the author. The author reserves Copyright 1996.