A project to develop a Translator's Workbench that has access to multiple source documents in a seamless way irregardless of their geographical location and storage method. This project is done is conjunction with colleagues at Informatika Fakultatea, Unibertsitatea Euskal Herria.
2. Conversion of Paper Dictionaries to Databases
This work is part of project 1 above and involves converting 3 multi-lingual paper dictionaries to databases. They are:
Basque-English dictionary of G. Aulestia,
English-Basque dictionary of G. Aulestia and L. White
Basque-Spanish-French dictionary of R.M.Azkue.
The process consists of scanning the dictionary and then converting the scanned files into text files using OCR software. The text files are then checked for correct layout but not spellchecked. Next the files are read using a custom designed parser so that each entry is divided into its distinct semantic fields and then the record is written into a database with those fields separated out. The description of the database's construction can be found in the draft paper "A Strucutre and Query Language for Federated Multilingual Dictionary Databases"
3. Relatedness of Natural Languages
This project aims at developing a measure for the relatedness of natural languages using the frequency and pattern of use of diachronic phonological rules in the transformation of older parent languages into more recent daughter languages. The method involves using a large corpus of words known in their old form and the same words known in their more recent form in at least two languages and dialects. The rules for transforming each word in each recent language are constructed. The rule sequence for each word for one language are used to construct a Probablistic Finite State Automata (PFSA). This automata is then reduced to a minimum form using the Minimum Message Length encoding criterion. Algoritms have been developed for searching the space of likely compressed PFSAs. The PFSA for the dilaect that is the smallest will be the dialect most closely related to its parent. At the moment a study is being conducted on two modern Chinese dialects Cantonese and Mandarin using about 2700 words from Middle Chinese, the parent language of both dialects.
4. Investigation of the Core Vocabulary of the Basque Language - Euskara
The language euskera is a highly agglutinative language and thereby (most) words are relatively easily decomposed into their composite parts. The author R.M.Azkue produced a dictionary at the turn of the 20th Century in which he annotated the words that he thought were "original" words in the language. These words have been subjected to computer programs to determine their component parts and to derive a canonical list of words. More details can be found in my Basque page.
5. Chronology of Basque Flora and Fauna Words
This is an idea to determine the relative consistency of the chronolgy of plant communities and their entry into the basque country since the last glaciation and the development of the words for those plants. The strategy is to use the studies of pollen records from archaeological and geological excavations to determine the sequence of appearance of the major forest types over the last 10,000 years. Then the aim is to match the words of the principal tree (and plant) life in the same relative sequence and determine if the later plant types have words that are compositions of the earlier plant types.
Progress to date: the pollen sequence has been determined from current publications. A study of the word compositons now has to be done.
6. Investigation of the Descriptive and Distancing Language used by Violent Men
This study aims to determine if changes in the behaviour of men can be adequately detected by chnages in the type of language they use. The study is of about 20 couples in which the man has sort assistance with changing violent behaviour towards his spouse and family. The man and his partner are given unstrucutred interviews for about 30minutes duration immediately before and after the workshop and 3 months later. The workshops are called a stage 2 workshop because each of the men must have completed a 29-hour stage 1 workshop before entering this workshop program. The interviews are taperecorded and converted into text transcripts. The transcripts are analysed for three general forms of language, namely content, visual/auditory/kinesthetic language and grammatical constructs that represent distancing from the people subjected to his violence. We are looking for changes in subliminal aspects of language that indicate a wider use of visual/auditory/kinesthetic language and less distancing. The workshops are run by myself and the partners are interviewed by my co-worker Kay O'Connor.