Research Interests

of

Prof Jon Patrick

1. NATURAL LANGUAGE PROCESSING

The processing of natural language has had a checkered history with phases of enthusiasm and despair over the last 40 years for creating effective and comprehensive computational analysis. Despite the failures natural language processing has proven to be effective in restricted domains and now we are seeing the emergence of a language industry in Europe, driven by the need for document translation in the European Union. The field of study is very large and its draws from many diverse subfields of Linguistics, Psychology and Computer Science. The field has now matured to the point that the expertise of Information Systems specialists is needed to manage projects and systems that have been developed.

1.1. Semantic Analysis of Text

The aim of this project is to develop a general purpose program for the semantic analysis of text. Pilot studies are oriented on developing a suite of general resources that can be bound together to facilitate a variety of domain studies, such as psychotherapy interviews, health management, classroom teaching, etc.

1.2. Study of subliminal language

Subliminal language is the components of language that are below conscious awareness. This is typically the grammatical constructions people use, the systematic features of their descriptions and the preferences they have in styles of expression. These features are said to vary according to the thinking strategies of a person and their emotional state at the time of speaking. This project aims to construct a semantic analyser to identify and annotate the subliminal components of spoken text.

Topic: Change in language due to Therapy processes

We have a collection of interviews with men before and after they underwent a program of psychotherapy to change their experiences and consequential behaviors about violence in their upbringing. The aim is to determine if the subliminal language of the men has changed due to the therapeutic workshop.

1.3. Electronic Dictionary Databases

Electronic dictionaries have emerged from the early work of machine translation research that discovered that large lexical stores holding specific grammatical information was necessary and the paper dictionary represented an immediate supply of some of this information. However, the conversion of dictionaries in paper form into electronic formats is a difficult task. Initially, they have to be read using a scanner and then converted into text using OCR software. This process typically produces an error rate of incorrect character identification of 1-5%, which therefore requires extensive spellchecking. Subsequently dictionary entries have to be loaded into a database. This is also a difficult problem as the structure of the data is ill-defined and not always entirely consistent across all dictionary entries. As well information for attribute demarcation is most often implied by the text formats and not by explicit symbols. If the dictionary information is to be used as a lexical database then the word list is extracted from the dictionary perhaps with some other valuable information and then a linguist will adapt that data for these other uses.

The nature of the data in a dictionary means that dictionary databases have special structures due to the highly irregular organization of their data. This complexity has proven to be difficult to grapple with over the years and many ad hoc storage structure have been proposed along with the shoe-horning of data into standard structures such as Relational databases. No specialised structures and retrieval functions suited to dictionary and lexical databases has been developed and this is the target of this research in the context of creating a world wide federation of on-line dictionary databases.

Topic: Federation of Dictionary Databases

A special retrieval language has been designed to act as a lingua franca for sharing data between dictionary databases and to allow users to retrieve data without having to learn a formal query language.This project aims to implement the retrieval language and provide for sharing data across multiple databases via the retrieval language.

1.4. Knowledge Management

The management of knowledge is increasing in importance for both corporate and organisational needs. Corporation are motivated to identify the Intellectual Property held within their organisation both in formal records and the informal knowledge of their staff so as to develop its commercial potential. More broadly government and research institutions are motivated to increase the speed of sharing distributing and retrieving relevant knowledge to complete a specific task.

Two principle strategies are used for KM Information Retrieval methods from extant documentation and formal knowledge elicitation and representation in crafted data structures.

The development of the field will take at least two significant directions e.i. the automatic extraction of knowledge from text stores requiring a big improvement in Natural language Processing techniques, and the identification of commercial opportunities from the knowledge extraction results.

1.5. Natural Language in Information Systems

The aim of this project is to make a review of the use of natural language in information systems and to describe and catalogue the information systems facets that are needed for Natural Language Processing projects.

2. MACHINE LEARNING SOFTWARE TOOLS - DATA MINING

There is great deal of machine learning programs that can now be classified as Data Mining Tools. This diversity derives from different conceptualisations of the structure of the data and the structures that are perceived to exist in the data. Until the advent of large volumes of data derived from historical databases Machine Learning has been an esoteric field of study. However now it holds out the promise of being extremely valuable in saving corporations significant expenses if hitherto unknown structures in data can be found. However Data Mining does not have to be restricted to use on corporate databases. The analysis of corpora is important in the development of tools for effective natural language processing and the field is awaiting the innovative application of data mining techniques.

2.1 Development of Probabilistic Finite State Automata (PFSA) and Push Automata for Inductive Inference of structure in Data

This project has the aim of adapting the theory of PFSA to deal with a mixture of probabilistic and non-probabilistic data.

Topic: Intelligent Self-Learning Parser-editor

This project aims to build a parser-editor that can be trained to identify the structure of dictionary entries and can learn from examples to parse unseen entries. The software will need to be able to cope with erroneous data, missing data and irregularly formatted data and intelligently prompt a user to intervene in the parsing process as well as allow and record irregular structures. The basic program will be the underlying processor for a series of software functions that are needed by lexicographers. For example the organisation Macquarie Online who publish the Macquarie dictionary have a requirement for 1. A function to automatically tag the entries for subject code categories aligned with topics studied in the high school curriculum; 2. A function to determine the grammatical structures and collocations of words and other linguistic phenomena.

2.2. Decision Trees & Graphs

Decision Graphs are an extension of Decision Trees to allow network structures and are an important technique for supervised learning and identification of knowledge. New heuristic variations of DG technology are being actively developed however an understanding of their analytical foundations is poorly understood. We aim to commit DG systems to more effective use in NLP.

3. WORKFLOW AND REVISION CONTROL FOR THE REGENERATION OF MULTIMEDIA SYSTEMS

To create efficiencies in the production process of revisable multi-media systems it is necessary to define processes for the control of content revision and regeneration with a workflow control of these processes. A model for managing Multimedia Run-time Systems (MRS) is presented as consisting of a revision control strategy for managing primary resources. Regeneration processes that move data from one process to the next incorporating derivative resources on the way, and ultimately producing run-time resources, and a workflow control process to regulate and maintain the integrity of the regeneration process.

Topic: Regeneration of MMS for 2nd Language Learning

The process of second language learning and translation are difficult and laborious and can be made significantly more comfortable with fast computer based aids.A case study of one approach to tackling these problems is presented. This MRS, known as the English to Basque Learning Environment (EBLE), is a reference library of three books and concomitant sound files for second language learning of Basque.

Currently, for this project we have available three resources for the learning of Basque that can be usefully co-ordinated for language learning. The aim of this project is to bring together a Basque-English and an English-Basque dictionary and, a Basque grammar book (written in English) into a software environment where a student can move from one document to the other seamlessly. This involves the development of appropriate storage structures, cross referencing strategies, retrieval mechanisms and a suitable user interface. As well, the examples of Basque and English in the grammar book will be recorded by native speakers and will have to be linked into the software environment with immediate access to the user. The final product should be executable from a CD-ROM.

4. ONTOLOGY OF HUMAN CONCEPTS - A STUDY of PROTO-BASQUE

A universal categorisation of human knowledge has proven elusive due to the diversity of perspectives such a system is designed for. Roget's thesaurus is the most enduring categorisation system we have but it has limited use for people attempting to achieve knowledge representation for computational purposes. Nowadays knowledge is encapsulated in computational systems by means of explicit structures designed to attain a particular engineering goal. There have been some efforts to reduce knowledge to its atomic parts through analysis of dictionaries but this has also proved an elusive goal as circularity of definitions create endless computational loops, such as in Wordnet. Category systems as used by Internet search engines but they rely on an army of people to categorise documents and to use a category nomenclature in a consistent way. This has also proven entirely unreliable as any web search indicates, with up to 95% or more false hits.

One way forward is not to look at the linguistic relationships and concepts we have in historically modern languages but rather to look at languages that are at least in their core elements ancient. This approach has the potential to understand relationships between objects and concepts that are now clouded over by centuries of language change in other languages. One language that has the potential to allow us to see far into the conceptual past is Basque. It is non-Indo-european and predates the Indo-european family of languages. Its distinctiveness means that it can be relatively easily cleared of imported words from the last 2000 years. We are conducting an analysis of Basque to identify the complete set of native Basque words still present in the language. This involves starting with identifying all the extant mono-syllable words and then building up to more complex forms. It appears that we will end up with about 4000-5000 words. These words will then be developed into a semantic net along with the words that are formed by compounding processes. Compounding is extremely common in Basque. The final semantic net will be compared with that of other languages to identify the differences and similarities between them and hopefully to give us greater insights into a basic ontology of human concepts that might be useful for more generic computational processing.