Honours Projects 2003 - Professor Jon Patrick

Projects that are sponsored by the Capital Markets Co-operative Research Centre (CMCRC) carry a $5000 scholarship for approved students and are eligible for one of the best project prizes of $1,000. Students working on these projects are expected to participate in all Language Technology Research Group activities and functions. The number of scholarships is limited. Approval is obtained by application for a scholarship and subject to selection on academic criteria.

Language Technology Applications - Machine Inference of a Language Parser (Sponsor: Capital Markets CRC)

Topic: Develop a mechanism for inferring probabilistic automata for language data.

Task: The process of building parsers for language technology problems has recently emerged from being a linguistic analytical process to be a statistical induction process with the advent of large corpora marked up with word class and syntactic structure. Simple inferencing has produced parsers of limited value although they do enhance analytically composed parsers. The aim of this project is to build a statistical parser that contains features not previously applied to this field. This parser would consist of a Probabilistic Finite State Automata extended to allow for re-entrant sub-cycles in the automata. The PFSA would be minimisable according to a variety of methods including direct algorithms, lingusitic criteria and statisitcal criteria.

 

Theme: Language Technology (Sponsor: Capital Markets CRC)

Topic: Semantic Analysis of Text

Task: Semantic analysis of text is in its embryonic stages with most efforts at deep analysis having been abandoned to be replaced by shallow parsing and statistical analysis. It is necessary to develop an approach to semantic analysis of text that uses a fairly general theory of semantics and to develop computational processing techniques to apply it to text. The task is to investigate different ways of defining the semantics of text and establishing computational methods for their automatic identification.

Project Aim: To create a computing system that identifies semantic elements in text.

Theme: Information Retrieval (Sponsor: Capital Markets CRC)

Topic: Adding Semantics to Information Retrieval Methods

Task: Information Retrieval methods have proven to be a reliable technique for the classification of documents. Many different methods are available for IR yet there is only limited knowledge about the context for optimum performance of each method. As well greater linguistic knowledge helps the performance of an IR classifier. Furthermore research generally shows that combinations of IR classifiers tend to be more accurate than any one classifier.

Project Aim: To construct a suite of IR classifiers that incorporate supplementary linguistic knowledge and determine the scenarios in which each is at its optimum and the effectiveness of combined classifiers.

Theme: Language Technology (Sponsor: Capital Markets CRC)

Topic: Appraisal of Machine Learning Methods in Computational Linguistics

Task: A number of machine learning techniques are popular in the field of computational linguistics for tackling problems of Part-of-speech tagging, Syntactic analysis and Word Sense Disambiguation (WSD). It would be valuable to make a comparative appraisal of the different techniques reported in the literature and apply some of the more up-to-date machine learning methods to tackle the same domains.

Project Aim: To appraise the different methods of supervised and unsupervised learning and apply them to part-of-speech tagging, syntactic analysis, and Word Sense Disambiguation and build generic implementations of the methods suited to rapid deployment in HLT systems.

Theme: Language Technology (Sponsor: Capital Markets CRC)

Topic: Utilization of Verb Classes for parsing and semantic tagging

Task: One study of verbs has produced a classification of about 300 classes. This study presents information on the list of words of each class and the grammatical structures that reside within each class. The challenge is to implement programs that can identify each class and the elements in a sentence that match the corresponding structure of that verb class.

Project Aim: The aim of the project is to design a system that enables one to specify the grammatical structure of a verb class, its word members and other relevant lexical information and subsequently to tag a text with appropriate structural markers. Identifying semantic contexts in which the verb class is used will hone the accuracy of a verb analysis system. The system will be tested on dialogue from psychotherapy interviews. The verb classes have been set up as a resource in an XML file for this project.

Theme: Language Technology (Sponsor: Capital Markets CRC)

Topic: Integration of Lexical Databases

Task: Every new language technology system tends to design its own lexical database with the result that there are hundreds of systems that are incompatible with each other. This creates a serious impediment to aggregating the knowledge in many systems and has to be solved by the development of a knowledge sharing metalanguage.

Project Aim: Investigate the large variety of lexical databases and ontologies to build a system to allow sharing of knowledge in a larger text analysis system.

Theme: Software Engineering of Language Technology Systems (Sponsor: Capital Markets CRC)

Topic: Software Architecture of LT Systems

Task: There are a number of architectures for LT Systems. most rely on a sequential pipeline architecture for the progressive processing of data from the simple to the complex stages. More recent approaches have used agent based philosophies that communicate the results of their processing. we are interested in developing a transaction based model that makes decisions on a just-in-time manner and with probabilistic criterion. WE believe this architecture will allow for massive parallelisation of the processing tasks as well as significant interaction between processing functions. Experiments in parallelising the solution using both clusters and grids need to be conducted.