Chinese language - breaking the chain

30 November 2012

Student Cathy Xiao Yu and supervisor Professor Josiah Poon.
Student Cathy Xiao Yu and supervisor Professor Josiah Poon.

The written Chinese language is like a beautiful unbroken chain of symbols, say University of Sydney researchers who are developing IT solutions that will make information extraction from Chinese text easier and faster.

Without an intimate knowledge of the language structure and its inter-connections it can be almost impossible to extract information from Chinese text without first conducting a lengthy segmentation process, says master's student Cathy Xiao Yu from the School of Information Technologies.

She says for most non-native speakers, Chinese text has no discernible beginning or end. Each word on its own does not have a definite grammatical role; it can be a verb, a noun or an adjective depending on the context.

But Cathy believes she is well on the way to developing an app that will make information extraction from Chinese text simpler and faster. Her work to date recently won her the 2012 Sydnovate Prize for Innovation.

"From a linguistics point of view, it is very difficult to extract the relevant messages within the text without first breaking the complex chain of symbols and unlocking its seemingly hidden meaning,' says Cathy.

"Our invention is a rule-based approach to information extraction that looks for patterns in the order of the symbols as opposed to existing methods that usually undertake segmentation or other pre-processing for locating the potential sentences or paragraphs before doing the extraction.

"In addition, our method could be applied in different areas when most other methods that can only be used in one specific area.

"We have tested our method of information extraction on Chinese medical literature with great results."

Also, in order to work effectively in China, it is crucial to know and to track the changes of positions held by a government official, hence, we have also applied the same method to the announcement of personnel movement in the Chinese government, says Professor Josiah Poon who is supervising Cathy's project.

Together they are determined to find a solution that will make the Chinese text more accessible to millions of people across the globe.

Professor Poon says his passion to develop an IT solution to the text extraction problem was originally motivated by his interest in Traditional Chinese Medicine (TCM) and the growing demand to prove the effectiveness of TCM with quantitative, empirical evidence.

"This is why we first tested our application on TCM journals. We aimed to extract prescriptions, herbs and dosages and clinical trial details from articles to test their effectiveness and identify the sets of core herbs for different diseases."

Furthermore, Professor Poon says the technique being developed has the potential to be applied to any Chinese text including business and financial reports and newspapers making previously unavailable literature accessible to non-native speakers around the world.

Commenting on the app, Professor David Goodman, Academic Director, China Studies Centre, at the University said: "The Chinese-English language interface is increasingly important. New technology that aids understanding is most necessary."

Follow University of Sydney Media on Twitter

Media enquiries: Victoria Hollick, 9351 2579, 0401 711 361,