TXP: a transaction-based XML parser

Raymond K. Wong, Basser Department of Computer Science, University of Sydney, NSW 2006, Australia, wong@cs.usyd.edu.au

James R. Curran, Basser Department of Computer Science, University of Sydney, NSW 2006, Australia, jcurran@cs.usyd.edu.au
 
 

Abstract

The recent emergence of eXtensible Markup Language (XML) as a new standard for data representation and exchange on the World-Wide Web has drawn significant interest in XML parsers. Almost all existing XML parsers are implemented based on either Lex and Yacc clones, or from scratch and in recursive manner. This paper describes a transaction-based XML parser called TXP which is equiped with a lightweight persistent XML store. We found the parser is extremely useful to support incremental parsing, parsing with rollback, concurrent parsing of multiple XML documents to the shared data structure, and parsing large, deeply nested XML documents without worrying stack overflow. The parser is a key component of the SODA (Semistructured Object Database) project at The University of Sydney.

Topics: Web Data Management
Keywords: XML, Semistructured Database, XML Parser
 
 

1. Introduction

There are several motivations for considering semistructured data. These include data integration [14], biological databases [7], querying or managing web sites [8, 11, 13], or dealing with semi-structured data directly [15]. In particular, the various data models proposed for semistructured data are similar with only minor variations: they model a database as a rooted, labeled graph. The nodes have an associated oid. The leaves are atomic values, like strings, integers, etc, or large objects, like images, sounds, etc, which are still atomic for the purpose of query evaluation. Labels are attached to the graph's edges. Several languages (e.g., [2, 7, 9]) have been proposed for semistructured data, which vary in style and expressive power. The recent emergence of eXtensible Markup Language (XML) [4] as a new standard for data representation and exchange on the World-Wide Web has drawn significant research attention. XML is a textual, markup language for information representation and exchange on the Web. Nested, tagged elements are the building blocks of XML. Each tagged element has a sequence of zero or more attribute/value pairs, and a sequence of zero or more sub-elements. These sub-elements may themselves be tagged elements, or they may be untagged segments of text data called CDATA. A well-formed XML document places no restrictions on tags, attribute names, or nesting patterns. Alternatively, a document can be accompanied by a Document Type Definition (DTD), essentially a grammar for restricting the tags and structure of a document. DTD is used to validate XML document.

In this paper, we briefly describe the SODA XML database being developed at the University of Sydney. We then present a transaction-based XML parser called TXP. TXP is specifically designed to

Finally we would describe in detail the design and implementation (by extending lex and yacc code to support persistence) of TXP.

The TXP shares much of its heritage with Lore [12]. Lore's Object Exchange Model (OEM) and external file format provide a similar hierarchical semi-structured object database description to TXP and XML. Both can be represented using a labelled directed graph. However OEM does not distinguish between attributes and children in the way that XML does. The literature regarding Lore has primarily focused on efficient retrieval and storage mechanisms for the semi-structured data.
 
 

2. The SODA XML Database and Its Data Model

SODA (for Semistructured Object DAtabase) is an object database designed specifically for managing XML data. It consists of a persistent object store specially designed for storing semistructured data such as text or web documents. It employs XML as its underlying information model. Since semistructured data is generally described as data which does not conform to a rigid schema [1, 5], data components can be missing, can be of different type from one item to another, can be atomic in one item and structured in another, collections can be heterogeneous, etc [15]. Implementing a persistent object store to store and manipulate such data requires normally rethinking all aspects of a POS, including object loading, storage management, buffer management, query processing and optimization, and application programming interfaces. Rather having a totally new data model and implementing a semistructured data repository from scratch (such as the Lore system developed at Stanford University [12]), we extend the existing object data model and its operators to support graph structure. For optimization reason, part of this extension is done in system-level.

In particular, we extend the techniques from object-role databases to develop our semistructured database. In object-role databases such as Fibonacci [3] and DOOR [16], a role extends an existing object with additional state and behavior. An object may have many roles evolve over time. Rather than being an instance of some unique subclass defined through multiple inheritance, an object simply is an instance of many types by virtue of having many roles. Every object reference is to a particular role, and the behavior of the object depends on which role is being referenced. Since an object can play multiple roles and acquire them at the instance level, instantiation of multiple specific role classes are supported such that the association between an instance and role classes are neither exclusive nor permanent.

Intuitively, we formalize a model which is similar to the object-role data model to model semistructured data. For instance, the objects in object-role databases represent the root nodes in those rooted, labeled graphs, since they can be referenced globally as the entry points to the semistructured data stored in the database. All the internal nodes and leaves would be represented by roles. Following the implementation of roles in DOOR, all these internal nodes and leaves will have a local identifier which is unique only under the umbrella of its root node. This would save us some oid space. The leaves are still atomic values, like strings, integers, etc, or large objects, like images, sounds, etc, and they can be augmented with different retrieval or update methods in their corresponding classes (called role classes in object-role databases). Names can be assigned to the link between each role and its player as in the label attached to each graph's edge. Moreover, we found that object-role query languages are very similar to those languages proposed for semistructured data, as they both describe the traversal paths in a declarative way based on path expressions.

To represent the type (or class) of each object or role, we extend XML in a gentle manner that only a type tag (::TypeName) is added to its syntax. For instance, consider the following project example in XML:

<project::Project who=''smith'' xml:lang=''en-US''>

<leader::Person> John Smith </leader>

<member::Person> Mary Jones </member>

<member::Person> Craig Lee </member>

<contact::Email> smith@cs.usyd.edu.au </contact>

<title> Issues on XML Parsing </title>

<abs::Abstract> This project address several issues </abs>

<abs> regarding parsing large XML documents. In </abs>

<abs> particular, we address the stack overflow </abs>

<abs> problem when a deeply nested XML is to be </abs>

<abs> parsed by a lex&yacc-based XML parser. </abs>

<keyword> XML parser </keyword>

<keyword> Object database </keyword>

<keyword> Semistructured data </keyword>

<sponsor>

<name> Future Parsing Technologies, Inc. </name>

<contact::Person> Donald Douglas </contact>

<amount> $200,000.00 </amount>

</sponsor>

<sponsor>

<name> TXP Corp </name>

<branch> Sydney </branch>

<amount> $180,000.00 </amount>

</sponsor>

</project>

Both leader and member are of the same type, Person. Once the type of a particular tag name is specified, further explicit declaration is unnecessary (e.g., abs) if they are within the same element boundary.
 
 

3. The Parser Design and Implementation

3.1. The Design

3.1.1. Children and Attributes

There are cases where the distinction between an element's children and attributes is unclear. Objects and properties that develop by naturally decomposition of a larger object into its constituents are usually encoded as child objects of the object. Objects and properties that are atomic in nature are usually encoded as attributes of the larger object. A common exception to this is when there is only one recognised attribute of the object such as

<leader::Person>John Smith</leader>

rather than

<leader::Person name="John Smith"/>

Whether a represented object is semantically closer to an object or an attribute does not always determine whether it should be rendered as such. There are many cases where attributes must be rendered as objects. For instance, the XML standard forbids more than one attribute to have the same name. This means that a property of an object which may have multiple values simultaneously must be encoded as an object even if it is semantically closer to an attribute.

There are also many cases where objects must be rendered or at least referred to using attributes. The following example shows a simple hierarchical database.

<leader::Person>

<name::Name>Mary Smith</name>

<PA::Person>

<name::Name>John Jones</name>

</PA>

</leader>

<PA::Person>

<name::Name>John Jones</name>

</PA>

The problem is that simple hierarchies cannot represent more complex relationship information without redundancy. For instance, to keep track of all of the people working on a project, a separate entry for the PA John Jones is required at the root of the hierarchy. This can be resolved without redundancy by using the ID and IDREF attribute types to create references to each object. Now the database becomes

<leader::Person id::ID=''P1'' PA=''P2''>

<name::Name>Mary Smith</name>

</leader>

<PA::Person id::ID=''P2''>

<name::Name>John Jones</name>

</PA>

If the PA object John Jones was to have a reference back to his boss, Mary Smith then there would be even more complication and redundancy without the use of references. The use of ID and IDREF attributes creates hierarchies interwoven with the parent-children structure. Because of this duality we have decided to consider attributes and children in the same manner with a marker to indicate whether the relationship is a child or an attribute.

A XML object is defined in terms of its name, type, attributes, child objects and parent object. The XML object's type, attributes and children are always absolute and context-free and in a simple root hierarchy database the object's name and parent are also absolute because there is only one context. These properties including the root hierarchy name and parent should be stored with the object. An object's name and parent however, are determined by its location(s) in the hierarchy. Because of their plurality these properties should be stored with the link not the object. The internal representation of the XML database is that each XML object holds a collection of links to other XML objects that it references. The objects themselves hang off the links so that their identity is dependent on their locational context.

3.1.2. Stacks and State

Small XML databases can be loaded into a SODA server using an ordinary LEX and YACC based XML parser. These generally efficient but stack intensive parsers perform poorly when used on large XML files due to the size of individual XML objects and the depth of nesting within the XML object. Precious stack memory can be saved by pushing these objects straight onto the database while their children are being parsed - which will be the ultimate destination of the objects anyway.

The difficulty is that the modification of the database occurs before there is any guarantee that the XML database file is syntactically valid. Recovering from a error in the XML database file may involve completely invalidating the entire XML object hierarchy pushed into the database or just ignoring individual erroneous XML objects. This means that a record of modifications to the database must be maintained while the file is being parsed. If this record is a persistant object within the database itself it can be used to support rollback for when TXP processes fail or are aborted. Further, this record can hold not only the XML object's state in the database but also the XML parse state so that parsing can be performed incrementally or continued after a system failure occurs. Finally, this record can be used as a signature to determine if another copy of the XML file is being loaded into the SODA server from a TXP client and which parsing is further advanced.

An appropriate point to add an object to the database and remove it from the parse stack is after the attributes of the XML element have been parsed and the children of the element are about to be parsed. A partial object is thus defined as the SODA database object representing an XML element that has been parsed beyond the element attributes but not the children or element end mark-up. This is stack space efficient because the parse scope of an element's attributes is generally smaller than that of the children parse scope. This is also time efficient since it allows child objects to be connected to their parents in the database as soon as they are created.

3.2. The Implementation

3.2.1. Stored Information

The YACC parser state with the SODA database state completely describe the parsing process and its state when the parsing is stopped. These two states contain all of the details required to continue the parsing process, undo parts or all of the changes to the database and recover the database state when parsing errors occur. The SODA database state is represented by a list of complete and partial object insertions that have modified the database as a result of the parsing process. The partial XML objects in the list can be considered as an embedded stack that keeps track of the incomplete objects which form the current YACC parse stack.

Each entry in the state list represents an XML object in the database and its associated extent within the XML file. The entry holds the SODA oid and the file positions of the start and end of the XML element in the XML file as well as a pointer to any object already in the database that have been deleted or modified as a result of the insertion of the new object. It also holds information used by the YACC parser discussed below.

The YACC parser state includes list of token, token type identifiers on the parse stack and the parse finite state machine (FSM) variable itself. YACC parse stack entries that represent XML entities that have had their attribute lists parsed are contained in the database modifications list. Each parsed entry that does not have a defined END file position is equivalent to an entry in the YACC parse stack. The YACC parse stack can be generated from partial object records in the list. However the state of the YACC FSM at each recursive layer must also be held in the object database so that it can be generated when parsing is restarted.

3.2.2. Parsing Algorithm

All operations on the stored parse state information such as the stack or object list include immediate updating of the persistent datastructure into the database.

  1. Ignore white space before the XML entity name and attributes
  2. Parse the XML entity name and attributes list. The name and attributes are the required properties for a partially-parsed object.
  3. Push a new object record onto the list.
  4. Set the new object record START property to the current file position. This file position is the character after the end of the opening entity tag.
  5. Add the partial object (name and attributes) to the database.
  6. Set the OID property of the object record to the partial object oid.
  7. Link new child (this object) to parent object
  8. Set the YACCSTATE property of the object record to the entry at the top of the YACC state stack.
  9. Recursively parse the children and connect to the parent as they are created.
  10. Set the END property of the object record to the file position after the end of the end entity tag.
The parse information is stored as snapshots of the parse state whenever a valid partial object is added to the database. The object records in the list provide a chronological record of the modifications to the database by inserting objects. The partial object records in the list provide the information required to recreate the YACC parse state information and continue parsing.

3.2.3. Recovery Algorithm and Rollback

The recovery algorithm uses the object record list entries to recover the YACC parse state or rollback to a stable YACC parse state so that parsing can continue.

  1. for each record in the object list
  2. if defined(END) then
  3. the object and all its children have been completed
  4. else if not defined(END) and defined(OID) then
  5. this is a partial object and some of the children may have been parsed
  6. if defined(YACCSTATE) then
  7. children may or may not have been parsed
  8. add the OID to the YACC object stack and YACCSTATE to the YACC state stack
  9. else
  10. rollback parse state
  11. else if not defined(END) and not defined(OID) then
  12. rollback parse state
At the end of this process a complete YACC parse state has been generated from which the parsing can continue.

The completeness of the recovery algorithm can be tested by considering interruptions before, during and after each step in the parsing algorithm.

Any interruption before Step 3 is ignored because it is to do with the microscopic parse state and must be rolled-back to reparsing the opening tag since only macroscopic parsing information is stored.

If the process is stopped during or just after Step 3 then an empty object record has been added to the list. This is removed from the list as part of the rollback process invoked by line 13.

If the process is stopped during or just after Step 4 then the START property has been defined but nothing else. The object itself has not been added to the database so only the object state record is in the database. This is removed in the rollback invoked by line 13.

If the process is stopped during or just after Step 5 then the object has been added to the database but the OID has not been set. Because no links to the parent node have been created and SODA removes unreachable objects only the object record remains in the database. This is also removed by the rollback invoked by line 13.

If the process is stopped during or just after Step 6 then the partial object has been added to the database. Rollback occurs on line 10 because the current parse state is not known. The partial object and the object record are removed from the database by the rollback procedure.

If the process is stopped during or just after Step 7 then the rollback occurs. In this case rollback must also remove the link object that associates the parent object with the child object in the database.

If the process is stopped during or just after Step 8 then the children have not been parsed but the parse state is stable. The parse stack is rebuilt using the object record for the partial object.

If the process is stopped during or just after Step 9 then some of the child objects may not have been parsed. The rest of the list will contain the information to determine the current parse state recursively.

If the process is stopped during or just after Step 10 then the parsing was complete and the object is stable in the database.

The file positions are given for the start and end of each tag in the object record list. Using this information it is possible to move the lexical analyser through the file (using seek) and rollback the parsing process. It is possible to request a rollback to a file location or rollback to a particular object.
 
 

4. Conclusion

We presented a transaction-based XML parser called TXP. Different from other XML parsers, TXP is specifically designed to: avoid stack overflow for large, deeply nested XML documents; support incremental parsing with rollback support; resume partially parsed documents without restarting from the beginning; support concurrent parsing of multiple XML documents from different processes; query on partially parsed data without waiting for the completion of parsing the whole document. We have described its design and detailed implementation (by extending lex and yacc code to support persistence).

As XML becomes the standard semistructured data exchange format for information of all kinds, XML parsers will have to efficiently cope with larger and larger XML files. We believe that the modest performance sacrifice for transaction-based XML parsing is required to handle parsing, manipulating and query of these large XML documents effectively. Alternatively, because the performance drawback may be a significant disadvantage in some circumstances (for example, small or flat XML documents), we have implemented TXP such that it can switch back to a typical way of parsing, i.e., storing the entire parse tree in memory, by runtime command argument.

References

[1] S. Abiteboul. Querying semi-structured data. In Proceedings of the International Conference on Database Theory (ICDT). Springer Verlag, 1997.

[2] S. Abiteboul, D. Quass, J. McHugh, J. Widom, and J. Wiener. The lorel query language for semistructured data. Journal on Digital Libraries, 1(1):68-88, 1997.

[3] A. Albano, G. Ghelli, and R. Orsini. Fibonacci: A Programming Language for Object Databases. VLDB Journal, 4(3):403-444, 1995.

[4] T. Bray, J. Paoli, and C.M. Sperberg-McQueen. Extensible markup language (xml) 1.0. In W3C Recommendation, World Wide Web Consortium, 1998; available online at http://www.w3.org/TR/REC-xml.

[5] P. Buneman. Tutorial: Semistructured data. In International Conference on PODS, 1997. [6] P. Buneman, S. Davidson, M. Fernandez, and D. Suciu. Adding structure to unstructured data. In Proceedings of the International Conference on Database Theory (ICDT), pages 336-350. Springer Verlag, 1997.

[7] P. Buneman, S. Davidson, G. Hillebrand, and D. Suciu. A query language and optimization techniques for unstructured data. In Proceedings of SIGMOD, 1996.

[8] M. Fernandez, D. Florescu, J. Kang, A. Levy, and D. Suciu. Strudel: a web-site management system. In Proceedings of SIGMOD, May 1997.

[9] M. Fernandez, D. Florescu, A. Levy, and D. Suciu. A query language for a web-site management system. SIGMOD Record, 26(3):4-11, 1997.

[10] R. Goldman and J. Widom. Dataguides: enabling query formulation and optimization in semistructured databases. In Proceedings of VLDB, September 1997.

[11] D. Konopnicki and O. Shmueli. Draft of w3qs: a query system for the world-wide web. In Proceedings of VLDB, 1995.

[12] J. McHugh et al. Lore: A database management system for semistructured data. SIGMOD Record, 26(3):54-66, September 1997.

[13] A. Mendelzon, G. Mihaila, and T. Milo. Querying the world wide web. In Proceedings of the Fourth Conference on Parallel and Distributed Information Systems, December 1996.

[14] Y. Papakonstantinou, H. Garcia-Molina, and J. Widom. Object exchange across heterogeneous information sources. In Proceedings of IEEE International Conference on Data Engineering, March 1995.

[15] D. Quass, A. Rajaraman, Y. Sagiv, J. Ullman, and J. Widom. Querying semistructured heterogeneous information. In Proceedings of Deductive and Object Oriented Databases, 1995.

[16] R.K. Wong, H.L. Chau, and F.H. Lochovsky. A Data Model and Semantics of Objects with Dynamic Roles. In IEEE International Conference on Data Engineering, April 1997.