XML, Expat, and Xerces

I’ve been doing some research into handling XML (Extensible Markup Language) data on VOS and OpenVOS, and I thought it would be useful to share what I learned with a wider audience.

WHAT IS XML?

The following description of XML is taken from “XML: A Beginner’s Guide”, by Dave Mercer. (Osborne/McGraw Hill, 2001).

“SGML (Standard Generalized Markup Language) is an international standard (ISO 8879) for the definition of device-independent, system-independent methods of representing information, both human and machine-readable. Languages conforming to the rules of SGML are called applications, and HTML is an SGML application. […] HTML has no provisions for extending itself in a standard way into new tags, attributes, data structures, or content types… XML, rather than being a predefined language like HTML, is a predefined way of defining new languages while avoiding the overly complex nature of SGML. Technically, XML includes a subset of the capabilities found in SGML.”

An XML Schema is a method of defining a specific XML document (or class of documents). It describes the structure of a document, including the elements, attributes, data types, and constraints that can be used, much as one would define a database.

An XML Document Type Definition (DTD) performs a similar purpose to a schema.

XML has been around for over 10 years, and there are a number of competing products and methods, and there are a number of related standards. There are also many trade books on XML. O’Reilly publishes a book called “Learning XML” which may be of help. See www.oreilly.com. The http://xml.com/ web site, maintained by O’Reilly, has a number of helpful articles and links.

The XML standards are published by the World Wide Web Consortium, which is at http://www.w3.org. The standards are pretty dry, so I don’t recommend trying to read them.

XML resembles HTML, and it has a similar heritage to HTML, but is a much more general-purpose encoding. HTML describes how data should look when displayed on a web page. HTML, by itself, doesn’t record whether a number represents a quantity, price, or stock number. XML, on the other hand, is typically unconcerned with how the data looks; it describes what the data means. One common application of XML is to create a textual encoding of a specific database. XML has the ability to say that this field is a numeric price, and that field is a alphanumeric stock number, and so forth. All data in an XML document is encoded as text, which means that sticky problems like endianness or binary representation of floating-point numbers go away. Thus, XML is gaining acceptance as a good language for computer-to-computer communcation of data.

TOOLS TO PROCESS XML

I am aware of two commonly-used open-source packages that process XML. The first is Expat and the second is Xerces. Both of these packages were created in 1999. Expat was the work of one individual; Xerces came out of IBM.

There are two competing models for dealing with XML. One is the Document Object Model (DOM) and the other is the Simple API for XML (SAX). The SAX method (used by Expat) reads through an XML document in a linear fashion, calling a handler function every time a markup element occurs. The DOM method reads the entire XML document and creates a tree-structured hierarchy. You can think of SAX as a sequential access method and DOM as a random access method. Xerces supports both SAX and DOM APIs. There are 3rd-party packages that provide a DOM API to expat (see “simkin”).

While we use the term “XML document’, in practice the XML-encoded text could be a file or a data stream. Typically, you supply the functions that collect the text and hand it to the parser, so it can come from any source.

EXPAT

Expat is designed for reading XML and taking some action every time an XML element appears. I haven’t seen any documentation that suggests that it can create XML-based data streams. Expat is written in C. There is also a version of Xerces for Perl and Java.

A port of expat to VOS is available at:

ftp://ftp.stratus.com/pub/vos/posix/ga/v-series/expat-2.0.1.txt

ftp://ftp.stratus.com/pub/vos/posix/ga/v-series/expat-2.0.1.save.evf.gz

The master web site for expat is:

http://expat.sourceforge.net/

and

http://www.libexpat.org/

A nice introduction to expat can be found on the sourceforge home page; here is the link:

http://www.xml.com/pub/a/1999/09/expat/index.html

You can find links to other open-source packages that use expat at http://www.libexpat.org/.

XERCES

Xerces can both read and write XML data streams. Xerces is written in C++ and is designed to be called by C++ programs.

Tom Mallory and I ported Xerces version 2.8.0 for a Stratus customer in mid-2008. Our target was a V Series platform. The current version of Xerces is 3.0.1. If you are interested in obtaining a copy of this port, please contact your account team.

The master site for xerces is:

http://xerces.apache.org/

WHICH ONE TO USE?

If you just want to parse XML-encoded document, and your processing is compatible with sequential reading of the data stream, then I recommend using expat. While it is written in C, thanks to the fact that all VOS programming languages can call functions and subroutines written in any other programming language, you can call expat from any VOS language.

If your requirements include the ability to perform random-access operations on a XML document, or you want to create XML, and you are comfortable programming in C++, then I recommend using Xerces.

If you need assistance to add the ability to handle XML documents to your application, contact your local Stratus account team.

XML, Expat, and Xerces

PARTNERS

TOPICS

QUICK LINKS