IV. XML processing

XML has attracted a far greater number of programmers than SGML, and the result is a nice pool of applications to choose from (for an overview, see e.g. Lars Marius Garshol's page about Free XML tools and software). One of the reasons is certainly that XML was specifically designed to be easier to parse than SGML. The downside is that putting together an XML processing system needs a few decisions upfront. The following chapters will give you a (necessarily subjective) choice of several XML parsers and XSLT processors with various output types. You may choose what you need, or install all of them and compare. The table below is intended to give some guidance.

Note: You should also be aware that most processors implement some sort of extensions to the standards. Some XSLT stylesheets require specific extensions for some special tricks, e.g. for chunking HTML output. Consult the documentation of the stylesheets you plan to use to find out which processor you should prefer.

The main considerations when picking one of the combos are:

  • Parser interface: There are two accepted standards for XML parser interfaces: SAX and DOM. The difference between these two models in simple words is as follows: a SAX-capable parser calls a registered function for each start tag, end tag, and the data inbetween. The parsing is done sequentially, so there is no need to have the whole document in memory at any time (in some cases it is not even necessary to have the whole document available, it may be received in chunks). The downside is that the elements have to be processed as they are encountered during parsing. If an application needs access to previous or later elements, it has to do some sort of buffering. On the other hand, a DOM-capable parser creates an in-memory representation of the whole document. This may need a lot of memory for large documents, but all elements can be accessed freely at any time during processing. For you as the end-user the parser interface issue boils down to the question which XSLT engine can be used with which parser.

  • Validating vs. non-validating: In the XML world documents do not have to be valid (in contrast to SGML), but they can be validated against a DTD if necessary. Therefore you can use either validating and non-validating parsers depending on your needs.

  • XML uses Unicode to encode characters. The programming language Java also builds on Unicode from the ground up, so processing XML with Java is kind of a natural match. The only downside is that Java programs tend to be a little slower than C/C++ programs and that you need the Java Runtime Engine (a bytecode interpreter) to run the applications. C/C++ programs are faster and have a smaller footprint, but programming Unicode in C or C++ is just not as popular.

Table 1. XML parsers and XSLT processors covered in this tutorial

Name Parser interface Validating parser Language
xsltproc DOM, SAX yes C
XP/XT SAX no Java
Xerces/Xalan DOM, SAX yes Java
Saxon/Ælfred SAX ? Java

None of the mentioned XSLT processors can directly create printable output (all do HTML output, though). Therefore we need a set of additional applications to transform our XML documents into PDF and RTF files.

Note: XML differs from SGML in that a SYSTEM identifier for the DTD file is mandatory. In order to keep the files portable, usually a URL is specified for this purpose instead of a local path. This means that for most XML transformations an internet connection is mandatory. It is not necessary for editing XML files with PSGML as PSGML does not attempt to resolve URLs.