The village of El Limon in the Dominican Republic is home to a remarkable development project which has built an irrigation and hydroelectric power system: the school has a laptop computer and soon will have an internet connection. Every night the children are entertained by multimedia CD-ROM titles originally designed for sale for people in affluent countries. Our mission was to develop the first multimedia product designed especially for them.

Olivia is a botanical illustrator and an expert on weeds, so we took notes and made illustrations of 32 weeds that we found in the village. We then had the task of converting a mass of text into presentable HTML. If I had to write 32 web pages by hand, even with the help of a WYSIWIG HTML editor, it would be difficult if not imposible to maintain a consistent look and feel. And then, if I didn't like the results or if I found the pages were incompatible with a particular browser, making any changes would be a lot of work. If we were to get the descriptions translated into spanish, we'd have to repeat all this work. In short, the project had grown beyond writing individual web pages to the next stage.

At the time we got home, the W3 consortium had just issued XML 1.0 as a reccomendation, and XML seemed to be just the tool I needed. With XML I was able to design my own specialized markup language for descriptions of weeds and write a translator program for producing a set of web pages complete with indexes and navigational hints for humans and search engines alike. Content and style are now separated so we can update the descriptions of the plants without worrying about the mechanics of HTML; because the pages are automatically generated, the look is absolutely consistent, reaching a level of professionalism that would be difficult to attain creating pages by hand, and if we don't like the way it looks or if we want to add a new feature, changing all of the pages is as simple as pushing a button.

Choices

One application of XML which has been greatly hyped is the idea of sending a java applet or other software package to the web browser which then downloads one or more XML files and views them. I decided to do something a bit more prosaic: write a program which "compiles" XML files to plain ordinary static HTML pages. This might not be as sexy as the alternatives, but it has a number of advantages. Static web pages don't need a special server, so they can be hosted by anybody, or stored on a hard drive, zip disk or CD-ROM. A web server serving static web pages can take many more hits than a dynamic web server and static pages can also be served by push technologies and can be spiced up with Javascript, Java applets any many other gimracks. Static web pages work fine with older browsers, and allow one to insulate oneself from the falling bombs and shells of the browser wars. Because all my software is written in Java and I took some care designing it, it would be easy for me to migrate to different solutions in the future: I could write a Java servlet, for example, that could generate HTML documents on the fly while allowing readers to customize the format by filling out a form or, for that matter, an applet that presents information graphically. [Three-layer architecture of XML Compiler]

My publishing system has a three-layer architecture. At the top of the picture, there is an XML parser which, given a document type declaration or DTD parses XML into a collection of objects. You don't want to write your own, many of them are free with source code at this point: as much as I hate to admit it, I choise MSXML because it was the only well-documented XML parser at the time. If I had started this project today, I would have probably used IBM's XML for Java.

The output of an XML parser, however, is structured more like an XML document than like a collection of plant descriptions. Secondly, I don't want to be tied down to a product from Microsoft, so I created my own collection of objects that represent a description of a plant in a natural way. One factory class uses MSXML to generate a Species object that describes a plant. When I want to change to another parser, I simply need to rewrite one class since no other class contains a single reference to MSXML.

The HTML generator lies in a third layer. Since it's separate from the representation of the plants, it could easily be replaced with a layer that generates, say, a LATEX document or one that runs as a client-side Java applet, or that generates HTML dynamically on a server. If I were to write an applet version, I probably would not include an XML parser with the applet, since it would take time to download and verify the parser as well as take time to download and parse XML instead I would send the intermediate representation to the applet using Java's Object serialization. The design of this program has been highly influenced by the books Design Patterns and Concurrent Programming in Java.

Writing XML

Another important step in using XML is writing a DTD or document type definition. A DTD is a precise description of the syntax of a dialect of XML. I wanted a version of XML for describing a weed. A typical weed description looks like
6.xml
<?XML VERSION="1.0"?>
<!DOCTYPE PLANTDATA SYSTEM "limon.dtd">
<PLANTDATA>
 <SPECIES ID="6">
<FAMILY>Cucurbitacea</FAMILY>
<LATIN>Momordica charantia L.</LATIN>
<COMMON>balsam pear</COMMON>
<COMMON>balsam apple</COMMON>
<COMMON>cerasee bush</COMMON>
<COMMON>archucha</COMMON>
<COMMON>balsamina</COMMON>
<COMMON>achochilla</COMMON>
<COMMON>pepinillo</COMMON>
<COMMON>cunde amor</COMMON>
<COMMON>melao de Sao Caetano</COMMON>
<COMMON>carcilla</COMMON>

<TEXT TYPE="DESCRIPTION" SOURCE="Direnzo98">
Vine,  climbs by tendrils.  Leaves are alternate,  soft and lightly
hairy.  Leaves are deeply lobed with five lobes.  (Length about <CM>3</CM>)
Yellow flowers arise from leaf axils as do tendrils.  Flower has five
petals,  bright orange small cluters of pistils and stamen at center.
(Diameter about <CM>1.5</CM>) Pods are oval tapering to a point with rows of
 little spikes,  green turning orange as they mature.  Exploded
pods show bright orange peels and four red seeds.  Inside is sticky.
Pod length (about <CM>2.5</CM>)  Stem is hairy,  very hairy at terminal
end.  Found growing on fence along main road in full sun.
</TEXT>

 </SPECIES>
</PLANTDATA>

I had a vision in my mind of something like the above, but I started out knowing almost nothing about XML. I read the XML Specification and started experimenting until I had a DTD that did what I wanted. It is

Limon.dtd
<!ELEMENT PLANTDATA ( SPECIES )+>
<!ELEMENT SPECIES ( FAMILY?,LATIN*,COMMON*,TEXT*,CITE*)>
<!ATTLIST SPECIES ID CDATA #REQUIRED>

<!ELEMENT FAMILY ( #PCDATA )>
<!ELEMENT LATIN ( #PCDATA )>
<!ELEMENT COMMON ( #PCDATA )>

<!ELEMENT TEXT ( #PCDATA | A | CM | REF )*>
<!ATTLIST TEXT TYPE CDATA #REQUIRED>
<!ATTLIST TEXT SOURCE CDATA #REQUIRED>
<!ATTLIST TEXT LANGUAGE CDATA "ENGLISH">

<!ELEMENT A (#PCDATA)>
<!ATTLIST A HREF CDATA #REQUIRED>

<!ELEMENT CM (#PCDATA)>

<!ELEMENT REF EMPTY>
<!ATTLIST REF ID CDATA #REQUIRED>

<!ELEMENT IMAGE (#PCDATA)>
<!ATTLIST IMAGE HREF CDATA #REQUIRED>
<!ATTLIST IMAGE SOURCE CDATA "">
<!ATTLIST IMAGE TYPE CDATA "PHOTO">

<!ELEMENT CITE EMPTY>
<!ATTLIST CITE SOURCE CDATA #REQUIRED>
<!ATTLIST CITE PAGE CDATA "">

<!ENTITY Agrave '&#192;'>
<!ENTITY Aacute '&#193;'>
<!ENTITY agrave '&#224;'>
<!ENTITY aacute '&#225;'>
this document type is still a work in progress. For instance, I've only added the entities for the accent marks that I'm actually using. It would be nice to have a complete set of them. If you're going to share a document type with other people and expect them to interoperate with you you should produce something mature and stable. For just learning XML it's good to experiment.

Downloads

The page compiler is still an experimental program, and I can't say I'm proud of every aspect of it. Still, you can take a look at the Source code and the xml files if you wish. Both are copyright © 1998 by Honeylocust Media Systems, the source is available under the GNU Public license and all rights are reserved on the XML files. You can get MSXML with documentation and source code from our dear friends in Redmond, or you can download msxml.jar, MSXML in convenient JAR form.

© 1998 Honeylocust Media Systems. Contact: houle@msc.cornell.edu