Chapter 15 of Professional Java Server Programming, © 1999 Wrox Press.

Part 1
[ Part 2 ]
Part 3
Table of Contents
    XML,  SQL and Java Objects
        XML:  the legacy application
        Mapping XML to SQL
            Storing XML fragments in the database
        The Weeds class
        Mapping a relational database to objects
        Creating and loading the database
    URL Rewriting and file layout
                What the pages look like to a web browser
                What's really on the server

XML, SQL and Java Objects

XML: the legacy application

The software that generated Weeds of El Limon was a filter that converted a collection of XML files into HTML files. Thus, I started WEEDS 2 with plant descriptions in XML form, a document type descriptor (DTD) that defines the format of those files, and Java classes that transforms the document tree produced by an XML parser (MSXML) to a set of objects describing the weeds. Naturally, when adopting a database, I wanted to reuse as much of this work as possible.



limon.dtd

<!ELEMENT plantdata ( species )+>
<!ELEMENT species ( family?,latin*,common*,text*,cite*) >
<!ATTLIST species id CDATA #REQUIRED>
<!ELEMENT family ( #PCDATA ) >
<!ELEMENT latin ( #PCDATA ) >
<!ELEMENT common ( #PCDATA ) >
<!ATTLIST common xml:lang CDATA "en">
<!ELEMENT text ( #PCDATA | A | cm | ref )* >
<!ATTLIST text type CDATA #REQUIRED>
<!ATTLIST text source CDATA #REQUIRED>
<!ATTLIST text xml:lang CDATA "en">
<!ELEMENT a (#PCDATA)>
<!ATTLIST a href CDATA #REQUIRED>
<!ATTLIST a xml:link CDATA #FIXED "simple">
<!ELEMENT cm (#PCDATA)>
<!ELEMENT ref EMPTY>
<!ATTLIST ref id CDATA #REQUIRED>
<!ELEMENT cite EMPTY >
<!ATTLIST cite source CDATA #REQUIRED>
<!ATTLIST cite page CDATA "">
<!ENTITY %ISOlat1 PUBLIC "ISO 8879-1986//ENTITIES Added Latin 1//EN//XML" "ISOlat1.pen" >
%ISOlat1;

I needed to make no changes in the DTD , or in the input XML files. To separate identified and unidentified weeds, I simply copied the identified weeds into a directory titled identified-weeds/. (ISOlat1.pen is a file that defines the entities such as &aacute; defined in HTML 4.0, included in the WEEDS2 package available from the tapir or Wrox sites.) This DTD specifies a very rigid file format, in which there is little room for variation. The <plantdata> element, for instance, can contain only a list of <species> elements, which in turn contains <family>, <latin>, <common> and <text> elements in that order. A typical file looks like



19.xml

<?XML VERSION="1.0"?>
<!DOCTYPE plantdata SYSTEM "limon.dtd">
<plantdata>
 <species id="19">
<family>Labiatae</family>
<latin>Leonotis nepetaefolia (L.) R. Br.</latin>
<common>lion's ear</common>
<common xml:lang="es">cordon de fraile</common>
<common xml:lang="es">rubim de bolas</common>
<common xml:lang="es">cordao de frade</common>
<common xml:lang="es">boton de cadete</common>
<common xml:lang="es">molinillo</common>
<common xml:lang="es">quina de pasto</common>
<text type="DESCRIPTION" source="Direnzo98">
Leaves are opposite, on long petioles, very soft, and coarsely toothed
with round teeth. Flowers occur in globes (<cm>2.5-3.4</cm>)
slightly prickly to touch with thin leaves immediately undeneath. Flowers
are orange and tubular with four stamens with white filaments, light
yellow anthers and one pistil. Stem is very rigid and square, plant
is taprooted. Common in fallow fields.
</text>
<cite source="Cardenas72" cite="181"/>
</species>
</plantdata>

Except for the contents of the <text> element, the structure is rigid, and, therefore, easy to map to a relational database.

Along with the DTD and XML files, WEEDS 2 also inherited a class, MSXMLSpeciesFactory, which converted XML plant descriptions into Java objects which implemented the interface Species,



Species.java

public interface Species {
  public String getId();
  public LanguageString getFamily();
  public LanguageString[] getLatin();
  public LanguageString[] getCommon();
  public Text[] getTexts();
};

MSXMLSpeciesFactory creates an implementation of Species called SpeciesImpl, which simply stores information in Java fields,



SpeciesImpl.java

package tapir.weeds.representation;
import java.util.Vector;
class SpeciesImpl implements Species {
  String d_id;
  LanguageString d_family;
  LanguageString d_latin[];
  LanguageString d_common[];
  Text d_texts[];

and several methods for reading attributes of a Species. There aren't any methods for setting attributes, because those are set directly by MSXMLSpecies factory (which can see them because, like SpeciesImpl it's a member of tapir.weeds.repre­sentation: these fields are available to the classes in tapir.weeds, where the beans and supporting classes are defined.)


  public String getId() {
    return d_id;
  };
  public LanguageString getFamily() {
    return d_family;
  };
  public LanguageString[] getLatin() {
    return (LanguageString[]) d_latin.clone();
  };
  public LanguageString[] getCommon() {
    return (LanguageString[]) d_common.clone();
  };
  public Text[] getTexts() {
    return (Text[]) d_texts.clone();
  };
};

The tapir.weeds.representation package also contains several more classes, including LanguageString, an immutable object that contains both a string and the ISO digraph denoting the language of the string (we record common names in both English and Spanish) and Text, an object that describes the content of the <text> element.

Mapping XML to SQL

When we map XML objects to SQL tables, we're confronted with the differences between XML and the relational data model. Relational databases store information in a rigid and inflexible format. A database contains one or more tables, which are collection of rows (or records), which are composed of columns (or fields.) The rigid structure of relational databases is a good thing: it's the mathematical purity of the relational data model that makes it possible to build databases that scale into the stratosphere and meet the needs of global businesses. XML, on the other hand, is much more flexible. Although some XML files, like 19.xml, can be divided into records (<species> elements) and fields (<latin>, <common>, ....) XML can support complex, even recursive, hierarchies. (such as WDDX, see http://www.wddx.org/.) Also XML files can contain free-form text similar to HTML, such as the contents of the <text> element.

Another difference between XML and SQL, for the time being, is that SQL is typed and XML is untyped. A SQL database can understand, for instance, that some columns contain strings and other columns contain numbers, and thus, compare quantities correctly. You certainly can store data of a particular type inside an XML element or in an attribute, and write programs to process it any way you wish, but XML processing software cannot, yet, automatically determine the type of data inside an element or attribute. Some proposals have been made to replace the DTD with a more specific schema that contains type information, such as XML-Data (http://www.w3c.org/TR/1998/NOTE-XML-data)and DCD (http://www.w3c.org/TR/NOTE-dcd/). Both proposals are discussed in the book XML Applications from Wrox Press.



Two kinds of products to aid the integration of XML and databases are forseeable: XML query languages and tools to automatically map XML documents and SQL tables. Although the W3 Consortium has sponsored a conference on XML query languages, no standard has emerged. The closest thing to an XML query language in existence is the extensible stylesheets transformation language (XSLT, see http://www.w3.org/Style/XSL/) To use XSLT as a query language, think of relationships between XML elements and attributes as representing relationships between things in the world. To take an example, in the above 19.xml, the <common> element that contains the text "lion's ear" is a child of a <species> tag for which the ID attribute is 19. This means that "lion's ear" is a common name of plant 19. If we put all of our weed description in one big file, the following XSLT stylesheet selects out the common names of weed 19,




<xsl:stylesheet>
  <xsl:templace match='species[attribute(id)]=="19"/common'>
      <xsl:copy-of>
  <xsl:template>
<xsl:stylesheet/>

with the output


<common>lion's ear</common>
<common xml:lang="es">cordon de fraile</common>
<common xml:lang="es">rubim de bolas</common>
<common xml:lang="es">cordao de frade</common>
<common xml:lang="es">boton de cadete</common>
<common xml:lang="es">molinillo</common>
<common xml:lang="es">quina de pasto</common>

SAXON (http://home.iclweb.com/icl2/mhkay/saxon.html) is an XSLT processor written in Java which can be extended by writing new XSL elements and new XSL functions in Java. SAXON, among other things, can insert data from XML documents into SQL databases. I've written a SAXON script to load the weed database from XML, however, I'm not reproducing it here because SAXON is in rapid development and the current version probably won't work when you get this. Look for it at http://www.tapir.org/weeds/ if you're interested. To be competitive with SQL, an XML query language will need a way to associate types with XML attributes and tags, and, likely, a feature similar to joins in SQL.

In WEEDS 2, I mapped XML to SQL with Java classes written by hand. Oracle, Microsoft, and other players in the database market are working on products to automate this process. As with query languages, with such tools you'll need to specify the types of elements and attributes, and how you'd like to map particular XML elements to particular SQL tables in either a GUI or a specialized programming language.



The first table in our database is species, which holds the master record for each species. It contains the id, the primary key of the plant, and the only attribute of a plant which has a one-to-one relationship with the plant, it's family name:


id

family

VARCHAR(32)

VARCHAR(32)

26

Labiatae

Table 1 Species


However, a species can have more than one latin name. For instance, it might have been discovered independently in different parts of the world 100 years ago, or, at some point botanists might have argued about the classification of the plant. Since there is a one-to-many relationship between species and common names, I created a new table for common names. Here, the column species is a foreign key that points to the species table. Each species is listed once on the top page. This is ensured by the principal column, which designates one name as the name which appears on the top page, the name for which principal=1.


species

name

principal

VARCHAR(32)

VARCHAR(64)

INT

19

Leonotis nepetaefolia (L.) R. Br.

1

Table 2 Latin

I also map the <common> element to a table called common, Each record contains the foreign key species, as well as the common name, in the name column, and the contents of the xml:lang attribute in the lang column.


species

name

lang

VARCHAR(32)

VARCHAR(64)

CHAR(2)

19

lion's ear

en

19

cordon de fraile

es

19

rubim de bolas

es

19

cordao de frade

es

19

boton de cadete

es

19

molinillo

es

19

quina de pasto

es

Table 3 Common


The <text> tag is more complex, partially because the <text> element has more attributes, but more importantly, because it contains free-formatted XML like "is <cm>5</cm> tall, see also <a href='/foo/'/>". Here, we're not interested in questions like, "in what plants is there a measurement of a length greater than 5 cm", or "how many pages contain hyperlinks to yahoo?" The contents of the <text> elements are text, not data, so I just copy them into the text column.


species

type

source

lang

text

VARCHAR(32)

VARCHAR(64)

VARCHAR(64)

CHAR(2)

TEXT

19

DESCRIPTION

Direnzo98

en

Leaves are opposite, on long...

Table 4 Text


The content of the text column, unfortunately, is not a well-formed XML document because a valid XML document must contain a single root element (plus processing instructions). For example,


<greeting>Hello!</greeting>


is a well-formed XML document while


This is <not/> a well-formed XML-document


is not. This is acceptable, because to reconstitute a <text> element from the database, I add the <text> and </text> start and end tags and fill in the attributes of the element from the other columns in the row. I do this to avoid denormalization. Denormalization happens when a piece of information is stored in two different places in the database. For instance, suppose I stored the entire <text> element in the text column, and an application changed the type column, but forgot to change the type attribute of the <text> element in the text column. Later on, one application might use the value stored in the type column, and another might use the value stored in the text field -- they don't agree, and this is a bug.


It's important to avoid accidental or unnecessary denormalization in database designs. Sometimes it's necessary to denormalize a database. Denormalization isn't itself a bug, it just invites bugs by making it possible for the database to hold inconsistent information. There are two limiting types of database application: online transaction processing (OLTP) and data warehousing. Relational databases were invented for OLTP applications, such as airline reservation systems. In OLTP, a large number of clients are simultaneously reading and writing from the database and it's important that everybody share a consistent view of the database -- the probability that two people get assigned the same airline seat should be close to zero. In data warehousing, which is increasing in importance, the database is updated infrequently, and complex and time-consuming queries are run against it. In data warehousing, it's common to create summary tables containing redundant information. If updates are infrequent enough, it's adequate to rebuild the summary tables manually after an update, or triggers can be used to update the summary table automatically after an update.


The cite field keeps track of references to literature that isn't online.


species

source

page

VARCHAR(32)

VARCHAR(64)

VARCHAR(8)

19

Cardenas72

181

Table 5 Cite



Unlike the previous tables, the Images table is not loaded from the XML document, but rather from the collection of images accompanying them. After scanning the images, we reduced the number of colors in the images to conserve download time. We felt the best tradeoff was 32 colors, so images are stored in identified-weeds/32colors/. Images come in two sizes: small images that appear on the individual species pages, and large images for viewing separately. These were stored as 19half.gif and 19.gif respectively. I store .gif images in the BLOB field img. I also store information about the format of the image (its MIME type) as well as its width and height so I can provide the correct width and height attributes for the <IMG> tag. By keeping metainformation on images, I could, for instance, store both a GIF and PNG version of an image, serving the PNG image only to browsers that support PNG. (This is one case where denormalization is justifiable. Even though the format, width, and height of an image can be determined by inspecting the BLOB, it would be a great deal of work. If the database supports user-defined functions, it might be possible to create a function which extracts the width and height from the img column, but it probably wouldn't be worth it.)


Species

type

img

format

width

height

VARCHAR(32)

VARCHAR(64)

BLOB

VARCHAR(64)

INT

INT

19

big

0x3846494702...

Image/gif

555

781

19

small

0x3846494701...

Image/gif

277

359

Table 6 Images



In Weeds of El Limon, we numbered plants in the order we found them. To make Common Weeds more like most plant books, I sorted the weeds on the top page by family name, genus and species. Since users often navigate a site by editing URLs, and because I wanted page numbers in the printed version of Common Weeds, I had to renumber the weeds. I didn't want to throw the old plant numbers away, since I might someday want to look at an entry in my old notebooks. Keying the weeds by the new numbers rather than the old numbers would be a mistake, because if I ever added a new weed into the middle of the alphabetical order, I'd have to change the keys. It would be safest, from the standpoint of database integrity, to sort the weeds every time I want to convert a new weed number to an old weed number, or vice versa. Although it's easy to sort the weeds in SQL,


SELECT species.id FROM species,latin WHERE latin.principal=1 AND species.id=latin.species ORDER BY family,latin.name.


it seemed wasteful to resort the weeds on each query. Instead, I chose to sort the weeds and compute the mapping between new and old numbers just once, after loading the database. Since the new numbers can be computed from the database, this is another case of denormalization, but it's acceptable if the database is changed infrequently. (Although currently the database is so small and the performance demands light enough that it probably would be reasonable to resort on each query.) With a database that supports triggers, it would be possible to have the database reconstruct the new_ordering table whenever the database is updated, however, MySQL doesn't support triggers.


old_number

new_number

INT

INT

19

10

Table 7 SQL table: new_ordering



Storing XML fragments in the database

Next, I'll discuss how I store the contents of the <text> element in the database. Consider the <text> element from 22.xml,



22.xml

<text type="DESCRIPTION" source="Direnzo98">
Tree-like plant, grows about 3m tall. Leaves are
large (<cm>45-60</cm> wide), deeply lobed with five to
seven lobes, leaves are serrate with prominent veins. Flowers occur
in raceme with female flowers above male flowers. Fruits occur in
terminal clusters and are covered with soft spines. Fruits have
three lobes with three seeds inside; beans are dark brown with irregular
white spots -- seeds are known to be poisonous. Pod length is
about <cm>2.8</cm> and bean length about
<cm>1.3</cm>. Found growing in field by fence.
</text>

To avoid needless denormalization, I store the content of the <text> tag in the text columns of the text table and the attributes in other columns. <text> elements are internally represented with instances of tapir.representation.Text. The textToXML() method of Weeds convert a Text object into the XML string, as stored in the text column.



Weeds.java

  public String textToXML(Text t) {
    StringBuffer sb=new StringBuffer();
    TextChunk[] chunks=t.getChunks();
    for(int i=0;i<chunks.length;i++) {
      addChunk(sb,chunks[i]);
    };
    return sb.toString();
  };
   void addChunk(StringBuffer sb,TextChunk chunk) {
    if (chunk instanceof AnchorChunk) {
      addAnchorChunk(sb,(AnchorChunk) chunk);
    } else if (chunk instanceof RefChunk) {
      addRefChunk(sb,(RefChunk) chunk);
    } else if (chunk instanceof CMChunk) {
      addCMChunk(sb,(CMChunk) chunk);
    } else
      addPlainChunk(sb,(PlainChunk) chunk);
  };
   void addAnchorChunk(StringBuffer sb,AnchorChunk chunk) {
    sb.append("<A HREF='"+chunk.getHref()+"'>"+chunk.getText()+"</A>");
  };
   void addRefChunk(StringBuffer sb,RefChunk chunk) {
    sb.append("<REF ID='"+chunk.getId()+"'/>");
  };
   void addCMChunk(StringBuffer sb,CMChunk chunk) {
    sb.append("<CM>");
    sb.append(chunk.getText());
    sb.append("</CM>");
  };
   void addPlainChunk(StringBuffer sb,TextChunk chunk) {
    sb.append(chunk.getText());
  };
Weeds.insertIntoText() stores a Text in the database,
Weeds.java

 private void insertIntoText(String id,Text t) throws SQLException {
    preparedStatement ps=prepareStatement("INSERT INTO text (species,type,source,lang,text) values (?,?,?,?,?)");
    ps.setString(1,id);
    ps.setString(2,t.getType());
    ps.setString(3,t.getSource());
    ps.setString(4,t.getLanguage());
    ps.setString(5,textToXML(t));
    ps.executeUpdate();
  };

It's straightforward to get Text objects out of the database. I didn't even need to rewrite MSXMLSpeciesFactory. . By adding a synthetic <!DOCTYPE> tag, I tell MSXML to expect <text> as a root element rather than <species>. MSXMLSpeciesFactory has a hierarchical structure which reflects the structure of the XML document. An XML parser based on the document object model (DOM) returns a tree of Element objects. MSXMLSpeciesFactory contains methods to parse each element, for instance Species parseSpecies(Element e) parses a <species> tag and Text parseText(Element e) parses a <text> tag. The only change I had to make to MSXMLSpeciesFactory was to make the parseText() method of MSXMLSpeciesFactory public, so I could access it from other packages. The method DBSpecies.parseText(), given type and source information and the contents of the <text> tag, wraps up the contents in start and end tags, feeds the resulting XML document into MSXML to produce a parse tree, and then uses MSXMLSpeciesFactory to convert the parse tree into a text object.


DBSpecies.java

public static Text parseText(String type,String source,String language,String xmlBody) throws ParseException {
    StringBuffer xml=new StringBuffer();
    xml.append("<?XML VERSION=\"1.0\"?>\n");
    xml.append("<!DOCTYPE TEXT SYSTEM \"file:/home/paul/identified-weeds/limon.dtd\">\n");
    xml.append("<TEXT TYPE=\"");
    xml.append(type);
    xml.append("\" SOURCE=\"");
    xml.append(source);
    xml.append("\" xml:lang=\"");
    xml.append(language);
    xml.append("\">\n");
    xml.append(xmlBody);
    xml.append("</text>");
    StringBufferInputStream feed=new StringBufferInputStream(xml.toString());
    Document d=new Document();
    d.setLoadExternal(true);
    d.load(feed);
    Element root=d.getRoot();
    return MSXMLSpeciesFactory.parseText(root);
  };

The Weeds class


A number of system resources, such as the connection to the database, are the same in every page in the Common Weeds system. I concentrated these resources a single class named Weeds, currently I create a new instance of Weeds with each request, but when it comes time to improve performance, I can pool instances of Weeds between requests, thus preserving database connections, prepared statements, and any other resources that I'd like to have last between requests.


Also, I decided to put all of the SQL statements in my application (except for those that create the database) in the Weeds class. Some authorities disapprove of this (such as web database pioneer and photographer Phillip Greenspun, at http://www.photo.net/) because it separates SQL code from the Java (or other language) code that uses it, thus, making software harder to read. It's also annoying to think up a good name for each SQL statement. While writing about this project, I had to split up the Weeds.java file so I could show you methods from Weeds next to the methods of other classes that call them to make the code more understandable. However, I get a few benefits from this decision: First, by concentrating my SQL in one place, I'll only need to change one class, Weeds, if I switch to a different database. Although SQL is supposed to be a standard, you'll never know how many quirks your favorite databases have until you try porting an application. Also, I can make changes in the layout of database tables by altering just the Weeds class. In both cases, queries that are used in more than one place are done by a single methods in Weeds, so I don't need to change SQL statements in two places.


Weeds has a constructor, which simply creates a new database connection. Although Weeds classes could be pooled, Weeds could also be adapted to use the connection pool object from Chapter 11

Weeds.java

package tapir.weeds;
import java.io.*;
import java.sql.*;
import tapir.weeds.representation.*;
public class Weeds {
  public String d_dbname="weeds";
  public Connection d_conn;
  public Weeds() throws ClassNotFoundException,SQLException {
	d_conn=getConnection(d_dbname);
  };

I provide static methods to get connections, so I can reuse them in both the Weeds class and in the CreateDB class which creates the weeds database. (This way when I switch databases, I only need to change one getConnection() method. getRootConnection() is used to create a connection to the database server before the weeds database exists,  so we can create it. The method of doing this is database-dependent.


Weeds.java

  public static Connection getConnection(String dbname) throws ClassNotFoundException,SQLException {
    Class.forName("org.gjt.mm.mysql.Driver");
    return DriverManager.getConnection("jdbc:mysql://localhost/"+dbname,"root","");
  };
  public static Connection getRootConnection() throws ClassNotFoundException,SQLException {
    return getConnection("mysql");
  };

JDBC provides three ways to execute SQL statements: with the Statement class, the PreparedStatement class, and the CallableStatement class. When you use plain statements, the database must recompile the statement each time, which wastes time. Prepared statements are created inside an application which holds a database connection, and live as long as the database connection. They can't be reused with another connection in the same application, or by another application. Callable, or stored, statements are compiled once and stored in the database. Callable statements can be used by multiple applications. Unfortunately, not all databases support stored statements (MySQL doesn't), and the procedure for storing statements in a database depends on the database.


In general, it's easier to use prepared and callable statements rather than plain statements in Java. For example, consider the insertIntoText() method of Weeds,


Weeds.java

  private void insertIntoText(String id,Text t) throws SQLException {
    PreparedStatement ps=prepareStatement("INSERT INTO text (species,type,source,lang,text) values (?,?,?,?,?)");
    ps.setString(1,id);
    ps.setString(2,t.getType());
    ps.setString(3,t.getSource());
    ps.setString(4,t.getLanguage());
    ps.setString(5,textToXML(t));
    ps.executeUpdate();
  };

This class prepares a statement with a number of empty slots to fill, each marked with a question mark. It then fills in the blanks with the setString() method of PreparedStatement, and executes the statement. The method


Weeds.java

  public PreparedStatement prepareStatement(String stmt) throws SQLException {
    return d_conn.prepareStatement(stmt);
  };

of Weeds is a hook -- right now it simply calls the prepareStatement on the database connection. However, because every call to prepareStatement goes through it, it gives me the freedom to change the behavior of prepareStatement. For instance, if I want to save prepared statements between requests (once I'm pooling instances of Weeds) I could keep a Hashtable of prepared statements indexed by strings, and fetch them out of the Hashtable rather than creating them in the database. This hook also provides a strategic place to take statistics on how often various queries are used.


Hypothetically, I could write the same method with the Statement method:



  private void insertIntoText(String id,Text t) throws SQLException {
    Statement s=d_conn.getStatement();
    s.executeUpdate("INSERT INTO text (species,type,source,lang,text) values ('"+id+"','"+t.getType()+"','"
       +t.getSource()+"','"+t.getLanguage()+"','"+t.textToXML()+"')");
  }

allthough this contains fewer characters than the PreparedStatement version, it's the typing equivalent of a tongue-twister, and worse, it won't work all the time. If any of the string, for instance, contained a single quote character, the database would think that the string ended and would return an error. (Or worse, some databases, such as Microsoft's SQL server, let you execute statements in batch mode -- specifying several in the same call to executeUpdate(). If a hacker can feed your web application strings that fool it into feeding two queries to your database, she could possibly alter or destroy your database or extract confidential information.) You can fix this by adding escape characters to your strings, putting a backslash before any backslashes, null characters, or single or double quotes. The PreparedStatement class automatically quotes strings and converts other types such as Java Dates to the appropriate type for the database. (Different databases use different date formats, and messing around with dates in string formats is what got us into the Year 2000 mess.)

PreparedStatement also lets you move chunks of data in and out of BLOB and TEXT fields through InputStreams. Other languages, such as Perl, have better syntax for manipulating strings, where you could do the above INSERT with just


$self->{CONN}->do{"INSERT INTO text(species,type,source,lang,text) values '$species', '$type','$source','$lang','$text'"}


but Java's weak string handling, combined with a strong PreparedStatement class make working with PreparedStatements preferable to working with Statements.

Mapping a relational database to objects

WEEDS 1 had a package, honeylocust.limon.representation, which contained a set of classes to store information about weeds. Other than MSXMLSpeciesFactory, which reads Species objects out of XML files, the classes in this package only have accessor methods to let other classes read values from them. Sensing that someday I might want to change my representation for the weeds, say by storing them in a database, I used Java interfaces to separate the details of the representation layer from other packages that use it. When I first started writing WEEDS 2, I used the original classes out of the honeylocust.limon.representation classes -- until I got tired of ugly warts in the original classes, moved them to tapir.weeds.representation and fixed them up.


The interface Species is implemented by classes that deliver information about weeds. It's an interface, so it can be implemented in different ways. WEEDS 2 contains two implementations of Species: SpeciesImpl (described previously, and carried over from WEEDS 1), which stores information about weeds in RAM, and DBSpecies, which holds only the ID number of a species in RAM and retrieves information about weeds directly from the database each time a method is called.


DBSpecies is a drop-in replacement for SpeciesImpl. Any class that can handle an instance of Species can handle an instance of DBSpecies. Java's access protection hides the reality that DBSpecies is based on a different principle than SpeciesImpl,


DBSpecies.java

DBSpecies.java
package tapir.weeds;
import java.sql.*;
import java.io.*;
import java.util.Vector;
import tapir.weeds.representation.*;
import honeylocust.msxml.om.*;
import honeylocust.msxml.parser.*;
public class DBSpecies {
  public final Weeds d_w;
  public final String d_id;
  public final String d_oldId;

DBSpecies contains only a reference to the database (the Weeds class) and the id of a weed: to get other information, DBSpecies queries the database and converts the results to the appropriate objects. The DBSpecies object is the place in the system where I translate the new id numbers of the weeds, determined by their order on the top page to the old id numbers used in the original XML data. I do it here, because this minimizes the amount of code required.


I construct a DBSpecies by passing in an instance of Weeds, and the id of a weed.


DBSpecies.java

  public DBSpecies(Weeds w,String id) throws SQLException {
    d_w=w;
    d_id=id;
    d_oldId=w.getOldId(d_id);
  };

The Weeds.getOldId() function looks up the old number in the new_ordering table. In this application, it's safe to store the old id in d_oldId since I never update the database after loading. A few accessors just return the id of the Species object,


DBSpecies.java

  public String getId() {
    return d_id;
  };
  public String getOldId() {
    return d_oldId;
  };

A few methods retrieve their values from the database, by running SQL queries that are stored in the Weeds object, then converting the ResultSet returned by JDBC into the Java objects we used in my internal representation.


DBSpecies.java

  public LanguageString getFamily() throws SQLException {
    ResultSet rs=d_w.selectFromSpecies(d_oldId);
    return new LanguageString(rs.getString(1),"la");
  };
  public LanguageString[] getLatin() throws SQLException {
    ResultSet rs=d_w.selectFromLatin(d_oldId);
    Vector v=new Vector();
    while(rs.next()) {
      v.addElement(new LanguageString(rs.getString(1),"la"));
    };
    LanguageString names[]=new LanguageString[v.size()];
    v.copyInto(names);
    return names;
  };

The getTexts() method is trickier, since Text objects are stored in XML form inside a TEXT column of the Text table titled text. We fetch the columns out of the database, and then feed the columns into the method DBSpecies.parseText() which uses MSXML to construct a Text object. The parseText() method is talked about previously in the section on SQL and XML.


DBSpecies.java

  public Text[] getTexts() throws SQLException,ParseException {
    ResultSet rs=d_w.selectFromTexts(d_oldId);
    Vector v=new Vector();
    while(rs.next()) {
      v.addElement(parseText(rs.getString(1),rs.getString(2),rs.getString(3),rs.getString(4)));
    };
    Text texts[]=new Text[v.size()];
    v.copyInto(texts);
    return texts;
  };

DBSpecies only scratches the surface of object-oriented design with relational databases. In this case, the Species class was designed to represent database-like information in the very beginning, so the mapping was easy. It's possible to create Java classes which reflect the structure of a SQL table, which is usually easy, or make SQL tables that reflect the structure of Java classes, which is generally harder. By adding an object-oriented layer over your database, you can distribute access to the database over RMI or IIOP, log access to the database, implement security policies, or work around the absence of triggers in some databases. Enterprise Java Beans (EJB) has support for entity beans, which represent objects in a database or another persistent store.


When mapping a hierarchy of objects that inherit from each other, for instance, it's necessary to decide if you want to (i) create a table for a superclass and additional tables to represent properties of subclasses, (ii) create a table for each class (without any special relationship between superclasses and subclasses) or (iii) create one big table that holds related classes and has a special column designating the type of the column (leaving columns not needed by a particular class NULL.) It can sometimes be faster to load all of the properties out of the database when an object is created, but sometimes we only need some of the properties, and any information stored in RAM runs the risk of becoming stale. Although it's possible to store serialized Java objects in BLOBs, (much like XML fragments can be stored in TEXTs) the database can treat them only as lumps of incomprehensible data. Since object-relational mappings try to hide the underlying database connections, it can be awkward to begin, commit, and roll back transactions, particularly if you're building systems that work with both RAM-based and database-backed objects. Finally, writing constructors for database-backed objects takes some thought, since there are two senses in which you might want to "create" a database-backed object: (i) to create a new instance of an object which represents an existing database record, or (ii) to create an entirely new object, including the database record. It helps here to create objects through factory methods.


Object databases, designed to work with OO languages, were a popular research topic in the early 80's, but have seen only limited commercial use because object databases are harder to query than relational databases, don't have a standard as developed as SQL, and have historically had difficulty scaling beyond 4 GB databases. Java, multimedia, and the internet are reviving interest in Object databases, particularly Computer Associates' Jasmin (http://www.cai.com/products/jasmine.htm) Object-relational databases add features to relational databases to simplify mapping objects to tables: Ingres and the free PostgreSQL (http://www.postgresql.org/) database are early examples, although major databases such as Oracle now support object-relational extensions, which are addressed by the forthcoming SQL 3 standard. A number of products automatically generate Java classes representing SQL tables: Sun's Java Blend

(http://www.sun.com/software/javablend/), ObjectMatter's VBSF (http://www.objectmatter.com/), Thought Inc.'s CocoBase (http://www.thoughtinc.com/cocofree/doc/index.html) and the free product Jora

(http://www.ispras.ru/~knizhnik/Jora-1.02/ReadMe.html)


Creating and loading the database

One of the greatest thrills of SQL programming is accessing your database through the SQL monitor. In MySQL, you can type the command mysql and then proceed to create a database and define tables. This is fun to do the first time you develop a database application, but it gets old fast -- particularly when you need to make a backup copy or migrate your database to a new computer or DBMS. When you build a database application, you must also build a set of utilities for initializing and maintaining the database.


CreateDB simply creates the weeds database, and the tables that make it up. Because we might want to make several databases for testing, CreateDB is a command line application which takes one argument: the name of the database we want to create.


CreateDB.java

package tapir.weeds;
import java.sql.*;
public class CreateDB {
  String d_dbname;
  public static void main(String argv[]) throws Exception {
    CreateDB self=new CreateDB(argv[0]);
    self.run();
  }
  public CreateDB(String dbname) {
    d_dbname=dbname;
  };

We face a problem when we wish to create a database. We need to connect to the database server, but we can't do that without the name of a valid database. In both MySQL and PostgresSQL, you can create a database when you're connected to any database, however, you need to know the name of a database that exists on your system. MySQL uses a database called mysql to store access control lists, so you can always connect to the MySQL database through mysql -- the procedure varies on other databases. We use getRootConnection() to get a connection to create the new database with, then we open a connection to the new database with getConnection().


CreateDB.java

  public void run() throws Exception {
    Connection conn=Weeds.getRootConnection();
    Statement s=conn.createStatement();
    s.executeUpdate("CREATE DATABASE "+d_dbname);
    s.close();
    conn.close();
    conn=Weeds.getConnection(d_dbname);
    s=conn.createStatement();

Next we execute the SQL statements to set up the tables. The SQL isn't hidden away in Weeds.java because it runs before the database exists, so it would be impossible to create an instance of Weeds.


CreateDB.java

    s.executeUpdate("CREATE TABLE species (id varchar(32),family varchar(64))");
    s.executeUpdate("CREATE TABLE latin (species varchar(32),name varchar(64),principal int)");
    s.executeUpdate("CREATE TABLE common (species varchar(32),name varchar(64),lang char(2))");
    s.executeUpdate("CREATE TABLE text (species varchar(32),type varchar(64),source varchar(64),
                                              lang char(2),text text)");
    s.executeUpdate("CREATE TABLE cite (species varchar(32),source varchar(64),page varchar(8))");
    s.executeUpdate("CREATE TABLE images (species varchar(32),type varchar(64),img blob,format varchar(64),
                                              height int,width int)");
    s.executeUpdate("CREATE TABLE new_ordering (old_number int,new_number int)");
    s.close();
    conn.close();
  };
}

With the database created, we're ready to fill it with information. InsertWeeds, given a list of XML files containing weed descriptions, inserts them into the database. I simply type


% cd identified-weeds

% java tapir.weeds.InsertWeeds *.xml

The class is simple since most of the thinking is done by other classes.


InsertWeeds.java

package tapir.weeds;
import tapir.weeds.representation.*;
import java.io.File;
import java.util.Enumeration;
import java.util.Vector;
public class InsertWeeds {
  public static void main(String argv[]) throws Exception {
    Weeds w=new Weeds();
    Vector v=new Vector();
    for(int i=0;i<argv.length;i++)
      MSXMLSpeciesFactory.parseLimon(v,new File(argv[i]));
    for(Enumeration e=v.elements();e.hasMoreElements();) {
      Species s=(Species) e.nextElement();
      w.insertWeed(s);
    }
    w.reorderWeeds();
  }
}

Weeds.java

  public void insertWeed(Species s) throws SQLException {
    if (!s.identified())
      throw new IllegalArgumentException("Weed not identified");
    String id=s.getId();
    insertIntoSpecies(id,s.getFamily().toString());
    LanguageString l[]=s.getLatin();
    for (int i=0;i<l.length;i++) {
      insertIntoLatin(id,l[i].toString(),i==0);
    }
    l=s.getCommon();
    for (int i=0;i<l.length;i++) {
      insertIntoCommon(id,l[i]);
    }
    Text t[]=s.getTexts();
    for (int i=0;i<t.length;i++) {
      insertIntoText(id,t[i]);
    }
  };

which calls methods that look like


Weeds.java

  private void insertIntoSpecies(String id,String family) throws SQLException {
    PreparedStatement ps=prepareStatement("INSERT INTO species (id,family) values (?,?)");
    ps.setString(1,id);
    ps.setString(2,family);
    ps.executeUpdate();
  };

The method insertIntoText() was discussed previously in the Section on "XML and SQL," because it converts the Text object back into SQL before storing it in the relational database.


The last task InsertWeeds does is compute the new sort order of weeds: sorted first by family name and then by genus and species. Fortunately, you can get SQL to output results in almost any order you like -- this is done as follows


weeds.java

  public void reorderWeeds() throws SQLException {
    Statement s=d_conn.createStatement();
    PreparedStatement ps=prepareStatement("INSERT INTO new_ordering (new_number,old_number) VALUES (?,?)");
    ResultSet rs=s.executeQuery("SELECT species.id FROM species,latin WHERE latin.principal=1
                        AND species.id=latin.species ORDER BY family,latin.name");
    int i=1;
    while(rs.next()) {
      // System.out.println(i+" "+rs.getInt(1));
      ps.setInt(1,i);
      ps.setInt(2,rs.getInt(1));
      ps.executeUpdate();
      i++;
    };
  };

URL Rewriting and file layout

WEEDS 2 uses Apache's URL rewriting to present a static-looking appearance to a dynamic page. I do this because I make a static copy of my site with a web crawler, so every URL must be a file name that could exist on a static site.

What the pages look like to a web browser

To a web browser, the weeds/ directory has the layout


index.html
common.html
error.gif
mountains.jpg
weed/
   1.html
   2.html
   ...
big/
   1.gif
   2.gif
   ...
small/
   1.gif
   2.gif
   ...

from the viewpoint of web clients, such as browsers as well as web crawlers and cache programs, it's exactly like a static site, and we can use a web crawler to make an exact copy.

What's really on the server

The contents of the weeds/ directory on the server are entirely different from what the browser sees. None of the web documents exist in the weeds/ directory, rather it looks like


.htaccess
jsp/
   common.jsp
   copy.jsp
   error.jsp
   footer.jsp
   index.jsp
   weed.jsp
   error.gif
mountains.jpg
weeds2.jar

error.gif and mountains.jpg are the only files which get served to the client. The jsp/ directory contains our Java Server pages, but these URLs are never seen by the client. Client URLs are converted into server URLs by rewriting rules specified in the .htaccess file. The .htaccess file contains server configuration commands specific to a directory. Most of the directives that can be put in a httpd.conf file, to, for instance, rewrite URLs, define new MIME types, or put a password on a directory, can be put in an .htaccess file. When setting up security in the httpd.conf file you can use the AllowOverride directive to specify exactly what directives can be used in .htaccess files in particular directories and their subdirectories. The .htaccess file is convenient, but if you're administering a web server with a large number of users who you don't completely trust: such as a departmental web server, it's only possible to exert central control on security policy if .htaccess is disabled or restricted.


The .htaccess file in weeds/ looks like:

.htaccess

RewriteEngine on
RewriteBase /weeds/
RewriteRule ^$ /weeds/jsp/index.jsp
RewriteRule ^index\.html /weeds/jsp/index.jsp
RewriteRule ^common\.html /weeds/jsp/common.jsp
RewriteRule ^weed/([0-9]+)\.html /weeds/jsp/weed.jsp?weed=$1
RewriteRule ^small/([0-9]+)\.gif /servlets/tapir.weeds.View Weed/small/$1.gif
RewriteRule ^big/([0-9]+)\.gif/servlets/tapir.weeds.ViewWeed
/big/$1.gif
RewriteRule ^weeds2.jar /weeds/jsp/weeds2.jar [forbidden]

the first two lines turn on URL rewriting and tell the URL rewriting module what directory you want to have rewritten. The rest of the file consists of rewriting rules. The format of a rule is



ReWriteRule search replace [options]

The first argument, search is a regular expression. Most characters, such as letters and numbers, match themselves. Other characters have special meanings. For instance, ^ matches the beginning of a URL, . Matches any character, and the expression [0-9] matches any digits. The * operator matches zero or more instances of the previous character, and + matches one or more. The character \ escapes the character that follows it, negating its special meaning (the following character matches itself), and the parenthesis characters groups several characters together. The character $ matches the end of the line, thus


RewriteRule ^$ /weeds/jsp/index.jsp

maps http://www0/weeds/ to http://www0/weeds/jsp/index.jsp and the rule


RewriteRule ^index.html /weeds/jsp/index.jsp

matches http://earth/weeds/index.html so the web server instead serves the URL at http://earth/weeds/jsp/index.jsp. Unlike URL redirection, where the web browser is instructed to try a different URL, the web browser sees no sign that URL rewriting takes place -- all processing takes place on the server. The next rule maps individual weed pages to the JSP page, taking the id of the weed as a GET argument passed after "?".


RewriteRule ^$ /weeds/jsp/index.jsp

This rule matches URLs such as


http://earth/weeds/weed/5.html
http://earth/weeds/6034.html

but doesn't match URLs such as


http://earth/weeds/.html
http://earth/weeds/54a3.html

The parenthesis in the rule groups the digits together, so in the case of http://earth/weeds/6034.html, $1=6034. This value is substituted into the replace string, so the URL http://earth/weeds/weed/5.html maps to http://earth/weeds/jsp/weeds.jsp?weed=5. The next two rules map image URLs to the image servlet, mapping, for instance, http://earth/weeds/big/5.gif to http://earth/servlets/tapir.weeds.ViewWeeds/big/5.gif. This points out a disadvantage of JSP relative to servlets and CGI. A servlet, such as http://earth/servlets/tapir.weeds.ViewWeeds/ , can have a path appended to it, such as big/5.gif. This makes it possible for a single servlet (or CGI script) to serve an entire "virtual directory tree" underneath the servlet URL. Unfortunately, you can't do this with JSP, so you can only pass arguments as form elements, namely, weed=5.gif.


In the last rule, I use the forbidden option to prevent web users from accessing the weeds2.jar; in this case I've got nothing to hide, since I'm making the source code available on the web and telling you all about it. In general web users should not be allowed to download server software from your server. Often, servlets, cgi-scripts, and other web programs contain passwords (for users to log in, or for a database) and other confidential information -- as Java class files can be decompiled, a hacker can get a head start at cracking your site by getting a copy of your server-side applications, so it's important to lock them down. I put weeds2.jar in the weeds/ directory so all of WEEDS 2 would be one convenient package for you to install. If I was feeling paranoid, I'd put it entirely outside of the web server directory.


This is a bare introduction to URL rewriting. Many more features are supported (such as serving different pages to different web browsers, serving pages off of other servers by proxy, and using external programs to decide which page to redirect to) and there are traps and pitfalls to avoid which you can read about in the online documentation at http://www.apache.org/docs/mod/mod_rewrite.html In particular, URL rewriting is much faster when the directives are stored in the central httpd.conf file rather than in a per-directory .htaccess, but for our development system which doesn't get many hits, convenience wins over raw speed.


Part 1
[ Part 2 ]
Part 3
Table of Contents

Chapter 15 of Professional Java Server Programming, © 1999 Wrox Press.
Produced by Honeylocust Media Systems, contact paul@honeylocust.com