Weeds of El Limon 2 The problem The site How it works The unfriendly net The tools
A little more than a year ago, we
visited the village of El Limon in the Dominican Republic to help
string power lines for a micro-hydroelectic generator. (Read about
the project at
However, making Weeds of El Limon meant making 32 nearly identical web pages for 32 weeds. It seemed reasonable to write 32 pages once, but what if I decided I didn't like the look and I wanted to change it? The XML specification had just been approved, so I got the idea to write the weed descriptions in XML and write a program to convert them into attractive HTML. Being an early application of XML, Weeds of El Limon, attracted attention from the XML community and we wrote about it Chapter 12 of XML Applications from Wrox press.
A year later, we'd received E-mail from people interested in the weeds. Some had typed "prickle poppy" into a search engine and got our description, and we heard from a peace corps volunteer who was just about to visit the Dominican Republic. W asked ourselves, "How we can we make our site more useful to people interested in plants?" Weeds of El Limon had problems: because we didn't take the right books with us, we weren't able to identify many of the plants, in particular the grasses. Also, our plants were numbered in the order in which we found them, not the conventional alphabetic ordering. We'd gotten the rights to put a list of recommended street trees online, and making an improved Weeds of El Limon would be a way to test ideas for a more powerful publishing system.
To make the new site effective, we'd
need a clear mission. We thought about the Peace Corps volunteer who
wrote us, and, being able to find almost nothing else about
tropical weeds online, decided that we could best help people like
him out by making a brief introduction to weeds in the Caribbean.
Our sampling procedure wasn't comprehensive or scientific, we just
walked out of the schoolhouse we were staying in and dug up the first
new weed we saw -- this meant, however, that we observed and
recorded the most ubiquitous species... the ones that you'd notice
right after you stepped off the plane. Therefore, we bundled the 14
weeds we'd identified as Common Weeds of the Caribbean
Common Weeds contains three kinds of database-generated page. Each plant has an individual page, and there are two index pages: an index by common name and a top page indexed by latin name. When Common Weeds is completed, there will also be a few static pages with information about the authors, the software, and books about tropical weeds.
Common Weeds is a simpler site than the original Weeds of El Limon. In the original, with 32 weeds, I needed separate pages to make indices by common name, latin name, and by family. With just 14 weeds, it was practical to link to each weed from the top page, eliminating the need for separate indices by family and latin name. Since we put more information on the top page, the site is easier to use, since users need to click less.
Beneath the surface, the HTML is simpler. In Weeds of El Limon, I used a table cell background to color the bar at the top of the page, making white letters on a black background. I first tried this with the BGCOLOR attribute of the <TD> tag and with <FONT COLOR>, but the result on Netscape 2, which doesn't support colored table cells, was disastrous: white letters on a white background are impossible to read. To solve this problem, I used cascading style sheets (CSS) to set the cell and text colors. Since then, our staff artist discovered just how bad the CSS implementations in Netscape 4 and IE 3 are -- and that, often, it is much better to use old-fashioned, conservative HTML that works, even if it does make the blood of the W3 consortium boil. This time, to avoid trouble with table-cell backgrounds, I chose fail safe colors. In a browser without table cell backgrounds, the pale green bar at top is white -- compatible with black text.
Like the top page, the common name index is information rich. For the roughly 80% of web users with screens 800x600 or greater, all of the names fit on one screen. Although I generate the page dynamically out of a database, I set the break points of the columns by hand to guarantee effective layout.
Compared to the indices, the changes to the individual weed pages were minor. Although some pundits think that page numbers are obsolete on the web, I decided to keep numbering the weeds. In the original, we numbered weeds by the order in which we found them. Now they're numbered in the order they appear in on the top page. I added a navigation widget that lets viewers jump to any number with a single click, to imitate the "feel" of a book.
Weeds of El Limon was a simple filter: it took a collection of XML files as an input, and created a set of static HTML files which I could put on my server. Afterwards, I built several database driven sites and got hooked by the ability to provide multiple views of information stored in a database and the ability of two or more people to collaborate on maintaining a database driven site. For the first phase of Common Weeds, I took advantage of the existing XML format. Rather than building a system to interactively add and update weed descriptions in the database, I could simply import them in XML format into the database. If I need to change the descriptions in the short term, I can edit the XML and reload the database. This way I could quickly develop a database-driven site, and in the future, I can add additional pages to edit the database directly. Since the original Weeds of El Limon software, WEEDS 1, was written in Java, it seemed natural to use servlets and JSP since I'd be able to reuse some of the software and design.
For each of the three page types, there exists a Java Server Page and a Java Bean. For instance, to generate individual weed pages, WEEDS 2 uses weed.jsp and WeedBean. The Java Server Page is, mostly, an HTML template filled with information it gets from its corresponding Bean. There are a few advantages to this division. If I make a change in a Java class, I have to recompile the .java file, and, possibly, restart the servlet engine. (Some servlet engines, such as Apache JServ and JRun, can be configured to automatically reload class files that change. Others, such as Sun's Java Server Web Development Kit, cannot.) JSP files, however, are compiled into Java servlets by the JSP engine. So long as the JSP file stays the same, the JSP engine reuses the servlet. When you change the JSP file, the JSP engine detects this and recompiles. Thus, it's as easy to edit a JSP file on your server as it is to edit an HTML file. By defining the look of a page in a JSP, I can make simple changes without the hassle of recompiling. However, by hiding complex Java methods inside beans, the Java left inside the JSP is simple and stereotypical, generally of the form <% bean.getProperty() %>, meaning that JSPs can be written and edited by people who know about graphic design and HTML while the complexities of Java, databases, and object-oriented design can be worried about by programmers.
A one-to-one mapping between JSPs and beans is only one possible design. You could, if you like, have multiple JSPs access the same bean, or, have a page incorporate more than one bean. Java Beans can be used as reusable components across a site, to create navigation bars, or to insert advertisements. In my case, there were some methods, (for instance, those that display certain objects in HTML form) and properties (such as the copyright notice) that were common to all the pages. I could have created an additional bean shared by all the pages, but rather, I made IndexBean, CommonBean and WeedBean subclasses of GeneralBean to share these functions.
WEEDS 2, in addition to web pages, contains images of the weeds. The images are stored in .gif format inside BLOB columns in the database, and are served by the ViewWeed servlet. I use a servlet here, because ViewWeed simply retrieves the image from the database and sends it over the network verbatim without filling in a template.
The beans and servlet depend on supporting classes, in particular, DBSpecies, which provides an object-relational view of a single species in the database, and Weeds, which is the gateway to the database. In some sense, Weeds is the central class of the application since it holds the database connection and all of the SQL statements used to access the database. Also, Weeds contains utility methods that I want to share throughout the application. Currently, I create a new instance of Weeds for each web request. Although Weeds itself is lightweight (about 32 bytes), it takes time to create the database connection. This is OK now, because WEEDS 2 is currently fast enough for what I do with it.
If I need to speed my application up, I've got two options. The simplest is to change Weeds so it gets its connections from an existing mature and efficient connection pool, such as the one presented in Chapter 11 This would be a snap, since Weeds encapsulates the connection -- I wouldn't have to change a line of code elsewhere. If, however, WEEDS becomes a more complex application -- that, say, accesses more than one database, or pools additional resources, it would also be possible to pool the Weeds class. Pooling Weeds wouldn't be very different from pooling connections, although unlike a connection pool, I couldn't simply copy the code off the net. Since, ultimately, only a single thread at a time can obtain an object from a pool, the time cost of pooling is determined by the time it takes to obtain a lock (to call a synchronized method.) Since it costs the same to pool a single object that contains references to N objects as it does to pool any other object, wrapping multiple objects that need to be pooled in a single object leads to a design which can be scaled to higher performance.
Locating all of the SQL in the Weeds class has other consequences. There is the disadvantage that SQL statements are declared in a file far away from the places where they are used, which makes the code harder to read to a newcomer. Each SQL statement is wrapped in a method, and it's a pain to think up a good name for each statement. On the other hand, since all of the SQL is in one place, all of the dependence on a particular database is in one place. Although SQL is supposed to be a standard, you'll discover many foibles in your database when you try to port an application from one database to another. If, for instance, I find the application doesn't work with database XYZ, I can make a subclass XYZWeeds which fixes the incompatibilities. This would also be a way to take advantage of special, nonportable, features of a particular database, such as stored procedures, which could improve performance. With the SQL in one place, I can also change the names of columns and rows and make other changes in the structure of the database without affecting the rest of the program. (I've already done this several times)
Organizing a server application in several layers makes your application more flexible for the future. Because the JSPs, beans, and supporting classes form three distinct layers, it would be possible to, if desired, move one of the layers onto another server using Java RMI or IIOP. For large business applications, Enterprise Java Beans (EJB) provides a standard interface for application servers that provide services such as transaction management, security, distribution and persistence. I'll talk about EJB more when I discuss the Java Beans in this application.
I wish I could install the WEEDS 2 software and database on our web server in San Francisco and work on both the descriptions and the software from my computers at work in home. Unfortunately, in Germany where I currently live, residential phone calls are charged by the minute. Between high costs, a busy modem pool, and a congested transatlantic cable, it's not practical to work online.
To cope with this problem, we keep a master copy of each of our web sites on our development server in Germany, each on a virtual host in our home LAN. We don't have a database or servlet engine in San Francisco, but instead, we create a static copy of the site (plain HTML and GIF files) on our machine in Germany and install the static copy on our web server. This technique can't be used for sites which process complicated queries (such as full-text search) or that can be modified by users (such as a site with a message board). However, our static site can be stored on a floppy and viewed on computers without a network connection.
There are two steps to copy our site
to our server. First, on our machine in Germany, we make a static
copy with a web crawler, specifically, pavuk
Pavuk crawls through a web site and makes a collection of
directories and files that mirror the original site. Then, we
mirror the static copy of our site to our web server using rsync
To copy our dynamic site with a web crawler, we need to make it look, to web clients, like a static site. That is, there can't be any .jsp, .cgi, or .asp files, or cgi-bin directories. Also, we can't pass parameters via the GET or POST methods (that is, no URLs that end like weed.html?id=7.) Although it's possible to write a servlet which pretends to be a directory, and reads the path after the servlet name (such as servlet/small/4.gif), JSPs can only read parameters via GET or POST. To get around this, I use Apache's mod_rewrite module to transform static URLs like weeds/weed/4.html into URLs with GET parameters, such as weeds/jsp/weed.jsp?id=4. URL rewriting is similar to servlet aliasing, but is much more powerful: mod_rewrite can use external programs to transform URLs, use mod_proxy to pass a request to another web server (this was useful in early testing, when the only JSP 1.0 compatible engine didn't work with Apache) or even distribute requests between multiple servers.
There are a other reasons to make a dynamic site look static. Major search engines and other robots often refuse to index sites that look dynamic, stopping when they see URLs that look dynamic (.jsp, .cgi, .asp, /cgi-bin, etc.), because some sites could send a crawler crawling through millions of dynamically generated pages, wasting the time and bandwidth of the crawler while overloading the dynamic site. Thus, many worthwhile dynamic sites lose the hits they could get by being indexed. If a finite subset of a database driven site is worth indexing, making it look static could increase your traffic and help people find a useful resource. Also, when you hide the mechanics of your site, you make it a little harder for hackers to take advantage of published and unpublished weaknesses of your server software.
Next, I'll talk about the hardware and software that Common
Weeds depends on and how to set up a system to run it. Our
development system is a 350 Mhz Pentium 2 with 64 megs of RAM. it
runs Debian Linux 2.0. I use Apache as a web server, with the
optional mod_rewrite module compiled in (which is included with the
Apache source but disabled by default, see http://www.apache.org/)
as well as the Apache Jserv servlet engine (see
add support for JSP, I use Sun's reference implementation,
Configuration depends on your web server and servlet engine. After getting Apache, JServ and the JSP reference implementation working, I had to make two changes to run my application: first, I had to add the mm.mysql driver to the permanent classpath of the servlet engine.
The java class files (servlets, beans, and supporting files) are packaged in a JAR file, which I add as a JServ repository. Unlike the permanent class path, JServ monitors files in the repository and reloads them when they change so I can develop without restarting the servlet engine.