Three Faces of XML in Zope

Services, Documents, Datastores

02/02/2000

Extra Bits

• Jon Udell is the author of "Practical Internet Groupware." He's that rare person who manages to see the entire field of computing with a rather unbiased view and who can speak from the experience derived from doing real work.

•See also our interview with Paul Everitt of Digital Creations and one of the founders of Zope.

• O'Reilly Editor-in-Chief Frank Willison went to the Python Conference and wrote a series of dispatches: Day 1, Day 2, Day 3

[ This is the text of a talk given on Tue Jan 25 2000, as the keynote for the Zope track of the 8th International Python Conference. ]

Introduction

I was reading an online discussion somewhere the other day, and somebody asked: "What's the best platform for building collaborative Web-based software?"

Somebody else answered: "There are really only two choices, Domino and Zope."

Five years ago, this would have been unthinkable. There would have been no way that a small team of script-language programmers could have put together an application-development platform that anybody would even think of comparing to Lotus Notes.

But the world has changed, and it's changed in precisely the ways that level the playing field for Zope. The key factors are: Internet standards; nearly-universal Web dialtone; object-oriented, network-aware scripting languages like Python; and most of all, a culture of open-source software development that thrives not only because the tools are freely exchanged, but also the knowledge of how and why to use them.

I think that what makes Zope so interesting to so many people -- me included -- is that it grew up in this environment. Zope didn't have to be retrofitted with Web support. It was built from the ground up to use -- and extend -- the Web.

Central to Zope's mission is its various kinds of support for XML, and that's the focus of my talk today. I'll admit that when Paul Everitt asked me to come and speak here, I was reluctant. After all, I'm a guy who's done almost all his Web programming in Perl and Java, and only recently begun to explore the world of Python and Zope. But after Paul and I talked for a while, I realized that we share the same vision for the future of software, and the future of the Web. It's a vision that looks beyond the parochial rivalries of our time: Windows vs Linux, Perl vs Python, Microsoft vs everybody else. When you focus on the big picture, I think there are just three things that matter:

a network-services architecture
the document interfaces through which people interact with these network services
the datastores that underly these network services

Zope's architecture addresses all three of these points, and in each case, XML can play an crucial role.

A network-services architecture

Let's start by unpacking the notion of a network-services architecture, as it relates to the Web. To a remarkable degree, today's Web already is a vast collection of network services. So far, these services are mainly browser-oriented. My browser "calls" a service on Yahoo to receive a page of a directory. Or it "calls" a service on AltaVista to receive a page of search results.

One of the nicest things about the Web, however, is that browsers aren't the only things that can call on the services offered by websites. Programs written in any URL-aware language -- including Python, Perl, JavaScript, and Java -- can "call" these Web services too. To these programs, the Web looks like a library of callable components. What's more, it's very easy to build new Web services out of these existing components, by combining them in novel ways. I think of this as the Web's analog to the UNIX pipeline.

My favorite example of this is a thing I call the Web mindshare calculator. It's a script that ranks the sites listed in a Yahoo category according to the number of links in the AltaVista index that point to those sites. For example, in the Yahoo category that includes Zope there were 250 sites listed when I ran this script the other day. The top-ranked site was Sausage.com (that's the company that make the HotDog HTML editor). IBM alphaWorks was ninth. Zope came in pretty strongly at 30th. Vignette.com was 37th, and midgard-project.com, had it been included in that Yahoo category, would have ranked 200th.

To do this analysis, my script regards Yahoo as a service that offers a kind of namespace traversal API -- but really, it has to create that API itself, by recursing on nodes of the Yahoo directory as they're expressed in the HTML pages that Yahoo produces. There's no formal API that you can call directly to ask for the list of sites underneath some node of the directory.

What about the Netscape Open Directory project? It's true that you can download the whole thing as an XML file, but there's no API for retrieving just parts of the tree, or unrolling a subtree. Zope's support for XML-RPC, and its forthcoming support for SOAP (Microsoft's Simple Object Access Protocol), is the kind of thing that's going to blow this game wide open.

Since the birth of the Web, it's been the case that every Web application automatically exposes all its callable methods to HTTP clients. Zope isn't unique in this regard, though it is unusual in the richness of the set of such methods that it exposes in this way. Everything has a URL in Zope, right down to the individual elements of a parsed XML document.

I call this model the first-generation object Web. Scripts call URLs, which invoke actions, which return HTML pages, which can be processed by scripts, which can then call other URLs.

But Zope is helping to create what I call the second generation object Web. I got a taste of what this will be like when I built a Zope-based affiliate that received news feeds from Dave Winer's UserLand site. At the time, Dave was using XML-RPC to send these feeds to affiliates. Every hour, he gathered up his new stories into a Frontier data structure, turned that into an XML-RPC packet, and called a Zope method on my server, passing that data as an argument to the method. On my end, the data automatically appeared as a Python list-of-lists, which I unpacked and stuffed into a database. Distributed computing doesn't get any easier than this! The plumbing is literally invisible, and you can spend all your time doing the real work -- namely, putting the data to use.

There's no question in my mind that this general approach to distributed computing is going to be wildly popular. But you don't have to take my word for it. Ask Dun and Bradstreet. Last year, they reengineered their whole online business around the idea of wrapping up back-end data sources and middle-tier components in XML interfaces that are accessible via HTTP or HTTPS.

Traditionally, D&B customers bought packaged reports, not raw data, and the creation of a customized report was a slow and painful process. The new scheme turned the old one upside down. It defined an inventory of data services, and empowered developers to transact against those services using an XML-over-HTTP protocol. In their case, the mechanism was based on the webMethods' B2B server, but conceptually that's not too different from XML-RPC. Prior to last year, developers who needed custom D&B feeds had to ask D&B to create them, which took forever, and then had to maintain private network links to D&B and speak a proprietary protocol over those links. In the new scheme, D&B publishes a catalog listing a set of data products. A developer fetches the catalog over the Web, using any XML-oriented and HTTP-aware toolkit, and writes a little glue code to integrate that feed into an application.

So where's Zope in all this? Today, despite its very cool XML-RPC support, Zope is largely designed to emit HTML to browsers. It's clear where things are headed, though. In a great article for XML.com, Amos Latteier showed how a wxPython client can remotely manage a Zope server using XML-RPC. Admittedly, the vision's a bit ahead of the reality here, since you can't pass Zope objects over XML-RPC, and since Zope's management interface is still heavily browser-oriented.

For example, when I call manage_addFolder, what comes back isn't an XML packet containing a status, along with perhaps a reference to the new folder object that was added. What I get back instead, no matter whether I make that call using HTTP or XML-RPC, is the HTML text of the management page to which Zope redirects the browser that requests this action.

In his article, Amos acknowledges these limitations. He suggests that SOAP will enable richer communication of Zope objects across network channels -- in effect becoming a kind of RMI for Zope. But more interestingly, from my perspective, he mentions that Zope's API will be retooled to be less HTML-centric.

Personally, I'm fine with XML-RPC for the near future. It's dead simple, and that's just the way I like it. The things that SOAP might enable aren't the things I need to get my job done. In particular, I'm not too interested in recreating RMI for Zope, though I'm sure something like that would be useful to some people. The problem with RMI is that it presumes Java everywhere, and I don't think we're going to have Java everywhere, or Zope everywhere, or anything everywhere. What I hope we're going to continue to have is a diverse set of Internet platforms, each with special strengths, and all able to converse with one another using simple, easily-scriptable Internet protocols. What matters most to me is Amos' second point -- that Zope needs to grow an API that's fully XML-centric rather than HTML-centric.

As Web programmers, we're all in the game of creating -- and using -- network services. A Web server running CGI scripts makes a pretty shabby ORB, but compared to what was available before 1994, it's a wonderful thing. It will get a lot more wonderful with some fairly basic improvements. AltaVista, and Yahoo, and every other site that offers services to the Web ought to be implementing those services, internally, using XML interfaces. When the client is a browser, the XML can be rendered into HTML. When the client is a program, the XML can be delivered directly to that program. Today, the programs that consume these XML interfaces are typically other back-end services. But just in the last few weeks, there have been two dramatic demonstrations of what can happen when the user's client becomes XML-aware. One is the new ZopeStudio product, which enables Mozilla to run a Zope management interface that's driven by XML rather than HTML. The other is Amos' wxPython-based Zope client, which is elegantly simple, surprisingly fast, and immediately useful.

The general notion of XML as an interface description language isn't rocket science, and there's huge leverage still to be gotten out of even something as simple as XML-RPC, never mind SOAP. We can all imagine how to build network services this way, and I'm sure there are people in this room who are already doing it. The challenge is to embed this mindset directly in a toolkit, so that everything you build with it is not only a trivial first-generation Web object, but also a more powerful second-generation Web object that delivers its services equally well to people and to other Web objects. XML-RPC and eventually SOAP will be key enablers, but what's also needed is a vision of how and why to integrate those things into a platform. I think Zope has the right set of ideas, and I'm eager to see where it takes them.

Document interfaces

So we have a bunch of network services, isolated from humans and from software by some kind of XML layer. How are people going to use those services? What the first-generation Web taught us is that documents aren't just things we read, they're the interfaces through which we interact with network services. We still call these things "pages," but on the Web, a page isn't just a place where I read, it's a place where I maintain my calendar, or order books, or manage my investments, or exchange messages with other people.

What we used to call software applications, we now increasingly call Web pages. One reason for this, clearly, is the power of the Web's thin-client model of computing. Why should I install software, when I can just bookmark it?

But the thin-client benefit isn't the whole story. Web applications are just different than conventional GUI applications. On the Web, a document has no boundaries. It can connect to anything, and aggregate anything, that's available on the Web. It's now possible to surround the user interface of a software application -- the checkboxes and the pushbuttons that I use to order a book, or manage my investments -- with a wealth of contextual information. The book page links me to reviews of the book, or to other books in the same category. The investment page links me to analyst reports, realtime data, historical charts.

The challenge here becomes organizing all that contextual information, and presenting it in useful ways. "Information architecture" is the buzzword that's used to describe this emerging discipline. Zope, from the beginning, was a power tool for the information architect. It encourages you to build a site whose structure maps to the structure of your content, and when you do that you get all sorts of powerful inheritance effects (or should I say, acquisition effects?). Equally important, it encourages an object-oriented approach to your data. Documents such as news items, or Confera messages, are strongly typed. They can extend the behaviors of their ancestral documents, and they can interact intelligently with ZCatalog. In the world of electronic publishing, where I spend a lot of my time, this object-oriented approach to documents is not yet well understood. But there's intense competitive pressure to build document interfaces on the Web that are highly personalized, that aggregate data from multiple sources, and that summarize complex relationships among data components. Object publishing is the way to get these things done, and of course that's what Zope is about.

So where does XML fit into this picture? Well, a common misconception I hear all the time is that XML will enable "smart searching." That's kind of silly. What enables smart searching is effective information architecture. You can achieve that without any special tools or languages, if you know what you're doing, just like you can write object-oriented code in C or BASIC if you know what you're doing. Or you can use XML, and screw it up. But XML can make it easier to do effective information architecture, just like Python or Java can make it easier to do object-oriented programming.

The real significance of XML is that it extends the object system down into the document. The document's no longer just a blob of data with some metadata attributes that hook it into the containing object system. It's a mini object system in its own right, with a well-defined structure right down to the tiniest element, and scriptable methods that can access and transform and rearrange those elements. You can see this quite dramatically in Zope when you create a document of type 'XML Document' and then inspect the resulting tree of parsed elements.

What's this good for? Well it's true that for a while to come, we'll continue to need to transform this stuff into HTML for delivery into browsers. But as we create and store more of our documents in XML, we vastly improve our ability to add value to the resulting HTML pages. It's tempting to think that the limitations of HTML are what's keeping us from making richer and smarter document interfaces, and that as soon as XSL and SVG become pervasive, all our problems will be solved. I'm not holding my breath waiting for that to happen, though, because I can't help noticing that CSS -- to cite one painful example -- has been widely deployed since 1966, and still isn't universally acceptable on the Web.

The real bottleneck, I think, is lack of granular control over content. That's why I use XML. For example, one of my clients is a magazine, and I store all of its online content as XML, for delivery as HTML. It's very simple XML, just XHTML really, but the point is that I can reliably parse the content, transform it into richly-interconnected sets of Web pages, and vary that transformation at will to meet all sorts of rapidly-changing requirements.

An XML Document isn't just a header, containing structured metadata, and a body containing random stuff. The body has structure too. Most Web content managers haven't thought much about this kind of micro-structure. When they think in terms of information architecture, they consider the macro-structure of a site -- its functional areas, and how classes of documents relate to those areas. But documents do have internal structure, and you can leverage it to powerful effect. For example, the lead paragraph in a news item can also function as a teaser that appears in a summary view -- if you've tagged the lead paragraph so that you can later find and reuse it.

Or consider the collection of How-To's on the Zope.org site. As a member of Zope.org, I can create a new How-To, and categorize it by level of difficulty and topic. ZCatalog can search the resulting document in fulltext mode, or by way of the attributes I've assigned it. This is a great idea, but it only goes so far. Really, this is still the old model in which a document combines a structured header with an unstructured body. Now suppose there were a DTD for writing How-To's, along with an authoring environment that made it easy to instantiate that DTD. Suppose there's an element of the DTD called DtmlCodeFragment. Suppose that I can query the How-To collection for these DtmlCodeFragment elements, in conjunction with some specific piece of DTML syntax, say the <DTML-TREE>tag. The result would be an extremely powerful view that would draw together, on a single page, many different examples of uses of the tree tag. Imagine what that search results page would look like, then go to the zope.org site, search for "tree tag," and observe what you get -- a list of titles which reveals nothing about which of the underlying documents contains useful examples of the DTML tree tag.

Programmers learn by example, but usually, no single example will suffice. We need to collect all the variations on a theme, and consider them side by side, in order to build up the most complete mental model of the system. I think that most Zope newbies would agree that it's the difficulty of getting that mental model right that's the biggest obstacle to becoming productive in Zope. So information micro-architecture, if I can coin that term, is strategic for the Zope community, as well as for most of the problem domains in which we're inclined to use Zope.

Reusable content, at a fine level of granularity, is a game that we're all still learning how to play. Zope's 'XML Document' feature invites content managers to join that game. It's not a panacea, I'll admit. The embedded XML parser, expat, won't validate your content, and although you can arrange for ZCatalog to index and search your XML data, there's no support yet for XQL querying. But the immediate challenge is simply to get people over the XML activation threshold, by making it easy to start creating, and using, XML documents. I think Zope is well-positioned to do that. Since I'm a big believer in the "eat your own dogfood" principle, I'd suggest it might help to start managing Zope's own documentation in a more granular way -- including How-To's, Tips, and a lot of the message traffic that's currently happening outside of Zope on various lists.

Data stores

My third theme is the data stores that support network services. Like every Web application server, Zope certainly knows how to play nicely with SQL engines. But unlike most, Zope comes with a native, object-oriented data store: ZODB. So when you're building a Zope-based service -- you can store the data in a relational database, or an object database, or some combination of the two.

It's worth noting that ZODB isn't the only imaginable object database that Zope and its applications could sit on top of. There are several commercial object databases, such as ObjectStore and POET, which come with what are called "bindings" to object-oriented programming languages, typically C++ and Java. In the case of Java, what this means is that your Java Hashtable, or Vector of Vectors, can persist in one of these databases.

Last year, ObjectDesign noticed that XML's structures map nicely to the object data its engine can store, query, and manage. And, like other ODB vendor, it repositioned its product as an "XML data server."

The version of ObjectStore that works with XML is called eXcelon. Once eXcelon has parsed and stored a chunk of XML, you can do some really powerful things with it. For example, you can query with XQL -- leveraging all of its powerful syntax for dealing with tree-structured data. There's also an experimental update language, analogous to SQL UPDATE, that you can use to declaratively alter nodes in place, create new nodes, and remove nodes.

This is a far cry from XML-RPC. Does it makes sense to regard XML not only as an interchange medium, but as a storage discipline? And if so, are XML and object databases really the natural partners that they might appear to be?

I don't think anybody really knows whether XML should evolve into a full-fledged data-management discipline, or if so, how. But it does seem clear that there's going to be a ton of XML content in the world, and that relational databases are not as naturally suited to the storage and management of that content as are object databases.

So where does Zope fit in here? I've said that XML is a way to extend the object system down into your documents. Zope's 'XML Document' feature already does that. It would be cool to see Zope offering some of the advanced features now available in commercial object databases -- like XQL querying and declarative updating, of XML content. But it would also be cool if Zope could hook directly into those other object databases, just as today it can hook into Oracle or Sybase. I'm not sure just how that would work, but if anybody can figure it out, I'll bet Jim Fulton can. Why might this matter? Because network services increasingly want to use complex data that's painful to shoehorn into relational stores. Persistent object storage is one of the most wonderful features of Zope, and there ought to be options ranging from ZODB to Berkeley DB all the way up to industrial-strength commercial object databases.

Fini

To wrap this up, I'd like to thank Paul Everitt for inviting me -- a mere Zope newbie -- to give this talk. I'll admit that I've had a kind of a love/hate relationship with the product as I've gotten to know it over the last few months, but Zope and I are having more good days than bad days lately, and there's no denying I've caught the Zope fever.

Here's how somebody described the experience to me:

"It starts as a benign web server with online content management... and turns into an insidious learning curve that will destroy your every waking hour. That is, until you uncover the secret that lets you do what you wanted it to in the first place."

It's true that Zope has its secrets, but it's also true that Zope takes the 'open' in open source very seriously. XML is an important part of the story. It can help ensure that Zope's network services, document interfaces, and datastores remain open, and in the right ways. The whole story hasn't been written yet, but I'm enjoying it so far, and I can't wait to read the next chapter.