< Criteria for evaluating specifications
Zope 3 will support XML >

[Comments] (5) A CMS as a pile of semi-structured data:

Paul Everitt and I have long been communicating about the role of XML in the CMS world. Recently, he posted the following blog entry and asked for my opinions per email. I started writing a mail back to him, but then I realized I have a blog now too...

Firstly, I want the UdellCMS too. :) Infrae has been trying to build one for a while, staggering along the way.

Now as to some comments to Paul's vision. Within Infrae I have the reputation of being a devil's advocates, and I've done it to Paul before, so here goes with my comments.

Paul mentions, referring to the demo he links to, that:

it isn't a "programming language" (though it sure looks like one)

This one is never going to fly. It's a programming language all right, though a highly declarative one. It's going to take someone with programming skills to write these. The advantage, similar to SQL, is reporting: you do not need to know a lot of APIs in order to extract data out of the system.

Now to what Paul and I both agree is the more interesting bit:

What's more interesting, IMO, is the effect this has on CMS design. It totally changes your approach to navigation. Instead of thinking very hard about folder structures or topic structures, you just throw everything into a big pile and let stored semi-structured and full-text queries create smaller piles. This allows numerous approaches to site navigation.

I think that this approach opens more possibilities than some traditional CMS approaches have been taken, because of its non-API oriented query/reporting nature. This also ties in to various content repository projects, such as Infrae's Railroad and content repository APIs like JSR-170.

But, I also think that you cannot just give up (up-front) thinking about things like folder structures or topic structures, or structured document content, for a number of reasons.

While the files in the referenced demo are perhaps in one big pile, these files contain a lot of structure that can be exploited. It's just not represented as folders and topics. Someone has to put this structure in there.

How will you get (non-computer-savvy) people to produce such structured content? I.e. how do you get enough consistency to actually be able to do smart queries like the ones demonstrated? Jon Udell alone can be trusted to produce semantic XHTML, but a whole organization? With Silva, we took a lot of care so that we get some semi-structured content out of it all (Silva XML).

In addition, organizations use structuring techniques like folders and (mandatory) metadata to have something to hold on to and some coherence in the produced content. You can use folders for authorization, for instance. You also need some uniformity in the topics used by people, and often organizations want to mandate this.

So, some structuring facility seems unavoidable in a CMS. Wikipedia has shown that even minimal facilities can lead to grand things, though, but wikis aren't right for all use cases.

Where the real merit lies is in giving up some APIs and going more to declarative data. In one way that's just moving the problem. But in another way, what you can gain is that you can move some CMS-style tasks away from a potentially limiting set of APIs to a more powerful query & reporting model. In order words, you manufactor serendipity; you make it easier to reuse content in new ways. It all being done in a standardized way (the host of XML standards) is cool of course.

We've been doing things like this at Infrae and with Silva for quite a while now, and we're trying to open it up more. It's not easy and the benefits are sometimes hard to see, but we're trying. But we can extract PDFs and Word documents from Silva, and with custom apps on top of Silva we can expose and relate data in a lot of interesting ways. When you start turning it into a real applicat And we'd like to do more in the future, as we're opening up Silva to accept more kinds of XML content.


Comments:

Posted by Paul Everitt at Wed Mar 02 2005 15:05

Quick rejoinder...you're right that you still have to work at getting structure in. But, the work for that can be baked into forms, not components. And changing your mind about the particular structures and meaning in your content doesn't involve a component change and software upgrade. Just a form.

Posted by Martijn Faassen at Wed Mar 02 2005 16:00

Some comments to your comment, Paul. :)

What form technology is powerful enough to make a non-technical user input semi-structured data, right now? With Silva, far more is necessary (either the XMLWidgets-based HTML forms editor, or Kupu with extensive transformation logic) than just forms, and that's just for a few types of document XML. We got years of experience doing that kind of stuff. Changing "just a form" implies the complexity levels in doing so are lower than they really are.

That said, having the content in XML definitely helped us, as we could add in Kupu later during Silva development without having to redo the underlying content APIs significantly. XForms on the horizon should make life somewhat easier still, but I don't expect it'll ever be so *easy* the world 'just' applies to real world settings. :)

The point where I'm really dubious is on avoiding the need for content upgrades. If you do *not* upgrade your content, you shift the problem to the code that tries to extract information from your content. Instead of dealing with a pool of uniform semistructured data it will have to know about "before" and "after". This is arguably harder to maintain than doing an upgrade.

Again, I think these ideas and technologies can make a lot of sense. I do think they belong more in an intermediary layer between app server and CMS (or other particular web apps) than the outer CMS layer itself. It makes the CMS foundations more flexible, which is good, but I think the CMS layer will have to add back in some structure and structuring to make it work for end users.

Google maps is perhaps a good example. Somewhere lurking underneath is a pile of XML, but there's a layer on top that transforms this XML into a structured user interface, deliberately limiting the possibities for an end-user.

Usability often needs limitation, and in this sense there's some tension with the goals of flexibility and extensibility.

The challenge is to somehow do both at the same time.

Posted by Michel Pelletier at Thu Mar 03 2005 18:44

Hey guys, I got thrown into this a bit and did not follow the initial thread of discussion, but I have a few bits to toss in. Chris M and I spoke a few months ago about such a CMS, where everything was "blob" in a pile and any such hiearchical or other information layout imposed on the data was a view constructed from meta-data. Is this the UdellCMS? I don't have the time to track Jon's stuff and I really should.

This discussion happened just as I was beginning to prototype a semantic web catalog for Zope 3 (http://zemantic.org/) and organizing many different views of a big pile of blobs based on their meta-data is a prime use case. Of course it's not the only use case of the semantic web, but it follows from the strength of its design.

Posted by Martijn Faassen at Thu Mar 03 2005 19:23

Hey Michel. It is open to interpretation what the UdellCMS is; perhaps we should call it the PaulCMS instead. :)

I think Paul's idea isn't so much about outside metadata but more of extracting structural information from a pile of semistructured (XML) data. Provide nice structured overviews and table of contents while the content is actually just a pile of XML, by smart querying and reporting.

The idea is also that this pile of XML documents can have all kinds of varieties of vastly or subtly different XML documents added to it, as the reporting is smart enough to make sense of it. This as opposed to upgrading your content model all the time if you add a new feature or change your document format.

Handwaving these ultra-smart queries into existence is where I think this model has some trouble. The other trouble is that not everybody can be expected to produce semi-structured data at all, and that good input tools for such users are a complicated topic.

That doesn't mean I'm rejecting the idea, as there's lots of potential, I'm just offering feedback to see whether we can't flesh it out better.

Imposing structure on a pile of unstructured information by outside metadata, such as RDF and topicmaps, is also very interesting, and I think could be complementary to extracting structure from content. This may actually be a useful way to start filling up some of the gaps in the concept.

I'll be checking out zemantic as soon as I can get some time.

Posted by Michel Pelletier at Thu Mar 03 2005 19:43

Yeah, ultra smart queries are a pain, I have been researching a whole range of query languages and the smarter they get the harder they bite. This gets into the whole AI thing and I came dangerously close to thinking the best query language for the semantic web was Prolog. Not surprisingly there are a lot of people out there championing that idea, after all Prolog was the first predicate logic language.

Less accurate, but perhaps more useful are natural language queries using natural language toolkits like NLTK. Show the user a handfull of natural language query syntax examples and let them work with the system interactively in their native language.

By the time you get time to check out zemantic, i might have time to check out five. I keep telling people that they'll get their Zope 2 version of zemantic as soon as I figure five out. ;)


[Main]

Unless otherwise noted, all content licensed by Martijn Faassen
under a Creative Commons License.