From: Kurt Cagle [kurt@kurtcagle.net]
Sent: Sunday, March 02, 2003 2:23 AM
To: Terry Brooks
Subject: Re: Paper about the Web

All in all a very nice paper.

I have a few thoughts about the feasibility of creating a meta-data layer for 
the web, brought about by some comments within your paper.

The intrinsic notion of the semantic web as laid out by Tim Berners-Lee makes 
a fundamental (and I suspect fundamentally wrong) assumption: that a URL 
should map to a specific configuration of information - a document - into 
perpetuity. This notion fell apart with the introduction of CGI, and indeed 
has no real representation in what we currently call the web. Perhaps it is 
more appropriate to think of all loci in URL space as pointing not to 
specific documents but instead to functions; some functions produce a 
constant answer, others may produce wildly differing information depending 
upon a number of external parameters.

For instance, consider a URL that points to an RSS feed. The RSS feed can be 
thought of as a time dependent function that returns a set of linkages and a 
"basic" semantic meaning associated with each link within that set, yet the 
likelihood that, over time, the RSS feed will consistently return the same 
set of information is actually fairly small, especially if the list is 
active. 

I could see taking this notion of mapping URLs to functions further. A URL by 
itself is the visible part of the linkage, but it tends to hide the agency of 
the invoking agent; beyond time, any such linkage will have associated agent 
type information, possible expectations on MIME type, information limiting 
what kind of data can't be retrieved from the URL, the protocol used to 
retrieve the data, and so forth. A SOAP message may be a little more explicit 
with this set of associations, but it is still fundamentally a bundle of data 
that provides additional context to a URL.

We refer to web pages as being documents because the agents through which we 
view the contextual data present this information in a format that model our 
expectations of printed documents. However, this is more a property of the 
agent than it is of the bundled information itself. This is becoming more the 
case now that more of that information is moving out of HTML (or even XHTML) 
and into pure XML formats. For simplification then, you can basically say 
that all URLs point to either an XML entity, a binary entity that has an 
associated viewer/editor within the client (or not), or a mixed entity in 
which the XML entity acts as a mechanism for providing both semantic content 
and linkages. Within this view, there's nothing explicit about the document 
characteristics associated with the URL; indeed, the base level semantics of 
the XML entity are in fact impugned to it by an explicit or implicit schema 
that may have no association whatsoever with the notion of documentness.

The problem with "search" in the traditional sense is that it relies upon two 
principle characteristics, neither of which are really true on the web:
1) a URL provides an explicit mapping to an explicit constant set of 
information.
2) The semantic associations contained within a search query can be readily 
mapped to the textual content of the information retrieved from that query.

If, as I stated before, the URL space in fact consists of a set of vectors to 
functions the output of which can be mapped to XML or to some contextually 
opaque entity, then point #1 goes out the window. If you can't guarantee 
constancy (as I believe you cannot), then the notion of search in general 
becomes more questionable.

The second point is more subtle but no less important. When a person makes a 
Google query, they are making an implicit distinction about namespaces. The 
term "bush" could describe an inept president, a flowering shrub, a region in 
Australia, and so forth - in essence, there is ambiguity here of namespaces.  
Search engines such as Google use a modified Bayesian algorithm coupled with 
site popularity to make reasonable guesses about the degree of relevance of 
any given site, but it does not in fact know anything about the namespaces 
involved; it can only infer the namespace when enough terms are provided.

Any XML document is associated with at least one namespace (though it could 
also be a subset of more than one other namespace, and hence a superset of 
one or more namespaces). That namespace correlates to both a validational 
schema (such as a DTD or XSD document), though its also possible that the 
namespace also correlates to an ontological schema (such as an RDF schema).  
The ontological schema makes it possible to identify an XML resource as being 
relevant to that search, even if the text of the resource contains no 
explicit text reference.

Going back to RSS a second. RSS provides a contextual linking of information; 
you have associations that are built into RSS by aggregating conceptually 
allied concepts and linking these to a human readable title, a URL, and 
potentially one of more categories. This XML bundle is a "document" full of 
physcial associations. 

I've got more comments, but I'm falling asleep in my chair.

Kurt Cagle
Author, SVG Programming
http://www.metaphoricalweb.com