WWW 95 Trip report
In this trip report I present in some detail my thoughts from four
days in Darmstadt, Germany, attending the Third International WWW
conference.
This document is a little long, so if you want you can go to my
comments on paper presentations from Tuesday,
papers from Wednesday, papers from Thursday, or Alan Kay's
talk to close the conference, and my summary
comments.
Monday, April 10
I attended the Workshop on Indexing and Semantic Headers. Jim
Mayfield and I had submitted a position paper that was a quick overview of our work with KQML and intelligent
Web agents.
The attendees included authors of several important Web search
engines. Michael
Mauldin from CMU discussed his Lycos engine, and (CS grad
student) Brian Pinkerton from U of Washington talked about his
WebCrawler.
Martijn
Koster talked about Nexor's Aliweb
system. David Eichmann talked about his MORE system, which is
sponsored by NASA as a byproduct of the Repository Based Software
Engineering Program.
- These search services are getting quite popular. Lycos is
getting so busy that it's usefulness is restricted. Mauldin mentioned
that Lycos is running off a network of nine Sparc
workstations, with three more on order. Yahoo is also becoming very
busy, but with their new commercial backing maybe their hardware base
will be able to keep pace with demand.
- The growth of the Net in general and the Web in particular means
that whereas a robot could traverse the whole Web in a matter of hours
18 months ago, now it takes many days. This seems to be an inherent
weakness of robot-based search indices, which in my opinion can be
addressed only by running robots in parallel and pooling the results.
Replication of search engines is also clearly a good idea. In a
smaller, private network, then a robot that serves to keep all the
servers abreast of what's new within that network might be very
useful. So robot-collected metadata may be the beginnings of
inter-server communication.
- Everybody seems interested in figuring ways for search engines to
share information, although nobody's got the details worked out yet.
One approach used by Aliweb involves having servers produce index
files for their local webspace, and then sharing these index files
when robots stop by.
- Robots are not themselves bad for the Net, since the ratio of
volume of traffic generated by queries and the volume of traffic
generated by robots is about 30:1.
- Andrzej (Andrew) Duda from IMAG talked about the Discover system.
Discover provides a query routing service to over 500 WAIS servers.
It found three references in WAISspace to whois++, and one apparently
spurious reference to my hero Tom Swift.
- I'm looking forward to an upcoming article in IEEE Expert in
which some of these search engines are compared.
- But robots aren't the end of the story! Stu Weibel from OCLC
pointed out that indexing is not a viable option for truly large
collections. Again, in the context of a smaller network, I'm led to
wonder if a crude first pass through a corpus could be done, to be
followed by construction of an ad hoc index of that
subcollection for use with a set of related queries.
- This community is worried about charges of copyright
infringement. If an indexing service creates a copy of a document and
makes it available for you, especially if the service is robot-based
and the document was found by the robot rather than submitted by its
author, then maybe the service is violating the document's copyright.
Lycos gets around this by abstracting the document, and indexing that,
which (according to CMU's lawyers) falls under fair use.
- The query logs being collected by these search engines give an
interesting picture of what people search for on the Web. The top ten
query list includes Linux, and material of a sexual nature.
Maulder says that his logs are invaluable as evidence of how people
really search for information. That is, according to Maulder, that
average users issue relatively short queries, and then browse the
returned documents, as opposed to the TREC approach, in which
lengthly, complicated queries are issued. In Maulder's opinion, the
TREC project has spent millions of dollars trying to answer the wrong
question.
- My colleague Jim Mayfield says that Maulder's data might
simply be an artifact of how good the search engines on the Web are at
the moment. Perhaps if the available engines could handle longer,
more complex queries, people would issue them. I think there's a
paradigm collision happening: at the document level, the Web still
has a hypertext structure. So with the tools we have, and this
hypertext structure, the user might very well adopt a short query plus
a bit of browsing style of search. In a large corpus with no
hypertext structure, as is the case in TREC, then document level
browsing isn't nearly as attractive as a follow-up to a short query
that yielded 700 possibly relevant documents.
- There were a total of about two hundred people in attendance at
the eight workshops. There were about seven hundred people in
attendance at the eight tutorials. Each participant was charged
150DM, or about $105. I suspect that they turned a profit.
The session on Tuesday included several short speeches of welcome.
Since the papers and maybe the talks themselves appear at the
conference Web site in the on-line proceedings, I present my notes in
various levels of detail. Also, there were two parallel sessions, and
I was only able to go to the most interesting one (or the one in which
I was speaking :-)
The papers are listed in the order in which they appear in the printed
proceedings. These proceedings are available from Elsevier. They
also still seem to be available on the conference
Web server.
- The paper "Towards an intelligent publishing environment", by
Pitkow and Jones, discussed URLs, URNs and URCs. The "whois++" system
apparently does some URN to URL mapping.
- The DEC Web toolkit paper won the conference award for Best
Paper.
- My presentation
of the Rowe
and Nicholas paper was well-received, I think. There were maybe
200 people in the lecture hall at the time, and of course it was
broadcast over the MBone, as was the rest of the conference.
- The WebMake system looks like a good start in the area of
Web-based software development environments. This business of
distributed software engineering has been simmering slowly for years,
perhaps waiting for some effective way of tracking and resolving
dependencies between different artifacts. Apparently the CCI language
is used to implement client-side scripts.
- Pitkow and Recker reported on their WWW survey. Among their
findings: Only 10% of Web users are female. (And I think conference
attendance reflected this.)
- In the last session of the day, Quint lamented the lack of
respect for SGML that HTML represents. Specifically, he objects to
browsers that don't strictly enforce the HTML DTD, and to the use of
HTML to describe the appearance of a document, rather than just its
structure. The tone of the conference in general was tolerant of
presentation information in HTML, but opposed to lone wolf extensions
ala Netscape.
- I attended a "birds of a feather" session on Metadata and
Indexing. The NCSA/OCLC workshop, which resulted in the creation of a
semantic header format, was discussed at length. I personnaly am not
convinced that even a detailed header will really go far towards
solbing the resource discovery problem. Several white papers have
been produced by an IEEE metadata group.
- The most memorable event on Wednesday was probably the
presentation by Thomas Reardon of Microsoft. Reardon is the Technical
Lead of the Windows 95 Networking Group, and has been deeply involved
in URLs and other Web resources being treated as first-class objects
in the new Windows 95 OS. Microsoft claims to be the "API company",
and is content to allow others to innovate at other layers. Reardon
was only mildly successful at preventing his talk from becoming too
much of an advertisement for Microsoft.
- The Network Information Discovery and Retrieval panel gave people
a chance to complain about the lack of metadata, and how robot-based
approaches don't seem to scale well. Tim Berners-Lee knew of five different
URN schemes out there, and there is now a movement towards URAs, or
uniform resource agents.
- The first paper session had several good papers. Pfeifer's paper
of freeWAIS with structured fields is be quite useful. The SFGate
software, which is a Web to WAIS interface, seems to be the natural
client for this version of WAIS. (Xwais does not seem to be the
natural client.) The Eyesight system from CNIDR is a competitor.
- There was a BOF for Sun's new Java
system. Java can be described as an object-oriented language for
distributed applications. HotJava is a dynamic, extensible Web
browser, written in Java. By extensible, they mean importation of new
Java code, called applets, from a server, to be run as part of a
client. Lots of thought went into preventing applets from becoming an
infobahn for viruses and trojan horses. In my opinion, if Java takes
off, then Web browsers as we know them know will seem very old
fashioned, and sooner rather than later. However, you need to be
running Solaris on your Sparc before you can run Java. CCI and
SafeTCL were mentioned as alternatives for client-side code. Since
Ousterhout now works for Sun, I expect development of SafeTCL and Java
to converge into a single effort.
- Pitkow gave yet another talk on Thursday. (He was a co-author on
three papers at this conference.) In this paper on browsing habits of
Web users, he reported that during a typical WWW session, the user
visits maybe five sites, and visits perhaps seven html files per site.
You probably know that Kay was at Xerox PARC in the early 70s when
they were doing all kinds of cool things, such as inventing the modern
workstation, and object-oriented software development. He is now an
Apple fellow. He referred to HTML as the MS-DOS of the Web, and
charged the Web community to "Please don't make HTML a tradition."
The objects in the Web shouldn't be HTML files, but should instead be
objects that may be evaluated as HTML documents, along with a
number of other applicable operations. (I think he's right - HTML
documents are too static. The fact that server-side and server-side
scripting mechanisms came into existence so quickly indicates the need
for dynamic Web objects.)
- The alphabet soup of URLs, URNs, and URCs is an active area of
research.
- Nobody said a words about automated tools for converting legacy
documents. Apparently the Web is growing quite fast enough with the
tools we have, thank you, crude as they may be.
- Two of the keys of the Web's success: anybody can run a browser,
and just about anybody can set up a simple server. Complicated
servers, with multiple machines sharing one IP address, lots of
proxying and caching etc., requires a lot of energy.
- Nowadays, when somebody says "object-oriented", I tend to cringe.
However, Kay's argument that browsers and servers should traffic in
objects, rather than (say) static HTML files, is quite compelling. If
he's right, then the clients and servers of the future are going to
have to exploit and implement these operations, whatever they may be.
This will effect the whole architecture of the Web in ways that at the
moment are not easily understood.
- We should resist efforts to standardize HTML, I think. The Web
as a collaborative work environment, for training, software
development, information gathering and analysis, or whatever, will
accelerate as client-side and server-side scripting gets more
sophisticated and robust. Reardon spoke of the Web as a platform on
which other nifty applications will be developed. I wonder what those
applications will be?
The next Web conference will be in Boston, December 11-14 1995.
Last Modified:
4/19/95
by
Charles Nicholas,
nicholas@cs.umbc.edu