IIMS 1996: Yourlo, Ginige and Witana- publishing documents on the world wide web

A maintainable solution for publishing documents on the world wide web

Zhenya Yourlo, Athula Ginige and Varuni Witana
University of Technology, Sydney

Many organisations and individuals are now publishing many different types of information on the Internet via the World Wide Web. With this new push towards an electronic medium, better solutions need to be found for the maintenance and publishing of documents electronically. Presently, the majority of documents published on the World Wide Web are stored in a directory structure that is supported by the operating system installed at the site. Unfortunately such an approach has drawbacks with regard to the maintainability of the Web site, as well as during the initial construction of the Web site. This paper presents one possible solution to this problem that involves populating a database with nodes from documents to be published; the goal of which being to increase the maintainability of the resulting product, and at the same time to minimise the time spent in realising a final product.

Introduction

No foresight can anticipate nor any document of reasonable length contain express provisions for all possible questions. Abraham Lincoln, March 4, 1861 (Congress, 1989)

The World Wide Web (WWW) is the culmination of a number of years work by the NCSA (National Center for Supercomputing Applications) at CERN (The European Laboratory for Particle Physics) to integrate and extend existing information services present on the Internet. Previous attempts such as Gopher and WAIS have attempted to achieve a similar feat, but have not had success on as large a scale as the WWW. The success of the WWW is mainly attributed to the ease of use of the WWW and also due to its incorporation of transparent interfaces to other existing information discovery tools such as Archie, Gopher, WAIS and Usenet News.

Publishing on the World Wide Web has accelerated at a phenomenal rate since its inception in November, 1990 (Cailliau, 1995), and increasingly so in the last few months. Much of this growth has been fuelled by a large amount of media exposure that the Internet, and the WWW receive virtually every day. The majority of those who publish on the WWW though, do so without sufficient planning or thought with regard to the maintainability of their Web sites - unavoidably leading to much unnecessary maintenance work in the future as content is either extended or altered.

As part of one of the current projects at UTS, we have developed a maintainable solution to the problem of initial structuring and maintenance of Web sites based around modifications made to the HTTP daemon (with an accompanying graphical user interface). This paper describes the implementation of our system, and also the rationale behind our choice of implementation.

Publishing in the electronic domain

The electronic domain does share some similarities with its paper based counterpart, although overall the electronic domain, and its associated paradigms are substantially different.

Documents in the paper domain are essentially linear in structure, and are intended to be read as such, whereas documents in the electronic domain are not necessarily bound to a linear format. Due to these differing document structures, a difference also exists in the communication of information from each of these different domains. The process of composing, and then reading a paper document follows that of linearisation, and delinearisation, whereas the process of composing, and then reading an electronic document follows that of externalisation, and internalisation (Ginige et al., 1995). Since the process of human thought is associative (Munn, 1961), it follows that it is more natural to communicate information by way of a non-linear medium rather than one that is linear.

Although significant differences exist between the electronic and paper domains, the majority of documents are still written using traditional linear authoring tools (such as the average word processor). Since many of the documents that we deal with at UTS - especially lecture notes - are being translated into the electronic domain, a straightforward conversion mechanism must be found.

The structure of the average document follows a set of rules that may be exploited by the electronic publisher. Most technical documents, are composed of sections, subsections and paragraphs. This is convenient, as it allows a document to be easily decomposed into representative chunks that are suitable for publishing in electronic form.

Nodes

Nodes form the fundamental foundation of hypermedia systems. Nodes are pieces of information in a hyperspace that convey a single idea or theme - for example a paragraph within a document, or a scene within a video clip.

Figure 1: "Chunking" a document into nodes

Anchor points

In order to implement a hypermedia system, a number of anchor points must be defined inside document nodes. These anchor points are the words or images that form the starting point of a link. Another common term for an anchor point is a "hot spot."

Links

Several different types of links exist - structural, associative, and referential links.

Structural links preserve the original structure of documents converted for use in a hypermedia information system, and also form the foundation for the hypermedia information system as a whole.
Associative links provide associations between concepts present inside nodes that form part of a hypermedia information system. For instance, an associative link may exist between a node containing a quote from Shakespeare, and another node containing the complete text of the Shakespearean work that the quote was taken from.
Referential links in hypermedia information systems perform the task of providing dictionary/glossary functions to the hypermedia information system. For instance, the word "obfuscated" appearing within one node may be linked to its dictionary definition residing within another node in the same hypermedia information system (or even in another hypermedia information system).

The predominant forms of links present in the WWW are structural and associative links. Referential links in the WWW are presented in the same way as associative links.

Information structures

Relationships between document nodes may take a number of forms depending on circumstance, and original content source. The three main types of node arrangements that we will consider are linear, hierarchical and network structures.

A linear structure can be compared to a "hypertrail". This structure refers to a linear path through a set of nodes that each have a 1:1 structural link relationship with each other.
A hierarchical structure, or tree structure, implies that a set of document nodes form a parent-child relationship. Document nodes that are part of a hierarchical structure have a 1:1 structural link relationship with their parent nodes, but a 1:N structural link relationship with any child nodes.
A network structure of document nodes is an arbitrary (possibly cyclic) arrangement of document nodes that is largely common in hypermedia information systems. A set of document nodes that can be classified as being in a network structure have a 1:N structural link relationship with any other node in the network.

We now propose a new structure that combines features of both the hierarchical and network structures described above - the hierarchical network structure. The hierarchical network structure is essentially a network structure with certain rules imposed upon it. These rules state that any node in the hierarchical network may only have a 1:N structural link relationship with child nodes, and must only have a 1:1 structural link relationship with any other node in the hierarchical network that shares the following relationship with the current node: next, previous, up or home. This structure is particularly convenient in that it not only provides necessary navigational aids to the reader (hopefully helping to alleviate the problem of being "lost in hyperspace"), whilst also enabling easy conversion of a large number of paper based documents.

The world wide web as a publishing medium

The WWW is an ideal medium for electronic publishing due to the sheer enormity of the resulting document distribution, but even so, there are limitations imposed by the underlying document markup mechanism. The structure of documents present on the WWW conform to that which is typical of most hypermedia systems - a structure consisting of nodes, links and anchors (see section 2.0). When designing and maintaining a set of electronic documents on the WWW which have a structure imposed on them, navigation issues must also be taken into account. From experience (Ginige et al., 1996 ), it has been found that a reasonable set of navigational aids to present to the browser of an electronic document are links to the nodes that share the following relationship to the current node: previous, next, up, home and children. These links would normally appear at the top and bottom of every document node, and should feature the title of the destination node as part of their description.

Figure 2: Illustration of the combination of hierarchical and network structures
to form a hierarchical network structure (using document nodes from Figure 1).

Another significant obstacle that must be overcome in order to use the WWW as a maintainable publishing environment lies in the structuring and logical storage organisation of document nodes. On a typical WWW site, the documents are arranged in a directory structure that is native to the operating system that the HTTP daemon is running under. This means that the original documents that are to be published on the Web must be coerced, in some way, to conform to that structure.

HTTP daemon directory structures

A number of choices do exist in determining the final directory structure that the document nodes will conform to. One choice of directory structure would be a flat structure, wherein all document nodes for the Web site would be placed into one directory. This structure has the advantage that reorganisation of document nodes does not require any maintenance of the directory structure. The major drawback of a flat structure is that there is no way to view the "big picture" of nodes on the Web site due to the fact that all the structural information is embedded within the HTML (HyperText Markup Language) content of each node.

Another choice of directory structure would be a structure similar to the structure that we used in the Hypermedia Technologies course notes at UTS. This structure involves creating a new directory level for each section and subsection in the original document, and placing the corresponding document nodes inside each of these directories. The advantage of this approach is that the relationship between each of the nodes on the Web site is clear due to their logical separation in the directory tree. Unfortunately, this advantage is outweighed by the large amount of directory maintenance that must be carried out when any reorganisation of the Web site occurs.

Typical document structure, as discussed in section 2.0 does not naturally fit into a form that is conformant to a file/directory based structure. In order to accommodate the average document, the wisest option would be to modify the HTTP (HyperText Transport Protocol) daemon to accept document nodes in a more easily maintainable structure.

Web experiences at UTS

During the markup of a set of lecture notes for one of the Hypermedia Technologies postgraduate subject run at UTS, we went through this process of Web site structuring, document conversion and markup, and identified the most time consuming elements. On average, it was found that it took approximately one half days work to take a set of lecture notes, break these up into nodes, convert to HTML, and add structural links.

Activity Time taken (min)

Convert to HTML 10

Chunk document 30

Create directory structure 10

Add structural links 150

Move nodes into directory structure 10

Total time 210

On an average Web site, it would be necessary to add, and maintain each of these links by hand with some form of HTML editor. As can be seen from the data presented above, this process is definitely a time consuming, and inefficient activity.

Even though a Web site may have well planned structure, and navigational aids as described above, after it has reached or surpassed a certain size, the navigational aids and structural links will become increasingly difficult and time consuming to maintain. This is readily illustrated by the following example. If we have an existing node structure similar to that shown in Figure 3, and it is necessary to move node A to position B, it would be necessary to sever one link and modify eight links. If this activity were repeated for ten nodes, it would mean that 90 links in total would have to be modified by hand using a HTML editor. The chance of introducing error in this situation becomes quite large not only due to the magnitude of the task, but also due to its repetitive nature - verification of every one of the resulting modified links would also be particularly time consuming. This difficulty with maintenance creates a need for an automated solution to this problem.

HTML and its limitations

HTML itself is still in its infancy compared to other markup standards that are in use today such as SGML, and TeX. This immaturity leads to shortcomings not only in the look and feel of resulting documents, but also in the maintainability of the structure of electronic documents. As it is aptly put in (Greenspun, 1994), "Make a graph with time on the x-axis and formatting capabilities on the y-axis. Draw a line from the HTML level 1 to HTML level 3. You'll see that HTML reaches LaTeX's level of formatting capability around the year 2000".

HTML 2, the underlying document markup language of the WWW is quite adequate for most publishing needs, and with the advent of HTML 3, will have the ability to accommodate the artistic and aesthetic whims of the document publisher. HTML, though, is more concerned with local rather than global or large scale document structure - ie. it is not possible to impose an efficiently maintainable structure on a set of HTML documents without some form of external assistance.

Figure 3: Moving a document node from A to B

An object oriented approach

A document may be broken up in a number of ways that yield suitable elements for composing an electronically viable version. We will now consider an object oriented approach to this process of document "chunking" that flows naturally from the document's original form, and also allows for easy navigation of the resulting document nodes once they are in electronic form.

Let us now consider the effect of taking each topic in a document as a separate HTML node. Although the document was originally linear in structure, it has now been decomposed into manageable objects that are related to each other in some fashion. Each of these objects can now be thought of as having a set of attributes and operations associated with them that determine their properties.

A document node has a title, content and keywords, and can be queried about handles to its nearest neighbours (next, up, previous and children). In order to establish a neat package for these attributes, it would be desirable to define a datatype (or schema) that can represent any document node that represents a single subtopic of any original document.

An object network

Given a set of document nodes that had already been extracted from their original source, it is possible, although time consuming, to determine a structure that follows logically from the content of those nodes. Since the hierarchical network structure is derived from information present in the original document to be converted, it should be possible to write a tool to automatically do the "chunking" and structuring. This could potentially pose a significant time saying in initial construction of a Web site, and also dramatically reduce the time spent in making additions to an existing Web site.

The HTML database system - a new server architecture

To combine all the ideas discussed above in order to create a maintainable Web site, the HTML database system was developed. Due to the discrete object nature of document nodes, an object oriented database was selected to form the basis for the whole system. This allowed for the definition of datatypes necessary for the construction of the HTML database system, and also provided speedy retrieval of document nodes.

For performance reasons, we chose to modify the HTTP daemon rather than to write CGI (Common Gateway Interface) scripts - this also gave much greater flexibility over the URLs (Uniform Resource Locater) of document nodes stored in the database. The need to maintain a directory structure also needed to be removed. This was achieved by maintaining document nodes within the database inside a flat directory structure, with the structural information relating the nodes stored separately. This approach retains the advantages of a flat directory structure, but since the structural information is now easily accessible, removes the drawback of having a flat directory structure (ie. the relationships between the nodes will be able to be viewed easily with the assistance of an integrated GUI).

Eliminating the need for addition and maintenance of structural links that allow navigation between document nodes was an obstacle that was necessarily overcome by our implementation of the HTML database system. Modifications of the HTTP daemon not only allow the retrieval of node content from the database, but also allow the generation of structural links from data associated with the document nodes "on the fly" as the nodes are retrieved. So as not to limit the freedom of those using the HTML database system with regard to structuring of a Web site, no rigid structuring scheme is imposed on the end user other than that of not violating a hierarchical network structure. Fortunately, a good choice of structure for many documents is the hierarchical network structure as described below.

One of the most important, and necessary features of the HTML database system is the integrated graphical environment used to create and structure document nodes (see figure 4). The GUI is designed to give potential authors a view of "the big picture" of the Web site at all times when manipulating the relationships between nodes in the database. This will be achieved by showing all nodes in the database represented in a tree like structure (with connecting lines representing structural links). Once the completed GUI is in place, adding and structuring nodes in the database will simply be a matter of "point and click". Restructuring existing nodes in the database will follow a familiar "drag and drop" methodology, wherein a node and its associated children may be detached from one point in the hierarchical network, and reattached at an arbitrary position. With this interface in place, the time required to undertake maintenance work on a Web site using the HTML database system can be greatly reduced since there would no longer be any need to perform the time consuming task of editing structural links by hand with a HTML editor for every node whose structural relationship to other nodes is modified.

An automated conversion/structuring tool will also form part of the completed HTML database system and GUI, enabling documents in RTF format to be quickly and easily added to the existing hierarchical network node structure at a user nominated position. This tool has the capacity to significantly reduce the time required in the initial setup of a Web site, eliminating the need to manually chunk the document, convert to HTML, and add structural links/navigational aids to every document node added to the Web site.

Implementation details

The modified HTTP daemon is based on version 1.3 of the NCSA HTTP daemon. Modifications to the original HTTP daemon were undertaken in a number of stages. The first step was to implement a new set of configuration options for the HTTP daemon that allow the customisation of the look and feel of the automatically generated navigational links. This was done using two new configuration files that specify a template that the server uses to construct navigational links. An example configuration file may look like:

[ Up: %u% | Home: %h% ]<br>
[ Previous: %p% | Next: %n% ]<p>

The letters in between the percent signs represent the link that is to be inserted at that point by the server at run time eg. the HTML for a link to the node that is "up" relative to the current node. At this time, the style in which child nodes are presented (a bulleted list) is hard coded into the HTTP daemon. It is likely that this behaviour will be altered to also be user customisable through a template in the future.

In the original incarnation of the HTTP daemon, upon receiving a request for a HTML document, it would examine the URL of the request in order to determine what kind of document was being requested. Usually, the first part of the path in the URL determines the document type (for example /cgi-bin would mean that the request is for the result of executing a CGI script). For the HTML database, we chose to use the prefix /db to represent any documents that reside in the database. Adhering to this convention means that the modified HTTP daemon still supports retrieval of conventional documents from the file system, as well as documents stored in the database.

Figure 4: Screen shots of developmental GUI

In order to enable retrieval of nodes from the database, a new function (send_dbdoc) was written. This function performs the task of retrieval of nodes from the database, and also has the ability to generate "on the fly" HTML for display of navigational links to previous, next, up, home and child nodes. Access control to the documents contained within the database works in the same way as with documents contained outside the database, except that the access control file (.htaccess in this case) is stored inside the database. This was implemented by adding a small parsing function to the HTTP daemon so that it can interpret .htaccess files that are stored within the database.

The design phase of datatypes for the database that we chose (OBST (Uhl et al., 1994)) resulted in two datatypes called HtmlObject and HtmlLink. Instances of the HtmlObject type contain the content and keywords of each document node, whilst instances of the HtmlLink datatype contain the structural information of the document node, and are related to the HtmlObject by sharing a unique name.

Conclusion and future directions

We believe that the HTML database system in its present incarnation does have the capacity to act as a maintainable solution for publishing documents on the WWW The current implementation fully supports retrieval of nodes from the database, and also "on the fly" HTML generation for structural links. The GUI has now been developed to a state from which it may be used to add, delete and edit nodes in the database, but as yet lacks a graphical display of the node structure that exists within the database.

A major undertaking that will be part of continuing work on this project will be an adaptation of the HTML database to the Matilda architecture (Lowe and Ginige, 1996). The Matilda architecture (and associated data model) form part of another major project that is being undertaken by the HyVis group at UTS in order to develop a flexible architecture for intelligently assisted construction of hypermedia information systems.

The HTML database will also at some stage be integrated with revision control software in order to maintain edit histories, change logs and exclusive edit locking (with associated ownership) for each node contained within the system. This would lead to a fully network distributable HTML structuring, editing and publishing environment, in which many users could contribute to the one knowledge base without interfering with the work of any other user on the network.

References

Cailliau, R. (1995). A Little History. http://www.w3.org/hypertext/WWW/History.html [verified 1 Dec 2000 at http://www.w3.org/History.html]

Congress, US. (1989). Inaugural Addresses of the Presidents of the United States: From George Washington 1789 to George Bush 1989. Senate document (United States. Congress. Senate) 101-110. http://www.columbia.edu/acis/bartleby/ [verified 1 Dec 2000 at http://www.bartleby.com/124/index.html]

Ginige, A., Lowe, D. B. and Robertson, J. (1995). Hypermedia authoring. IEEE Multimedia, 2(4), December 1995.

Ginige, A., Witana, V. and Yourlo, Z. (1996). Use of the world wide web in the delivery of education: A case study. In C. McBeath and R. Atkinson (Eds), Proceedings of the Third International Interactive Multimedia Symposium, 140-148. Perth, Western Australia, 21-25 January. Promaco Conventions. http://www.aset.org.au/confs/iims/1996/ek/ginige.html

Greenspun, P. (1994). We Have Chosen Shame, and Will Get War. http://www-swiss.ai.mit.edu/philg/research/shame-and-war.html [ verified 1 Dec 2000 at http://philip.greenspun.com/research/shame-and-war.html]

Lowe, D. B. and Ginige, A. (1996). MATILDA: A framework for the representation and processing of information in multimedia systems. In C. McBeath and R. Atkinson (Eds), Proceedings of the Third International Interactive Multimedia Symposium, 229-236. Perth, Western Australia, 21-25 January. Promaco Conventions. http://www.aset.org.au/confs/iims/1996/lp/lowe2.html

Munn, N. L. (1961). Psychology, 4th Edition. Boston: Houghton Mifflin Company.

Uhl, L., Theobald, D., Schiefer, B., Ranft, M., Zimmer, W. and Alt, J. (1994). The Object Management System of STONE. Forschungszentrum Informatik (FZI), Haid-und-Neu-Strasse 10-14, D-76131 Karlsruhe.

Authors: Mr Zhenya Yourlo, A/Prof Athula Ginige, Ms Varuni Witana
School of Elec. Engineering
University of Technology, Sydney
PO Box 123, Broadway NSW 2007, Australia
Ph: +61 2 330 2393 Fax: +61 2 330 2435
Email: firefox@ec.uts.edu.au, athula@ec.uts.edu.au, varuni@ec.uts.edu.au
Please cite as: Yourlo, Z., Ginige, A. and Witana, V. (1996). A maintainable solution for publishing documents on the world wide web. In C. McBeath and R. Atkinson (Eds), Proceedings of the Third International Interactive Multimedia Symposium, 439-446. Perth, Western Australia, 21-25 January. Promaco Conventions. http://www.aset.org.au/confs/iims/1996/ry/yourlo.html

[ IIMS 96 contents ] [ IIMS Main ] [ ASET home ]
This URL: http://www.aset.org.au/confs/iims/1996/ry/yourlo.html
© 1996 Promaco Conventions. Reproduced by permission. Last revision: 15 Jan 2004. Editor: Roger Atkinson
Previous URL 1 Dec 2000 to 30 Sep 2002: http://cleo.murdoch.edu.au/gen/aset/confs/iims/96/ry/yourlo.html

Activity	Time taken (min)
Convert to HTML	10
Chunk document	30
Create directory structure	10
Add structural links	150
Move nodes into directory structure	10
Total time	210