Many organisations and individuals are now publishing many different types of information on the Internet via the World Wide Web. With this new push towards an electronic medium, better solutions need to be found for the maintenance and publishing of documents electronically. Presently, the majority of documents published on the World Wide Web are stored in a directory structure that is supported by the operating system installed at the site. Unfortunately such an approach has drawbacks with regard to the maintainability of the Web site, as well as during the initial construction of the Web site. This paper presents one possible solution to this problem that involves populating a database with nodes from documents to be published; the goal of which being to increase the maintainability of the resulting product, and at the same time to minimise the time spent in realising a final product.
No foresight can anticipate nor any document of reasonable length contain express provisions for all possible questions. Abraham Lincoln, March 4, 1861 (Congress, 1989)The World Wide Web (WWW) is the culmination of a number of years work by the NCSA (National Center for Supercomputing Applications) at CERN (The European Laboratory for Particle Physics) to integrate and extend existing information services present on the Internet. Previous attempts such as Gopher and WAIS have attempted to achieve a similar feat, but have not had success on as large a scale as the WWW. The success of the WWW is mainly attributed to the ease of use of the WWW and also due to its incorporation of transparent interfaces to other existing information discovery tools such as Archie, Gopher, WAIS and Usenet News.
Publishing on the World Wide Web has accelerated at a phenomenal rate since its inception in November, 1990 (Cailliau, 1995), and increasingly so in the last few months. Much of this growth has been fuelled by a large amount of media exposure that the Internet, and the WWW receive virtually every day. The majority of those who publish on the WWW though, do so without sufficient planning or thought with regard to the maintainability of their Web sites - unavoidably leading to much unnecessary maintenance work in the future as content is either extended or altered.
As part of one of the current projects at UTS, we have developed a maintainable solution to the problem of initial structuring and maintenance of Web sites based around modifications made to the HTTP daemon (with an accompanying graphical user interface). This paper describes the implementation of our system, and also the rationale behind our choice of implementation.
Documents in the paper domain are essentially linear in structure, and are intended to be read as such, whereas documents in the electronic domain are not necessarily bound to a linear format. Due to these differing document structures, a difference also exists in the communication of information from each of these different domains. The process of composing, and then reading a paper document follows that of linearisation, and delinearisation, whereas the process of composing, and then reading an electronic document follows that of externalisation, and internalisation (Ginige et al., 1995). Since the process of human thought is associative (Munn, 1961), it follows that it is more natural to communicate information by way of a non-linear medium rather than one that is linear.
Although significant differences exist between the electronic and paper domains, the majority of documents are still written using traditional linear authoring tools (such as the average word processor). Since many of the documents that we deal with at UTS - especially lecture notes - are being translated into the electronic domain, a straightforward conversion mechanism must be found.
The structure of the average document follows a set of rules that may be exploited by the electronic publisher. Most technical documents, are composed of sections, subsections and paragraphs. This is convenient, as it allows a document to be easily decomposed into representative chunks that are suitable for publishing in electronic form.
Figure 1: "Chunking" a document into nodes
Figure 2: Illustration of the combination of hierarchical and network structures
to form a hierarchical network structure (using document nodes from Figure 1).
Another significant obstacle that must be overcome in order to use the WWW as a maintainable publishing environment lies in the structuring and logical storage organisation of document nodes. On a typical WWW site, the documents are arranged in a directory structure that is native to the operating system that the HTTP daemon is running under. This means that the original documents that are to be published on the Web must be coerced, in some way, to conform to that structure.
Another choice of directory structure would be a structure similar to the structure that we used in the Hypermedia Technologies course notes at UTS. This structure involves creating a new directory level for each section and subsection in the original document, and placing the corresponding document nodes inside each of these directories. The advantage of this approach is that the relationship between each of the nodes on the Web site is clear due to their logical separation in the directory tree. Unfortunately, this advantage is outweighed by the large amount of directory maintenance that must be carried out when any reorganisation of the Web site occurs.
Typical document structure, as discussed in section 2.0 does not naturally fit into a form that is conformant to a file/directory based structure. In order to accommodate the average document, the wisest option would be to modify the HTTP (HyperText Transport Protocol) daemon to accept document nodes in a more easily maintainable structure.
Activity | Time taken (min) |
Convert to HTML | 10 |
Chunk document | 30 |
Create directory structure | 10 |
Add structural links | 150 |
Move nodes into directory structure | 10 |
Total time | 210 |
On an average Web site, it would be necessary to add, and maintain each of these links by hand with some form of HTML editor. As can be seen from the data presented above, this process is definitely a time consuming, and inefficient activity.
Even though a Web site may have well planned structure, and navigational aids as described above, after it has reached or surpassed a certain size, the navigational aids and structural links will become increasingly difficult and time consuming to maintain. This is readily illustrated by the following example. If we have an existing node structure similar to that shown in Figure 3, and it is necessary to move node A to position B, it would be necessary to sever one link and modify eight links. If this activity were repeated for ten nodes, it would mean that 90 links in total would have to be modified by hand using a HTML editor. The chance of introducing error in this situation becomes quite large not only due to the magnitude of the task, but also due to its repetitive nature - verification of every one of the resulting modified links would also be particularly time consuming. This difficulty with maintenance creates a need for an automated solution to this problem.
HTML 2, the underlying document markup language of the WWW is quite adequate for most publishing needs, and with the advent of HTML 3, will have the ability to accommodate the artistic and aesthetic whims of the document publisher. HTML, though, is more concerned with local rather than global or large scale document structure - ie. it is not possible to impose an efficiently maintainable structure on a set of HTML documents without some form of external assistance.
Figure 3: Moving a document node from A to B
Let us now consider the effect of taking each topic in a document as a separate HTML node. Although the document was originally linear in structure, it has now been decomposed into manageable objects that are related to each other in some fashion. Each of these objects can now be thought of as having a set of attributes and operations associated with them that determine their properties.
A document node has a title, content and keywords, and can be queried about handles to its nearest neighbours (next, up, previous and children). In order to establish a neat package for these attributes, it would be desirable to define a datatype (or schema) that can represent any document node that represents a single subtopic of any original document.
For performance reasons, we chose to modify the HTTP daemon rather than to write CGI (Common Gateway Interface) scripts - this also gave much greater flexibility over the URLs (Uniform Resource Locater) of document nodes stored in the database. The need to maintain a directory structure also needed to be removed. This was achieved by maintaining document nodes within the database inside a flat directory structure, with the structural information relating the nodes stored separately. This approach retains the advantages of a flat directory structure, but since the structural information is now easily accessible, removes the drawback of having a flat directory structure (ie. the relationships between the nodes will be able to be viewed easily with the assistance of an integrated GUI).
Eliminating the need for addition and maintenance of structural links that allow navigation between document nodes was an obstacle that was necessarily overcome by our implementation of the HTML database system. Modifications of the HTTP daemon not only allow the retrieval of node content from the database, but also allow the generation of structural links from data associated with the document nodes "on the fly" as the nodes are retrieved. So as not to limit the freedom of those using the HTML database system with regard to structuring of a Web site, no rigid structuring scheme is imposed on the end user other than that of not violating a hierarchical network structure. Fortunately, a good choice of structure for many documents is the hierarchical network structure as described below.
One of the most important, and necessary features of the HTML database system is the integrated graphical environment used to create and structure document nodes (see figure 4). The GUI is designed to give potential authors a view of "the big picture" of the Web site at all times when manipulating the relationships between nodes in the database. This will be achieved by showing all nodes in the database represented in a tree like structure (with connecting lines representing structural links). Once the completed GUI is in place, adding and structuring nodes in the database will simply be a matter of "point and click". Restructuring existing nodes in the database will follow a familiar "drag and drop" methodology, wherein a node and its associated children may be detached from one point in the hierarchical network, and reattached at an arbitrary position. With this interface in place, the time required to undertake maintenance work on a Web site using the HTML database system can be greatly reduced since there would no longer be any need to perform the time consuming task of editing structural links by hand with a HTML editor for every node whose structural relationship to other nodes is modified.
An automated conversion/structuring tool will also form part of the completed HTML database system and GUI, enabling documents in RTF format to be quickly and easily added to the existing hierarchical network node structure at a user nominated position. This tool has the capacity to significantly reduce the time required in the initial setup of a Web site, eliminating the need to manually chunk the document, convert to HTML, and add structural links/navigational aids to every document node added to the Web site.
[ Up: %u% | Home: %h% ]<br>The letters in between the percent signs represent the link that is to be inserted at that point by the server at run time eg. the HTML for a link to the node that is "up" relative to the current node. At this time, the style in which child nodes are presented (a bulleted list) is hard coded into the HTTP daemon. It is likely that this behaviour will be altered to also be user customisable through a template in the future.
[ Previous: %p% | Next: %n% ]<p>
In the original incarnation of the HTTP daemon, upon receiving a request for a HTML document, it would examine the URL of the request in order to determine what kind of document was being requested. Usually, the first part of the path in the URL determines the document type (for example /cgi-bin would mean that the request is for the result of executing a CGI script). For the HTML database, we chose to use the prefix /db to represent any documents that reside in the database. Adhering to this convention means that the modified HTTP daemon still supports retrieval of conventional documents from the file system, as well as documents stored in the database.
Figure 4: Screen shots of developmental GUI
In order to enable retrieval of nodes from the database, a new function (send_dbdoc) was written. This function performs the task of retrieval of nodes from the database, and also has the ability to generate "on the fly" HTML for display of navigational links to previous, next, up, home and child nodes. Access control to the documents contained within the database works in the same way as with documents contained outside the database, except that the access control file (.htaccess in this case) is stored inside the database. This was implemented by adding a small parsing function to the HTTP daemon so that it can interpret .htaccess files that are stored within the database.
The design phase of datatypes for the database that we chose (OBST (Uhl et al., 1994)) resulted in two datatypes called HtmlObject and HtmlLink. Instances of the HtmlObject type contain the content and keywords of each document node, whilst instances of the HtmlLink datatype contain the structural information of the document node, and are related to the HtmlObject by sharing a unique name.
A major undertaking that will be part of continuing work on this project will be an adaptation of the HTML database to the Matilda architecture (Lowe and Ginige, 1996). The Matilda architecture (and associated data model) form part of another major project that is being undertaken by the HyVis group at UTS in order to develop a flexible architecture for intelligently assisted construction of hypermedia information systems.
The HTML database will also at some stage be integrated with revision control software in order to maintain edit histories, change logs and exclusive edit locking (with associated ownership) for each node contained within the system. This would lead to a fully network distributable HTML structuring, editing and publishing environment, in which many users could contribute to the one knowledge base without interfering with the work of any other user on the network.
Congress, US. (1989). Inaugural Addresses of the Presidents of the United States: From George Washington 1789 to George Bush 1989. Senate document (United States. Congress. Senate) 101-110. http://www.columbia.edu/acis/bartleby/ [verified 1 Dec 2000 at http://www.bartleby.com/124/index.html]
Ginige, A., Lowe, D. B. and Robertson, J. (1995). Hypermedia authoring. IEEE Multimedia, 2(4), December 1995.
Ginige, A., Witana, V. and Yourlo, Z. (1996). Use of the world wide web in the delivery of education: A case study. In C. McBeath and R. Atkinson (Eds), Proceedings of the Third International Interactive Multimedia Symposium, 140-148. Perth, Western Australia, 21-25 January. Promaco Conventions. http://www.aset.org.au/confs/iims/1996/ek/ginige.html
Greenspun, P. (1994). We Have Chosen Shame, and Will Get War. http://www-swiss.ai.mit.edu/philg/research/shame-and-war.html [ verified 1 Dec 2000 at http://philip.greenspun.com/research/shame-and-war.html]
Lowe, D. B. and Ginige, A. (1996). MATILDA: A framework for the representation and processing of information in multimedia systems. In C. McBeath and R. Atkinson (Eds), Proceedings of the Third International Interactive Multimedia Symposium, 229-236. Perth, Western Australia, 21-25 January. Promaco Conventions. http://www.aset.org.au/confs/iims/1996/lp/lowe2.html
Munn, N. L. (1961). Psychology, 4th Edition. Boston: Houghton Mifflin Company.
Uhl, L., Theobald, D., Schiefer, B., Ranft, M., Zimmer, W. and Alt, J. (1994). The Object Management System of STONE. Forschungszentrum Informatik (FZI), Haid-und-Neu-Strasse 10-14, D-76131 Karlsruhe.
Authors: Mr Zhenya Yourlo, A/Prof Athula Ginige, Ms Varuni Witana School of Elec. Engineering University of Technology, Sydney PO Box 123, Broadway NSW 2007, Australia Ph: +61 2 330 2393 Fax: +61 2 330 2435 Email: firefox@ec.uts.edu.au, athula@ec.uts.edu.au, varuni@ec.uts.edu.au Please cite as: Yourlo, Z., Ginige, A. and Witana, V. (1996). A maintainable solution for publishing documents on the world wide web. In C. McBeath and R. Atkinson (Eds), Proceedings of the Third International Interactive Multimedia Symposium, 439-446. Perth, Western Australia, 21-25 January. Promaco Conventions. http://www.aset.org.au/confs/iims/1996/ry/yourlo.html |