IIMS 1992: Ginige and Gonzalez - workstations for multimedia information systems

A workstation architecture for multimedia information systems

Athula Ginige and Ruben Gonzalez
School of Electrical Engineering
University of Technology, Sydney

1. Introduction

Development of multimedia systems is unfolding into a billion dollar industry expanding the scope of computers and communication systems [l]. Most of the enabling technologies required to develop multimedia information systems are now in place. A number of experimental multimedia systems have been reported in the literature. Some of these have been aimed at integrating text and graphics [2,3]. Most systems that incorporate sound and video relays on a video disk player to provide these mediums in analogue form [4,5]. A hybrid multimedia system where video and audio are transmitted as analogue signals and data is transmitted over TCP/IP link has now been developed [6]. Due to recent performance advances, personal computers and workstations have become important platforms for developing multimedia applications.

In a multimedia system information is presented to the user as a combination of text, sound, graphics, video and still images. In a workstation based multimedia information system passive presentation of information can be combined with the interactive features of a computer. Thus an important feature of workstation based multimedia systems is that it allows users to interact with the information that is presented to them.

Thus workstation based interactive multimedia systems can enrich educational and entertainment activities [4,7]. These systems will allow us to visit faraway places and explore the surroundings without physically going to these places. Also with such systems teleworking becomes a reality.

2. Digital multimedia

To achieve the full potential of interactive multimedia systems the media should be in digital form. Compared to analogue information, digital information is easier to manipulate in terms of interaction, switching, mixing, storage and transmission. Nevertheless there are few technical hurdles to over come before digital multimedia systems become widely available.

The main difficulty is to integrate digital video into a multimedia workstation environment in a cost effective way. Also the video stream has to be synchronised with other media streams, specially audio.

The way images will be displayed on multimedia systems will have a significant impact on the wider usage of these systems. The methods available to display video in a multimedia information system can be broadly classified into

Analogue video displayed on a separate monitor switched by a computer (typical in most teleconferencing systems at present),
Analogue video overlaid on the computer monitor switched by the computer,
Digital video integrated with other screen information and controlled and manipulated by the computer.

At present there are few commercial systems that are capable of displaying analogue video overlaid on the computer monitor switched by the computer. But to achieve a high level of interactivity with video information it is necessary to have digital video integrated with other screen information and controlled and manipulated by the computer.

Raw video requires large amount of storage and high bandwidth to transmit. Thus it is necessary to code the raw video prior to transmission or storage. To make multimedia information systems commercially viable, a compression scheme that can produce good quality motion pictures using a variable bit rate in the range 0.1 - 2 Mbits/sec is required. These raw images can be compressed because of the high correlation each pixel has with its spatial and temporal neighbours.

3. Design concepts

We approached the problem of developing a digital multimedia system based on a workstation in a top down manner. First we developed the architecture of a possible overall system as shown in figure 1. This enabled us to identify the different sub systems and to define interfaces among different sub systems. This approach eliminates integration problems compared to a bottom up approach.

Figure 1: Multimedia system architecture

Also we decided to use an iterative development process. That is initially to use existing technology and methods where ever possible and to develop new technology or algorithms only if we can not find a suitable existing method. Then to gradually refine the system based on the experience gained.

After surveying the current technology we decided that we have to develop a new video coding scheme. We looked at the existing schemes in terms of their ability to meet the following requirements and found none that meets all the requirements.

ability to display video on the same monitor used for other mediums such as text and graphics within resizeable, relocatable windows.
ability to manipulate video as other forms of graphics.
ability to select and control video sequences (switching).
in the case of stored video the ability to view the images in an interactive manner. (surrogate travel, telemarketing).
ability to create links from objects within an
image for hypermedia applications.
ability to automatically index video in terms of information content in images for image database applications.
ability to browse quickly through video information in database applications.

The ability to view stored video images in an interactive manner will greatly enhance the range of applications. There are different levels of interaction; sequence, frame and physical objects within a frame. In the simplest case the user can select a sequence of images at random and move forward or backward through that sequence. A typical example is video lecture presentation. At the next level the user should be able to freeze a frame, zoom in and out, perform cut and paste operations (editing sequences at frame level) and to perform operations like contrast enhancement. At the highest level of interaction the user will be able to interact with physical objects within a frame. At this level operations such as rearranging objects within a frame, cut and paste operations at object level, changing properties such as colour and texture of objects, and operations such as object scaling and rotation are possible. This level of interaction will have many applications in telemarketing. Users will be able to "look" at a product such as a dress worn by a model from different angles. When buying real estate the user can view the properties from different view points, "walk" through them and make a short list before physically going to these places. Interior decorators will be able to take a scene of a living room, rearrange objects, change colour and texture of fabric and decide on an optimum lay out.

Much progress has been achieved in Discrete Cosine Transform (DCT) based coding schemes and resulted in standards H261, JPEG (Joint Photographic Expert Group) and MPEG (Moving Picture Expert Group) [8,9]. Main emphasis in these coding schemes was reduction in data and very little consideration has been given to requirements associated with integrating the coding scheme into workstation based multimedia information systems. As a result it is not possible to interact at the graphical information level with images coded using waveform coding techniques such as DCT.

The level of interaction that is required for workstation based multimedia systems can only be achieved through coding techniques that make use of graphic information in images.

4. Graphic information based coding

Recent advances in computer graphics have made it possible to create photo-realistic images and to animate these in real time. Some of these advances took place in the areas of colour quantisation, 2 and 3D geometric modelling, animation and interactivity.

If we abstract the processes involved in computer graphic generation one can look at it as a transformation of a set of data into an image or a sequence of images.

model data computer
------------->
graphics images

On the other hand computer vision extracts data from an image. For example in a typical pattern recognition application various objects in the scene, their shape, position and orientation and some times even the velocity vectors are extracted from the image. Although global scene understanding still offers a vast amount of problems, within constrained environments much progress has been made.

Again if we abstract the various processes in computer vision we can represent these as follows.

images computer
------------->
vision model data

Therefore by combining these two areas we developed a scheme to extract data from an image in a manner suitable for displaying on a workstation. This process can be modelled as follows.

The field of computer animation demonstrates the power of this approach. In fact many of today's movies are partially generated with computer animation techniques. It is then feasible that video should undergo the reverse transformation and be represented using animation techniques [10]. This reverse transformation is not very straight forward. In fact to accomplish this one must utilise knowledge from the area of computer vision in order to find a suitably high level representation for the video. Computer animation utilises very high level information regarding the geometric configuration of objects and their motion. This high level representation is then decomposed into a raster scan representation which is the lowest level representation possible. Conversely to encode video which is at the raster scan representation level as computer animated graphics one must identify and model the objects which compose the scene and their motion. This type of information regarding images and image sequences can only be obtained through image understanding as expressed in computer vision techniques. As one may realise what we are discussing is in effect a unification of the three different visual data processing fields; computer graphics, computer vision and image coding.

Also this scheme will allow us to use an asymmetric coding and decoding scheme. Even we can completely eliminate the decoding scheme by coding the image in terms of primitives that a workstation can easily display. Thus this scheme is ideal for stored video applications. In these applications coding can be done offline and coded images can be displayed in real time.

Thus to implement this scheme it is necessary to analyse the characteristics of computer display systems and to establish the primitives that can be easily displayed on a graphic workstation.

4.1 Computer display systems

One of the characteristics of digital computer display systems is that they utilise frame buffers for image storage. Due to the nature of frame buffers, image data to be displayed does not need to be extracted from the data store in any particular order or contiguous manner. Computationally it is also much better to generate pixel clusters of one colour rather than to have multi-colour clusters and single pixels. Therefore the storage of the image data need not be done in any predefined order. Neither does the image data need to be regenerated for new frames so that conditional replenishment techniques are ideally suited to handle intraframe motion in image sequences.

One of the main characteristics of many frame buffers however is that they are colour mapped. Commonly only 256 colours are available for image composition. However quantisation to a level of 256 colours generally introduces unacceptable image degradation and artefacts such as contouring. One very large benefit of colour mapped displays is however the data compaction qualities that they present, considering that most images contain less than 10,000 distinguishable colours. Coding of images for display on graphic workstations must take all of these properties into account and attempt to utilise the benefits while avoiding the liabilities.

4.2 Quantisation and dithering

The original colour images have 8 bits per primary colour resulting 2²⁴ colours. Though low cost 24 bit colour display systems will be available in future workstations, accelerated 24 bit graphic frame buffers are still quite expensive. Some image compression can be achieved by adaptively reducing the number of possible colours in the image representation as well as the non-distinguishable colours in an image. This can be achieved by re-quantising the colours and using colour lookup tables. Various methods were investigated to perform this function, however all the methods had the side effect of introducing both false contours and false colouring.

Colour quantisation is a two stage operation, colour subset selection followed by the actual quantisation. The simplest adaptive scheme is the popularity algorithm which is fast but suffers from one main problem. This arises because the popularity algorithm does not attempt to ensure that colours are selected from the entire image's colour range. This causes some image colours to be rendered with large errors because the distance to the nearest supported colour is too far away. A major reason for this is that some colours which may be indistinguishable may all have high frequencies and so will be included in the colour map. This will exclude other colours which may be quite visually important and cause regions containing those colours to be subject to false colouring as a result. Other methods such as the Median Cut algorithm go some way towards removing this particular problem by dividing the colour space into cubes. This has the effect of ensuring that there will be a wider selection of representative colour reducing false colouring. This however does nothing to alleviate contouring and still some colours which may be indistinguishable may find their way into the colour map.

The solution to exclude indistinguishable colours from the colour map and hence maximise colour map utilisation is to first process the colours according to the Human Visual System (HVS) colour perception model. This ensures that as many necessary colours are selected from the colour map as possible. The effect of this upon the various quantisation schemes is that of equalising and boosting their performance.

Traditionally to eliminate quantisation generated artefacts it becomes necessary to dither the image and so is essential for low colour, high quality image display. It exploits the integrating properties of the HVS by trading colour resolution with spatial resolution. But dithering decomposes large clusters of pixels having the same colour into small clusters or individual pixels. This reduces the spatial correlation thus increasing the amount of information that needs to be coded and the data rate. These artefacts are primarily caused by the inefficient utilisation of the colour look up table's resources and can be effectively overcome by utilising the properties of the HVS. Dithering then becomes unnecessary.

These issues are very important when one is considering multimedia video coding because they severely effect the quality of the video. This in turn determines how realistic and therefore credible the video is. We have made quality considerations of prime importance as this is one of the major limitations of current real time software decodeable video coding schemes.

4.3 Coding

We have investigated both spatial and temporal coding. The spatial domain coding scheme is based on the Graphic Primitives a workstation can easily display.

4.3.1 Graphic primitives based coding

By graphic primitives we are generally referring to such things as points, lines and polygons, rather than lower level objects such as pixel maps. The intention is to decompose an image into a set of points, lines and polygons which can be readily manipulated by computers and graphics processors. The coding method should ideally be lossless and readily reversible. The easiest way to convert from a raster image to a vector based image is simply that of run length encoding which generates lines. Raster to polygon algorithms also exist [11, 12].

Run length coding schemes excel where spatial correlation between pixels is high. However they can actually give negative compression ratios when the spatial correlation is very low. This would tend to occur when an image has been dithered. The solution is to use an adaptive technique such as a run length limited scheme which encodes single pixels differently thus giving rise to points as well as lines. On average standard run length coding requires 2 bytes of data per run, with run length limited coding single points only require 1 byte of data with 2 required for runs of two or more pixels. A fully adaptive scheme would differentiate between various types of pixel clusters encoding each with on average only one byte. Two dimensional run length coding can also be used to generate polygons.

Raster to polygon algorithms tend to suffer from similar problems as the run length schemes since they are based on colour clustering. Some overcome this by forcing unclustered pixels into the nearest cluster [11]. However this will severely degrade image quality unless a non-adaptive, uniform quantisation scheme has been used and the quality is already low. Other algorithms attempt to improve the compression ratio by only forming half polygons [12]. One other approach attempts to fit square polygons to the data hierarchically through quattrees. These schemes however are not very different in terms of compression ratios from run length schemes.

All of these coding methods utilise colour clustering. However the issue in these coding methods is how to structure the data. In all there are three different variables that need encoding in a graphic primitive; location, colour and shape. Of these the location information requires the largest amount of data to represent. Typically the location requires 4 bytes with one byte for the colour and on average only one byte for the shape information if dithering has been used. For this reason run length coding is so efficient in that the location of each primitive is implicitly coded within the final data structure so that only the colour and length need to be explicitly coded. Structuring the data so that colour clusters are extracted from the image and grouped according to colour requires only the location and shape information for each primitive to be explicitly coded, however this will require more space than a run length scheme. A hierarchical data structure can be used to lower the amount of data to be coded by having multiple clustering levels such as principle clusters of colour followed by subclusters of length.

For image sequences only the first frame however needs to be fully coded since after the first frame the display frame store can be utilised thus requiring only the difference image to be coded. The difference image will mainly be composed of single and two pixel colour clusters, however the total number of primitives will be much lower.

There is however one major problem in doing this, and that relates to the statistical properties and the types of primitives which are required to express high quality natural images. Traditional graphic primitives are extremely inefficient with respect to the amount of data necessary to express the required graphic primitives. We are currently working on a new efficient representation scheme for these graphic primitives.

4.3.2 Temporal coding

The temporal coding phase is where most of the data compression will occur. We will for now limit temporal coding at this stage to conditional replenishment. We do this because we can get good result without needing to spend much work on more esoteric techniques. This is the only technique that fully exploits the frame buffer memory capabilities and requires no decoding or processing. Initially various non-evident problems arose in defining the replenishment condition which required a solution making this technique not as straight forward as originally envisaged.

One major problem was that of determining a robust condition for replenishment which was insensitive to noise and yielded only essential replenishment data. This is not a simple question.

Typically Euclidean distances are used to set a threshold with which to determine valid data. This was found however to be unsatisfactory. This was producing unnecessary replenishment data in some images and insufficient data in others depending on the images' colour composition. Thus to find an optimum was very difficult if not impossible. The reason for this is that a uniform change of colour at one instance will not produce an equal perceived change at another instance. This is due to the non-linear and non-uniform properties of the human visual system. Various rather inelegant solutions have been proposed to this problem such as introducing memory into the threshold decision making process and having dynamic thresholds.

The best and most elegant approach to this is rather simple. It is by using the actual perceived, colour change as the metric and then use a constant uniform threshold function on this. This gives optimum results (subjectively) under all tested conditions. Another problem which. presented itself was that some pels will change values very gradually overtime and will not be detected. So that a second image needs to be updated with the pels that have not been selected for replenishment. This ensures that the image we are testing against will always be a true representation of the coded image.

One must also take into consideration the reduced spatial resolution of the human visual system in the time domain. For this reason any pel which meets the conditions presented so far for replenishment is still not accepted before a further test is applied to it. This test ensures that any pel or cluster which is sub-threshold with respect to visibility in terms of spatial resolution is eliminated from the replenishment data.

However the replenishment will become necessary not only in the case of motion but also in the case of illumination changes and the need to update obsolete colour map references. This last point is a major problem because due to quantisation noise or the shifting colour composition of images with time due to the introduction of new objects into the scene some pixels may reference colours which are no longer supported by the colour map. So that a test for not supported colours must first be executed followed by a search of the image to cause those pixels referencing those colours to be replenished.

This overall coding technique fully exploits the frame buffer memory capabilities and requires no decoding or processing. This ensures that no extra system components are necessary to display video on any graphic workstation.

5. Initial workstation architecture

By using the graphic primitives based coding scheme that we developed we were able to integrate digital video into a standard X Window System Architecture. We select X windows to make our system independent of the underlying hardware.

The X Window architecture consist of client applications which communicate with the Server, which runs the X Display. Client applications individually interact with the server through X lib. X lib is the lowest level interface to the X protocol.

As video was coded as graphic primitives a single client program was able to handle both video and graphics. Since X windows does not have good facilities to handle audio at present it was handled by a separate process. Another client program handles the text information and the control information coming from the user is handled by a separate client program.

The information coming through the network or the hard disk is separated into different buffers at the I/O interface. The synchronisation module make these information available to respective clients at the appropriate time. The figure 2 shows this logical architecture.

Figure 2: Logical architecture of the workstation

We have now developed a logical architecture for a multimedia workstation. This architecture can be mapped on to a graphic workstation with X windows. We have successfully integrated video into this workstation environment by developing a novel coding scheme based on graphic information in images [13]. We can display digital CIF size (352x288) colour images in real time on a Sun Sparc II workstation without using any additional hardware.

The comparative studies show the coding scheme we developed is better than other schemes reported in literature. The Tenet Group at University of California, Berkeley is also developing a similar system. Their system can display Quarter CIF (192x144) monochrome images in real time [14].

We are working on developing representations higher than graphic primitives for still and video images. This high level representation of images can be used to index images for database applications, and also to create links in hypermedia applications. This high level representation of images can be a simple one or we can make it as complex as we like. The trade off is between the amount of computing power required to extract the high level representation and the amount of data required to represent the image at that level. Thus when more and more computing power becomes available we can increase the complexity of this high level representation and move towards scene understanding.

As this overall architecture does not require any additional hardware to implement, it can be used on any graphic workstation that runs X Windows. Thus we believe this will increase the popularity of digital multimedia applications.

7. Acknowledgments

The development of the video coding scheme described in this paper was funded by OTC. The permission of the managing Director of OTC to publish this work is gratefully acknowledged.

8. References

Fox, A., 'Standards and the Emergence of Digital Multimedia Systems", Communications of the ACM, April 1991.
Siu, Y. T., and Clark, M., L, "DMKS: system kernel for multimedia integration", IEEE Proc., Part E, March 1989.
Gibbs, S., Tscichritzis, Fitas A., et al "MUSE: a multimedia filing system", IEEE Software, March 1987.
Woolsey, K. H., "Multimedia Scouting", IEEE Computer Graphics & Applications, July 1991.
Reisman, S. "Developing Multimedia Applications", IEEE Computer Graphics & Applications, July 1991.
Beadle, P., "Multimedia Communication research at OTC", Australian Broadband Switching and Services Symposium 91, Sydney, July 91, Vol 2.
Pea, R. D., "Learning through Multimedia", IEEE Computer Graphics & Applications, July 1991.
Wallace, G. K., "The JPEG Still Picture Compression Standard", Communications of the ACM, April 1991.
Gall, D. L., "MPEG: A Video Compression Standard for Multimedia Applications", Communications of the ACM, April 1991.
Forchheimer, R., Kronander, T., "Image Coding from Waveform to animation". IEEE Transactions on Acoustics, Speech & Signal Processing, Vol 37, Dec 1989.
Hata, T., Tomita, S., Nakada, M. et al. "A Graphic Command Coding Scheme for Multi-colour images", IEEE Global Telecommunications Conference, December 1986.
Shlien, S., "Raster to Polygon Conversion of Images", Computers and Graphics, Vol 7, Nos. 3-4, 1983.
Gonzalez R., Ginige A., "Image Coding based on Graphic Primitives", Australian Broadband Switching and Services Symposium 91, Sydney, July 91, Vol 2.
Umemura, K, Okazaki, A, "Real-Time Transmission and Software Decompression of Digital Video in a Workstation", Technical Report TR-91-004, International Computer Science Institute, Berkeley, California, USA, January, 1991.

Authors: Dr Athula Ginige and Mr Ruben Gonzalez
School of Electrical Engineering, University of Technology, Sydney
PO Box 123, Broadway NSW 2007
Please cite as: Ginige, A. and Gonzalez, R. (1992). A workstation architecture for multimedia information systems. In Promaco Conventions (Ed.), Proceedings of the International Interactive Multimedia Symposium, 337-347. Perth, Western Australia, 27-31 January. Promaco Conventions. http://www.aset.org.au/confs/iims/1992/ginige.html

[ IIMS 92 contents ] [ IIMS Main ] [ ASET home ]
This URL: http://www.aset.org.au/confs/iims/1992/ginige.html
© 1992 Promaco Conventions. Reproduced by permission. Last revision: 4 Apr 2004. Editor: Roger Atkinson
Previous URL 5 Apr 2000 to 30 Sep 2002: http://cleo.murdoch.edu.au/gen/aset/confs/iims/92/ginige.html