In a multimedia system information is presented to the user as a combination of text, sound, graphics, video and still images. In a workstation based multimedia information system passive presentation of information can be combined with the interactive features of a computer. Thus an important feature of workstation based multimedia systems is that it allows users to interact with the information that is presented to them.
Thus workstation based interactive multimedia systems can enrich educational and entertainment activities [4,7]. These systems will allow us to visit faraway places and explore the surroundings without physically going to these places. Also with such systems teleworking becomes a reality.
The main difficulty is to integrate digital video into a multimedia workstation environment in a cost effective way. Also the video stream has to be synchronised with other media streams, specially audio.
The way images will be displayed on multimedia systems will have a significant impact on the wider usage of these systems. The methods available to display video in a multimedia information system can be broadly classified into
Raw video requires large amount of storage and high bandwidth to transmit. Thus it is necessary to code the raw video prior to transmission or storage. To make multimedia information systems commercially viable, a compression scheme that can produce good quality motion pictures using a variable bit rate in the range 0.1 - 2 Mbits/sec is required. These raw images can be compressed because of the high correlation each pixel has with its spatial and temporal neighbours.
Figure 1: Multimedia system architecture
Also we decided to use an iterative development process. That is initially to use existing technology and methods where ever possible and to develop new technology or algorithms only if we can not find a suitable existing method. Then to gradually refine the system based on the experience gained.
After surveying the current technology we decided that we have to develop a new video coding scheme. We looked at the existing schemes in terms of their ability to meet the following requirements and found none that meets all the requirements.
Much progress has been achieved in Discrete Cosine Transform (DCT) based coding schemes and resulted in standards H261, JPEG (Joint Photographic Expert Group) and MPEG (Moving Picture Expert Group) [8,9]. Main emphasis in these coding schemes was reduction in data and very little consideration has been given to requirements associated with integrating the coding scheme into workstation based multimedia information systems. As a result it is not possible to interact at the graphical information level with images coded using waveform coding techniques such as DCT.
The level of interaction that is required for workstation based multimedia systems can only be achieved through coding techniques that make use of graphic information in images.
If we abstract the processes involved in computer graphic generation one can look at it as a transformation of a set of data into an image or a sequence of images.
model data | computer -------------> graphics |
images |
On the other hand computer vision extracts data from an image. For example in a typical pattern recognition application various objects in the scene, their shape, position and orientation and some times even the velocity vectors are extracted from the image. Although global scene understanding still offers a vast amount of problems, within constrained environments much progress has been made.
Again if we abstract the various processes in computer vision we can represent these as follows.
images | computer -------------> vision |
model data |
Therefore by combining these two areas we developed a scheme to extract data from an image in a manner suitable for displaying on a workstation. This process can be modelled as follows.
The field of computer animation demonstrates the power of this approach. In fact many of today's movies are partially generated with computer animation techniques. It is then feasible that video should undergo the reverse transformation and be represented using animation techniques [10]. This reverse transformation is not very straight forward. In fact to accomplish this one must utilise knowledge from the area of computer vision in order to find a suitably high level representation for the video. Computer animation utilises very high level information regarding the geometric configuration of objects and their motion. This high level representation is then decomposed into a raster scan representation which is the lowest level representation possible. Conversely to encode video which is at the raster scan representation level as computer animated graphics one must identify and model the objects which compose the scene and their motion. This type of information regarding images and image sequences can only be obtained through image understanding as expressed in computer vision techniques. As one may realise what we are discussing is in effect a unification of the three different visual data processing fields; computer graphics, computer vision and image coding.
Also this scheme will allow us to use an asymmetric coding and decoding scheme. Even we can completely eliminate the decoding scheme by coding the image in terms of primitives that a workstation can easily display. Thus this scheme is ideal for stored video applications. In these applications coding can be done offline and coded images can be displayed in real time.
Thus to implement this scheme it is necessary to analyse the characteristics of computer display systems and to establish the primitives that can be easily displayed on a graphic workstation.
One of the main characteristics of many frame buffers however is that they are colour mapped. Commonly only 256 colours are available for image composition. However quantisation to a level of 256 colours generally introduces unacceptable image degradation and artefacts such as contouring. One very large benefit of colour mapped displays is however the data compaction qualities that they present, considering that most images contain less than 10,000 distinguishable colours. Coding of images for display on graphic workstations must take all of these properties into account and attempt to utilise the benefits while avoiding the liabilities.
Colour quantisation is a two stage operation, colour subset selection followed by the actual quantisation. The simplest adaptive scheme is the popularity algorithm which is fast but suffers from one main problem. This arises because the popularity algorithm does not attempt to ensure that colours are selected from the entire image's colour range. This causes some image colours to be rendered with large errors because the distance to the nearest supported colour is too far away. A major reason for this is that some colours which may be indistinguishable may all have high frequencies and so will be included in the colour map. This will exclude other colours which may be quite visually important and cause regions containing those colours to be subject to false colouring as a result. Other methods such as the Median Cut algorithm go some way towards removing this particular problem by dividing the colour space into cubes. This has the effect of ensuring that there will be a wider selection of representative colour reducing false colouring. This however does nothing to alleviate contouring and still some colours which may be indistinguishable may find their way into the colour map.
The solution to exclude indistinguishable colours from the colour map and hence maximise colour map utilisation is to first process the colours according to the Human Visual System (HVS) colour perception model. This ensures that as many necessary colours are selected from the colour map as possible. The effect of this upon the various quantisation schemes is that of equalising and boosting their performance.
Traditionally to eliminate quantisation generated artefacts it becomes necessary to dither the image and so is essential for low colour, high quality image display. It exploits the integrating properties of the HVS by trading colour resolution with spatial resolution. But dithering decomposes large clusters of pixels having the same colour into small clusters or individual pixels. This reduces the spatial correlation thus increasing the amount of information that needs to be coded and the data rate. These artefacts are primarily caused by the inefficient utilisation of the colour look up table's resources and can be effectively overcome by utilising the properties of the HVS. Dithering then becomes unnecessary.
These issues are very important when one is considering multimedia video coding because they severely effect the quality of the video. This in turn determines how realistic and therefore credible the video is. We have made quality considerations of prime importance as this is one of the major limitations of current real time software decodeable video coding schemes.
4.3.1 Graphic primitives based coding
By graphic primitives we are generally referring to such things as points, lines and polygons, rather than lower level objects such as pixel maps. The intention is to decompose an image into a set of points, lines and polygons which can be readily manipulated by computers and graphics processors. The coding method should ideally be lossless and readily reversible. The easiest way to convert from a raster image to a vector based image is simply that of run length encoding which generates lines. Raster to polygon algorithms also exist [11, 12].
Run length coding schemes excel where spatial correlation between pixels is high. However they can actually give negative compression ratios when the spatial correlation is very low. This would tend to occur when an image has been dithered. The solution is to use an adaptive technique such as a run length limited scheme which encodes single pixels differently thus giving rise to points as well as lines. On average standard run length coding requires 2 bytes of data per run, with run length limited coding single points only require 1 byte of data with 2 required for runs of two or more pixels. A fully adaptive scheme would differentiate between various types of pixel clusters encoding each with on average only one byte. Two dimensional run length coding can also be used to generate polygons.
Raster to polygon algorithms tend to suffer from similar problems as the run length schemes since they are based on colour clustering. Some overcome this by forcing unclustered pixels into the nearest cluster [11]. However this will severely degrade image quality unless a non-adaptive, uniform quantisation scheme has been used and the quality is already low. Other algorithms attempt to improve the compression ratio by only forming half polygons [12]. One other approach attempts to fit square polygons to the data hierarchically through quattrees. These schemes however are not very different in terms of compression ratios from run length schemes.
All of these coding methods utilise colour clustering. However the issue in these coding methods is how to structure the data. In all there are three different variables that need encoding in a graphic primitive; location, colour and shape. Of these the location information requires the largest amount of data to represent. Typically the location requires 4 bytes with one byte for the colour and on average only one byte for the shape information if dithering has been used. For this reason run length coding is so efficient in that the location of each primitive is implicitly coded within the final data structure so that only the colour and length need to be explicitly coded. Structuring the data so that colour clusters are extracted from the image and grouped according to colour requires only the location and shape information for each primitive to be explicitly coded, however this will require more space than a run length scheme. A hierarchical data structure can be used to lower the amount of data to be coded by having multiple clustering levels such as principle clusters of colour followed by subclusters of length.
For image sequences only the first frame however needs to be fully coded since after the first frame the display frame store can be utilised thus requiring only the difference image to be coded. The difference image will mainly be composed of single and two pixel colour clusters, however the total number of primitives will be much lower.
There is however one major problem in doing this, and that relates to the statistical properties and the types of primitives which are required to express high quality natural images. Traditional graphic primitives are extremely inefficient with respect to the amount of data necessary to express the required graphic primitives. We are currently working on a new efficient representation scheme for these graphic primitives.
4.3.2 Temporal coding
The temporal coding phase is where most of the data compression will occur. We will for now limit temporal coding at this stage to conditional replenishment. We do this because we can get good result without needing to spend much work on more esoteric techniques. This is the only technique that fully exploits the frame buffer memory capabilities and requires no decoding or processing. Initially various non-evident problems arose in defining the replenishment condition which required a solution making this technique not as straight forward as originally envisaged.
One major problem was that of determining a robust condition for replenishment which was insensitive to noise and yielded only essential replenishment data. This is not a simple question.
Typically Euclidean distances are used to set a threshold with which to determine valid data. This was found however to be unsatisfactory. This was producing unnecessary replenishment data in some images and insufficient data in others depending on the images' colour composition. Thus to find an optimum was very difficult if not impossible. The reason for this is that a uniform change of colour at one instance will not produce an equal perceived change at another instance. This is due to the non-linear and non-uniform properties of the human visual system. Various rather inelegant solutions have been proposed to this problem such as introducing memory into the threshold decision making process and having dynamic thresholds.
The best and most elegant approach to this is rather simple. It is by using the actual perceived, colour change as the metric and then use a constant uniform threshold function on this. This gives optimum results (subjectively) under all tested conditions. Another problem which. presented itself was that some pels will change values very gradually overtime and will not be detected. So that a second image needs to be updated with the pels that have not been selected for replenishment. This ensures that the image we are testing against will always be a true representation of the coded image.
One must also take into consideration the reduced spatial resolution of the human visual system in the time domain. For this reason any pel which meets the conditions presented so far for replenishment is still not accepted before a further test is applied to it. This test ensures that any pel or cluster which is sub-threshold with respect to visibility in terms of spatial resolution is eliminated from the replenishment data.
However the replenishment will become necessary not only in the case of motion but also in the case of illumination changes and the need to update obsolete colour map references. This last point is a major problem because due to quantisation noise or the shifting colour composition of images with time due to the introduction of new objects into the scene some pixels may reference colours which are no longer supported by the colour map. So that a test for not supported colours must first be executed followed by a search of the image to cause those pixels referencing those colours to be replenished.
This overall coding technique fully exploits the frame buffer memory capabilities and requires no decoding or processing. This ensures that no extra system components are necessary to display video on any graphic workstation.
The X Window architecture consist of client applications which communicate with the Server, which runs the X Display. Client applications individually interact with the server through X lib. X lib is the lowest level interface to the X protocol.
As video was coded as graphic primitives a single client program was able to handle both video and graphics. Since X windows does not have good facilities to handle audio at present it was handled by a separate process. Another client program handles the text information and the control information coming from the user is handled by a separate client program.
The information coming through the network or the hard disk is separated into different buffers at the I/O interface. The synchronisation module make these information available to respective clients at the appropriate time. The figure 2 shows this logical architecture.
Figure 2: Logical architecture of the workstation
We have now developed a logical architecture for a multimedia workstation. This architecture can be mapped on to a graphic workstation with X windows. We have successfully integrated video into this workstation environment by developing a novel coding scheme based on graphic information in images [13]. We can display digital CIF size (352x288) colour images in real time on a Sun Sparc II workstation without using any additional hardware.
The comparative studies show the coding scheme we developed is better than other schemes reported in literature. The Tenet Group at University of California, Berkeley is also developing a similar system. Their system can display Quarter CIF (192x144) monochrome images in real time [14].
We are working on developing representations higher than graphic primitives for still and video images. This high level representation of images can be used to index images for database applications, and also to create links in hypermedia applications. This high level representation of images can be a simple one or we can make it as complex as we like. The trade off is between the amount of computing power required to extract the high level representation and the amount of data required to represent the image at that level. Thus when more and more computing power becomes available we can increase the complexity of this high level representation and move towards scene understanding.
As this overall architecture does not require any additional hardware to implement, it can be used on any graphic workstation that runs X Windows. Thus we believe this will increase the popularity of digital multimedia applications.
Authors: Dr Athula Ginige and Mr Ruben Gonzalez School of Electrical Engineering, University of Technology, Sydney PO Box 123, Broadway NSW 2007 Please cite as: Ginige, A. and Gonzalez, R. (1992). A workstation architecture for multimedia information systems. In Promaco Conventions (Ed.), Proceedings of the International Interactive Multimedia Symposium, 337-347. Perth, Western Australia, 27-31 January. Promaco Conventions. http://www.aset.org.au/confs/iims/1992/ginige.html |