In computer systems, there is an inextricable relationship between algorithms (corresponding to processing tasks) and the data they act on. It is both efficient and appropriate to select a format (or representation) for data which is suited to the algorithms most commonly performed on the data. As an example, it we were required to uniformly brighten an image by adding a constant value to all of its pixels. If we were to perform this operation on a raw format image of dimension 256 x 256 it would take 65536 additions. If on the other hand the a Fourier domain format were used (in which a single stored value specifies the average intensity of the data) the same operation may be performed with a single addition. We refer to this optimisation by saying that the data format supports a specified set of operations. Of importance to image and video information are the operations "read from" and "write to" a storage device. These operations are made more efficient if a compressed format is chosen. On the other hand if the most important operation is display, a format which reflects the layout of the display memory or the primitives used to communicate with display hardware will be most efficient. To date, these two issues have existed independently and driven image and video formats down divergent paths. Recently however, dedicated hardware for decoding standard image and video formats have become available. In the foreseeable future display hardware will be capable of accepting image and video information in a range of different formats. This provides an opportunity for the development of data formats which support more of the operations performed by application software - leaving the mapping of this format to the display memory up to the system hardware.
In this paper we consider image and video format in the context of the operations commonly found in multimedia computing. In section 2 we propose a set of requirements based on our own research into multimedia authoring and presentation systems. In section 3 we show how many of these requirements are linked to an ability to access (image and video)data at a range of resolutions. The concept of scalability is presented as a solution to this in section 4. The scalability of current standards as well as emerging formats is considered in section 4.1 from which some broader implications and concluding remarks (section 5) are drawn.
This particular problem has a second, more complex manifestation in distributed multimedia systems. In such systems, the network resources not only vary according to their physical properties (architecture of the host machine, bandwidth of network cables etc.) but may also vary over time with fluctuations in network activity. In such situations it is important that client and server applications can dynamically adjust certain quality of service parameters in order to meet any real time constraints - such as maintaining a relatively constant frame rate for video data.
In fact both of these problems may be addressed by allowing a trade off between data volume and certain quality parameters. In order to cope with variations in system capabilities during presentation, this trade off must be available at presentation time rather than (as currently predominates) at he time the data is encoded. Even in the case where system parameters are constant with time (as can be assumed for many single user environments),this approach carries the added advantage that a single set of distribution media may be "pruned" at presentation time to suit the architecture of the host machine. This would represent a significant improvement in the portability of multimedia (and in particular image and video) information.
Currently the provision of this structuring information is difficult. Invariably this information is entered entirely manually as with the "image maps" used in the markup of images for HTML documents. This, in turn represents a significant cost in the production of visual information systems and severely restricts the level of interactivity that can be practically provided by applications - especially for video information which may require separate markup for each frame.
Possibly the most attractive solution to this problem is through application of image analysis techniques. If the computer is able to refine a users rough input based on an analysis of the image data, the manual tracing of region boundaries is made considerably more efficient. The natural extension of such a system to video would see the machine able to both assist with the initial definition of regions, and then be able to track these regions through successive frames. A number of system which makes use of such techniques have been described recently (Ueda, Miyatake & Yoshizawa 1991; Flickner, Sawhney, Niblack, Ashley, Huang, Dom, Gorkani, Hafner, Lee, Petkovic, Steele & Yanker, n.d.). Image analysis can also play an important role in the extraction of syntactic structure from image data. In the QBIC system (Flickner et al. n.d.) for example, successive frames of a video sequence are analysed for the purpose of region extraction based on motion information. After this segmentation, each region is assigned a depth based on its interaction with other regions over time. The resulting structure can support queries which include simple 3D predicates such as "find an image of A in front of B". For complex scenes with many regions, the ability to automate this process represents a significant saving over manual processing.
A problem which will need to be addressed if analysis is to find more widespread use in multimedia however, is that most analysis techniques are computationally expensive. Where analysis is combined with interaction, response time may become a problem. We can however view this problem as a generalisation of the "multiple architecture" problem of section 2.2. In particular, where the data volume has been reduced by a trade in perceptual quality, analysis may yield a useful but more approximate result. Due to the reduced data volume, processing time will be reduced. Given appropriately designed algorithms it is conceivably possible to support the interactive use of analysis by this means.
A second approach to this problem, which should be examined in parallel with the above considerations is the selection of a data format which makes explicit information which is more directly useful in analysis (Lowe 1993, Burt 1984). Using the terminology of section 1 we note that such formats support certain analysis operations.
The second issue is ensuring that the required presentation is achieved consistently across a range of output devices. A problem here is that the display resolution of different output devices may vary widely. For example, screen resolution may vary in the range of 70-120 ppi (pixel per inch) while hard copy devices such as dye sublimation printers and slide recorders usually have resolution in excess of 300 ppi. If presentation size is determined in terms of pixels (as it typically is), display size may vary dramatically between devices and machines. This problem is exacerbated by the fact that the resizing of data is (currently) handled in an ad hoc: fashion according to the particular application being used and the particular class of device it is communicating with.
Thus the ability to finely control the size and resolution of image and video data is the key requirement for presentation. To make this a more precise requirement (as resizing of image and video data is obviously possible and performed all the time) we additionally require that this process incorporate an efficiency trade off. That is, there should be a proportional relationship between the resolution at which data is accessed and the processing (bandwidth etc.) required to perform the access. Only in this way can the full range of presentation modes (icon, normal, magnified, fast forward and rewind etc.) be readily supported on a single data set.
A third issue which will not pursue in any depth is that of providing perceptually correct colours on different output devices. This particular problem is well understood (Trussell 1993) and colour matching software, commercially available.
Figure 1: Trading image resolution for data size. the full resolution image (a) contains 65536 pixels; the same image at half resolution (b) contains 16384 pixels; and quarter resolution, 4096 pixels. The effect on perceived quality is notably less significant.
Postponing a consideration of compression (about which much is already known) we observe that the problem of efficiently providing access to data at a range of resolutions is central to the requirements of multimedia. Specifically, a solution to this requirement provides a means by which processing and bandwidth demands may be traded (smoothly) against perceptual quality factors. It also provides a means of controlling the presentation of the data across a wide range of output (and input) devices by supporting access to the data at a resolution appropriate to any device. Due to the close relationship between resolution and size, the size of presentation may also be controlled by this mechanism.
A similar rational may be applied to the use of computer analysis of visual data. As was noted in section 2.3, most analysis techniques are computationally expensive. However, good approximate results may be achieved by processing data at lower resolution. In addition, both the efficiency and reliability of certain types of analysis may be improved by use a so called "multi-resolution" techniques (Rozenfeld 1984) in which an initial low resolution analysis is progressively refined.
In summary we note that compression and multi-resolution access are key requirements for image and video formats used in multimedia. One way of providing efficient multi-resolution access which has received attention in recent image coding literature, is through the use of a scalable code.
The smallest element in most scalable (image and video) codes is a significantly reduced version of the original data. This provides a floor to the resolution at which data may be accessed (without the express need for resizing by an application). The remainder of the file is composed of "detail elements". The resolution of the decoded image may be increased by the inclusion of more of these detail elements up to the full resolution of the data (see Figure 2).
In Taubman & Zakhor (1994), a scalable (multi-rate) video codec is described which is based on the wavelet transform. A software implementation of the codec has also been released (Taubman 1995). This supports a range of playback rates and resolutions (an example set of rates drawn from Taubman & Zakhor (1994) is given in Table 1) and provides an excellent example of work in progress in this field. The performance of this codec in terms of compression ratio for a specified quality level was found to be equivalent to or in excess of MPEG-1.
|Frame format||Available frame rates (fps)|
|352 x 240 colour||30, 15, 7.5, 3.75|
|352 x 240 monochrome||30, 15, 7.5, 3.75|
|176 x 120 colour||30, 15, 7.5, 3.75|
|176 x 120 monochrome||30, 15, 7.5, 3.75|
|88 x 60 colour||30, 15, 7.5, 3.75|
|88 x 60 monochrome||30, 15, 7.5, 3.75|
|44 x 30 colour||15, 7.5, 3.75|
|44 x 30 monochrome||30, 15, 7.5, 3.75|
|22 x 15 monochrome||15, 7.5, 3.75|
The significant finding from this and other related works from our perspective is that it is possible to achieve compression and multi-resolution access within a single image (video) format. We are therefore able to provide supports for the key requirements of multimedia within the framework of the data file format (or, more specifically the data representation).Table 1: Playback rates supported on the scalable codec for a data source of size 352 x 240 and 30 fps frame rate. Note that all rates are related by a scale factor of 2.
Providing scalability within the MPEG framework has also been considered in the literature. While there are several approaches to this, the (representative) conclusion drawn by Anastassiou (1994) is that not more than three hierarchical multi-resolution levels are practical for a single source and that the predominant application of MPEG be single resolution. In line with this, the current American HDTV standard does not provide for a hierarchical system.
There are several reasons why scalable image (and video) codes have not received the attention that single level codes (predominant in the JPEG and MPEG standards) have. On reason is their lesser maturity but possibly the most significant reason is that there is a price to pay for scalability in both coder complexity and compression. This has been influential as much of the recent development of (image and video) codes has been motivated by the implementation of High Definition Television (HDTV) services. For such services, compression and low decoder cost are the key requirements as display sizes are predetermined and fixed. For these services, multi-resolution access is only needed in a very limited form - in order to provide backward compatibility with existing TV receivers. This is a very different environment to that of multimedia computing where, while complexity and compression are still important, efficient access across a finely graded and wide ranging set of resolutions is of utmost importance.
In addition to these issues, a second problem exists with the majority of scalable codecs which have been proposed in the literature. That is they support decoding only at a limited set of resolutions and specifically dyadic multiples of the "minimum" resolution (ie. 1, 2, 4, 8 times minimum resolution). Providing a finer graduation remains an open area for research (Dorrell 1995).
We argue that while current standards may have been appropriate for early multimedia systems, in which capture and fixed mode presentation predominated, they are inadequate for the current generation of interactive multimedia applications. The need for greater scalability in particular suggests that some rethinking of the current direction for image and video compression standards used in multimedia. For this reason, current efforts to develop a wavelet based compression standard are promising and should be embraced with enthusiasm by the multimedia community. But it must not stop there. The development of image and video formats for use in multimedia must take into account the scope of operations which are required to be performed on the data. The need to be able to perform analysis on the format, as well as have fine control over its presentation on a wide range of output devices must receive a similar level of consideration as compression. Perhaps only when this happens will there start to be truly integrated image and video information (until then we must continue to deliver the illusion and depend on tomorrow's faster architecture).
Arman, F., Hsu, A. & Chiu, M. (1993). Image processing on compressed data for large video databases. In ACM Multimedia'93.
Bove, Jr., V. M. (1993). Scalable (extensible, interoperable) digital video representations. In Digital Images and Human Vision, The MIT Press, Ch 3, pp.23-33.
Burt, P. (1984). The pyramid as a structure for efficient computation. In Multi-resolution Image Processing and Analysis. Springer-Verlag.
Daubechies, 1. (1992). Ten Lectures in Wavelets. Siam.
Dorrell, A. (1995). Visual information representation for scalable coding and analysis, Doctoral assessment report available from URL http://www.ee.uts.edu.au/~andrewd/publications/docass/report.html
Dorrell, A. & Lowe, D. (1995). Fast image operations in wavelet spaces. In Digital Image Computing, Techniques and Applications, Australian Pattern Recognition Society, Brisbane, Australia.
Flickner, M., Sawhney, H., Niblack, W., Ashley, J., Huang, Q, Dom, B., Gorkani, M., Hafner, J.. Lee, D., Petkovic, D., Steele, D. & Yanker, P. (n.d.). Querie by image and video content: The QBIC system. Submitted to IEEE Computer, special issue on Content based picture retrieval systems. Preprint via URL: http://www.ibm.com/
Hilton, M. L., Jawerth, B. D. & Sengupta, A. (1994). Compressing still and moving images with wavelets. Multimedia Systems.
Lowe, D. (1993). Image Representation by Information Decomposition. PhD thesis, School of Electrical Engineering, University of Technology, Sydney.
Lowe, D. B. and Ginige, A. (1996). MATILDA: A framework for the representation and processing of information in multimedia systems. In C. McBeath and R. Atkinson (Eds), Proceedings of the Third International Interactive Multimedia Symposium, 229-236. Perth, Western Australia, 21-25 January. Promaco Conventions. http://www.aset.org.au/confs/iims/1996/lp/lowe2.html
Rozenfeld, A. (Ed) (1984). Multi-resolution Image Processing and Analysis. Springer-Verlag.
Smith, B. C. & Rowe, L. A. (1993). A new family of algorithms for manipulating compressed images. IEEE Journal on Computer Graphics and Applications, 13(5), 34-42.
Taubman, D. S. (1995). Fully scalable, low latency video codec. Public software release available by ftp from robotics.eecs.berkeley.edu in /pub/multimedia/scalable2.tar.Z
Taubman, D. & Zakhor, A. (1994). Multi-rate 3-D subband coding of video. IEEE Transactions on Image Processing, 3(5), 572-588.
Trussell, H. J. (1993). DSP solutions run the gamut for colour systems. IEEE Signal Processing Magazine, pp. 8-23.
Ueda, H., Miyatake, T. & Yoshizawa, S. (1991). IMPACT: Interactive motion picture authoring system for creative talent. In Conference Proceedings on Human Factors in Computing Systems, ACM, New York, NY, USA, p.525.
Woods, J. W. & O'Neil, S. D. (1986). Subband coding of images. IEEE Transactions on Acoustics, Speech and Signal Processing, 34(5), 1278-1288.
Zandi, A., Allen, J. D., Schwartz, E. L. & Boliek, M. (1995). CREW: Compression with reversible embedded wavelets. Preprint available from URL: http://www.crc.ricoh.com/misc/crc-publications.html
|Authors: Andrew Dorrell and David Lowe|
School of Electrical Engineering
University of Technology, Sydney
PO Box 123 Broadway 2007 Sydney Australia
Email: firstname.lastname@example.org, email@example.com
Please cite as: Dorrell, A. and Lowe, D. B. (1996). Scalable visual information in multimedia. In C. McBeath and R. Atkinson (Eds), Proceedings of the Third International Interactive Multimedia Symposium, 113-118. Perth, Western Australia, 21-25 January. Promaco Conventions. http://www.aset.org.au/confs/iims/1996/ad/dorrell.html