IIMS 1996: Myers - the perspective for content addressable multimedia databases

The perspective for content addressable multimedia databases

Douglas G Myers
Curtin University of Technology

The need to efficiently manage data in an information system was recognised over two decades ago. The databases that evolved, though, and which are now widely used, are homogeneous and involve only small insertions of highly structured data that is forced to be unique. Consequently, to a very large extent data is synonymous with information in such systems. Multimedia, by its very nature, is based on heterogeneous date. For many potential applications of multimedia information systems, data insertions will involve massive volumes about whose contents very little is known. For example, video data. Further, what constitutes information in this data can be quite dynamic and strongly application dependent That is to say, applications can be envisaged where what can be classified as information may well depend on some learning process within the fabric of the query system. Multimedia calls for radically new forms of databases. This paper presents an argument that the most appropriate management paradigm for multimedia date revolves about retrieval by content. Four common abstract models of this paradigm are examined which have Increasing level of flexibility. Three are realistic targets for development and application within distributed multimedia systems in the short to medium term. The fourth remains conceptually by far the most attractive, but formidable research problems make it unlikely it will be seen even in experimental form for some years.

Introduction

An information system stores, communicates and processes information. Storage centres about a database and the access mechanisms provided by the database management system. Communication relays information to where the user is located where that user is any entity with which the information system is interacting. Increasingly, communication is being provided by a computer network such as the Internet. That sets bandwidth limitations and introduces issues of latency. Processing is usually involved with the interface between the user and the information system, and often concerned with transformations such as visualisation. It is essentially local and so dependent on the limitations and strengths of the resources available to the user at the access point.

A multimedia information system is distinguished in that whatever quantum of information is defined has multiple media components. At the heart of such a system is a multimedia database. As the various media components of the data are usually added in differing time scales, variations in the capabilities of the insertion process often lead to this being called an image or video database. Multimedia databases m conceptually far more complex than any currently in existence and design theories still lie very much in the research domain. The problem is that in existing databases, the concepts of data and information are quite tightly coupled while in multimedia databases there is only a loose association. Images illustrate the point. Many are a two dimensional representation of a three dimensional scene. Their information content consequently relates to that scene, not the array of pixels in the data structure which forms them. Thus that information content can and usually is vast, depending on the context in which the image is examined. For multimedia databases, then, a radically different approach to database management is required.

The potential of multimedia information systems

For what problems does a multimedia information system offer a solution? The following illustrate part of the spectrum of applications:

Electronic catalogues of, say, artworks, consumer items or industrial products.
Mobile robots and other systems can easily acquire video data. An information system designed to accept such video input has application to tasks such as automated stock replenishment in retail outlets, automated fruit picking or detecting the presence of contraband at airports.
Medical applications of multimedia information systems can be divided into two broad areas; screening applications and patient support services. Screening applications include searching mammograms for cancers or electrophoresis scans for genetic abnormalities. Patient support services include providing video showing how a patient is responding to rehabilitative treatment, or, say, a 3D reconstruction of a patients liver drawn from several imaging modalities.
Two security applications for multimedia information systems are identifying the individual who belongs to a fingerprint or fingerprint fragment, and identifying faces. The latter has application in police work where the identification is from a witness description, but also in private security systems to identify the privileges extended to individuals within a secure area.
An interesting application is to incorporate legal documents, images of the crime site, reconstructions and so forth into an information system to assist courts in legal actions.

The concept of an architecture and an organisation

Any 'black box' may be described in terms of its function, organisation and implementation. Function refers to input-output behaviour which is the potential user's perception of the system. In computing, this description is increasingly called an architecture, although for databases it is usually called a scheme (Korth & Silbershatz, 1991). Two of its key components are a set of abstract data elements and a query language to access those elements.

In general terms, the organisation of a system is a mathematical model which logically describes how outputs are generated from inputs through some sequence of primitive actions. As a logical description it may not describe how the system is actually constructed, hence the need for an implementation description. For a database, the organisation defines the semantics of the query language and gives a more detailed description of the abstract data elements of the architecture and how they are manipulated.

An image is a data structure typically around one Mbyte in size. A single one hour video sequence is the equivalent of around 100,000 such images. Consequently, a multimedia database can be expected to be some terabytes in size; the equivalent of more than 1000 common hard disk drives. That suggests a distributed implementation which raises various practical problems and organisational design issues. This extreme size also makes data compression almost mandatory and that greatly complicates storage and access issues. Compression is a process of reducing data redundancy. Apart from the usual forms of coded compression such as MPEG, multimedia databases are likely to utilise the following:

Geometrical compression. For example, in video sequences it is possible to omit frames and interpolate to regenerate them. Morphing, another example, involves one data set being replaced by parameters for a given non-linear operation which is applied to another, smaller data set.
Knowledge compression. For example, it is possible to derive rules which describe ageing effects in human faces. If images of a person at different ages are required, it is only necessary to apply those rules to the one image.
Model based compression. The data set is replaced by a mathematical model plus some initialising data from which the data set may be generated. While extremely effective with an exceptional compression ability, creating models presents many challenges.

Navigation also presents numerous organisational issues. In its most general sense, database navigation can be created as one of.

a filtering process in which where the desired data may exist is progressively reduced from all elements in the database to just some simple set;
a process in which data from one element determines the path to the next.

The former is basically the mode of existing databases, especially relational, and is highly developed. The latter is not as well researched and has no preferred approaches. It is, though, a mode which seems well suited to object oriented databases and these in turn seem well suited to multimedia information systems.

Some architectural issues

The design of a multimedia database will be strongly influenced by the use of a multimedia information system. Two points on this issue:

In many multimedia applications, a user will be uncertain about what they identity as information. Consequently, an important interactive activity will be browsing; an inspection by the user of an abstraction of the data, and often of the class data. Browsing will be so important in multimedia information systems as to be considered almost another form of data manipulation.
User transactions in multimedia information systems will be lengthy, in part because user perceptions of what is information are very likely to vary over the period of the interaction. This argues for more sophisticated and better designed user interfaces than are commonly employed in current information systems.

Turning to the multimedia databases directly, three general comments can be made:

Most currently used databases are homogeneous. There is one data structure class - usually relational - and the data itself is one medium - usually alphanumeric. By definition, multimedia databases require several classes of data structure and the data can represent several media. They are therefore heterogeneous.
Database retrieval returns information, not data. Existing databases have an almost one to one mapping between data and the information of interest. Indeed, the insertion process itself often defines the key information relationships between the different data items. Such a database may be termed organised. Consider an image. Here, the data is relatively small but the information is enormous. Categorising it, though, means identifying all possible interpretations of the image in all possible contexts; a monumental task. This suggests a more general situation is one where data is simply inserted and the relationships which define information are decided by the database management mechanisms during a session. Such a database can be termed self organising.

Now:

A complete database is usually defined as one where every location of every data structure is filled. This assumes a close relationship between data and information. From purely an information perspective, it seems more appropriate to describe completeness as when a legitimately framed query always results in a response. An incomplete database is therefore one where there is no such guarantee and this describes self organising systems. Completeness is very difficult to prove when entities have a wide variety of information modes and some of those modes are very information rich.
In order to eliminate redundancy and inconsistency, the logical structure of existing databases allows each insertion of data to be checked against what is already there. Since information is not as well defined in a self organising database, it cannot provide such checks and so these must be expected to occur.
A given query in a current database always generates the same response. If a database has redundant data, however, then there can be similar responses to a query. That is, the response achieved to a given query may be dependent on how retrieval is implemented and the history of recent retrievals. For that reason, self organising databases can be termed noisy.
Retrieval is framed in terms of a query language that in turn is based on a logic. Existing databases use a monatonic logic; a logically complete or consistent logic without contradictions. Thus a response is always correct within the axioms of the logic used as correctness is purely a function of those axioms. For self organising databases, though, there is no clear concept of information, thus a perception about information held at one instance may not necessarily be held for all time. This means the query language must be based on a non-monotonic logic. That is, a logic which is not consistent and where a conclusion may well change in the light of newly inserted information.
Non-monotonic logics may be described as working with generalities and constructing logical arguments from them in spite of possible exceptions. For example, making deductions on the assumption all men like sport. Monotonic logics have importance in describing human communication and knowledge, but unfortunately their development is limited (Weigart & Tsai, 1994).

The focus of any general discussion on the architecture of a multimedia database needs to be on the abstract data model of that database and the query language. The data model is simply the classes of the data structures within the database and their properties. A key concern in selecting a model is its ability to accurately organise the information of interest within its data structures. Particularly for multimedia databases, this issue of representation is complex. A new discipline termed knowledge engineering, part of artificial intelligence, is addressing the problem. That work suggests the most appropriate models are object models or entity relationship ( ER ) models. Object models are the abstract foundation of object oriented systems and seem well suited to multimedia applications. Further, the paradigm flows easily into the organisation. ER models are quite general and seem well suited to situations where concepts of information are ill defined. The considerable interest in ER models in recent years has produced a significant research literature. Gyssens et al (1994), Tsuda et al.(1991), and Staube & Ozsu (1995) are examples of current thinking on the representation problem.

The retrieval function tends to figure prominently in query languages. Its design is remarkably complex given the many issues which must be addressed. How is a query to be framed when the information to be accessed is in multimedia form? ( Yoshitaka et al, 1994 ) Should it be in one medium only such as text, or should it consist of multiple media? If the latter, then how to frame the query in some consistent manner? Should the search be such that components of the query can only he used to examine like components in the database? For example, audio may only be used to search audio. If, however, a query is seen as general, then how to combine the different elements of the query given there may be inconsistency between them, what constraints should exist on seeking a combination of media and how to make comparisons?

Another non-trivial question is what is the outcome of a retrieval? Consider the operation of a database in abstract form. As an architecture. a database is simply a collection or set of entities where each is a structure of some form built around atomic elements. Therefore, retrieval may be viewed as a navigation process where at each point a two stage process of:

extracting entities;
applying some general masking operation based on components drawn from the query itself to extract particular atomic elements or sub-structure of interest is applied.

For a homogeneous database:

the information representation never changes ( and is often essentially the same as the data structure);
the information structure is usually simple, commonly a relation or ordered set, thus masking is essentially a selection process.

Contrast this with heterogeneous databases:

If information is uncertain, then what forms the structure and atomic elements will be context dependent. To illustrate, on some retrievals a complete image can be considered atomic, but on others it may be features within that same image. Equally, in the first case the data structure is a single atomic element while in the second it is likely to be a complex feature graph. Thus there is the question of defining what is meant by atomic and structure in any retrieval operation.

There may be inconsistency. How are multiple responses due to this to be resolved? There may also be redundancies where a query can be satisfied across several different media. Then should all components satisfying the query be returned and if so in what form?

The implication is that only part of a multimedia entity accessed is retrieved and that requires the query to include constraints which define that part.

If the information in a self organising database is unknown, then neither can effective retrieval mechanisms be known. That suggests learning might need to be an integral part of such a database and so that in some sense it becomes a deductive database. A deductive database has data, plus facts and rules. The facts and rules not only guide retrieval, but assist in learning new facts and rules that may be used to improve future performance. An apt analogy suggested in the recent literature is that such retrieval mechanisms, especially with regard to actions on very information rich sources such as images, may be best viewed as information mining mechanisms.

A taxonomy for multimedia databases

An abstract taxonomy for multimedia databases needs to be based on retrieval mechanisms where these in turn are based on the information content of entities. Given that the entities in a database are actually structures of atomic elements, then that suggests in the Most general terms and following Berra et al (1993) retrieval can be classified according to whether that accessing action is dependent on:

atomic elements alone;
structure alone;
structure and atomic elements;
the semantics of this representation.

More explicitly, this suggests a taxonomy as follows:

Level 0 Retrieval System. Retrieval is based about accessing a particular atomic element termed the identifier or index. Whether the structure is retrieved or not is a function of some test on the index. This method of retrieval effectively demands homogeneity.
Level 1 Retrieval System. Retrieval is based on some particular atomic element or elements being present. More particularly, that they have some given attribute. This implies knowing the meaning of that data; for example, that it is a time stamp. Equally, it may mean knowing that this data can derive some particular value. For example, it may represent a phoneme.
Level 2 Retrieval System. Retrieval is on the basis of some sub-structure being present. This effectively represents retrieval by pattern recognition. In effect, the query is framed in terms of a sketch and the search is on similarity to that.
Level 3 Retrieval System. Retrieval is not on any particular attribute of the data structure directly, but on its meaning. This is semantic retrieval. It is complex as it involves an interpretation of the information representation and an understanding of its significance.

While these levels describe an increasing level of complexity in the search process, the last is a very significant increase indeed. Three problems associated with it can be highlighted:

Any data will undoubtedly lead to multiple interpretations due, for example, to the occlusion of objects in scenes. This ambiguity requires guidelines to be included in retrieval for preferred or most likely interpretations and that requires additional information not formally part of the database proper.

There is the question of interpreting the meaning of a query. This has two aspects, an interpretation of the complete query and of the individual query terms, which are best illustrated by examples:

If a query is to search for an image of a house with particular features photographed on a springtime afternoon, then apart from requiring an understanding of what is meant by springtime and afternoon and how these should be expressed in specific query parameters, this also requires an understanding of what differences could exist between images because of this specification. A query may be to find an image where a red brick building abuts to a blue painted building. Are situations of back to back, side by side on the same street or physically touching all considered to meet this requirement?

Each of these requires additional interpretation and knowledge, and probably some interpretation of past retrievals to gauge the general desires of the user. That is to say, it requires some context in which to interpret user requests. Framing that is a difficult problem. Semantic analysis fall very strongly within the domain of artificial intelligence and while it has progressed significantly over the last decade or so it still has difficulty answering questions such as these with authority.

This taxonomy relates to architecture. With respect to organisations:

level 0 systems are well understood as they are in common use;
level 1 systems are a relatively simple extension of level 0;
similarity retrieval is computationally expensive, thus the question of efficiency looms as a significant problem in the creation of level 2 systems;
semantic retrieval is several orders of magnitude more computationally expensive than similarity retrieval, hence the question of efficiency alone acts as a major limitation on the adoption of level 3 systems for the foreseeable future.

While level 0 and level 1 may be easy to create, they lack the flexibility needed for many important multimedia applications (IEEE Computer Special Issue, 1995). The task, therefore, is to determine how to make level 2 and 3 databases practical.

Retrieval by similarity

Similarity measures can be broadly divided into comparing a reference against derived:

features or feature sets;
structural relationships.

The first is very similar to level 1 retrieval. However, in multimedia the number of features can be extremely large, possibly even infinite, and so the issue becomes one of whether the reference and image features are sufficiently close.

Features can be interpreted quite broadly. For images, for example, they can be interpreted as any of the following:

Derived Features. Derived features include computed quantities such as areas, moments, perimeter lengths, transform values, histograms, textures or ratios as well as morphological quantities or other quantities such as statistics or motion indicators. There is a long history of successful use of these in pattern recognition tasks.
Templates. A template is usually an image, part of an image such as a region or object, or possibly even a video sequence. A problem is that the similarity is physical, not content based. Templates tend to fail because of the variation in image parameters and the difficulty in compensating for that.
Sketches. A sketch is a derived image of some kind. It may be an edge image, or literally a handsketch or it may be an existing image to which extra elements - such as clouds, graphics or whatever - are added. As the end result is still an image, the performance of these similarity measures, with the slight exception of edge images, is not that much better than a template.

Derived features are the most widely used as they are easily determined albeit at some high computational cost. Being the outcome of a mathematical operation and not an interpretation, though, means they often relate poorly to content. It is a reflection of the degree of difficulty in deriving content that feature sets are still considered.

Structural similarity requires expressing the data model entity in the form of a connected graph, tree, pyramid or some similar form showing how the elements within that entity link to one another. Similarity measures based on structure tend to perform quite well. This is to be expected as they more directly relate to content. The problem is actually forming them, although in the case of images deriving graphs can be relatively easy. The most successful approach at present is to apply templates of likely image objects and correlate then or pass them through a recognition structure such as a neural network to identify them.

The strongest interest in similarity measures is with respect to graphs. The problems of recognising human faces illustrates why ( Pentland et al, 1993 ). As the significant difference between faces relate to the lips, eyes, nose, ears and chin and the distances between them, a mathematical measure is needed to include those. If that measure proves inadequate, then it is necessary to turn to secondary factors such as hair. The formulation of a similarity measure is quite complex, but a graph which links the components with the key comparisons first can greatly simplify the process while still offering a sophisticated analysis.

Semantic retrieval

Understanding information in any form is difficult. A fruitful way of examining this problem is to assume information is expressed in some abstract language. That is to say, it consists of tokens of some form organised into sentences. This may actually involve a hierarchy. In speech, for example, words are formed from phonemes according to some given rules. To understand this information requires several steps. The first is to interpret the tokens themselves. This may be taken as some general form of dictionary reference. Next is the syntactic problem; the problem of determining the relationship between the tokens and the significance of these. Viewed generally, it is the problem of using the grammar of the language to parse the sentence. Finally, there is the semantic problem; assigning a meaning to the sentence. Semantics are necessary as a sentence can he grammatically correct, yet have no meaning. A more difficult problem is that in most forms of information there are ambiguities and so there can be several meanings. That meaning may be resolved by looking at past information, but often it is only resolved by referring to some context; a body of knowledge assumed in the communication but never stated. To illustrate, there is no need in speech to outline what will happen at a birthday party because that is part of common experience.

Understanding information is vital to multimedia database management in order to be able to translate actions between the different media. It seems an immensely complex problem. It certainly may be if the general case is tackled. However, limited problems can be tackled and quite successfully as some computer games, inventory ordering systems and other examples show. This suggests building separate semantic modules, each directed at a particular task. This is in part the underlying principle of intelligent agents suggested for graphical user interfaces, network searching and similar tasks.

Creating an information understanding system involves capturing the knowledge needed to perform the semantic analysis and then forming that knowledge into a structure for use within a computing system. Knowledge capture is difficult, but far from impossible. However, representing it can present problems. There are two common approaches:

a declarative representation;
a procedural representation.

Declarative captures facts and assertions and relies on statements in a logic or some relational expression such as graph or semantic network. Procedural information captures actions or consequences and is usually based on a formal grammar. As the former is ideal for dictionary structures and the latter for semantic analysis, usually both are needed and used.

Semantic retrieval is complex, but with some determination useful structures can be created. The real problem is that the time taken to interpret information, especially complex information such as images, involves considerable computation. Current computing structures are poorly organised towards such problems. Ideally, symbolic computers should be used, but few are available and there is much work to be done before they can be remotely considered mainstream.

Research problems in multimedia databases

Multimedia databases are a highly complex technology. Their design still falls within the research domain and it will be some years before they offer the robustness and features expected of commercial systems. This is illustrated by some of the key problems which still need to be addressed:

A transaction model is needed for multimedia information systems. This would greatly assist in the design of graphical user interfaces and disclose the potential of sophisticated visualisation methods such as virtual reality.
The question of how to most effectively represent multimedia information in such a way as to achieve the desired functions of the database is still quite open.
A critical question is how to compress such that the process of representation is not hindered. That is to say, such that operations of similarity or semantic retrieval are not impaired.
More methods of achieving efficient search need to be investigated. This is particularly critical given the size of multimedia databases and their likely distributed nature.
Similarity involves comparing a structure of features against a template. This raises three issues; how to design the template, how to derive the features and how to make the comparison.
The first of these is particularly complex and at present seems very application dependent. There needs to be a study of various application areas, therefore, to ascertain what might be desired and how queries should be framed.
The final problem is quite complex. It would seem to be application dependent, but that needs to be tested. So far, most research work has been with respect to human face recognition, but this problem has some peculiarities which simplify the task. Namely, that the spacing between the eyes, the distance between the line of the eyes and the bottom of the nose and the position of the lips, and similar measurements do not change. Seeking similarity when the problem is, say, comparing flowers is likely to be far more testing.
The overall problem of semantic retrieval is complex. The theory needed for semantic retrieval, at least in restricted form, seems sufficiently well developed for application. Thus the need is to investigate a broader range of specific problems to more clearly identify the issues in extending semantic retrieval to a more general situation. For example, the heuristics needed to remove ambiguity in image analysis.

References

Berra, P. B., Golshani, E, Mehrotra, R.& Sheng, O. R. L. (1993). Guest Editor's Introduction: Multimedia Information Systems. IEEE Transactions on Knowledge and Data Engineering, 5(4), 545-550.

Gyssens, M., Pareadaens, L, Van Den Bussche, J. & Van Gucht, D.(1994). A graph-oriented object database mode. IEEE Transactions on Knowledge and Data Engineering, 6(4), 572-586.

IEEE Computer, Special Issue ( 1995). Finding the Right Image: Content-Based Image Retrieval Systems. IEEE Computer Society, 28(9).

Korth, H. F.& Silbershatz, A. (1991). Database System Concepts. New York, McGraw Hill.

Pentland, A., Ricand, R. W. & Sclaroff, S. (1993). Photobook: Tools for content-based manipulation of image databases. MIT Media Laboratory Perceptual Computing Technical Report 285.

Staube, D. D.& Ozsu, M. T. ( 1995 ). Query optimisation and execution plan generation in object oriented data management systems. IEEE Transactions on Knowledge and Data Engineering, 7(2), 210-227.

Tsuda, K., Yamamoto, K., Hirakawa, M., Tanaka, M. & Ichikawa, M. (1991). MORE: An object oriented data model with a facility for changing object structures. IEEE Transactions on Knowledge and Data Engineering, 3(4), 444-465.

Weigart, T. J. & Tsai, J. ( 1994 ). A computationally tractable non-monotonic logic. IEEE Transactions on Knowledge and Data Engineering, 6(1), 57-63.

Author: Douglas G Myers
School of Electrical and Computer Engineering
Curtin University of Technology
GPO Box U1987
Perth, Western Australia 6001
Please cite as: Myers, D. G. (1996). The perspective for content addressable multimedia databases. In C. McBeath and R. Atkinson (Eds), The Learning Superhighway: New world? New worries? Proceedings of the Third International Interactive Multimedia Symposium, 280-286. Perth, Western Australia, 21-25 January. Promaco Conventions. http://www.aset.org.au/confs/iims/1996/lp/myers1.html

[ IIMS 96 contents ] [ IIMS Main ] [ ASET home ]
This URL: http://www.aset.org.au/confs/iims/1996/lp/myers1.html
© 1996 Promaco Conventions. Reproduced by permission. Last revision: 15 Jan 2004. Editor: Roger Atkinson
Previous URL 17 Feb 2001 to 30 Sep 2002: http://cleo.murdoch.edu.au/gen/aset/confs/iims/96/lp/myers1.html