A Response to Jim Gray's "A Call to Arms"

by Lee Fesperman
FFE Software, Inc.

This article is also published (with modifications) on Database Debunkings under the title, On Jim Gray's "Call to Arms"

In April, 2005, Jim Gray with Mark Compton published "A Call to Arms" in ACM Databases, Vol. 3, No. 3, yet another Manifesto on the direction of database technology. Gray's basic premise is that traditional relational database is sagging under the onslaught of modern computing challenges. Naturally, what he really is indicting are SQL databases epitomized by the major vendors -- Oracle, Microsoft (SQL Server) and IBM (DB2). He appeals to the prejudices of many in the information technology field but provides little hard evidence. He uses simplistic anecdotes and distorts history and current reality to support his premise. Gray gathers a plethora of modern practices and intimates that all should be integrated into database systems.

This response examines a number of his points beginning with his indictment of relational technology and attempts to separate reality from opinion. This is a daunting task since Gray's article tries to encompass most aspects of current computing techniques. Rather than cover all his proposals, this article will concentrate on several salient items while touching on a few other topics.

Is Relational becoming Irrelevant?

Are classic relational database architectures (as represented by SQL DBMSs) really sagging at the knees? Or, are the major SQL vendors not keeping their eye on the ball?

The major SQL DBMS vendors pay little more than lip service to SQL Standards and, more importantly, virtually ignore the principles of the Relational Model. SQL itself is notorious for weaknesses in its relational fidelity, examples include poor primary key and domain support and the Exists bug. SQL92 did make some improvements in relational compliance over SQL89 and SQL86 (but didn't fix the above problems.) However, SQL99 and later standards made a sharp turn away from relational concepts in introducing features that are contrary to the relational model.

The major SQL vendors have numerous incompatibilities with the SQL Standard in their products. For example, both Oracle and SQL Server treat an empty string as a database null. The major open source SQL DBMS is even worse, almost defiant about non-conformance. Rather than "polishing the round ball" (in the words of Michael Stonebraker), the major vendors have concentrated on marketing, differentiation, non-database features and quirky optimizers requiring vendor-specific tuning that lock customers into their proprietary solutions. Improved relational compliance is the lowest priority.

In ComputerWorld in the middle '80s, Ted Codd published his 12 rules to measure relational fidelity. Vendors have mostly ignored these guidelines and made little movement towards complying with them. Only one or two smaller vendors have the goal of moving their implementations towards improved relational capabilities.

Yet full compliance or, at the least, better compliance with the relational model offers true solutions to many of the issues that Gray raises. He disregards this, instead treating the weakness of current implementations as evidence of the failings of the relational model. He offers XML and XQuery as alternatives, supposedly based on 'developer' preferences. This is a rash judgment, not supported by current experience.

XML and XQuery are insolubly flawed. XML is just another incarnation of the discredited Hierarchical Data Model. XQuery brings excessive complexity. A relational query language uses only method to access information -- by value. XQuery has 3 -- by value, by navigating the hierarchy and through external linkage. It is 3 times as complex.

XML is certainly appropriate for data transfer between systems, for streaming operations. XML is weak as a medium for persistent storage of data, except perhaps for property sets. When used for more complex data, data that must be accessed and manipulated in a variety of ways, the problems mentioned above in connection with XQuery become overwhelming.

Note also the XML exhibits a strong dichotomy between data and algorithms.

Dichotomy between Data and Procedures

Gray claims there is an artificial separation between data and procedures (or later in his article, algorithms) in 'traditional' relational databases. He asserts that this dichotomy began with the COBOL language and the COBOL DBTG (Database Task Group). He is distorting history and current reality. The Relational Model (RM) has nothing to do with the COBOL and actually predates DBTG. RM is not pure data; it defines powerful operators for manipulating data expressed in relational algebra and relational calculus languages. SQL is a type of relational calculus. It is primarily declarative rather than procedural.

Perhaps Gray is thinking of the fact that RM rejects procedural code for most database processing and requires declarative interfaces. There are very good reasons for this stance. Procedural code is convoluted and obtuse. It is difficult for a human to verify its correctness and even to understand its purpose. Machine verification of procedural code is still not possible and is not likely to occur in the near future.

Another major problem with procedural code is that optimization techniques yield limited gains in efficiency, even after years of research in this area. Functional, logic and declarative code allow much greater improvements. An optimizer for a declarative language can yield multiple levels of magnitude improvement. Similarly, declarative is much easier to verify as correct.

Note: the major Object-Oriented Languages -- Java, C#, ..., are primarily procedural in nature.

The above reasons illustrate why procedural code is generally a poor choice for binding with the database. Even so, most SQL DBMSs support stored procedures and triggers written in procedural languages. The actual impetus for adding stored procedures and triggers is weaknesses in SQL as a full declarative, relational language. This brings the question -- "Why not improve the relational capability of SQL or invent a better relational query language?". Gray ignores this choice, leaning of the assumption that we've polished the round ball enough. In general, he looks past the power of RM to Object-Oriented and XML query facilities for solutions, certainly a popular view. More on this direction later.

Even with an improved relational language, there is a place for stored procedures and triggers, as well as functions used in expressions. An Object-Oriented language is not a bad choice as the implementation language for these capabilities in database systems. The next section examines the reasons in more detail.

Embedding a procedural style language in the DBMS has a number of potential uses. Examples include -- logging, soundex, statistical and financial formulas, spatial and temporal calculations and other domain specific algorithms. While the DBMS could provide these functions natively, pluggable packages, including 3rd party ones, are a more flexible arrangement.

There are dangers to this type of embedding, however. Inappropriate procedures/functions could be resource intensive to the point of starving other actions in the DBMS, could open security holes, could crash the system and could engage in improper crosstalk that violates database principles. It is responsibility of the DBMS and system designers/administrators to guard against such circumstances.

Synergy between Objects and Database

Object-Orientation (OO) is the dominant programming paradigm today, and the Relational Model (RM) is the dominant one for database systems. Together, they power most applications in use. However, many consider this marriage to be in trouble. They describe the problem as an 'impedance mismatch' between OO applications and RM or more accurately, SQL database systems. They usually attribute this to the difference in the way the OO model represents information -- as a physically linked lattices of objects and the way RM represents information -- as logically linked tables.

One reason for this so-called mismatch is the fundamental distinction between the concerns of an individual application and those of a shared database system. An individual application works within a subset of the total problem space. The data it manipulates is a private working-set and is transient. A database system concerns itself with the persistence of shared data and with maintaining the security, integrity and accessibility of the data. The database system services a variety of applications, reporting capabilities, batch processing and ad-hoc access all using the same data. The upshot is that different applications may use different object representations of the same data internally.

Gray touts SQLJ (an OO style embedded SQL) as a 'nice' integration of SQL and Java. The reality is that SQLJ is not so nice and rarely used. Most OO applications either access the database API and use the values directly or construct their own objects, or they use Object/Relational (OR) wrappers like Hibernate for Java. All this occurs on the client, so the DBMS backend is not involved. Unfortunately, the special requirements of OR wrappers can sometimes drive the design of the database, to the detriment of general usability of the shared resource.

A second solution to combining objects and the database is an Object/Relational Database Management Systems (ORDBMS). The integration of OO and RM in such systems vary. Gary suggests a complete inversion -- records [sic] as vectors of objects (fields), tables as sequences of record objects and databases as collections of tables. This is a real problem because it dilutes/removes the power of RM, substituting OO which has no viable data model.

A more reasonable structure for an ORDBMS is to integrate at a lower level. The ORDBMS can use objects to provide complete support for domains, an area where SQL is lacking. Domains are extended, complex datatypes for column in addition to the basic datatypes - character, numeric, date, interval, etc.. The OO language provides the implementation for the extended column datatypes.

Stored procedures, triggers and expression functions would also use the same OO language for their implementation. This allows them to manipulate domain objects directly. Procedures and functions can receive them as arguments and return domain objects. In this way, an ORDBMS can provide improved capabilities completely within the Relational Model.

The modern OO languages, such as Java and C#, compile to an intermediate form which then is compiled to machine code at runtime. This allows full portability and greater efficiency. A runtime compiler can analyze usage patterns for improved optimization and can recompile portions dynamically as it gathers newer statistics.

These languages also offer greater security. The runtime environment employs a verifier for the intermediate code to protect against malicious code or inadvertent corruption. The compiled code runs in a 'sandbox' that prevents interference with other parts of the DBMS.

The FirstSQL/J ORDBMS is an example of a database system using this approach. FirstSQL/J uses Java as the OO implementation language for domains, stored procedures, triggers and expression functions. For more details on the FirstSQL/J implementation, see An ORDBMS that is Truly Relational.

Other Topics

This section briefly looks at other areas in Gray's article. Most, but not all, show promise in improving current database systems.

Self-Managing and Always-Up Database Systems

Like many of the concepts mentioned in Gray's article, self-managing and always-up database systems are not contrary to the Relational Model. RDBMSs already exist with these characteristics, proving again that Gray is overstating his case in claiming relational database architectures are slowly sagging to their knees. In fact, relational databases are uniquely able to deal with these situations because of their logical orientation. Additional information on this topic can be found in Highly Available Databases. Note that the major SQL DBMSs have limited capabilities in this area.

Improved Physical Structures

The Relational Model and SQL Standard define things at a logical level and are silent about physical details This allows implementations complete freedom in the physical structures used. However, current SQL DBMSs tend to use simple structures and access paths (primarily -- indexing, hashing and sorting). They use random access disks as the storage medium, converting data to octets (8 bit bytes) using serialization. Random access and serialization can be expensive, especially for objects.

There is much room for improvement here. In-memory storage is a promising technique supported by some newer RDBMSs. In-memory structures are direct access and can avoid conversion/serialization to achieve high-performance processing. This is especially important for systems with no or inefficient random access storage.

Active Databases and Publish/Subscribe Capabilities

An active database is a database system that keeps the client up-to-date about any changes to queried data as they occur. For example, the client sends a query to the database selecting and retrieving a set of data. As long as the client keeps the query open, it will receive notification of any change (Insert, Update, Delete) made to the backend database. A publish/subscribe facilities provides a similar capability -- retrieving an initial data set and then receiving notifications of changes to the set. Note: Relational implementations of active databases have existed since the '90s.

Federated Databases

A federated database system combines 2 or more distinct servers under a single interface. A number of implementations have been developed. A common example of their use is a single query that joins tables from different databases. The problem with federated databases is they are not efficient in the general case, because only limited optimization is possible.

If each participating database system is from the same vendor (homogeneous databases), they can communicate to achieve some reasonable optimization plan. However, this isn't feasible for heterogeneous databases. A standard optimization dialogue/protocol to support this is not possible because of divergent, proprietary implementations. An efficient federation of heterogeneous databases is a pipe dream.

Evolving Schemas

Gray is using an OO mindset when referring to schema evolution. Because OO lacks a coherent data model, constant restructuring of the object design becomes almost a necessity. This is also an essential component in the Extreme Programming (XP) paradigm.

Schema changes are much less common in a relational database. Relational theory and normalization techniques create flexible designs that can survive environment evolution. Should structural changes do become necessary, the normalized design and relational facilities like views can ease the transition and limit the effects of the change. Some applications may require no changes at all.