Missing Information Exists!


Missing Information

Dealing with missing information has been a problem since the beginnings of database management. Missing information refers to data values that are not recorded in the database along with associated data. Here we mean missing in its most basic sense - not present. The reason that data is missing includes a myraid of circumstances - the data was not available, inappropriate for the context, lost, refused, etc.

Let us start with a concrete example. In this example, we are storing data about employees. The data about each employee contains the usual information - name, employee #, social security #, pay rate, date of employment and birthdate.

For a given employee x, all information was available, except date of birth. The birthdate is missing information for employee x. The fact that the date of birth is missing might be filled in on a hard-copy employment form by various ad-hoc methods - 1) leaving the entry blank, 2) drawing a line through the entry, 3) entering the ubiquitous N/A meaning Not Available or Not Applicable.

Applications utilizing early database systems applied similar techniques. Fields with missing values were set to arbitrary ones - blank, zero, etc.. The database systems themselves had no way of representing missing information or even being aware of the existence of missing information. There was no systematic method of dealing with values that were missing, not present. It was incumbent on the application to deal with all aspects of missing data values.

The advent of relational database provided solutiions for processing of missing information. In the relational database model, missing information is represented by a meta-value called a Null. See How Relational solves the Missing Information Problem.

Unknown Answers

Missing information not only causes problems in representation but also affects operations on the database. Operations on a database, such as queries, often involve conditional expressions that are like questions about data values in the database. An example would be a query that asks - For each employee, is birthdate after 12/31/59? The answer to such a question on an employee in normally Yes or No (True or False), but this is not an appropriate response when an employee's date of birth is missing.

In systems where missing data values are represented by ad-hoc, default values, responses to questions about the data values are at best confusing, at worst simply incorrect. Such a system might represent a missing birthdate with a zero value. If Jerry is an employee whose date of birth is missing and the field is set to zero, then no is the answer to the question - Is Jerry's birthdate after 12/31/59?

To someone making the query and knowing that Jerry just turned 21, this would be obviously incorrect. The answer to the question when data is missing is not yes or no; it should be - I don't know. In other words, the answer to the question is unknown. Answers to queries involving potentially missing information must include yes, no and unknown.

The application had the responsibility to make the necessary tests for missing information in earlier systems. Application support was uneven and often incorrect. The problem is that neither the database system nor the application had ways to deal with answers that are not true or false but unknown.

Relational database solves the problem by providing what is known as 3 Valued Logic. In 3 Valued Logic, answers can be true, false or unknown. This is described in How Relational solves the Missing Information Problem.

Return to Contents Page: Should Nulls be considered harmful?     Return to Issues Page


Copyright © 1998 FFE Software, Inc. All Rights Reserved WorldWide