Chapter 2. Database Design Essentials

A relational database is essentially just a set of tables of data that are related to each other. These tables are useful because we can use a high-level language, such as SQL, to add, change, or delete the tables or the data in the tables. We can query the database and obtain the results of selecting data based on simple or complex relations. We can perform set operations, such as joins, on the data. These capabilities let us store, manage, and make use of even very large amounts of data. 

Tables can hold data of different types, allowing us to store, primarily, numbers, dates, and text. We can perform calculations, sort, or change these types of data. (There are also types used to store arbitrary data—which could be images or sounds, for example—but they are just "stuff" to the database and we cannot perform operations on these types apart from storing it and retrieving it. These cannot be manipulated using SQL and won't be considered until Chapter 9, "Advance JDBC Features.") The Oracle SQL datatypes correspond only roughly to Java types and mapping from one to the other requires some care.

Designing a database is largely a matter of determining the types of data in each table and determining the relationships between each of the tables in the database. Exploring a little of the theory behind relational database design will prevent problems such as redundant data or a database that is difficult to use.

Tables, Columns, and Rows

The table is the fundamental organizing unit in a database. There are several terminologies used, but generally we describe a table as having columns and rows. Let's take as an example a table designed to hold information about the CDs in our collection. There are a number of different pieces of information we would like to keep about each CD, such as artist, title, and release date. We call each of these a column, and we define them when we create the table.

As we start to enter data into our table, each item that we enter—in our example, the information for each CD—is called a row in the table.

Sometimes the columns in a table are called fields, and sometimes the rows are called records. In other words, for each CD, we will have a record in the table, and each record will have fields identifying the artist, the album title, etc. In general, we'll stick to the row and column terminology.

One of the requirements of a relational database is that it must have a high-level relational language. In practice, this language is SQL (Structured Query Language). The main purpose of SQL is to define and manipulate data through statements that apply to tables, rows, and sets of rows. Each SQL statement is independent of all others, and each one is executed, effectively, all at once. There is no provision in standard SQL (at least, as it is now most commonly implemented) for step-by-step, conditional processing. If we need to do that, we need to use a host procedural language, such as Java, that allows us to execute SQL statements sequentially, making decisions as we go, based on conditional expressions in constructs such as if statements and while loops.

Because SQL was intended to be hosted by another language, it has datatypes that are intended to be mapped to the datatypes of typical procedural languages. Perhaps because of this intention to be so generally useful, these mappings are not particularly ideal for any language. To complicate things further, different database vendors support the SQL standard types in different ways. Oracle, specifically, has a small set of native types that map to a larger set of standard SQL types. As a matter of principle, we should try to do things in as standard a way as possible, but in this case, we'll use the Oracle datatypes, rather than add an additional layer of complexity by struggling to maintain the illusion that we are using the standard SQL types.

Oracle Datatypes
To store data in a database, an appropriate set of tables must first be created, with columns of the appropriate types defined. These are the basic Oracle datatypes we have to choose from:


There are a few other types, particularly BLOB (Binary Large Objects) and CLOB (Character Large Objects), which are used to store arbitrary data. These types are opaque to the database—meaning that the database cannot access and perform operations on the contents. BLOBs and CLOBs require special processing that is difficult to perform using SQL alone.

String Types

There are two basic types for storing string data, CHAR and VARCHAR2. The difference between them is that CHAR has a fixed length up to a limit of 2,000 bytes, whereas VARCHAR2 has a variable length up to 4,000 bytes. A fixed amount of space is always used for a CHAR entry; if the entry is shorter than the reserved length, it is padded with blanks. This can lead to surprising results when doing string comparisons, for example. A VARCHAR2 entry, however, will take only as much space as necessary. Except perhaps for small, fixed-length fields, such as identifiers, it's generally preferable to use VARCHAR2.

Strings in SQL statements are delimited with single quotes. If we wish to include a single quote as part of a string, we need to escape it—that is, we need to precede it with a special character that indicates that it is to be treated as a literal single quote and not as a delimiter. To escape a single quote, we use another single quote. This is something like the backslash in Java: The backslash is the escape character, so in order to use it as a literal backslash, we have to precede it by another backslash. For example, if we want to use the string "Let's go" in an SQL statement, we write:'Let''s go'

Numeric Types

The numeric types in Oracle are three variations on NUMBER. The default option, NUMBER, with no parameters, allows 38-digit precision, with a maximum scale of 127 (i.e., the decimal point can be moved right or left up to 127 places). The second option, NUMBER(precision), implies a scale of zero, so it is an integer with the specified precision. The third option, NUMBER(precision, scale), allows fixed-length (as specified by precision), floating point numbers, where the scale parameter specifies the minimum number of digits preceding the decimal point if positive, and the minimum number of digits following the decimal point if negative.

Because the default, NUMBER, is basically good enough to store any Java numeric type, the easiest thing to do is simply use NUMBER for everything. The drawback of not specifying realistic limits for precision and scale is that it can lead to inefficient storage of numbers and potentially poor performance in applications with large amounts of floating point data; in this case, it would be better to determine appropriate limits for precision and scale.

Although any Java numeric type can be stored in a NUMBER, the same is not true when we need to store a NUMBER in a Java type. We need to identify the type of data that the NUMBER is holding; this is usually obvious, and integer values are the most common. For NUMBERs known to be integer values, the safest choice in Java is the long type. For NUMBERs with a specified precision, we can choose int, if the precision is relatively small—9 or less; if the precision is larger, we need to use the long type.

For floating point values similar considerations apply. For NUMBERS known to be floating point values, the safest choice is the Java double type. For NUMBERs with a specified precision and scale, we can use the Java float type if the precision is 6 or less and the scale is not greater than 38—if either scale or precision are larger than these values, we need to use the double type.
Date Types

DATE is a bit misleading because it actually represents both date and time in Oracle SQL. There are two ways to consider a DATE, depending on whether we are using an interactive interface, such as SQL*Plus, or using it inside a host language, such as Java. In Java, the JDBC interface will perform the appropriate mapping to the Java Date class, so we can deal with it strictly in Java terms.

In SQL*Plus, we can generally use a string representation of the date, based on the default format (e.g., '12-July-02') or a format that we specify. We need to be aware that there is an underlying numerical representation because, depending on the format, when we perform a query we may see only part of the information that is actually stored. For example, a DATE field that stores the date and time July 12, 2001, 10:54:45 a.m. would display only the date portion by default: 12-JUL-02.

When we insert or change data in the database, we may inadvertently omit some of the information we intended. If we insert the value '12-JUL-2000', the time will be zero minutes, zero seconds after midnight on that date.

This behavior makes sense, but we need to be aware of it, because if we were to compare the values, they would not be equal, even though, by default, when we print them out they appear to be the same. If we wanted to know whether two values are the same date, without regard to the time, we need to truncate them before comparing them.

TIMESTAMP(precision), an extension of DATE, was introduced in Oracle 9i and is accurate to fractions of a second. DATE is accurate only to a second, which may not be sufficient for logging events or timing processes. The default precision of TIMESTAMP, 6 decimal places, means it is accurate to a microsecond. The maximum precision, 9, is accurate to a nanosecond. Note that this precision reflects only the ability of Oracle to store a value; using the operating system to obtain a time value, for example, is restricted to the accuracy to the resolution of the system clock, which is typically on the order of a millisecond.
Null

The value NULL is one of the most problematic features in relational databases, and we'll be revisiting it in different contexts as we expand our database experience. In short, it can be described as representing an unknown value. For example, when we enter information about our CDs into a database, we might not know the release date of a CD—in which case, we could enter NULL.

NULL can cause several problems for the unwary. For example, comparing NULL with anything, including NULL, is always false. (This may remind you of the Not a Number [NaN] value in Java.) When NULL is used in arithmetic operations, the value of NULL propagates through the results. We'll explore this topic in more depth when we learn more about queries in the next chapter

Designing a Database
Let's design a database for our CD collection. Assume that, until now, we've been keeping all the information about our CDs on index cards and that, for each CD, we've written down the artist's name, what country the artist is from, the title of the album, the release date, the label, and a list of the songs on the album. We want to convert this system of cards into an Oracle database.

Based on our index cards, we first design a table with the following columns, omitting for now the list of songs:
CD_COLLECTION
   TITLE
   ARTIST
   COUNTRY
   RELEASE_DATE
   LABEL 


Notice that, to identify each CD uniquely, we can use the combination of the two columns TITLE and ARTIST. This combination is called the primary key for the table.

A primary key isn't required by Oracle, and for our current purposes, we could probably get by without one, but a primary key is required by the definition of a relational database. A relational database requires that we must be able to address the intersection of any row and column uniquely.
We'll assume that a database is available for you to use in order to follow along with the examples. If you've installed Oracle and didn't choose to have a database created at installation time, you can use the Oracle Database Configuration Assistant to create one now.

In general, you shouldn't be logged in as administrator except when you need to do database administration work. If you don't have a regular user account already, you should create one and log in as that user. To create a user account, using SQL*Plus, connect to the database as administrator and type the following command at the sqlplus prompt:

CREATE USER username IDENTIFIED BY password
Next, grant the rights to connect and create and use objects:

GRANT CONNECT, RESOURCE TO username 

Now you can connect to the database as the user you created:

CONNECT username/password 
If you are using a database created by someone else, you should get the administrator of that database to create a user account for you, perhaps in your own tablespace. You should ensure that you have the sufficient rights to create and drop tables.

In this chapter, we will be using SQL*Plus to create tables and enter information into the database using SQL. At this stage, apart from noticing that CREATE TABLE is used to create a table, INSERT is used to put a row of data into it, and SELECT is used to list the contents, you shouldn't worry about the details of SQL.

To create the table, we'll type the following at the sqlplus prompt:

CREATE TABLE CD_COLLECTION
 ( ALBUM_TITLE VARCHAR2(100),
 ARTIST VARCHAR2(100),
 COUNTRY VARCHAR2(25), 
 RELEASE_DATE DATE, 
 LABEL VARCHAR2(25), 
 PRIMARY KEY (ALBUM_TITLE, ARTIST) );

The reason that we've omitted the songs from this table is that relational databases don't have a list or array type, so there is no natural way to store them in this table. A table in a relational database should store only items that have a one-to-one relationship with the other items in the tables.

If it were just a matter of storing the titles of the songs, it would be easy to fake it by having a series of columns SONG_1, SONG_2, SONG_3, etc. But this quickly gets unmanageable if we later decide to store more information about each song, such as length and composer. Also, there is no absolute way to determine the maximum number of songs we should allow per CD. Instead, we'll need to create a separate table for the songs on the CDs.

Let's enter a couple of CDs into this table with the following SQL statements:

 INSERT INTO CD_COLLECTION VALUES('Black Sheets of Rain', 'Bob Mould', 'USA','1-JAN-1992','Virgin'); INSERT INTO CD_COLLECTION VALUES('Candy Apple Grey', 'Husker Du', 'USA','1-JAN-1986', 'Warner Brothers');

Representing the relationship between a whole, such as a CD, to its parts, such as the songs on the CD, is a fundamental part of designing a database and is termed a one-to-many relationship. The process of properly organizing data into tables—what the tables are, what the keys are, and what columns are in each table—is called normalization. We took the first step toward normalizing our database when we decided not to include songs in the CD_COLLECTION table
.
Normalization

Normalization is not a simple, all-or-nothing process—it's a step-by-step process. As we normalize, we apply a specific new rule at each step to our database. Each of these steps is called a normal form. We'll look at the first three: the first, second, and third normal forms. Each is stricter than the next, and its requirements are in addition to those of the previous forms. Thus, a database in third normal form, by definition, meets the requirements of second and first normal forms, as well.


First Normal Form
The first normal form formalizes the principle that we've already stated about the songs on our CDs; we shouldn't have aggregate or repeating types, such as lists or arrays, in a table. Where there is a one-to-many relationship, each type of data should be in its own table—the wholes in one and the parts in another.

In the case of our CD collection, this means that the songs should be in a separate table of their own. We'll have a table that lists each of the songs (many) on each of the tracks (many) on each (one) of our CDs.



We'll also include the length of the song and the composer, using the following columns:
SONGS
   SONG_TITLE 
   COMPOSER
   LENGTH 
   TRACK 
   ARTIST 
   TITLE


In this table, the primary key consists of three columns, TRACK, ARTIST, and TITLE, all of which are sufficient together to identify each song on each CD uniquely. To tie a set of songs in this table with a CD in the CD_COLLECTION table, we need to include the primary key (which identifies a unique CD) from the CD_COLLECTION table; here, in the SONGS table, this key is called a foreign key. In addition to including columns for ARTIST and ALBUM_TITLE, we explicitly identify the foreign key, in addition to the primary key, when we create the table:

CREATE TABLE SONGS ( SONG_TITLE VARCHAR2(100), COMPOSER VARCHAR2(100), LENGTH NUMBER, TRACK NUMBER, ARTIST VARCHAR2(100), ALBUM_TITLE VARCHAR2(100), FOREIGN KEY (ARTIST, ALBUM_TITLE) REFERENCES CD_COLLECTION(ARTIST, ALBUM_TITLE), PRIMARY KEY (SONG_TITLE, ARTIST, ALBUM_TITLE) );


Associating the ARTIST and TITLE columns in this table with the ARTIST and TITLE columns of the CD_COLLECTION table by specifying with a foreign key forms a referential constraint. The database will enforce this constraint to ensure that any songs we enter into this table are associated with a CD that we've entered in the CD_COLLECTION table. This means that we must first enter CD information before we enter song information.

Let's insert a couple of songs for each of our CDs:
INSERT INTO SONGS VALUES('Black Sheets of Rain','Mould', NULL,1, 'Bob Mould', 'Black Sheets of Rain); INSERT INTO SONGS VALUES('Crystal','Mould', 3.28, 1, 'Husker Du', 'Candy Apple Grey'); INSERT INTO SONGS VALUES('Don''t want to know if you are lonely ','Hart', 3.28, 2, 'Husker Du', 'Candy Apple Grey'); INSERT INTO SONGS VALUES('I don''t know for sure','Mould', 3.28, 3, 'Husker Du', 'Candy Apple Grey');

As we'll see later, SQL will allow us to query the two tables in ways that will put the information back together like it was on our index card. But because we have not thrown information away about the individual songs by lumping them together with the CD-as-a-whole information, we can also do new things that would've been hard to do with index cards, such as determining how many CDs contain a given song or finding all songs by a given composer.

Second Normal Form
The second normal form says that a column in a table should not depend on just part of the key. This is actually a problem in our first table with the column COUNTRY. Assume that our CD collection has CDs from around the world. Our primary key for this table is ARTIST and TITLE, but COUNTRY depends only on ARTIST. To straighten this out, we'll need to remove the COUNTRY column from our CD_COLLECTION table and add a new table, an ARTISTS table that lists information, including at least COUNTRY, about each of the artists represented in our CD collection:

ARTISTS 
    ARTIST 
    COUNTRY

We can probably add more information to this later, such as date of birth, Web site URL, etc. This is already starting to show one of the typical effects of normalization—lots of smaller tables.
Third Normal Form

The third normal form states that a column in the table must not depend on any other column of the table. To put it another way, all columns in the table must depend directly on the whole key and nothing but the key. There is no example of this in our database now, but it's not hard to see how a situation could arise if we extend our example a bit. Suppose that we wanted to use our database for a music store and that one of the pieces of information we needed to store was the list price. Our first thought might be to add a column LIST_PRICE to the CD_COLLECTION tables:
CD_COLLECTION
   ALBUM_TITLE 
   ARTIST 
   COUNTRY
   RELEASE_DATE 
   LABEL 
   LIST_PRICE

Now suppose that each label sets the list price, the same price, for all of its CDs. Because LIST_PRICE does not vary for each individual CD but instead depends on the non-key column LABEL, the third normal form suggests that we need to remove the LIST_PRICE column and add instead a new table, LIST_PRICES, which lists the price by label:
LIST_PRICES  
  LABEL 
   LIST_PRICE

This has clear advantages. When a recording label changes its list price, there's only one value to change. Further, consistency is also ensured because it eliminates the possibility of entering wrong values on individual CDs.

This is as far as we'll take normalization here. These first three forms cover most of the issues that we need to deal with common database designs. Finding a way to resolve them is usually as straightforward as we've seen. It's usually much harder to fix problems with databases involving fourth and fifth normal forms, where the solutions are more varied, depending largely on the specifics of each situation.

If your application attempts to solve a large and complex data-modeling problem, you may wish to consult with someone who has more experience or refer to a book dealing more specifically with data modeling.

Denormalization

As we saw above, there is at least one drawback to normalizing a database: You get more tables, some of which can be quite small. This will make your queries more complex and may require more programming, because the new tables may, for example, require new user interfaces for data input.

In our example, if the only information that we are keeping specific to each artist is COUNTRY, we may decide that the programming necessary to maintain a separate table is not worth the effort. The major drawback is that, unless we restrict what we allow as a valid entry in some other way, we may end up with different, inconsistent values for different CDs by the same artist. Maybe that's acceptable.

The best way to determine whether you've normalized your data enough (or perhaps gone too far) is to test your data model before investing significant development time in developing an application that uses it. Using SQL*Plus manually or, preferably, using scripts, create the database, add some data, and try out typical queries.

In later chapters, we will occasionally revisit this CD collection example, and as we do so, we'll find that we'll need to make changes as we take more considerations into account. This evolution of a database is a common part of the development process and needs to be taken into account when planning a project.

No comments:

Post a Comment