By Gavin JT Powell (BSc, OCP)
July 8th, 2002
A relational database relates subsets of a dataset to each other. How do we relate these subsets?† What do we mean by a subset and what is a dataset?
‚ A dataset is the equivalent of a database in Sybase or Ingres and a schema in Oracle. A dataset is a set of tables.
‚ A subset is a table. A table defines the structure and contains the row and column data for each subset.
‚ Tables are related to each other by linking them based on common items and values between two tables.
This paper assumes that the reader understands Normalization.
‚ Do you have performance problems? †Are your customers and users complaining about performance?† The root cause of 99% of database performance problems is poorly written SQL code. Quite often poorly designed SQL code can be the result of an inappropriate underlying table structure. The table structure may be over Normalized forcing SQL code to join many tables to find a single item of information.
‚ Do you have performance problems?† Are there obvious problems with the database, disk space usage or the network?† These things may cause performance degradation. Get your database and operating system administrators to search for bottlenecks. If you do not have in-house skills it might be cost-effective to hire a consultant who is an expert in their specific field. A highly skilled consultant could resolve bottlenecks in hours, which could take your current employees months of training and experimentation. In production environments experimentation can cause disasters and some serious down time.
‚ Further Normalization. Consider separating unused or infrequently accessed data from large tables into current and archived tables. Do this in such a way that frequently accessed data remains in the current table. This is obviously not Denormalization but it is not really Normalization either. When unused data is removed into a separate table there will be no direct relationship between the current and archived tables. If a relationship is required then the wrong data is being archived or do not archive at all.
‚ Do you have performance problems?† If you do not have performance problems then you should not be reading this document.
‚ Denormalization sometimes implies the undoing of some of the steps of Normalization.
‚ Denormalization is not necessarily the reverse of the steps of Normalization. Denormalization does not imply complete removal of specific Normal Form levels.
‚ Denormalization results in duplication. It is quite possible that table structure is much too granular or possibly even incompatible with structure imposed by applications. Data warehouse databases usually require copying in order to achieve Denormalization of an interactive application OLTP or client server database. The result is two databases. The OLTP database is highly Normalized and the data warehouse database is a Denormalized version of the OLTP database.
‚ Denormalization usually involves merging of multiple transactional tables or multiple static tables into single transactional or static tables respectively. Denormalization does not generally involve merging of transactional and static data tables. This implies that all Normal Forms but 2nd Normal Form can be Denormalized to assist with performance.
‚ It is possible to partially Denormalize by adding composite key elements to subset tables. This can often be required for databases utilized by object applications such as those written in Java. A puristís relational database structure will contain composite primary key columns in subset tables for all parent tables above in the hierarchy. Java applications typically add a single column unique identifier to each table as a primary key and exclude the composite columns from the parent tables. This table design is typical of top-down designed Java applications. This Java-Object based top-down design is not incorrect but can lead to extremely hard-hitting SQL join statements. I have seen SQL statements joining as many as fifteen tables. A query of this magnitude is impossible to tune. One possible answer is as already mentioned. Add composite key elements to subset tables when primary keys are unique identifiers (surrogate keys). Inclusion of these extra columns will generally reduce the number of tables in a SQL statement join.
Denormalization is not rocket-science. Denormalization can sometimes be viewed as being synonymous with Normalization except quite obviously, the opposite; both are common sense. There are a number of very simple steps to take initially when Denormalizing a set of tables.
‚ Look for one-to-one relationships. These may be unnecessary if the required removal of null values causes costly joins. Disk space is cheap. Complex SQL join statements can destroy performance. These one-to-one relationships are typical of 4th and 5th Normal Forms.
‚ Do you have many-to-many join resolution entities?† Are they all necessary?† Are they all used by applications? Many-to-many join resolution entities are typical of 3rd Normal Form. Quite often when a dataset is first designed there may be over usage of 3rd Normal Form leaving some superfluous, performance hitting entities which can become more and more difficult to remove in terms of application code changes.
i This is not actually 3rd Normal Form but this simplified interpretation of Normalization could possibly be twisted to allow it.
‚ When constructing SQL statement joins are you finding many tables in joins where those tables are scattered throughout the entity relationship diagram? Are you tracing logical paths across multiple entities, through multiple entity relationships to find the required columns? This is inefficient because every table passed through will probably add another table to the join. This is a potential indicator that a single dataset services multiple loosely connected applications. Loosely connected applications should not always be placed into the same dataset (Oracle schema).
‚ Compare the number of functions in the application and the number of tables. If you have far more tables than functions you might want to Denormalize or simply remove some of those data buckets of functionality because they may have become redundant.
‚ When searching for static data items such as customer details are you querying a single or multiple tables?† A single table is much more efficient than multiple tables.
Letís explain Denormalization by example. We will start with the reverse of the 3rd, 4th and 5th Normal Forms. Then we will proceed to applications type Denormalization and finally onto some simplistic forms of data warehouse Denormalization.
It is unlikely you will ever need to reverse 1st and 2nd Normal Forms so we will skip 1st and 2nd Normal Forms Denormalization at present; except perhaps for data warehouse tables or temporary reporting tables. Effectively temporary reporting tables are similar to data warehouse table structures.
i Data warehouses can use Denormalization but also use a dimensional data model.
A very common cause of performance problems is caused by the over application of 4th Normal Form to remove null values from entities. Examine the diagram in Figure 1. Note the one-to-one or zero relationships between Customer to Listed and Customer to Address entities. The zero part of these relationships implies that a customer does not have to have either a stock ticker symbol or an address. Thus 4th Normal Form allows the removal of these two sections to new entities, the Listed and Address entities.
Removal of null values from entities to new entities can save space but this is a debatable issue. Since null values are empty and Oracle has variable record lengths the space saving element may be negligible. However, SQL statements to retrieve customer details from these three entities could require a two or three table join to retrieve all of the customer information. When it is taken into account that other information will be retrieved with customers, such as orders and invoices, very large joins could result. The 4th Normal Form detail shown in the top part of Figure 1 could cause a very complex multiple table join to become even more unnecessarily complex, difficult to tune and thus lead to severe performance problems. These types of performance problems are often detected only when applications are only realistically sized production databases at the end of the development cycle. Making database level logical structural changes in the maintenance cycle can be so costly as to make them impossible.
Figure 1: 4th Normal Form Denormalization
Now letís take a look at 3rd Normal Form. One cannot really present the argument that 3rd Normal Form is for academics and is commercially unusable because it is. However, the point to make about 3rd Normal Form is that every 3rd Normal Form many-to-many join resolution entity should be closely examined. Is it really needed?
i This is not actually 3rd Normal Form but this simplified interpretation of Normalization could possibly be twisted to allow it.
Look at the diagram in Figure 2. Would you really need to locate the details of a single course for a single student? Yes you probably say. Think about it from the perspective of how your application interfaces with the reader; does it ever require you to list the details of a single course on which a single student is enrolled? Perhaps yes. If so then that information may have already been selected from the database? The point is this. 3rd Normal Form many-to-many join resolution entities should only be used when they have meaning.
Figure 2: 3rd Normal Form Denormalization
So 3rd Normal Form should create meaningful entities from many-to-many relationships. 3rd Normal Form is best not used to create many-to-many join resolution entities such as the CourseStudent entity in Figure 2. Why?
Look at the schema shown in Figure 3. Note how the SupportingAct entity does two things. Firstly, it does resolve a many-to-many join between the Act and Show entities. Secondly and more importantly the SupportingAct entity has meaning in itself since it defines a second, third or even more supporting acts for a main event. Technically speaking there can be many shows for every act since a pop music band could do many shows at many venues. Additionally every pop act could have one or more supporting pop music bands that perform with it as an introductory event. Since the supporting acts are also Act entities by definition and they perform at the same show then it is thus meaningful to define those supporting acts as extensions of both the Act and Show entities. A finer point to note is that a pop music band traveling all over the world may not necessarily always play with the same supporting act or acts. Thus the same act could use different supporting acts when the main event performs at different venues.
That is a little complicated and open to debate. However, the point should now be clear about the practical application of 3rd Normal Form. Many-to-many joins resolved into new entities should produce a meaningful entity, not meaningless unique identifiers, simply defined in order to define a theoretical but not necessarily practical required uniqueness. These extra unique entities may or may not ever be used. Do they exist for the sake of the data model or for the sake of the application? More often than not CourseStudent type many-to-many join resolution entities are not used and if they are then perhaps should not be. On the other hand it is often tempting for Developers to make use of these types of wasteful entities simply to make application coding easier. The result is often an over coded application.
Figure 3: The Concerts Schema
Now letís retrogress a little. Letís go back to the 4th Normal Form entity structure as presented in Figure 1 but with an extra entity. Examine the diagram in Figure 4. Note the new entity exchange and the addition to the Listing table of the EXCHANGE_ID foreign key column. The relationship between the Listing and Exchange entities is a many-to-one relationship, placing the tables into 2nd Normal Form. Why is this 2nd Normal Form? This is because the Exchange table will contain items such as NYSE (New York Stock Exchange) or NASDAQ. The names of exchanges are static relative to the companies listed on those exchanges. Ticker symbols are used to represent companies on stock exchanges.
The big green arrow in Figure 4 shows the removal of the Exchange entity by placing the name of the exchange into the Listed entity. This is effective Denormalization since there is no information on the Exchange entity other than itís unique identifier (EXCHANGE_ID) and the name of the exchange. In short, the Exchange entity in this situation is completely unnecessary and will probably ultimately hurt performance by requiring more complex multiple table join SQL statements.
Figure 4: 2nd Normal Form Denormalization
Disclaimer Notice: This information is available ďAS ISĒ. I am in no way responsible or liable for any mishaps as a result of using this information.