A Layman’s Approach to
Relational Database Normalization
By
Gavin JT Powell (BSc, OCP)
June 23rd,
2002
Application
of the Relational Database model to a data set involves the removal of
duplication. Removal of duplication is performed using a process called Normalization. Normalization is
comprised of a set of rules called Normal
Forms. Normalization is applied to subsets of data or tables in a database.
Tables are for placing directly associated data into. Tables can be related or
linked to each other through the use of index identifiers. An index identifier
identifies a row of data in a table much like an index is used in a book. The
index is used to locate an item of interest without having to read the whole
book.
There
are five levels or layers of Normalization called 1st, 2nd,
3rd, 4th and 5th Normal Forms. Each Normal
Form is a refinement of the previous Normal Form. 4th and 5th
Normal Forms are rarely applied. In designing tables for performance it is
common practice to ignore the steps of Normalization and jump directly to 2nd
Normal Form. 3rd Normal Form is often not applied either; unless
many-to-many joins cause an absolute need for unique values at the application
level.
i Over-Normalization can lead to
poor performance in both OLTP and Data Warehouse type databases. Over-Normalization is common in top-down
designed Java object applications. In this situation an object structure is imposed
onto a relational database. Object and relational data structures are
completely different methodologies.
That
is far too much jargon. Let us make the understanding of Normalization very
simple. Forget about it! Normalization
is for Academics and in its strictest form is generally impractical due to its
adverse effect on performance in a commercial environment, especially 3rd,
4th and 5th Normal Forms. The simplest way to describe
what Normalization attempts to achieve can be explained in three ways.
1.
Divide the whole into
smaller more manageable parts.
2.
Removal of duplicated
data into related subsets.
3.
Linking of two
indirectly related tables by the creation of a new table. The new table
contains indexes from the two indirectly related tables. This is commonly known
as a many-to-many join.
These
three points are meaningless without further explanation of Normalization. So
let us go through the rules and try to explain it in a non-Academic fashion.
Let us start with some relational database buzzwords.
â A table contains many repetitions of the same row. A
table defines the structure for a row. An example of a table is a list of
customer names and addresses.
â A row is a line of data. Many rows make up the data
in a table. An example of a row is a single customer name and address within a
table of many customers. A row is also known as a record or a tuple.
â The structure of a row in a table is divided up into columns.
Each column contains a single item of data such as a name or address. A column
can also be called a field or attribute.
â Referential
Integrity is a process of validation
between related tables where references between different tables are checked
against each other. A primary key is placed on a parent or superset table as
the primary identifier or key to each row in the table. The primary key will
always point to a single row only and it is unique within the table. A foreign
key is a copy of a primary key value in a subset table. An example of a
function of Referential Integrity is that it will not allow the deletion of a
subset record where a foreign key value exists in a parent table. Primary keys
in this document are referred to as PK and foreign keys as FK. Note that both primary and
foreign keys can consist of more than one column. A key consisting of more than
one column is known as a composite key.
â An index is used to gain fast access to a table and
to enforce relationships between tables. An index allows direct access to rows
by duplicating a small part of each row to an additional (index) file. An index
is a copy of the contents of a small number of columns in a table. The most
efficient indexes are made up of single columns containing integers.
i Primary
and foreign keys are special types of indexes, applying referential integrity.
1st Normal Form removes repetition
by creating one-to-many relationships. Data repeated many times in one table is removed to a subset table.
The subset table becomes the container for the removed repeating data. Each row
in the subset table will contain a single reference to each row in the original
table. The original table will then contain only non-duplicated data.
â Saves space.
â Reduces complexity.
â Ensures that every purchase order item will belong to
a purchase order.
i Inter-entity
relationships can be zero, one or many to zero, one or many.
Figure 1 – 1st Normal Form
Figure 2 – 2nd Normal Form
My version of 3rd
Normal Form is used to resolve many-to-many relationships into unique values.
i This is not actually 3rd Normal Form but this simplified interpretation of Normalization could possibly be twisted to allow it.
In
Figure
3 a student can be enrolled in many courses and a
course can have many students enrolled. The point to note is that it is
impossible to find a unique course-student item without joining every student
with every course. Therefore each unique item can be found with the combination
of values. Thus the coursestudent
entity in Figure
3 is a many-to-many join resolution entity. In a
commercial environment it is very unlikely that an application will ever need
to find this unique item, especially not a modern-day Java object web
application where the tendency is to drill-down through list collections rather
than display individual items. Many-to-many join resolutions should only be created when they are specifically required by the
application. It can sometimes be better to resolve these joins in the
application to improve database performance.
i Be very
careful using 3rd Normal Form and beyond.
Figure 3 – 3rd Normal Form
Figure 4 – 4th Normal Form
5th Normal Form divides related columns into separate tables
based on those relationships. In
Figure
5 product, manager and employee are all related to each
other. Thus three separate entities can be created to explicitly define those
inter-relationships. The result is information that can be reconstructed from
smaller parts. An additional purpose of 5th Normal Form is to remove
redundancy or duplication not covered by 1st to 4th
Normal Forms of Normalization.
Figure 5 – 5th Normal Form
â 1st Normal Form removes repetition by
creating one-to-many relationships.
â 2nd Normal Form creates not one-to-many
relationships but many-to-one relationships, effectively separating static from
dynamic information. 2nd NF removes items from tables independent of
the primary key.
â 3rd Normal Form is used to resolve
many-to-many relationships into unique values. 3rd NF allows for
uniqueness of information by creation of additional many-to-many join
resolution tables. These tables are rarely required in modern day applications.
i This is not actually 3rd Normal Form but this simplified interpretation of Normalization could possibly be twisted to allow it.
â 4th Normal Form is intended to separate
multi-valued facts in a single table into multiple tables. 5th
Normal Form divides related columns into separate tables based on those
relationships. 4th and 5th NF minimize nulls and composite
primary keys by removing null capable fields and subset composite primary key
dependent fields to new tables. 4th and 5th Normal Forms
are rarely useful.
Disclaimer
Notice: This information is available “AS IS”. I am in no way responsible or
liable for any mishaps as a result of using this information.
More
available on Denormalization in my latest book Oracle High Performance Tuning
for 9i and 10g