Informatica: Database Normalization

Introduction

The process by which an exiting database schema is modified to organise the data in an efficient, non-redundant manner is called Normalization. The primary goal of the process is to reduce data redundancy across tables and ensure that the data dependencies are meaningful, so that the data is logically stored and consumes less space. The process was first put forward by Edgar F Codd in his paper A Relation Model of Data for Large Shared Data Banks. Describing in generic terms the process decomposes the original non normalized table into two or more relations such that they convey the same meaning as the original table.

Problems addressed by normalization

Just to highlight various problems that can be faced while designing a database let's presume that we are to maintain the online library for the Browser Library Store. We will be needed to have a track of information pertaining to various books, publishers and authors. Initially we consider a single table for the purpose with fields such as Title, ISBN, Author, AuthorBio, Subject, Publisher, Price, Pages. Now using this schema for our database, can lead to various inconsistencies and anomalies. Storing unrelated group of information in a single table results in multiple records having same information, as in having same author for multiple books in this case leads to repeating the author information over these records. An improper/missing updating of any of the records leaves the database in an inconsistent state and can lead to false query results. This phenomenon is known as updating anomaly.

We won't be able to enter any information pertaining to author or publisher without a book being housed in our library or even cannot have multiple entries per book as we are having ISBN as the primary key here. This is known as insertion anomaly.

Similarly we can't delete record for any of the books from our library without deleting the author information, the scenario better known as the deletion anomaly.

Ideally, we should design our database so as to avoid these inconsistencies and anomalies, by normalizing the tables and enforcing referential integrity between them.

What are normal forms

A set of guidelines laid down by the database community for ensuring that the databases are normalized. The normal forms describe the degree of strictness of a table. The lesser the degree of normalization for a table, the greater are the chances of it having logical inconsistencies and data anomalies. The process tends to slow down the data retrieval, since data which may have been retrievable from one record in an non normalized design may have to be retrieved from several records in the normalized form. Therefore the rule of the thumb says that normalize while it doesn't have a measurable effect on performance. A typical database is normalized up to the third normal form, where in some cases may require the schema to consider higher degree of normalization. Lets take a look at what each of these normal forms has to offer us.

First Normal Form

The first normal form should be the simplest one to conform to. This rule calls for eliminating repeating groups in individual tables and creation of separate table for each related set of data. Each set of related data is further identified with a primary key. This requires values in each table to be atomic. In the example alluded to above, we have non atomic column author, which could be further broken into fields like first name, last name for better retrieval. Similarly there could be a subject field which needs to be atomic. We could achieve 1NF conformance by breaking down the original table into separate tables with related data. So we end up creating new tables namely Book, Author, Publisher, Subject and some tables to establish relations between them, which are all in 1NF.

Book Table

ISBN Number	Title	Price	PublisherID
0321194871	Exceptional C++	$79	1
2119487103	Introduction to Database System	$126	2
2342395603	Exceptional C++ Style	$89	1
4545323233	Inside C++ Object Model	$66	3

The information for the author, subject and publisher are taken out of the original table and the entire related set of information is moved to separate tables. The relation between the Book table and the Publisher table is modelled using the primary key of 'many' as the foreign key of the 'one' in the relation. Since the books and authors, books and subjects have many to many relationship the same is to be modelled as a separate tables.

Author Table

AuthorID	First Name	Last Name	AuthorInfo
1	Herb	Sutter	One of the most prominent C++ experts.
2	Stanley	Lippman	Architect with the Visual C++ development team at Microsoft
3	Chris	Date	Specializing in relational database technology

The author table contains the information about the author with AuthorID as the primary key.

Publisher Table

PublisherID	Name	Address	City	State	Zip	Phone
1	Addison-Wesley	75 Arlington Street, Suite 300	Boston	MA	02116	201-767-5021
2	Prentice Hall	Upper Saddle River	New Jersey	New Jersey	07458	800-677-7377
3	Sams Publishing	800 East 96th Street	Indianapolis	Indiana	46240	800-889-3358

Subject Table

SubjectID	Name
1	C++
2	Database

In order to establish the relationship between the books, authors and subject we'll consider two more tables viz. BookAuthor table and BookSubject table.

BookAuthor Table

ISBN Number	AuthorID
0321194871	1
2119487103	3
2342395603	1
4545323233	2

This table models the many to many relationship between books and authors. A book may have more than one authors, similarly an author may have penned a number of books.

BookSubject Table

ISBN Number	SubjectID
0321194871	1
2119487103	2
2342395603	1
4545323233	1

The relationship between books and subjects is again many to many as a book may be covering more than one subjects, similarly a subject may have a number of books on it.

The only drawback of such normalization process is that we might need to use joins for even a simple cross table query.

The article will be updated soon...

Informatica

Categories

Topics

Thursday, March 20, 2008

Database Normalization

What people said... (0)

Profile

Blog Archive

Recent Comments