Thursday, March 20, 2008

Database Normalization

Introduction

The process by which an exiting database schema is modified to organise the data in an efficient, non-redundant manner is called Normalization. The primary goal of the process is to reduce data redundancy across tables and ensure that the data dependencies are meaningful, so that the data is logically stored and consumes less space. The process was first put forward by Edgar F Codd in his paper A Relation Model of Data for Large Shared Data Banks. Describing in generic terms the process decomposes the original non normalized table into two or more relations such that they convey the same meaning as the original table.

Problems addressed by normalization

Just to highlight various problems that can be faced while designing a database let's presume that we are to maintain the online library for the Browser Library Store. We will be needed to have a track of information pertaining to various books, publishers and authors. Initially we consider a single table for the purpose with fields such as Title, ISBN, Author, AuthorBio, Subject, Publisher, Price, Pages. Now using this schema for our database, can lead to various inconsistencies and anomalies. Storing unrelated group of information in a single table results in multiple records having same information, as in having same author for multiple books in this case leads to repeating the author information over these records. An improper/missing updating of any of the records leaves the database in an inconsistent state and can lead to false query results. This phenomenon is known as updating anomaly.


We won't be able to enter any information pertaining to author or publisher without a book being housed in our library or even cannot have multiple entries per book as we are having ISBN as the primary key here. This is known as insertion anomaly.



Similarly we can't delete record for any of the books from our library without deleting the author information, the scenario better known as the deletion anomaly.



Ideally, we should design our database so as to avoid these inconsistencies and anomalies, by normalizing the tables and enforcing referential integrity between them.

What are normal forms

A set of guidelines laid down by the database community for ensuring that the databases are normalized. The normal forms describe the degree of strictness of a table. The lesser the degree of normalization for a table, the greater are the chances of it having logical inconsistencies and data anomalies. The process tends to slow down the data retrieval, since data which may have been retrievable from one record in an non normalized design may have to be retrieved from several records in the normalized form. Therefore the rule of the thumb says that normalize while it doesn't have a measurable effect on performance. A typical database is normalized up to the third normal form, where in some cases may require the schema to consider higher degree of normalization. Lets take a look at what each of these normal forms has to offer us.

First Normal Form

The first normal form should be the simplest one to conform to. This rule calls for eliminating repeating groups in individual tables and creation of separate table for each related set of data. Each set of related data is further identified with a primary key. This requires values in each table to be atomic. In the example alluded to above, we have non atomic column author, which could be further broken into fields like first name, last name for better retrieval. Similarly there could be a subject field which needs to be atomic. We could achieve 1NF conformance by breaking down the original table into separate tables with related data. So we end up creating new tables namely Book, Author, Publisher, Subject and some tables to establish relations between them, which are all in 1NF.

Book Table





























ISBN NumberTitlePricePublisherID
0321194871Exceptional C++$791
2119487103Introduction to Database System$1262
2342395603Exceptional C++ Style$891
4545323233Inside C++ Object Model$663


The information for the author, subject and publisher are taken out of the original table and the entire related set of information is moved to separate tables. The relation between the Book table and the Publisher table is modelled using the primary key of 'many' as the foreign key of the 'one' in the relation. Since the books and authors, books and subjects have many to many relationship the same is to be modelled as a separate tables.

Author Table
























AuthorIDFirst NameLast NameAuthorInfo
1HerbSutterOne of the most prominent C++ experts.
2StanleyLippmanArchitect with the Visual C++ development team at Microsoft
3ChrisDateSpecializing in relational database technology


The author table contains the information about the author with AuthorID as the primary key.

Publisher Table




































PublisherIDNameAddressCityStateZipPhone
1Addison-Wesley75 Arlington Street, Suite 300BostonMA02116201-767-5021
2Prentice HallUpper Saddle RiverNew JerseyNew Jersey07458800-677-7377
3Sams Publishing800 East 96th StreetIndianapolisIndiana 46240800-889-3358


Subject Table













SubjectIDName
1C++
2Database


In order to establish the relationship between the books, authors and subject we'll consider two more tables viz. BookAuthor table and BookSubject table.

BookAuthor Table



















ISBN NumberAuthorID
03211948711
21194871033
23423956031
45453232332


This table models the many to many relationship between books and authors. A book may have more than one authors, similarly an author may have penned a number of books.

BookSubject Table



















ISBN NumberSubjectID
03211948711
21194871032
23423956031
45453232331


The relationship between books and subjects is again many to many as a book may be covering more than one subjects, similarly a subject may have a number of books on it.

The only drawback of such normalization process is that we might need to use joins for even a simple cross table query.

The article will be updated soon...

What people said... (0)