What is the problem?
3rd normalised form database design is where every table has a unique
id field, and where every foreign table reference is made by id and by
nothing else. For example, in a 3rd normalised database design of address
and people, you never refer to the people in the address table by name,
but only by the id of their record.
The problem is this: adding, editing, updating and accessing the records
is complex. You therefore have to write code that manipulates the records
"one level up". Also, you can never be sure that what you are dealing with
is entirely correct, because if you look at the tables separately, all you
see is numbers. no names. no clues. no way to debug.
But, as I've found out: it's much worse than that - a lot worse. In fact,
it was incredibly naive of me, an experienced programmer, to wander in
blithely to the task of dealing with a 3rd normalised form database, and
expect that hard-coded functions to manipulate data would be a "good idea".
So this article will outline some of the lessons that I've learned, and
some of the tricks that, if you don't believe they are necessary, then,
putting it bluntly, you are quite simply a total failure at being able
to assess projects, and you should quit your job and go and do something
else, like flower arranging, because otherwise you will be a complete
liability to the project you're working on.
Also, I'm sorry to have to say this to Free Software people, but the
Microsoft .net data access framework provides the level of required
functionality to correctly manipulate data, and free software doesn't.
Anyone assessing free software for the manipulation of large data sets
(100 or more 3rd normalised form tables, and gigabytes of data) should
think twice, unless they have access to extremely competent programmers
and/or at least a 48 man-month budget for design and coding.
Lastly, I should particularly point out: it doesn't matter what programming
language you program your SQL access layer in: it's GOING to be complex.
You should therefore design (or find) an access layer that is in the
best possible programming language that you can find. And, I'm sorry to
have to inform people who love SQL stored procedures this, but stored
procedures is a shit way to deal with the level of required complexity.
Not only that, but you have to have seriously competent programmers
in SQL stored procedure language - a language which is completely
ill-equipped
to deal with the required complexities, but also you have to coordinate
their efforts very closely with the rest of your programmers. Forever.
Why is this so difficult?
There are many reasons - but here's one to illustrate. Let's take an
initial table, which then had to be extended somehow. But, for various
reasons (not least of which could be that the initial table is multi-purpose
i.e. it's like a Base Classs, and therefore cannot be edited) you have
to have an "extension" table (like a Derived Class) that refers to the
initial table, with a one-to-one mapping.
Let's suppose that there are existing entries in the "Base Class" table,
which you now need to "extend". Let's call the initial table "Person",
and you have lots of "people" in the table already, and now you want to
extend that to "Employee". in the "Employee" table, you want to put
which company they work for, what date they started etc.
So, the most sensible thing to do is to do a LEFT JOIN between the
"Person" and the "Employee" tables. Also, let's assume that each person's
name (or, better, social security number) is unique. So, from a usage
perspective, you never look up a person by name - only by their SSN. Your
security division swears blind that the Person table was legitimately
procured, but the fact that it contains 200 million entries, and there
are only 50,000 Employees in the company, leaves you a little suspicious.
Regardless: that's the setup. The Person table is pre-populated (mostly
inaccurately...), and you have existing employees to manage, and new ones
to add. Now we describe the problem. The issue is: when adding a new
person
across these two tables, what exactly is required? Well - the procedure
goes something like this:
First, do a LEFT JOIN against the Person and Employee tables, looking
for an existing record (remember, this is two tables: you can't just do a
straight REPLACE INTO).
If a record was found, then do an UPDATE into the Person fields,
rather
than an INSERT, of those fields that, in the input data, match with the
Person fields.
If in the Query, the Employee id field is NULL (and so will all of the
other Employee fields, but the important one to test is the Employee id
field)
then you know that there was an existing Person record, but no Employee
record. Under these circumstances, rather than doing an UPDATE (on a
non-existent Employee record) you have to do an INSERT. Oh - but
remember to grab the correct Person id from the query you just made!
And that's only if there are two tables. If there are three JOINed together
with LEFT JOINs, then you have to repeat the procedure AGAIN, just to make
sure that there might not be existing data which could end up now being
successfully JOINed in (remember: queries don't just go against id fields,
they can be against anything).
It gets worse, and here's why.
Imagine that there's a scenario where you cannot use Primary Key Indexes.
More specifically, to make a record unique requires conditional
indexes, that cannot be done using SQL: they can only be done in a
programming
language.
Conditionally Unique Indexes
Unique Indexes in SQL databases are typically the way to avoid data being
accidentally added that is supposed to already be there. So, you can't
end up with the same "Manager" doing more than one "Job", for example.
(where reality meets naievity in database design you often get conflicts.
resolving the conflict in this example is left as an exercise for the
reader - but - and here's a clue: you're not allowed to shoot the people
who pay your wages).
So, what happens when you want to keep, for example, history information
about a Manager's career? Instead of having a "Job" table which has a
single "Manager ID" field, you instead have a third table -
MANAGER_JOB_REF -
which stores not only the ID field of the "Mananger" and the ID field
of the "Job", but also the date stamp when the "Manager" started the
job, and a date stamp when the "Manager" ends the job - which can be
NULL to indicate "they're still in the job".
Now you've moved into a whole new ballgame of complexity in your 3rd
Normalised Form database, where you can't even create Unique Indexes
across the ID fields (Manager_ID and Job_ID) of the MANAGER_JOB_REF table.
Why? because a Manager might temporarily have two jobs as they transition
between roles in the company. In fact, for quite some time, they might
end up having the roles, and, realistically, if a company down-sizes,
they might end up with two roles, permanently (but we'll leave that
issue aside as the exercise for the reader, referred to above, where
you're not allowed to shoot your boss. or go mad and claim sick leave.)
So, we have a nice complex conditional index, where people are identified
uniquely like this:
SELECT * FROM
Person
LEFT JOIN Employee ON Person.id = Employee.person_id
LEFT JOIN Manager ON Employee.id = Manager.employee_id
LEFT JOIN Manager_Job_REF ON Manager_Job_REF.manager_id = Manager.id
LEFT JOIN Jobs ON Manager_Job_REF.job_id = Job.id
WHERE
(Person.SSN = '667051111') AND
(Manager.Level = 'Numpty Timewasting Level') AND
(Manager_Job_Ref.end_date IS NULL) AND
(Manager_Job_Ref.start_date > '21 Jun 1954')
Anyone spot the deliberate mistake? anyone? No - it's nothing to do
with the Manager being a Numpty, nor with the fact that they've been
in the job far too long. Here's a clue: it's to do with the LEFT JOINs.
This query, even if the Person exists, and in fact even if the Employee
record exists, will still not appear in the query if there is no Manager
record or even if there is, if there's no Manager_Job_REF record - because
of the WHERE clause tests.
We missed out NULL checking, required due to the LEFT JOINs.
Here's the actual query required to do the job correctly:
SELECT * FROM
Person
LEFT JOIN Employee ON Person.id = Employee.person_id
LEFT JOIN Manager ON Employee.id = Manager.employee_id
LEFT JOIN Manager_Job_REF ON Manager_Job_REF.manager_id = Manager.id
LEFT JOIN Jobs ON Manager_Job_REF.job_id = Job.id
WHERE
(Person.SSN = '667051111') AND
(
(Manager.Level IS NULL AND
Manager.id IS NULL)
OR
(Manager.Level = 'Numpty Timewasting Level')
) AND
(
(Manager_Job_Ref.id IS NULL AND
Manager_Job_Ref.end_date IS NULL AND
Manager_Job_Ref.start_date IS NULL)
OR
((Manager_Job_Ref.end_date IS NULL) AND
(Manager_Job_Ref.start_date > '21 Jun 1954')
)
)
To explain clearly: the LEFT JOINs result in the Manager.id and all
Manager fields being LEFT JOINED being NULL. But, ANSI SQL requires
that if a field is NULL, then any comparison operations such as equals
greater or less than must all return false. So, you have
to explicitly code in a test for NULL in order to make the LEFT JOIN
return the NULL Manager fields.
But, as you can see: the WHERE clause doesn't end there - the tests
also include Manager_Job_Ref, and so you have to test for NULLs there
as well! But, it doesn't end there - as you can
see from the Manager_Job_Ref.end_date example, you could actually have
a valid test (if this was INNER JOINs) for NULL values - remember,
in the example, we said that if the end_date was NULL, then the Manager
was still in the job.
So we have to explicity include a test which uniquely identifies whether
the table being LEFT JOINed is empty / missing, and for that, we have
to check that the primary index field is NULL as well (e.g.
Manager_Job_Ref.id)
This looks insane - but I kid you not, it's absolutely necessary -
and we've only just scratched the surface. I forgot to mention: when
you start doing LEFT JOINs, then anything that you would normally need
to be INNER JOINed must be converted to LEFT JOINs, because otherwise,
entries will be missed... but that you should already know about, as a
good SQL designer.
Lookup Tables
One final little "trick" that's helpful for lookup tables. In 2nd
normalised form, lookup tables are done by storing the name in the
table - effectively treating the database as a glorified spreadsheet,
and thus wasting enormous resources. In 3rd normalised form, you end
up instead with the enormous inconvenience of storing id fields which
you then can't debug properly - unless, of course, your SQL access layer
performs the translation for you.
So, let's describe a perfectly reasonable scenario where lookup tables
take on a whole new level of insane complexity underneath, in 3rd
normalised form, that look perfectly acceptable on the outside, to
ordinary sane people used to dealing with 2nd normalised form.
We'll extend the numpty Manager example with three further tables.
The first table is "Boss Hierarchy" and the second and third is the
"Corporate Structure".
The "Boss Hierarchy" table is called EMPLOYEE_EMPLOYEE_REF and it
contains fields "employee_id" and "my_boss_employee_id". The other table
is... let's say that it contains which part of the company employees
are in: e.g. the "Arizona Office", the "Numpty Division", the "Blue
Cheese Moonies Office". (personally I think the Arizona Office is the
funniest one, but my sense of humour is pretty sick). Finally, we need
a CORPSTRUCT_EMPLOYEE_REF table to link employees to corporate divisions,
with field names that are self-evident.
Also, we'll add that, for convenience, 'Employees' have a link to their
division head stored in the Employee, by id of the 'Person'. This is a
little contrived, and there are better ways to represent the relationship
with a better design, but it allows us to illustrate the issue.
We now want to present a "view" of this data which looks like a 2nd
normalised form, where we can pass in data to be changed that contains
things like "this employee named 'fred' has moved to the Blue Cheese
Office and the boss of that division is 'Mr Insanity'", instead of "this
employee with id 21767 that's in the Person table not the employee table"
has moved to the corporate division with id 107, and corp division with
id 107 is controlled by Person who is known as 6, Number 6".
So - let's write those out, roughly as pseudo-SQL 'create' statements:
CREATE TABLE Person (
id int, # unique index
name varchar,
stolen_ssn_number varchar
);
# table that extends Person
CREATE TABLE Employee (
id int,# unique index
person_id int,
division_head_id int
started_date datetime,
fired_from_cannon_date NULL datetime
);
# table that extends employee
CREATE TABLE Manager (
id int,# unique index
employee_id int,
);
CREATE TABLE CorpStruct (
id int, # unique index
name_of_office varchar;
);
# table that links employees to their boss
CREATE TABLE EMPLOYEE_EMPLOYEE_REF (
id int, # unique index
employee_id int
my_boss_employee_id int,
);
# table that links employees to a corporate office
CREATE TABLE CORPSTRUCT_EMPLOYEE_REF (
id int, # unique index
corpstruct_id int,
employee_id int
);
CREATE FOREIGN KEY ce1
CORPSTRUCT_EMPLOYEE_REF.corpstruct_id REFERENCES CorpStruct.id;
CREATE FOREIGN KEY ce2
CORPSTRUCT_EMPLOYEE_REF.employee_id REFERENCES Employee.id;
CREATE FOREIGN KEY ee1
EMPLOYEE_EMPLOYEE_REF.employee_id REFERENCES Employee.id;
CREATE FOREIGN KEY ee1
EMPLOYEE_EMPLOYEE_REF.my_boss_employee_id REFERENCES Employee.id;
CREATE FOREIGN KEY me1
Manager.employee_id REFERENCES Employee.id;
CREATE FOREIGN KEY pe1
Employee.person_id REFERENCES Person.id;
CREATE FOREIGN KEY eb1
Employee.division_head_id REFERENCES Person.id;
You might already start to be getting a sinking feeling about this one
already - but don't worry: it really is as bad as it looks,
but I spent nearly four weeks thinking this through, making mistakes,
so that you don't have to (hurrah!).
The issue is this: you need to translate the association between the
"Corporate Structure" table into a "Person", in order to correctly
display the boss of the Division. It's a horrendously complex lookup,
but it is a lookup nonetheless. And all you want to do is show that
"fred" answers to "Mr Insanity" in the "Blue Cheese Office"!
What I did was to do a recursive lookup, down the Foreign Key relationships,
to translate the (contrived) division head Person id into the "Corporate
Structure" office. You have to link from Person to Employee, from Employee
to CORPSTRUCT_EMPLOYEE_REF, from CORPSTRUCT_EMPLOYEE_REF to "CorpStructure",
and finally you get to "lookup" the name "Mr Insanity" as being in
the "Blue Cheese Office".
Of course, you could just put the "Division Head Honcho" id field into
the "CorpStruct" table, and this would make life much easier. However,
reality never meets sanity as far as Database Design is concerned and
you may have to just live with the existing database design. Plus, both
suggestions don't take into account the "Conditional Indexes" syndrome,
where Division Heads come and go and you might want the database to record
who to blame or sue (or take out a contract on, or even offer a nice fat
bonus: if you work for Enron - you never know what's going to happen).
Just in case something is discovered ten years down the line.
So you can't just expect "lookup tables" to be nice and straightforward -
you can expect to have to jump through a couple of "anklebone connected
to da.." hoops in order to put the right strings into that nice
drop-down menu you were told to add to the edit page.
Insanity by Design
I haven't covered everything - all of the required tricks, but
I hope I've impressed upon you by now quite how insane it would be to
have to reimplement, ten, fifteen - heck, even twice some of the
logic above, and I hope that it illustrates to you why it is necessary
to have a decent database abstraction layer. If you work for a big
corporation, such as Dow Jones, then you already know what I am talking
about, but your Non-Disclosure Agreement precludes you from being able
to tell anyone about it - plus, you make a hell of a lot of money from
being able to keep your job as complex as it really is, and you'd prefer
to keep receiving the nice bonuses.
As a "lazy" Free Software developer, I have difficulty in doing the same
thing more than about twice, especially if I know that I might have to do
it again, and so therefore want to go out of my way to write code that will
make it easier the next few times. But even I had no idea just
quite how insane 3rd normalised form database management really is.
The solutions effectively call for "Writeable Views". Every decent
database (that means excluding MySQL, whose designers should have been
put down at birth for not deciding to do ANSI SQL 92 compliance from the
word GO) should have the concept of "Writeable Views".
Microsoft .net provides a work-around for this lack of SQL writeability,
via its "recordset" abstraction layer, where you can merge two datasets
together, and all the appropriate deletes, updates and inserts will be
handled for you. Microsoft SQL 2000 endeavours to provide simplistic
"write" access to views - but the restrictions on it are way
below the required level of functionality as partly outlined in
this article.
Also, you should by now have a clear idea why I am so against the
use of Stored Procedures. To even attempt to implement an
abstraction layer in a programming language that was designed in the 80s,
which doesn't cope with object-orientated principles or even optional
parameters correctly without horrendous workarounds, would not be insane,
it would be stupid. At least if you're insane, you probably stand a
good chance of succeeding, whereas if you're just stupid enough to try
implementing a Database Abstraction layer in a SQL programming language,
you're probably not intelligent enough to pull it off.
Either that, or you are deliberately trying to make yourself indispensable
by pulling the wool over the eyes of your Management, which is much worse.
Whatever the reason: don't do it. Spend some time researching or writing
a decent 3rd Normalised Form database abstraction layer, in a proper
object-orientated modern programming language, and stay the hell away
from hand-coded hard-coding of database access functions. Oh - and don't
be fooled into thinking that Views will make your job easier, either,
because Views are only half the story (read-only...) and now you have
to maintain the Views forever, as well as your code...
In short: good luck...