New ideas on database design: February 2007

Monday, 26 February 2007

Database triggers are bad

Triggers are quite special to relational databases. Things magically happen when you change the data in a table. You can maintain cache tables and check the integrity of data before inserting. They sound great but cause all sorts of problems.

My experience

20 years ago I was part of team that producing an intelligent database where triggers automatically maintained the database, my role was simply to test that everything worked as it should. My issue was that after changing any data in the database I needed to check every table in the database to see what might of been affected. While these triggers might of sounded magical to the customers they were starting to look demonic to me.

Take an ASP program that inserts a record into a table. Testing that this happened is quite simple but how do you check what other tables might of been afected? Obviously just looking at the original insert statement isn't enough, you need to see what triggers may have been called, then check to see if other tables are affected by this trigger and whether in turn these tables have triggers. In the end I just wrote a program to take a snapshot of the database before and after each test and then see what tables were altered. It was a nightmare.

Use stored procedures instead

Don't let your 3GL programs change data directly but instead call a stored procedure to do whatever is required. This way it is obvious what code is called while the activation of a trigger is hidden from the user.

You can easily follow the complete logic of the code being called without needing to search for other hidden triggers.

Getting your 3GL programs to call stored procedures rather than updating the data directly ensures that database changes can happen without needing to rewrite all your 3GL programs.

Limitations of cache tables

Limited reporting: Cache tables are used to store summary data for quick reporting (i.e. monthly trading by branch). The type of reporting from these tables is very limited i.e. you can pick which months and which branches but that's it. So if your users want to see total trading by your female employees then you're out of luck unless that is then built specifically into the cache table.

Duplicate code: So perhaps you offer two types of reporting

one on the original trade table offering detailed reporting on recent trades
another on the cache table offering historical, but limited, reporting

The problem is you now have two sets of code each offering reporting but with different capablities and performance. Over time these differences will get larger.

Additional batches are now required to check that the cache table and original trade table are perfectly aligned. I know your triggers are supposed to ensure the two cache table holds identical totals but I'll guarantee that over time they'll differ. This means yet more code.

Users get confused when they can get detailed reporting over the last month but are limited to standard reporting before that. You can explain about triggers, cache tables and the amount of disk space saved but if your customer wants to see all his trades over the last year then he's not going to be impressed.

Removing cache tables

If you do all your reporting from the main trade table then you don't need

triggers to update the cache tables
batches to check cache vs main table
2 sets of reporting programs

If you're removing wasted code and allowing detailed reports for any month then the cost of producing your system go down while its functionality is going up.

Because you have removed all that wasted code then the cost of maintaining the code will be less and bugs are less likely to appear.

Reporting performance is likely to degrade but with proper indexing there is no reason that detailed reporting shouldn't be nearlly instant over 100 million records.

The space require by your system is going to go up if you were totally replacing your older trade data with the cache table. This is unavoidable but seeing as my mp3 player is 40GB then I figure you could probably aford a little more space.

Database locking may raise it's head if we're running large reports of a table while rapidly inserting trades. Proper indexing should be able to avoid locking the active pages but you can allways duplicate the database and report of the duplicate table.

Wednesday, 14 February 2007

Databases rot over time

Your original database design was great and the system has been running fine for a couple of years. The original developers had all done their jobs well, received their bonuses and moved on to new projects or new jobs. So why is everything starting to fall apart now?

Squeezing round pegs into square holes

The current developers who inherited the system are faced with a simple choice every time they need to add new functionality:

Alter a complex design and rewrite all interdependent code (feeds, screens, searching, integrity checks)
Just squeeze the additional functionality into the existing database structures that weren't designed for that purpose.

The first option may be technically correct but costs the most and carries the most risk if it all goes wrong. The second option is cheap to implement and can always be rolled back out if things go wrong.

The problem with this second option is it leads to a slow rotting of the database design (assuming it was any good in the first place). New code needs to be added to deal with each of these new exceptions.

Pretty much the same thing happens when scope creep raises its ugly head in the later stages of development.

So what happens?

Functionality decreases, bugs appear, performance starts to decrease, maintenance costs start to increase.

Those generic stored procedures that worked with any item stored in the table now don't work for the new items that are now being stored there. This leads to extra code being written to cope with the exceptions. The exceptions mount up until your system becomes the maintenance nightmare we all dread.

Performance starts to appear random depending on which set of exceptions are needing to be called.

Functionality also starts to appear random as the exception code deviates more from the original logic.

Poor integrity checks
Saving time by not implementing integrity checks on your data means that bad data will creep into your database. When errors appear in your code that leads to a Allowing bad data into your database means doesn't save time. It just means that all the rest of the code in the database now needs to protect itself against this bad data.

Thursday, 8 February 2007

Modern databases have too many tables

We have been using the rules of normal form to design databases for the past 20 years. The problem is that these rules quite often result in databases with far too many tables.

Why so many tables

Let's take a configuration management database (CMDB) for a bank where we store data about all the items within a bank and how they relate to each other. We’d have tables for people, computers, printers, software, departments, projects, locations etc. Let's make a conservative estimate of 30 different entity types to store information on.

Unfortunately this list of tables is only just beginning as we would also want:

tables containing lookup values
tables to join M:M relationships
cache tables (where required)
a user table
permissions tables
a log table
a system parameters table
temporary work tables
data transfer tables for file loading
some people also like to split the main tables into current and history tables

The number of tables quickly grows and in our example above will probably reach around 200 tables.

Why is this a bad thing?

More tables mean more code

Every table in your database needs code to support it:

Code to add data to the table. This might be a form or a feed.
Code to check the integrity of the data. You do check your data don't you?
Code to view the data and provide reports.
Code to search the data.

The more tables in your design then the more code you need.

More code is always bad

More code means:

it will take longer to develop the system
it will cost more
it will cost more to maintain
it will have more errors
users will have to wait longer for a solution

The last point is hurts in many ways as the longer your users wait then the higher are their expectations and the less likely you are to meet these increased expectations.

It also means that there is more chance that the needs of the application will have changed radically by the time your solution is ready to roll out.

More tables means fewer features

Let's imagine we have a number of lookup tables in our database. The data in these tables would have been entered by hand as it's probably a waste of time providing a form to enter data into each table.

If we store all these lookup values in just one table then we could easily provide a screen to maintain the data in that one table. We could now spend the time to provide features such as default values, display sequence and HTML code generation as these features will automatically appear on every lookup value in the database.

The problem is that rules of normal form tell us we can't have different types of values being stored in the same fields.

The question is - "does your design really need that many tables?"

New ideas on database design