randys.org

Wasting your precious bandwidth since 1999

How-not-to: Normalize Your Data

Disclaimer: I am not a database architect nor a certified database programmer. I might not be database guru, but I have been working with SQL Server long enough to know what works and what doesn’t.

I’m not a database programmer per say, but I’m expected to be one at work (even though I’m a code jockey). I don’t mind it most of the time, but there are times when it becomes really frustrating. Especially when you come across some schema that is poorly implemented. Normalization is a double edged sword. On one side, normalization is good for data integrity. On the other, it becomes a pain in the arse to get at your data. There are ways around the later (Views, de-normalized tables), but, a pain in the arse it still is. The one thing that really bugs me is normalization done wrong. We use Microsoft SQL server at work, but this should apply to any relational database.

Recently, at my day job, I’ve been working on a particular project which involved consuming data from a third party for use in a .NET web service. We were already consuming data from this vendor but they recently changed the schema (and data) and are deprecating the old format. The new feed is basically a complete dump of their data and schema. Not all of their data, just data we subscribe to (about 1M records total). Instead of modifying our import process to accommodate the current (old) schema (which would have been a huge effort on our part), we just created a new import process and modified the web service.

After working with this new schema for a couple weeks, I have come to the conclusion that, well, it’s crap. First and foremost, it’s not normalized properly. If you have an column in a table that looks like it should be a foreign key, then by all means, make it so. For example, say I have a table called Widgets with several columns. One of those columns is called WidgetTypeId and another called WidgetTypeDisplayName. There’s a prime candidate for another table. I don’t care if you only have two WidgetTypes, it belongs in its own table. Here’s why: When selecting records out of the table and filtering on WidgetTypeDisplayName, it gets really expensive. Sure, you could add an index on that column, but it still would be slower compared to joining on a WidgetType table. If I had ten million widgets and only 100 widget types, I’m only filtering 100 records vs. ten million.

What about filtering on the WidgetTypeId column? BZZZZZ. Where did you get that ID? Did you select it from the same table? Guess what, same difference. Maybe you hard-coded it in the query. That’s not smart either, especially if you have multiple environments that could possibly be out of sync (i.e. WidgetTypeId 4 in development might not be associated to the same thing in production). Put it in it’s own table.

I ran into this several times with this new schema at work and it was a complete pain. For one thing, in development, the queries were always relatively quick to return. In production, however, not the same. This database sat on the same server as another production database so there was more load on it. Enough to make the queries perform poorly. On the order of 10 to 40 seconds to return data in some cases. That’s not good. My solution: create a single de-normalized table (for some reason, we don’t use views) of commonly used data points. The result was millisecond returns and simple queries. Furthermore, instead of filtering/joining on tables with half a million records, it now filters on less than 20 thousand, properly indexed records.

Tags: ,

Leave a Reply

All content Copyright © Randy Sesser | Hosted by WebFaction
Entries (RSS) | Comments (RSS)

randys.org is Digg proof thanks to caching by WP Super Cache!