Saturday, February 14, 2009

Where Have All The Data Modelers Gone?





I noticed a curious thing at work recently. We have hundreds of instances of Oracle and SQL Server deployed on servers in our data center. Data warehouses that dwarf Home Depot archive many terabytes of information. There's an army of administrators keeping watch, ensuring that no SELECT goes unfulfilled. You can't throw a rock without hitting a Java or .NET developer, well-versed in SQL, chomping at the bit to write an application to deliver that information via the web to any and all clients. Everyone agrees that data are the crown jewels, the lifeblood of any modern business.

If this is true, where have all the data modelers gone?

I see very few people who are comfortable and conversant with concepts like normalization and dimensional modeling. Application developers who possess encyclopedic knowledge of SQL syntax, but lack sound relational design principles and best practices, are given free rein to create whatever tables satisfy the immediate needs of their current project. Extracts are taken, data is duplicated, modifications are made without communicating them back to the source system from whence they came. The number of servers and schemas multiplies. The disk storage required grows exponentially at an ever-increasing rate. Terabytes are maintained, but the unique data might fit on a modest USB key if it could be normalized and extracted. The database administrators don't have any passion for the discipline. Data warehouse designers, ignorant or dismissive of dimensional modeling ideas, create duplicates of source data that are little more than staging areas. There's no cleansing or de-duplication going on. Reports are generated from these staging tables, in spite of better advice.

How did we get here? Where did all that knowledge go?

When the COBOL dinosaurs walked the earth, and mainframes were the kings of business, developers were of one mind. The person who wrote the logic also handled the flat files for persistence and the green screens for the user interface. They were expected to know everything about the system.

Relational databases came along in the 1970s after Ted Codd published his seminal paper on the relational model. There was a land rush to create working implementations of the model. Where there once was a single dominant language for business logic and persistence, COBOL, now there were two: client and SQL. Client/server was all the rage.

Personal computing took processing power out of the data centers, with their raised floors and frigid temperatures, and put it out on desktops. Networks knitted those islands together, first using LANs inside companies and then the Internet over the whole world. Client/server fell out of favor. Fat clients gave way to thin. Two tiers became three; three became four or more. Object-oriented programming took over both the desktop and middle layers, relegating relational databases to the back room. First C++, then Java, and now C# became the brokers between the persistent data and the end users. Object-relational impedance mismatch became the order of the day.

I don't know if increased specialization and layering has put relational practitioners in a funk, but it looks to me like true expertise in this area is dying. I've been fortunate to work with someone lately who is both a well-read devotee of Ralph Kimball and experienced at standing up several successful data warehouses that use dimensional techniques. But he's a singleton; I don't know of anyone else in the enterprise who speaks the same language. Tools for entity-relationship modeling (e.g. ERWin) are buried in the DBA's toolbox, hidden by scripting and client tools (e.g. TOAD), like a specialized wrench reserved for out-of-the-ordinary tasks.

Contrast this with the state of the middle tier. There's a cacophony of voices whenever someone proposes an object design. Good UML tools, both commercial and free, help to make even the gnarliest whiteboard session intelligible and clean.

It has dawned on me that data are still the crown jewels of every business. Client and middle tier technologies continue to evolve rapidly, coming and going at a rapid pace. But the data lives on forever.

Data warehousing and OLAP were huge in the late 90s. Kimball and Inmon slugged it out for dominance. Articles and books were written; consultancies prospered; businesses gobbled up advice and software as quickly as their pocketbooks would allow.

Has the animal spirit gone out of this field? The furor has died down and moved elsewhere, but the need to manage information and deliver it in a timely way to customers is still with us. I don't know if the problem is "solved" so completely that it's become routine everywhere but at my current employer.

But from where I sit today, knowing relational databases and dimensional modeling very well looks like a good bet if you want to have a rare and valuable skill.

1 comment:

Unknown said...

It was a treat reading your blog, i hope i can keep up with it. This post was particularly interesting and so i thought i would comment.

As i see it, the reason data modeling got relegated was because of DB product companies, DBA's and project managers in that order.

DB product companies tried to make proprietary versions of what codd proposed, with out any interoperability. The next decision DB product companies made was to compete on dumbing down that very model. A singinificant consequence of this was DBA's for different DB products did not have an incentive to learn good DB design practice. They were able to grow as a great DBA for one particular flavor of DB. Project manager's are to blame (for pretty much everything that goes wrong with any IT project), because they were technically incompetent, but got to make decisions that were incentivised for the short term. Thanks to moore's law they were able to get away with it.