When to move beyond relational databases
One of my current projects is the review of an application built by a contractor for a major federal agency. The code relies heavily on queries and stored procedures against a relational database management system (RDBMS). These are easily the most complex I have ever seen. Normally when technical debt accrues in a database, IT managers refactor the database design immediately and utilize views, indexes, precomputation and all the other goodies a RDBMS offers.
However, it may very well be that the relational paradigm, as venerable and successful as it is, is simply not the best choice for an application. There are alternative paradigms, notably document and graph databases – colloquially known as NoSQL databases – that have their own advantages and disadvantages. But let’s first remind ourselves why RDBMSs have dominated for so long before exploring how NoSQL databases compare.
With theoretical foundations in mathematics and powerful commercial and open-source implementations (e.g. Oracle and PostgreSQL respectively), RDBMSs have flourished for decades. They have powerful features we take for granted:
ACID guarantees: Atomicity, consistency, isolation and durability. Atomicity guarantees transactional scope where the failure of any single operation in a set triggers a rollback. Consistency guarantees data is always in a valid state (through referential integrity, for example). Isolation guarantees multiple clients hitting the same data don’t step on each other. Durability guarantees changes persist across catastrophic failures.
SQL: The standard language (with vendor-specific enhancements) for performing database operations. SQL is relatively easy to learn and universally supported.
Ad hoc queries: Queries that have to be run but are never anticipated. Relational databases usually run these quickly.
Commercial support. Available for even open-source options.
Yet RDBMSs aren’t perfect. When the data gets big, they don’t generally scale easily even with sharding. Database schemas are also notoriously rigid in application development. Changes to a single column reverberate among views, stored procedures and application code. Finally, managers need to solve the impedance mismatch between relational data and object-oriented code manually or with potentially complex object-relational mapping technologies like ActiveRecord in Ruby or Hibernate in Java.
Sometimes features like ACID can be more trouble than they’re worth. Enter document databases like MongoDB and Apache CouchDB – both open source. “Documents” are flexible binary JSON structures where child records modeled with relationships in a RDBMS are instead embedded within the documents.
Aside from scalability, the biggest advantage is simplicity. Once one becomes familiar with JSON, querying is straightforward. Without a predefined schema, data can evolve as needed. Joins are obviated because data is denormalized. And because JSON is the most common data format on the Web, developers may be able to use a driver to pass data from the database straight through to the front end.
On the other hand, if records embedded in a document change frequently (such that they are factored out into their own documents in a relational manner), the result may be foreign key relationships without ACID guarantees – meaning possible orphan records. And without joins, it will take multiple queries to fetch the required data. As in RDBMS development, great care is needed in document design and indexing.
Consider document databases when queries can be anticipated or when there is need to scale and/or make relatively few updates to preexisting data. There are support options too – especially for MongoDB.
Anyone who watched The Wire might remember the bulletin board used by the Major Crimes Unit to display an evolving org chart for crime syndicates in Baltimore. The hierarchy was determined through analysis of communications and other data. That was the first graph database I ever saw.
With a graph database, data is modeled as a collection of nodes connected by edges – both endowed with attributes. As always, the data model must be optimized for the anticipated queries – for example, when deciding whether certain data belongs in a node or edge.
This is a fundamental shift from RDBMSs. When working with network data (such as SIGINT, financial transactions, or migration patterns) modeling in tables and relationships can be awkward. Much worse, RDBMSs can be quite slow for the kinds of queries that matter on graphs like shortest paths, community detection and centrality.
Also built upon a mathematical foundation, graph databases like open-source Neo4J are ideal for storing and querying network data. Though other approaches are available, Cypher, a Neo4J-specific language, is a good choice for querying a graph. It has a steep learning curve, but, in experienced hands, Cypher is a powerful and performant query language.
Like RDBMSs, Neo4J supports ACID transactions and indexing. Commercial support and drivers are available in all major programming languages.
Data is the lifeblood of applications. While RDBMSs will always be robust and powerful and perhaps most familiar, follow the advice of lean software development experts Mary and Tom Poppendieck to consider all options to make applications easier to develop and faster to run.
Neil A. Chaudhuri ([email protected]) is the founder and president of Vidya. He has well over a decade of experience building complex software projects for commercial and government clients.