Data warehousing tools



The term data warehousing was coined just six years ago by W.H. Inmon, a noted database
expert and founder of Pine Cone Systems Inc. of Edgewood, Colo. Though widely used,
it’s also largely misunderstood. Before we look at specific types of data warehousing
tools, let’s clarify a few terms.


A data warehouse is a large collection of data from multiple sources within an
organization. The data is used to support enterprise decision-making, forecasting and
planning. It is the end use of that data—the support of enterprise-level
decision-making—that defines a data warehouse.


A related, sometimes misused, term is data mart, which is a smaller, more localized
version of a data warehouse. A data mart is designed to serve a division, department,
location or other entity within an enterprise.


Typically, agencies will implement a data warehouse by assembling data from legacy and
operational database systems throughout the enterprise, although the warehouse itself is
not a production database. Because the original data comes from multiple sources in a
variety of usually incompatible formats, a lot of processing is needed both to create the
data warehouse and to use it.


Hundreds of products are at play in the data warehousing market, which today may be
second only to the Internet as a fertile ground for marketers. Products range from
all-in-one integrated suites, which, their makers claim, can handle the entire data
warehousing job—compilation, management and analysis—to specialized niche tools
that do only one job. Broadly, these specialized tools form three groups:


A data warehouse is an enterprise database that’s gone about as far as it can go.
To take it to these ends, you’re going to need more than a few software tools.
We’re talking about serious computer hardware and massive amounts of online storage,
sometimes measured in multiple terabytes and, in some cases, conceivably even exabytes.


The database management system component of data warehousing is not fundamentally
different from any other database.


According to the Meta Group Inc. of Stamford, Conn., most data warehouses today are
managed with standard Unix relational DBMSes from Sybase Inc., Oracle Corp., IBM Corp. or
Informix Software Inc. of Menlo Park, Calif. It is worth noting, however, that some
DBMSes, such as Informix’s Red Brick Warehouse, are designed for data warehousing.


In this Buyers Guide, I will concentrate on tools for assembling and converting the
data from its original sources, and products that perform sophisticated queries and
analyses.


I’ve also eliminated those products oriented toward smaller-scale projects. For
example, if a particular server is specified to run on Microsoft Windows 95 and use
Microsoft Access as its main DBMS, it’s just not a realistic candidate for managing
anything that should be called a data warehouse. Database, yes; data mart, possibly; data
warehouse, no way.


When creating a data warehouse, it’s almost inevitable that you’ll get data
from a lot of systems that don’t usually talk to one another—assorted
mainframes, distributed departmental-level systems such as IBM AS/400s, servers running a
variety of Unix flavors on a wide variety of hardware, and even from LAN-based data on
Novell NetWare, Banyan Vines or Microsoft Windows NT systems, and smaller, proprietary
systems, too.


And you may well be getting data from outside sources: customers, suppliers, and state
and federal agencies.


Assembling data from such multiple sources requires a serious effort at conversion,
both in making the physical transfers workable and in mapping the data structures into
something that will handle all the incoming data. The second of those two jobs is where
the greatest amount of work is involved.


Keep in mind that you may have to accommodate both fixed-length data fields and
records—those 80-column card systems persist—and variable-length fields. Data
from multiple-table relational databases, maybe normalized and maybe not, will have to be
matched with data from hierarchical network-based and object-oriented databases.  


For example, you’ll encounter a variety of date structures that have to be
reconciled over and above year 2000 issues. Expect also to encounter different accounting
systems, different ways of handling null entries, mismatched file-naming conventions and
duplicated part numbers—everything except a simple answer.


Before you can select a product, you need to create a comprehensive inventory of all
the machines, physical locations, DBMSes and networks from which you will get the data to
populate the warehouse. And you’ll also have to figure out how much data will have to
be moved and how often.


After you’ve done such an inventory, you’re ready to look at some of the
acquisition tools in the accompanying table. Given the uniqueness of each organization and
its warehousing needs, there’s no shortcut to doing your homework and seeing just
which products might be useful in your situation.


OK, you’ve collected the data from all your sources, massaged it so that you can
make comparisons, and it’s all assembled in a database managed by a good DBMS. Now
what?


Most queries that are run against databases outside of warehouses are specific, so
you’re looking for definite information that’s presented in a definite form.
That’s fine as far as it goes, but it’s generally not enough to justify having a
data warehouse.


The value of warehousing lies in its ability to let you discover information you
didn’t know you had, explore new possibilities, follow a line of reasoning wherever
it takes you, and do what-if analyses and create new scenarios. For this work, you need a
package that slices and dices the data in as much detail and as many dimensions as you
need.


The process is generally called OLAP, but other related terms are data mining, drilling
down, or, for one vendor, data diving. One factor that differentiates these OLAP tools
from query and reporting packages is that you use them interactively.


With most OLAP products, when you want more detail, you just double-click on a category
to drill down. Or you can show data as a graph by simply clicking on a button, or bring in
a new data dimension with a click or two. Equally important, you can usually save the OLAP
queries in a form that lets others continue the interaction to extract even more
information from the data.


Using OLAP tools, you often can discover correlations and relationships in your data
that you didn’t even suspect existed. You can analyze trends and population factors
more accurately and in finer detail, look at time-series and statistical factors, and view
different types of graphical representations and visualizations to see what they reveal.


Some OLAP tools are designed to work directly with warehouse databases on networked
workstations, while others are designed to download pieces of a data warehouse onto a
desktop PC and do most processing and analysis locally.


Establishing and maintaining a data warehouse is one of the biggest information
technology projects your organization may ever tackle, short of a complete re-engineering
of procedures. It’s mandatory that you put a lot of time into investigating
alternatives and products to find which ones will serve you best. This guide is only a
starting point—a product that isn’t listed here may be the right one for your
organization.


More information is available on the Web. Three good starting points are: Data
Warehousing at http://www.datawarehousing.com,
  the Data Warehousing Institute at http://www.dw-institute.com,
and the International Data Warehousing Association at http://www.idwa.org/.


The following books also can help:


Russell Kay writes about and reviews software and hardware in Worcester, Mass. E-mail
him at russellk@bix.com.

inside gcn

  • analytics (Wright Studio/Shutterstock.com)

    3 data strategies to help crackdown on internal corruption

Reader Comments

Please post your comments here. Comments are moderated, so they may not appear immediately after submitting. We will not post comments that we consider abusive or off-topic.

Please type the letters/numbers you see above