Learn to speak the language of this blossoming technology field

Not sure of the difference between a data mart and a data warehouse? You’re not
alone.


Federal users who look to data warehousing for information retrieval must struggle with
a new lexicon. As they set up and learn to use the multimillion-dollar warehouses,
they’re often perplexed by shades of meaning.


“You have to be alert because there are so many overlapping definitions and slight
differences in meaning,” said Wendel Yale, a technical manager at Oracle Corp.


A data mart, also called a data store, typically contains several databases specific to
one department or function. Together, multiple stores, or subsets of data, make up a data
warehouse.


But a warehouse is not simply a consolidation and replication of databases from many
departments.


A true data warehouse supplies tools and procedures that let users manipulate the data
across the different marts to get the information they need for making decisions. This
breadth and flexibility is what makes the data warehouse different from the data mart.


The technological building blocks of a data warehouse are:


Variations include a database OLAP (DOLAP), an RDBMS designed to work with OLAP, and
WOLAP, a Web browser OLAP tool.


The point of entry for the warehouse user is the SQL query. The user can write an ad
hoc, or self-structured, query; choose from preset lists of elements to construct a query;
or select a default query prewritten by the warehouse programmers.


How data is arranged—the data model—affects how fast answers to queries are
returned and how valuable the data is to different users.


For example, a data model known as a star schema can greatly speed multidimensional
queries, such as how many instances of a particular set of circumstances occurred over a
three-year period. Imagine the large center of the star containing all the facts and
figures, represented as a huge fact table. Surrounding it, like points on a star, are
query elements such as time, infrastructure specifications and geographical location, each
represented by another table.


Under the star schema data model, an answer to a query would draw criteria from the
star points and populate the answer with data from the star’s central table.


All the warehoused data must conform to the data model, and designers have several
models from which to choose. A snowflake model is essentially a complex star. A cube
represents data from different databases in precalculated views, such as you might see in
a 3-D spreadsheet. An entity relationship model represents files, consisting of tables,
and the relationships among them. 


“One person might look at the data as a salesman, another might look at the data
as a manager,” Yale said. “While they’re looking at the same data,
they’re looking at it from different dimensions.”


Data visualization can show in one image what would be less clear in a thousand words.
Make one of the points on your star schema a geographical information system database,
make a second point a database of satellite photographs, make another point a database of
vegetation, add visualization tools and query. In seconds, megabytes of images,
photographs and flat files transform into a photogrammetric map showing all recorded
instances of, say, Dutch elm disease in southern New England.


A further way to query a data warehouse is by data mining. Sometimes referred to as
KDD, or knowledge discovery in databases, data mining extends the warehouse’s value.
A user need not ask a specific question to get a useful answer. Instead, the user
describes a concept or an idea, then launches a search.


Data mining uses statistical algorithms to comb the data for patterns and relationships
that users might not have known about. The result of the data mining query is presented
not as an answer, but as ideas or a list of items that might be used for a more specific
query.


All of this presupposes that the data has already been extracted from the various
databases, transformed and converted into data marts and the data warehouse.


Databases hold dozens of different file formats in dozens of different programs running
under dozens of different operating systems. Data transformation makes them all accessible
by the warehouse user.


Database content within an organization often overlaps. The same person could appear in
one database as Doe_J., in another as Doe, Jane K., and in two instances as Doe, Jnae.
Data scrubber software flags the redundancies along with spelling inconsistencies,
outdated information and other discrepancies.


The data is considered normalized when it has been validated and any discrepancies
reconciled. Unless data has been normalized, queries will return distorted results.


To speed searches, a warehouse has a repository of metadata, a sort of Yellow Pages.


Metadata, literally data that describes other data, identifies such things as the
data’s source, how it is mapped and its reliability ranking. Some data warehouse
software lets users query the metadata to find out when data was last updated or how it
was transformed to go into the warehouse.


If you were to take the Yellow Pages directory, duplicate and divide it, translate some
chunks into other languages, reorder some sections by location instead of topic and keep
each piece in a different place, you would have some idea of how different data warehouse
tools store metadata and why getting a fast—or any—answer isn’t always
easy.


For more on data warehousing, check out a white paper at http://www.redbrick.com and a helpful data warehousing
glossary at http://www.techguide.com/index1.html.



inside gcn

  • pollution (Shutterstock.com)

    Machine learning improves contamination monitoring

Reader Comments

Please post your comments here. Comments are moderated, so they may not appear immediately after submitting. We will not post comments that we consider abusive or off-topic.

Please type the letters/numbers you see above