What is your e-mail address?

My e-mail address is:

Do you have a password?

Forgot your password? Click here
close

    The database of databases

    Evolution in warehousing means better information sharing

    Related Links

    Data warehouses

    Illustration by Bruno Budrovic

    The Internet and IP-based networks have gone a long way toward removing barriers between agencies. To further a spirit of cooperation, agencies are also beginning
    to share information from disparate databases by sending some of that data to common data warehouses, where it can be merged, queried and analyzed.


    The Transportation Department's Bureau
    of Transportation Statistics, for example,
    has created the TranStats portal
    (www.transtats.bts.gov).


    It lets researchers, analysts or citizens
    view, search, run reports and generate
    graphics from 400 transportation-related
    tables in over 100 databases (one of them
    with more than 250 million rows) kept by
    different departments and agencies.


    'The Bureau of Transportation Statistics
    has three primary goals for
    TranStats,' said TranStats data team
    manager Cheryl Young. 'The first is to assist
    public users in obtaining data easily
    and efficiently, the second is to support
    public and internal users who perform
    online analysis and the third is to offer
    data to internal users to perform advanced
    statistical analysis using third-party tools.'


    But setting up a data warehouse is more
    involved than building a traditional database.
    'We learned that data collection and
    presentation involves much more than
    technology,' Young said. 'It is difficult to
    change the concept of data stewardship.
    We would like to have focused much earlier
    on this issue so that the technical
    progress can follow.'


    Common characteristics

    While they share many common characteristics,
    a data warehouse is different in design
    and function from a production database.
    The primary difference is whether the
    system will be used for online transaction
    processing or online analytic processing.
    Most production databases are primarily
    for OTLP and require high-speed access to
    individual data records. A motor vehicles
    department, for example, would need an
    OLTP database to let clerks quickly access
    the record of an individual car, determine if
    a registration is current and update data.


    The TranStats data warehouse, on the
    other hand, requires a database management
    system designed for OLAP. This affects
    the hardware needed to support the
    data warehouse, as well as the DBMS itself.
    TranStats runs on four Sun Solaris
    servers and a 2.5TB Sybase IQ Database
    Management System from Sybase Inc.
    Young said query speed, not transaction
    speed, led to selecting Sybase for this application.


    'Our data changes monthly or quarterly,
    but our queries must be performed within
    a split second,' Young said. 'Sybase IQ is
    much faster than other databases and the
    only one that can meet our dynamic Web
    presentation criteria.'


    In addition, data warehouses extract information
    from a number of databases, so
    they tend to be much larger than any one
    of their sources.


    'Scalability becomes a bigger issue with
    a data warehouse, as does the way in
    which the DBMS implements parallel
    queries because of the heavy query nature
    of the data warehouse,' said Craig Mullins,
    consultant and author of Database Administration:
    The Complete Guide to Practices
    and Procedures (Addison-Wesley
    Professional, 2002).


    Data mart or data warehouse?

    There are two main approaches an organization
    could take in setting up a data warehouse.
    Bill Inmon, considered the father of
    data warehousing, favors a top-down approach.
    The organization decides what it
    wants to contain in the central warehouse
    and how that data will be used. Lower-level
    analytic databases, or 'data marts,' then
    draw their data from the central warehouse.
    This method ensures consistency of
    data throughout the organization.


    The other method is a bottom-up approach.
    The data marts are set up first to
    meet specific business needs, and these feed data into the warehouse. This allows
    the organization to start realizing a return
    sooner. Whichever method is chosen, the
    overall data warehouse design should be
    decided early in the process.


    'In setting up a data warehouse, it is important
    to have the data model correct
    right from the get-go,' said Nathaniel
    Booth, IT specialist at the Geological Survey
    Data Center in Middleton, Wis. 'You
    have piles of applications that will depend
    on that data going forward, so it is really
    hard to change later.'


    USGS hired a consultant for the initial
    design of the data warehouse and early
    business intelligence deployments. In addition
    to settling on the data format, one
    must also select the DBMS. The decision is
    complicated by the fact that, in most cases,
    the organization already has some sort of
    DBMS in place, and any new system must
    be compatible with what already exists.


    In addition, it requires interoperability
    with systems at any other agency that will
    be feeding data into the data warehouse.


    'Assuming you already have a database,
    which almost everyone in government
    has, just buy more of what you have and
    do a sole-source acquisition,' advises John
    Kost, analyst for Gartner Inc. of Stamford,
    Conn. 'The risk in any procurement is
    [that] you are going to end up buying
    something you hadn't expected and don't
    have the internal skill sets to support.'


    That was the approach USGS took in setting
    up a central data warehouse for collecting
    water quality data. Water doesn't
    respect legislated boundaries. A pollutant
    dumped in a stream in one city will flow
    quickly into another municipality's water
    system without stopping at the city limits,
    finally winding up in a lake or ocean.


    Conversely, USGS takes data streams
    from state, local and federal agencies nationwide
    and aggregates them in Oracle
    databases at its data center. Researchers
    can then conduct browser-based queries
    through a flexible interface.


    The largest database contains the National
    Water Quality Assessment, with
    more than 11 million water and biological
    test results.


    'We stayed with Oracle because of its
    support for a wide array of functionality
    including Oracle spatial, text search and
    indexing,' Booth said. 'We had to bring in
    data from SQL Server and Ingres databases
    as well as text data.'


    The data center started with Oracle 7.3
    and now runs three instances of Oracle's
    10g database on Red Hat Linux.


    'We try to stay on the most recent version
    of the database,' Booth said, though
    this sometimes introduces problems.
    'Sometimes the new version doesn't support
    the applications you are running,
    and previous queries don't always work
    right, so just dealing with the change
    management through the versions of the
    database is somewhat of a challenge.'


    He also said ensuring all the data follows
    a consistent format is an ongoing effort.


    'Vendors still haven't come to us with a
    standardized metadata
    repository, so we have a
    difficult time having to
    manage the metadata
    about multiple products,'
    he said. 'In hindsight, it
    is nice to pick ones that
    play well with metadata.'


    'We learned that data collection and presentation involves much more than technology. It is difficult to change the concept
    of data stewardship.'
    'CHERYL YOUNG, TRANSPORTATION DEPARTMENT


    However, when one is depending on an
    array of entities with their own legacy systems
    and unique data needs, managing
    the data warehouse means constantly negotiating
    with all interested parties.


    Sonoma County, Calif., for example, set
    up an IBM DB2 Geographic Information
    Systems data warehouse for use by
    all agencies, including agricultural inspectors,
    tax assessors, the building department
    and emergency services. The
    county had already been using ESRI
    software for some of its applications, but
    there was no standard across the county.


    'We were looking for a central repository
    to share information, so people didn't have
    to guess which version is most up-to-date,'
    said GIS manager Tim Pudoff. 'By having
    a central database where people can publish
    their data, we can provide users with a
    one-source shopping point.'


    The GIS staff has real-time access to add
    or update information in the warehouse,
    and participating agencies periodically
    send automatic or manual feeds from their
    own servers. The system is overseen by a
    technical advisory committee with representatives
    from 16 agencies, who provide
    requirements and buy-offs on any upgrades
    or new services. In such a situation, Pudoff
    emphasized, it is vital to have an executive
    champion to help manage the competing
    interests of the different parties involved.


    'You need someone who can help you
    wade through the bureaucratic
    trials,' he said.
    'I have that from my
    manager, and it made all
    the difference.'


    Selecting the right
    hardware and software
    for a data warehouse involves
    more than just looking at its technical
    prowess or initial cost.


    Looking ahead

    'It is almost irrelevant what the vendor
    tells you the price is,' Kost cautions. 'You
    need to figure out what are the component
    parts of the implementation, including
    your own costs.'


    This includes the hardware, services in
    setting up the database, creating links to
    other databases, applications or agencies,
    and ongoing support.


    Although an initial database purchase
    might be geared to a single use, Mullins
    advised breaking out the hardware, software
    and services components and examining
    them separately, even if the entire
    contract does wind up going to a single
    vendor.


    'I would split out the hardware and
    programming services from the DBMS
    unless this particular DBMS is going to
    be used only by this one project,' he said.
    'A DBMS is designed for sharing data
    across multiple applications, and you
    don't have to always have a brand-new
    DBMS for every application.'


    In evaluating the DBMS, Mullins said to
    look at the analytical and reporting features,
    not the transaction speed.


    'An RFP for a data warehouse should
    focus on analytical processing instead of
    transactional processing,' he said. 'The reporting
    features embedded with the
    DBMS become much more important, as
    does the ability to manage very large
    amounts of data. A DBMS with embedded
    ETL [extract, transfer, load] features will
    also move to the head of the pack of data
    warehouse RFPs, because warehousing
    implementations require heavy data
    movement from operational databases to
    the warehouse.'


    One also has to consider who is going to
    support it once the consultants wrap up
    the project, which was one factor in USGS'
    decision to stick with Oracle rather than a
    product with a smaller installed base and
    local support network.


    'For us, it is important to have face time
    with the consultants,' Booth said. 'It is
    difficult, when you are not in a major metropolitan
    area, to find skills in these very
    specific technologies.'



    DATA WAREHOUSING





    Drew Robb is a freelance technology writer in Los Angeles.

    Reader Comments

    Please post your comments here. Comments are moderated, so they may not appear immediately after submitting. We will not post comments that we consider abusive or off-topic.

    Your Name:(optional)
    Your Email:(optional)
    Your Location:(optional)
    Comment:
    Please type the letters/numbers you see above

    GCN eNewsletters

    eSeminar