Data warehousing tools
Data warehousing tools<@VM>Any of these 14 tools can help you to extract, validate and load your data
The right products can help you build a foundation for successfully managing data
By John H. Mayer
Special to GCN
Recent developments in data warehousing have run toward the emerging issues of getting information to the Web, developing enterprise information portals and integrating wireless technology. But the basic steps are still a challenge for developers.
The process of identifying, acquiring, transforming and loading data into the warehouse remains the most complex, time-consuming and costly portion of the development cycle.
The difficulties information technology managers face when they tackle this task have been chronicled many times. A recent study by the Data Warehousing Institute of Gaithersburg, Md., for instance, found that almost one-third of IT managers surveyed indicated that their efforts to collect data did not meet expectations.
More often than not, projects failed simply because the data that IT departments want to collect from legacy sources is inaccurate, inconsistent or difficult to acquire in its original form.
This guide focuses on commercial data warehouse tools that perform the data extraction, transformation and loading (ETL) functions.
Industry analysts typically define extraction as the process of identifying and retrieving a set of data from an operational system. Transformation tools play a key role as IT managers integrate data from multiple sources. They permit a system administrator to develop rules for integrating data from different sources and tables to form a single table or entity.
Just as crucial is the process of formatting the output from the transformation process into a form acceptable to the target database's load utility. This typically includes the use of a scheduling mechanism to trigger the loading process.
All of these functions usually are performed by a single set of tools. The tool sets also often support a metadata repository and a scheduling engine.
IT managers facing the prospect of integrating data from multiple sources into a single repository traditionally have had to start by grappling with the build-or-buy question.
There are excellent reasons to build your own ETL system. First, your in-house programming staff is invariably better acquainted with the legacy database and operating environment than any vendor could be. Second, few IT department personnel have the free time to evaluate, compare, select and install commercial tools, let alone learn to properly use them. And finally, few IT budgets today can afford the additional capital expense of the tools, especially for a first-time, proof-of-concept project.
Despite the valid arguments for building your own system, vendors have over the past few years made a convincing case for buying one instead.
The learning curve issue is a good example. Vendors have dramatically shortened the time it takes a developer to get up to speed with an ETL tool by integrating highly graphical and wizard-driven interfaces that guide developers through the implementation process.
Today, developers can construct data flow diagrams to visually model a task and automatically extract, transform, validate and load data without writing a single line of code. And highly intuitive interfaces are simplifying the task of mapping such things as complex enterprise resource planning data to a target database.
At the same time, vendors are extending the scope of their tool sets and putting an end to the days when developers had to write code to string together different functions. Products such as Oracle Warehouse Builder, for example, replace a bevy of individual point tools with a single common environment capable of modeling and design, data extraction, loading, aggregation and warehouse administration.
Formation from Informix Software offers a visual method of designing data transformtion jobs. Pricing for Formation starts at $7,500 per processor.
In addition, many tools such as Warehouse Builder, IBM's Visual Warehouse, Computer Associates International Inc.'s DecisionBase and others add extensive metadata management capabilities.
Tool performance isn't taking a backseat, either. Many vendors have recently added enhanced loading capabilities for the leading target databases. New loaders in Hummingbird Ltd.'s Genio Suite, for instance, significantly reduce the time required to manage large volumes of transactions.
Other products, like Formation from Informix Software Inc., integrate parallelism into their architectures to take advantage of all available CPUs. By automatically segmenting a job across multiple CPUs, such tools boost performance while making more efficient use of enterprise resources. Users can even specify parallel settings on individual operators or groups of operators.
Meanwhile, companies have introduced new techniques to help reduce the data processing workload, a particularly crucial consideration as warehouses sag under mountains of data and an escalating number of transactions.
Products such as ChangeDataMove from BMC Software Inc. lets designers almost instantly reflect in the data warehouse incremental updates and changes in the operational transaction processing environment.
By tracking the changed records during the input/output operation, the tool captures changes virtually as soon as they occur. That not only keeps target databases more current and consistent with source databases than they have been in the past, but it also allows the tool set to transform and transport to the target database only the data that has changed, thereby reducing the workload.
When purchasing an ETL tool, a number of issues are worth considering. Will the tool support the key legacy databases and data types your application requires? Will the tool support performance levels that could easily reach hundreds of transactions per second without adversely affecting traditional operational performance?
Take a close look at the architecture of the tool as well. Tools that offer reusable modules, for instance, often can save considerable development time and effort. If you have already designed a custom in-house process and are looking to replace it, examine where the current commercial product can improve efficiency. Will the purchase of this tool force you to revamp an established process? And if so, what are the costs involved?
At the same time, look for tools that offer performance enhancements, such as loading files into memory for lookup or adding parallelism to maximize performance.
|Tips for buyers|
' Before you decide to build a custom extraction, transformation and loading system, make sure you consider ongoing maintenance and enhancement costs, particularly as the system's environment grows increasingly complex.
' Before purchasing a commercial tool, talk to organizations that have used it in similar environments to judge whether the tool will save enough time to justify the investment.
' Consider requiring that the vendor provide an experienced consultant as a condition of the sale, to help accelerate the learning curve.
' Take a close look at the reusability of modules within the tool and their affect on your productivity.
' If the tool offers parallelism features, take full advantage of them to maximize performance.
Another issue to consider is the support a tool vendor can provide. Although vendors have made their products much easier to use, the tools are still software and require a lengthy learning curve.
One way to prevent the learning curve from undermining your project is to ask the vendor, as a condition of the sale, to provide someone who understands how to design with the tool. Such a consultant will come at a high price, but the investment will be worth it if it guarantees success.Factor this
Although vendors continually have enhanced their products, a number of factors in your legacy systems will affect how well a tool extracts, transforms or loads data. The availability of network bandwidth, for example, is one issue. Bandwidth bottlenecks can easily slow down the performance of the most optimized tool.
The volume of data to be moved and collected in a project also is crucial. With the rise of the Web and electronic commerce, businesses are continually surprised at how much data they can collect in a short period of time. Yet the larger the prospective data warehouse, the longer and more complex the process of extracting, transforming and loading data.
Other issues such as hardware configurations'the amount of memory and speed of disk performance on the source and target systems'also can influence the speed of the process, as will the type of interconnect used to access source and target data. Will the system have native access or use Open Database Connectivity or some other interface?
Finally, the construction of the target database itself also will have a major impact on data transfer. Are there constraints in the data warehouse in terms of the complexity of the data structure or number of indices? And has the database been tuned to maximize performance?
The task of extracting, transforming and loading the data from legacy sources into a central repository doesn't grow any simpler as the amount of data you collect skyrockets.
But the good news is that there is a wider choice of more capable and comprehensive tools today than there ever has been.
The key to selecting the right tools will lie in your own understanding of your operating environment and the unique requirements you face as you build your data warehouse.John H. Mayer of Belmont, Mass., writes about information technology.
|Vendor||Product||Platform support||Legacy data sources||Targets supported||Comments||Price|
|BMC Software Inc.|
|ChangeDataMove||NT, Unix, OS/390||IMS, IMS fast path, CICS/VSAM, VSAM batch, DB2||Oracle, DB2 UDB, Sybase, SQL Server||Non-intrusive change capture tool; supports near-real- time transaction-based change data propagation to enable fast and complete synchronization of data||$20,000 up|
|Computer Associates International Inc.|
|DecisionBase||NT, Unix||IMS, VSAM, physical sequential files, DB2, ADABASE, CA-IDMS, enterprise resource planning systems||DB2, OS/390, OS/2 DB2/400, Informix, SQL Server, Oracle, Red Brick, Sybase||Integrates data transformation with enterprise repository; includes metadata management capabilities||$200,000 |
Redwood Shores, Calif.
|Warehouse Integrator||Client: Win95, NT; Server: NT, HP-UX, Solaris, AIX||All major RDBMSes, other data types through Constellar Hub product||Oracle, Informix, SQL Server, Sybase, CA Ingres, Red Brick ESSbase, PaBLO, MicroStrategy DSSAgent, SQL/MP, ODBC||Dimension reference model supports management of many types of data warehouses across different MOLAP and ROLAP platforms; new version adds native support for Oracle Express and metadata auditing||$100,000 up|
|Aclue Decision Supportware Version 2.4||Client: Win95, NT; Server: NT, Solaris AIX||Flat files, all major RDBMSes, ODBC||Oracle Financial Analyzer, Oracle Express, Hyperion Essbase, SQL Server 7.0 OLAP Services, all major RDBMSes||Supports three OLAP platforms; includes comprehensive audit trail and drill-back capabilities, and advanced time-server management||$37,500 up|
|Evolutionary Technologies International Inc.|
|ETI*Extract||Client: Win95, NT; Server: NT, AIX, Solaris, HP-UX||Flat files, IMS, VSAM, Oracle, Sybase, Teradata, SQL Server, Informix, SAS, TurboI, IDMS, DB2, SAP R/3||Most major relational databases||Manages batch, legacy, near-real-time and clickstream data; latest release adds ability to control user access to objects in MetaStore, improved administration metadata audit trail||$100,000 up|
|Genio Suite 4.0||Client: Win95, NT; Server: NT, Win 2000, Solaris, AIX, HP-UX||All major file formats, relational databases and ERP products||Oracle Express, Hyperion Essbase||Provides universal data exchange; latest version adds views of third-party metadata repositories including Business Objects' Universes and Cognos' Catalogs; also adds enhanced loading capabilities for Oracle8 and NCR teradata databases||Genio engine: $50,000 for NT, $75,000 for Unix; $20,000 for design tools|
|Visual Warehouse||AIX, Solaris||DB2, Oracle, Informix, SQL Server, CICS/VSAM, IMS||DB2, Hyperion Essbase||Program for large and small businesses provides a managed OLAP environment and Web-enabled infrastructure; comes bundled with DB2 UDB||$36,950 up|
Palo Alto, Calif.
|PowerMart 4.6||Client: Win95, NT; Server: NT, HP-UX, AIX, Solaris||DB2, flat files, IMS, Informix, SQL Server, Access, Oracle Sybase, VSA, ODBC||Flat files, Informix, SQL Server Access, Oracle, PeopleSoft ERP, SAP Business Info Warehouse, Sybase||Integrated suite of products includes a deploy folder wizard to guide developers, source-extraction mapping templates, flat-file acceleration, advanced session management and multilevel security||$45,375 up for NT $66,000 up for Unix platforms|
|Informix Software Inc.|
Menlo Park, Calif.
|Formation||NT, Solaris, HP-UX||Informix, Oracle, SQL Server, ODBC, text files, EBCDIC files, binary files||Informix, Oracle, SQL Server, ODBC to others||Integrated autoparallelism automatically segments jobs across multiple CPUs; includes user-specifiable parallel settings and native loaders for Informix, Oracle, SQL Server||$7,500 up per processor|
|DataStage||NT, Unix||Flat files, VSAM, SQL Server||SQL Server||Former Ardent Software product provides an automated workflow environment with reusable components||$47,500 up|
Redwood Shores, Calif.
|Warehouse Builder||Client: NT; Server: Oracle8i||Oracle, DB2 via gateway, ODBC, flat files, IMS, VSAM, Oracle ERP, SAP R/3||Oracle8i, 8i Catalog, Discoverer and Express||Integrates modeling and design, data extraction, movement and loading, aggregation, metadata management and tool integration into single solution; has a wizard-driven interface and is tightly integrated with Oracle8i; integrates metadata via Common Warehouse Model||$25 per Universal Power Unit (for UPU, multiply MHz speed of Intel chip by 1; multiply speed of RISC chip by 1.5)|
|SAS Institute Inc.|
|SAS/Access||NT, Unix||DB2, SQL Server, Oracle Teradata, SAP R/3, Baan, flat files, text files, binary files||Most file structures||Supports a variety of file structures||$31,664 up for server license with OLAP tool|
|Taurus Software Inc.|
Palo Alto, Calif.
|DataBridge||Client: Win95, NT; Server: NT, HP-UX, AIX, Solaris, UnixWare, SGI||MPE, Oracle, Image, KSAM, Allbase, fixed files, flat files, text files, CSV files, freeze files||Unix, Oracle, fixed files, flat files, text files, CSV files, Freeze files, DB2, Informix, SQL Server||ETL tool with portable scripting language and a GUI that simplifies script development||$29,000 to $79,000 |
|BridgeWare||Same||MPE, Oracle, Image, KSAM, Allbase, fixed files, flat files, text files, CSV files, freeze files||Unix, Oracle, fixed files, flat files, text files, CSV files, Freeze files, DB2, Informix, SQL Server||Adds real-time, change detect component||$20,000 to $44,000 |