Agency Award'IRS | Terabytes at their fingertips
2007 GCN Award: The IRS' update of the Compliance Data Warehouse makes analysis less taxing

GRAND CENTRAL: Jeff Butler says centralizing data lets analysts 'address a wide range of challenging questions.'
Zaid Hamid

ONE-STOP SHOP: Employees at the Office of Research, Analysis and Statistics have integrated more than 150T of data into the warehouse.
Taxpayers who have trouble getting all of their tax information together can file for extensions, dragging out the filing process until the final deadline of Oct. 15. Internal Revenue Servide analysts know how they feel ' until recently, getting the data needed to answer taxpayers' questions could take just as long.
For the complete list of the 2007 GCN Award winners, click here
'Going through the request-for-information process would take six to eight months,' said Stan Griffin, research chief at the agency's Wage and Investment Research Division in Atlanta. But with the expanded capabilities of the IRS' 150-terabyte Compliance Data Warehouse, analysts can now get what they are looking for in a matter of hours or days.
[IMGCAP(1)]'The CDW has made it possible for us to get current data rapidly to answer the business questions that are on our customers' minds,' he said.
Expansion planThe IRS has assembled a massive amount of taxpayer data over the years and created the CDW to give research analysts a single, consistent view of the data.
'The CDW captures and integrates data from multiple legacy data sources, each with its own platform, production environment, storage format, naming convention and authorization policy,' said Jeff Butler, director of research databases. 'By standardizing data in a centralized location for researchers, the IRS is better positioned to effectively address a wide range of challenging questions, including those relating to channel preferences, program or treatment effectiveness, tax gap estimation, risk classification, workload optimization and trends in compliance.'
The IRS created the initial version of the CDW in the mid-1990s. By 1998, it held 1.2T of data, which has grown more than a hundredfold since. During the past three years, the agency has updated its technology while expanding from roughly 20T of data to more than 150T of data.
Moving all of those terabytes from distant data centers into the warehouse and keeping it up-to-date posed a challenge. The tapes the IRS had been using to transport the data only held 2G each. The CDW staff routinely had to copy, ship and load hundreds of tapes to keep the warehouse current. In addition, the tapes didn't support encryption, so all that taxpayer information was flying around the country unprotected.
[IMGCAP(2)]
So last year, the agency replaced tape with 2T network-attached storage appliances. These NAS devices are about the size of a tissue box but hold the equivalent of 1,500 tapes. The appliances are shipped from the remote sites to the CDW, where someone attaches them to the network and extractsthe data. Since the NAS devices have 256-bit encryption, the data is secure while in transit. Using an appliance rather than 1,500 tapes saves approximately two weeks' staff time on each update and millions of dollars during a five-year period, officials said.
Once the data arrives, it needs to be integrated into the warehouse. No single tool can handle all the types and formats that get sent to the data warehouse. The CDW uses assembly language to access older mainframe data.
'In most of today's data management environments, data integration is just one part of a more comprehensive strategy,' Butler said.
The CDW is one of the IRS' largest online repositories of searchable metadata, and it includes data definitions, lookup tables, profiles and other database artifacts. More data features are on the way.
'CDW is owned and managed by the Research, Analysis and Statistics organization,' Butler said. 'This means that new data, hardware, software and processes can be integrated in significantly less time than would be the case if it were managed in an enterprise IT production environment.'
By early 2008, an expansion covering the subject areas of filing and payment compliance will go online. Butler and his team will also be working to improve data quality and expand Web services during the next three years.
'Automated profiling, record matching, rules engines, and monitoring devices are playing an increasingly important role in mature data environments like CDW's,' he said.
Setting an example While Butler has his eyes on how much better the CDW can become, others are already singing the recent update's praises.
'The CDW is unique due to the requirement for highly flexible queries against a very large data warehouse,' said Mike Daily, Mitre's project manager for the CDW expansion, who is helping the IRS develop detailed models of the CDW data model and architecture.
With the easier access to more data, the number of researchers using the system has already expanded eightfold during the past two years to more than 300.
'CDW has really become more of a one-stop shop from tax compliance data,' said Thomas Mielke, an economist doing small-business and self-employed research for the IRS in St. Paul, Minn. Locating specific tax compliance data used to be one of the primary business problems faced by his office. Now they can access most of the data needed through the CDW.
In addition to the speed and ease of access, Griffin said his researchers love being able to go to a centralized source that provides them accurate and consistent data instead of having to pull from various inconsistent sources.
'It is very well-organized,' he said. 'Everyone should consider modeling their platforms after what the CDW staff is doing.'