Before you mine, refine

Before you mine, refine <@VM>Profiling helps manage mystery data

The lowdown

  • What is it? A data mining tool is one of many software programs that can search databases to uncover patterns and relationships, and perform tasks as varied as predicting users' needs, visualizing trends, integrating data in disparate formats and extracting and cleansing data.


  • What are the essentials in preparing for data mining? Knowing the size, location and format of the existing data is most important. Data mining needs access to consistent, accurate data quickly and easily.


  • What preliminary steps are necessary to data mining? You must first acquire the original data, understand the original data formats, cleanse the data of inconsistencies and inaccuracies, move the data to the target platform, and transform the data into accessible formats.


  • What other considerations are important? Know how the data mining system will integrate with existing systems, and the Internet, if necessary.


  • Must-know info? The more steps of the process a package can perform without degradation of performance, the better. Look for products that can handle multiple formats and perform their functions automatically, although manual configuration for special situations can be important.

  • At the Civilian Health and Medical Program of the Uniform Services, data mining uncovered patterns suggesting fraud and abuse, which administrators were able to resolve, saving millions of dollars.

    Data mining tools help position your information resources

    Data mining is amazing technology. It lets you dig through mountains of information to turn up nuggets of previously unknown, but valuable, relationships in data.

    And nobody has more mountains of information than the federal government, for which data mining can tap vast stores of information to reveal patterns, expose inconsistencies and suggest solutions to problems.

    But the vastness of data available, and the variety of formats and structures of the data, can stop data mining in its tracks.

    Systems chiefs do have one thing on their side: Unlike mining for, say, coal or iron ore, you can refine before you mine. That is, you can convert data into a form that makes mining faster and more efficient. And there are tools to automate this process, giving data mining wider scope and greater depth than ever before.

    Government agencies use data mining for many purposes. In one example, a successful program at the Civilian Health and Medical Program of the Uniform Services, the health care system for military personnel and their dependents, uncovered patterns that suggested fraud and abuse, which administrators were able to resolve, saving millions of dollars.

    Agencies also use data mining to target potential program users, analyze experimental and scientific results, perform risk analysis and management, develop profiles for welfare returnees, and predict transportation and infrastructure breakdowns.

    There are plenty of examples of successful data mining but that doesn't mean it's easy; a variety of obstacles stand in the way. Source data might reside on remote platforms. Databases might have multiple formats. Actual formats in some legacy systems might be unknown. Databases might have undefined fields or, worse, fields defined incorrectly. The data itself might be inconsistent'death dates before birth dates, for example'erroneous, or incomprehensible in the case of codes known only to the coder.

    Panning for gold

    Although a large amount of raw material is good, databases might be too large and unwieldy to work with effectively.

    Finally, you would prefer to have data that includes only the information you need to mine. For instance, record numbers might be a part of a medical database but useless for your purpose.

    Start with an idea of what you want the eventual data store to look like. Where will it reside? What format will it be in? This gives you a target for further data preparation tasks.

    Next, analyze existing data sources. How big are they? In what format are they? Some tools specifically analyze such databases and report on what the data is like. This helps you prepare for later steps.

    Acquiring the data can be complex, especially with remote, multiple and heterogeneous data sources. Some tools specialize in accessing a dizzying variety of database types on a wide selection of platforms. They can move the data from its original format and location to your data mining platform.

    Once the data arrives, don't dive immediately into data mining. Cleansing the data of inconsistencies, errors and plain nonsense will pay off in smoother and more conclusive data mining outcomes. Such automatic tools employ sophisticated algorithms to identify and remedy problem data.

    After the data is cleansed, transform it into your format of choice'usually not a demanding chore although it's time-consuming for very large databases.

    Here's your cue

    The next step is to stage the data in a way that makes sense for your purposes. Many agencies find it useful to create data marts dedicated to a certain subset of the data, and on which data mining and ad hoc queries can be fast and efficient.

    Internet, intranet or extranet access can also be added for agency personnel, interagency partners and the public. Indexing the data can be useful to support queries. Convera's RetrievalWare supports searches by synonyms, for example, so that looking for 'rising gas prices' will also turn up hits on the 'increased cost of petrol.'

    Agencies have a lot to consider when choosing tools to support data mining. Usually, you are turning to an automated tool because a database is too large, complex or inconveniently distributed to handle manually. The tool you choose should handle these situations.

    Tools that manage a variety of data formats and platforms are most useful for bridging the gap between the original documents and the target data store. The tool should not dictate the database. If your data sources are large, or if you anticipate growth, the scalability of tools becomes important to maintain performance.

    Integration with existing systems is vital if you expect to deploy the system throughout an agency or across agencies.

    Don't neglect the human element. Tools should be reasonably easy to use, requiring either little training or providing tutorials. Remember that you will be performing some of these tasks infrequently, so intuitive functionality will come in handy. Manual configuration to handle certain known problem areas should be an option.

    The market for data mining is growing rapidly, giving agencies a variety of choices, both in price and features.

    Meanwhile, standards for metadata and data mining are emerging, which will simplify deployment and product comparisons.

    Agencies are increasingly aware of the value they can get from data mining. Less than two years ago, the General Services Administration became the first agency to appoint a chief knowledge officer. The creation of CKO posts and knowledge management programs has been steady ever since, and the trend isn't likely to stop any time soon.

    Edmund X. DeJesus is a freelance writer in Norwood, Mass. Contact him at dejesus@compuserve.com. With more than 200 years' worth of information about one-eighth of the land in the United States, the Bureau of Land Management has a nightmare data management problem by any measure.

    The data includes land surveys, ownership titles, mineral rights and other varieties of documents, all changing over time, with the original data in a multitude of legacy formats. How can you mine this?

    Leslie Cone, project manager for BLM's Land and Resources Project Office in Denver, said the answer is in data profiling.

    Data profiling analyzes the structure and format of data sources, including the field lengths and data types. Once this is known, data can be logically recast into more modern and accessible formats.

    The problem is that manual data profiling takes too long. The data sources are too large, the documentation can be inaccurate, the meaning of certain fields might be misleading, different offices might use different standards, and some codes might be too obscure to interpret. 'We had one data source where a code of '2525' kept turning up,' Cone said. 'We had no idea what it meant.'

    Plow and harvest

    Fortunately, there are tools to perform data profiling automatically. Cone's group chose Evoke Software Corp.'s Migration Architect and Axio to rapidly plow through data and reveal the true formats and structure.

    The program's sophisticated algorithms tease out information essential for reformatting, while also exposing inconsistencies and anomalies. 'Once we know the true structure of the data, we can than transform it for use in the applications we develop,' Cone said. The newly recast data is available both publicly and internally.

    Although this automated data profiling software is not cheap at $300,000 up, Cone considers the cost worthwhile. 'The job would have been impossible otherwise,' she said. 'Instead, we now have usable data we can leverage into valuable applications for BLM's management tasks.'

    Reader Comments

    Please post your comments here. Comments are moderated, so they may not appear immediately after submitting. We will not post comments that we consider abusive or off-topic.

    Please type the letters/numbers you see above