Before you mine data, refine it

 

Connecting state and local government leaders

Data mining is amazing technology. It lets you dig through mountains of information to turn up nuggets of previously unknown, but valuable, relationships in data.

Data mining is amazing technology. It lets you dig through mountains of information to turn up nuggets of previously unknown, but valuable, relationships in data.And nobody has more mountains of information than the government, for which data mining can tap vast stores of information to reveal patterns, expose inconsistencies and suggest solutions to problems.But the vastness of data available, and the variety of formats and structures of the data, can stop data mining in its tracks.Systems chiefs do have one thing on their side: Unlike mining for, say, coal or iron ore, you can refine before you mine. That is, you can convert data into a form that makes mining faster and more efficient. And there are tools to automate this process, giving data mining wider scope and greater depth than ever before.Government agencies use data mining for many purposes. In one example, a successful program at the Civilian Health and Medical Program of the Uniform Services, the health care system for military personnel and their dependents, uncovered patterns that suggested fraud and abuse, which administrators were able to resolve, saving millions of dollars.Agencies also use data mining to target potential program users, analyze experimental and scientific results, perform risk analysis and management, develop profiles for welfare returnees, and predict transportation and infrastructure breakdowns.There are plenty of examples of successful data mining, but that doesn't mean it's easy; a variety of obstacles stand in the way. Source data might reside on remote platforms. Databases might have multiple formats. Actual formats in some legacy systems might be unknown. Databases might have undefined fields or, worse, fields defined incorrectly. The data itself might be inconsistent'death dates before birth dates, for example'erroneous, or incomprehensible in the case of codes known only to the coder.Although a large amount of raw material is good, databases might be too large and unwieldy to work with effectively.Finally, you would prefer to have data that includes only the information you need to mine. For instance, record numbers might be a part of a medical database but useless for your purpose.Start with an idea of what you want the eventual data store to look like. Where will it reside? What format will it be in? This gives you a target for further data preparation tasks.Next, analyze existing data sources. How big are they? In what format are they? Some tools specifically analyze such databases and report on what the data is like. This helps you prepare for later steps.Acquiring the data can be complex, especially with remote, multiple and heterogeneous data sources. Some tools specialize in accessing a dizzying variety of database types on a wide selection of platforms. They can move the data from its original format and location to your data mining platform.Once the data arrives, don't dive immediately into data mining. Cleansing the data of inconsistencies, errors and plain nonsense will pay off in smoother and more conclusive data mining outcomes. Such automatic tools employ sophisticated algorithms to identify and remedy problem data. After the data is cleansed, transform it into your format of choice'usually not a demanding chore although it's time-consuming for very large databases.The next step is to stage the data in a way that makes sense for your purposes. Many agencies find it useful to create data marts dedicated to a certain subset of the data, and on which data mining and ad hoc queries can be fast and efficient.Internet, intranet or extranet access can also be added for agency personnel, interagency partners and the public. Indexing the data can be useful to support queries. Convera's RetrievalWare supports searches by synonyms, for example, so that looking for 'rising gas prices' will also turn up hits on the 'increased cost of petrol.'Agencies have a lot to consider when choosing tools to support data mining projects. Usually, you are turning to an automated tool because a database is too large, complex or inconveniently distributed to handle manually. The tool you choose should be able to handle these situations.Tools that manage a variety of data formats and platforms are most useful for bridging the gap between the original documents and the target data store. The tool should not dictate the database. If your data sources are large, or if you anticipate growth, the scalability of tools becomes important to maintain performance. Integration with existing systems is vital if you expect to deploy the system throughout an agency or across agencies.Don't neglect the human element. Tools should be reasonably easy to use, requiring either little training or providing tutorials. Remember that you will be performing some of these tasks infrequently, so intuitive functionality will come in handy. Manual configuration to handle certain known problem areas should be an option.The market for data mining is growing rapidly, giving agencies a variety of choices, both in price and features. Meanwhile, standards for metadata and data mining are emerging, which will simplify deployment and product comparisons.Agencies are increasingly aware of the value they can get from data mining. Less than two years ago, the General Services Administration became the first agency to appoint a chief knowledge officer. The creation of CKO posts and knowledge management programs has been steady ever since, and the trend isn't likely to stop any time soon. XXXSPLITXXX-With more than 200 years' worth of information about one-eighth of the land in the United States, the Bureau of Land Management has a nightmare data management problem by any measure.The data includes land surveys, ownership titles, mineral rights and other varieties of documents, all changing over time, with the original data in a multitude of legacy formats. How can you mine this?Leslie Cone, project manager for BLM's Land and Resources Project Office in Denver, said the answer is in data profiling.Data profiling analyzes the structure and format of data sources, including the field lengths and data types. Once this is known, data can be logically recast into more modern and accessible formats.The problem is that manual data profiling takes too long. The data sources are too large, the documentation can be inaccurate, the meaning of certain fields might be misleading, different offices might use different standards, and some codes might be too obscure to interpret. 'We had one data source where a code of '2525' kept turning up,' Cone said. 'We had no idea what it meant.'Fortunately, there are tools to perform data profiling automatically. Cone's group chose Evoke Software Corp.'s Migration Architect and Axio to rapidly plow through data and reveal the true formats and structure.The program's sophisticated algorithms tease out information essential for reformatting, while also exposing inconsistencies and anomalies. 'Once we know the true structure of the data, we can than transform it for use in the applications we develop,' Cone said. The newly recast data is available both publicly and internally.Although this automated data profiling software is not cheap at $300,000 up, Cone considers the cost worthwhile. 'The job would have been impossible otherwise,' she said. 'Instead, we now have usable data we can leverage into valuable applications for BLM's management tasks.'

The lowdown

  • What is it? A data mining tool is one of many software programs that can search databases to uncover patterns and relationships, and perform tasks as varied as predicting users' needs, visualizing trends, integrating data in disparate formats and extracting and cleansing data.

  • What are the essentials in preparing for data mining? Knowing the size, location and format of the existing data is most important. Data mining needs access to consistent, accurate data quickly and easily.

  • What preliminary steps are necessary to data mining? You must first acquire the original data, understand the original data formats, cleanse the data of inconsistencies and inaccuracies, move the data to the target platform, and transform the data into accessible formats.

  • What other considerations are important? Know how the data mining system will integrate with existing systems, and the Internet, if necessary.

  • Must-know info? The more steps of the process a package can perform without degradation of performance, the better. Look for products that can handle multiple formats and perform their functions automatically, although manual configuration for special situations can be important.
  • At the Civilian Health and Medical Program of the Uniform Services, data mining uncovered patterns suggesting fraud and abuse, which administrators were able to resolve, saving millions of dollars.

    Data mining tools help position your information resources















    Panning for gold















    Here's your cue



















    Edmund X. DeJesus is a free-lance writer in Norwood, Mass. Contact him at dejesus@compuserve.com.









    Plow and harvest





    X
    This website uses cookies to enhance user experience and to analyze performance and traffic on our website. We also share information about your use of our site with our social media, advertising and analytics partners. Learn More / Do Not Sell My Personal Information
    Accept Cookies
    X
    Cookie Preferences Cookie List

    Do Not Sell My Personal Information

    When you visit our website, we store cookies on your browser to collect information. The information collected might relate to you, your preferences or your device, and is mostly used to make the site work as you expect it to and to provide a more personalized web experience. However, you can choose not to allow certain types of cookies, which may impact your experience of the site and the services we are able to offer. Click on the different category headings to find out more and change our default settings according to your preference. You cannot opt-out of our First Party Strictly Necessary Cookies as they are deployed in order to ensure the proper functioning of our website (such as prompting the cookie banner and remembering your settings, to log into your account, to redirect you when you log out, etc.). For more information about the First and Third Party Cookies used please follow this link.

    Allow All Cookies

    Manage Consent Preferences

    Strictly Necessary Cookies - Always Active

    We do not allow you to opt-out of our certain cookies, as they are necessary to ensure the proper functioning of our website (such as prompting our cookie banner and remembering your privacy choices) and/or to monitor site performance. These cookies are not used in a way that constitutes a “sale” of your data under the CCPA. You can set your browser to block or alert you about these cookies, but some parts of the site will not work as intended if you do so. You can usually find these settings in the Options or Preferences menu of your browser. Visit www.allaboutcookies.org to learn more.

    Sale of Personal Data, Targeting & Social Media Cookies

    Under the California Consumer Privacy Act, you have the right to opt-out of the sale of your personal information to third parties. These cookies collect information for analytics and to personalize your experience with targeted ads. You may exercise your right to opt out of the sale of personal information by using this toggle switch. If you opt out we will not be able to offer you personalised ads and will not hand over your personal information to any third parties. Additionally, you may contact our legal department for further clarification about your rights as a California consumer by using this Exercise My Rights link

    If you have enabled privacy controls on your browser (such as a plugin), we have to take that as a valid request to opt-out. Therefore we would not be able to track your activity through the web. This may affect our ability to personalize ads according to your preferences.

    Targeting cookies may be set through our site by our advertising partners. They may be used by those companies to build a profile of your interests and show you relevant adverts on other sites. They do not store directly personal information, but are based on uniquely identifying your browser and internet device. If you do not allow these cookies, you will experience less targeted advertising.

    Social media cookies are set by a range of social media services that we have added to the site to enable you to share our content with your friends and networks. They are capable of tracking your browser across other sites and building up a profile of your interests. This may impact the content and messages you see on other websites you visit. If you do not allow these cookies you may not be able to use or see these sharing tools.

    If you want to opt out of all of our lead reports and lists, please submit a privacy request at our Do Not Sell page.

    Save Settings
    Cookie Preferences Cookie List

    Cookie List

    A cookie is a small piece of data (text file) that a website – when visited by a user – asks your browser to store on your device in order to remember information about you, such as your language preference or login information. Those cookies are set by us and called first-party cookies. We also use third-party cookies – which are cookies from a domain different than the domain of the website you are visiting – for our advertising and marketing efforts. More specifically, we use cookies and other tracking technologies for the following purposes:

    Strictly Necessary Cookies

    We do not allow you to opt-out of our certain cookies, as they are necessary to ensure the proper functioning of our website (such as prompting our cookie banner and remembering your privacy choices) and/or to monitor site performance. These cookies are not used in a way that constitutes a “sale” of your data under the CCPA. You can set your browser to block or alert you about these cookies, but some parts of the site will not work as intended if you do so. You can usually find these settings in the Options or Preferences menu of your browser. Visit www.allaboutcookies.org to learn more.

    Functional Cookies

    We do not allow you to opt-out of our certain cookies, as they are necessary to ensure the proper functioning of our website (such as prompting our cookie banner and remembering your privacy choices) and/or to monitor site performance. These cookies are not used in a way that constitutes a “sale” of your data under the CCPA. You can set your browser to block or alert you about these cookies, but some parts of the site will not work as intended if you do so. You can usually find these settings in the Options or Preferences menu of your browser. Visit www.allaboutcookies.org to learn more.

    Performance Cookies

    We do not allow you to opt-out of our certain cookies, as they are necessary to ensure the proper functioning of our website (such as prompting our cookie banner and remembering your privacy choices) and/or to monitor site performance. These cookies are not used in a way that constitutes a “sale” of your data under the CCPA. You can set your browser to block or alert you about these cookies, but some parts of the site will not work as intended if you do so. You can usually find these settings in the Options or Preferences menu of your browser. Visit www.allaboutcookies.org to learn more.

    Sale of Personal Data

    We also use cookies to personalize your experience on our websites, including by determining the most relevant content and advertisements to show you, and to monitor site traffic and performance, so that we may improve our websites and your experience. You may opt out of our use of such cookies (and the associated “sale” of your Personal Information) by using this toggle switch. You will still see some advertising, regardless of your selection. Because we do not track you across different devices, browsers and GEMG properties, your selection will take effect only on this browser, this device and this website.

    Social Media Cookies

    We also use cookies to personalize your experience on our websites, including by determining the most relevant content and advertisements to show you, and to monitor site traffic and performance, so that we may improve our websites and your experience. You may opt out of our use of such cookies (and the associated “sale” of your Personal Information) by using this toggle switch. You will still see some advertising, regardless of your selection. Because we do not track you across different devices, browsers and GEMG properties, your selection will take effect only on this browser, this device and this website.

    Targeting Cookies

    We also use cookies to personalize your experience on our websites, including by determining the most relevant content and advertisements to show you, and to monitor site traffic and performance, so that we may improve our websites and your experience. You may opt out of our use of such cookies (and the associated “sale” of your Personal Information) by using this toggle switch. You will still see some advertising, regardless of your selection. Because we do not track you across different devices, browsers and GEMG properties, your selection will take effect only on this browser, this device and this website.