Categorizing is key for effective data mining

Categorizing is key for effective data mining

By Florence Olsen

GCN Staff

Contrary to popular opinion, data mining is no clean, antiseptic process, according to data mining consultant Philip Matkovsky.

As with old-fashioned statistical analysis, organizations have to categorize data minutely before they can mine it for patterns such as fraud.

'The federal government, interestingly enough, has not built up a large structured data set of known fraud cases,' said Matkovsky, manager of operations for analytical systems at Federal Data Corp. of Bethesda, Md. Lacking such resources, federal data miners must use unsupervised learning methods, such as disjoint cluster analysis, to train their favorite algorithms to sniff out fraud, he said.

If analysts had more examples of known fraud cases, Matkovsky said, they could fall back on supervised learning methods to train the algorithms to detect evidence of new frauds.

In selecting data mining tools, Matkovsky looks for a tool set that allows both supervised and unsupervised learning 'because you don't always have perfect information,' he said. It generally 'is going to be a mess.' With the cost of data mining tools running about $30,000 per seat, most agencies can afford only one toolset.

'Look for a toolset with multiple algorithms,' Matkovsky advised, because data mining problems can be unpredictable.

He said modern neural network algorithms are far more sensitive to subtle variations in the underlying data than were the earliest neural net programs. Differences in sensitivity are 'largely a matter of parameter setting,' he said.

Some neural network algorithms permit considerable user involvement in setting search parameters, 'which is a good thing,' he said, 'because you can fine-tune your algorithm and get rid of false alarms bit by bit.' Training a neural network is an abstract, creative activity that does not fit easily into a 9-to-5 schedule, Matkovsky said.

But for anyone who has enthusiasm for tools such as Intelligent Miner from IBM Corp. and SAS Enterprise Miner from SAS Institute Inc. of Cary, N.C., the rewards are great, Matkovsky said.

'You're using a machine to explain variants with no human intervention except for the initial input. That's a very cool thing,' he said.


  • Records management: Look beyond the NARA mandates

    Pandemic tests electronic records management

    Between the rush enable more virtual collaboration, stalled digitization of archived records and managing records that reside in datasets, records management executives are sorting through new challenges.

  • boy learning at home (Travelpixs/

    Tucson’s community wireless bridges the digital divide

    The city built cell sites at government-owned facilities such as fire departments and libraries that were already connected to Tucson’s existing fiber backbone.

Stay Connected