In 1996 Osama Fayyad proposed a very popular process how to make a companies data useful for business needs. Data Mining is described to be a part of the KDD Process, actually quite small in this definition, but very important. After reading this article, you will understand why Data Mining needs a pre- and post-processing and just can’t stand alone against all the Data.
The approach to gain knowledge out of a set of data was separated by Fayyad into individual steps. The individuality results out of different tools you use, and different outcomes that are needed.
The KDD Process stands for the Knowledge Discovery in Databases. According to Fayyad there are five steps: Selection, Pre-processing, Transformation, Data Mining and Interpretation. These five steps are passed through iteratively. Every step can be seen as a work-through phase. Such a phase requires the supervision of a user and can lead to multiple results. The best of these results is used for the next iteration, the others should be documented. In the following, the steps will be briefly described.
The five KDD Steps by Fayyad
- In the Selection-step the significant data gets selected or created. Henceforward the KDD process is maintained on the gathered target data. Only relevant information is selected, and also meta data or data that represents background knowledge. Sometimes the combination of data from ubiquitous sources can be useful, but possible matters of compatibility have to be observed.
- A good result after applying data mining depends on an appropriate data preparation in the beginning. Important elements of the provided data have to be detected and filtered out. These kind of things are settled in the Pre-processing phase. To detect knowledge the effective main task is to pre-process the data properly and not only to apply data mining tools. The less noise contained in data the higher is the efficiency of data mining. Elements of the pre-processing span the cleaning of wrong data, the treatment of missing values and the creation of new attributes.
- That data also needs to be transferred into a data-mining-capable format. The Transformation phase of the data may result in a number of different data formats, since variable data mining tools may require variable formats. The data also is manually or automatically reduced. The reduction can be made via lossless aggregation or a loss full selection of only the most important elements. A representative selection can be used to draw conclusions to the entire data.
- In the Data Mining phase, the data mining task is approached. Fayyad gives a classified overview over existing data mining techniques. He makes suggestions, which technique may be used for which objectives, but most of the techniques are now improved. The output of this step is detected patterns. Data Mining will be focused on following articles.
- The interpretation of the detected pattern reveals whether or not the pattern is interesting. That is, whether they contain knowledge at all. This is why this step is also called evaluation. The duty is to represent the result in an appropriate way so it can be examined thoroughly. If the located pattern is not interesting, the cause for it has to be found out. It will probably be necessary to fall back on a previous step for another attempt.
The detected knowledge out of the KDD process is usually used to support the decisions of the management. Therefore it flows into a Decision Support System (DSS) or into marketing automation for direct marketing purposes.
Other approaches to build a process
Modified approaches to analyse the data still roughly follow the original proposal. Particularly vendors have recognized that only a systematic approach leads to successful data mining. Such systematic solutions are for instance the former called “5A’s”, proposed by SPSS or “SEMMA“, proposed by SAS to be used in the Enterprise Miner. (The “5A’s” of SPSS were Assess, Access, Analyze, Act and Automate – they don’t seem to use them any more after being bought by IBM, while “SEMMA” by SAS stands for Sample, Explore, Modify, Model, Assess.) Without dwelling on both of these solutions, since they are closely connected to vendor products where you find much information, another step-by-step proposal shall be highlighted. This is the “CRISP-DM”, the Cross-Industry Standard Process for Data Mining. They are currently working on a new release!
Data Mining always has to have pre-processed data – and after applying the result needs to be understood and presented. Only then you can actually make use of the great benefits of Data Mining.