To improve your Data Mining result when only having a small amount of target variables, it is useful to oversample the target variable. It is shown here how this works – and how to undo it when dealing with the result.
In cases where the target variable appears in a fraction of less then 10%, it is common to stratify the occurrence of the target variable. That should improve the result of your Data Mining challange. The term “oversampling” is used by SAS in their Enterprise Miner Software, to higher the relative occurence of the target variable without using copies – but by reducing the occurence of the non-target variable. Be advised that “oversampling” is also called to duplicate the content – you should check that out at zyxos Blog. We will stick to the quite simple view of SAS.
10% target variable with 10,000 data sets (original fraction = 0,1)
90% non-target variable with 90,000 data sets
100% total with 100,000 data sets
So if the target only makes a fraction of 10% in the beginning, you take a sample where it is oversampled to a 50%/50% distribution. That is quite easy, just take the complete 10% target variables and add a randomly taken amount of data sets of the non-target variables in the same number like the target variables have.
50% target variables with 10,000 data sets (oversampled fraction = 0,5)
50% non-target variable with 10,000 data sets
100% total with 20,000 data sets
Now you apply your usual Data Mining on this flatfile, and you get as a result a scoring model, that can predict the occurrence probability of the target variable on given input data.
BUT these occurrence probability from the scoring is not the one that predicts correctly for the original distribution, it is the one that predicts correctly for the oversampled set!
If you would like to present your customer the correct probabilities, you have to undo the oversampling after scoring.
To undo the oversampling
That is how it works:
correct probabilities =
1/(1+(1/original fraction-1)/(1/oversampled fraction-1)*(1/scoring result-1));
Here is the formula:
Formula as used in SAS Base:
FIXED_p_scoring1 = 1/(1+(1/0.1-1)/(1/0.5-1)*(1/p_scoring1-1));
FIXED_p_scoring0 = 1/(1+(1/0.9-1)/(1/0.5-1)*(1/p_scoring0-1));
BTW: if you just want to fetch the top xy% of your sorted-by-probability data sets you don’t need to undo the oversampling. Since it is just a linear transformation, you just take your “best customers” for instance and you are don’t. The undoing does not change the sorting. I mostly applied it only when the customer wanted to know the “correct” probability.
Rule of thumb how and when to use
You have to review the use of oversampling thoroughly. Since you are cutting away some of your data (some of the non-target part) you will lost some information. If this information is represented by the rest of the non-target data, the model would not get worse, just better.
So how much oversampling, and when?
My rule of thumb:
5-10% target variable with more then 10,000 data sets: 50/50 oversampling is ok
<5% target variable with more then 10,000 data sets: 30/70 oversampling is recommend (30% target variable and 70% non-target variable)
<5% target variable with less then 10,000 data sets: the whole flatfile should not be smaller as 20,000 data sets. So do the oversampling in a way that your target variable fraction is maximized, but you still have in sum more then 20, 000 data sets.
In practice that is the way I got the best results with oversampling. For example in direct marketing for CRM I once had only a fraction of 3% of the target variable in the data source. But finally with the help of oversampling it was possible to create a model that made a 5-times higher response quote against a test set of customers.
So don’t give up, when the target variable appears to be rare. Just reduce the hay, and the needle will be found!