Overrepresentation - "SAS"-Oversampling

Nadel in Overrepresentation - SAS-Oversampling

Typical Data Mining task: Looking for a needle in a haystack!

To improve your Data Mining result when only having a small amount of target variables, it is useful to oversample the target variable. It is shown here how this works – and how to undo it when dealing with the result.

In cases where the target variable appears in a fraction of less then 10%, it is common to stratify the occurrence of the target variable. That should improve the result of your Data Mining challange. The term “oversampling” is used by SAS in their Enterprise Miner Software, to higher the relative occurence of the target variable without using copies – but by reducing the occurence of the non-target variable. Be advised that “oversampling” is also called to duplicate the content – you should check that out at zyxos Blog. We will stick to the quite simple view of SAS.

Before:

10% target variable with 10,000 data sets (original fraction = 0,1)
90% non-target variable with 90,000 data sets
100% total with 100,000 data sets

So if the target only makes a fraction of 10% in the beginning, you take a sample where it is oversampled to a 50%/50% distribution. That is quite easy, just take the complete 10% target variables and add a randomly taken amount of data sets of the non-target variables in the same number like the target variables have.

After:

50% target variables with 10,000 data sets (oversampled fraction = 0,5)
50% non-target variable with 10,000 data sets
100% total with 20,000 data sets

Now you apply your usual Data Mining on this flatfile, and you get as a result a scoring model, that can predict the occurrence probability of the target variable on given input data.

BUT these occurrence probability from the scoring is not the one that predicts correctly for the original distribution, it is the one that predicts correctly for the oversampled set!

If you would like to present your customer the correct probabilities, you have to undo the oversampling after scoring.

To undo the oversampling

That is how it works:

correct probabilities =
1/(1+(1/original fraction-1)/(1/oversampled fraction-1)*(1/scoring result-1));

Here is the formula:

Oversampling Back in Overrepresentation - SAS-Oversampling

Oversampling undo

Formula as used in SAS Base:

FIXED_p_scoring1 = 1/(1+(1/0.1-1)/(1/0.5-1)*(1/p_scoring1-1));

FIXED_p_scoring0 = 1/(1+(1/0.9-1)/(1/0.5-1)*(1/p_scoring0-1));

BTW: if you just want to fetch the top xy% of your sorted-by-probability data sets you don’t need to undo the oversampling. Since it is just a linear transformation, you just take your “best customers” for instance and you are don’t. The undoing does not change the sorting. I mostly applied it only when the customer wanted to know the “correct” probability.

Rule of thumb how and when to use

You have to review the use of oversampling thoroughly. Since you are cutting away some of your data (some of the non-target part) you will lost some information. If this information is represented by the rest of the non-target data, the model would not get worse, just better.

So how much oversampling, and when?

My rule of thumb:

5-10% target variable with more then 10,000 data sets: 50/50 oversampling is ok

<5% target variable with more then 10,000 data sets: 30/70 oversampling is recommend (30% target variable and 70% non-target variable)

<5% target variable with less then 10,000 data sets: the whole flatfile should not be smaller as 20,000 data sets. So do the oversampling in a way that your target variable fraction is maximized, but you still have in sum more then 20, 000 data sets.

In practice that is the way I got the best results with oversampling.  For example in direct marketing for CRM I once had only a fraction of 3% of the target variable in the data source. But finally with the help of oversampling it was possible to create a model that made a 5-times higher response quote against a test set of customers.

So don’t give up, when the target variable appears to be rare. Just reduce the hay, and the needle will be found!

16 comments to Overrepresentation – “SAS”-Oversampling

  • J3ny

    Hi. Nice article. Thanks a lot.

    Out of pure curiosity: what is the difference between oversampling and undersampling? I am asking because I found some others articles about data mining and it seems that some of them would rather call what you described here as undersampling [source]. Basically dropping non-target variables would make a whole data set smaller thus undersampling. On the other hand, after oversampling (in example that you provided) would look more like this:

    50% target variables with 90,000 data sets (oversampled fraction = 0,5)
    50% non-target variable with 90,000 data sets
    100% total with 180,000 data sets

    Obtained by duplicating target variables. Feel free to correct me if Im wrong.
    Regards,
    J3ny

  • Guido Deutsch

    Hi J3ny, thank you for your comment. I think you are right, there obviously is more then one definition of oversampling.

    The definition used by SAS (video) is quite simple; “over”-sampling means that you represent a target variable over the original fraction. So instead of having 5%, you have 10% by reducing the non target variable. Check the video, it is quite useful if you want to check out SAS-oversampling with the Miner. Redundant stratification is not intended, though.

    But I think we are also talking about two different techniques. The disproportionate non-redundant stratification of the occurrence of a variable is one thing – it is called oversampling by SAS and undersampling by your source that sounds very reasonable. The redundant stratification is called oversampling in your source and is imho not a process step in Miner. Though it is a good technique to improve the result that I used sometimes.

    I came across several differences between terms in the SAS understanding and in the scientific view. Thanks for this one, J3ny, I will revise the article.

  • Brian

    Hello. Does this formula work for ANY data mining algorithm that produces a probability? So, SVM to decision trees to logistic regression to neural nets?

    Thanks!

    PS excited to see how the blog expands

  • Guido Deutsch

    Thanks Brian, and the answer is yes – since your algorithm is bound to your subset. And if you linearly change your subset before using it, you can change it back linearly afterwards, it is technically not interfering with your algorithm in any way. So it does not depend on any data mining technique. Though chances are high that your result will be different (and better, hopefully). If you are into this you might want to read Logistic Regression in Rare Events Data by Gary King and Langche Zeng.

  • Guido, nice post.
    It is a useful formula, but I quit using it some years ago, because the probabilities calculated by the model are only the probabilities of the training data. To know the real value of the model, I always check it against a second dataset containing new data that are more recent than the training data. For it is obvious: you want to use the model to predict the future, so you should check your model against data from the future. The real world validation is the only thing you can trust in data mining, otherwise you never know whether the model is time-robust, overfitted etc.
    About simply duplicating records : plainly put : it ads no information whatsoever, it only consumes diskspace. If you want to duplicate records you should do it in an intelligent way. This is nicely described by Dorian Pyle in his book “”Data Preparation for Data Mining”, which I still find the best data mining book I ever laid eyes on.

  • shirly

    Hi Guido, Thanks for the post.

    What if target variable is (~0.001%) and overall there are 1,240,251 observations, would you still do

    “<5% target variable with more then 10,000 data sets: 30/70 oversampling is recommend (30% target variable and 70% non-target variable)" ?

  • Guido Deutsch

    Hi Shirly, 1,240 observations are not that much with the huge amount of data given. The density you describe (0.001%) seems to show a very rare event that may be persistent against a forecast. But you may try 20/80 next to the 30/70 – that is 1,200 targets and 4,800 no-target. I think it will be tough to find a good model, though.

  • Cristina

    Hi there,

    I am refering to the comment by J3ny.

    i am wondering if somebody could explain me how to configure the node “sample” by undersampling or oversampling using SAS Miner as it was proposed J3ny.

    Thank you for your help!!

  • Mina

    Dear Sir:

    I am developing a prediction model with a logistic regression by using SAS Enterprise Miner. The original sample (N=342) only has 16 target “1” category, which corresponds to 4,7% (16/342) of the observations. To handle the imbalanced sample and the rare events issue, at the sample general property panel, for the level based option I set sample proportion as 50.0. Hence I end up with a sample containing 32 observations (16 observations for target “1” category and 16 observations for the target “0” category), which I used to developed the prediction model.

    However, for my PhD thesis stand point of view, I must provide references that this methodology “oversampling” approach is legible.

    Could you suggest me any references in order to prove that this approach is legible and accepted by the Statisticians community?

    Thank you so much for your kind cooperation,
    Mina

  • what does scoring result mean in your formula ?

    Thing is I did oversampling using 30: 70 ratio as my events ratio in actual 20000 population was (355) or 1.78%. Now I have built a model using 1000 observations, which I know might not be that good. But what if I want to go with the equation derived from 1000 oversampled data to calculate probabilities for overall 20000 population.

    1 thing is intercept correction. What else needs to be done?

    Will be of gr8 help

  • Thanks for the formula. However, the formulas presented do not match. The one in text format is:

    1/(1+(1/original fraction-1)/(1/oversampled fraction-1)*(1/scoring result-1));

    But the graphical one translated to text is (because of multiplication taking precedence before subtraction):

    1/(1+(1/original fraction-1)/(1/oversampled fraction-1)*(1/scoring result)-1);

    Since the 1’s would cancel eachother out in this case, I am assuming the graphics are wrong?

  • Guido Deutsch

    @ Lars: the graphics shown in text is (here with multiplication taking precedence before subtraction symbolized in brackets):
    1/(1+((1/original fraction)-1)/((1/oversampled fraction)-1)*((1/scoring result)-1));

    The same, right?!

  • Those are the same, however, the extra brackets in your last part of the translation needs to be present in the graphics as well. See the ones I’ve marked with hard edged brackets below:

    1/(1+((1/original fraction)-1)/((1/oversampled fraction)-1)*[(1/scoring result)-1]);

    Without them multiplication would be done before subtracting 1, but you want to first subtract 1 and then multiply, indicated by the extra brackets.

  • I have been using oversampling techniques for a while now and have been adjusting the probabilities according to the formulas on this page. However, there is one peculiar thing I have noticed. Let’s say that the original rare event occurs with a frequency of one in a hundred. Should not the average probability for that event after undoing the oversampling be close to 1 percent? Having looked at five different models it seems that I get average probabilities that are about twice of the natural occurance (2% in this case).

  • quants_mum

    Hi
    I am working on decision trees for the first time at job and request some clarity on the following :

    1.Should the oversampling/undersampling technique be applied on the entire dataset and then the dataset be split into training and validation subsets or it should be the other way around?ie. splitting the dataset into training and validation and then applying oversampling on the training dataset while keeping the testing dataset as it is?

    2.I am working on problems of classifying customers with high cheque bounce % ; identifying frauds etc. Which techniques would serve me better – CHAID or classification tree?

    3. Can we have a ordinal categorical target variable for classification trees?

    Response would be highly appreciated.

    thanks
    quants_mum

  • JP

    Thanks very good post. how to do the Oversampling to handle Asymmetric costs..am currently working on research project.

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>