machine learning - Dealing with unbalanced datasets in Spark MLlib -

- April 15, 2012

i'm working on particular binary classification problem highly unbalanced dataset, , wondering if has tried implement specific techniques dealing unbalanced datasets (such smote) in classification problems using spark's mllib.

i'm using mllib's random forest implementation , tried simplest approach of randomly undersampling larger class didn't work expected.

i appreciate feedback regarding experience similar issues.

thanks,

class weight spark ml

as of moment, class weighting random forest algorithm still under development (see here)

but if you're willing try other classifiers - functionality has been added logistic regression.

consider case have 80% positives (label == 1) in dataset, theoretically want "under-sample" positive class. logistic loss objective function should treat negative class (label == 0) higher weight.

here example in scala of generating weight, add new column dataframe each record in dataset:

def balancedataset(dataset: dataframe): dataframe = {      // re-balancing (weighting) of records used in logistic loss objective function     val numnegatives = dataset.filter(dataset("label") === 0).count     val datasetsize = dataset.count     val balancingratio = (datasetsize - numnegatives).todouble / datasetsize      val calculateweights = udf { d: double =>       if (d == 0.0) {         1 * balancingratio       }       else {         (1 * (1.0 - balancingratio))       }     }      val weighteddataset = dataset.withcolumn("classweightcol", calculateweights(dataset("label")))     weighteddataset   }

then, create classier follow:

new logisticregression().setweightcol("classweightcol").setlabelcol("label").setfeaturescol("features")

for more details, watch here: https://issues.apache.org/jira/browse/spark-9610

- predictive power

a different issue should check - whether features have "predictive power" label you're trying predict. in case after under-sampling still have low precision, maybe has nothing fact dataset imbalanced nature.

i exploratory data analysis - if classifier doesn't better random choice, there risk there no connection between features , class.

perform correlation analysis every feature label.
generating class specific histograms features (i.e. plotting histograms of data each class, given feature on same axis) can way show if feature discriminates between 2 classes.

overfitting - low error on training set , high error on test set might indication overfit using overly flexible feature set.

bias variance - check whether classifier suffers high bias or high variance problem.

training error vs. validation error - graph validation error , training set error, function of training examples (do incremental learning)
- if lines seem converge same value , close @ end, classifier has high bias. in such case, adding more data won't help. change classifier 1 has higher variance, or lower regularization parameter of current one.
- if on other hand lines quite far apart, , have low training set error high validation error, classifier has high variance. in case getting more data help. if after getting more data variance still high, can increase regularization parameter.

Search This Blog

WIKI