For a hands-on learning experience to develop LLM applications, join our LLM Bootcamp today.
Early Bird Discount Ending Soon!

Resampling

Imbalanced data is a common problem in machine learning, where one class has a significantly higher number of observations than the other. This can lead to biased models and poor performance on the minority class. In this blog, we will discuss techniques for handling imbalanced data and improving model performance.   

Understanding imbalanced data 

Imbalanced data refers to datasets where the distribution of class labels is not equal, with one class having a significantly higher number of observations than the other. This can be a problem for machine learning algorithms, as they can be biased towards the majority class and perform poorly on the minority class.

Techniques for handling imbalanced data

Dealing with imbalanced data is a common problem in data science, where the target class has an uneven distribution of observations. In classification problems, this can lead to models that are biased toward the majority class, resulting in poor performance of the minority class. To handle imbalanced data, various techniques can be employed. 

 

Techniques for Handling Imbalanced Data

 1. Resampling Techniques

Resampling techniques involve modifying the original dataset to balance the class distribution. This can be done by either oversampling the minority class or undersampling the majority class. 

Oversampling techniques include random oversampling, synthetic minority over-sampling technique (SMOTE), and adaptive synthetic (ADASYN). Undersampling techniques include random undersampling, nearmiss, and tomek links. 

An example of a resampling technique is bootstrap resampling, where you generate new data samples by randomly selecting observations from the original dataset with replacements. These new samples are then used to estimate the variability of a statistic or to construct a confidence interval.  

For instance, if you have a dataset of 100 observations, you can draw 100 new samples of size 100 with replacement from the original dataset. Then, you can compute the mean of each new sample, resulting in 100 new mean values. By examining the distribution of these means, you can estimate the standard error of the mean or the confidence interval of the population mean. 

2. Data Augmentation

Data augmentation is a powerful technique used to artificially increase the size and diversity of a dataset by creating new, synthetic samples from existing ones. In the context of imbalanced data, this helps balance the representation of minority classes. Techniques vary depending on the type of data—

  • For images, this might include rotating, flipping, zooming, cropping, or adjusting brightness.

  • For text, you might perform synonym replacement, random word insertion, or back-translation.

  • For tabular data, methods like SMOTE (Synthetic Minority Over-sampling Technique) generate new data points by interpolating between existing ones.

By enriching the dataset in this way, models are less likely to overfit and can generalize better, especially when learning patterns from underrepresented classes.

 

Read about top statistical techniques in this blog  

3. Synthetic Minority Over-Sampling Technique (SMOTE)

SMOTE is an advanced oversampling method used to balance datasets where one class is significantly underrepresented. Unlike simple oversampling, which duplicates minority class instances and may lead to overfitting, SMOTE takes a more intelligent approach by creating entirely new, synthetic examples.

How SMOTE Works

  1. Identify nearest neighbors: For each minority class instance, SMOTE identifies its k nearest neighbors—typically from the same class—using distance metrics like Euclidean distance.

  2. Interpolation: A random neighbor is selected, and a new synthetic data point is created by drawing a line between the original data point and the neighbor. A new point is then generated somewhere along this line.

  3. Repeat: This process is repeated until the desired balance between classes is achieved.

Example Scenario

Imagine you have only 50 positive cases (minority class) and 950 negative ones (majority class). SMOTE would generate new synthetic examples between similar positive cases to bring the number of positive instances closer to 950. This helps the model understand the underlying patterns in the minority class more effectively.

Benefits of SMOTE

  • Reduces overfitting: Since new samples are not direct copies, the model is less likely to memorize the data.

  • Improves generalization: The synthetic points help in learning a more generalized decision boundary.

  • Better minority class representation: It enhances the model’s ability to correctly classify minority class instances.

Limitations

  • May create ambiguous samples if the minority class is highly overlapping with the majority class.

  • Doesn’t consider feature correlations or outliers, which might introduce noise if not handled properly.

4. Ensemble Techniques

Ensemble techniques are powerful strategies in machine learning that combine predictions from multiple models to improve overall performance, especially when dealing with imbalanced datasets. Instead of relying on a single model, ensemble methods leverage the strengths of several models to produce more robust and accurate predictions.

Key Ensemble Methods

  • Bagging (Bootstrap Aggregating): This method builds multiple models (usually of the same type, like decision trees) using different random subsets of the data. The final prediction is made through majority voting or averaging. Random Forest is a popular bagging technique that performs well even with imbalanced data when class weights are adjusted.

  • Boosting: Boosting builds models sequentially, where each new model focuses on correcting the errors of the previous ones. Techniques like AdaBoost, Gradient Boosting, and XGBoost can be tuned to give more attention to misclassified (often minority class) instances, thereby improving classification performance on the minority class.

  • Stacking: This method combines different types of models (e.g., logistic regression, decision trees, SVM) and uses a meta-model to learn how best to blend their predictions. It allows for greater flexibility and can better capture complex patterns in the data.

Why It Works for Imbalanced Data

Ensemble methods reduce variance and bias, making the model more resilient to the pitfalls of imbalanced classes. When used with class weighting, resampling, or cost-sensitive learning, these techniques can significantly enhance the detection of minority class instances without sacrificing performance on the majority class.

 

llm bootcamp

 

5. One-Class Classification

One-class classification is a specialized technique particularly useful when data from only one class (typically the majority or “normal” class) is available or reliable. The model is trained solely on this single class to learn its distribution and behavior, and then used to identify instances that deviate significantly from it—flagging them as anomalies or potential members of the minority class.

How It Works

The model essentially creates a profile of what “normal” looks like based on the training data. Any new instance that doesn’t fit this profile is considered an outlier. This is especially helpful in situations where minority class data is too rare, sensitive, or expensive to collect—for example, fraud detection or fault monitoring.

Common Algorithms

  • One-Class SVM (Support Vector Machine): Separates the training data from the origin in feature space, classifying anything outside the learned region as an anomaly.

  • Isolation Forest: Randomly isolates observations by splitting data recursively; anomalies are easier to isolate and thus have shorter path lengths.

  • Autoencoders (for deep learning): Neural networks that learn to reconstruct input data. Poor reconstruction indicates an anomaly.

Benefits

  • Doesn’t require balanced datasets.

  • Effective for anomaly and rare event detection.

  • Minimizes risk of overfitting to minority data that may be noisy or inconsistent.

Limitations

  • Less effective when the minority class has varied characteristics or when both classes are available in sufficient quality and quantity.

  • May produce false positives if the “normal” class has high variability.

 

How generative AI and LLMs work

 

6. Cost-Sensitive Learning

Cost-sensitive learning is an effective technique that directly addresses class imbalance by incorporating the cost of misclassification into the learning process. Instead of treating all errors equally, it penalizes mistakes on the minority class more heavily, encouraging the model to give greater attention to those instances.

How It Works

In imbalanced datasets, models tend to favor the majority class because minimizing overall error often means simply predicting the dominant class. Cost-sensitive learning changes this by assigning a higher misclassification cost to the minority class. For example, in a medical diagnosis scenario, missing a rare disease case (false negative) is far more serious than a false positive, so the cost of misclassification is adjusted accordingly.

Implementation Techniques

  • Weighted loss functions: Modify the loss function (e.g., cross-entropy) to include class weights, so the model is penalized more for misclassifying minority class instances.

  • Class weights in algorithms: Many machine learning libraries (like scikit-learn, XGBoost, LightGBM) offer built-in parameters to assign different weights to classes.

  • Custom cost matrices: In some cases, you can define a full cost matrix to dictate specific penalties for each type of misclassification.

Benefits

  • No need to alter the original data distribution.

  • Integrates seamlessly into most machine learning algorithms.

  • Helps in real-world cases where consequences of different types of errors vary (e.g., fraud detection, medical diagnostics).

Limitations

  • Requires domain knowledge to assign appropriate costs.

  • May lead to instability if the cost ratios are set too aggressively.

7. Evaluation Metrics for Imbalanced Data

When working with imbalanced data, traditional evaluation metrics like accuracy can be misleading. A model might achieve high accuracy by simply predicting the majority class, while completely ignoring the minority class. Therefore, it’s crucial to use metrics that truly reflect performance on imbalanced datasets.

Key Metrics to Use

  • Precision: Measures how many of the predicted positive instances are actually correct. High precision is important when false positives are costly.

  • Recall (Sensitivity): Measures how many actual positive instances were correctly identified. This is especially critical when detecting rare events in imbalanced data.

  • F1 Score: The harmonic mean of precision and recall. It provides a balanced measure, especially useful when dealing with imbalanced classes.

  • AUC-ROC (Area Under the Receiver Operating Characteristic Curve): Evaluates the model’s ability to distinguish between classes. A higher AUC-ROC indicates better performance on imbalanced data.

  • AUC-PR (Area Under the Precision-Recall Curve): More informative than ROC-AUC when dealing with severely imbalanced datasets, as it focuses on the minority class performance.

Why These Metrics Matter

In imbalanced data scenarios, focusing only on overall accuracy can hide poor performance on the minority class. These metrics ensure that the model is evaluated fairly, especially in cases where identifying rare but critical instances (like fraud, disease, or defects) is the main goal.

Choosing the Best Technique for Handling Imbalanced Data

After discussing techniques for handling imbalanced data, we learned several approaches that can be used to address the issue. The most common techniques include undersampling, oversampling, and feature selection. 

Undersampling involves reducing the size of the majority class to match that of the minority class, while oversampling involves creating new instances of the minority class to balance the data. Feature selection is the process of selecting only the most relevant features to reduce the noise in the data.  

In conclusion, it is recommended to use both undersampling and oversampling techniques to balance the data, with oversampling being the most effective. However, the choice of technique will ultimately depend on the specific characteristics of the dataset and the problem at hand. 

 

Explore a hands-on curriculum that helps you build custom LLM applications!

 

March 21, 2023

This RHadoop tutorial resamples from a large data set in parallel. This blog is designed for beginners.

How-to: RHadoop (with R on Hadoop) to resample from a large data set 

Reposted from Cloudera blog.

Internet-scale datasets present a unique challenge to traditional machine-learning techniques, such as fitting random forests or “bagging.” To fit a classifier to a large data set, it’s common to generate many smaller data sets derived from the initial large data set (i.e. resampling). There are two reasons for this:

  1. Large data sets typically live in a cluster, so any operations should have some level of parallelism. Separate models fit on separate nodes that contain different subsets of the initial data.
  2. Even if you could use the entire initial data set to fit a single model, it turns out that ensemble methods, where you fit multiple smaller models using subsets of the data, generally outperform single models. Indeed, fitting a single model with 100M data points can perform worse than fitting just a few models with 10M data points each (so less total data outperforms more total data; e.g. see this paper).

Furthermore, bootstrapping is another popular method that randomly chops up an initial data set to characterize distributions of statistics and also to build ensembles of classifiers (e.g., bagging). Parallelizing bootstrap sampling or ensemble learning can provide significant performance gains even when your data set is not so large that it must live in a cluster. The gains from purely parallelizing the random number generation are still significant.

Sampling with replacement

Sampling-with-replacement is the most popular method for sampling from the initial data set to produce a collection of samples for model fitting. This method is equivalent to sampling from a multinomial distribution where the probability of selecting any individual input data point is uniform over the entire data set.

Unfortunately, it is not possible to sample from a multinomial distribution across a cluster without using some kind of communication between the nodes (i.e., sampling from a multinomial is not embarrassingly parallel). But do not despair: we can approximate a multinomial distribution by sampling from an identical Poisson distribution on each input data point independently, lending itself to an embarrassingly parallel implementation.

Below, we will show you how to implement such a Poisson approximation to enable you to train a random forest on an enormous data set. As a bonus, we’ll be implementing it in R and RHadoop, as R is many people’s statistical tool of choice. Because this technique is broadly applicable to any situation involving resampling a large data set, we begin with a full general description of the problem and solution.

Formal problem statement for RHadoop

Our situation is as follows:

  • We have N data points in our initial training set {xi}, where N is very large (106-109) and the data is distributed over a cluster.
  • We want to train a set of M different models for an ensemble classifier, where M is anywhere from a handful to thousands.
  • We want each model to be trained with K data points, where typically K << N. (For example, K may be 1–10% of.)

The number of training data points available to us, N, is fixed and generally outside of our control. However, K and M are both parameters that we can set and their product KM determines the total number of input vectors that will be consumed in the model fitting process. There are three cases to consider:

  • KM < N, in which case we are not using the full amount of data available to us.
  • KM = N, in which case we can exactly partition our data set to produce independent samples.
  • KM > N, in which case we must resample some of our data with replacement.

The Poisson sampling method described below handles all three cases in the same framework. (However, note that for the case KM = N, it does not partition the data, but simply resamples it as well.)

(Note: The case where K = N corresponds exactly to bootstrapping the full initial data set, but this is often not desired for very large data sets. Nor is it practical from a computational perspective: performing a bootstrap of the full data set would require the generation of MN data points and M scans of an N-sized data set. However, in cases where this computation is desired, there exists an approximation called a “Bag of Little Bootstraps.”)

The goal

So our goal is to generate M data sets of size K from the original N data points where N can be very large and the data is sitting in a distributed environment. The two challenges we want to overcome are:

  • Many resampling implementations perform M passes through the initial data set. which is highly undesirable in our case because the initial data set is so large.
  • Sampling-with-replacement involves sampling from a multinomial distribution over the N input data points. However, sampling from a multinomial distribution requires message passing across the entire data set, so it is not possible to do so in a distributed environment in an embarrassingly parallel fashion (i.e., as a map-only MapReduce job).

Poisson-approximation resampling

Our solution to these issues is to approximate the multinomial sampling by sampling from a Poisson distribution for each input data point separately. For each input point xi, we sample M times from a Poisson(K / N) distribution to produce M values {mj}, one for each model j. For each data point xi and each model j, we emit the key-value pair *<j, xi>*a total of MJ times (where MJ can be zero). Because the sum of multiple Poisson variables is Poisson, the number of times a data point is emitted is distributed as Poisson(KM / N), and the size of each generated sample is distributed as Poisson(K), as desired. Because the Poisson sampling occurs for each input point independently, this sampling method can be parallelized in the map portion of a MapReduce job.

(Note that our approximation never guarantees that every single input data point is assigned to at least one of the models, but this is no worse than multinomial resampling of the full data set. However, in the case where KM = N, this is particularly bad in contrast to the alternative of partitioning the data, as partitioning will guarantee independent samples using all N training points, while resampling can only generate (hopefully) uncorrelated samples with a fraction of the data.)

Ultimately, each generated sample will have a size K on average, and so this method will approximate the exact multinomial sampling method with a single pass through the data in an embarrassingly parallel fashion, addressing both of the big data limitations described above. Because we are randomly sampling from the initial data set, and similarly to the “exact” method of multinomial sampling, some of the initial input vectors may never be chosen for any of the samples. We expect that approximately exp{–KM / N} of the initial data will be entirely missing from any of the samples (see figure below).

Poisson Approximation
Poisson Approximation

Amount of missed data as a function of KM / N. The value for KM = N is marked in gray.

Finally, the MapReduce shuffle distributes all the samples to the reducers and the model fitting or statistic computation is performed on the reduce side of the computation.

The algorithm for performing the sampling is presented below in pseudocode. Recall that there are three parameters —NM, and K — where one is fixed; we choose to specify T = K / N as one of the parameters as it eliminates the need to determine the value of N in advance.

/# example sampling parameters

T = 0.1 # param 1: K / N; average fraction of input data in each model; 10%

M = 50 # param 2: number of models

def map(k, v): // for each input data point

for i in 1:M // for each model

m = Poisson(T) // num times curr point should appear in this sample 

if m > 0 

 for j in 1:m // emit current input point proper num of times 

    emit (i, v)

def reduce(k, v): 

fit model or calculate statistic with the sample in v

Note that even more significant performance enhancements can be achieved if it is possible to use a combiner, but this is highly statistic/model-dependent.

Example: Kaggle Data Set on Bulldozer Sale Prices
We will apply this method to test out the training of a random forest regression model on a Kaggle data set found here. The data set comprises ~400k training data points. Each data point represents a sale of a particular bulldozer at an auction, for which we have the sale price along with a set of other features about the sale and the bulldozer. (This data set is not especially large, but will illustrate our method nicely.) The goal will be to build a regression model using an ensemble method (specifically, a random forest) to predict the sale price of a bulldozer from the available features.

Bulldozer
A bulldozer

Could be yours for $141,999.99

The data are supplied as two tables: a transaction table that includes the sale price (target variable) and some other features, including a reference to a specific bulldozer; and a bulldozer table, that contains additional features for each bulldozer. As this post does not concern itself with data munging, we will assume that the data come pre-joined. But in a real-life situation, we’d incorporate the join as part of the workflow by, for example, processing it with a Hive query or a Pig script. Since in this case, the data are relatively small, we simply use some R commands. The code to prepare the data can be found here.

Quick note on R and RHadoop

As so much statistical work is performed in R, it is highly valuable to have an interface to use R over large data sets in a Hadoop cluster. This can be performed with RHadoop, which is developed with the support of Revolution Analytics. (Another option for R and Hadoop is the RHIPE project.)

One of the nice things about RHadoop is that R environments can be serialized and shuttled around, so there is never any reason to explicitly move any side data through Hadoop’s configuration or distributed cache. All environment variables are distributed around transparently to the user. Another nice property is that Hadoop is used quite transparently to the user, and the semantics allow for easily composing MapReduce jobs into pipelines by writing modular/reusable parts.

The only thing that might be unusual for the “traditional” Hadoop user (but natural to the R user) is that the mapper function should be written to be fully vectorized (i.e., keyval() should be called once per mapper as the last statement). This is to maximize the performance of the mapper (since R’s interpreted REPL is quite slow), but it means that mappers receive multiple input records at a time and everything the mappers emit must be grouped into a single object.

Finally, I did not find the RHadoop installation instructions (or the documentation in general) to be in a very mature state, so here are the commands I used to install RHadoop on my small cluster.

Fitting an ensemble of Random forests with poisson sampling on RHadoop

We implement our Poisson sampling strategy with RHadoop. We start by setting global values for our parameters:

frac.per.model <- 0.1 # 10% of input data to each sample on avg num.models <- 50

As mentioned previously, the mapper must deal with multiple input records at once, so there needs to be a bit of data wrangling before emitting the keys and values:

#MAPPER

poisson.subsample <- function(k, v) {

#parse data chunk into data frame 

#raw is basically a chunk of a csv file 

raw <- paste(v, sep="\n") 

#convert to data.frame using read.table() in parse.raw()

input <- parse.raw(raw)


#this function is used to generate a sample from

#the current block of data

generate.sample <- function(i) {

#generate N Poisson variables

draws <- rpois(n=nrow(input), lambda=frac.per.model)

#compute the index vector for the corresponding rows,

#weighted by the number of Poisson draws

indices <- rep((1:nrow(input))[draws > 0], draws[draws > 0])

#emit the rows; RHadoop takes care of replicating the key appropriately 

#and rbinding the data frames from different mappers together for the

#reducer 

keyval(rep(i, length(indices)), input[indices, ])

}

#here is where we generate the actual sampled data

raw.output <- lapply(1:num.models, generate.sample)


#and now we must reshape it into something RHadoop expects

output.keys <- do.call(c, lapply(raw.output, function(x) {x$key}))

output.vals <- do.call(rbind, lapply(raw.output, function(x) {x$val}))

keyval(output.keys, output.vals)

}

Because we are using R, the reducer can be incredibly simple: it takes the sample as an argument and simply feeds it to our model-fitting function, randomForest():

#REDUCE function 

fit.trees <- function(k, v) {

#rmr rbinds the emited values, so v is a dataframe 

#note that do.trace=T is used to produce output to stderr to keep

#the reduce task from timing out

rf <- randomForest(formula=model.formula,

    data=v,

    na.action=na.roughfix,

    ntree=10, do.trace=TRUE)

#rf is a list so wrap it in another list to ensure that only

#one object gets emitted. this is because keyval is vectorized

keyval(k, list(forest=rf))

 }

Keep in mind that in our case, we are actually fitting 10 trees per sample, but we could easily only fit a single tree per “forest”, and merge the results from each sample into a single real forest.

Note that the choice of predictors has specified in the variable model. formula. R’s random forest implementation does not support factors that have more than 32 levels, as the optimization problem grows too fast. To illustrate the Poisson sampling method, we chose to simply ignore those features, even though they probably contain useful information for regression. In a future blog post, we will address various ways that we can get around this limitation.

The MapReduce job itself is initiated like so:

mapreduce(input="/poisson/training.csv",

input.format="text", map=poisson.subsample,

reduce=fit.trees,

output="/poisson/output")

The resulting trees are dumped in HDFS at Poisson/output.

Finally, we can load the trees, merge them, and use them to classify new test points:

raw.forests <- from.dfs("/poisson/output")[["val"]]

forest <- do.call(combine, raw.forests)

Conclusion

Each of the 50 samples produced a random forest with 10 trees, so the final random forest is an ensemble of 500 trees, fitted in a distributed fashion over a Hadoop cluster. The full set of source files is available here.

Hopefully, you have now learned a scalable approach for training ensemble classifiers or bootstrapping in a parallel fashion by using a Poisson approximation to multinomial sampling.

August 18, 2022

Related Topics

Statistics
Resources
rag
Programming
Machine Learning
LLM
Generative AI
Data Visualization
Data Security
Data Science
Data Engineering
Data Analytics
Computer Vision
Career
AI