data preprocessing

Ruhma Khawaja

Top Machine Learning Practices & Algorithms

Machine learning practices are the guiding principles that transform raw data into powerful insights. By following best practices in algorithm selection, data preprocessing, model evaluation, and deployment, we unlock the true potential of machine learning and pave the way for innovation and success.

In this blog, we focus on machine learning practices—the essential steps that unlock the potential of this transformative technology. By adhering to best practices, such as selecting the right machine learning algorithms, gathering high-quality data, performing effective preprocessing, evaluating models, and deploying them strategically, we pave the path toward accurate and impactful results.

Join us as we explore these key machine learning practices and uncover the secrets to optimizing machine-learning models for revolutionary advancements in diverse domains.

1. Choose the Right Algorithm

When choosing an algorithm, it is important to consider the following factors:

The type of problem you are trying to solve. Some algorithms are better suited for classification tasks, while others are better suited for regression tasks.
The amount of data you have. Some algorithms require a lot of data to train, while others can be trained with less data.
The desired accuracy. Some algorithms are more accurate than others
The computational resources you have available. Some algorithms are more computationally expensive than others.

Once you have considered these factors, you can start to narrow down your choices of algorithms. You can then read more about each algorithm and experiment with different algorithms to see which one works best for your problem.

2. Get Enough Data

Machine learning models are only as good as the data they are trained on. If you don’t have enough data, your models will not be able to learn effectively. It is important to collect as much data as possible that is relevant to your problem. The more data you have, the better your models will be.

There are a number of different ways to collect data for machine learning projects. Some common techniques include:

Web scraping: Web scraping is the process of extracting data from websites. This can be done using a variety of tools and techniques.
Social media: Social media platforms can be a great source of data for machine learning projects. This data can be used to train models for tasks such as sentiment analysis and topic modeling.
Sensor data: Sensor data can be used to train models for tasks such as object detection and anomaly detection. This data can be collected from a variety of sources, such as smartphones, wearable devices, and traffic cameras.

3. Clean Your Data

Even if you have a lot of data, it is important to make sure that it is clean. This means removing any errors or outliers from your data. If your data is dirty, it will make it difficult for your models to learn effectively. There are a number of different ways to clean your data. Some common techniques include:

Identifying and removing errors: This can be done by looking for data that is missing, incorrect, or inconsistent.
Identifying and removing outliers: Outliers are data points that are significantly different from the rest of the data. They can be removed by identifying them and then removing them from the dataset.
Imputing missing values: Missing values can be imputed by filling them in with the mean, median, or mode of the other values in the column.
Transforming categorical data: Categorical data can be transformed into numerical data by using a process called one-hot encoding.

Once you have cleaned your data, you can then proceed to train your machine learning models.

4. Evaluate Your Models

Once you have trained your models, it is important to evaluate their performance. This can be done by using a holdout set of data that was not used to train the models. The holdout set can be used to measure the accuracy, precision, and recall of the models.

Accuracy: Accuracy is the percentage of data points that are correctly classified by the model.
Precision: Precision is the percentage of data points that are classified as positive that are actually positive.
Recall: Recall is the percentage of positive data points that are correctly classified as positive.

The ideal model would have high accuracy, precision, and recall. However, in practice, it is often necessary to trade-off between these three metrics. For example, a model with high accuracy may have low precision or recall.

Once you have evaluated your models, you can then choose the model that has the best performance. You can then deploy the model to production and use it to make predictions.

5. Deploy Your Models

Once you are satisfied with the performance of your models, it is time to deploy them. This means making them available to users so that they can use them to make predictions. There are many different ways to deploy machine learning models, such as through a web service or a mobile app.

Deploying your machine learning models is considered a good practice because it enables the practical utilization of your models by making them accessible to users. Also, it has the potential to reach a broader audience, maximizing its impact.

By making your models accessible, you enable a wider range of users to benefit from the predictive capabilities of machine learning, driving decision-making processes and generating valuable outcomes.

Popular Machine-Learning Algorithms

Here are some of the most popular machine-learning algorithms:

1. Decision Trees

Decision trees are intuitive and easy to interpret, making them great for beginners. They work by splitting the data into smaller subsets based on certain conditions (like yes/no questions), forming a tree-like structure. The final “leaves” of the tree represent the classification or outcome. They’re especially useful in classification problems, such as deciding whether an email is spam or not.

2. Linear Regression

Linear regression is one of the simplest algorithms used for predictive analysis. It finds the best-fitting straight line (also called a regression line) through the data points and predicts the target value based on that line. It’s best suited for problems where the relationship between the input and output variables is linear—such as predicting housing prices based on square footage.

3. Support Vector Machines (SVM)

SVMs are more advanced algorithms used for both classification and regression. They work by finding a hyperplane (a boundary) that best separates the data into classes. SVMs are powerful in high-dimensional spaces and are effective when the margin of separation between classes is very clear. For example, they can be used in image classification or handwriting recognition.

4. Neural Networks

Neural networks are inspired by the human brain and are composed of layers of interconnected nodes (neurons). They are highly versatile and can handle complex, non-linear relationships in data. Neural networks are the backbone of deep learning and are used in applications like speech recognition, image generation, and natural language processing. However, they require large datasets and significant computational power to perform well.

It is important to note that there are no single “best” machine learning practices or algorithms. The best algorithm for a particular problem will depend on the specific factors of that problem.

In a Nutshell

Machine learning practices are essential for accurate and reliable results. Choose the right algorithm, gather quality data, clean and preprocess it, evaluate model performance, and deploy it effectively. These practices optimize algorithm selection, data quality, accuracy, decision-making, and practical utilization. By following these practices, you improve accuracy and solve real-world problems.

May 24, 2023

Machine Learning

Shehryar Mallick

Data preprocessing –The foundation of data science solution

This blog explores the important steps one should follow in the data preprocessing stage such as eradicating duplicates, fixing structural errors, detecting, and handling outliers, type conversion, dealing with missing values, and data encoding.

What is data preprocessing?

A common mistake that many novice data scientists make is that they skip through the data wrangling stage and dive right into the model-building phase, which in turn generates a poor-performing machine learning model.

This resembles a popular concept in the field of data science called GIGO (Garbage in Garbage Out). This concept means inferior quality data will always yield poor results irrespective of the model and optimization technique used.

Hence, an ample amount of time needs to be invested in ensuring the quality of the data is up to standard. In fact, data scientists spend around 80% of their time just on the data pre-processing phase.

But fret not, because we will investigate the various steps that you can follow to ensure that your data is preprocessed before stepping ahead in the data science pipeline.

Let’s look at the steps of data pre-processing to understand it better:

Removing duplicates:

You may often encounter repeated entries in your dataset, which is not a good sign because duplicates are an extreme case of non-random sampling, and they tend to make the model biased. Including repeated entries will lead to the model overfitting this subset of points and hence must be removed.

We will demonstrate this with the help of an example. Let’s say we had a movie data set as follows:

As we can see, the movie title “The Dark Knight” is repeated at the 3rd index (fourth entry) in the data frame and needs to be taken care of.

Using the code below, we can remove the duplicate entries from the dataset based on the “Title” column and only keep the first occurrence of the entry.

Just by writing a few lines of code, you ensure your data is free from any duplicate entries. That’s how easy it is!

Fix structural errors:

Structural errors in a dataset refer to the entries that either have typos or inconsistent spellings:

Here, you can easily spot the different typos and inconsistencies, but what if the dataset was huge? You can check all the unique values and their corresponding occurrences using the following code:

Once you identify the entries to be fixed, simply replace the values with the correct version.

Voila! That is how you fix the structural errors.

Detecting and handling outliers:

Before we dive into detecting and handling outliers, let’s discuss what an outlier is.

“Outlier is any value in a dataset that drastically deviates from the rest of the data points.”

Let’s say we have a dataset of a streaming service with the ages of users ranging from 18 to 60, but there exists a user whose age is registered as 200. This data point is an example of an outlier and can mess up our machine-learning model if not taken care of.

There are numerous techniques that can be employed to detect and remove outliers in a data set but the ones that I am going to discuss are:

Box plots

Z- Score

Let’s assume the following data set:

If we use the describe function of pandas on the Age column, we can analyze the five number summary along with count, mean, and standard deviation of the specified column, then by using the domain specific knowledge like for the above instance we know that significantly large values of age can be a result of human error we can deduce that there are outliers in the dataset as the mean is 38.92 while the max value is 92.

As we have got some idea about what outliers are, let’s see some code in action to detect and remove the outliers

Box Plots:

Box plots or also called “Box and Whiskers Plot” show the five number summary of the features under consideration and are an effective way of visualizing the outlier.

As we can see from the above figure, there are number of data points that are outliers. So now we move onto Z-Score, a method through which we are going to set the threshold and remove the outlier entries from our dataset.

Z- Score:

A z-score determines the position of a data point in terms of its distance from the mean when measured in standard deviation units.

We first calculate the Z-score of the feature column:

The standard normal curve (Z-score) for a set of values represents 99.7% of the data points within the range of –3 and +3 scores, so in practice often the threshold is set to be 3 and anything beyond that is deemed an outlier and hence removed from the dataset if problematic or not a legitimate observation.

Type Conversion:

Type conversion refers to when certain columns are not of valid data type, for instance, in the following data frame, three out of four columns are of object data type:

Well, we don’t want that right? Because it would produce unexpected results and errors. We are going to convert Title and Director to string data types, and Duration_mins to integer data type.

Dealing With Missing Values:

Often, data set contains numerous missing values, which can be a problem. To name a few it can play a role in development of biased estimator, or it can decrease the representativeness of the sample under consideration.

Which brings us to the question of how to deal with them.

One thing you could do is simply drop them all. If you notice that index 5 has a few missing values, when the “dropna” command is implemented, it will drop that row from the dataset.

But what to do when you have a limited number of rows in a dataset? You could use different imputations methods such as the Measures of central tendencies to fill those empty cells.

The measures include:

Mean: The mean is the average of a data set. It is “sensitive” to outliers.
Median: The median is the middle of the set of numbers. It is resistant to outliers
Mode: The mode is the most common number in a data set.

It is better to use median instead of mean because of the property of not deviating drastically because of outliers. Allow me to elaborate this with an example

Notice how there is a documentary by the name “Hunger!” with “Duration_mins” equal to 6000 now observe the difference when I replace the missing value in the duration column with mean and with median.

If you search on the internet for the duration of movie “The Shining” you’ll find out it’s about 146 minutes so, isn’t 152 minutes much closer as compared to 1129 as calculated by mean?

A few other techniques to fill the missing values that you can explore are forward fill and backward fill.

Forward will work on the principle that the last valid value of a column is passed forward to the missing cell of the dataset.

Notice how 209 propagated forward.

Let’s observe backward fill too

From the above example, you can clearly see that the value following the empty cell was propagated backwards to fill in that missing cell.

The final technique I’m going to show you is called linear interpolation. What we do is take the mean of the values prior to and following the empty cell and use it to fill in the missing value.

3104.5 is the mean of 209 and 6000. As you can see this technique is too affected by outliers.

That was a quick run-down on how to handle missing values, moving onto the next section.

Feature scaling:

Another core concept of data preprocessing is the feature scaling of your dataset. In simple terms feature scaling refers to the technique where you scale multiple (quantitative) columns of your dataset to a common scale.

Assume a banking dataset has a column of age which usually ranges from 18 to 60 and a column of balance which can range from 0 to 10000. If you observe, there is an enormous difference between the values each data point can assume, and machine learning model would be affected by the balance column and would assign higher weights to it as it would consider the higher magnitude of balance to carry more importance as compared to age which has relatively lower magnitude.

To rectify this, we use the following two methods:

Normalization

Standardization

Normalization fits the data between the ranges of [0,1] but sometimes [-1,1] too. It is affected by outliers in a dataset and is useful when you do not know about the distribution of dataset.

Standardization, on the other hand, is not bound to be within a certain range; it’s quite resistant to outliers and useful when the distribution is normal or Gaussian.

Normalization:

Standardization:

Data encoding

The last step of the data preprocessing stage is the data encoding. It is where you encode the categorical features (columns) of your dataset into numeric values.

There are many encoding techniques available, but I’m just going to show you the implementation of one hot encoding (Pro-tip: You should use this when the order of the data does not matter).

For instance, in the following example, the gender column is nominal data, meaning that the identification of your gender does not take precedence over other genders. To further clarify the concept, let’s assume, for the sake of argument, we had a dataset of examination results of some high school class with a column of rank. The rank here is an example of ordinal data as it would follow a certain order and higher-ranking students would take precedence over lower-ranked ones.

If you notice in the above example, the gender column could assume one of the two options that were either male or female. What one hot encoder did was create the same number of columns as the number of options available, then for the row that had the associated possible value encode it with one (why one?). Well because one is the binary representation of true) otherwise zero (you guessed, zero represents false)

If you do wish to explore other techniques, here is an excellent resource for this purpose:

Blog: Types of categorical data encoding

Conclusion:

It might have been a lot to take in, but you have now explored the crucial concept of data science, that is data preprocessing.; Moreover, you are now equipped with the steps to curate your dataset in such a way that it will yield satisfactory results.

The journey to becoming a data scientist can seem daunting, but with the right mentorship, you can learn it seamlessly and take on real world problems in no time, to embark on the journey of becoming a data scientist, enroll yourself in the Data Science bootcamp and grow your career.

External resource:

Tableau: What is Data Cleaning?

November 22, 2022

Data Science

Search ...

LLM - Online Courses

Reviews

Consulting

Community