fbpx
Learn to build large language model applications: vector databases, langchain, fine tuning and prompt engineering. Learn more

data preprocessing

Machine learning practices are the guiding principles that transform raw data into powerful insights. By following best practices in algorithm selection, data preprocessing, model evaluation, and deployment, we unlock the true potential of machine learning and pave the way for innovation and success.

In this blog, we focus on machine learning practices—the essential steps that unlock the potential of this transformative technology. By adhering to best practices, such as selecting the right machine learning algorithms, gathering high-quality data, performing effective preprocessing, evaluating models, and deploying them strategically, we pave the path toward accurate and impactful results.

5 essential machine learning practices
5 essential machine learning practices

Join us as we explore these key machine learning practices and uncover the secrets to optimizing machine-learning models for revolutionary advancements in diverse domains.

1. Choose the right algorithm

When choosing an algorithm, it is important to consider the following factors:

  • The type of problem you are trying to solve. Some algorithms are better suited for classification tasks, while others are better suited for regression tasks.
  • The amount of data you have. Some algorithms require a lot of data to train, while others can be trained with less data.
  • The desired accuracy. Some algorithms are more accurate than others
  • The computational resources you have available. Some algorithms are more computationally expensive than others.

Once you have considered these factors, you can start to narrow down your choices of algorithms. You can then read more about each algorithm and experiment with different algorithms to see which one works best for your problem.

2. Get enough data

Machine learning models are only as good as the data they are trained on. If you don’t have enough data, your models will not be able to learn effectively. It is important to collect as much data as possible that is relevant to your problem. The more data you have, the better your models will be.

There are a number of different ways to collect data for machine learning projects. Some common techniques include:

  1. Web scraping: Web scraping is the process of extracting data from websites. This can be done using a variety of tools and techniques.
  2. Social media: Social media platforms can be a great source of data for machine learning projects. This data can be used to train models for tasks such as sentiment analysis and topic modeling.
  3. Sensor data: Sensor data can be used to train models for tasks such as object detection and anomaly detection. This data can be collected from a variety of sources, such as smartphones, wearable devices, and traffic cameras.
Machine learning practices for data scientists
Machine learning practices for data scientists

3. Clean your data

Even if you have a lot of data, it is important to make sure that it is clean. This means removing any errors or outliers from your data. If your data is dirty, it will make it difficult for your models to learn effectively. There are a number of different ways to clean your data. Some common techniques include:

  • Identifying and removing errors: This can be done by looking for data that is missing, incorrect, or inconsistent.
  • Identifying and removing outliers: Outliers are data points that are significantly different from the rest of the data. They can be removed by identifying them and then removing them from the dataset.
  • Imputing missing values: Missing values can be imputed by filling them in with the mean, median, or mode of the other values in the column.
  • Transforming categorical data: Categorical data can be transformed into numerical data by using a process called one-hot encoding.

Once you have cleaned your data, you can then proceed to train your machine learning models.

4. Evaluate your models

Once you have trained your models, it is important to evaluate their performance. This can be done by using a holdout set of data that was not used to train the models. The holdout set can be used to measure the accuracy, precision, and recall of the models.

  1. Accuracy: Accuracy is the percentage of data points that are correctly classified by the model.
  2. Precision: Precision is the percentage of data points that are classified as positive that are actually positive.
  3. Recall: Recall is the percentage of positive data points that are correctly classified as positive.

The ideal model would have high accuracy, precision, and recall. However, in practice, it is often necessary to trade-off between these three metrics. For example, a model with high accuracy may have low precision or recall.

Once you have evaluated your models, you can then choose the model that has the best performance. You can then deploy the model to production and use it to make predictions.

5. Deploy your models

Once you are satisfied with the performance of your models, it is time to deploy them. This means making them available to users so that they can use them to make predictions. There are many different ways to deploy machine learning models, such as through a web service or a mobile app.

Deploying your machine learning models is considered a good practice because it enables the practical utilization of your models by making them accessible to users. Also, it has the potential to reach a broader audience, maximizing its impact.

By making your models accessible, you enable a wider range of users to benefit from the predictive capabilities of machine learning, driving decision-making processes and generating valuable outcomes.

Popular machine-learning algorithms

Here are some of the most popular machine-learning algorithms:

  1. Decision trees: Decision trees are a simple but effective algorithm for classification tasks. They work by dividing the data into smaller and smaller groups until each group can be classified with a high degree of accuracy.
  2. Linear regression: Linear regression is a simple but effective algorithm for regression tasks. It works by finding a line that best fits the data.
  3. Support vector machines: Support vector machines are a more complex algorithm that can be used for both classification and regression tasks. They work by finding a hyperplane that separates the data into two groups.
  4. Neural networks: Neural networks are powerful but complex algorithms that can be used for a variety of tasks, including classification, regression, and natural language processing.

It is important to note that there are no single “best” machine learning practices or algorithms. The best algorithm for a particular problem will depend on the specific factors of that problem.

In a nutshell

Machine learning practices are essential for accurate and reliable results. Choose the right algorithm, gather quality data, clean and preprocess it, evaluate model performance, and deploy it effectively. These practices optimize algorithm selection, data quality, accuracy, decision-making, and practical utilization. By following these practices, you improve accuracy and solve real-world problems.

 

May 24, 2023

This blog explores the important steps one should follow in the data preprocessing stage such as eradicating duplicates, fixing structural errors, detecting, and handling outliers, type conversion, dealing with missing values, and data encoding. 

What is data preprocessing?

A common mistake that many novice data scientists make is that they skip through the data wrangling stage and dive right into the model-building phase, which in turn generates a poor-performing machine learning model. 

Data preprocessing –The foundation of data science solution  | Data Science Dojo

data pre-processing
Data pre-processing

This resembles a popular concept in the field of data science called GIGO (Garbage in Garbage Out). This concept means inferior quality data will always yield poor results irrespective of the model and optimization technique used. 

Hence, an ample amount of time needs to be invested in ensuring the quality of the data is up to standard. In fact, data scientists spend around 80% of their time just on the data pre-processing phase. But fret not, because we will investigate the various steps that you can follow to ensure that your data is preprocessed before stepping ahead in the data science pipeline. 

 

Learn practical data science today!

 

Let’s look at the steps of data pre-processing to understand it better:

Removing duplicates: 

You may often encounter repeated entries in your dataset, which is not a good sign because duplicates are an extreme case of non-random sampling, and they tend to make the model biased. Including repeated entries will lead to the model overfitting this subset of points and hence must be removed. 

We will demonstrate this with the help of an example. Let’s say we had a movie data set as follows: 

As we can see, the movie title “The Dark Knight” is repeated at the 3rd index (fourth entry) in the data frame and needs to be taken care of. 

 Data frame

Using the code below, we can remove the duplicate entries from the dataset based on the “Title” column and only keep the first occurrence of the entry. 

Code

Data frame

 

Just by writing a few lines of code, you ensure your data is free from any duplicate entries. That’s how easy it is! 

Fix structural errors: 

Structural errors in a dataset refer to the entries that either have typos or inconsistent spellings: 

data set

Here, you can easily spot the different typos and inconsistencies, but what if the dataset was huge? You can check all the unique values and their corresponding occurrences using the following code: 

data frame

Once you identify the entries to be fixed, simply replace the values with the correct version. 

code

Voila! That is how you fix the structural errors. 

 

Detecting and handling outliers: 

Before we dive into detecting and handling outliers, let’s discuss what an outlier is.  

“Outlier is any value in a dataset that drastically deviates from the rest of the data points.” 

Let’s say we have a dataset of a streaming service with the ages of users ranging from 18 to 60, but there exists a user whose age is registered as 200. This data point is an example of an outlier and can mess up our machine-learning model if not taken care of. 

 

Large language model bootcamp

 

There are numerous techniques that can be employed to detect and remove outliers in a data set but the ones that I am going to discuss are: 

  1. Box plots 
  1. Z- Score 

Let’s assume the following data set: 

data set

 

If we use the describe function of pandas on the Age column, we can analyze the five number summary along with count, mean, and standard deviation of the specified column, then by using the domain specific knowledge like for the above instance we know that significantly large values of age can be a result of human error we can deduce that there are outliers in the dataset as the mean is 38.92 while the max value is 92. 

dataset outliers

As we have got some idea about what outliers are, let’s see some code in action to detect and remove the outliers 

Box Plots: 

Box plots or also called “Box and Whiskers Plot” show the five number summary of the features under consideration and are an effective way of visualizing the outlier. 

outlier data points

As we can see from the above figure, there are number of data points that are outliers. So now we move onto Z-Score, a method through which we are going to set the threshold and remove the outlier entries from our dataset. 

Z- Score: 

A z-score determines the position of a data point in terms of its distance from the mean when measured in standard deviation units. 

We first calculate the Z-score of the feature column: 

z score

The standard normal curve (Z-score) for a set of values represents 99.7% of the data points within the range of –3 and +3 scores, so in practice often the threshold is set to be 3 and anything beyond that is deemed an outlier and hence removed from the dataset if problematic or not a legitimate observation. 

code

Type Conversion: 

Type conversion refers to when certain columns are not of valid data type, for instance, in the following data frame, three out of four columns are of object data type: 

data frame

Well, we don’t want that right? Because it would produce unexpected results and errors. We are going to convert Title and Director to string data types, and Duration_mins to integer data type. 

 

code data type

  1. Dealing With Missing Values: 

Often, data set contains numerous missing values, which can be a problem. To name a few it can play a role in development of biased estimator, or it can decrease the representativeness of the sample under consideration. 

Which brings us to the question of how to deal with them. 

One thing you could do is simply drop them all. If you notice that index 5 has a few missing values, when the “dropna” command is implemented, it will drop that row from the dataset. 

data set

data frame

 

But what to do when you have a limited number of rows in a dataset? You could use different imputations methods such as the Measures of central tendencies to fill those empty cells. 

The measures include:

  1. Mean: The mean is the average of a data set. It is “sensitive” to outliers. 
  2. Median: The median is the middle of the set of numbers. It is resistant to outliers 
  3. Mode: The mode is the most common number in a data set. 

It is better to use median instead of mean because of the property of not deviating drastically because of outliers. Allow me to elaborate this with an example 

data set

Notice how there is a documentary by the name “Hunger!” with “Duration_mins” equal to 6000 now observe the difference when I replace the missing value in the duration column with mean and with median. 

data set

data set

 

If you search on the internet for the duration of movie “The Shining” you’ll find out it’s about 146 minutes so, isn’t 152 minutes much closer as compared to 1129 as calculated by mean? 

A few other techniques to fill the missing values that you can explore are forward fill and backward fill. 

Forward will work on the principle that the last valid value of a column is passed forward to the missing cell of the dataset. 

data frame

Notice how 209 propagated forward. 

Let’s observe backward fill too 

data frame

From the above example, you can clearly see that the value following the empty cell was propagated backwards to fill in that missing cell. 

The final technique I’m going to show you is called linear interpolation. What we do is take the mean of the values prior to and following the empty cell and use it to fill in the missing value. 

data set

3104.5 is the mean of 209 and 6000. As you can see this technique is too affected by outliers. 

That was a quick run-down on how to handle missing values, moving onto the next section. 

Feature scaling: 

Another core concept of data preprocessing is the feature scaling of your dataset. In simple terms feature scaling refers to the technique where you scale multiple (quantitative) columns of your dataset to a common scale. 

Assume a banking dataset has a column of age which usually ranges from 18 to 60 and a column of balance which can range from 0 to 10000. If you observe, there is an enormous difference between the values each data point can assume, and machine learning model would be affected by the balance column and would assign higher weights to it as it would consider the higher magnitude of balance to carry more importance as compared to age which has relatively lower magnitude. 

To rectify this, we use the following two methods: 

  1. Normalization 
  1. Standardization 

Normalization fits the data between the ranges of [0,1] but sometimes [-1,1] too. It is affected by outliers in a dataset and is useful when you do not know about the distribution of dataset. 

Standardization, on the other hand, is not bound to be within a certain range; it’s quite resistant to outliers and useful when the distribution is normal or Gaussian. 

Normalization:  

code

Standardization:  

code

Data encoding 

The last step of the data preprocessing stage is the data encoding. It is where you encode the categorical features (columns) of your dataset into numeric values. 

There are many encoding techniques available, but I’m just going to show you the implementation of one hot encoding (Pro-tip: You should use this when the order of the data does not matter).  

For instance, in the following example, the gender column is nominal data, meaning that the identification of your gender does not take precedence over other genders. To further clarify the concept, let’s assume, for the sake of argument, we had a dataset of examination results of some high school class with a column of rank. The rank here is an example of ordinal data as it would follow a certain order and higher-ranking students would take precedence over lower-ranked ones. 

code

data set

 

If you notice in the above example, the gender column could assume one of the two options that were either male or female. What one hot encoder did was create the same number of columns as the number of options available, then for the row that had the associated possible value encode it with one (why one?). Well because one is the binary representation of true) otherwise zero (you guessed, zero represents false) 

If you do wish to explore other techniques, here is an excellent resource for this purpose:

Blog: Types of categorical data encoding

 

Conclusion: 

It might have been a lot to take in, but you have now explored the crucial concept of data science, that is data preprocessing.; Moreover, you are now equipped with the steps to curate your dataset in such a way that it will yield satisfactory results. 

The journey to becoming a data scientist can seem daunting, but with the right mentorship, you can learn it seamlessly and take on real world problems in no time, to embark on the journey of becoming a data scientist, enroll yourself in the Data Science bootcamp and grow your career. 

External resource: 

Tableau: What is Data Cleaning? 

November 22, 2022

Related Topics

Statistics
Resources
Programming
Machine Learning
LLM
Generative AI
Data Visualization
Data Security
Data Science
Data Engineering
Data Analytics
Computer Vision
Career
Artificial Intelligence