Interested in a hands-on learning experience for developing LLM applications?
Join our LLM Bootcamp today and Get 28% Off for a Limited Time!

data analytics

In the realm of data analysis, understanding data distributions is crucial. It is also important to understand the discrete vs continuous data distribution debate to make informed decisions.

Whether analyzing customer behavior, tracking weather, or conducting research, understanding your data type and distribution leads to better analysis, accurate predictions, and smarter strategies.

Think of it as a map that shows where most of your data points cluster and how they spread out. This map is essential for making sense of your data, revealing patterns, and guiding you on the journey to meaningful insights.

Let’s take a deeper look into the world of discrete and continuous data distributions to elevate your data analysis skills.

 

llm bootcamp banner

 

What is Data Distribution?

A data distribution describes how points in a dataset are spread across different values or ranges. It helps us understand patterns, frequencies, and variability in the data. For example, it can show how often certain values occur or if the data clusters around specific points.

This mapping of data points provides a snapshot, providing a clear picture of the data’s behavior. It is crucial to understand these data distributions so you choose the right tools and visualizations for analysis and effective storytelling.

These distributions can be represented in various forms. Some common examples include histograms, probability density functions (PDFs) for continuous data, and probability mass functions (PMFs) for discrete data. All the forms of visualizations can be primarily categorized into two main types: discrete and continuous data distributions.

 

Explore 7 types of statistical distributions with examples

 

Discrete Data Distributions

Discrete data consists of distinct, separate values that are countable and finite. It means that you can count the data points and the data can take a specific number of possible values. It often represents whole numbers or counts, such as the number of students in a class or the number of cars passing through an intersection. This type of data does not include fractions or decimals.

Some common types of discrete data distributions include:

1. Binomial Distribution

The binomial distribution measures the probability of getting a fixed number of successes in a specific number of independent trials, each with the same probability of success. It is based on two possible outcomes: success or failure.

Its common examples can be flipping a coin multiple times and counting the number of heads, or determining the number of defective items in a batch of products.

2. Poisson Distribution

The Poisson distribution describes the probability of a given number of events happening in a fixed interval of time or space. This distribution is used for events that occur independently and at a constant average rate.

It can be used in instances such as counting the number of emails received in an hour or recording the number of accidents at a crossroads in a week.

 

Read more about the Poisson process in data analytics

 

3. Geometric Distribution

The geometric distribution measures the probability of the number of failures before achieving the first success in a series of independent trials. It focuses on the number of trials needed to get the first success.

Some scenarios to use this distribution include:

  • The number of sales calls made before making the first sale
  • The number of attempts needed to get the first heads in a series of coin flips

These discrete data distributions provide essential tools for understanding and predicting scenarios with countable outcomes. Each type has unique applications that make it powerful for analyzing real-world events.

Continuous Data Distributions

Continuous data consists of values that can take on any number within a given range. Unlike discrete data, continuous data can include fractions and decimals. It is often collected through measurements and can represent very precise values.

Some unique characteristics of continuous data are:

  • it is measurable – obtained through measuring values
  • infinite values – it can take on an infinite number of values within any given range

For instance, if you measure the height and weight of a person, take temperature readings, or record the duration of any events, you are actually dealing with and measuring continuous data points.

A few examples of continuous data distributions can include:

1. Normal Distribution

The normal distribution, also known as the Gaussian distribution, is one of the most commonly used continuous distributions. It is represented by a bell-shaped curve where most data points cluster around the mean. It is suitable to use normal distributions in situations when you are measuring the heights of people or test scores in a large population.

2. Exponential Distribution

The exponential distribution models the time between consecutive events in a Poisson process. It is often used to describe the time until an event occurs. Common examples of data measurement for this distribution include the time between bus arrivals or the time until a radioactive particle decays.

3. Weibull Distribution

The Weibull distribution is used primarily for reliability testing and predicting the time until a system fails. It can take various shapes depending on its parameters. This distribution can be used to measure the lifespan of mechanical parts or the time to failure of devices.

Understanding these types of continuous distributions is crucial for analyzing data accurately and making informed decisions based on precise measurements.

Discrete vs Continuous Data Distribution Debate

Uncovering the discrete vs continuous data distribution debate is essential for effective data analysis. Each type presents distinct ways of modeling data and requires different statistical approaches.

 

Discrete vs continuous data distributions

 

Let’s break down the key aspects of the debate.

Nature of Data Points

Discrete data consists of countable values. You can count these distinct values, such as the number of cars passing through an intersection or the number of students in a class.

Continuous data, on the other hand, consists of measurable values. These values can be any number within a given range, including fractions and decimals. Examples include height, weight, and temperature. Continuous data reflects measurements that can vary smoothly over a scale.

Discrete Data Representation

Discrete data is represented using bar charts or histograms. These visualizations are effective for displaying and comparing the frequency of distinct categories or values.

Bar Graph

Each bar in a bar chart represents a distinct value or category. The height of the bar indicates the frequency or count of each value. Bar charts are effective for displaying and comparing the number of occurrences of distinct categories. Here are some key points about bar charts:

  • Distinct Bars: Each bar stands alone, representing a specific, countable value.
  • Clear Comparison: Bar charts make it easy to compare different categories or values.
  • Simple Visualization: They provide a straightforward visual comparison of discrete data.

For example, if you are counting the number of students in different classes, each bar on the chart will represent a class and its height will show the number of students in that class.

Histogram

This graphical representation is similar to bar charts but used for grouped frequency of discrete data. Each bar of a histogram represents a range of values. Hence, helping in visualizing the distribution of data across different intervals. Key features include:

  • Adjacent Bars: Bars have no gap between them, indicating the continuous nature of data
  • Interval Width (Bins): Width of each bar (bin) represents a specific range of values – narrow bins show more detail, while wider bins provide a smoother overview
  • Central Tendency and Variability: Identify the central tendency (mean, median, mode) and variability (spread) of the data revealing the shape of the data distribution, such as normal, skewed, or bimodal
  • Outliers Detection: Help in detecting outliers or unusual observations in the data

 

Master the top 7 statistical techniques for data analysis

 

Continuous Data Representation

On the other hand, continuous data is best represented using line graphs, frequency polygons, or density plots. These methods effectively show trends and patterns in data that vary smoothly over a range.

Line Graph

It connects data points with a continuous line, showing how the data changes over time or across different conditions. This is ideal for displaying trends and patterns in data that can take on any value within a range. Key features of line graphs include:

  • Continuous Line: Data points are connected by a line, representing the smooth flow of data
  • Trends and Patterns: Line graphs effectively show how data changes over a period or under different conditions
  • Detailed Measurement: They can display precise measurements, including fractions and decimals

For example, suppose you are tracking the temperature changes throughout the day. In that case, a line graph will show the continuous variation in temperature with a smooth line connecting all the data points.

Frequency Polygon

A frequency polygon connects points representing the frequencies of different values. It provides a clear view of the distribution of continuous data, making it useful for identifying peaks and patterns in the data distribution. Key features of a frequency polygon are as follows:

  • Line Segments: Connect points plotted above the midpoints of each interval
  • Area Under the Curve: Helpful in understanding the overall distribution and density of data
  • Comparison Tool: Used to compare multiple distributions on the same graph

Density Plot

A density plot displays the probability density function of the data. It offers a smoothed representation of data distribution. This representation of data is useful to identify peaks, valleys, and overall patterns in continuous data. Notable features of a density plot include:

  • Peaks and Valleys: Plot highlights peaks (modes) where data points are concentrated and valleys where data points are sparse
  • Area Under the Curve: Total area under the density curve equals 1
  • Bandwidth Selection: Smoothness of the curve depends on the bandwidth parameter – a smaller bandwidth results in a more detailed curve, while a larger bandwidth provides a smoother curve

Probability Function for Discrete Data

Discrete data distributions use a Probability Mass Function (PMF) to describe the likelihood of each possible outcome. The PMF assigns a probability to each distinct value in the dataset.

A PMF gives the probability that a discrete random variable is exactly equal to some value. It applies to data that can take on a finite or countable number of values. The sum of the probabilities for all possible values in a discrete distribution is equal to 1.

For example, if you consider rolling a six-sided die – the PMF for this scenario would assign a probability of 1/6 to each of the outcomes (1, 2, 3, 4, 5, 6) since each outcome is equally likely.

 

Read more about the 9 key probability distributions in data science

 

Probability Function for Continuous Data

Meanwhile, continuous data distributions use a Probability Density Function (PDF) to describe the likelihood of a variable falling within a particular range of values. A PDF describes the probability of a continuous random variable falling within a particular range of values.

It applies to data that can take on an infinite number of values within a given range. The area under the curve of a PDF over an interval represents the probability of the variable falling within that interval. The total area under the curve is equal to 1.

For instance, you can look into the distribution of heights in a population. The PDF might show that the probability of a person’s height falling between 160 cm and 170 cm is represented by the area under the curve between those two points.

Understanding these differences is an important step towards better data handling processes. Let’s take a closer look at why it matters to know the continuous vs discrete data distribution debate in depth.

 

Explore a hands-on curriculum that helps you build custom LLM applications!

 

Why is it Important to Understand the Type of Data Distribution?

Understanding the type of data you’re working with is crucial. It can make or break your analysis. Let’s dive into why this is so important.

Selecting the Right Statistical Tests and Tools

Knowing the distribution of your data helps you make more accurate decisions. Different types of distributions provide insights into various aspects of your data, such as central tendency, variability, and skewness. Hence, knowing whether your data is discrete or continuous helps you choose the right statistical tests and tools.

Discrete data, like the number of customers visiting a store, requires different tests than continuous data, such as the time they spend shopping. Using the wrong tools can lead to inaccurate results, which can be misleading.

 

Explore the 6 key AI tools for data analysis

 

Making Accurate Predictions and Models

When you understand your data type, you can make more accurate predictions and build better models. Continuous data, for example, allows for more nuanced predictions. Think about predicting customer spending over time. With continuous data, you can capture every little change and trend. This leads to more precise forecasts and better business strategies.

Understanding Probability and Risk Assessment

Data types also play a key role in understanding probability and risk assessment. Continuous data helps in assessing risks over a range of values, like predicting the likelihood of investment returns. Discrete data, on the other hand, can help in evaluating the probability of specific events, such as the number of defective products in a batch.

 

How generative AI and LLMs work

 

Practical Applications in Business

Data types have practical applications in various business areas. Here are a few examples:

Customer Trends Analysis

By analyzing discrete data like the number of purchases, businesses can spot trends and patterns. This helps in understanding customer behavior and preferences. Continuous data, such as the duration of customer visits, adds depth to this analysis, revealing more about customer engagement.

Marketing Strategies

In marketing, knowing your data type aids in crafting effective strategies. Discrete data can tell you how many people clicked on an ad, while continuous data can show how long they interacted with it. This combination helps in refining marketing campaigns for better results.

Financial Forecasting

For financial forecasting, continuous data is invaluable. It helps in predicting future revenue, expenses, and profits with greater precision. Discrete data, like the number of transactions, complements this by providing clear, countable benchmarks.

 

Understand the important data analysis processes for your business

 

Understanding whether your data is discrete or continuous is more than just a technical detail. It’s the foundation for accurate analysis, effective decision-making, and successful business strategies. Make sure you get it right! Remember, the key to mastering data analysis is to always know your data type.

Take Your First Step Towards Data Analysis

Understanding data distributions is like having a map to navigate the world of data analysis. It shows you where your data points cluster and how they spread out, helping you make sense of your data.

Whether you’re analyzing customer behavior, tracking weather patterns, or conducting research, knowing your data type and distribution leads to better analysis, accurate predictions, and smarter strategies.

Discrete data gives you countable, distinct values, while continuous data offers a smooth range of measurements. By mastering both discrete and continuous data distributions, you can choose the right methods to uncover meaningful insights and make informed decisions.

So, dive into the world of data distribution and learn about continuous vs discrete data distributions to elevate your analytical skills. It’s the key to turning raw data into actionable insights and making data-driven decisions with confidence. You can kickstart your journey in data analytics with our Data Science Bootcamp!

 

data science bootcamp banner

November 22, 2024

In the world of machine learning, evaluating the performance of a model is just as important as building the model itself. One of the most fundamental tools for this purpose is the confusion matrix. This powerful yet simple concept helps data scientists and machine learning practitioners assess the accuracy of classification algorithms, providing insights into how well a model is performing in predicting various classes.

In this blog, we will explore the concept of a confusion matrix using a spam email example. We highlight the 4 key metrics you must understand and work on while working with a confusion matrix.

 

llm bootcamp banner

 

What is a Confusion Matrix?

A confusion matrix is a table that is used to describe the performance of a classification model. It compares the actual target values with those predicted by the model. This comparison is done across all classes in the dataset, giving a detailed breakdown of how well the model is performing. 

Here’s a simple layout of a confusion matrix for a binary classification problem:

confusion matrix

In a binary classification problem, the confusion matrix consists of four key components: 

  1. True Positive (TP): The number of instances where the model correctly predicted the positive class. 
  2. False Positive (FP): The number of instances where the model incorrectly predicted the positive class when it was actually negative. Also known as Type I error. 
  3. False Negative (FN): The number of instances where the model incorrectly predicted the negative class when it was actually positive. Also known as Type II error. 
  4. True Negative (TN): The number of instances where the model correctly predicted the negative class.

Why is the Confusion Matrix Important?

The confusion matrix provides a more nuanced view of a model’s performance than a single accuracy score. It allows you to see not just how many predictions were correct, but also where the model is making errors, and what kind of errors are occurring. This information is critical for improving model performance, especially in cases where certain types of errors are more costly than others. 

For example, in medical diagnosis, a false negative (where the model fails to identify a disease) could be far more serious than a false positive. In such cases, the confusion matrix helps in understanding these errors and guiding the development of models that minimize the most critical types of errors.

 

Also learn about the Random Forest Algorithm and its uses in ML

 

Scenario: Email Spam Classification

Suppose you have built a machine learning model to classify emails as either “Spam” or “Not Spam.” You test your model on a dataset of 100 emails, and the actual and predicted classifications are compared. Here’s how the results could break down: 

  • Total emails: 100 
  • Actual Spam emails: 40 
  • Actual Not Spam emails: 60

After running your model, the results are as follows: 

  • Correctly predicted Spam emails (True Positives, TP): 35
  • Incorrectly predicted Spam emails (False Positives, FP): 10
  • Incorrectly predicted Not Spam emails (False Negatives, FN): 5
  • Correctly predicted Not Spam emails (True Negatives, TN): 50

confusion matrix example

Understanding 4 Key Metrics Derived from the Confusion Matrix

The confusion matrix serves as the foundation for several important metrics that are used to evaluate the performance of a classification model. These include:

1. Accuracy

accuracy in confusion matrix

  • Formula for Accuracy in a Confusion Matrix:

What is a Confusion Matrix? Understand the 4 Key Metric of its Interpretation | Data Science Dojo

Explanation: Accuracy measures the overall correctness of the model by dividing the sum of true positives and true negatives by the total number of predictions.

  • Calculation for accuracy in the given confusion matrix:

What is a Confusion Matrix? Understand the 4 Key Metric of its Interpretation | Data Science Dojo

This equates to = 0.85 (or 85%). It means that the model correctly predicted 85% of the emails.

2. Precision

precision in confusion matrix

  • Formula for Precision in a Confusion Matrix:

What is a Confusion Matrix? Understand the 4 Key Metric of its Interpretation | Data Science Dojo

Explanation: Precision (also known as positive predictive value) is the ratio of correctly predicted positive observations to the total predicted positives.

It answers the question: Of all the positive predictions, how many were actually correct?

  • Calculation for precision of the given confusion matrix

What is a Confusion Matrix? Understand the 4 Key Metric of its Interpretation | Data Science Dojo

It equates to ≈ 0.78 (or 78%) which highlights that of all the emails predicted as Spam, 78% were actually Spam.

 

How generative AI and LLMs work

 

3. Recall (Sensitivity or True Positive Rate)

Recall in confusion matrix

  • Formula for Recall in a Confusion Matrix

What is a Confusion Matrix? Understand the 4 Key Metric of its Interpretation | Data Science Dojo

Explanation: Recall measures the model’s ability to correctly identify all positive instances. It answers the question: Of all the actual positives, how many did the model correctly predict?

  • Calculation for recall in the given confusion matrix

What is a Confusion Matrix? Understand the 4 Key Metric of its Interpretation | Data Science Dojo

It equates to = 0.875 (or 87.5%), highlighting that the model correctly identified 87.5% of the actual Spam emails.

4. F1 Score

  • F1 Score Formula:

What is a Confusion Matrix? Understand the 4 Key Metric of its Interpretation | Data Science Dojo

Explanation: The F1 score is the harmonic mean of precision and recall. It is especially useful when the class distribution is imbalanced, as it balances the two metrics.

  • F1 Calculation:

What is a Confusion Matrix? Understand the 4 Key Metric of its Interpretation | Data Science Dojo

This calculation equates to ≈ 0.82 (or 82%). It indicates that the F1 score balances Precision and Recall, providing a single metric for performance.

 

Understand the basics of Binomial Distribution and its importance in ML

 

Interpreting the Key Metrics

  • High Recall: The model is good at identifying actual Spam emails (high Recall of 87.5%). 
  • Moderate Precision: However, it also incorrectly labels some Not Spam emails as Spam (Precision of 78%). 
  • Balanced Accuracy: The overall accuracy is 85%, meaning the model performs well, but there is room for improvement in reducing false positives and false negatives. 
  • Solid F1 Score: The F1 Score of 82% reflects a good balance between Precision and Recall, meaning the model is reasonably effective at identifying true positives without generating too many false positives. This balanced metric is particularly valuable in evaluating the model’s performance in situations where both false positives and false negatives are important.

 

Explore a hands-on curriculum that helps you build custom LLM applications!

 

Conclusion

The confusion matrix is an indispensable tool in the evaluation of classification models. By breaking down the performance into detailed components, it provides a deeper understanding of how well the model is performing, highlighting both strengths and weaknesses. Whether you are a beginner or an experienced data scientist, mastering the confusion matrix is essential for building effective and reliable machine learning models.

September 23, 2024

In the world of data analysis, drawing insights from a limited dataset can often be challenging. Traditional statistical methods sometimes fall short when it comes to deriving reliable estimates, especially with small or skewed datasets. This is where bootstrap sampling, a powerful and versatile statistical technique, comes into play.

In this blog, we’ll explore what bootstrap sampling is, how it works, and its various applications in the field of data analysis.

What is Bootstrap Sampling?

 

bootstrap sampling
A visual representation of the bootstrap sampling scheme

 

Bootstrap sampling is a resampling method that involves repeatedly drawing samples from a dataset with replacements to estimate the sampling distribution of a statistic.

Essentially, you take multiple random samples from your original data, calculate the desired statistic for each sample, and use these results to infer properties about the population from which the original data was drawn.

 

Learn about boosting algorithms in machine learning

 

Why do we Need Bootstrap Sampling?

This is a fundamental question I’ve seen machine learning enthusiasts grapple with. What is the point of bootstrap sampling? Where can you use it? Let me take an example to explain this. 

Let’s say we want to find the mean height of all the students in a school (which has a total population of 1,000). So, how can we perform this task? 

One approach is to measure the height of a random sample of students and then compute the mean height. I’ve illustrated this process below.

Traditional Approach

 

bootstrap sampling - traditional approach
Traditional method to sampling a distribution

 

  1. Draw a random sample of 30 students from the school. 
  2. Measure the heights of these 30 students. 
  3. Compute the mean height of this sample. 

However, this approach has limitations. The mean height calculated from this single sample might not be a reliable estimate of the population mean due to sampling variability. If we draw a different sample of 30 students, we might get a different mean height.

To address this, we need a way to assess the variability of our estimate and improve its accuracy. This is where bootstrap sampling comes into play.

Bootstrap Approach

 

bootstrap sampling
Implementing bootstrap sampling

 

  1. Draw a random sample of 30 students from the school and measure their heights. This is your original sample. 
  2. From this original sample, create many new samples (bootstrap samples) by randomly selecting students with replacements. For instance, generate 1,000 bootstrap samples. 
  3. For each bootstrap sample, calculate the mean height. 
  4. Use the distribution of these 1,000 bootstrap means to estimate the mean height of the population and to assess the variability of your estimate.

 

llm bootcamp banner

 

Implementation in Python

To illustrate the power of bootstrap sampling, let’s calculate a 95% confidence interval for the mean height of students in a school using Python. We will break down the process into clear steps.

Step 1: Import Necessary Libraries

First, we need to import the necessary libraries. We’ll use `numpy` for numerical operations and `matplotlib` for visualization.

 

 

Step 2: Create the Original Sample

We will create a sample dataset of heights. In a real-world scenario, this would be your collected data.

 

 

Step 3: Define the Bootstrap Function

We define a function that generates bootstrap samples and calculates the mean for each sample. 

 

 

  • data: The original sample. 
  • n_iterations: Number of bootstrap samples to generate. 
  • -bootstrap_means: List to store the mean of each bootstrap sample. 
  • -n_size: The original sample’s size will be the same for each bootstrap sample. 
  • -np.random.choice: Randomly select elements from the original sample with replacements to create a bootstrap sample. 
  • -sample_mean: Mean of the bootstrap sample.

 

Explore the use of Gini Index and Entropy in data analytics

 

Step 4: Generate Bootstrap Samples

We use the function to generate 1,000 bootstrap samples and calculate the mean for each.

 

 

Step 5: Calculate the Confidence Interval

We calculate the 95% confidence interval from the bootstrap means.

 

 

  • np.percentile: Computes the specified percentile (2.5th and 97.5th) of the bootstrap means to determine the confidence interval.

Step 6: Visualize the Bootstrap Means

Finally, we can visualize the distribution of bootstrap means and the confidence interval. 

 

 

  • plt.hist: Plots the histogram of bootstrap means. 
  • plt.axvline: Draws vertical lines for the confidence interval.

By following these steps, you can use bootstrap sampling to estimate the mean height of a population and assess the variability of your estimate. This method is simple yet powerful, making it a valuable tool in statistical analysis and data science.

 

Read about ensemble methods in machine learning

 

Applications of Bootstrap Sampling

Bootstrap sampling is widely used across various fields, including the following:

Economics

Bootstrap sampling is a versatile tool in economics. It excels in handling non-normal data, commonly found in economic datasets. Key applications include constructing confidence intervals for complex estimators, performing hypothesis tests without parametric assumptions, evaluating model performance, and assessing financial risk.

For instance, economists use bootstrap to estimate income inequality measures, analyze macroeconomic time series, and evaluate the impact of economic policies. The technique is also used to estimate economic indicators, such as inflation rates or GDP growth, where traditional methods might be inadequate.

Medicine

Bootstrap sampling is applied in medicine to analyze clinical trial data, estimate treatment effects, and assess diagnostic test accuracy. It helps in constructing confidence intervals for treatment effects, evaluating the performance of different diagnostic tests, and identifying potential confounders.

Bootstrap can be used to estimate survival probabilities in survival analysis and to assess the reliability of medical imaging techniques. It is also suitable to assess the reliability of clinical trial results, especially when sample sizes are small or the data is not normally distributed.

Machine Learning

In machine learning, bootstrap estimates model uncertainty, improves model generalization, and selects optimal hyperparameters. It aids in tasks like constructing confidence intervals for model predictions, assessing the stability of machine learning models, and performing feature selection.

Bootstrap can create multiple bootstrap samples for training and evaluating different models, helping to identify the best-performing model and prevent overfitting. For instance, it can evaluate the performance of predictive models through techniques like bootstrapped cross-validation.

Ecology

Ecologists utilize bootstrap sampling to estimate population parameters, assess species diversity, and analyze ecological relationships. It helps in constructing confidence intervals for population means, medians, or quantiles, estimating species richness, and evaluating the impact of environmental factors on ecological communities.

Bootstrap is also employed in community ecology to compare species diversity between different habitats or time periods.

 

How generative AI and LLMs work

 

Advantages and Disadvantages

Advantages 

 

Disadvantages 

 

Non-parametric Method: No assumptions about the underlying distribution of the data, making it highly versatile for various types of datasets.  Computationally Intensive: Requires many resamples, which can be computationally expensive, especially with large datasets. 

 

Flexibility: Can be used with a wide range of statistics and datasets, including complex measures like regression coefficients and other model parameters.  Not Always Accurate: May not perform well with very small sample sizes or highly skewed data. The quality of the bootstrap estimates depends on the original sample representative of the population. 

 

Simplicity: Conceptually straightforward and easy to implement with modern computational tools, making it accessible even for those with basic statistical knowledge.  Outlier Sensitivity: Bootstrap sampling can be affected by outliers in the original data. Since the method involves sampling with replacement, outliers can appear multiple times in bootstrap samples, potentially biasing the estimated statistics. 

 

Explore a hands-on curriculum that helps you build custom LLM applications!

 

To Sum it Up 

Bootstrap sampling is a powerful tool for data analysis, offering flexibility and practicality in a wide range of applications. By repeatedly resampling from your dataset and calculating the desired statistic, you can gain insights into the variability and reliability of your estimates, even when traditional methods fall short.

Whether you’re working in economics, medicine, machine learning, or ecology, understanding and utilizing bootstrap sampling can enhance your analytical capabilities and lead to more robust conclusions.

August 14, 2024

In data science and machine learning, decision trees are powerful models for both classification and regression tasks. They follow a top-down greedy approach to select the best feature for each split. Two fundamental metrics determine the best split at each node – Gini Index and Entropy.

This blog will explore what these metrics are, and how they are used with the help of an example.

 

Gini Index and Entropy

 

What is the Gini Index?

It is a measure of impurity (non-homogeneity) widely used in decision trees. It aims to measure the probability of misclassifying a randomly chosen element from the dataset. The greater the value of the Gini Index, the greater the chances of having misclassifications.

Formula and Calculation

The Gini Index is calculated using the formula:

Gini index

where p( j | t ) is the relative frequency of class j at node t.

  • The maximum value is (1 – 1/n) indicating that n classes are equally distributed.
  • The minimum value is 0 indicating that all records belong to a single class.

Example

Consider the following dataset.

 

ID Color (Feature 1) Size (Feature 2) Target (3 Classes)
1 Red Big Apple
2 Red Big Apple
3 Red Small Grape
4 Yellow Big Banana
5 Yellow Small Grape
6 Red Big Apple
7 Yellow Small Grape
8 Red Small Grape
9 Yellow Big Banana
10 Yellow Big Banana

 

This is also the initial root node of the decision tree, with the Gini Index as:

Gini Index and Entropy: Exploring the 2 Measures of Data Impurity | Data Science Dojo

This result shows that the root node has maximum impurity i.e., the records are equally distributed among all output classes.

 

llm bootcamp banner

 

Gini Split

It determines the best feature to use for splitting at each node. It is calculated by taking a weighted sum of the Gini impurities (index) of the sub-nodes created by the split. The feature with the lowest Gini Split value is selected for splitting of the node.

Formula and Calculation

The Gini Split is calculated using the formula:

Gini Index and Entropy - Gini Split

where

  • ni represents the number of records at child/sub-node i.
  • n represents the number of records at node p (parent-node).

Example

Using the same dataset, we will determine which feature to use to perform the next split.

  • For the feature “Color”, there are two sub-nodes as there are two unique values to split the data with:

 

Gini Index and Entropy

 

Gini Index and Entropy

 

  • For the feature “Size”, the case is similar as that of the feature “Color”, i.e., there are also two sub-nodes when we split the data using “Size”:

Gini Index and Entropy

 

Gini Index and Entropy

 

Since the Gini Split for the feature “Size” is less, this is the best feature to select for this split.

What is Entropy?

Entropy is another measure of impurity, and it is used to quantify the state of disorder, randomness, or uncertainty within a set of data. In the context of decision trees, like the Gini Index, it helps in determining how a node should be split to result in sub-nodes that are as pure (homogenous) as possible.

Formula and Calculation

The Entropy of a node is calculated using the formula:

Gini Index and Entropy

where p( j | t ) is the relative frequency of class j at node t.

  • The maximum value is log2(n) which indicates high uncertainty i.e., n classes are equally distributed.
  • The minimum value is 0 which indicates low uncertainty i.e., all records belong to a single class.

 

Explore the key boosting algorithms in ML and their applications

 

Example

Using the same dataset and table as discussed in the example of the Gini Index, we can calculate the Entropy (impurity) of the root node as:

Gini Index and Entropy

 

 

 

 

This result is the same as the results obtained in the Gini Index example i.e., the root node has maximum impurity.

Information Gain

Information Gain’s objective is similar to that of the Gini Split – it aims to determine the best feature for splitting the data at each node. It does this by calculating the reduction in entropy after a node is split into sub-nodes using a particular feature. The feature with the highest information gain is chosen for the node.

Formula and Calculation

The Information Gain is calculated using the formula:

Information Gain = Entropy(Parent Node) – Average Entropy(Children)

where

Gini Index and Entropy

  • ni represents the number of records at child/sub-node i.
  • n represents the number of records at the parent node.

Example

Using the same dataset, we will determine which feature to use to perform the next split:

  • For the feature “Color”

Gini Index and Entropy

 

Gini Index and Entropy

 

  • For feature “Size”:

Gini Index and Entropy

Gini Index and Entropy

 

Since the Information Gain of the split using the feature “Size” is high, this feature is the best to select at this node to perform splitting.

Gini Index vs. Entropy

Both metrics are used to determine the best splits in decision trees, but they have some differences:

  • The Gini Index is computationally simpler and faster to calculate because it is a linear metric.
  • Entropy considers the distribution of data more comprehensively, but it can be more computationally intensive because it is a logarithmic measure.

Use Cases

  • The Gini Index is often preferred in practical implementations of decision trees due to its simplicity and speed.
  • Entropy is more commonly used in theoretical discussions and algorithms like C4.5 and ID3.

 

How generative AI and LLMs work

 

Applications in Machine Learning

Decision Trees

Gini Index and Entropy are used widely in decision tree algorithms to select the best feature for splitting the data at each node/level of the decision tree. This helps improve accuracy by selecting and creating more homogeneous and pure sub-nodes.

Random Forests

Random forest algorithms, which are ensembles of decision trees, also use these metrics to improve accuracy and reduce overfitting by determining optimal splits across different trees.

Feature Selection

Both metrics also help in feature selection as they help identify features that provide the most impurity reduction, or in other words, the most information gain, which leads to more efficient and effective models.

 

Learn more about the different ensemble methods in machine learning

 

Practical Examples

  1. Spam Detection
  2. Customer Segmentation
  3. Medical Diagnosis
  4. And many more

The Final Word

Understanding the Gini Index and Entropy metrics is crucial for data scientists and anyone working with decision trees and related algorithms in machine learning. These metrics provide aid in creating splits that lead to more accurate and efficient models by selecting the optimal feature for splitting at each node.

 

Explore a hands-on curriculum that helps you build custom LLM applications!

 

While the Gini Index is often preferred in practice due to its simplicity and speed, Entropy provides a more detailed understanding of the data distribution. Choosing the appropriate metric depends on the specific requirements and details of your problem and machine learning task.

August 9, 2024

Data Analysis Expressions (DAX) is a language used in Analysis Services, Power BI, and Power Pivot in Excel. DAX formulas include functions, operators, and values to perform advanced calculations and queries on data in related tables and columns in tabular data models. 

 The Basics of DAX for Data Analysis 

DAX is a powerful language that can be used to create dynamic and informative reports that can help you make better decisions. By understanding the basics of Data Analysis Expressions, you can: 

  • Perform advanced calculations on data 
  • Create dynamic filters and calculations 
  • Create measures that can be used in reports 
  • Build tabular data models 
Data Analysis Expressions
Data Analysis Expressions

Creating DAX Tables, Columns, and Measures 

Data Analysis Expression tables are similar to Excel tables, but they can contain calculated columns and measures. Calculated columns are formulas that are applied to all rows in a column, while measures are formulas that are calculated based on data in multiple columns. 

To create a DAX table, right-click on the Tables pane and select New Table. In the Create Table dialog box, enter a name for the table and select the columns that you want to include. 

To create a calculated column, right-click on the Columns pane and select New Calculated Column. In the Create Calculated Column dialog box, enter a name for the column and type in the formula that you want to use.

To create a measure, right-click on the Measures pane and select New Measure. In the Create Measure dialog box, enter a name for the measure and type in the formula that you want to use. 

Executing DAX Operators 

Data Analysis Expressions operators are used to perform calculations on data. Some common DAX operators include: 

  • Arithmetic operators: These operators are used to perform basic arithmetic operations, such as addition, subtraction, multiplication, and division. 
  • Comparison operators: These operators are used to compare two values and return a Boolean value (true or false). 
  • Logical operators: These operators are used to combine Boolean values and return a Boolean value. 
  • Text operators: These operators are used to manipulate text strings. 

Read more –> Data Analysis Roadmap 101: A step-by-step guide

Discussing Basic Math & Statistical Functions 

DAX includes a wide variety of mathematical and statistical functions that can be used to perform calculations on data. Some common mathematical and statistical functions include: 

  • SUM: This function returns the sum of all values in a column or range. 
  • AVERAGE: This function returns the average of all values in a column or range. 
  • COUNT: This function returns the number of non-empty values in a column or range. 
  • MAX: This function returns the maximum value in a column or range. 
  • MIN: This function returns the minimum value in a column or range. 
DAX Functions
DAX Functions

Implementing Date & Time Functions 

Data Analysis Expressions includes many date and time functions that can be used to manipulate date and time data. Some common date and time functions include: 

  • DATEADD: This function adds a specified number of days, months, years, or hours to a date. 
  • DATEDIFF: This function returns the number of days, months, years, or hours between two dates. 
  • TODAY: This function returns the current date. 
  • NOW: This function returns the current date and time. 

Using Text Functions 

DAX includes several text functions that can be used to manipulate text data. Some common text functions include: 

  • LEFT: This function returns the leftmost characters of a string. 
  • RIGHT: This function returns the rightmost characters of a string. 
  • MID: This function returns a substring from a string. 
  • LEN: This function returns the length of a string. 
  • TRIM: This function removes leading and trailing spaces from a string. 

Using calculate & filter functions 

Data Analysis Expressions includes several calculate and filter functions that can be used to create dynamic calculations and filters. Some common calculate and filter functions include: 

  • CALCULATE: This function allows you to create dynamic calculations that are based on the current context. 
  • FILTER: This function allows you to filter data based on a condition. 

Summing up Data Analysis Expressions (DAX) 

Data Analysis Expressions is a powerful language that can be used to perform advanced calculations and queries on data in Analysis Services, Power BI, and Power Pivot in Excel. By understanding the basics of DAX, you can create dynamic and informative reports that can help you make better decisions. 

July 21, 2023

Many people who operate internet businesses find the concept of big data to be rather unclear. They are aware that it exists, and they have been told that it may be helpful, but they do not know how to make it relevant to their company’s operations. 

Using small amounts of data at first is the most effective strategy to begin a big data revolution. There is a need for meaningful data and insights in every single company organization, regardless of size.

Big data plays a very crucial role in the process of gaining knowledge of your target audience as well as the preferences of your customers. It enables you to even predict their requirements. The appropriate data has to be provided understandably and thoroughly assessed. A corporate organization can accomplish a variety of objectives with its assistance. 

 

Understanding Big Data
Understanding Big Data

 

Nowadays, you can choose from a plethora of Big Data organizations. However, selecting a firm that can provide Big Data services heavily depends on the requirements that you have.

Big Data Companies USA not only provides corporations with frameworks, computing facilities, and pre-packaged tools, but they also assist businesses in scaling with cloud-based big data solutions. They assist organizations in determining their big data strategy and provide consulting services on how to improve company performance by revealing the potential of data. 

The big data revolution has the potential to open up many new opportunities for business expansion. It offers the below ideas. 

 

Competence in certain areas

You can be a start-up company with an idea or an established company with a defined solution roadmap. The primary focus of your efforts should be directed toward identifying the appropriate business that can materialize either your concept or the POC. The amount of expertise that the data engineers have, as well as the technological foundation they come from, should be the top priorities when selecting a firm. 

Development team 

Getting your development team and the Big Data service provider on the same page is one of the many benefits of forming a partnership with a Big Data service provider. These individuals have to be imaginative and forward-thinking, in a position to comprehend your requirements and to be able to provide even more advantageous choices.

You may be able to assemble the most talented group of people, but the collaboration won’t bear fruit until everyone on the team shares your perspective on the project. After you have determined that the team members’ hard talents meet your criteria, you may find that it is necessary to examine the soft skills that they possess. 

 

Cost and placement considerations 

The geographical location of the organization and the total cost of the project are two other elements that might affect the software development process. For instance, you may decide to go with in-house development services, but keep in mind that these kinds of services are almost usually more expensive.

It’s possible that rather than getting the complete team, you’ll wind up with only two or three engineers who can work within your financial constraints. But why should one pay extra for a lower-quality result? When outsourcing your development team, choose a nation that is located in a time zone that is most convenient for you. 

Feedback 

In today’s business world, feedback is the most important factor in determining which organizations come out on top. Find out what other people think about the firm you’d want to associate with so that you may avoid any unpleasant surprises. Using these online resources will be of great assistance to you in concluding.

 

What role does big data play in businesses across different industries?

Among the most prominent sectors now using big data solutions are the retail and financial sectors, followed by e-commerce, manufacturing, and telecommunications. When it comes to streamlining their operations and better managing their data flow, business owners are increasingly investing in big data solutions. Big data solutions are becoming more popular among vendors as a means of improving supply chain management. 

  • In the financial industry, it can be used to detect fraud, manage risk, and identify new market opportunities.
  • In the retail industry, it can be used to analyze consumer behavior and preferences, leading to more targeted marketing strategies and improved customer experiences.
  • In the manufacturing industry, it can be used to optimize supply chain management and improve operational efficiency.
  • In the energy industry, it can be used to monitor and manage power grids, leading to more reliable and efficient energy distribution.
  • In the transportation industry, it can be used to optimize routes, reduce congestion, and improve safety.


Bottom line to the big data revolution

Big data, which refers to extensive volumes of historical data, facilitates the identification of important patterns and the formation of more sound judgments. Big data is affecting our marketing strategy as well as affecting the way we operate at this point. Big data analytics are being put to use by governments, businesses, research institutions, IT subcontractors, and teams to delve more deeply into the mountains of data and, as a result, come to more informed conclusions.

 

Written by Vipul Bhaibav

May 8, 2023

The COVID-19 pandemic threw businesses into uncharted waters. Suddenly, digital transformation was more important than ever, and companies had to pivot quickly or risk extinction. And the humble QR code – once dismissed as a relic of the past – became an unlikely hero in this story. 

QR tech’s versatility and convenience allowed businesses, both large and small, to stay afloat amid challenging circumstances and even inspired some impressive growth along the way. But the real magic happened when data analytics was added to the mix. 

Data-Analytics-and-QR-Codes-For-Business-Growth

You see, when QR code was paired with data analytics, companies could see the impact of their actions in real time. They were able to track customer engagement, spot trends, and get precious new insights into their customers’ preferences. This newfound knowledge enabled companies to create superior strategies, refine their campaigns, and more accurately target their audience.  

The result? Faster growth that’s both measurable and sustainable. Read on to find out how you, too, can use data analytics and QR codes to supercharge your business growth. 

Why use QR codes to track data? 

Did you ever put in a lot of effort and time to craft the perfect marketing campaign only to be left wondering how effective it was? How many people viewed it, how many responded, and what was the return on investment?  

Before, tracking offline campaigns’ MROI (Marketing Return on Investment) was an inconvenient and time-consuming process. Businesses used to rely on coupon codes and traditional media or surveys to measure campaign success.

For example, say you put up a billboard ad. Now without any coupon codes or asking people how they found out about you, it was almost impossible to know if someone had even seen the ad, let alone acted on it. But the game changed when data tracking enabled QR codes came in.

Adding these nifty pieces of technology to your offline campaigns allows you to collect valuable data and track customer behavior. All the customers have to do is scan your code, which will take them to a webpage or a landing page of your choosing. In the process, you’ll capture not only first-party data from your audience but also valuable insights into the success of your campaigns. 

For instance, if you have installed the same billboard campaign in two different locations, a QR code analytics dashboard can help you compare the results to determine which one is more effective. Say 2000 people scanned the code in location A, while only 500 scanned it in location B. That’s valuable intel you can use to adjust your strategy and ensure all your offline campaigns perform at their best. 

How does data analytics fit in the picture? 

Once you’ve employed QR codes and started tracking your campaigns, it’s time to play your trump card – analytics. 

Extracting wisdom from your data is what turns your campaigns from good to great. Analytics tools can help you dig deep into the numbers, find correlations, and uncover insights to help you optimize your campaigns and boost conversions. 

For example, using trackable codes, you can find out the number of scans. However, adding analytics tools to the mix can reveal how long users interacted with the content after scanning your code, what locations yielded the most scans, and more.

This transforms your data from merely informative to actionable. And arming yourself with these kinds of powerful insights will go a long way in helping you make smarter decisions and accelerate your growth. 

Getting started with QR code analytics 

Ready to start leveraging the power of QR codes and analytics? Here’s a step-by-step guide to getting started: 

Step 1: Evaluate QR codes’ suitability for your strategy 

Before you begin, ask yourself if a QR code project is actually in line with your current resource capacity and target audience. If you’re trying to target a tech-savvy group of millennials who lead busy lives, they could be the perfect solution. But it may not be the best choice if you’re aiming for an older demographic who may struggle with technology.  

Plus, keep in mind that you’ll also need dedicated resources to continually track and manage your project and the data it’ll yield. As such, make certain you have the right resource support lined up before diving in. 

Step 2: Get yourself a solid QR code generator 

The next step is to find a reliable and feature-rich QR code generator. A good one should allow you to customize your codes, track scans, and easily integrate with your other analytics tools. The internet is full of such QR code generators, so do your research, read reviews, and pick the best one that meets your needs. 

Step 3: Choose your QR code type 

QR codes come in two major types:  

  1. Static QR codes – They are the most basic type of code that points to a single, predefined destination URL and don’t allow for any data tracking.  
  2. Dynamic/ trackable QR codes – These are the codes we’ve been talking about. They are far more sophisticated as they allow you to track and measure scans, collect vital data points, and even change the destination URL on the fly if needed.

For the purpose of analytics, you will have to opt for dynamic /trackable QR codes. 

Step 4: Design and generate QR code

Now that you have your QR code generator and type sorted, you can start with the QR code creation process. Depending on the generator you picked, this can take a few clicks or involve a bit of coding.

But be sure to dress up your QR codes with your brand colors and an enticing call to action to encourage scans. A visually appealing code will be far more likely to pique people’s interest and encourage them to take action than a dull, black-and-white one. 

Step 5: Download and print out the QR code 

Once you have your code ready, save it and print it out. But before printing a big batch of copies to use in your campaigns, test your code to ensure it works as expected. Scan it from different devices and check the destination URL to verify everything is good before moving ahead with your campaign. 

Step 6: Start analyzing the data 

Most good QR code generators come with built-in analytics or allow you to integrate with popular tools like Google Analytics. So you can either go with the integrated analytics or hook up your code with your analytics tool of choice. 

Industry use cases using QR codes and analytics 

QR codes, when combined with analytics tools, can be incredibly powerful in driving business growth. Let’s look at some use cases that demonstrate the potential of this dynamic duo. 

1. Real estate – Real estate agents can use QR codes to give potential buyers a virtual tour of their properties. This tech can also be used to provide comprehensive information about the property, like floor plans and features. Furthermore, with analytics integration, real estate agents can track how many people access property information and view demographic data to better understand each property’s target market.  

2. Coaching/ Mentorship – A coaching business can use QR codes to target potential clients and measure the effectiveness of their coaching materials. For example, coaches could test different versions of their materials and track how many people scanned each QR code to determine which version resonated best with their target audience. Statistics derived from this method will let them refine their materials, hike up engagement and create a higher-end curriculum. 

3. Retail – They are an excellent way for retailers to engage customers in their stores and get detailed metrics on their shopping behavior. Retailers can create links to product pages, add loyalty programs and coupons, or offer discounts on future purchases. All these activities can be tracked using analytics, so retailers can understand customer preferences and tailor their promotions accordingly. 

QR codes and data analytics: A dynamic partnership

No longer confined to the sidelines, tech’s newfound usage has propelled it to the forefront of modern marketing and technology. By combining codes with analytics tools, you can unlock boundless opportunities to streamline processes, engage customers, and drive your business further. This tried-and-true, powerful partnership is the best way to move your company digitally forward.

Written by Ahmad Benny

March 22, 2023

Data analytics is the driving force behind innovation, and staying ahead of the curve has never been more critical. That is why we have scoured the landscape to bring you the crème de la crème of data analytics conferences in 2023.  

Data analytics conferences provide an essential platform for professionals and enthusiasts to stay current on the latest developments and trends in the field. By attending these conferences, attendees can gain new insights, and enhance their skills in data analytics.

These events bring together experts, practitioners, and thought leaders from various industries and backgrounds to share their experiences and best practices. Such conferences also provide an opportunity to network with peers and make new connections.  

Data analytics conferences to look forward to

In 2023, there will be several conferences dedicated to this field, where experts from around the world will come together to share their knowledge and insights. In this blog, we will dive into the top data analytics conferences of 2023 that data professionals and enthusiasts should add to their calendars.

Top Data Analytics Conferences in 2023
      Top Data Analytics Conferences in 2023 – Data Science Dojo

Strata Data Conference   

The Strata Data Conference is one of the largest and most comprehensive data conferences in the world. It is organized by O’Reilly Media and will take place in San Francisco, CA in 2023. It is a leading event in data analytics and technology, focusing on data and AI to drive business value and innovation. The conference brings together professionals from various industries, including finance, healthcare, retail, and technology, to discuss the latest trends, challenges, and solutions in the field of data analytics.   

This conference will bring together some of the leading data scientists, engineers, and executives from across the world to discuss the latest trends, technologies, and challenges in data analytics. The conference will cover a wide range of topics, including artificial intelligence, machine learning, big data, cloud computing, and more. 

Big Data & Analytics Innovation Summit  

The Big Data & Analytics Innovation Summit is a premier conference that brings together experts from various industries to discuss the latest trends, challenges, and solutions in data analytics. The conference will take place in London, England in 2023 and will feature keynotes, panel discussions, and hands-on workshops focused on topics such as machine learning, artificial intelligence, data management, and more.  

Attendees can attend keynote speeches, technical sessions, and interactive workshops, where they can learn about the latest technologies and techniques for collecting, processing, and analyzing big data to drive business outcomes and make informed decisions. The connection between the Big Data & Analytics Innovation Summit and data analytics lies in its focus on the importance of big data and the impact it has on businesses and industries. 

Predictive Analytics World   

Predictive Analytics World is among the leading data analytics conferences that focus specifically on the applications of predictive analytics. It will take place in Las Vegas, NV in 2023. Attendees will learn about the latest trends, technologies, and solutions in predictive analytics and gain valuable insights into this field’s future.  

At PAW, attendees can learn about the latest advances in predictive analytics, including techniques for data collection, data preprocessing, model selection, and model evaluation. For the unversed, Predictive analytics is a branch of data analytics that uses historical data, statistical algorithms, and machine learning techniques to make predictions about future events. 

AI World Conference & Expo   

The AI World Conference & Expo is a leading conference focused on artificial intelligence and its applications in various industries. The conference will take place in Boston, MA in 2023 and will feature keynote speeches, panel discussions, and hands-on workshops from leading AI experts, business leaders, and data scientists. Attendees will learn about the latest trends, technologies, and solutions in AI and gain valuable insights into this field’s future.  

The connection between the AI World Conference & Expo and data analytics lies in its focus on the importance of AI and data in driving business value and innovation. It highlights the significance of AI and data in enhancing business value and innovation. The event offers attendees an opportunity to learn from leading experts in the field, connect with other professionals, and stay informed about the most recent developments in AI and data analytics. 

Data Science Summit   

Last on the data analytics conference list we have the Data Science Summit. It is a premier conference focused on data science applications in various industries. The meeting will take place in San Diego, CA in 2023 and feature keynote speeches, panel discussions, and hands-on workshops from leading data scientists, business leaders, and industry experts. Attendees will learn about the latest trends, technologies, and solutions in data science and gain valuable insights into this field’s future.  

Special mention – Future of Data and AI

Hosted by Data Science Dojo, Future of Data and AI is an unparalleled opportunity to connect with top industry leaders and stay at the forefront of the latest advancements. Featuring 20+ industry experts, the two-day virtual conference offers a diverse range of expert-level knowledge and training opportunities.

Don’t worry if you missed out on the Future of Data and AI Conference! You can still catch all the amazing insights and knowledge from industry experts by watching the conference on YouTube.

Bottom line

In conclusion, the world of data analytics is constantly evolving, and it is crucial for professionals to stay updated on the latest trends and developments in the field. Attending conferences is one of the most effective ways to stay ahead of the game and enhance your knowledge and skills.  

The 2023 data analytics conferences listed in this blog are some of the most highly regarded events in the industry, bringing together experts and practitioners from all over the world. Whether you are a seasoned data analyst, a new entrant in the field, or simply looking to expand your network, these conferences offer a wealth of opportunities to learn, network, and grow.

So, start planning and get ready to attend one of these top conferences in 2023 to stay ahead of the curve. 

 

March 2, 2023

Have you ever heard a story told with numbers? That’s the magic of data storytelling, and it’s taking the world by storm. If you’re ready to captivate your audience with compelling data narratives, you’ve come to the right place.

what is data storytelling
What is data storytelling – Detailed analysis by Data Science Dojo

 

Everyone loves data—it’s the reason your organization is able to make informed decisions on a regular basis. With new tools and technologies becoming available every day, it’s easy for businesses to access the data they need rather than search for it. Unfortunately, this also means that increasingly people are seeing the ins and outs of presenting data in an understandable way.

The rise in social media has allowed people to share their experiences with a product or service without having to look them up first. As a result, businesses are being forced to present data in a more refined way than ever before if they want to retain customers, generate leads, and retain brand loyalty. 

What is data storytelling? 

Data storytelling is the process of using data to communicate the story behind the numbers—and it’s a process that’s becoming more and more relevant as more people learn how to use data to make decisions. In the simplest terms, data storytelling is the process of using numerical data to tell a story. A good data story allows a business to dive deeper into the numbers and delve into the context that led to those numbers.

For example, let’s say you’re running a health and wellness clinic. A patient walks into your clinic, and you diagnose that they have low energy, are stressed out, and have an overall feeling of being unwell. Based on this, you recommend a course of treatment that addresses the symptoms of stress and low energy. This data story could then be used to inform the next steps that you recommend for the patient.   

Why is data storytelling important in three main fields: Finance, healthcare, and education? 

Finance – With online banking and payment systems becoming more common, the demand for data storytelling is greater than ever. Data can be used to improve a customer journey, improve the way your organization interacts with customers, and provide personalized services. Healthcare – With medical information becoming increasingly complex, data storytelling is more important than ever. In education – With more and more schools turning to data to provide personalized education, data storytelling can help drive outcomes for students. 

 

The importance of authenticity in data storytelling 

Authenticity is key when it comes to data storytelling. The best way to understand the importance of authenticity is to think about two different data stories. Imagine that in one, you present the data in a way that is true to the numbers, but the context is lost in translation. In the other example, you present the data in a more simplified way that reflects the situation, but it also leaves out key details. This is the key difference between data storytelling that is authentic and data storytelling that is not.

As you can imagine, the data store that is not authentic will be much less impactful than the first example. It may help someone, but it likely won’t have the positive impact that the first example did. The key to authenticity is to be true to the facts, but also to be honest with your readers. You want to tell a story that reflects the data, but you also want to tell a story that is true to the context of the data. 

 

Register for our conferenceFuture of Data and AI’ to learn from esteemed leaders and discover how to put data storytelling into action. Don’t miss out!

 

How to do data storytelling in action?

Start by gathering all the relevant data together. This could include figures from products, services, and your business as a whole; it could also include data about how your customers are currently using your product or service. Once you have your data together, you’ll want to begin to create a content outline.

This outline should be broken down into paragraphs and sentences that will help you tell your story more clearly. Invest time into creating an outline that is thorough but also easy for others to follow.

Next, you’ll want to begin to find visual representations of your data. This could be images, infographics, charts, or graphs. The visuals you choose should help you to tell your story more clearly.

Once you’ve finished your visual content, you’ll want to polish off your data stories. The last step in data storytelling is to write your stories and descriptions. This will give you an opportunity to add more detail to your visual content and polish off your message. 

 

The need for strategizing before you start 

While the process of data storytelling is fairly straightforward, the best way to begin is by strategizing. This is a key step because it will help you to create a content outline that is thorough, complete, and engaging. You’ll also want to strategize by thinking about who you are writing your stories for. This could be a specific section of your audience, or it could be a wider audience. Once you’ve identified your audience, you’ll want to think about what you want to achieve.

This will help you to create a content outline that is targeted and specific. Next, you’ll want to think about what your content outline will look like. This will help you to create a content outline that is detailed and engaging. You’ll also want to consider what your content outline will include. This will help you to ensure that your content outline is complete, and that it includes everything you want to include. 

Planning your content outline 

There are a few key things that you’ll want to include in your content outline. These include audience pain points, a detailed overview of your content, and your strategy. With your strategy, you’ll want to think about how you plan to present your data. This will help you to create a content outline that is focused, and it will also help you to make sure that you stay on track. 

Watch this video to know what your data tells you

 

Researching your audience and understanding their pain points 

With the planning complete, you’ll want to start to research your audience. This will help you to create a content outline that is more focused and will also help you to understand your audience’s pain points. With pain points in mind, you’ll want to create a content outline that is more detailed, engaging, and honest. You’ll also want to make sure that you’re including everything that you want to include in your content outline.   

Next, you’ll want to start to research your pain points. This will help you to create a content outline that is more detailed and engaging. 

Before you begin to create your content outline, you’ll want to start to think about your audience. This will help you to make connections and to start creating your content outline. With your audience in mind, you’ll want to think about how to present your information. This will help you to create a content outline that is more detailed, engaging, and focused. 

The final step in creating your content outline is to decide where you’re going to publish your data stories. If you’re going to publish your content on a website, you should think about the layout that you want to use. You’ll want to think about the amount of text and the number of images you want to include. 

 

The need for strategizing before you start 

Just as a good story always has a beginning, a middle, and an end, so does a good data story. The best way to start is by gathering all the relevant data together and creating a content outline. Once you’ve done this, you can begin to strategize and make your content more engaging, and you’ll want to make sure that you stay on track. 

 

Mastering your message: How to create a winning content outline

The first thing that you’ll want to think about when it comes to planning your content outline is your strategy. This will help you to make sure that you stay on track with your content outline. Next, you’ll want to think about your audience’s pain points. This will help you to make sure that you stay focused on the most important aspects of your content.  

 

Researching your audience and understanding their pain points 

The final thing that you’ll want to do before you begin to create your content outline is to research your audience. This will help you to make sure that you stay focused on the most important aspects of your content. With pain points in mind, you’ll want to make sure that you stay focused on the most important aspects of your content.  

Next, you’ll want to start to research your audience. This will help you to make sure that you stay focused on the most important aspects of your content. 

By approaching data storytelling in this way, you should be able to create engaging, detailed, and targeted content. 

 

The bottom line: What we’ve learned

In conclusion, data storytelling is a powerful tool that allows businesses to communicate complex data in a simple, engaging, and impactful way. It can help to inform and persuade customers, generate leads, and drive outcomes for students. Authenticity is a key component of effective data storytelling, and it’s important to be true to the facts while also being honest with your readers.

With careful planning and a thorough content outline, anyone can create powerful and effective data stories that engage and inspire their audience. As data continues to play an increasingly important role in decision-making across a wide range of industries, mastering the art of data storytelling is an essential skill for businesses and individuals alike.

February 21, 2023

In this blog, we will discuss what Data Analytics RFP is and the five steps involved in the data analytics RFP process.

(more…)

December 1, 2022

In this article, we’re going to talk about how data analytics can help your business generate more leads and why you should rely on data when making decisions regarding a digital marketing strategy. 

Some people believe that marketing is about creativity – unique and interesting campaigns, quirky content, and beautiful imagery. Contrary to their beliefs, data analytics is what actually powers marketing – creativity is simply a way to accomplish the goals determined by analytics. 

Now, if you’re still not sure how you can use data analytics to generate more leads, here are our top 10 suggestions. 

1. Know how your audience behaves

Most businesses have an idea or two about who their target audience is. But having an idea or two is not good enough if you want to grow your business significantly – you need to be absolutely sure who your audience is and how they behave when they come to your website. 

Now, the best way to do that is to analyze the website data.  

You can tell quite a lot by simply looking at the right numbers. For instance, if you want to know whether the users can easily find the information they’re looking for, keep track of how much time they spend on a certain webpage. If they leave the webpage as soon as it loads, they probably didn’t find what they needed. 

We know that looking at spreadsheets is a bit boring, but you can easily obtain Power BI Certification and use Microsoft Power BI to make data visuals that are easy to understand and pleasing to the eye. 

 

Data analytics books
Books on Data Analytics – Compilation by Data Science Dojo

Read the top 12 data analytics books to learn more about it

 

2. Segment your audience

A great way to satisfy the needs of different subgroups within your target audience is to use audience segmentation. Using that, you can create multiple funnels for the users to move through instead of just one, thereby increasing your lead generation. 

Now, before you segment your audience, you need to have enough information about these subgroups so that you can divide them and identify their needs. Since you can’t individually interview users and ask them for the necessary information, you can use data analytics instead. 

Once you have that, it’s time to identify their pain points and address them differently for different subgroups, and voilàa – you’ve got yourself more leads. 

3. Use data analytics to improve buyer persona

Knowing your target audience is a must but identifying a buyer persona will take things to the next level. A buyer persona doesn’t only contain basic information about your customers. It goes deeper than that and tells you their exact age, gender, hobbies, location, and interests.  

It’s like describing a specific person instead of a group of people. 

Of course, not all your customers will fit that description to a T, but that’s not the point. The point is to have that one idea of a person (or maybe two or three buyer personas) in your mind when creating content for your business.  

buyer persona - Data analytics
Understanding buyer persona with the help of Data analytics  [Source: Freepik] 

 

4. Use predictive marketing 

While data analytics should absolutely be used in retrospectives, there’s another purpose for the information you obtain through analytics – predictive marketing. 

Predictive marketing is basically using big data to develop accurate forecasts of customers’ behavior. It uses complex machine-learning algorithms to build predictive models. 

A good example of how that works is Amazon’s landing page, which includes personalized recommendations.  

Amazon doesn’t only keep track of the user’s previous purchases, but also what they have clicked on in the past and the types of items they’ve shown interest in. By combining that with the season of purchase and time, they are able to make recommendations that are nearly 100% accurate. 

lead generation
Acquiring customers – Lead generation

 

If you’re curious to find out how data science works, we suggest that you enroll in the Data Science Bootcamp

 

5. Know where website traffic comes from 

Users come to your website from different places.  

Some have searched for it directly on Google, some have run into an interesting blog piece on your website, while others have seen your ad on Instagram. This means that the time and effort you put into optimizing your website and creating interesting content pays off. 

But imagine creating a YouTube ad that doesn’t bring much traffic – that doesn’t pay off at all. You’d then want to rework your campaign or redirect your efforts elsewhere.  

This is exactly why knowing where website traffic comes from is valuable. You don’t want to invest your time and money into something that doesn’t bring you any benefits. 

6. Understand which products work 

Most of the time, you can determine what your target audience will like and dislike. The more information you have about your target audience, the better you can satisfy their needs.  

But no one is perfect, and anyone can make a mistake. 

Heinz, a company known for producing ketchup and other food, once released their new product: EZ Squirt ketchup in shades of purple, green, and blue. At first, the kids loved it, but this didn’t last for long. Six years later after that, Heinz halted production of these products. 

As you can see, even big and experienced companies flop sometimes. A good way to avoid that is by tracking which product pages have the least traffic and don’t sell well. 

7. Perform competitor analysis 

Keeping an eye on your competitors is never a bad idea. No matter how well you’re doing and how unique you are, others will try to surpass you and become better. 

The good news is that there are quite a few tools online that you can use for competitor analysis. SEMrush, for instance, can help you see what the competition is doing to get qualified leads so that you can use it to your advantage. 

Even if there wasn’t a tool you need, you can always enroll in a Python for Data Science course and learn to build your own tools that can track the data you need to drive your lead generation. 

competitor analysis - data analytics
Performing competitor analysis through data analytics [Source: Freepik] 

8. Nurture your leads

Nurturing your leads means developing a personalized relationship with your prospects at every stage of the sales funnel in order to get them to buy your products and become your customers. 

Because lead nurturing offers a personalized approach, you’ll need information about your leads: what is their title, role, industry, and similar info, depending on what your business does. Once you have that, you can provide them with the relevant content that will help them decide to buy your products and build brand loyalty along the way. 

This is something b2b lead generation companies can help you with if you’re hesitant to do it on your own.  

9. Gain more customers

Having an insight into your conversion rate, churn rate, sources of website traffic, and other relevant data will ultimately lead to more customers. For instance, your sales team will be able to calculate which sources convert most effectively and prepare resources before running a campaign. 

The more information you have, the better you’ll perform, and this is exactly why Data Science for Business is important – you’ll be able to see the bigger picture and make better decisions. 

data analysts performing data analysis of customer's data
Data analysts performing data analysis of customer’s data

10. Avoid significant losses 

Finally, data can help you avoid certain losses by halting the launch of a product that won’t do well. 

For instance, you can use the Coming soon page to research the market and see if your customers are interested in a new product you planned on launching. If enough people show interest, you can start producing, and if not – you won’t waste your money on something that was bound to fail. 

 

Conclusion:

Applications of data analytics go beyond simple data analysis, especially for advanced analytics projects. The majority of the labor is done up front in the data collection, integration, and preparation stages, followed by the creation, testing, and revision of analytical models to make sure they give reliable findings.

Data engineers, who build data pipelines and aid in the preparation of data sets for analysis, are frequently included within analytics teams in addition to data scientists and other data analysts.

 

Written by Ava-Mae

November 17, 2022

Data Science Dojo is offering Metabase for FREE on Azure Marketplace packaged with web accessible Metabase: Open-Source server. 

Metabase query
Metabase query

 

Introduction 

Organizations often adopt strategies that enhance the productivity of their selling points. One strategy is to utilize the prior business data to identify key patterns regarding any product and then take decisions for it accordingly. However, the work is quite hectic, costly, and requires domain experts. Metabase has bridged that gap of skillset. Metabase provides marketing and business professionals with an easy-to-use query builder notebook to extract required data and simultaneously visualize it without any SQL coding, with just a few clicks. 

What is Metabase and its question? 

Metabase is an open-source business intelligence framework that provides a web interface to import data from diverse databases and then analyze and visualize it with few clicks. The methodology of Metabase is based on questions and the answers to them. They form the foundation of everything else that it provides. 

           

A question is any kind of query that you want to perform on a data. Once you are done with the specification of query functions in the notebook editor, you can visualize the query results. After that you can save this question as well for reusability and turn it into a data model for business specific purposes. 

Pro Tip: Join our 6-months instructor-led Data Science Bootcamp to become expert at data science & analytics skillset 

Challenges for businesses  

For businesses that lack expert analysts, engineers and substantial IT department, it was costly and time-consuming to hire new domain experts or managers themselves learn to code and then explore and visualize data. Apart from that, not many pre-existing applications provide diverse data source connections which was also a challenge. 

In this regard, a straightforward interactive tool that even newbies could adapt immediately and thus get the job done would be the most ideal solution. 

Data analytics with Metabase  

Metabase concept is based on questions which are basically queries and data models (special saved questions). It provides an easy-to-use notebook through which users can gather raw data, filter it, join tables, summarize information, and add other customizations without any need for SQL coding.

Users can select the dimensions of columns from tables and then create various visualizations and embed them in different sub-dashboards. Metabase is frequently utilized for pitching business proposals to executive decision-makers because the visualizations are very simple to achieve from raw data. 

 

visualization on sample data
Figure 1: A visualization on sample data 

 

A visualization on sample data 
Figure 2:  Query builder notebook

 

Major characteristics 

  • Metabase delivers a notebook that enables users to select data, join with other tables, filter, and other operations just by clicking on options instead of writing a SQL query 
  • In case of complex queries, a user can also use an in-built optimized SQL editor 
  • The choice to select from various data sources like PostgreSQL, MongoDB, Spark SQL, Druid, etc., makes Metabase flexible and adaptable 
  • Under the Metabase admin dashboard, users can troubleshoot the logs regarding different tasks and jobs 
  • Has the ability to enable public sharing. It enables admins to create publicly viewable links for Questions and Dashboards  

What Data Science Dojo has for you  

Metabase instance packaged by Data Science Dojo serves as an open-source easy-to-use web interface for data analytics without the burden of installation. It contains numerous pre-designed visualization categories waiting for data.

It has a query builder which is used to create questions (customized queries) with few clicks. In our service users can also use an in-browser SQL editor for performing complex queries. Any user who wants to identify the impact of their product from the raw business data can use this tool. 

Features included in this offer:  

  • A rich web interface running Metabase: Open Source 
  • A no-code query building notebook editor 
  • In-browser optimized SQL editor for complex queries 
  • Beautiful interactive visualizations 
  • Ability to create data models 
  • Email configuration and Slack support 
  • Shareability feature 
  • Easy specification for metrics and segments 
  • Feature to download query results in CSV, XLSX and JSON format 

Our instance supports the following major databases: 

  • Druid 
  • PostgreSQL 
  • MySQL 
  • SQL Server 
  • Amazon Redshift 
  • Big Query 
  • Snowflake 
  • Google Analytics 
  • H2 
  • MongoDB 
  • Presto 
  • Spark SQL 
  • SQLite 

Conclusion  

Metabase is a business intelligence software and beneficial for marketing and product managers. By making it possible to share analytics with various teams within an enterprise, Metabase makes it simple for developers to create reports and collaborate on projects. The responsiveness and processing speed are faster than the traditional desktop environment as it uses Microsoft cloud services. 

At Data Science Dojo, we deliver data science education, consulting, and technical services to increase the power of data. We are therefore adding a free Metabase server dedicated specifically for Data Analytics operations on Azure Market Place. Hurry up and install this offer by Data Science Dojo, your ideal companion in your journey to learn data science!  

Click on the button below to head over to the Azure Marketplace and deploy Metabase for FREE by clicking on “Get it now”. 

CTA - Try now

Note: You’ll have to sign up to Azure, for free, if you do not have an existing account. 

November 5, 2022

From customer relationship management to tracking analytics, marketing analytics tools are important in the modern world. Learn how to make the most of these tools.

What do you usually find in a toolbox? A hammer, screwdriver, nails, tape measure? If you’re building a bird house, these would be perfect for you, but what if you’re creating a marketing campaign? What tools do you want at your disposal? It’s okay if you can’t come up with any. We’re here to help.

Industry’s leading marketing analytics tools

These days marketing is all about data. Whether it’s a click on an email or an abandoned cart on Amazon, marketers are using data to better cater to the needs of the consumer. To analyze and use this data, marketers have a toolbox of their own.

So what are some of these tools and what do they offer? Here, at Data Science Dojo, we’ve come up with our top 5 marketing analytics tools for success:

Customer relationship management platform (CRM)

CRM is a tool used for managing everything there is to know about the customer. It can track where/when a consumer visits your site, tracks the interactions on your site, and creates profiles for leads. A few examples of CRMs are:

HubSpot logo
HubSpot logo

HubSpot, along with the two others listed above, took the idea of a CRM and made it into an all-inclusive marketing resort. Along with the traditional CRM uses, HubSpot can be used to:

  • Manage social media
  • Send mass email campaigns
  • View traffic, campaign, and customer analytics
  • Associate emails, blogs, and social media posts to specific marketing campaigns
  • Create workflows and sequences
  • Connect to your other analytics tools such as Google Analytics, Facebook Ads, YouTube, and Slack.

HubSpot continues its effectiveness by creating reports allowing its users to analyze what is and isn’t working.

This is just a brief description revealing the tip of the iceberg of what HubSpot does. If you want to see below the water line, visit its website.

Search software

Search engine optimization (SEO) is the process of a website ranking on search engines. It’s how you can find everything you have ever searched for on Google. Search software helps marketers analyze how to best optimize websites for potential consumers to find.

A few search software companies are:

I would love to describe each one of the above businesses, but I only have experience with Moz. Moz focuses on a “less invasive way (of marketing) where customers are earned rather than bought”.

Its entire business is focused on upgrading your SEO. Moz offers 9 different services through its Moz Pro toolkit:

Moz Pro Services
Moz Pro Services

I love Moz Keyword Explorer. This is the tool I use to check different variations of titles, keywords, phrases, and hashtags. It gives four different scores, which you can see in the photo below.

Moz Keyword Explorer
Moz Keyword Explorer

Now, there’s not enough data to show the average monthly volume for my name, but, according to Moz, it wouldn’t be that difficult to rank higher than my competitors, people have a high likelihood of clicking, and the Priority explains that my name is not a “sweet spot” for high volume, low difficulty, and high CTR. In conclusion, using my name as a keyword to optimize the Data Science Dojo Blog isn’t the best idea.

 

Read more about marketing analytics in this blog

 

Web analytics service

We can’t talk about marketing tools and not to mention Web Analytics Services. These are some of the most important pieces of equipment in the marketer’s toolbox. Google Analytics (GA) is a free web analytics service that integrates your company’s website data into a meticulously organized dashboard.

I wouldn’t say GA is the be-all and end-all piece of equipment, and there are many different services and tools out there, however, it can’t be refuted that Google Analytics is a great tool to integrate into your company’s marketing strategy.

Some similar Web Analytics Services include:

Google analytics logo
Google Analytics logo

Some of the analytics you’ll be able to understand are

  • Real-time data – Who’s on your site right now? Where are the users coming from? What pages are they looking at?
  • Audience Information – Where do your users live, age range, interests, gender, new or returning visitor, etc.?
  • Acquisition – Where did they come from (Organic, Direct, Paid Ads, Referrals, Campaigns)? What day/time do they land on your website? What was the final URL they visited before leaving? You can also link to any Google Ads campaigns you have running.
  • Behavior – What is the path people take to convert? How is your site speed? What events took place (Contact form submission, newsletter signup, social media share)?
  • Conversions – Are you attributing conversions by first touch, last touch, linear, or decay?

Understanding these metrics is amazingly effective in narrowing down how users interact with your website.

Another way to integrate Google Analytics into your marketing strategy is by setting up goals. Goals are set up to track specific actions taken on your website. For example, you can set up goals to track purchases, newsletter signups, video plays, live chat, and social media shares.

If you want a more in-depth look at what Google Analytics can offer, you can learn the basics through their Analytics Academy.

marketing analytics tool
Google analysis feedback

Analysis and feedback platform (A&F)

A&Fs are another great piece of equipment in the marketer’s toolbox; more specifically for looking at how users are interacting on your website. One such A&F, HotJar, does this in the form of heatmaps and recordings. HotJar’s integrated tracking pixel allows you to see how far users scroll on your website and what items were clicked the most.

You can also watch recordings of a user’s experience and even filter down to the URL of the page you wish to track, (i.e. /checkout/). This allows you to capture the user’s unique journey until they make a purchase. For each recording, you can view audience information such as geographical location, country, browser, operating system, and a documented list of user actions.

In addition to UX/UI metrics, you can also integrate polls and forms on your website for more intricate data about your users.

As a marketing manager, these tools help to visualize all of my data in ways that a pivot table can’t display. And while I am a genuine user of these platforms, I must admit that it’s not the tool that makes the man, it’s the strategy. To get the most use out of these platforms, you will need to understand what business problem you are trying to solve and what metrics are important to you.

There is a lot of information that these dashboards can provide you. However, it’s up to you to filter through the noise. Not every accessible metric applies to you, so you will need to decide what is the most important for your marketing plan.

A few similar platforms include:

Experimentation platforms

Experimentation platforms are software for experimenting with different variations of a sample. Its purpose is to run A/B tests, something HubSpot does, but these platforms dive head first into them.

Experimentation Platforms
Experimentation Platforms

Where HubSpot only tests versions A and B, experimentation platforms let you test versions A, B, C, D, E, F, etc. They don’t just test the different versions, they will also test different audiences and how they respond to each test version. Searching “definition experimentation platforms” is a good place to start in understanding what experimentation platforms are. I can tell you they are a dream come true for marketers who love to get their hands dirty in behavioral targeting.

Optimizely is one such example of a company offering in-depth A/B testing. Optimizely’s goal is to let you spend more time experimenting with the customer experience and less time wading through statistics to learn what works and what doesn’t. If you are unsure what to do, you can test it with Optimizely.

Using companies like Optimizely or Split is just one way to experiment. Many name-brand companies like  Netflix,  MicrosofteBay, and Uber have all built their experimentation platforms to use internally.

Not perfect

No one toolbox is perfect, and everyone is going to be different. One piece of advice I can give is to always understand the problem before deciding which tool is best to solve the problem. You wouldn’t use a hammer to do a job where a drill would be more effective, right?

Top 5 marketing analytics tools for success | Data Science Dojo

You could, it just wouldn’t be the most efficient method. The same concept goes for marketing. Understanding the problem will help you know which tools should be in your toolbox.

August 18, 2022

In this blog, we discussed the applications of AI in healthcare. We took a deep dive into an application of AI, and prognosis prediction using an exercise. We made a simple prognosis detector with an explanation of each step. Our predictor takes symptoms as inputs and predicts the prognosis using a classification model.

Introduction to prognosis prediction

The role of data science and AI (Artificial Intelligence) in the Healthcare industry is not limited to predicting and tracking disease spread. Now, it has become possible to learn the causes of whatever symptoms you are experiencing, such as cough, fever, and body pain, without visiting a doctor and self-treating it at home. Platforms like Ada Health and Sensely can diagnose the symptoms you report.

If you have not already, please go back and read AI & Healthcare. If you have already read it, you will remember I wrote, “Predictive analysis, using historical data to find patterns and predict future outcomes can find the correlation between symptoms, patients’ habits, and diseases to derive meaningful predictions from the data.”

This tutorial will do just that: Predict the prognosis with symptoms as our input.

Exercise: Predict prognosis using symptoms as input

Prognosis Prediction Process
Prognosis Prediction Process

Import required modules

Let us start by importing all the libraries needed in the exercise. We import pandas as we will be reading CSV files as Data Frame. We are importing Label Encoder from sklearn.preprocessing package. Label Encoder is a utility class to convert non-numerical labels to numerical labels. In this exercise, we predict prognosis using symptoms, so it is a classification task.

We are using RandomForestClassifier, which consists of many individual decision trees that work as an ensemble. Learn more about RandomForestClassifier by enrolling in our Data Science Bootcamp, a remote instructor-led Bootcamp. We also require classification reports and accuracy score metrics to measure the model’s performance.

import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

Read CSV files

We are using this Kaggle dataset for our exercise.

It has two files, Training.csv and Testing.csv, containing training and testing data, respectively. You can download these files by going to the data section of the above link.

Read CSV files into Data Frame using pandas read_csv() function. It reads comma-separated files at supplied file path into DataFrame. It takes a file path as a parameter, so provide the right file path where you have downloaded the files.

train = pd.read_csv("File path of Training.csv")
test = pd.read_csv("File path of Testing.csv")

Check samples of the training dataset

To check what the data looks like, let us grab the first five rows of the DataFrame using the head() function.

We have 133 features. We want to predict prognosis so that it would be our target variable. The rest of the 132 features are symptoms that a person experience. The classifier would use these 132 symptoms feature to predict prognosis.

train.head()
data frame
Head Data frame

The training set holds 4920 samples and 133 features, as shown by the shape attribute of the DataFrame.

train.shape
Output
(4920, 133)

Descriptive analysis

Description of the data in the DataFrame can be seen by describe() method of the DataFrame. We see no missing values in our DataFrame as the count of all the features is 4920, which is also the number of samples in our DataFrame. We also see that all the numeric features are binary and have a value of either 1 or 0.

train.describe()
Describe data frame
Describe data frame
train.describe(include=['object'])
data frame objects
Describe data frame objects

Our target variable prognosis has 41 unique values, so there are 41 diseases in which the model will classify input. There are 120 samples for each unique prognoses in our dataset.

train['prognosis'].value_counts()
Prognosis Column
Value Count of Prognosis Column

There are 132 symptoms in our dataset. The names of the symptoms will be listed if we use this code block.

possible_symptoms = train[train.columns.difference(['prognosis'])].columnsprint(list(possible_symptoms))

Output
['abdominal_pain', 'abnormal_menstruation', 'acidity', 'acute_liver_failure', 'altered_sensorium', 'anxiety', 'back_pain', 'belly_pain', 'blackheads', 'bladder_discomfort', 'blister', 'blood_in_sputum', 'bloody_stool', 'blurred_and_distorted_vision', 'breathlessness', 'brittle_nails', 'bruising', 'burning_micturition', 'chest_pain', 'chills', 'cold_hands_and_feets', 'coma', 'congestion', 'constipation', 'continuous_feel_of_urine', 'continuous_sneezing', 'cough', 'cramps', 'dark_urine', 'dehydration', 'depression', 'diarrhoea', 'dischromic _patches', 'distention_of_abdomen', 'dizziness', 'drying_and_tingling_lips', 'enlarged_thyroid', 'excessive_hunger', 'extra_marital_contacts', 'family_history', 'fast_heart_rate', 'fatigue', 'fluid_overload', 'fluid_overload.1', 'foul_smell_of urine', 'headache', 'high_fever', 'hip_joint_pain', 'history_of_alcohol_consumption', 'increased_appetite', 'indigestion', 'inflammatory_nails', 'internal_itching', 'irregular_sugar_level', 'irritability', 'irritation_in_anus', 'itching', 'joint_pain', 'knee_pain', 'lack_of_concentration', 'lethargy', 'loss_of_appetite', 'loss_of_balance', 'loss_of_smell', 'malaise', 'mild_fever', 'mood_swings', 'movement_stiffness', 'mucoid_sputum', 'muscle_pain', 'muscle_wasting', 'muscle_weakness', 'nausea', 'neck_pain', 'nodal_skin_eruptions', 'obesity', 'pain_behind_the_eyes', 'pain_during_bowel_movements', 'pain_in_anal_region', 'painful_walking', 'palpitations', 'passage_of_gases', 'patches_in_throat', 'phlegm', 'polyuria', 'prominent_veins_on_calf', 'puffy_face_and_eyes', 'pus_filled_pimples', 'receiving_blood_transfusion', 'receiving_unsterile_injections', 'red_sore_around_nose', 'red_spots_over_body', 'redness_of_eyes', 'restlessness', 'runny_nose', 'rusty_sputum', 'scurring', 'shivering', 'silver_like_dusting', 'sinus_pressure', 'skin_peeling', 'skin_rash', 'slurred_speech', 'small_dents_in_nails', 'spinning_movements', 'spotting_ urination', 'stiff_neck', 'stomach_bleeding', 'stomach_pain', 'sunken_eyes', 'sweating', 'swelled_lymph_nodes', 'swelling_joints', 'swelling_of_stomach', 'swollen_blood_vessels', 'swollen_extremeties', 'swollen_legs', 'throat_irritation', 'toxic_look_(typhos)', 'ulcers_on_tongue', 'unsteadiness', 'visual_disturbances', 'vomiting', 'watering_from_eyes', 'weakness_in_limbs', 'weakness_of_one_body_side', 'weight_gain', 'weight_loss', 'yellow_crust_ooze', 'yellow_urine', 'yellowing_of_eyes', 'yellowish_skin']

There are 41 unique prognoses in our dataset. The name of all prognoses will be listed if we use this code block:

list(train['prognosis'].unique())
Output
['Fungal infection','Allergy','GERD','Chronic cholestasis','Drug Reaction','Peptic ulcer diseae','AIDS','Diabetes ','Gastroenteritis','Bronchial Asthma','Hypertension ','Migraine','Cervical spondylosis','Paralysis (brain hemorrhage)','Jaundice','Malaria','Chicken pox','Dengue','Typhoid','hepatitis A','Hepatitis B','Hepatitis C','Hepatitis D','Hepatitis E','Alcoholic hepatitis','Tuberculosis','Common Cold','Pneumonia','Dimorphic hemmorhoids(piles)','Heart attack','Varicose veins','Hypothyroidism','Hyperthyroidism','Hypoglycemia','Osteoarthristis','Arthritis','(vertigo) Paroymsal  Positional Vertigo','Acne','Urinary tract infection','Psoriasis','Impetigo']

Data visualization

new_df = train[train.columns.difference(['prognosis'])]
#Maximum Symptoms present for a Prognosis are 17
new_df.sum(axis=1).max()
Minimum Symptoms present for a Prognosis are 3
new_df.sum(axis=1).min()
series = new_df.sum(axis=0).nlargest(n=15)
pd.DataFrame(series, columns=["Occurance"]).loc[::-1, :].plot(kind="barh")
bar chart
Horizontal bar chart for Occurrence column

Fatigue and vomiting are the symptoms most often seen.

Encode object prognosis

Our target variable is categorical features. Let us create an instance of Label Encoder and fit it with the prognosis column of train data and test data. It will encode all possible categorical values in numerical values.

label_encoder = LabelEncoder()
label_encoder.fit(pd.concat([train['prognosis'], test['prognosis']]))

It concludes the data preparation step. Now, we can move on to model training with this data.

Training and evaluating model

Let us train a RandomForestClassifier with the prepared data. We initialize RandomForestClassifier, fit the features and label in it then finally make a prediction on our test data.

In the end, we transform label encoded prognosis values back to the original form using the fit_transform() method of the LabelEncoder object.

random_forest = RandomForestClassifier()
random_forest.fit(train[train.columns.difference(['prognosis'])], label_encoder.fit_transform(train['prognosis']))
y_pred = random_forest.predict(test[test.columns.difference(['prognosis'])])
y_true = label_encoder.fit_transform(test['prognosis'])
print("Accuracy:", accuracy_score(y_true, y_pred))
print(classification_report(y_true, y_pred, target_names=test['prognosis']))
Classification report
Classification report

Predict prognosis by taking symptoms as input

We have our model trained and ready to make predictions. We need to create a function that takes symptoms as input and predicts the prognosis as output. The function predict_prognosis() below is just doing that.

We take input features as a string of symptoms separated by space. We strip the string to remove spaces at the beginning and end of the string. We split this string and created a list of symptoms. We cannot use this list directly in the model for prediction as it contains symptoms’ names, but our model takes a list of 0 and 1 for the absence and presence of symptoms. Finally, with the features in the desired form, we predict the prognosis and print the predicted prognosis.

def predict_prognosis():
  print("List of possible Symptoms you can enter: ", list(train[train.columns.difference(['prognosis'])].columns))
  input_symptoms = list(input("\nEnter symptoms space separated: ").strip().split())
  print(input_symptoms)
  test_value = []
  for symptom in train[train.columns.difference(['prognosis'])].columns:
    if symptom in input_symptoms:
      test_value.append(1)
    else:
      test_value.append(0)
    np_test = np.array(test_value).reshape(1, -1)
    encoded_label = random_forest.predict(np_test)
  predicted_label = label_encoder.inverse_transform(encoded_label)[0]
  print("Predicted Prognosis: ", predicted_label)
predict_prognosis()

Give input symptoms:

Effective prognosis prediction | Data Science Dojo

Predicted prognoses

Suppose we have these symptoms abdominal pain, acidity, anxiety, and fatigue. To predict prognosis, we must enter the symptoms in comma separate fashion. The system will separate the symptoms, transform them into a form model that can predict and finally output the prognosis.
Output prognosis
Output prognosis

Conclusion

To sum up, we discussed the applications of AI in healthcare. Took a deep dive into an application of AI, and prognosis prediction using an exercise. Created a prognosis predictor with an explanation of each step. Finally, we tested our predictor by giving it input symptoms and got the prognosis as output.

Full Code Available!

August 18, 2022

Learning data analytics is a challenge for beginners. Take your learning experience of data analytics one step ahead with these twelve data analytics books. Explore a range of topics, from big data to artificial intelligence.

 

Data analytics books
Books on Data Analytics

Data Analytics Books

1. Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking by Foster Provost and Tom Fawcett

This book is written by two globally esteemed data science experts who introduce their readers to the fundamental principles of data science and then dig deep into the important role data plays in business-related decision-making. They do a great job of demonstrating different techniques and ideas related to analytical thinking without getting into too many technicalities.

Through this book, you can not only begin to appreciate the importance of communication between business strategists and data scientists but can also discover how to approach business problems analytically to generate value.

2. The Data Science Design Manual (Texts in Computer Science) eBook: S. Skiena, Steven: Books

To survive in a data-driven world, we need to adopt the skills necessary to analyze datasets acquired. Data Science is critical to statistics, data visualization, machine learning, and mathematical modeling, Steven in this book give an overview of data science introduction for beginners in this emerging discipline.

The second part of the book highlights the essential skills, knowledge, and principles required to collect, analyze and interpret data. This book leaves learners spellbound with its step-by-step guidance to develop an inside-out theoretical and practical understanding of data science.

The Data Science Design Manual is a thorough instructor guide for learners eager to kick off their learning journey in Data Science. Lastly, Steven added the application of data science in the world, a wide range of exercises, Kaggle challenges, and most interestingly the examples from a data science show, The Quant Shop to excite the learners. 

3. Data Analytics Made Accessible by Anil Maheshwari

Are you a data enthusiast looking to finally dip your toes in the field? Start with Data Analytics Made Accessible by Anil Maheshwari.  Get a sense of what data analytics is all about and how significant a role it plays in real-world scenarios with this informative, easy-to-follow read.

In fact, this book is considered such a vital resource that numerous universities across the globe have added it to their required textbooks list for their analytics courses. It sheds light on the relationship between business and data by talking at length about business intelligence, data mining, and data warehousing.  

4. Python for Data Analysis by Wes McKinney

Written by the main author of the Pandas library, Python for Data Analysis is a book that spells out the basics of manipulating, processing, cleaning, and crunching data in Python. It is a hands-on book that walks its readers through a broad set of real-world case studies and enables them to solve different types of data analysis problems. 

It introduces different data science tools in Python to the readers in order to get them started on loading, cleaning, transforming, merging, and reshaping data. It also walks you through creating informative visualizations using Matplotlib. 

5. Big Data: A Revolution That Will Transform How We Live, Work, and Think by Viktor Mayer-Schönberger and Kenneth Cukier

This book is tailor-made for those who want to know the significance of data analytics across different industries. In this work, these two renowned domain experts bring the buzzword ‘big data’ under the limelight and try to dissect how it’s impacting our world and changing our lives, for better or for worse. 

It does not delve into the technical aspects of data science algorithms or applications, rather it’s more of a theoretical primer on what big data really is and how it’s becoming central to different walks of life. Apart from encouraging the readers to embrace this ground-breaking technological development, it also reminds them of the potential digital hazards it poses and how we can protect ourselves from them.

6. Business Unintelligence: Insight and Innovation beyond Analytics and Big Data by Barry Devlin

This book is great for someone who is looking to read through the past, present, and future of business intelligence. Highlighting the great successes and overlooked weaknesses of traditional business intelligence processes, Dr. Devlin delves into how analytics and big data have transformed the landscape of modern-day business intelligence. 

It identifies the tried-and-tested business intelligence practices and provides insights into how the trinity of information, people, and process conjoin to generate competitive advantage and drive business success in this rapidly advancing world. Furthermore, in this book, Dr. Delvin recommends several new models and frameworks that businesses and companies can employ for an even better tomorrow.

Join our Data Science Bootcamp today to start your career in the world of data.

7. Storytelling with Data: A Data Visualization Guide for Business Professionals by Cole Nussbaumer Knaflic

Globally, the culture is visual. Everything we consume from art, and advertisements to TV is visual. Data visualization is the art of narrating stories with a purpose. In this book, Knaflic highlights key points to effectively tell a story backed by data. The book journeys through the importance of situating your data story within a context, guides on the most suitable charts, graphs, and maps to spot trends and outliers, and discusses how to declutter and retain focus on the key points. 

This book is a valuable addition for anyone eager to grasp the basic concepts of data communication. Once you finish reading the book, you will gain a general understanding of several graphs that add a spark to the stories you create from data. Knaflic instills in you the knowledge to tell a story with an impact.

Learn about lead generation through data analytics in this blog

10 ways data analytics can help you generate more leads 

 

8. Developing Analytic Talent: Becoming a Data Scientist by Vincent Granville

Granville leveraged his lifetime’s experience of working with big data, business analytics, and predictive modeling to compose a “handbook” on data science and data scientists. In this book, you will find learnings that are rarely found in traditional statistical, programming, or computer science textbooks as the author writes from experiential knowledge rather than theoretical. 

Moreover, this book covers all the most valuable information to help you excel in your career as a data scientist. It talks about how data science came to the fore in recent times and became indispensable for organizations using big data. 

The book is divided into three components:

  • What is data science and how does it relate to other disciplines
  • Data science technical applications along with tutorials and case studies
  • Career resources for future and practicing data scientists

This data science book also helps decision-makers to build a better analytics team by informing them about specialized solutions and their uses. Lastly, if you plan to launch a startup around data science, giving this book a reader will give you an edge with some quick ideas based on 20+ industrial experience in Granville.

9. Learning R: A Step-By-Step Function Guide to Data Analysis by Richard Cotton

Non-technical users are scared off by programming languages. This book is an asset for all non-tech learners of the R language. The author compiled a list of tools that make access to statistical models much easier. This book, step-by-step, introduces the reader to R without digging into the details of statistics and data modeling. 

The first part of this data science book introduces you to the basics of the R programming language. It discusses data structures, data environment, looping constructs, and packages. If you are already familiar with the basics you can begin with the second part of the book to learn the steps involved in data analysis like loading, cleaning, and transforming data. The second part of the book gives more insight to perform exploratory analysis and modeling.

10. Data Analytics: A Comprehensive Beginner’s Guide to Learn About the Realms of Data Analytics From A-Z by Benjamin Smith

Smith pens down the path to learning data analytics from A to Z in easy-to-understand language. The book offers simplified explanations for challenging topics like sophisticated algorithms, or even the Euclidean Square Estimate. At any point, while reading this book, you will not feel overwhelmed by technical jargon or menacing formulas. 

First, quickly after introducing the topic, the author then explains a real-world use case and then brings forth the technical jargon. Smith demonstrates almost every practical topic with the use of Python, to enable learners to recreate the projects by themselves. The handy tips and practical exercises are a bonus. 

11. Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing, and Presenting Data by EMC Education Services

With the implementation of Big Data analytics, you explore greater avenues to investigate and generate authentic outcomes to support businesses. It instigates deeper insights that were previously not conveniently doable for everyone. Readers of Data Science and Big Data Analytics perform integration with real-time feeds and queries of structured and unstructured data. As you progress with the chapters in this book, you will open new paths to insight and innovation.

EMC Education Services in this book introduced some of the key techniques and tools suggested by the practitioners for Big Data analytics. Mastering the tools upholds an opportunity of becoming an active contributor to the challenging projects of Big Data analytics. This data science book consists of twelve chapters, crafting a reader’s journey from the Basics of Big Data analytics toward a range of advanced analytical methods, including classification, regression analysis, clustering time series, and text analysis.

All these lessons speak to assist multiple stakeholders which include business and data analysts looking to add Big Data analytics skills to their portfolio; database professionals and managers of business intelligence, analytics, or Big Data groups looking to enrich their analytic skills; and college graduates investigating data science as a career field

12. An Introduction to Statistical Methods and Data Analysis by Lyman Ott

Lyman Ott discussed the powerful techniques used in statistical analysis for both advanced undergraduate and graduate students. This book helps students with solutions to solve problems encountered in research projects. Not only does it greatly benefit students in decision making but it also allows them to become critical readers of statistical analyses. The book gained positive feedback from different levels of learners because it presumes the readers to have little or no mathematical background, thus explaining the complex topics in an easy-to-understand way.

Ott extensively covered the introductory statistics in the starting 11 chapters. The book also targets students who struggle to ace their undergraduate capstone courses. Lastly, it provides research studies and examples that connect the statistical concepts to data analysis problems.

Upgrade your data science skillset with our Python for Data Science training!

August 17, 2022

From customer relationship management to tracking analytics, marketing tools are important in the modern world. Learn how to make the most of these tools.

What do you normally find in a toolbox? A hammer, screwdriver, nails, tape measure? If you’re building a bird house, these would be perfect for you, but what if you’re creating a marketing campaign? What tools do you want at your disposal? It’s okay if you can’t come up with any. We’re here to help.

These days marketing is all about data. Whether it’s a click on an email or an abandoned cart on Amazon, marketers are using data to better cater to the needs of the consumer. In order to analyze and use this data, marketers have a toolbox of their own.

So what are some of these tools and what do they offer? Here, at Data Science Dojo, we’ve come up with our top 5 marketing analytics tools for success.

Customer Relationship Management Platform (CRM)

CRM is a tool used for managing everything there is to know about the customer. It can track where/when a consumer visits your site, it tracks the interactions on your site, and creates profiles for leads. A few examples of CRMs are:

hubspot_logo

  • HubSpot, along with the two others listed above, took the idea of a CRM and made it into an all-inclusive marketing resort. Along with the traditional CRM uses, HubSpot can be used to:
  • Manage social media
  • Send mass email campaigns
  • View traffic, campaign, and customer analytics
  • Associate emails, blogs, and social media posts to specific marketing campaigns
  • Create workflows and sequences
  • Connect to your other analytics tools such as Google Analytics, Facebook Ads, Amazon seller competitor analysis, YouTube, and Slack.

 

HubSpot continues its effectiveness by creating reports allowing its users to analyze what is and isn’t working.

This is just a brief description revealing the tip of the iceberg of what HubSpot does. If you want see below the water line, visit its website.

Search Software

Search engine optimization (SEO) is the process of a website ranking on search engines. It’s how you are able to find everything you have ever searched for on Google. Search software helps marketers analyze how to best optimize websites for potential consumers to find.

A few search software companies are:

 

I would love to describe each one of the above businesses, but I only have experience with Moz. Moz focuses on a

“less invasive way (of marketing) where customers are earned rather than bought”.

In fact, its entire business is focused on upgraging your SEO. Moz offers 9 different services through its Moz Pro toolkit:

MOZ_Services

Personally,

I love the Moz Keyword Explorer. This is the tool I use to check different variations of titles, keywords, phrases, and hashtags. It gives four different scores, which you can see on the photo below.

keyword Search

Now, there’s not enough data to show the average monthly volume for my name, but, according to Moz, it wouldn’t be that difficult to rank higher than my competitors, people have a high likelihood of clicking, and the Priority explains that my name is not a “sweet spot” for high volume, low difficulty, and high CTR. In conclusion, using my name as a keyword to optimize the Data Science Dojo Blog probably isn’t the best idea.

Web Analytics Service

We can’t talk about marketing tools and not mention Web Analytics Services. These are one of the most important pieces of equipment in the marketer’s toolbox. Google Analytics (GA) is a free web analytics service that integrates your company’s website data into a neatly organized dashboard. I wouldn’t say GA is the be-all and end-all piece of equipment, and there are many different services and tools out there, however, it can’t be refuted that Google Analytics is a great tool to integrate into your company’s marketing strategy.

Some similar Web Analytics Services include:

Google-Analytics

Some of the analytics you’ll be able to understand are

  • Real time data – Who’s on your site right now? Where are the users coming from? What pages are they looking at?
  • Audience Information – Where do your users live, age range, interests, gender, new or returning visitor, etc.?
  • Acquisition – Where did they come from (Organic, Direct, Paid Ads, Referrals, Campaigns)? What day/time they landed on your website? What was the final url they visited before leaving? You can also link to any Google Ads campaigns you have running.
  • Behavior – What is the path people take to convert? How is your site speed? What events took place (Contact form submission, newsletter signup, social media share)?
  • Conversions – Are you attributing conversions by first touch, last touch, linear, or decay?

 

Understanding these metrics is very effective in narrowing down how users interact with your website.

Another way to integrate Google Analytics into your marketing strategy is by setting up goals. Goals are set up to track specific actions taken on your website. For example, you can set up goals to track purchases, newsletter signups, video plays, live chat, and social media shares.

If you want a more in-depth look at what Google Analytics can offer, you can learn the basics through their Analytics Academy.

Analysis_feedback

Analysis and Feedback Platform (A&F)

A&Fs are another great piece of equipment in the marketer’s toolbox; more specifically for looking at how users are interacting on your website. One such A&F, HotJar, does this in the form of heatmaps and recordings. HotJar’s integrated tracking pixel allows you to see how far users scroll on your website and what items were clicked the most.

You can also watch recordings of a user’s experience and even filter down to the url of the page you wish to track, (i.e. /checkout/). This allows you to really capture the user’s unique journey until they make a purchase. For each recording, you can view audience information such as geographical location, country, browser, operating system, and a documented list of user actions.

In addition to UX/UI metrics, you can also integrate polls and forms on your website for more intricate data about your users.

As a marketing manager, these tools really help to visualize all of my data in ways that can’t be displayed by a pivot table. And while I am a fervent user of these platforms, I must admit that it’s not the tool that makes the man, it’s the strategy.

To get the most use out of these platforms, you will need to understand what business problem you are trying to solve and what metrics are important to you.There is a lot of information that these dashboards can provide you. However, it’s up to you to filter through the noise. Not every accessible metric is applicable to you, so you will need to decide what is the most important for your marketing plan.

A few similar platforms include:

Experimentation Platforms

Experimentation platforms are a software for experimenting different variations of a sample. Its purpose is to run A/B tests, something HubSpot does, but these platforms dive head first into them.

Experimentation Platforms

Where HubSpot only tests versions A and B, experimentation platforms let you test versions A, B, C, D, E, F, ect. They don’t just test the different versions, they will also test different audiences and how they respond to each test version. Searching “definition experimentation platforms” is a good place to start in understanding what experimentation platforms are. I can tell you they are a dream come true for marketers who love to get their hands dirty in behavioral targeting.

Optimizely is one such example of a company offering in depth A/B testing. Optimizely’s goal is to let you spend more time experimenting with the customer experience and less time wading through statistics to learn what works and what doesn’t. If you are unsure what to do, you can test it with Optimizely.

Using companies like Optimizely or Split is just one way to experiment. Many name brand comapanies like  Netflix,  MicrosoftEbay, and Uber have all built their own experimentation platforms to use internally.

Not a perfect marketing tool

No one toolbox is perfect and everyone’s is going to be different. One piece of advice I can give is to always understand the problem before deciding which tool is best to solve the problem. You wouldn’t use a hammer to do a job that a drill would be more effective at right?

hammer-in-wall gif

You could, it just wouldn’t be the most efficient method. The same concept goes for marketing. Understanding the problem will help you know which tools should be in your toolbox.

June 14, 2022

HR Analytics and employee churn rate prediction: classification and regression tree applied to a company’s HR data. This article explains churn rate prediction in overcoming the trend of people resigning from companies.

People are expected to give their all – labor, passion, and time – to their jobs. But if their jobs don’t give back enough, they will leave. As have 4.5 million burned-out American employees who quit their jobs since November 2021 due to low satisfaction. Could their HRs have retained them if churn rate prediction identified those ready to leave?

HR analytics refers to the collection of employee data, its analysis, and reporting of actionable insights. Information from HR analytics can be used to:

  1. generalize standards for working conditions to avoid burnout
  2. assign projects that align with employees’ strengths for better performance
  3. launch initiatives that align with career aspirations for higher satisfaction
  4. evaluate performance to uncover sources of talent

So, corporations are using data to retain talented employees, increase employee satisfaction, boost company loyalty, churn rate prediction and reduce hiring and retention costs.

Churn rate prediction using machine learning

Classification and regression trees (CART) enable companies to characterize loyalty and identify who is likely to resign. Not only that, but it also reveals the conditions that affect their loyalty and/or make them unsatisfied. So, in this analysis, we will not only be conducting churn rate prediction but also identify possible factors of what pushed them over the edge.

When you perform CART, you can identify two paths: what makes an employee loyal, and what makes an employee leave. Each path has a set of attributes that leads to a greater sense of loyalty, as well as those that lead to higher dissatisfaction.

Then, each of these attributes is ranked in order of importance to know which has a greater influence on the employee’s decision to stay or to leave.  There are different solutions available in the market for HR analytics, but we will apply the CART algorithm using the R programming language.

This is a simulated dataset with several measures that can be used to predict which employees are at a risk to leave the company. Here, the CART algorithm unfolds actionable insights in the following steps:

  1. Business case
  2. Data exploration and preparation
  3. Split data into training and validation
  4. Develop an initial model and interpret two complete paths
  5. Identify important variables

You can follow the steps from this notebook to perform it on your device by clicking here.

1. Business case

In this case study, we will visualize two paths of attributes that affect loyalty and dissatisfaction among employees. The business case is formed around the question: Can we predict those employees who are likely to churn?

2. Data exploration and preparation

There are eight continuous variables and two categorical variables in the data set that offers information about 14999 employees. Continuous variables are those with numerical values, and categorical variables group things into category headers, like “Departments” that can have values similar to sales, marketing, consumer, operations, and so on.

 The variables are explained in the data dictionary below:

  1. satisfaction_level: Satisfaction ratings of the job of an employee
  2. last_evaluation: Rating between 0 to 1, received by an employer over their job performance during the last evaluation
  3. number_projects: Number of projects an employee is involved in
  4. average_monthly_hours: The average number of hours in a month, spent by an employee at the office
  5. time spent_company: Number of years spent in the company
  6. work_accident: 0-no accident during employee stay, 1 accident during employee stay
  7. promotion_last 5 years: Number of promotions in the employee’s stay period
  8. resigned: 0 indicates the employee stays in the company, 1 indicates-the employee who resigned from the company
  9. salary_grade: Salary earned by an employee
  10. department: the department to which an employee belongs

We will plot the variables to explore:

data science variables dataset graph
Plotting No. of Employees and Frequency
  • Satisfaction level: Most employees are highly satisfied.
  • Last evaluation: Most employees are good performers with 75% of the data set being evaluated between 56%-87%.
  • Number of projects: most employees do a reasonable number of projects.
  • Average monthly hours: Most employees spend, fairly, a higher number of hours at work.
  • Time spent in the company: Fewer employees stay beyond 4 years.

Let us take a second glance at the binary, continuous variables: work_accident, resigned, and promotion_last_5years.

Frequency of accidents at work

Frequency of accidents at work
Frequency of Accidents at Work Graph
  • Most employees (85.5%) did not have an accident

Frequency of resignations

Frequency of resignations
Frequency of resignations graph
  • Most employees (76.2%) stayed with the organization and did not resign.

Frequency of promotions in the last 5 years

Frequency of promotions in last 5 years
Frequency of Promotions in the Last 5 Years Graphs
  • Most employees (97.9%) did not receive a promotion in the last 5 years.

Exploring categorical variables: salary_grade and department.

Salary grade of employees

salary grade of employees
Salary Grade of Employees Graph
  • 8.2% of the organization from the top level with the highest pay, 42.9% of the employees are paid a medium salary and 48.7% of the employees are paid a low salary.

Number of employees in each department

Graph
No. of Employees in Different Departments Graph
  • The department ‘sales’ has the highest number of employees at 27% and management the lowest which forms only 4.2%.

3. Split data into training and validation

We will split the data into two parts: training and validation but let’s understand why we do that. We train humans to perform a skill. Similarly, we can train the algorithm to perform. To train a human, we let them practice towards perfecting their ability. But for algorithms, we input data so that they can learn.

The algorithm identifies the pattern in the data and learns the intricacies and nuances of that pattern to build an ability to predict accurately. Therefore, we split our dataset so that we can test the trained model on a representative dataset where we already know the correct predictions. This will let us know how well the model that we trained is performing.

But before we train the model, we will create factors of the following variables:

  1. Department: Represents the number of employees in each department. There are a total of 10 departments. Department Sales has the highest number of employees at 27% and management the lowest which forms only 4.2%.
  1. Salary grade: Represents the salary as low medium and high. 8.25% of the organization are top level with the highest pay, 42.9% of the employees are paid a medium salary and 48.7% of the employees are paid a low salary.
  1. Resigned: In this, 0 denotes who stayed and 1 denotes who resigned from the organization.

We create factors when we wish that each type within a variable be treated as a category. For example, in R’s memory, factorizing the variable ‘department’ will mean treating, ‘low,’ ‘high,’ and ‘medium’ as individual categories. This ensures that the modeling functions treat each type correctly.

4. Develop an initial model

The initial model is developed on the training data set.

training data set
Initial Model of Training Data Set

How to read the tree?

  • 1 denotes ‘resigned,’ and 0 denotes ‘stayed’
  • At the top when no condition is applied to the training data set (train) the best guess is determined as 0 (stayed)
  • Of the total observations 76% did not leave and 24% left

Interpreting two complete paths

Path 1: Will not leave (Loyal)

  • first condition: satisfaction level >= 47%
  • second condition: time_spend_company < 5 years
  • third condition: last_evaluation < 81%

Hence, those who did NOT leave are highly satisfied, have spent at least 4 years in the organization, and are good performers with an evaluation of at least 80%.

Path 2: Will leave (Resign)

  • first condition: satisfaction_level < 47%
  • second condition: number_project >= 3 projects
  • third condition: last_evaluation >= 58%

Hence, those who leave are lowly or moderately satisfied and have a workload of 3 or more projects with their performance being evaluated at least 58%.

5. Identify the important variables

data science important variable
Identifying Important Variables

Summary

Characterizing loyalty

11,428 employees, which is, 76% of the data set are loyal. Three conditions that affect loyalty are:

  • a high level of satisfaction (satisfaction_level >= 47%)
  • have spent at least 4 years in the organization (time_spend_company < 5 years)
  • are good performers with an evaluation of at least 80% (last_evaluation < 81%)

Characterizing left

3,571 employees, which is, 24% of the data set left. Three conditions that affect ‘resigned’ are:

  • low or moderate satisfaction (satisfaction_level < 47%)
  • have a workload of 3 or more projects (number_project >= 3 projects) and
  • their performance being evaluated at least 58% (last_evaluation >= 58 %)

HR analytics, the provenance of a few leading companies, a decade ago, is a solution that is being widely applied now by several growing businesses to uncover surprising sources of talent and counterintuitive insights about what drives employees to be loyal to their organization. We hope this encourages you to leverage the power of HR analytics to retain talent and save hiring costs. You can follow the steps from this notebook to perform it on your device by clicking on the button below:

Click For Code

June 10, 2022

This article lists the top 54 most shared data science quotes: Data as an analogy, importance of data, data analytics adoption, data wrangling, data privacy and security, and future of data.

 

The growing reliance on data analytics has reset business practices, opening frontiers from innovation to productivity and competition. Moreover, these technologies are available at a much cheaper cost, making data a growing torrent flowing into every area of the global economy.

In this data-driven world of technological innovation, let’s take a look at some of the most popular data science quotes.

Learn with amazing data science quotes

 

Experts from every area of the economy have spoken of its capability and impact. We have a curated list for you of some of the famous and useful data science quotes:

data as an analogy

 

Data science quotes about “data as an analogy”

 

1. “Information is the oil of the 21st century, and analytics is the combustion engine.”- Peter Sondergaard, Chairman Of The Board at DecideAct.

2. “Without big data analytics, companies are blind and deaf, wandering out onto the web like deer on a freeway.”- Geoffrey Moore, management consultant and author of Crossing the Chasm.

3. “If you wanna do data science, learn how it is a technical, cultural, economic, and social discipline that has the ability to consolidate and rearrange societal power structures.” – Hugo Bowne-Anderson, Head of Developer Relations at OuterBounds.

4. Possessed is the right word. I often tell people; I don’t necessarily want to be a data scientist. You just kind of are a data scientist. You just can’t help but look at that data set and go, I feel like I need to look deeper. I feel like that’s not the right fit.” – Jennifer Shin, data science/machine learning/AI expert and founder of 8 Path Solutions.

5. “My least favorite description [of Deep Learning] is, “It works just like the brain.” I don’t like people saying this because, while Deep Learning gets an inspiration from biology, it’s very, very far from what the brain does.” – Yann LeCun, VP & Chief AI Scientist at Meta.

data science quotes
Data science quote – Yann LeCun

6. “AI is the new electricity. Just as electricity transformed industry after industry 100 years ago, I think AI will do the same.” – Andrew Ng, Founder & CEO of Landing AI, Founder of deeplearning.ai, Co-Chairman and Co-Founder of Coursera, and is currently an Adjunct Professor at Stanford University.

7. “Much of the power of artificial intelligence stems from its very mindlessness. Immune to the vagaries and biases that attend conscious thought, computers can perform their lightning-quick calculations without distraction or fatigue, doubt or emotion. The coldness of their thinking complements the heat of our own.” – Nicholas G. Carr, American writer on technology and business.

8. “We’ve defined our relationship with technology not as that of body and limb or even sibling and sibling, but as that of master and slave.” […] “With roles reversed, the metaphor also informs society’s nightmares about technology. As we become dependent on our technological slaves…we turn into slaves ourselves.” – Nicholas G. Carr, American writer on technology and business.

PRO TIP: Join our data science bootcamp program today to enhance your data analysis skillset!

importance of data

Data science quotes about “the importance of data”

 

9. “There were 5 exabytes of information created between the dawn of civilization through 2003, but that much information is now created every two days.” – Eric Schmidt, Founding Partner, Innovation Endeavors.

 

10. “We are moving slowly into an era where big data is the starting point, not the end.” – Pearl Zhu, Author.

 

11. Most of the world will make decisions by either guessing or using their guts. They will be either lucky or wrong.” – Suhail Doshi, chief executive officer, Mixpane.

 

12. “We’re entering a new world in which data may be more important than software.” – Tim O’Reilly, founder, O’Reilly Media.

 

13. “Without big data, you are blind and deaf in the middle of a freeway.” – Geoffrey Mooremanagement consultant, and theorist.

 

14. “Statistics are like bikinis. What they reveal is suggestive, but what they conceal is vital.” – Aaron Levenstein, business professor at Baruch College.

 

15. “A data scientist is someone who can obtain, scrub, explore, model, and interpret data, blending hacking, statistics, and machine learning. Data scientists not only are adept at working with data but appreciate data itself as a first-class product.” – Hillary Mason, founder, Fast Forward Labs.

 

16. “Data Scientists are involved with gathering data, massaging it into a tractable form, making it tell its story, and presenting that story to others.” – Mike Loukides, editor, O’Reilly Media.

 

17. “Too often we forget that genius, too, depends upon the data within its reach, that even Archimedes could not have devised Edison’s inventions.” – Ernest Dimnet, priest, writer, and lecturer.

 

18. “The core advantage of data is that it tells you something about the world that you didn’t know before.”- Hilary Mason, data scientist and founder of Fast Forward Labs.

 

data analytics adoption

Data science quotes about “data analytics adoption”

 

19. “The biggest challenge of making the evolution from a knowing culture to a learning culture—from a culture that largely depends on heuristics in decision making to a culture that is much more objective and data-driven and embraces the power of data and technology—is really not the cost. Initially, it ends up being imagination and inertia…

 

What I have learned in my last few years is that the power of fear is quite tremendous in evolving oneself to think and act differently today, and to ask questions today that we weren’t asking about our roles before.

And it’s that mindset change—from an expert-based mindset to one that is much more dynamic and much more learning-oriented, as opposed to a fixed mindset—that I think is fundamental to the sustainable health of any company, large, small, or medium.” – Murli Buluswar, chief science officer, AIG.

 

20. “What we found challenging, and what I find in my discussions with a lot of my counterparts that is still a challenge, is finding the set of tools that enable organizations to efficiently generate value through the process.

 

I hear about individual wins in certain applications but having a more cohesive ecosystem in which this is fully integrated is something we are all struggling with, in part because it’s still very early days. Although we’ve been talking about it seemingly quite a bit over the past few years, the technology is still changing; the sources are still evolving.” – Ruben Sigala, former EVP and chief marketing officer, Caesars Entertainment.

 

21. “The human side of analytics is the biggest challenge to implementing big data.” – Paul Gibbons, author of “The Science of Successful Organizational Change.

 

22. “Every day, three times per second, we produce the equivalent of the amount of data that the Library of Congress has in its entire print collection, right? But most of it is like cat videos on YouTube or 13-year-olds exchanging text messages about the next Twilight movie.” – Nate Silver, founder and editor in chief of FiveThirtyEight.

 

23. “One of the biggest challenges is around data privacy and what is shared versus what is not shared. And my perspective on that is consumers are willing to share if there’s value is returned. One-way sharing is not going to fly anymore. So how do we protect and how do we harness that information and become a partner with our consumers rather than kind of just a vendor for them?” – Zoher Karu, head of data and analytics, APAC and EMEA.

 

24. “The human side of analytics is the biggest challenge to implementing big data.” – Paul Gibbons, author of “The Science of Successful Organizational Change.”

 

25. “The first change we had to make was just to make our data of higher quality. We have a lot of data, and sometimes we just weren’t using that data, and we weren’t paying as much attention to its quality as we now need to… The second area is working with our people and making certain that we are centralizing some aspects of our business.

We are centralizing our capabilities, and we are democratizing its use. I think the other aspect is that we recognize as a team and as a company that we ourselves do not have sufficient skills, and we require collaboration across all sorts of entities outside of American Express.

 

This collaboration comes from technology innovators, it comes from data providers, it comes from analytical companies. We need to put a full package together for our business colleagues and partners so that it’s a convincing argument that we are developing things together, that we are co-learning, and that we are building on top of each other.” – Ash Gupta, former American Express executive; president, Payments and E-Commerce Innovation, LLC.

 

26. “On average, people should be more skeptical when they see numbers. They should be more willing to play around with the data themselves.” – Nate Silver, founder, and editor in chief of FiveThirtyEight.

 

27. “Think analytically, rigorously, and systematically about a business problem and come up with a solution that leverages the available data.” – Michael O’Connell, chief analytics officer, TIBCO.

data wrangling

 

Data science quotes about “data wrangling”

 

28. “The data fabric is the next middleware.” – Todd Papaioannou, entrepreneur, investor, and mentor.

 

29. The goal is to turn data into information and information into insight.” – Carly Fiorina, former chief executive officer, Hewlett Packard.

 

30. “No data is clean, but most is useful.” – Dean Abbott, Co-founder and Chief Data Scientist at SmarterHQ

 

31. “Errors using inadequate data are much less than those using no data at all.” – Charles Babbage, mathematician, engineer, inventor, and philosopher.

 

32. “Data are just summaries of thousands of stories–tell a few of those stories to help make the data meaningful.” – Chip and Dan Heath, authors of “Made to Stick” and “Switch.”

 

33. “In the spirit of science, there really is no such thing as a ‘failed experiment.’ Any test that yields valid data is a valid test.” –  Adam Savage, creator of MythBusters.

 

34. “If somebody tortures the data enough (open or not), it will confess anything.” – Paolo Magrassi, former vice president, research director, Gartner.

 

35. “I think you can have a ridiculously enormous and complex data set, but if you have the right tools and methodology, then it’s not a problem.” – Aaron Koblin, entrepreneur in data and digital technologies.

 

36. “Data that is loved tends to survive.” – Kurt Bollacker, computer scientist.

 

37. Data is like garbage. You’d better know what you are going to do with it before you collect it.” – Mark Twain.

 

38. We are surrounded by data but starved for insights.” – Jay Baer, marketing and customer experience expert.

 

39. “With data collection, ‘the sooner the better’ is always the best answer.”- Marissa Mayer, IT executive and co-founder of Lumi Labs, former Yahoo! President and CEO.

 

40. “Errors using inadequate data are much less than those using no data at all.”- Charles Babbage, mathematician, philosopher, inventor, and mechanical engineer.

 

Learn more about data wrangling

 

data privacy, data security

Data science quotes about “data privacy and security”

 

41. “The price of freedom is eternal vigilance. Don’t store unnecessary data, keep an eye on what’s happening, and don’t take unnecessary risks.” – Chris Bell, former U.S. congressman.

 

42. “It’s so cheap to store all data. It’s cheaper to keep it than to delete it. And that means people will change their behavior because they know anything they say online can be used against them in the future.”- Mikko Hypponen, security and privacy expert.

 

43. “In (the) digital era, privacy must be a priority. Is it just me, or is secret blanket surveillance obscenely outrageous?” – Al Gore, former vice president of the United States.

 

44. You happily give Facebook terabytes of structured data about yourself, content with the implicit tradeoff that Facebook is going to give you a social service that makes your life better.” – John Battelle, founder, Wired magazine.

 

45. Better be despised for too anxious apprehensions than ruined by too confident security.” – Edmund Burke, British philosopher, and statesman.

 

46. Everything we do in the digital realm—from surfing the web to sending an email to conducting a credit card transaction to, yes, making a phone call—creates a data trail. And if that trail exists, chances are someone is using it—or will be soon enough.” – Douglas Rushkoff, author of “Throwing Rocks at the Google Bus.

 

future of data

 

Data science quotes about “the future of data”

 

47. “The world is one big data problem.” – Andrew McAfee, principal research scientist, at MIT.

 

48. “Big data will spell the death of customer segmentation and force the marketer to understand each customer as an individual within 18 months or risk being left in the dust.” – Virginia M. (Ginni) Rometty, chairman, president, and CEO of IBM.

 

49. “Every company has big data in its future, and every company will eventually be in the data business.” – Thomas H. Davenport, American academic and author specializing in analytics, business process innovation, and knowledge management.

 

50. We should teach the students, as well as executives, how to conduct experiments, how to examine data, and how to use these tools to make better decisions.”- Dan Ariely, professor of psychology and behavioral economics at Duke University and a founding member of the Center for Advanced Hindsight.

 

51. Autodidacts—the self-taught, un-credentialed, data-passionate people—will come to play a significant role in many organizations’ data science initiatives.” – Neil Raden, founder, and principal analyst, Hired Brains Research.

 

52. “There’s a digital revolution taking place both in and out of government in favor of open-sourced data, innovation, and collaboration.”- Kathleen Sebelius, former U.S. Secretary of Health and Human Services.

 

53. “Big data will replace the need for 80% of all doctors.” – Vinod Khosla, co-founder of Sun Microsystems and founder of Khosla Ventures.

 

54. “I keep saying that the sexy job in the next 10 years will be statisticians, and I’m not kidding.”- Hal Varian, chief economist, at Google.

Here’s a list of Techniques for Data Scientists to Upskill with LLMs

 

The extensive list of data science quotes highlights the growing impact of the field on modern-day businesses and their running. Take inspiration from the opinions of leaders about data analytics, data wrangling, data privacy, and a lot more. These data science quotes provide unique insight into the world of data for you to start!

June 10, 2022

Take a look at WRC+, wOBA, and wRAA to determine if the shift is really creating a problem for Major League Baseball.

Argue all you want; nobody was better on the diamond than Ted Williams. The last player to finish a season with a .400 avg, Teddy Ballgame was also one of the first recipients of the defensive archetype that is taking Major League Baseball by storm today: The Shift.

What is the shift in major league baseball?

Typically, it is deployed when a pull-heavy power hitter is at the plate. To oversimplify, the defense moves to one side of the field with the sole purpose of creating a higher chance the batter grounds out, pop out, and gets out.

In order to tell if this defensive style is working, we need to look at some data. We will be looking at data provided by FanGraphs. So before moving on, we need to understand how FanGraphs defines a defensive shift.

  • Shift – Traditional: This breaks out all plays where a traditional shift is employed. Generally, this implies there are three infielders to the right of the second base (and it’s how I filtered the data).
Traditional-defensive-shift-good-or-bad-for-baseball
Traditional Defensive Shift for Baseball

 

  • Shift – Non-Traditional: This breaks out all plays which would not be considered traditional.
  • Shift – All: This breaks out all of it, traditional or non-traditional.
  • No Shift: This breaks out all plays where it was not used.

 

Video courtesy of the Seattle Mariners

The Oakland Athletics might have been exaggerating a bit against Ichiro, but that’s how some fans feel when the defense shifts. So what is it meant to do?

What it does

In baseball, it is meant to take away the part of the field the batter is most likely to hit the ball toward, theoretically making them more likely to get out. It’s perfectly legal within the written rules of baseball, but baseball’s a game full of religiously followed unwritten rules. To many people who argue against the shift, it’s these unwritten rules that are being broken and why the MLB should ban them.

Whether you or I believe that doesn’t matter. What matters is if the data is telling us this works. Sure, you can say the only proof you need is that more teams are deploying this look on defense. The Chicago White Sox performed some variation of the shift against 1,079 batters in 2016, only to be doubled in 2018, shifting against 2,150 batters.

Chicago white sox
No. of Total Batters Faced by the Chicago White Sox

 

Houston Astro
No. of Total Batters Faced by the Houston Astro

 

Unfortunately, this doesn’t hold true for all teams. For example, The Houston Astros, who are notorious for using the shift, shifted against 2,052 batters in 2016 but only shifted against 1,892 batters in 2018. Looking at the trend of the number of times a team uses the shift only gives us a surface-level understanding of whether or not it’s working. Let’s dig deeper.

PRO TIP: Join our data science bootcamp today to learn more about data analysis!

Weighted on-base average

Weighted On-Base Average, or wOBA, “is a rate statistic which attempts to credit a hitter for the value of each outcome (single, double, etc) rather than treating all hits or times on base equally”. Essentially, it puts an assigned weight on every outcome to account for the amount of value each outcome is perceived to carry. The League average is always scaled to the league average On Base Percentage, but we’re going to use a wOBA league average of .320 (because that’s what Fangraphs says is typical for an average player).

If we look at that magical .320 in the chart below, we see there were only five teams that had a team wOBA above the league average against it. That’s one less team than in 2016, which had 6 teams above league average at the end of the season.

woba-with-the-shift-for-all-mlb (1)
Weighted On-Base Average with Shift for all MLB

 

Now, I don’t know about you, but that doesn’t really tell me anything other than teams really didn’t change that much between years (and the trends would agree).

So now let’s look at the data from the 2018 season. The graph below shows us the wOBA of teams when the defense is in a traditional shift versus a normal defense (no shift).

wOBA no shift vs with shift
Weighted On-Base Average with Shift and No-Shift

 

The difference isn’t staggering, but it is noticeable. We can see there are 4 teams with a wOBA above the .320 mark, while none of the teams met the average with no shift. Take this with a grain of salt. Typically big, pull-heavy, power hitters are most often shifted against, and home runs have a higher weight added to them than any other outcome. It could be the shift is showing a higher wOBA because more players are attempting to beat the shift by hitting over it. With stat cast reporting 1.9% of pitches in a shift resulting in a ground ball versus 2.5% with the ball in the air, it looks like hitters are choosing not to sacrifice power for on-base percentage.

Weighted runs above average

Weighted Runs Above Average, or wRAA, lets us measure “the number of offensive runs a player contributes to their team compared to the average player”.Zero is considered the league average, so anything positive is helping the team out.

Like wOBA, I created a graph comparing wRAA between 2018 and 2016 when players are batting against a shift. And like wOBA, it doesn’t really tell us much. It looks like some teams made adjustments, while others didn’t.

Weighted Runs Above Average
Weighted Runs Above Average with Shift Comparison

 

This is where things get interesting. We see a big difference when comparing the 2018 shift statistics vs. no shift. Teams typically have a higher wRAA with the shift than without.

Weighted Runs Above Average 2018
Weighted Runs Above Average with Shift and No-Shift Comparison

 

Once again, this should be taken with a grain of salt (that makes two now), but it does look like the shift doesn’t stop people from scoring. In fact, you could argue that the shift is allowing more teams to score.

Weighted runs created plus

Weighted Runs Created Plus, or wRC+, is similar to wOBA in that it assigns weights to outcomes in order to credit a hitter for a higher-valued outcome, but it also takes into account that all ballparks create a different environment for scoring runs. wRC+ quantifies a player’s total offensive value measured by runs. The league average is scaled to 100.

In the graph below, teams didn’t see much of a difference when batting against the shift in 2018 as they did in 2016. The trend lines are almost identical, which leads me to believe the shift really hasn’t changed much about the game when it comes to creating runs.

wrcplus with shift comparison
Weighted Runs Created Plus with Shift Comparison

 

But if we look at the difference in 2018 between batting against a shift and no shift, there is a subtle difference (like 3 percentage points). Not really enough to convince me the shift is creating this major problem in baseball that must be stopped.

wrcplus-no-shift-vs-with-shift
Weighted Runs Created Plus with Shift and No-Shift Comparison

 

If anything, it’s helping teams like the Rays and Marlins actually score runs. Both teams are named after ocean creatures. Both had a wRC+ against the shift of more than 100 and a positive wRAA against the shift in 2018. Coincidence? I’ll let you decide.

Recap

To recap, wOBA, wRAA, and wRC+ suggest the shift might not be creating the defensive outcome teams are looking for. Personally, I don’t think we have quite enough data to draw insightful conclusions about the shift.

However, from the limited data available, we can see a 2:1 ratio of outs to hits as a percentage of pitches thrown while teams are using the shift during the 2018 season. To break it down, 2.9% of pitches thrown in a shift resulted in an out, while 1.4% resulted in a hit. We also see a 2:1 ratio when teams are in a no-shift defense. 8.6% of pitches resulted in an out versus 4.3% resulting in a hit.

However, from the limited data available, we can see a 2:1 ratio of outs to hits as a percentage of pitches thrown while teams are using the shift during the 2018 season. To break it down, 2.9% of pitches thrown in a shift resulted in an out, while 1.4% resulted in a hit. We also see a 2:1 ratio when teams are in a no-shift defense. 8.6% of pitches resulted in an out versus 4.3% resulting in a hit.

Before you make a decision, please read up about what other people are saying. Here are a few good articles you can read to help you form an opinion about the shift.

Do you want to learn data science and higher-level analytics? Check out Data Science Dojo’s data science bootcamp!

June 9, 2022

Related Topics

Statistics
Resources
rag
Programming
Machine Learning
LLM
Generative AI
Data Visualization
Data Security
Data Science
Data Engineering
Data Analytics
Computer Vision
Career
AI