Data Analytics

How to create a Data Analytics RFP in 2023? 
Ava Mae
| December 1, 2022

In this blog, we will discuss what Data Analytics RFP is and the five steps involved in the data analytics RFP process.

(more…)

Data storytelling for successful brand building
Bilal Awan
| November 30, 2022

In this blog, we are going to discuss about data storytelling for successful brand building, its components and brand storytelling

What is data storytelling? 

Data storytelling is a process of driving insights from a dataset using analysis and making it presentable through visualization. It not only helps capture insights but makes content visually presentable so that stakeholders can make data-driven decisions.  

With data storytelling, you can influence and inform your audience based on your analysis.  

 

There are 3 important components of data storytelling.  

  1. Data: You analyze to build a foundation of your data story. This could be descriptive, diagnostic, predictive, or prescriptive analysis to help get a full picture. 
  2. Narrative: Also known as a storyline, a narrative is used to communicate insights gained from your analysis. 
  3. Visualization: Visualization helps communicate that story clearly and effectively. Making use of graphs, charts, diagrams, and audio-visuals for the purpose. 

 

The benefits of data storytelling

data storytelling - infographic
Data storytelling 

 

So, the question arises why do we even need storytelling for data? The simple answer is it helps with decision-making. But let’s take a look at some of the benefits of data storytelling. 

  • Adding value to your data and insights. 
  • Interpreting complex information and highlighting essential key points for the audience. 
  • Providing a human touch to your data. 
  • Offering value to your audience and industry. 
  • Building credibility as an industry and topic thought leader.

 

For example, Airbnb uses data storytelling to help consumers find the right hotel at the right price and also for hosts to set up Airbnb at the most lucrative place.  

 

Data storytelling helps AirBnB deliver personalized experience and recommendations. Their price tip feature is constantly updated to help guide hosts on how likely are they to get a booking at a chosen price.  Other features include host/guest interactions, current events, and local market history in real-time available through its app. 

 

Data-driven brand storytelling 

Now that we have an understanding of data storytelling, let’s talk about how brand storytelling works. Data-driven brand storytelling is when a company uses research, studies, and analytics to share information about a brand and tell a story to consumers.  

It turns complex datasets into an insightful easy to understand visually comprehensible story. It is different than creative storytelling where the brand only focuses on creating a perception. Here the story is based on factual data. 

Storytelling is a great way to build brand association and connect with your consumers. Data-driven storytelling uses visualization that captures attention.  

 

Learn how to create and execute data visualization and tell a story with your data by enrolling in our 5-day live Power BI training 

 

Studies show that our brains process images 60,000 times faster than words, 90% of information transmitted to the brain is visual in nature and we’re 65% more likely to retain information that is visual. 

That’s why infographics, charts, and images are so useful.  

For example, Tower Electric Bikes, a direct-to-consumer e-bike brand used a infographic to rank the most and the least bike-friendly cities across the US. This way they turned an enormous amount of data into visually friendly info-graphic that bike consumers can interpret with just a glance. 

 

bike friendly cities infographic
Bike friendly cities infographic – Source: Tower electric bikes

  

Using the power of storytelling for marketing content 

Even though all content is interpreted as data by consumer but visual content provides the most value in terms of memorability, impact, and capturing their attention. The job of any successful brand is to build a positive association in consumers’ minds. 

Storytelling helps create those positive associations by providing high-value engaging content, capturing attention, and giving meaning to not-so-visually appealing datasets. 

We live in a world that is highly cluttered by advertising and paid promotional content. To make your content stand out from competitors you need to have good visualization and a story behind it. Storytelling helps assign meaning and context to data that would otherwise look unappealing and dry.  

Consumers gain clarity, and better understanding, and share more if it makes sense to them. Data storytelling helps extract and communicate insight that in turn helps your consumer’s buying journey.

It could be content relevant to any stage of their buyer journey or even outside of the sales cycle. Storytelling helps create engaging and memorable marketing content that would help grow your brand. 

Learn how to use data visualization, narratives, and real-life examples to bring your story to life with our free community event Storytelling Data. 

 

Executing flawless data-driven brand storytelling 

Now that we have a better understanding of brand storytelling, let’s have a look at how to go about crafting a story and important steps involved. 

Craft a compelling narrative 

The most important element in building a story is the narrative. You need a compelling narrative for your story. There are 4 key elements to any story. 

Characters: These are your key players or stakeholders in your story. They can be customers, suppliers, competitors, environmental groups, government, or any other group that has to do with your brand.  

Setting: This is where you use your data to reinforce the narrative. Whether it’s an improved feature in your product that increases safety or a manufacturing process that takes into account environmental impact. This is the stage where you define environment that concerns your stakeholders.

Conflict: Here you describe the root issue or problem you’re trying to solve with data. This could be marketing content that generated sales revenue, you want your team to have a better understanding of it to create helpful content for the sales team. Conflict plays a crucial role in making your story relevant and engaging. There needs to be a problem for a data solution.  

Resolution: Finally, you want to propose a solution to the identified problem. You can present a short-term fix along with a long-term pivot depending on the type of problem you are solving. At this stage, your marketing outreach should be consistent with a very visible message across all channels.

You don’t want to create confusion, whatever resolution/result you’ve achieved through analysis should be clearly indicated with supporting evidence and compelling visualization to make your story come to life. 

Your storytelling needs to have all these steps to be able to communicate your message effectively to the desired audience. With these steps, your audience will walk through a compelling, engaging and impactful story. 

 

Start learning data storytelling today

Our brains are hard-wired to love stories and visuals. Storytelling is not something new it dates back to 1700 BCE, from cave paintings to symbol language. That is the reason it resonates so well in today’s fast-paced cluttered consumer environment.  

Brands can use storytelling based on factual data to engage, create positive associations and finally encourage action. The best way to come up with a story narrative is to use internal data, success stories, and insights driven by your research and analysis. Then translate those insights into a story and visuals for better retention and brand building. 

 

register now

 

 

References 

Data storytelling

Power BI – Data storytelling

HBS – Data storytelling

 

 

10 ways data analytics can help you generate more leads 
Ava-Mae
| November 17, 2022

In this article, we’re going to talk about how data analytics can help your business generate more leads and why you should rely on data when making decisions regarding a digital marketing strategy. 

Some people believe that marketing is about creativity – unique and interesting campaigns, quirky content, and beautiful imagery. Contrary to their beliefs, data analytics is what actually powers marketing – creativity is simply a way to accomplish the goals determined by analytics. 

Now, if you’re still not sure how you can use data analytics to generate more leads, here are our top 10 suggestions. 

1. Know how your audience behaves

Most businesses have an idea or two about who their target audience is. But having an idea or two is not good enough if you want to grow your business significantly – you need to be absolutely sure who your audience is and how they behave when they come to your website. 

Now, the best way to do that is to analyze the website data.  

You can tell quite a lot by simply looking at the right numbers. For instance, if you want to know whether the users can easily find the information they’re looking for, keep track of how much time they spend on a certain webpage. If they leave the webpage as soon as it loads, they probably didn’t find what they needed. 

We know that looking at spreadsheets is a bit boring, but you can easily obtain Power BI Certification and use Microsoft Power BI to make data visuals that are easy to understand and pleasing to the eye. 

 

 

 

 

Data analytics books
Books on Data Analytics – Compilation by Data Science Dojo

Read the top 12 data analytics books to learn more about it

 

2. Segment your audience

A great way to satisfy the needs of different subgroups within your target audience is to use audience segmentation. Using that, you can create multiple funnels for the users to move through instead of just one, thereby increasing your lead generation. 

Now, before you segment your audience, you need to have enough information about these subgroups so that you can divide them and identify their needs. Since you can’t individually interview users and ask them for the necessary information, you can use data analytics instead. 

Once you have that, it’s time to identify their pain points and address them differently for different subgroups, and voilàa – you’ve got yourself more leads. 

3. Use data analytics to improve buyer persona

Knowing your target audience is a must but identifying a buyer persona will take things to the next level. A buyer persona doesn’t only contain basic information about your customers. It goes deeper than that and tells you their exact age, gender, hobbies, location, and interests.  

It’s like describing a specific person instead of a group of people. 

Of course, not all your customers will fit that description to a T, but that’s not the point. The point is to have that one idea of a person (or maybe two or three buyer personas) in your mind when creating content for your business.  

buyer persona - Data analytics
Understanding buyer persona with the help of Data analytics  [Source: Freepik] 

 

4. Use predictive marketing 

While data analytics should absolutely be used in retrospectives, there’s another purpose for the information you obtain through analytics – predictive marketing. 

Predictive marketing is basically using big data to develop accurate forecasts of customers’ behavior. It uses complex machine-learning algorithms to build predictive models. 

A good example of how that works is Amazon’s landing page, which includes personalized recommendations.  

Amazon doesn’t only keep track of the user’s previous purchases, but also what they have clicked on in the past and the types of items they’ve shown interest in. By combining that with the season of purchase and time, they are able to make recommendations that are nearly 100% accurate. 

lead generation
Acquiring customers – Lead generation

 

If you’re curious to find out how data science works, we suggest that you enroll in the Data Science Bootcamp

 

5. Know where website traffic comes from 

Users come to your website from different places.  

Some have searched for it directly on Google, some have run into an interesting blog piece on your website, while others have seen your ad on Instagram. This means that the time and effort you put into optimizing your website and creating interesting content pays off. 

But imagine creating a YouTube ad that doesn’t bring much traffic – that doesn’t pay off at all. You’d then want to rework your campaign or redirect your efforts elsewhere.  

This is exactly why knowing where website traffic comes from is valuable. You don’t want to invest your time and money into something that doesn’t bring you any benefits. 

6. Understand which products work 

Most of the time, you can determine what your target audience will like and dislike. The more information you have about your target audience, the better you can satisfy their needs.  

But no one is perfect, and anyone can make a mistake. 

Heinz, a company known for producing ketchup and other food, once released their new product: EZ Squirt ketchup in shades of purple, green, and blue. At first, the kids loved it, but this didn’t last for long. Six years later after that, Heinz halted production of these products. 

As you can see, even big and experienced companies flop sometimes. A good way to avoid that is by tracking which product pages have the least traffic and don’t sell well. 

7. Perform competitor analysis 

Keeping an eye on your competitors is never a bad idea. No matter how well you’re doing and how unique you are, others will try to surpass you and become better. 

The good news is that there are quite a few tools online that you can use for competitor analysis. SEMrush, for instance, can help you see what the competition is doing to get qualified leads so that you can use it to your advantage. 

Even if there wasn’t a tool you need, you can always enroll in a Python for Data Science course and learn to build your own tools that can track the data you need to drive your lead generation. 

competitor analysis - data analytics
Performing competitor analysis through data analytics [Source: Freepik] 

8. Nurture your leads

Nurturing your leads means developing a personalized relationship with your prospects at every stage of the sales funnel in order to get them to buy your products and become your customers. 

Because lead nurturing offers a personalized approach, you’ll need information about your leads: what is their title, role, industry, and similar info, depending on what your business does. Once you have that, you can provide them with the relevant content that will help them decide to buy your products and build brand loyalty along the way. 

This is something b2b lead generation companies can help you with if you’re hesitant to do it on your own.  

9. Gain more customers

Having an insight into your conversion rate, churn rate, sources of website traffic, and other relevant data will ultimately lead to more customers. For instance, your sales team will be able to calculate which sources convert most effectively and prepare resources before running a campaign. 

The more information you have, the better you’ll perform, and this is exactly why Data Science for Business is important – you’ll be able to see the bigger picture and make better decisions. 

data analysts performing data analysis of customer's data
Data analysts performing data analysis of customer’s data

10. Avoid significant losses 

Finally, data can help you avoid certain losses by halting the launch of a product that won’t do well. 

For instance, you can use the Coming soon page to research the market and see if your customers are interested in a new product you planned on launching. If enough people show interest, you can start producing, and if not – you won’t waste your money on something that was bound to fail. 

 

Conclusion:

Applications of data analytics go beyond simple data analysis, especially for advanced analytics projects. The majority of the labour is done up front in the data collection, integration, and preparation stages, followed by the creation, testing, and revision of analytical models to make sure they give reliable findings. Data engineers, who build data pipelines and aid in the preparation of data sets for analysis, are frequently included within analytics teams in addition to data scientists and other data analysts.

Scrape Twitter data without Twitter API using SNScrape for timeseries analysis 
Syed Umair Hasan
| November 16, 2022

A hands-on guide to collect and store twitter data for timeseries analysis 

“A couple of weeks back, I was working on a project in which I had to scrape tweets from twitter and after storing them in a csv file, I had to plot some graphs for timeseries analysis. I requested Twitter for Twitter developer API, but unfortunately my request was not fulfilled. Then I started searching for python libraries which can allow me to scrape tweets without the official Twitter API.

To my amazement, there were several libraries through which you can scrape tweets easily but for my project I found ‘Snscrape’ to be the best library, which met my requirements!” 

What is Snscrape? 

A scraper for social networking platforms known as snscrape (SNS). It retrieves objects, such as pertinent posts, by scraping things like user profiles, hashtags, or searches. 

 

Install Snscrape 

Snscrape requires Python 3.8 or higher. The Python package dependencies are installed automatically when you install Snscrape. You can install using the following commands. 

  • pip3 install snscrape 

  • pip3 install git+https://github.com/JustAnotherArchivist/snscrape.git (Development Version) 

 

For this tutorial we will be using the development version of Snscrape. Paste the second command in command prompt(cmd), make sure you have git installed on your system. 

 

Code walkthrough for scraping

Before starting make sure you have the following python libraries: 

  • Pandas 
  • Numpy 
  • Snscrape 
  • Tqdm 
  • Seaborn 
  • Matplotlit 

Importing Relevant Libraries 

To run the scraping program, you will first need to import the libraries 

import pandas as pd 

import numpy as np 

import snscrape.modules.twitter as sntwitter 

import datetime 

from tqdm.notebook import tqdm_notebook 

import seaborn as sns 

import matplotlib.pyplot as plt 

sns.set_theme(style="whitegrid") 

 

 

Taking User Input 

To scrape tweets, you can provide many filters such as the username or start date or end date etc. We will be taking the following user inputs which will then be used in Snscrape. 

  • Text: The query to be matched. (Optional) 
  • Username: Specific username from twitter account. (Required) 
  • Since: Start Date in this format yyyy-mm-dd. (Optional) 
  • Until: End Date in this format yyyy-mm-dd. (Optional) 
  • Count: Max number of tweets to retrieve. (Required) 
  • Retweet: Include or Exclude Retweets. (Required) 
  • Replies: Include or Exclude Replies. (Required) 

 

For this tutorial we used the following inputs: 

text = input('Enter query text to be matched (or leave it blank by pressing enter)') 

username = input('Enter specific username(s) from a twitter account without @ (or leave it blank by pressing enter): ') 

since = input('Enter startdate in this format yyyy-mm-dd (or leave it blank by pressing enter): ') 

until = input('Enter enddate in this format yyyy-mm-dd (or leave it blank by pressing enter): ') 

count = int(input('Enter max number of tweets or enter -1 to retrieve all possible tweets: ')) 

retweet = input('Exclude Retweets? (y/n): ') 

replies = input('Exclude Replies? (y/n): ') 

 

Which field can we Scrape? 

Here is the list of fields which we can scrape using Snscrape Library. 

  • url: str 
  • date: datetime.datetime 
  • rawContent: str 
  • renderedContent: str 
  • id: int 
  • user: ‘User’ 
  • replyCount: int 
  • retweetCount: int 
  • likeCount: int 
  • quoteCount: int 
  • conversationId: int 
  • lang: str 
  • source: str 
  • sourceUrl: typing.Optional[str] = None 
  • sourceLabel: typing.Optional[str] = None 
  • links: typing.Optional[typing.List[‘TextLink’]] = None 
  • media: typing.Optional[typing.List[‘Medium’]] = None 
  • retweetedTweet: typing.Optional[‘Tweet’] = None 
  • quotedTweet: typing.Optional[‘Tweet’] = None 
  • inReplyToTweetId: typing.Optional[int] = None 
  • inReplyToUser: typing.Optional[‘User’] = None 
  • mentionedUsers: typing.Optional[typing.List[‘User’]] = None 
  • coordinates: typing.Optional[‘Coordinates’] = None 
  • place: typing.Optional[‘Place’] = None 
  • hashtags: typing.Optional[typing.List[str]] = None 
  • cashtags: typing.Optional[typing.List[str]] = None 
  • card: typing.Optional[‘Card’] = None 

 

For this tutorial we will not scrape all the fields but a few relevant fields from the above list. 

The search function

Next, we will define a search function which takes in the following inputs as arguments and creates a query string to be passed inside SNS twitter search scraper function. 

  • Text 
  • Username 
  • Since 
  • Until 
  • Retweet 
  • Replies 

 

def search(text,username,since,until,retweet,replies): 

    global filename 

    q = text 

    if username!='': 

        q += f" from:{username}"     

    if until=='': 

        until = datetime.datetime.strftime(datetime.date.today(), '%Y-%m-%d') 

    q += f" until:{until}" 

    if since=='': 

        since = datetime.datetime.strftime(datetime.datetime.strptime(until, '%Y-%m-%d') -  

                                           datetime.timedelta(days=7), '%Y-%m-%d') 

    q += f" since:{since}" 

    if retweet == 'y': 

        q += f" exclude:retweets" 

    if replies == 'y': 

        q += f" exclude:replies" 

    if username!='' and text!='': 

        filename = f"{since}_{until}_{username}_{text}.csv" 

    elif username!="": 

        filename = f"{since}_{until}_{username}.csv" 

    else: 

        filename = f"{since}_{until}_{text}.csv" 

    print(filename) 

    return q 

 

Here we have defined different conditions and based on those conditions we are creating the query string. For example, if variable until (end date) is empty then we are assigning it the current date and appending it in a query string and if the variable since (start date) is empty then we are assigning it a date of past 7 days from the current date. Along with the query string, we are creating filename string which will be used to name our csv file. 

 

 

Calling the Search Function and creating Dataframe 

 

q = search(text,username,since,until,retweet,replies) 

# Creating list to append tweet data  

tweets_list1 = [] 

 

# Using TwitterSearchScraper to scrape data and append tweets to list 

if count == -1: 

    for i,tweet in enumerate(tqdm_notebook(sntwitter.TwitterSearchScraper(q).get_items())): 

        tweets_list1.append([tweet.date, tweet.id, tweet.rawContent, tweet.user.username,tweet.lang, 

        tweet.hashtags,tweet.replyCount,tweet.retweetCount,tweet.likeCount,tweet.quoteCount,tweet.media]) 

else: 

    with tqdm_notebook(total=count) as pbar: 

        for i,tweet in enumerate(sntwitter.TwitterSearchScraper(q).get_items()): #declare a username  

            if i>=count: #number of tweets you want to scrape 

                break 

            tweets_list1.append([tweet.date, tweet.id, tweet.rawContent, tweet.user.username,tweet.lang,tweet.hashtags,tweet.replyCount, 

                                tweet.retweetCount,tweet.likeCount,tweet.quoteCount,tweet.media]) 

            pbar.update(1) 

# Creating a dataframe from the tweets list above  

tweets_df1 = pd.DataFrame(tweets_list1, columns=['DateTime', 'TweetId', 'Text', 'Username','Language', 

                                'Hashtags','ReplyCount','RetweetCount','LikeCount','QuoteCount','Media']) 

 

 

 

In this snippet we have invoked the search function and the query string is stored inside variable ‘q’. Next, we have defined an empty list which will be used for appending tweet data. If the count is specified as -1 then the for loop will iterate over all the tweets. TwitterSearchScraper class constructor takes the query string as an argument and then we invoke the get_items() method of TwitterSearchScraper class to retrieve all the tweets. Inside for loop we append scraped data in the tweets_list1 variable which we defined earlier. If count is defined, then we use it to break the for loop. Finally, using this list, we create the pandas dataframe by specifying the column names. 

 

tweets_df1.sort_values(by='DateTime',ascending=False) 
Data frame - Panda's library
Data frame created using Panda’s library

 

Data Preprocessing

Before saving the data frame in a csv file, we will first process the data, so that we can easily perform analysis on it. 

 

 

Data Description 

tweets_df1.info() 
Data frame - Panda's library
Data frame created using Panda’s library

 

Data Transformation 

Now we will add more columns to facilitate timeseries analysis 

tweets_df1['Hour'] = tweets_df1['DateTime'].dt.hour 

tweets_df1['Year'] = tweets_df1['DateTime'].dt.year   

tweets_df1['Month'] = tweets_df1['DateTime'].dt.month 

tweets_df1['MonthName'] = tweets_df1['DateTime'].dt.month_name() 

tweets_df1['MonthDay'] = tweets_df1['DateTime'].dt.day 

tweets_df1['DayName'] = tweets_df1['DateTime'].dt.day_name() 

tweets_df1['Week'] = tweets_df1['DateTime'].dt.isocalendar().week 

 

The Datetime column contains both date and time, therefore it is better to split data and time in separate columns. 

tweets_df1['Date'] = [d.date() for d in tweets_df1['DateTime']] 

tweets_df1['Time'] = [d.time() for d in tweets_df1['DateTime']] 

 

After splitting we will drop the DateTime column. 

tweets_df1.drop('DateTime',axis=1,inplace=True) 

tweets_df1 

 

Finally our data is prepared, we will now save the dataframe as csv using df.to_csv() function which takes filename as an input parameter. 

tweets_df1.to_csv(f"{filename}",index=False)

Visualizing timeseries data using barplot, lineplot, histplot and kdeplot 

It is time to visualize our prepared data so that we can find useful insights. Firstly, we will load the saved csv in a dateframe using the read_csv() function of pandas which take filename as input parameter. 

tweets = pd.read_csv("2018-01-01_2022-09-27_DataScienceDojo.csv") 

tweets 

 

Data frame - Panda's library
Data frame created using Panda’s library

 

Count by Year 

The countplot function of seaborn allows us to plot count of tweets by year. 

f, ax = plt.subplots(figsize=(15, 10)) 

sns.countplot(x= tweets['Year']) 

for p in ax.patches: 

    ax.annotate(int(p.get_height()), (p.get_x()+0.05, p.get_height()+20), fontsize = 12) 

 
Plot count of tweets - Bar graph
Plot count of tweets – Bar graph

 

plt.figure(figsize=(15, 8)) 

 

ax=plt.subplot(221) 

sns.lineplot(tweets.Year.value_counts()) 

ax.set_xlabel("Year") 

ax.set_ylabel('Count') 

plt.xticks(np.arange(2018,2023,1)) 

 

plt.subplot(222) 

sns.histplot(x=tweets.Year,stat='count',binwidth=1,kde='true',discrete=True) 

plt.xticks(np.arange(2018,2023,1)) 

plt.grid() 

 

plt.subplot(223) 

sns.kdeplot(x=tweets.Year,fill=True) 

plt.xticks(np.arange(2018,2023,1)) 

plt.grid() 

 

plt.subplot(224) 

sns.kdeplot(x=tweets.Year,fill=True,bw_adjust=3) 

plt.xticks(np.arange(2018,2023,1)) 

plt.grid() 

 

plt.tight_layout() 

plt.show() 

 

Plot count of tweets - per year
Plot count of tweets – per year

 

Count by Month 

We will follow the same steps for count by month, by week, by day of month and by hour. 

 

f, ax = plt.subplots(figsize=(15, 10)) 

sns.countplot(x= tweets['Month']) 

for p in ax.patches: 

    ax.annotate(int(p.get_height()), (p.get_x()+0.05, p.get_height()+20), fontsize = 12) 

 
Monthly Tweet counts - chart
Monthly Tweet counts – chart

 

plt.figure(figsize=(15, 8)) 

 

ax=plt.subplot(221) 

sns.lineplot(tweets.Month.value_counts()) 

ax.set_xlabel("Month") 

ax.set_ylabel('Count') 

plt.xticks(np.arange(1,13,1)) 

 

plt.subplot(222) 

sns.histplot(x=tweets.Month,stat='count',binwidth=1,kde='true',discrete=True) 

plt.xticks(np.arange(1,13,1)) 

plt.grid() 

 

plt.subplot(223) 

sns.kdeplot(x=tweets.Month,fill=True) 

plt.xticks(np.arange(1,13,1)) 

plt.grid() 

 

plt.subplot(224) 

sns.kdeplot(x=tweets.Month,fill=True,bw_adjust=3) 

plt.xticks(np.arange(1,13,1)) 

plt.grid() 

 

plt.tight_layout() 

plt.show() 

 

Monthly tweets count chart
Monthly tweets count chart

 

 

Count by Week 

f, ax = plt.subplots(figsize=(15, 10)) 

sns.countplot(x= tweets['Week']) 

for p in ax.patches: 

    ax.annotate(int(p.get_height()), (p.get_x()+0.005, p.get_height()+5), fontsize = 10) 

 

Weekly tweets count chart
Weekly tweets count chart

 

 

plt.figure(figsize=(15, 8)) 

 

ax=plt.subplot(221) 

sns.lineplot(tweets.Week.value_counts()) 

ax.set_xlabel("Week") 

ax.set_ylabel('Count') 

 

plt.subplot(222) 

sns.histplot(x=tweets.Week,stat='count',binwidth=1,kde='true',discrete=True) 

plt.grid() 

 

plt.subplot(223) 

sns.kdeplot(x=tweets.Week,fill=True) 

plt.grid() 

 

plt.subplot(224) 

sns.kdeplot(x=tweets.Week,fill=True,bw_adjust=3) 

plt.grid() 

 

plt.tight_layout() 

plt.show()  

 

Weekly tweets count charts
Weekly tweets count charts

 

 

Count by Day of Month 

f, ax = plt.subplots(figsize=(15, 10)) 

sns.countplot(x= tweets['MonthDay']) 

for p in ax.patches: 

    ax.annotate(int(p.get_height()), (p.get_x()+0.05, p.get_height()+5), fontsize = 12) 

 

 

Daily tweets count chart
Daily tweets count chart
plt.figure(figsize=(15, 8)) 

 

ax=plt.subplot(221) 

sns.lineplot(tweets.MonthDay.value_counts()) 

ax.set_xlabel("MonthDay") 

ax.set_ylabel('Count') 

 

plt.subplot(222) 

sns.histplot(x=tweets.MonthDay,stat='count',binwidth=1,kde='true',discrete=True) 

plt.grid() 

 

plt.subplot(223) 

sns.kdeplot(x=tweets.MonthDay,fill=True) 

plt.grid() 

 

plt.subplot(224) 

sns.kdeplot(x=tweets.MonthDay,fill=True,bw_adjust=3) 

plt.grid() 

 

plt.tight_layout() 

plt.show() 

 

 
Daily tweets count charts
Daily tweets count charts

 

 

 

 

 

 

 

Count by Hour 

f, ax = plt.subplots(figsize=(15, 10)) 

sns.countplot(x= tweets['Hour']) 

for p in ax.patches: 

    ax.annotate(int(p.get_height()), (p.get_x()+0.05, p.get_height()+20), fontsize = 12) 
hourly tweets count chart
hourly tweets count chart

 

 

plt.figure(figsize=(15, 8)) 

 

ax=plt.subplot(221) 

sns.lineplot(tweets.Hour.value_counts()) 

ax.set_xlabel("Hour") 

ax.set_ylabel('Count') 

plt.xticks(np.arange(0,24,1)) 

 

plt.subplot(222) 

sns.histplot(x=tweets.Hour,stat='count',binwidth=1,kde='true',discrete=True) 

plt.xticks(np.arange(0,24,1)) 

plt.grid() 

 

plt.subplot(223) 

sns.kdeplot(x=tweets.Hour,fill=True) 

plt.xticks(np.arange(0,24,1)) 

plt.grid() 

 

plt.subplot(224) 

sns.kdeplot(x=tweets.Hour,fill=True,bw_adjust=3) 

#plt.xticks(np.arange(0,24,1)) 

plt.grid() 

 

plt.tight_layout() 

plt.show() 

 

Hourly tweets count charts
Hourly tweets count charts

 

 

Conclusion 

From the above time series visualizations, we can clearly understand that the peak hours of tweets from this account is between 7pm-9pm and from 4am -1pm the twitter handle is quiet. We can also point out that most of the tweets related to that topic are done in the month of August. Similarly, we can identify that the Twitter handle was not very active before 2021.  

Conclusively, we saw how we can easily scrape tweets without using Twitter API through Snscrape. Then we performed some transformations on the scraped data and stored it in csv file. Later, we used that csv file for time-series visualizations and analysis. We appreciate you following along with this hands-on guide. We hope that this guide will make it easy for you to get started on your upcoming data science project. 

<<Link to Complete Code>> 

Metabase: Analyze and learn data with just a few clicks
Saad Shaikh
| November 5, 2022

Data Science Dojo is offering Metabase for FREE on Azure Marketplace packaged with web accessible Metabase: Open-Source server. 

Metabase query
Metabase query

 

Introduction 

Organizations often adopt strategies that enhance the productivity of their selling points. One strategy is to utilize the prior business data to identify key patterns regarding any product and then take decisions for it accordingly. However, the work is quite hectic, costly, and requires domain experts. Metabase has bridged that gap of skillset. Metabase provides marketing and business professionals with an easy-to-use query builder notebook to extract required data and simultaneously visualize it without any SQL coding, with just a few clicks. 

What is Metabase and its question? 

Metabase is an open-source business intelligence framework that provides a web interface to import data from diverse databases and then analyze and visualize it with few clicks. The methodology of Metabase is based on questions and the answers to them. They form the foundation of everything else that it provides. 

           

A question is any kind of query that you want to perform on a data. Once you are done with the specification of query functions in the notebook editor, you can visualize the query results. After that you can save this question as well for reusability and turn it into a data model for business specific purposes. 

Pro Tip: Join our 6-months instructor-led Data Science Bootcamp to become expert at data science & analytics skillset 

Challenges for businesses  

For businesses that lack expert analysts, engineers and substantial IT department, it was costly and time-consuming to hire new domain experts or managers themselves learn to code and then explore and visualize data. Apart from that, not many pre-existing applications provide diverse data source connections which was also a challenge. 

In this regard, a straightforward interactive tool that even newbies could adapt immediately and thus get the job done would be the most ideal solution. 

Data analytics with Metabase  

Metabase concept is based on questions which are basically queries and data models (special saved questions). It provides an easy-to-use notebook through which users can gather raw data, filter it, join tables, summarize information, and add other customizations without any need for SQL coding.

Users can select the dimensions of columns from tables and then create various visualizations and embed them in different sub-dashboards. Metabase is frequently utilized for pitching business proposals to executive decision-makers because the visualizations are very simple to achieve from raw data. 

 

visualization on sample data
Figure 1: A visualization on sample data 

 

A visualization on sample data 
Figure 2:  Query builder notebook

 

Major characteristics 

  • Metabase delivers a notebook that enables users to select data, join with other tables, filter, and other operations just by clicking on options instead of writing a SQL query 
  • In case of complex queries, a user can also use an in-built optimized SQL editor 
  • The choice to select from various data sources like PostgreSQL, MongoDB, Spark SQL, Druid, etc., makes Metabase flexible and adaptable 
  • Under the Metabase admin dashboard, users can troubleshoot the logs regarding different tasks and jobs 
  • Has the ability to enable public sharing. It enables admins to create publicly viewable links for Questions and Dashboards  

What Data Science Dojo has for you  

Metabase instance packaged by Data Science Dojo serves as an open-source easy-to-use web interface for data analytics without the burden of installation. It contains numerous pre-designed visualization categories waiting for data.

It has a query builder which is used to create questions (customized queries) with few clicks. In our service users can also use an in-browser SQL editor for performing complex queries. Any user who wants to identify the impact of their product from the raw business data can use this tool. 

Features included in this offer:  

  • A rich web interface running Metabase: Open Source 
  • A no-code query building notebook editor 
  • In-browser optimized SQL editor for complex queries 
  • Beautiful interactive visualizations 
  • Ability to create data models 
  • Email configuration and Slack support 
  • Shareability feature 
  • Easy specification for metrics and segments 
  • Feature to download query results in CSV, XLSX and JSON format 

Our instance supports the following major databases: 

  • Druid 
  • PostgreSQL 
  • MySQL 
  • SQL Server 
  • Amazon Redshift 
  • Big Query 
  • Snowflake 
  • Google Analytics 
  • H2 
  • MongoDB 
  • Presto 
  • Spark SQL 
  • SQLite 

Conclusion  

Metabase is a business intelligence software and beneficial for marketing and product managers. By making it possible to share analytics with various teams within an enterprise, Metabase makes it simple for developers to create reports and collaborate on projects. The responsiveness and processing speed are faster than the traditional desktop environment as it uses Microsoft cloud services. 

At Data Science Dojo, we deliver data science education, consulting, and technical services to increase the power of data. We are therefore adding a free Metabase server dedicated specifically for Data Analytics operations on Azure Market Place. Hurry up and install this offer by Data Science Dojo, your ideal companion in your journey to learn data science!  

Click on the button below to head over to the Azure Marketplace and deploy Metabase for FREE by clicking on “Get it now”. 

CTA - Try now

Note: You’ll have to sign up to Azure, for free, if you do not have an existing account. 

Countly: A support for real-time product analytics and tracking
Saad Shaikh
| October 26, 2022

Data Science Dojo is offering Countly for FREE on Azure Marketplace packaged with web accessible Countly Server. 

Purpose of product analytics  

Product analytics is a comprehensive collection of mechanisms for evaluating the performance of digital ventures created by product teams and managers. 

Businesses often need to measure the metrics and impact of their products, for e.g., how the audience perceives their product like how many visitors are reading a particular page or clicking on a specific button. This gives an insight into what future decisions need to be taken regarding any product. Whether it should be modified? or removed? or kept as it is? Countly has made this work easier by providing a centralized web analytics environment to track the user engagement with a product along with monitoring its health.  

 

Pro Tip: Join our 6-months instructor-led Data Science Bootcamp to become expert at data science & analytics skillset 

Challenges for individuals  

Many platforms require developers for coding to visualize analytics which is not only time consuming but also come at a cost. At the application level, having an app crash leaves anyone in shock, and that is followed by a hectic task of determining the root cause of the problem which is time-consuming. At the corporate level, the current and past data needs to be analyzed appropriately for the future strength of the company and that requires robust analysis easily acquired by anyone which was a challenge faced by many organizations  

Countly analytics 

Countly enables users to monitor and analyze the performance of their applications irrespective of the platform in real-time. It can compile data from numerous sources and presents it in a manner that makes it easier for business analysts and managers to evaluate app usage and client behavior. It offers a customizable dashboard with the freedom to innovate and improve your products in order to meet important business and revenue objectives while also ensuring privacy by design. It is a world leader in product analytics because it tracks more than 1.5 billion unique identities on more than 16,000 applications and more than 2,000 servers worldwide. 

 

Analytics based technology - countly
Figure 1: Analytics based on type of technology

 

 

Analytics based on user activity - Countly
Figure 2: Analytics based on user activity

 

 

Figure 3: Analytics based on views - Countly
Figure 3: Analytics based on views

 

Major characteristics 

  • Interactive web interface: User-friendly web environment with customizable dashboards for easy accessibility along with pre-designed metrics and visualizations 
  • Platform-independent: Supports web analytics, mobile app analytics, and desktop application analytics for macOS and Windows 
  • Alerts and email reporting: Ability to receive alerts based on the metric changes and provides custom email reporting 
  • Users’ role and access manager: Provides global administrators the ability to manage users, groups, and their roles and permissions 
  • Logs Management: Maintains server and audit logs on the web server regarding user actions on data 

What Data Science Dojo has for you  

Countly Server packaged by Data Science Dojo provides a web analytics service that provides insights about your product in real-time, no matter if it’s a web application or mobile app, or even desktop application without the burden of installation. It comes with numerous pre-configured metrics and visualization templates to import data and observe trends. It’s helpful for businesses to identify the application usage and determine the client response to the apps.  

Features included in this offer:  

  • A VM configured with Countly Server: Community Edition accessible from a web browser 
  • Ability to track user analytics, user loyalty, session analytics, technology, and geo insights  
  • Easy-to-use customizable dashboard 
  • Logs manager 
  • Alerting and reporting feature 
  • User permissions and roles manager 
  • Built-in Countly DB viewer 
  • Cache management 
  • Flexibility to define data limits 

Conclusion  

Countly provides the feasibility to analyze data in real-time. It is highly extensible and possesses various features to manage different operations like alerting, reporting, logging, job management, etc. The analytics throughput can be increased by using multi-cores on Azure Virtual Machine. Also, Countly can handle different platform applications at once. This might slow down the server if you have thousands upon thousands of active client requests on different applications. The CPU and RAM usage may also be affected but through Azure Virtual Machine all these problems are taken care of. 

At Data Science Dojo, we deliver data science education, consulting, and technical services to increase the power of data. We are therefore adding a free Countly Server dedicated specifically for Data Analytics operations on Azure Market Place. Hurry up and install this offer by Data Science Dojo, your ideal companion in your journey to learn data science! 

Click on the button below to head over to the Azure Marketplace and deploy Countly for FREE by clicking on “Try now”. 

CTA - Try now

Note: You’ll have to sign up to Azure, for free, if you do not have an existing account. 

 

13 most common Data Analysts interview questions you must prepare for
Ayesha Saleem
| October 24, 2022

Get hired as a Data Analyst by confidently responding to the most asked interview questions. No matter how qualified or experienced you are, if you stumble over your thoughts while answering the interviewer, it might take away some of your chances of getting onboard. 

 

data analyst interview question
Data analyst interview question – Data Science Dojo

In this blog, you will find the top data analysts interview questions covering both technical and non-technical areas of expertise.  

List of Data Analysts interview questions 

1. Share about your most successful/most challenging data analysis project? 

In this question, you can also share your strengths and weaknesses with the interviewer.   

When answering questions like these, data analysts must attempt to share both their strengths and weaknesses. How do you deal with challenges and how do you measure the success of a data project? You can discuss how you succeeded with your project and what made it successful.  

Take a look at the original job description to see if you can incorporate some of the requirements and skills listed. If you were asked the negative version of the question, be honest about what went wrong and what you would do differently in the future to fix the problem. Despite our human nature, mistakes are a part of life. What’s critical is your ability to learn from them. 

Further talk about any SAAS platforms, programming languages, and libraries. Why did you use them and how did you use them to accomplish yours?

Discuss the entire pipeline of your projects from collecting data, to turning it into valuable insights. Describe the ETL pipeline including data cleaning, data preprocessing, and exploratory data analysis. What were your learnings and what issues did you encounter and how did you deal with them. 

Enroll in Data Science Bootcamp today to begin your journey

2. Tell us about the largest data set you’ve worked with? Or What type of data you have worked with in the past? 

What they’re really asking: Can you handle large data sets?  

Data sets of varying sizes and compositions are becoming increasingly common in many businesses. Answering questions about data size and variety requires a thorough understanding of the type of data and its nature. What data sets did you handle? What types of data were present? 

It is not necessary that you should only mention a dataset you worked with at your job. But you can also share about varying sizes specifically large datasets you worked with as a part of a data analysis course, Bootcamp, certificate program, or degree. As you put together a portfolio, you may also complete some independent projects where you find and analyze a data set. All of this is valid material to build your answer.  

The more versatile your experience with datasets will be, the greater the chances there are of getting hired.  

Read more about several types of datasets here:

32 datasets to uplift your skills in data science

 

3. What is your process for cleaning data? 

The expected answer to this question will include details about: How you handle missing data, outliers, duplicate data, etc.?c.? 

Data analysts are widely responsible for data preparation, data cleansing, or data cleaning. Organizations expect data analysts to spend a significant amount of time preparing data for an employer. As you answer this question, share in detail with the employer why data cleaning is so important. 

In your answer, give a short description of what data cleaning is and why it’s important to the overall process. Then walk through the steps you typically take to clean a data set. 

 

4. Name some data analytics software you are familiar with. OR What data software have you used in the past? OR What data analytics software are you trained in? 

What they need to know: Do you have basic competency with common tools? How much training will you need? 

Before you appear for the interview, it’s a good time to look at the job listing to see what software was mentioned. As you answer this question, describe how you have used that software or something similar in the past. Show your knowledge of the tool by employing associated words.  

Mention software solutions you have used for a variety of data analysis phases. You don’t need to provide a lengthy explanation. What data analytics tools you used and for which purpose will satisfy the interviewer. 

  

5. What statistical methods have you used in data analysis? OR what is your knowledge of statistics? OR how have you used statistics in your work as a Data Analyst? 

What they’re really asking: Do you have basic statistical knowledge? 

Data analysts should have at least a rudimentary grasp of statistics and know-how that statistical analysis helps business goals. Organizations look for a sound knowledge of statistics in Data analysts to handle complex projects conveniently. If you used any statistical calculations in the past, be sure to mention it. If you haven’t yet, familiarize yourself with the following statistical concepts: 

  • Mean 
  • Standard deviation 
  • Variance
  • Regression 
  • Sample size 
  • Descriptive and inferential statistics 

While speaking of these, share information that you can derive from them. What knowledge can you gain about your dataset? 

Read these amazing 12 Data Analytics books to strengthen your knowledge

12 excellent Data Analytics books you should read in 2022

 

 

6. What scripting languages are you trained in? 

In order to be a data analyst, you will almost certainly need both SQL and a statistical programming language like R or Python. If you are already proficient in the programming language of your choice at the job interview, that’s fine. If not, you can demonstrate your enthusiasm for learning it.  

In addition to your current languages’ expertise, mention how you are developing your expertise in other languages. If there are any plans for completing a programming language course, highlight its details during the interview. 

To gain some extra points, do not hesitate to mention why and in which situations SQL is used, and why R and python are used. 

 

7. How can you handle missing values in a dataset? 

This is one of the most frequently asked data analyst interview questions, and the interviewer expects you to give a detailed answer here, and not just the name of the methods. There are four methods to handle missing values in a dataset. 

  • Listwise Deletion 

In the listwise deletion method, an entire record is excluded from analysis if any single value is missing. 

  • Average Imputation  

Take the average value of the other participants’ responses and fill in the missing value. 

  • Regression Substitution 

You can use multiple-regression analyses to estimate a missing value. 

  • Multiple Imputations 

It creates plausible values based on the correlations for the missing data and then averages the simulated datasets by incorporating random errors in your predictions. 

 

8. What is Time Series analysis? 

Data analysts are responsible for analyzing data points collected at different intervals. While answering this question you also need to talk about the correlation between the data evident in time-series data. 

Watch this short video to learn in detail:

 

9. What is the difference between data profiling and data mining?

Profiling data attributes such as data type, frequency, and length, as well as their discrete values and value ranges, can provide valuable information on data attributes. It also assesses source data to understand its structure and quality through data collection and quality checks. 

On the other hand, data mining is a type of analytical process that identifies meaningful trends and relationships in raw data. This is typically done to predict future data. 

 

10. Explain the difference between R-Squared and Adjusted R-Squared.

The most vital difference between adjusted R-squared and R-squared is simply that adjusted R-squared considers and tests different independent variables against the model and R-squared does not. 

An R-squared value is an important statistic for comparing two variables. However, when examining the relationship between a single stock and the rest of the S&P500, it is important to use adjusted R-squared to determine any discrepancies in correlation. 

 

11. Explain univariate, bivariate, and multivariate analysis.

Bivariate analysis, which is simpler than univariate analysis, is used when the data set only has one variable and it does not involve causes or effects.  

Univariate analysis, which is more complicated than bivariate analysis, is used when the data set has two variables and researchers are looking to compare them.  

When the data set has two variables and researchers are investigating similarities between them, multivariate analysis is the right type of statistical approach. 

 

12. How would you go about measuring the business performance of our company, and what information do you think would be most important to consider?

Before appearing for an interview, make sure you study the company thoroughly and gain enough knowledge about it. It will leave an impression on the employer regarding your interest and enthusiasm to work with them. Also, in your answer you talk about the added value you will bring to the company by improving its business performance. 

 

13. What do you think are the three best qualities that great data analysts share?

List down some of the most critical qualities of a Data Analyst. This may include problem-solving, research, and attention to detail. Apart from these qualities, do not forget to mention soft skills which are necessary to communicate with team members and across the department.    

 

Did we miss any Data Analysts interview questions? 

Share with us in the comments below and help each other to ace the next data analyst job. 

  

Apache Superset : Empowering Business Intelligence
Saad Shaikh
| October 15, 2022

Data Science Dojo is offering Apache Superset for FREE on Azure Marketplace packaged with pre-installed SQL lab and interactive visualizations to get started. 

 

What is Business Intelligence?  

 

Business Intelligence (BI) depends on the idea of utilizing information to perform activities. It expects to give business pioneers noteworthy bits of knowledge through data handling and analytics. For instance, a business breaks down the KPIs (Key Performance Indicators) to distinguish its benefits and shortcomings. Hence, the decision-makers can conclude in which department the organization can work to increase efficiency.  

Recently two elements in BI have resulted in sensational enhancements in metrics like speed and proficiency. The two elements include:  

 

  • Automation  
  • Data Visualization  

 

Apache Superset widely focuses on the latter model which has changed the course of business insights.  

 

But what were the challenges faced by analysts before there were popular exploratory tools like Superset?  

 

Pro Tip: Join our 6-months instructor-led Data Science Bootcamp to master data science. 

 

Challenges of Data Analysts

 

Scalability, framework compatibility, and absence of business-explicit customization were a few challenges faced by data analysts. Apart from that exploring petabytes of data and visualizing it would cause the system to collapse or hang at times.  

In these circumstances, a tool having the ability to query data as per business needs and envision it in various diagrams and plots was required. Additionally, a system scalable and elastic enough to handle and explore large volumes of data would be an ideal solution.  

 

Data Analytics with Superset  

 

Apache Superset is an open-source tool that equips you with a web-based environment for interactive data analytics, visualization, and exploration. It provides a vast collection of different types of vibrant and interactive visualizations, charts, and tables. It can customize the layouts and the dynamic dashboard elements along with quick filtering, making it flexible and user-friendly. Apache Superset is extremely beneficial for businesses and researchers who want to identify key trends and patterns from raw data to aid in the decision-making process.  

 

Sales analytics - Apache superset
Video Game Sales Analytics with different visualizations

 

 

It is a powerhouse of SQL as it not only allows connection to several databases but also provides an in-browser SQL editor by the name SQL Lab  

SQL lab - Apache superset
SQL Lab: an in-browser powerful SQL editor pre-configured for faster querying

 

Key attributes  

 

  • Superset delivers an interactive UI that enriches the plots, charts, and other diagrams. You can customize your dashboard and canvas as per requirement. The hover feature and side-by-side layout make it coherent  
  • An open-source easy-to-use tool with a no-code environment. Drag and drop and one-click alterations make it more user-friendly  
  • Contains a powerful built-in SQL editor to query data from any database quickly  
  • The choice to select from various databases like Druid, Hive, MySQL, SparkSQL, etc., and the ability to connect additional databases makes Superset flexible and adaptable  
  • In-built functionality to create alerts and notifications by setting specific conditions at a particular schedule  
  • Superset provides a section about managing different users and their roles and permissions. It also has a tab for logging the ongoing events  

 

What does Data Science Dojo have for you  

 

Superset instance packaged by Data Science Dojo serves as a web-accessible no-code environment with miscellaneous analysis capabilities without the burden of installation. It has many samples of chart and dataset projects to get started. In our service users can customize dashboards and canvas as per business needs.

It comes with drag-and-drop feasibility which makes it user-friendly and easy to use. Users can create different visualizations to detect key trends in any volume of data.  

 

What is included in this offer:  

 

  • A VM configured with a web-accessible Superset application  
  • Many sample charts and datasets to get started  
  • In-browser optimized SQL editor called SQL Lab  
  • User access and roles manager  
  • Alert and report feature  
  • Feasibility of drag and drop  
  • In-build functionality of event logging  

 

Our instance supports the following major databases:  

 

  • Druid  
  • Hive  
  • SparkSQL  
  • MySQL  
  • PostgreSQL  
  • Presto  
  • Oracle  
  • SQLite  
  • Trino  
  • Apart from these any data engine that has Python DB-API driver and a SQL Alchemy dialect can be connected  

 

Conclusion  

 

Efficient resource requirement for exploring and visualizing large volumes of data was one of the areas of concern when working on traditional desktop environments. The other area of concern includes the ad-hoc SQL querying of data from different database connections. With our Superset instance, both concerns are put to rest.

When coupled with Microsoft cloud services and processing speed, it outperforms its traditional counterparts since data-intensive computations aren’t performed locally but in the cloud. It has a lightweight semantic layer and is designed as a cloud-native architecture.  

At Data Science Dojo, we deliver data science education, consulting, and technical services to increase the power of data. We are therefore adding a free Superset instance dedicated specifically to Data Science & Analytics on Azure Marketplace. Now hurry up and avail this offer by Data Science Dojo, your ideal companion in your journey to learn data science!  

 

Click on the button below to head over to the Azure Marketplace and deploy Apache Superset for FREE by clicking on “Get it now”. 

 

Superset

 

Note: You’ll have to sign up to Azure, for free, if you do not have an existing account. 

 

 

 

 

 

 

 

6 marketing analytics features to drive greater revenue
Gibran Saleem
| September 23, 2022

Marketing analytics tells you about the most profitable marketing activities of your business. The more effectively you target the right people with the right approach, the greater value you generate for your business.

However, it is not always clear which of your marketing activities are effective at bringing value to your business.  This is where marketing analytics comes in. Running an Amazon seller competitor analysis is crucial to your success in the marketplace. Using a framework to monitor your competitors’ efforts is a great way to ensure you can beat them at their own game.

It guides you to use the data to evaluate your marketing campaign. It helps you identify which of your activities are effective in engaging with your audience, improving user experience, and driving conversions. 

Grow your business with Data Science Dojo 

 

Marketing analytics
6 marketing analytics features by Data Science Dojo

Data driven marketing is imperative in optimizing your campaigns to generate a net positive value from all your marketing activities in real-time. Without analyzing your marketing data and customer journey, you cannot identify what you are doing right and what you are doing wrong when engaging with potential customers. The 6 features listed below can give you the start you need to get into analyzing and optimizing your marketing strategy using marketing analytics 

 Learn about marketing analytics tools in this blog

1. Impressions 

In digital marketing, impressions are the number of times any piece of your content has been shown on a person’s screen. It can be an ad, a social media post, video etc. However, it is important to remember that impressions do not mean views, a view is an engagement, anytime somebody sees your video that is a view, but an impression would also include anytime they see your video in the recommended videos on YouTube or in their newsfeed on Facebook. The impression will be counted regardless of whether they watch your video or not. 

Learn more about impressions in this video

 

It is also important to distinguish between impressions and reach. Reach is the number of unique viewers, so for example if the same person views your ad three times, you will have three impressions but a reach of one.  

Impressions and reach are important in understanding how effective your content was at gaining traction. However, these metrics alone are not enough to gauge how effective your digital marketing efforts have been, neither impressions nor reach tell you how many people engaged with your content. So, tracking impressions is important, but it does not specify whether you are reaching the right audience.  

 

2. Engagement rate 

In social media marketing, engagement rate is an important metric. Engagement is when a user comments, likes, clicks, or otherwise interacts with any of your content. Engagement rate is a metric that measures the amount of engagement of your marketing campaign relative to each of the following: 

  • Reach 
  • Post 
  • Impressions  
  • Days
  • Views 

Engagement rate by reach is the percentage of people who chose to interact with the content after seeing it. It is calculated by the following formula. Reach is a more accurate measurement than follower count, because not all of your brands followers may see the content while those who do not follow your brand may still be exposed to your content. 

Engagement rate by post is the rate at which followers engage with the content. This metric shows how engaged your followers are with your content. However, this metric does not account for organic reach and as your follower count goes up your engagement by post goes down. 

Engagement rate by Impressions is the rate of engagement relative to the number of impressions. If you are running paid ads for your brand, engagement rate by impressions can be used to gauge your ads effectiveness.  

Average Daily engagement rate tells you how much your followers are engaging with your content daily. This is suitable for specific use cases for instance, when you want to know how much your followers are commenting on your posts or other content. 

Engagement rate by views gives the percentage of people who chose to engage with your video after watching them. This metric however does not use unique views so it may double or triple count views from a single user. 

Learn more about engagement rate in this video

 

3. Sessions 

Sessions are another especially important metric in marketing campaigns that help you analyze engagement on your website. A session is a set of activities by a user within a certain period. For example, a user spent 10 minutes on your website, loading pages, interacting with your content and completed an interaction. All these activities will be recorded in the same 10-minute session.  

In Google Analytics, you can use sessions to check how much time a user spent on your website (session length), how many times they returned to your website (number of sessions), and what interactions users had with your website. Tracking sessions can help you determine how effective your campaigns were in directing traffic towards your website. 

If you have an E-commerce website another very helpful tool on Google Analytics is behavioral analytics. With behavioral analytics you see what key actions are driving purchases on your website. The sessions report can be accessed under conversions tab on Google Analytics. This report can help you understand user behaviors such as abandon carts. This allows you to target these users with targeted ads or offering incentives to complete their purchase. 

Learn more about sessions in this video

 

4. Conversion rate 

Once you have engaged your audience the next step in the customers’ journey is conversion. A conversion is when you make the customer or user complete a specific action. This desired action can be anything from a form submission, purchasing a product or subscribing to a service. The conversion rate is the percentage of visitors who completed the desired action.

So, if you have a form on your website and you want to find out what the conversion rate is. You would simply divide the number of form submissions by the number of visitors on that form’s page (Total conversions/total interactions). 

 

Conversion rate is a very important metric that helps you assess the quality of your leads. While you may generate a large number of leads or visitors, if you cannot get them to perform the desired action you may be targeting the wrong audience. Conversion rate can also help you gauge how effective your conversion strategy is, if you aren’t converting visitors, it might indicate that your campaign needs optimization. 

 

5. Attribution  

Attribution is a sophisticated model that helps you measure which channels are generating the most sales opportunities or conversions. It helps you assign credit to specific touchpoints on the customers journey and understand which touchpoints are driving conversions the most. But how do you know which touchpoint to attribute to a specific conversion?  Well, that depends on which attribution models you are using. There are four common attribution models. 

First touch attribution models assign all the credit to the first touchpoint that drove the prospect to your website. It focuses on the top of the marketing efforts funnel and tells you what is attracting people to your brand 

Last touch attribution models assign credit to the last touchpoint. It focuses on the last touchpoint the visitor interacted with before they converted. 

Linear attribution model assigns an equal weight to all the touchpoints in the buyer’s journey. 

Time decay attributions is based on how close the touchpoint is to the conversion, where a weighted percentage is assigned to the most recent touchpoints. This can be used when the buying cycle is relatively short. 

What model you use is based on what product or subscription you are selling and what is the length of your buyer cycle. While attribution is very important in identifying the effectiveness of your channels, to get the complete picture you need to look at how each touchpoint drives conversion. 

 Learn more about attribution in this video

 

6. Customer lifetime value 

Businesses prefer retaining customers over acquiring new ones, and one of the main reasons is that attracting new customers has a cost. The customer acquisition cost is the total cost that you incur as a business acquiring a customer. The customer acquisition cost is calculated by dividing the marketing and sales cost by the number of new customers. 

Learn more about CLV in this video

 

So, as a business, you must weigh the value of each customer with the associated acquisition cost. This is where the customer lifetime value or CLV comes in. The Customer lifetime value is the total value of your customer to your business during the period of your relationship.

The CLV helps you forecast your revenue as well, the larger the average CLV you have the better your forecasted revenue will be. CLV is calculated by dividing the annual revenue generated from customers by the average retention period (in years).  If your CAC is higher than your CLV, then you are on average losing money on every customer you make.

This presents a huge problem. Metrics like CAC and CLV are very important for driving revenue. They help you identify high-value customers and identify low value customers so you can understand how to serve these customers better. They help you make more informed decisions regarding your marketing effort and build a healthy customer base. 

 

 Integrate marketing analytics into your business 

Marketing analytics is a vast field. There is no one method that suits the needs of all businesses. Using data to analyze and drive your marketing and sales effort is a continuous effort that you will find yourself constantly improving upon. Furthermore, finding the right metrics to track that have a genuine impact on your business activities is a difficult task.

So, this list is by no means exhaustive, however the features listed here can give you the start you need to analyze and understand what actions are important in driving engagement, conversions and eventually value for your business.  

 

Business Analytics vs Data Science – Pick and choose your career path
Afsah Ur Rehman
| September 19, 2022

Data is growing at an exponential rate in the world. It is estimated that the world will generate 181 zettabytes of data by 2025. With this increase, we are also seeing an increase in demand for data-driven techniques and strategies.

According to Forbes, 95% of businesses expressed the need to manage unstructured data as a problem for their business. In fact, Business Analytics vs Data Science is one of the hottest debates among data professionals nowadays.

Many people might wonder – what is the difference between Business Analytics and Data Science? Or which one should they choose as a career path? If you are one of those keep reading to know more about both these fields!

Business analytics - Data science
                                                                                                      Team working on Business Analytics

First, we need to understand what both these fields are. Let’s take a look. 

What is Business Analytics? 

Business Analytics is the process of deriving insights from business data to inform business decisions. It is the process of collecting data and doing analysis for the business to make better decisions. It provides a lot of insight that can be used to make better business decisions. It helps in optimizing processes and improving productivity.

It also helps in identifying potential risks, opportunities, and threats. Business Analytics is an important part of any organization’s decision-making process. It is a combination of different analytical activities like data exploration, data visualization, data transformation, data modeling, and model validation. All of this is done by using various tools and techniques like R programming, machine learning, artificial intelligence, data mining, etc.

Business analytics is a very diverse field that can be used in every industry. It can be used in areas like marketing, sales, supply chain, operations, finance, technology and many more. 

Now that we have a good understanding of what Business Analytics is, let’s move on to Data Science. 

What is Data Science? 

Data science is the process of discovering new information, knowledge, and insights from data. They apply different machine-learning algorithms to any form of data from numbers to text, images, videos, and audio, to draw various understandings from them. Data science is all about exploring data to identify hidden patterns and make decisions based on them.

It involves implementing the right analytical techniques and tools to transform the data into something meaningful. It is not just about storing data in the database or creating reports about the same. Data scientists collect and clean the data, apply machine learning algorithms, create visualizations, and use data-driven decision-making tools to create an impact on the organization.

Data scientists use tools like programming languages, database management, artificial intelligence, and machine learning to clean, visualize, and explore the data.

Pro tip: Learn more about Data Science for business 

What is the difference between Business Analytics and Data Science? 

Technically, Business analytics is a subset of Data Science. But the two terms are often used interchangeably because of the lack of a clear understanding among people. Let’s discuss the key differences between Business Analytics and Data Science. Business Analytics focuses on creating insights from existing data for making better business decisions.

While Data Science focuses on creating insights from new data by applying the right analytical techniques. Business Analytics is a more established field. It combines several analytical activities like data transformation, modeling, and validation. Data Science is a relatively new field that is evolving every day. Business Analytics is more of a hands-on approach to manage the data whereas Data Science is more focused on the development of the data.

Both the fields also differ a bit in their required skills. Business Analysts mostly use Interpretation, Data visualization, analytical reasoning, statistics, and written communication skills to interpret and communicate their work. Whereas Data Scientists utilize statistical analysis, programming skills, machine learning, calculus and algebra, and data visualization to perform most of their work.

Which should one choose? 

Business analytics is a well-established field, whereas data science is still evolving. If you are inclined towards decisive and logical skills with little or no programming knowledge or computer science skills, you can take up Business Analytics. It is a beginner friendly domain and is easy to catch on to.

But if you are interested in programming and are familiar with machine learning algorithms or even interested in data analysis, you can opt for Data Science. We hope this blog answers your questions about the differences between the two similar and somewhat overlapping fields and helps you make the right data-driven and informed decision for yourself! 

 

4 event metrics that matter to organize a successful event
Fatima Rafique
| September 14, 2022

Looking at the right event metrics not only helps us in gauging the success of the current event but also facilitates understanding the audience’s behavior and preferences for future events.   

Creating, managing, and organizing an event seems like a lot of work and surely it is. The job of an event manager is no doubt a hectic one, and the job doesn’t end once the event is complete. After every event, analyzing it is a crucial task to continuously improve and enhance the experience for your audience and presenters.

In a world completely driven by data, if you are not measuring your events, you are surely missing out on a lot. The questions arise about how to get started and what metrics to look for. The post-Covid world has adopted the culture of virtual events which not only allows the organizers to gather audiences globally but also makes it easier for them to measure it.

There are several platforms and tools available for collecting the data, or if you are hosting it through social media then you can easily use the analytics tool of that channel. You can view our Marketing Analytics videos to better understand the analytical tools and features of each platform. 

event metrics
                                                                                                 Successful event metrics

You can take the assistance of tools and platforms to collect the data but utilizing that data to come up with insightful findings and patterns is a critical task. You need to hear the story your data is trying to tell and understand the patterns in your events.  

Event metrics that you should look at 

1. RSVP to attendance rate 

RSVP is the number of people who sign up for your event (through landing pages or social sites) while attendance rate is the number of people who show up.

Attendance rate
Customer self-service, e-support system, electronic attendees feedback concept.

You should expect at least 30% of your RSVPs to actually attend and if they don’t there is something wrong, the possible reasons could be: 

  • The procedure for joining the event is not provided or clarified 
  • They forgot about the event as they signed up long before 
  • The information provided regarding the event day or date is wrong  

Or it many other likely reasons. You need to dig into each channel to find out the reason because if a person signs up, it shows a clear intent to attend from their end.  

2. Retention rate 

There are a few channels as LinkedIn and YouTube that have inbuilt analytics to gauge retention rate, but you can always integrate third-party tools for other platforms. The retention rate depicts how long your audience stayed in your webinar and the points where they dropped off.

It is usually shown as a line graph with the duration of the webinar on the x-axis and the number of people on the y-axis, in this way you can view the number of people at a certain time in the webinar. Through this chart, you can look at points where you see a drop or rise in your views.

Retention rate
Graph representing retention rate  

 

Use-case
For instance, at Data Science Dojo our webinars experienced a huge drop in the audience during the initial 5 mins of the webinar. It was worrisome for the team, so we dug into it and conducted a critical analysis of our webinars. We realized this was happening because we usually spend our first 5 mins waiting for the audience to join in but that is where our existing audience started leaving.  

We decided to bring in engaging activities as a poll in those 5 mins and initiated conversations with our audience directly through chats which improved our overall retention as our audience started feeling more connected which made them stay for a long time. You can explore our webinars here 

3. Demographics of audience 

It is far-reaching to know where your audience belongs to. To take more targeted decisions in the future, every business must realize the audience demographics and what type of people find your events beneficial.  

Once we work on the demographics, it will help us for future events. For example, you can select a time that would be viable in your audience’s time zone, and you can also select a topic that they would be more interested in.  

Demographic data
Statistics showing demographic data

The demographics data opens many new avenues for your business, it introduces you to segments of your audience that you might not be targeting already, and you can expand your business. It shows the industries, locations, seniority, and many other crucial factors about your audience.  

By analyzing this data, you can also understand whether your content is attracting the right target audience or not, if not then what kind of audience you are pulling in and whether that’s beneficial for your business or not.  

4. Engagement rate 

Your event might receive a large number of views but if that audience is not engaging with your content, then it is something you should be concerned about. The engagement rate depicts how involved your audience is. Today’s audience has a lot of distractions especially when it comes to online events, in that situation grasping your audience’s attention and keeping them involved is a major task.  

Engagement rate
Audience engagement shown by chat messages

The more engaged the audience is, the higher the chance that they will benefit from it and come back to you for other services. There are several techniques to keep your audience engaged, you can look up a few engagement activities to build connections 

 Make your event a success with event metrics

On that note, if you have just hosted an event or have an event on your calendar, you know what you need to look at. These metrics will help you continuously improve your event’s quality to match the audience’s expectations and requirements. Planning your strategies based on data will help you stay relevant to your audience and trends.    

Exploratory data analysis in R | Spooky author identification
Pier Lorenzo Paracchini
| October 31, 2017

This blog is based on some exploratory data analysis performed on the corpora provided for the “Spooky Author Identification” challenge at Kaggle.

The Spooky challenge

A Halloween-based challenge [1] with the following goal using data analysis: predict who was writing a sentence of a possible spooky story between Edgar Allan Poe, HP Lovecraft, and Mary Wollstonecraft Shelley.

“Deep into that darkness peering, long I stood there, wondering, fearing, doubting, dreaming dreams no mortal ever dared to dream before.” Edgar Allan Poe

“That is not dead which can eternal lie, And with strange eons, even death may die.” HP Lovecraft

“Life and death appeared to me ideal bounds, which I should first break through, and pour a torrent of light into our dark world.” Mary Wollstonecraft Shelley

The toolset for data analysis

The only tools available to us during this exploration will be our intuitioncuriosity, and the selected packages for data analysis. Specifically:

  • tidytext package, text mining for word processing, and sentiment analysis using tidy tools
  • tidyverse package, an opinionated collection of R packages designed for data science
  • wordcloud package, pretty word clouds
  • gridExtra package, supporting functions to work with grid graphics
  • caret package, supporting function for performing stratified random sampling
  • corrplotpackage, a graphical display of a correlation matrix, confidence interval
# Required libraries

# if packages are not installed

# install.packages("packageName")
library(tidytext)

library(tidyverse)

library(gridExtra)

library(wordcloud)

library(dplyr)

library(complot)

The beginning of the data analysis journey: The Spooky data

We are given a CSV file, the train.csv, containing some information about the authors. The information consists of a set of sentences written by different authors (EAP, HPL, MWS). Each entry (line) in the file is an observation providing the following information:

an id, a unique id for the excerpt/ sentence (as a string) the text, the excerpt/ sentence (as a string), the author, the author of the excerpt/ sentence (as a string) – a categorical feature that can assume three possible values EAP for Edgar Allan Poe,

HPL for HP Lovecraft,

MWS for Mary Wollstonecraft Shelley

 # loading the data using readr package

  spooky_data <- readr::read_csv(file = "./../../../data/train.csv",

                    col_types = "ccc",

                    locale = locale("en"),

                    na = c("", "NA"))


  # readr::read_csv does not transform string into factor

  # as the "author" feature is categorical by nature

  # it is transformed into a factor

  spooky_data$author <- as.factor(spooky_data$author)

The overall data includes 19579 observations with 3 features (id, text, author). Specifically 7900 excerpts (40.35 %) of Edgard Allan Poe, 5635 excerpts (28.78 %) of HP Lovecraft, and 6044 excerpts (30.87 %) of Mary Wollstonecraft Shelley.

Read about Data Normalization in predictive modeling before analytics in this blog

Avoid the madness!

It is forbidden to use all of the provided spooky data for finding our way through the unique spookiness of each author.

We still want to evaluate how our intuition generalizes on an unseen excerpt/sentence, right?

For this reason, the given training data is split into two parts (using stratified random sampling)

  • an actual training dataset (70% of the excerpts/sentences), used for
    • exploration and insight creation, and
    • training the classification model
  • test dataset (the remaining 30% of the excerpts/sentences), used for
    • evaluation of the accuracy of our model.
# setting the seed for reproducibility

  set.seed(19711004)

  trainIndex <- caret::createDataPartition(spooky_data$author, p = 0.7, list = FALSE, times = 1)

  spooky_training <- spooky_data[trainIndex,]

  spooky_testing <- spooky_data[-trainIndex,]

Specifically 5530 excerpts (40.35 %) of Edgard Allan Poe, 3945 excerpts (28.78 %) of HP Lovecraft, and 4231 excerpts (30.87 %) of Mary Wollstonecraft Shelley.
Moving our first steps: from darkness into the light
Before we start building any model, we need to understand the data, build intuitions about the information contained in the data, and identify a way to use those intuitions to build a great predicting model.

Is the provided data usable?
Question: Does each observation has an id? An excerpt/sentence associated with it? An author?

missingValueSummary <- colSums(is.na(spooky_training))

As we can see from the table below, there are no missing values in the dataset.

| Data Science Dojo

Some initial facts about the excerpts/sentences

Below we can see, as an example, some of the observations (and excerpts/sentences) available in our dataset.

EAP
EAP

QuestionHow many excerpts/sentences are available by the author?

 no_excerpts_by_author <- spooky_training %>%

  dplyr::group_by(author) %>%

  dplyr::summarise(n = n())

ggplot(data = no_excerpts_by_author,

          mapping = aes(x = author, y = n, fill = author)) +

     geom_col(show.legend = F) +

     ylab(label = "number of excerpts") +

     theme_dark(base_size = 10)
Excerpt graph
Number of excerpts mapped against author name

Question: How long (# ofchars) are the excerpts/sentences by the author?

spooky_training$len <- nchar(spooky_training$text)

ggplot(data = spooky_training, mapping = aes(x = len, fill = author)) +

  geom_histogram(binwidth = 50) +

  facet_grid(. ~ author) +

  xlab("# of chars") +

  theme_dark(base_size = 10)
Count graph
Count and number of characters graph
ggplot(data = spooky_training, mapping = aes(x = 1, y = len)) +

  geom_boxplot(outlier.colour = "red", outlier.shape = 1) +

  facet_grid(. ~ author) +

  xlab(NULL) +

  ylab("# of chars") +

  theme_dark(base_size = 10)
characters graph
Number of characters

There are some excerpts that are very long. As we can see from the boxplot above, there are a few outliers for each author; a possible explanation is that the sentence segmentation has a few hiccups (see details below):

eaphplmws | Data Science Dojo

For example Mary Wollstonecraft Shelley (MWS) has an excerpts of around 4600 characters:

“Diotima approached the fountain seated herself on a mossy mound near it and her disciples placed themselves on the grass near her Without noticing me who sat close under her she continued her discourse addressing as it happened one or other of her listeners but before I attempt to repeat her words I will describe the chief of these whom she appeared to wish principally to impress One was a woman of about years of age in the full enjoyment of the most exquisite beauty her golden hair floated in ringlets on her shoulders her hazle eyes were shaded by heavy lids and her mouth the lips apart seemed to breathe sensibility But she appeared thoughtful unhappy her cheek was pale she seemed as if accustomed to suffer and as if the lessons she now heard were the only words of wisdom to which she had ever listened The youth beside her had a far different aspect his form was emaciated nearly to a shadow his features were handsome but thin worn his eyes glistened as if animating the visage of decay his forehead was expansive but there was a doubt perplexity in his looks that seemed to say that although he had sought wisdom he had got entangled in some mysterious mazes from which he in vain endeavoured to extricate himself As Diotima spoke his colour went came with quick changes the flexible muscles of his countenance shewed every impression that his mind received he seemed one who in life had studied hard but whose feeble frame sunk beneath the weight of the mere exertion of life the spark of intelligence burned with uncommon strength within him but that of life seemed ever on the eve of fading At present I shall not describe any other of this groupe but with deep attention try to recall in my memory some of the words of Diotima they were words of fire but their path is faintly marked on my recollection It requires a just hand, said she continuing her discourse, to weigh divide the good from evil On the earth they are inextricably entangled and if you would cast away what there appears an evil a multitude of beneficial causes or effects cling to it mock your labour When I was on earth and have walked in a solitary country during the silence of night have beheld the multitude of stars, the soft radiance of the moon reflected on the sea, which was studded by lovely islands When I have felt the soft breeze steal across my cheek as the words of love it has soothed cherished me then my mind seemed almost to quit the body that confined it to the earth with a quick mental sense to mingle with the scene that I hardly saw I felt Then I have exclaimed, oh world how beautiful thou art Oh brightest universe behold thy worshiper spirit of beauty of sympathy which pervades all things, now lifts my soul as with wings, how have you animated the light the breezes Deep inexplicable spirit give me words to express my adoration; my mind is hurried away but with language I cannot tell how I feel thy loveliness Silence or the song of the nightingale the momentary apparition of some bird that flies quietly past all seems animated with thee more than all the deep sky studded with worlds” If the winds roared tore the sea and the dreadful lightnings seemed falling around me still love was mingled with the sacred terror I felt; the majesty of loveliness was deeply impressed on me So also I have felt when I have seen a lovely countenance or heard solemn music or the eloquence of divine wisdom flowing from the lips of one of its worshippers a lovely animal or even the graceful undulations of trees inanimate objects have excited in me the same deep feeling of love beauty; a feeling which while it made me alive eager to seek the cause animator of the scene, yet satisfied me by its very depth as if I had already found the solution to my enquires sic as if in feeling myself a part of the great whole I had found the truth secret of the universe But when retired in my cell I have studied contemplated the various motions and actions in the world the weight of evil has confounded me If I thought of the creation I saw an eternal chain of evil linked one to the other from the great whale who in the sea swallows destroys multitudes the smaller fish that live on him also torment him to madness to the cat whose pleasure it is to torment her prey I saw the whole creation filled with pain each creature seems to exist through the misery of another death havoc is the watchword of the animated world And Man also even in Athens the most civilized spot on the earth what a multitude of mean passions envy, malice a restless desire to depreciate all that was great and good did I see And in the dominions of the great being I saw man reduced?”

Thinking Point: “What do we want to do with those excerpts/outliers?

Some more facts about the excerpts/sentences using the bag-of-words

The data is transformed into a tidy format (unigrams only) in order to use the tidy tools to perform some basic and essential NLP operations.

spooky_trainining_tidy_1n <- spooky_training %>%

  select(id, text, author) %>%

  tidytext::unnest_tokens(output = word,

                      input = text,

                      token = "words",

                      to_lower = TRUE)

Each sentence is tokenized into words (normalized to lower case, removed punctuation). See example below how the data (each excerpt/sentence) was and how it has been transformed.

glance | Data Science Dojo

Question: Which are the most common words used by each author?

Lets start to count how many times words has been used by each author and plot.

words_author_1 <- plot_common_words_by_author(x = spooky_trainining_tidy_1n,

                                     author = "EAP",

                                     greater.than = 500)


words_author_2 <- plot_common_words_by_author(x = spooky_trainining_tidy_1n,

                                     author = "HPL",

                                     greater.than = 500)

words_author_3 <- plot_common_words_by_author(x = spooky_trainining_tidy_1n,

                                     author = "MWS",

                                     greater.than = 500)


gridExtra::grid.arrange(words_author_1, words_author_2, words_author_3, nrow = 1)
common words graph
Most common words used by each author

From this initial visualization we can see that the authors use quite often the same set of words – like the, and, of. These words do not give any actual information about the vocabulary actually used by each author, they are common words that represent just noise when working with unigrams: they are usually called stopwords.

If the stopwords are removed, using the list of stopwords provided by the tidytext package, it is possible to see that the authors do actually use different words more frequently than others (and it differs from author to author, the author vocabulary footprint).

words_author_1 <- plot_common_words_by_author(x = spooky_trainining_tidy_1n,

                                     author = "EAP",

                                     greater.than = 70,

                                     remove.stopwords = T)


words_author_2 <- plot_common_words_by_author(x = spooky_trainining_tidy_1n,

                                     author = "HPL",

                                     greater.than = 70,

                                     remove.stopwords = T)


words_author_3 <- plot_common_words_by_author(x = spooky_trainining_tidy_1n,

                                     author = "MWS",

                                     greater.than = 70,

                                     remove.stopwords = T)


gridExtra::grid.arrange(words_author_1, words_author_2, words_author_3, nrow = 1)
Data analysis graph
Most common words used comparison between EAP, HPL, and MWS

Another way to visualize the most frequent words by author is to use wordclouds. Wordclouds make it easy to spot differences, the importance of each word matches its font size and color.

par(mfrow = c(1,3), mar = c(0,0,0,0))

words_author <- get_common_words_by_author(x = spooky_trainining_tidy_1n,

                       author = "EAP",

                       remove.stopwords = TRUE)

mypal <- brewer.pal(8,"Spectral")

wordcloud(words = c("EAP", words_author$word),

      freq = c(max(words_author$n) + 100, words_author$n),

      colors = mypal,

      scale=c(7,.5),

      rot.per=.15,

      max.words = 100,

      random.order = F)

words_author <- get_common_words_by_author(x = spooky_trainining_tidy_1n,

                       author = "HPL",

                       remove.stopwords = TRUE)

mypal <- brewer.pal(8,"Spectral")

wordcloud(words = c("HPL", words_author$word),

      freq = c(max(words_author$n) + 100, words_author$n),

      colors = mypal,

      scale=c(7,.5),

      rot.per=.15,

      max.words = 100,

      random.order = F)

words_author <- get_common_words_by_author(x = spooky_trainining_tidy_1n,

                       author = "MWS",

                       remove.stopwords = TRUE)

mypal <- brewer.pal(8,"Spectral")

wordcloud(words = c("MWS", words_author$word),

      freq = c(max(words_author$n) + 100, words_author$n),

      colors = mypal,

      scale=c(7,.5),

      rot.per=.15,

      max.words = 100,

      random.order = F)
Most common words
Most common words used by authors

From the word clouds, we can infer that EAP loves to use the words time, found, eyes, length, day, etc.HPL loves to use the words night, time, found, house, etc.MWS loves to use the words life, time, love, eyes, etc.

A comparison cloud can be used to compare the different authors. From the R documentation

‘Let p{i,j} be the rate at which word i occurs in document j, and p_j be the average across documents(∑ip{i,j}/ndocs). The size of each word is mapped to its maximum deviation ( max_i(p_{i,j}-p_j) ), and its angular position is determined by the document where that maximum occurs.’

See below the comparison cloud between all authors:

comparison_data <- spooky_trainining_tidy_1n %>%

     dplyr::select(author, word) %>%

  dplyr::anti_join(stop_words) %>%

  dplyr::count(author,word, sort = TRUE)


comparison_data %>%

 reshape2::acast(word ~ author, value.var = "n", fill = 0) %>%

 comparison.cloud(colors = c("red", "violetred4", "rosybrown1"),

               random.order = F,

               scale=c(7,.5),

               rot.per = .15,

               max.words = 200) 
Comparison cloud
Comparison cloud between authors

Below is the comparison clouds between the authors, two authors at any time.

par(mfrow = c(1,3), mar = c(0,0,0,0))

comparison_EAP_MWS <- comparison_data %>%

 dplyr::filter(author == "EAP" | author == "MWS")

comparison_EAP_MWS %>%

 reshape2::acast(word ~ author, value.var = "n", fill = 0) %>%

 comparison.cloud(colors = c("red", "rosybrown1"),

               random.order = F,

               scale=c(3,.2),

               rot.per = .15,

               max.words = 100)

comparison_HPL_MWS <- comparison_data %>%

dplyr::filter(author == "HPL" | author == "MWS")

comparison_HPL_MWS %>%

 reshape2::acast(word ~ author, value.var = "n", fill = 0) %>%

comparison.cloud(colors = c("violetred4", "rosybrown1"),

               random.order = F,

               scale=c(3,.2),

               rot.per = .15,

               max.words = 100)


comparison_EAP_HPL <- comparison_data %>%

dplyr::filter(author == "EAP" | author == "HPL")

comparison_EAP_HPL %>%

reshape2::acast(word ~ author, value.var = "n", fill = 0) %>%

comparison.cloud(colors = c("red", "violetred4"),

               random.order = F,

               scale=c(3,.2),

               rot.per = .15,

               max.words = 100)
Comparison cloud
Comparison cloud between EAP, HPL, and MWS

Question: How many unique words are needed in the author dictionary to cover 90% of the used word instances?

words_cov_author_1 <- plot_word_cov_by_author(x = spooky_trainining_tidy_1n, author = "EAP")

words_cov_author_2 <- plot_word_cov_by_author(x = spooky_trainining_tidy_1n, author = "HPL")

words_cov_author_3 <- plot_word_cov_by_author(x = spooky_trainining_tidy_1n, author = "MWS")


gridExtra::grid.arrange(words_cov_author_1, words_cov_author_2, words_cov_author_3, nrow = 1)
Comparison cloud
Detailed comparison cloud between EAP, HPL, and MWS

From the plot above we can see that for EAP and HPL provided corpus, we need circa 7500 words to cover 90% of word instance. While for MWS provided corpus, circa 5000 words are needed to cover 90% of word instances.

Question: Is there any commonality between the dictionaries used by the authors?

Are the authors using the same words? A commonality cloud can be used to answer this specific question, it emphasizes the similarities between authors and plot a cloud showing the common words between the different authors. It shows only those words that are used by all authors with their combined frequency across authors.

See below the commonality cloud between all authors.

comparison_data <- spooky_trainining_tidy_1n %>%

 dplyr::select(author, word) %>%

dplyr::anti_join(stop_words) %>%

dplyr::count(author,word, sort = TRUE)


mypal <- brewer.pal(8,"Spectral") comparison_data %>%

reshape2::acast(word ~ author, value.var = "n", fill = 0) %>%

commonality.cloud(colors = mypal,

               random.order = F,

               scale=c(7,.5),

               rot.per = .15,

               max.words = 200)
Frequency of word usage
Frequency of word usage

Question: Can Word Frequencies be used to compare different authors?

First of all, we need to prepare the data calculating the word frequencies for each author.

 word_freqs <- spooky_trainining_tidy_1n %>%

  dplyr::anti_join(stop_words) %>%

  dplyr::count(author, word) %>%

  dplyr::group_by(author) %>%

  dplyr::mutate(word_freq = n/ sum(n)) %>%

  dplyr::select(-n)

 

wordfreq | Data Science Dojo

Then we need to spread the author (key) and the word frequency (value) across multiple columns (note how NAs have been introduced for words not used by an author).
word_freqs <- word_freqs%>%
tidyr::spread(author, word_freq)

wordeap | Data Science Dojo

Let’s start to plot the word frequencies (log scale) comparing two authors at a time and see how words distribute on the plane. Words that are close to the line (y = x) have similar frequencies in both sets of texts. While words that are far from the line are words that are found more in one set of texts than another.
As we can see in the plots below, there are some words close to the line but most of the words are around the line showing a difference between the frequencies.
# Removing incomplete cases - not all words are common for the authors

# when spreading words to all authors - some will get NAs (if not used

# by an author)

word_freqs_EAP_vs_HPL <- word_freqs %>%

  dplyr::select(word, EAP, HPL) %>%

  dplyr::filter(!is.na(EAP) & !is.na(HPL))

ggplot(data = word_freqs_EAP_vs_HPL, mapping = aes(x = EAP, y = HPL, color = abs(EAP - HPL))) +

  geom_abline(color = "red", lty = 2) +

  geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +

  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +

  scale_x_log10(labels = scales::percent_format()) +

  scale_y_log10(labels = scales::percent_format()) +

  theme(legend.position = "none") +

  labs(y = "HP Lovecraft", x = "Edgard Allan Poe")

hplovecraft | Data Science Dojo

# Removing incomplete cases - not all words are common for the authors

# when spreading words to all authors - some will get NAs (if not used

# by an author)

word_freqs_EAP_vs_MWS <- word_freqs %>%

  dplyr::select(word, EAP, MWS) %>%

  dplyr::filter(!is.na(EAP) & !is.na(MWS))

ggplot(data = word_freqs_EAP_vs_MWS, mapping = aes(x = EAP, y = MWS, color = abs(EAP - MWS))) +

  geom_abline(color = "red", lty = 2) +

  geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +

  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +

  scale_x_log10(labels = scales::percent_format()) +

  scale_y_log10(labels = scales::percent_format()) +

  theme(legend.position = "none") +

  labs(y = "Mary Wollstonecraft Shelley", x = "Edgard Allan Poe")   

mary | Data Science Dojo

# Removing incomplete cases - not all words are common for the authors

# when spreading words to all authors - some will get NAs (if not used

# by an author)

word_freqs_HPL_vs_MWS <- word_freqs %>%

  dplyr::select(word, HPL, MWS) %>%

  dplyr::filter(!is.na(HPL) & !is.na(MWS))

ggplot(data = word_freqs_HPL_vs_MWS, mapping = aes(x = HPL, y = MWS, color = abs(HPL - MWS))) +

  geom_abline(color = "red", lty = 2) +

  geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +

  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +

  scale_x_log10(labels = scales::percent_format()) +

  scale_y_log10(labels = scales::percent_format()) +

  theme(legend.position = "none") +

  labs(y = "Mary Wollstonecraft Shelley", x = "HP Lovecraft")

maryhp | Data Science Dojo

In order to quantify how similar/different these sets of word frequencies by author, we can calculate a correlation (Pearson for linearity) measurement between the sets. There is a correlation of around 0.48 to 0.5 between the different authors (see plot below).

word_freqs %>%

  select(-word) %>%

  cor(use="complete.obs", method="spearman") %>%

  corrplot(type="lower",

       method="pie",

       diag = F)
Correlation graph
Correlation between EAP, HPL, and MWS
Get started with R programming with this free of cost course: Beginner R programming course.

References

[1] Kaggle challenge: Spooky Author Identification[2] “Text Mining in R – A tidy Approach” by J. Silge & D. Robinsons, O’Reilly 2017[3] “Regular Expressions, Text Normalization, and Edit Distance” draft chapter by D. Jurafsky & J. H . Martin, 2018

Appendix: Supporting functions

getNoExcerptsFor <- function(x, author){

  sum(x$author == author)

}

getPercentageExcerptsFor <- function(x, author){

  round((sum(x$author == author)/ dim(x)[1]) * 100, digits = 2)

}

get_xxx_length <- function(x, author, func){

  round(func(x[x$author == author,]$len), digits = 2)

}

plot_common_words_by_author <- function(x, author, remove.stopwords = FALSE, greater.than = 90){

  the_title = author

  if(remove.stopwords){

x <- x %>% dplyr::anti_join(stop_words)

  }

  x[x$author == author,] %>%

dplyr::count(word, sort = TRUE) %>%

dplyr::filter(n > greater.than) %>%

dplyr::mutate(word = reorder(word, n)) %>%

ggplot(mapping = aes(x = word, y = n)) +

geom_col() +

xlab(NULL) +

ggtitle(the_title) +

coord_flip() +

theme_dark(base_size = 10)

}

get_common_words_by_author <- function(x, author, remove.stopwords = FALSE){

  if(remove.stopwords){

x <- x %>% dplyr::anti_join(stop_words)

  }

  x[x$author == author,] %>%

dplyr::count(word, sort = TRUE)

}

plot_word_cov_by_author <- function(x,author){

  words_author <- get_common_words_by_author(x, author, remove.stopwords = TRUE) words_author %>%

mutate(cumsum = cumsum(n),

       cumsum_perc = round(100 * cumsum/sum(n), digits = 2)) %>%

ggplot(mapping = aes(x = 1:dim(words_author)[1], y = cumsum_perc)) +

geom_line() +

geom_hline(yintercept = 75, color = "yellow", alpha = 0.5) +

geom_hline(yintercept = 90, color = "orange", alpha = 0.5) +

geom_hline(yintercept = 95, color = "red", alpha = 0.5) +

xlab("no of 'unique' words") +

ylab("% Coverage") +

ggtitle(paste("% Coverage unique words -", author, sep = " ")) +

theme_dark(base_size = 10)

}
sessionInfo()
## R version 3.3.3 (2017-03-06)

## Platform: x86_64-apple-darwin13.4.0 (64-bit)

## Running under: macOS  10.13

##

## locale:

## [1] no_NO.UTF-8/no_NO.UTF-8/no_NO.UTF-8/C/no_NO.UTF-8/no_NO.UTF-8

##

## attached base packages:

## [1] stats     graphics  grDevices utils     datasets  methods   base     

##

## other attached packages:

##  [1] bindrcpp_0.2       corrplot_0.84      wordcloud_2.5     

##  [4] RColorBrewer_1.1-2 gridExtra_2.3      dplyr_0.7.3       

##  [7] purrr_0.2.3        readr_1.1.1        tidyr_0.7.1       

## [10] tibble_1.3.4       ggplot2_2.2.1      tidyverse_1.1.1   

## [13] tidytext_0.1.3    

##

## loaded via a namespace (and not attached):

##  [1] httr_1.3.1         ddalpha_1.2.1      splines_3.3.3     

##  [4] jsonlite_1.5       foreach_1.4.3      prodlim_1.6.1     

##  [7] modelr_0.1.1       assertthat_0.2.0   highr_0.6         

## [10] stats4_3.3.3       DRR_0.0.2          cellranger_1.1.0  

## [13] yaml_2.1.14        robustbase_0.92-7  slam_0.1-40       

## [16] ipred_0.9-6        backports_1.1.0    lattice_0.20-35   

## [19] glue_1.1.1         digest_0.6.12      rvest_0.3.2       

## [22] colorspace_1.3-2   recipes_0.1.0      htmltools_0.3.6   

## [25] Matrix_1.2-11      plyr_1.8.4         psych_1.7.8       

## [28] timeDate_3012.100  pkgconfig_2.0.1    CVST_0.2-1        

## [31] broom_0.4.2        haven_1.1.0        caret_6.0-77      

## [34] scales_0.5.0       gower_0.1.2        lava_1.5          

## [37] withr_2.0.0        nnet_7.3-12        lazyeval_0.2.0    

## [40] mnormt_1.5-5       survival_2.41-3    magrittr_1.5      

## [43] readxl_1.0.0       evaluate_0.10.1    tokenizers_0.1.4  

## [46] janeaustenr_0.1.5  nlme_3.1-131       SnowballC_0.5.1   

## [49] MASS_7.3-47        forcats_0.2.0      xml2_1.1.1        

## [52] dimRed_0.1.0       foreign_0.8-69     class_7.3-14      

## [55] tools_3.3.3        hms_0.3            stringr_1.2.0     

## [58] kernlab_0.9-25     munsell_0.4.3      RcppRoll_0.2.2    

## [61] rlang_0.1.2        grid_3.3.3         iterators_1.0.8   

## [64] labeling_0.3       rmarkdown_1.6      gtable_0.2.0      

## [67] ModelMetrics_1.1.0 codetools_0.2-15   reshape2_1.4.2    

## [70] R6_2.2.2           lubridate_1.6.0    knitr_1.17        

## [73] bindr_0.1          rprojroot_1.2      stringi_1.1.5     

## [76] parallel_3.3.3     Rcpp_0.12.12       rpart_4.1-11      

## [79] tidyselect_0.2.0   DEoptimR_1.0-8
Top 5 marketing analytics tools for success
Nathan Piccini
| December 18, 2018

From customer relationship management to tracking analytics, marketing analytics tools are important in the modern world. Learn how to make the most of these tools.

What do you usually find in a toolbox? A hammer, screwdriver, nails, tape measure? If you’re building a bird house, these would be perfect for you, but what if you’re creating a marketing campaign? What tools do you want at your disposal? It’s okay if you can’t come up with any. We’re here to help.

Industry’s leading marketing analytics tools

These days marketing is all about data. Whether it’s a click on an email or an abandoned cart on Amazon, marketers are using data to better cater to the needs of the consumer. To analyze and use this data, marketers have a toolbox of their own.

So what are some of these tools and what do they offer? Here, at Data Science Dojo, we’ve come up with our top 5 marketing analytics tools for success:

Customer relationship management platform (CRM)

CRM is a tool used for managing everything there is to know about the customer. It can track where/when a consumer visits your site, tracks the interactions on your site, and creates profiles for leads. A few examples of CRMs are:

HubSpot logo
HubSpot logo

HubSpot, along with the two others listed above, took the idea of a CRM and made it into an all-inclusive marketing resort. Along with the traditional CRM uses, HubSpot can be used to:

  • Manage social media
  • Send mass email campaigns
  • View traffic, campaign, and customer analytics
  • Associate emails, blogs, and social media posts to specific marketing campaigns
  • Create workflows and sequences
  • Connect to your other analytics tools such as Google Analytics, Facebook Ads, YouTube, and Slack.

HubSpot continues its effectiveness by creating reports allowing its users to analyze what is and isn’t working.

This is just a brief description revealing the tip of the iceberg of what HubSpot does. If you want to see below the water line, visit its website.

Search software

Search engine optimization (SEO) is the process of a website ranking on search engines. It’s how you can find everything you have ever searched for on Google. Search software helps marketers analyze how to best optimize websites for potential consumers to find.

A few search software companies are:

I would love to describe each one of the above businesses, but I only have experience with Moz. Moz focuses on a “less invasive way (of marketing) where customers are earned rather than bought”.

Its entire business is focused on upgrading your SEO. Moz offers 9 different services through its Moz Pro toolkit:

Moz Pro Services
Moz Pro Services

I love Moz Keyword Explorer. This is the tool I use to check different variations of titles, keywords, phrases, and hashtags. It gives four different scores, which you can see in the photo below.

Moz Keyword Explorer
Moz Keyword Explorer

Now, there’s not enough data to show the average monthly volume for my name, but, according to Moz, it wouldn’t be that difficult to rank higher than my competitors, people have a high likelihood of clicking, and the Priority explains that my name is not a “sweet spot” for high volume, low difficulty, and high CTR. In conclusion, using my name as a keyword to optimize the Data Science Dojo Blog isn’t the best idea.

Read more about marketing analytics in this blog

Web analytics service

We can’t talk about marketing tools and not to mention Web Analytics Services. These are one of the most important pieces of equipment in the marketer’s toolbox. Google Analytics (GA) is a free web analytics service that integrates your company’s website data into a meticulously organized dashboard. I wouldn’t say GA is the be-all and end-all piece of equipment, and there are many different services and tools out there, however, it can’t be refuted that Google Analytics is a great tool to integrate into your company’s marketing strategy.

Some similar Web Analytics Services include:

Google analytics logo
Google Analytics logo

Some of the analytics you’ll be able to understand are

  • Real-time data – Who’s on your site right now? Where are the users coming from? What pages are they looking at?
  • Audience Information – Where do your users live, age range, interests, gender, new or returning visitor, etc.?
  • Acquisition – Where did they come from (Organic, Direct, Paid Ads, Referrals, Campaigns)? What day/time do they land on your website? What was the final URL they visited before leaving? You can also link to any Google Ads campaigns you have running.
  • Behavior – What is the path people take to convert? How is your site speed? What events took place (Contact form submission, newsletter signup, social media share)?
  • Conversions – Are you attributing conversions by first touch, last touch, linear, or decay?

Understanding these metrics is amazingly effective in narrowing down how users interact with your website.

Another way to integrate Google Analytics into your marketing strategy is by setting up goals. Goals are set up to track specific actions taken on your website. For example, you can set up goals to track purchases, newsletter signups, video plays, live chat, and social media shares.

If you want a more in-depth look at what Google Analytics can offer, you can learn the basics through their Analytics Academy.

marketing analytics tool
Google analysis feedback

Analysis and feedback platform (A&F)

A&Fs are another great piece of equipment in the marketer’s toolbox; more specifically for looking at how users are interacting on your website. One such A&F, HotJar, does this in the form of heatmaps and recordings. HotJar’s integrated tracking pixel allows you to see how far users scroll on your website and what items were clicked the most.

You can also watch recordings of a user’s experience and even filter down to the URL of the page you wish to track, (i.e. /checkout/). This allows you to capture the user’s unique journey until they make a purchase. For each recording, you can view audience information such as geographical location, country, browser, operating system, and a documented list of user actions.

In addition to UX/UI metrics, you can also integrate polls and forms on your website for more intricate data about your users.

As a marketing manager, these tools help to visualize all of my data in ways that a pivot table can’t display. And while I am a genuine user of these platforms, I must admit that it’s not the tool that makes the man, it’s the strategy. To get the most use out of these platforms, you will need to understand what business problem you are trying to solve and what metrics are important to you.

There is a lot of information that these dashboards can provide you. However, it’s up to you to filter through the noise. Not every accessible metric applies to you, so you will need to decide what is the most important for your marketing plan.

A few similar platforms include:

Experimentation platforms

Experimentation platforms are software for experimenting with different variations of a sample. Its purpose is to run A/B tests, something HubSpot does, but these platforms dive head first into them.

Experimentation Platforms
Experimentation Platforms

Where HubSpot only tests versions A and B, experimentation platforms let you test versions A, B, C, D, E, F, etc. They don’t just test the different versions, they will also test different audiences and how they respond to each test version. Searching “definition experimentation platforms” is a good place to start in understanding what experimentation platforms are. I can tell you they are a dream come true for marketers who love to get their hands dirty in behavioral targeting.

Optimizely is one such example of a company offering in-depth A/B testing. Optimizely’s goal is to let you spend more time experimenting with the customer experience and less time wading through statistics to learn what works and what doesn’t. If you are unsure what to do, you can test it with Optimizely.

Using companies like Optimizely or Split is just one way to experiment. Many name brand companies like  Netflix,  MicrosofteBay, and Uber have all built their experimentation platforms to use internally.

Not perfect

No one toolbox is perfect, and everyone is going to be different. One piece of advice I can give is to always understand the problem before deciding which tool is best to solve the problem. You wouldn’t use a hammer to do a job where a drill would be more effective, right?

hammer in wall gif 1 | Data Science Dojo

You could, it just wouldn’t be the most efficient method. The same concept goes for marketing. Understanding the problem will help you know which tools should be in your toolbox.

Redash: A turnkey solution to easy data analysis
Ali Mohsin
| July 6, 2022

Data Science Dojo has launched one of the most in-demand data analytics software, Redash as a virtual machine offer on the Azure Marketplace.

Introduction

With the rising complexity of the data, organizations must have complete control over their data. Sometimes there is a hindrance for the analysts in the specific use cases. Especially when working internally with a dedicated team that requires unlimited access to information. A solution is needed to perform the data-driven tasks efficiently and extract actionable insights.

What is Redash?

Redash, a data analytics tool, assists organizations to become more data-driven by providing tools to democratize data access. It simplifies the creation of dashboards and makes visualizations of your data by connecting to any data source. 

Data analysis with Redash

As a Business Intelligence tool, it has more powerful integration capabilities than other Data Analytics platforms, making it a favorite among businesses that have implemented a variety of apps to manage their business processes. Similarly, according to the reviewer’s point-of-view, they found it to be more user-friendly, manageable, and business-friendly in comparison with other platforms.

PRO TIP: Join our Data Science Bootcamp to learn more about data analytics.

analytics graphs
Data Analytics with Redash

Key features of Redash

  • It offers a user-friendly graphical user interface to carry out complex tasks with a few clicks.
  • Allows users to deal with small as well as big data, it supports many SQL and NoSQL databases.
  • The Query Editor allows users to query the database by utilizing the Schema Browser and autocomplete features.
  • Users can utilize the drag-and-drop feature to build visualizations (like charts, boxplot, cohort, counter, etc.) and then merge them into a single dashboard.
  • Enables peer evaluation of reports and searches and makes it simple for users to share visualizations and the queries that go with them.
  • Allows charts and dashboards to be updated automatically at defined time intervals.

Redash with Azure Services

It leverages the power of Azure services to make the procedure of integration with data sources quickly. Write SQL queries to pull subsets of data for visualizations and plot different charts and share dashboards within the organization with greater ease.

Conclusion

Other open-source business intelligence solutions put strong competition on Redash. Deciding to invest in business intelligence and data analysis tool can be challenging because all corporate departments, including product, finance, marketing, and others, now use multiple platforms to carry out day-to-day operations and carry out analytics tasks to strengthen their control over data.

At Data Science Dojo, we deliver data science education, consulting, and technical services to increase the power of data. We, therefore, know the importance of data and encapsulated insights. Through this offer, we are confident that you can analyze, visualize, and query your data in a collaborative environment with greater easeInstall the Redash offer now from the Azure Marketplace by Data Science Dojo, your ideal companion in your journey to learn data science!

Try Redash!

Text mining: Easy steps to convert structured to the unstructured
Phuc Duong
| March 22, 2016

All of these written texts are unstructured; text mining algorithms and techniques work best on structured data.

Text analytics for machine learning: Part 1

Have you ever wondered how Siri can understand English? How can you type a question into Google and get what you want?

Over the next week, we will release a five-part blog series on text analytics that will give you a glimpse into the complexities and importance of text mining and natural language processing.

This first section discusses how text is converted to numerical data.

In the past, we have talked about how to build machine learning models on structured data sets. However, life does not always give us data that is clean and structured. Much of the information generated by humans has little or no formal structure: emails, tweets, blogs, reviews, status updates, surveys, legal documents, and so much more. There is a wealth of knowledge stored in these kinds of documents which data scientists and analysts want access to. “Text analytics” is the process by which you extract useful information from text.

Some examples include:

All these written texts are unstructured; machine learning algorithms and techniques work best (or often, work only) on structured data. So, for our machine learning models to operate on these documents, we must convert the unstructured text into a structured matrix. Usually this is done by transforming each document into a sparse matrix (a big but mostly empty table). Each word gets its own column in the dataset, which tracks whether a word appears (binary) in the text OR how often the word appears (term-frequency). For example, consider the two statements below. They have been transformed into a simple term frequency matrix. Each word gets a distinct column, and the frequency of occurrence is tracked. If this were a binary matrix, there would only be ones and zeros instead of a count of the terms.

Make words usable for machine learning

Text Mining

Why do we want numbers instead of text? Most machine learning algorithms and data analysis techniques assume numerical data (or data that can be ranked or categorized). Similarity between documents is calculated by determining the distance between the frequency of words. For example, if the word “team” appears 4 times in one document and 5 times in a second document, they will be calculated as more similar than a third document where the word “team” only appears once.

 

Clusters
Sample clusters

Text mining: Build a matrix

While our example was simple (6 words), term frequency matrices on larger datasets can be tricky.

Imagine turning every word in the Oxford English dictionary into a matrix, that’s 171,476 columns. Now imagine adding everyone’s names, every corporation or product or street name that ever existed. Now feed it slang. Feed it every rap song. Feed it fantasy novels like Lord of the Rings or Harry Potter so that our model will know what to do when it encounters “The Shire” or “Hogwarts.” Good, now that’s just English. Do the same thing again for Russian, Mandarin, and every other language.

After this is accomplished, we are approaching a several billion-column matrix; two problems arise. First, it becomes computationally unfeasible and memory intensive to perform calculations over this matrix. Secondly, the curse of dimensionality kicks in and distance measurements become so absurdly large in scale that they all seem the same. Most of the research and time that goes into natural language processing is less about the syntax of language (which is important) but more about how to reduce the size of this matrix.

Now that we know what we must do and the challenges that we must face in order to reach our desired result. The next three blogs in the series will be directly addressed to these problems. We will introduce you to 3 concepts: conforming, stemming, and stop word removal.

Want to learn more about text mining and text analytics?

Check out our short video on our data science bootcamp curriculum page OR watch our video on tweet sentiment analysis.

Text analytics: Drive text as machine-readable
Phuc Duong
| March 28, 2016

Develop an understanding of text analytics, text conforming, and special character cleaning. Learn how to make text machine-readable.

Text analytics for machine learning: Part 2

Last week, in part 1 of our text analytics series, we talked about text processing for machine learning. We wrote about how we must transform text into a numeric table, called a term frequency matrix, so that our machine learning algorithms can apply mathematical computations to the text. However, we found that our textual data requires some data cleaning.

In this blog, we will cover the text conforming and special character cleaning parts of text analytics.

Understand how computers read text

The computer sees text differently from humans. Computers cannot see anything other than numbers. Every character (letter) that we see on a computer is actually a numeric representation to a computer, with the mapping between numbers and characters determined by an “encoding table.” The simplest, but most common, is ASCII encoding in text analytics. A small sample ASCII table is shown to the right.

ASCII Code

To the left is a look at six different ways the word “CAFÉ” might be encoded in ASCII. The word on the left is what the human sees and its ASCII representation (what the computer sees) is on the right.

Any human would know that this is just six different spellings for the same word, but to a computer these are six different words. These would spawn six different columns in our term-frequency matrix. This will bloat our already enormous term-frequency matrix, as well as complicate or even prevent useful analysis.

 

ASCII Representation

Unify words with the same spelling

To unify the six different “CAFÉ’s”, we can perform two simple global transformations.

Casing: First we must convert all characters to the same casing, uppercase or lowercase. This is a common enough operation. Most programming languages have a built-in function that converts all characters into a string into either lowercase or uppercase. We can choose either global lowercasing or global uppercasing, it does not matter as long as it’s applied globally.

String normalization: Second, we must convert all accented characters to their unaccented variants. This is often called Unicode normalization, since accented and other special characters are usually encoded using the Unicode standard rather than the ASCII standard. Not all programming languages have this feature out of the box, but most have at least one package which will perform this function.

Note that implementations vary, so you should not mix and match Unicode normalization packages. What kind of normalization you do is highly language dependent, as characters which are interchangeable in English may not be in other languages (such as Italian, French, or Vietnamese).

Remove special characters and numbers

The next thing we have to do is remove special characters and numbers. Numbers rarely contain useful meaning. Examples of such irrelevant numbers include footnote numbering and page numbering. Special characters, as discussed in the string normalization section, have a habit of bloating our term-frequency matrix. For instance, representing a quotation mark has been a pain-point since the beginning of computer science.

Unlike a letter, which may only be capital or not capital, quotation marks have many popular representations. A quotation character has three main properties: curly, straight, or angled; left or right; single, double, or triple. Depending on the text analytics encoding used, not all of these may exist.

ASCII Quotations
Properties of quotation characters

The table below shows how quoting the word “café” in both straight quote and left-right quotes would look in a UTF-8 table in Arial font.

UTF 8 Form

Avoid over-cleaning

The problem is further complicated by each individual font, operating system, and programming language since implementation of the various encoding standards is not always consistent. A common solution is to simply remove all special characters and numeric digits from the text. However, removing all special characters and numbers can have negative consequences.

There is a thing as too much data cleaning when it comes to text analytics. The more we clean and remove the more “lost in translation” the textual message may become. We may inadvertently strip information or meaning from our messages so that by the time our machine learning algorithm sees the textual data, much or all the relevant information has been stripped away.

For each type of cleaning above, there are situations in which you will want to either skip it altogether or selectively apply it. As in all data science situations, experimentation and good domain knowledge are required to achieve the best results.

When do we want to avoid over-cleaning in your text analytics?

Special characters: The advent of email, social media, and text messaging have given rise to text-based emoticons represented by ASCII special characters.

For example, if you were building a sentiment predictor for text, text-based emoticons like “=)” or “>:(” are very indicative of sentiment because they directly reference happy or sad. Stripping our messages of these emoticons by removing special characters will also strip meaning from our message.

Numbers: Consider the infinitely gridlocked freeway in Washington state, “I-405.” In a sentiment predictor model, anytime someone talks about “I-405,” more likely than not the document should be classified as “negative.” However, by removing numbers and special characters, the word now becomes “I”. Our models will be unable to use this information, which, based on domain knowledge, we would expect to be a strong predictor.

Casing: Even cases can carry useful information sometimes. For instance, the word “trump” may carry a different sentiment than “Trump” with a capital T, representing someone’s last name.

One solution to filter out proper nouns that may contain information is through name entity recognition, where we use a combination of predefined dictionaries and scanning of the surrounding syntax (sometimes called “lexical analysis”). Using this, we can identify people, organizations, and locations.

Next, we’ll talk about stemming and Lemmatization as a way to help computers understand that different versions of words can have the same meaning (ex. run, running, runs).

Learn more

Want to learn more about text analytics? Check out the short video on our curriculum page OR

Process mining: Introducing event log mining
Dave Langer
| January 5, 2017

Process Mining is a critical skill needed by every data scientist and analyst for mining rich and varied data contained in event logs.

Event logs are everywhere and represent a prime source of big data. Event log sources run the gamut from e-commerce web servers to devices participating in globally distributed Internet of Things (IoT) architectures.

Even Enterprise Resource Planning (ERP) systems produce event logs! Given the rich and varied data contained in event logs, process mining these assets is a critical skill needed by every data scientist, business/data analyst, and program/product manager.

At the meetup for this topic, presenter David Langer showed how easy it is to get started process mining your event logs using the OSS tools of R and ProM.

David began the talk by defining which features of a dataset are important for event log mining:

Activity: A well-defined step in some workflow/process.

Timestamp: The date and time at which something worthy of note happened.

Resource: Staff and/or other assets used/consumed in the execution of an activity.

Event: At a minimum, the combination of an activity and a timestamp. Optionally, events may have associated resources, life cycle, and other data.

Case: A related set of events denoted, and connected, by a unique identifier where the events can be ordered.

Event Log: A list of cases and associated events.

Trace: A distinct pattern of case activities within an event log where each activity is present at most once per trace. Event log typically contain many traces.

Below is an example of IIS Web Server data that may be used for process mining:

intro_event_log_meetup Process mining

 

In this example, the traces for this event log are:

  1. portal, dashboard, purchase order report
  2. portal, help, contact us
  3. portal, my team, expense reports

David proceeded his talk with a live demo using the Incident Activity Records dataset from the 2014 Business Processing Intelligence Challenge (BPIC).

About the meetup

In this presentation hosted by Data Science Dojo:• The scenarios and benefits of event log mining• The minimum data required for event log mining• Ingesting and analyzing event log data using R• Process Mining with ProM• Event log mining techniques to create features suitable for Machine Learning models• Where you can learn more about this very handy set of tools and techniques for process mining.

Process mining source code

David’s source code can be viewed and cloned here, at his GitHub repository for this meetup. To clean and process the dataset, he ran through his R script step-by-step. David installed the R package, edeaR, which was specifically used to analyze and the dataset.

After cleaning the dataset, he loaded the new .csv file into the process mining workbench tool, ProM, for visualization. The visualization created helped gain insights about the flow of incident activities from open to close.

intro_event_log_meetup_02

Speaker: David Langer

Marketing analytics tools for success
Nathan Piccini
| December 18, 2018

From customer relationship management to tracking analytics, marketing tools are important in the modern world. Learn how to make the most of these tools.

What do you normally find in a toolbox? A hammer, screwdriver, nails, tape measure? If you’re building a bird house, these would be perfect for you, but what if you’re creating a marketing campaign? What tools do you want at your disposal? It’s okay if you can’t come up with any. We’re here to help.

These days marketing is all about data. Whether it’s a click on an email or an abandoned cart on Amazon, marketers are using data to better cater to the needs of the consumer. In order to analyze and use this data, marketers have a toolbox of their own.

So what are some of these tools and what do they offer? Here, at Data Science Dojo, we’ve come up with our top 5 marketing analytics tools for success.

Customer Relationship Management Platform (CRM)

CRM is a tool used for managing everything there is to know about the customer. It can track where/when a consumer visits your site, it tracks the interactions on your site, and creates profiles for leads. A few examples of CRMs are:

hubspot_logo

  • HubSpot, along with the two others listed above, took the idea of a CRM and made it into an all-inclusive marketing resort. Along with the traditional CRM uses, HubSpot can be used to:
  • Manage social media
  • Send mass email campaigns
  • View traffic, campaign, and customer analytics
  • Associate emails, blogs, and social media posts to specific marketing campaigns
  • Create workflows and sequences
  • Connect to your other analytics tools such as Google Analytics, Facebook Ads, Amazon seller competitor analysis, YouTube, and Slack.

 

HubSpot continues its effectiveness by creating reports allowing its users to analyze what is and isn’t working.

This is just a brief description revealing the tip of the iceberg of what HubSpot does. If you want see below the water line, visit its website.

Search Software

Search engine optimization (SEO) is the process of a website ranking on search engines. It’s how you are able to find everything you have ever searched for on Google. Search software helps marketers analyze how to best optimize websites for potential consumers to find.

A few search software companies are:

 

I would love to describe each one of the above businesses, but I only have experience with Moz. Moz focuses on a

“less invasive way (of marketing) where customers are earned rather than bought”.

In fact, its entire business is focused on upgraging your SEO. Moz offers 9 different services through its Moz Pro toolkit:

MOZ_Services

Personally,

I love the Moz Keyword Explorer. This is the tool I use to check different variations of titles, keywords, phrases, and hashtags. It gives four different scores, which you can see on the photo below.

keyword Search

Now, there’s not enough data to show the average monthly volume for my name, but, according to Moz, it wouldn’t be that difficult to rank higher than my competitors, people have a high likelihood of clicking, and the Priority explains that my name is not a “sweet spot” for high volume, low difficulty, and high CTR. In conclusion, using my name as a keyword to optimize the Data Science Dojo Blog probably isn’t the best idea.

Web Analytics Service

We can’t talk about marketing tools and not mention Web Analytics Services. These are one of the most important pieces of equipment in the marketer’s toolbox. Google Analytics (GA) is a free web analytics service that integrates your company’s website data into a neatly organized dashboard. I wouldn’t say GA is the be-all and end-all piece of equipment, and there are many different services and tools out there, however, it can’t be refuted that Google Analytics is a great tool to integrate into your company’s marketing strategy.

Some similar Web Analytics Services include:

Google-Analytics

Some of the analytics you’ll be able to understand are

  • Real time data – Who’s on your site right now? Where are the users coming from? What pages are they looking at?
  • Audience Information – Where do your users live, age range, interests, gender, new or returning visitor, etc.?
  • Acquisition – Where did they come from (Organic, Direct, Paid Ads, Referrals, Campaigns)? What day/time they landed on your website? What was the final url they visited before leaving? You can also link to any Google Ads campaigns you have running.
  • Behavior – What is the path people take to convert? How is your site speed? What events took place (Contact form submission, newsletter signup, social media share)?
  • Conversions – Are you attributing conversions by first touch, last touch, linear, or decay?

 

Understanding these metrics is very effective in narrowing down how users interact with your website.

Another way to integrate Google Analytics into your marketing strategy is by setting up goals. Goals are set up to track specific actions taken on your website. For example, you can set up goals to track purchases, newsletter signups, video plays, live chat, and social media shares.

If you want a more in-depth look at what Google Analytics can offer, you can learn the basics through their Analytics Academy.

Analysis_feedback

Analysis and Feedback Platform (A&F)

A&Fs are another great piece of equipment in the marketer’s toolbox; more specifically for looking at how users are interacting on your website. One such A&F, HotJar, does this in the form of heatmaps and recordings. HotJar’s integrated tracking pixel allows you to see how far users scroll on your website and what items were clicked the most.

You can also watch recordings of a user’s experience and even filter down to the url of the page you wish to track, (i.e. /checkout/). This allows you to really capture the user’s unique journey until they make a purchase. For each recording, you can view audience information such as geographical location, country, browser, operating system, and a documented list of user actions.

In addition to UX/UI metrics, you can also integrate polls and forms on your website for more intricate data about your users.

As a marketing manager, these tools really help to visualize all of my data in ways that can’t be displayed by a pivot table. And while I am a fervent user of these platforms, I must admit that it’s not the tool that makes the man, it’s the strategy. To get the most use out of these platforms, you will need to understand what business problem you are trying to solve and what metrics are important to you.There is a lot of information that these dashboards can provide you. However, it’s up to you to filter through the noise. Not every accessible metric is applicable to you, so you will need to decide what is the most important for your marketing plan.

A few similar platforms include:

Experimentation Platforms

Experimentation platforms are a software for experimenting different variations of a sample. Its purpose is to run A/B tests, something HubSpot does, but these platforms dive head first into them.

Experimentation Platforms

Where HubSpot only tests versions A and B, experimentation platforms let you test versions A, B, C, D, E, F, ect. They don’t just test the different versions, they will also test different audiences and how they respond to each test version. Searching “definition experimentation platforms” is a good place to start in understanding what experimentation platforms are. I can tell you they are a dream come true for marketers who love to get their hands dirty in behavioral targeting.

Optimizely is one such example of a company offering in depth A/B testing. Optimizely’s goal is to let you spend more time experimenting with the customer experience and less time wading through statistics to learn what works and what doesn’t. If you are unsure what to do, you can test it with Optimizely.

Using companies like Optimizely or Split is just one way to experiment. Many name brand comapanies like  Netflix,  MicrosoftEbay, and Uber have all built their own experimentation platforms to use internally.

Not Perfect

No one toolbox is perfect and everyone’s is going to be different. One piece of advice I can give is to always understand the problem before deciding which tool is best to solve the problem. You wouldn’t use a hammer to do a job that a drill would be more effective at right?

hammer-in-wall gif

You could, it just wouldn’t be the most efficient method. The same concept goes for marketing. Understanding the problem will help you know which tools should be in your toolbox.

 

6 effective email marketing metrics to measure success
Nathan Piccini
| January 3, 2019

Every email marketing campaign will succeed or fail, but how do you categorize something as a success or a failure? That’s where metrics come in.

Marketing campaigns are measured differently depending on your overarching goal, but most of the metrics we all use are the same. Metrics are a way to measure how well your marketing campaign is doing, and they will show you where you need to adjust in order to succeed.

In this post, you will find 6 effective email marketing metrics to measure the success of your email marketing campaign.

Conversions through email marketing

conversions
A customer paying for the service from mobile app

A conversion is characterized as a completed action towards a goal. Whether it’s signing up for a newsletter or buying a pair of sunglasses, someone performed an action that brought you closer to completing your goal.

For example, if a conversion is defined by a subscriber sign up on an email, I will calculate conversions by dividing the total number of signups by the total number of successfully delivered emails. Conversion rates vary widely depending on the industry you are in and what the goal of the campaign is.  Typically, if you are sitting between 1% – 3%, you’re doing pretty well.

Click-through rate

Don’t conversions all start with a click? The answer is yes in case you didn’t know, but how do I know how effective my email is at getting people to my landing page? Allow me to introduce you to the Click-Through Rate (CTR).

CTR is the number of clicks on links within your email that take potential customers to a landing page. It could be a button, picture, or text,but the important thing is someone clicked-on something that was meant to be clicked on.

Different factors will influence the number of clicks you receive, such as the:

  • ad copy
  • an image
  • call-to-action
  • color of text/buttons

You should always A/B test your emails to measure the effectiveness of different versions of the same email have on CTR. If you don’t know what A/B testing is, watch the short video below.

To calculate CTR, simply divide the total number of clicks by the total number of impressions or, in this case, the total number of people the email was successfully sent to.

Like all metrics, CTR is going to vary depending on the industry, but a good average benchmark is around 3.42%.

Click to Open Rate

Click to open rate

The Click to Open Rate (CTOR) measures the number of unique clicks versus the number of total unique opens an email had. Unlike CTR, it doesn’t take into account the people who didn’t open the email. CTOR gives you an idea of whether or not the content within your email is clicking with the audience.

To calculate this metric, divide the number of unique clicks by the number of unique opens an email has. As always, the CTOR will vary depending on the industry, but a good standard is between 20 – 30%.

Unsubscribe rate

Let’s get this out of the way. No one likes seeing someone unsubscribe from an email list. It hurts knowing someone just sent you away after you sent them something you put your blood, sweat, and tears into.

But it shouldn’t.

Unsubscribes aren’t all bad. In fact, people that unsubscribe are saving you time because you’ll no longer be sending an email to someone who won’t convert. But, if an unsubscribe rate is high, it can be indicating a few things:

  • You’re emailing too frequently.
  • You’re targeting the wrong audience.
  • You’re offering low quality content.

 

To calculate an unsubscribe rate, divide the number of unsubscribes by the total number of successfully sent emails, and multiply by 100. Generally, a good unsubscribe rate is below 0.5%.

Bounce rate

Bounce rate

The bounce rate is the number of emails that “bounce back” after being sent. The person who the email was intended for never receives the email, and the sender receives a message saying the email was never sent.

Bounce rates can be identified as either a hard or soft bounce:

  • Hard Bounce: The email address doesn’t exist.
  • Soft Bounce: The address exists, but there was a temporary issue when the email was sent.

To calculate an email bounce rate, divide the number of returned emails marked as undelivered by the total emails sent. A high bounce rate is above 2%. If you continue to send emails that get bounced, it will ruin your customer service, sending reputation and hurt your future level of deliverability.

Spam percentage

According to statista.com, spam made up 53.5 percent of emails around the world in 2018. It’s important to keep your email out of the spam folder so people see what you have to offer. Spam percentage measures the percentage of individuals who sent your email to spam versus the total number of emails sent. The higher the percentage, the more likely your emails will automatically be marked as spam.

To calculate spam percentage, divide the number of emails reported as spam by the number of successfully sent emails. The ideal percentage is 0, but we don’t live in a perfect world. So, if your emails receive a spam percentage of less than 0.1% you can sit happily at your computer.

Thanks for reading! Keep an eye out for more blog posts on different Marketing Metrics!

Role of Data Normalization in predictive modeling before analytics
Anna Kayfitz
| February 6, 2019

There are two key schools of thought on good practice for database management: data normalization and standardization. We will learn why does each matter? 

Organizations are investing heavily in technology as artificial intelligence techniques, such as machine learning, continue to gain traction across several industries.

  • A Price Water Cooper Survey pointed out that 40% of business executives in 2018 make major decisions at least once every 30 days using data and this is constantly increasing
  • A Gartner study states the 40% of enterprise data is either incomplete, inaccurate, or unavailable

As the speed of data coming into the business increases with the Internet of Things starting to become more mature, the risk of disconnected and siloed data grows if it is poorly managed within the organization. Gartner has suggested that a lack of data quality control costs average businesses up to $14 million per year.

The adage of “garbage in, garbage out” still plagues analytics and decision making and it is fundamental that businesses realize the importance of clean and normalized data before embarking on any such data-driven projects.

When most people talk about organizing data, they think it means getting rid of duplicates from their system which, although important, is only the first step in quality control and there are more advanced methods to truly optimize and streamline your data.

There are two key schools of thought on good practice: data normalization and standardization. Both have their place in data governance and/or preparation strategy.

Why data normalization?

A data normalization strategy takes database management and organizes it into specific tables and columns with the purpose of reducing duplication, avoiding data modification issues, and simplifying queries. All information is stored logically in one central location which reduces the propensity for inconsistent data (sometimes known as a “single source of truth”). In simple terms, it ensures your data looks and reads the same across all records.

In the context of machine learning and data science, it takes the values from the database and where they are numeric columns, changes them into a common scale. For example, imagine you have a table with two columns, and one contains values between 0 and 1 and the other contains values between 10,000 and 100,000.

The huge differences in scale might cause problems if you attempt to do any analytics or modeling. This strategy will take these two columns by creating a matching scale across all columns whilst maintaining the distribution e.g. 10,000 might become 0 and 100,000 becomes 1 with values in-between being weighted proportionality.

In real-world terms, consider a dataset of credit card information that has two variables, one for the number of credit cards and the second for income. Using these attributes, you might want to create a cluster and find similar applicants.

Both of these variables will be on completely different types of scale (income being much higher) and would therefore likely have a far greater influence on any results or analytics. Normalization removes the risk of this kind of bias.

The main benefits of this strategy in analytical terms are that it allows faster searching and sorting as it is better at creating indexes via smaller, logical tables. Also, in having more tables, there is a better use of segments to control the tangible placement of data store.

There will be fewer nulls and redundant data after modeling any necessary columns and bias/issues with anomalies are greatly reduced by removing the differences in scale.

This concept should not be confused with data standardization, and it is important that both are considered within any strategy.

What is data standardization?

Data standardization takes disparate datasets and puts them on the same scale to allow easy comparison between different types of variables. It uses the average (mean) and the standard deviation of a dataset to achieve a standardized value of a column.

For example, let’s say a store sells $520 worth of chocolate in a day. We know that on average, the store sells $420 per day and has a standard deviation of $50. To standardize the $520 we would do a calculation as follows:

520-420/50 = 100/50 = 2  – our standardized value for this day is 2. If the sales were $600, we’d scale in a similar way as 600-420/50 = 180/50 = 3.6.

If all columns are done on a similar basis, we quickly hav1`e a great base for analytics that is consistent and allows us to quickly spot correlations.

In summary, data normalization processes ensure that our data is structured logically and scaled proportionally where required, generally on a scale of 0 to 1. It tends to be used where you have predefined assumptions of your model. Data standardization can be used where you are dealing with multiple variables together and need to find correlations and trends via a weighted ratio.

By ensuring you have normalized data, the likelihood of success in your machine learning and data science projects vastly improves. It is vital that organizations invest as much in ensuring the quality of their data as they do in the analytical and scientific models that are created by it. Preparation is everything in a successful data strategy and that’s what we mainly teach in our data science bootcamp courses.

Related Topics

Up for a Weekly Dose of Data Science?

Subscribe to our weekly newsletter & stay up-to-date with current data science news, blogs, and resources.