fbpx
Learn to build large language model applications: vector databases, langchain, fine tuning and prompt engineering. Learn more

Data Analytics

Data is an essential component of any business, and it is the role of a data analyst to make sense of it all. Power BI is a powerful data visualization tool that helps them turn raw data into meaningful insights and actionable decisions.

In this blog, we will explore the role of data analysts and how they use Power BI to extract insights from data and drive business success. From data discovery and cleaning to report creation and sharing, we will delve into the key steps that can be taken to turn data into decisions. 

A data analyst is a professional who uses data to inform business decisions. They process and analyze large sets of data to identify trends, patterns, and insights that can help organizations make more informed decisions. 

 

Data Analyst using Power BI
Uses of Power BI for a Data Analyst – Data Science Dojo

Who is a data analyst?

A data analyst is a professional who works with data to extract insights, draw conclusions, and support decision-making. They use a variety of tools and techniques to clean, transform, visualize, and analyze data to understand patterns, relationships, and trends. The role of a data analyst is to turn raw data into actionable information that can inform and drive business strategy.

They use various tools and techniques to extract insights from data, such as statistical analysis, and data visualization. They may also work with databases and programming languages such as SQL and Python to manipulate and extract data. 

The importance of data analysts in an organization is that they help organizations make data-driven decisions. By analyzing data, analysts can identify new opportunities, optimize processes, and improve overall performance. They also help organizations make more informed decisions by providing insights into customer behavior, market trends, and other key metrics.

Additionally, their role and job can help organizations stay competitive by identifying areas where they may be lagging and providing recommendations for improvement. 

Defining Power BI 

Power BI provides a suite of data visualization and analysis tools to help organizations turn data into actionable insights. It allows users to connect to a variety of data sources, perform data preparation and transformations, create interactive visualizations, and share insights with others. 

Check out this course and learn Power BI today!

The platform includes features such as data modeling, data discovery, data analysis, and interactive dashboards. It enables organizations to quickly create and share visualizations, reports, and dashboards with stakeholders, regardless of their technical skill level.

Power BI also provides collaboration features, allowing team members to work together on data insights, and share information and insights with others through Power BI reports and dashboards. 

Key capabilities of Power BI  

Data Connectivity:It allows users to connect to various data sources including Excel, SQL Server, Azure SQL, and other cloud-based data sources. 

Data Transformation: It provides a wide range of data transformation tools that allow users to clean, shape, and prepare data for analysis. 

Visualization: It offers a wide range of visualization options, including charts, tables, and maps, that allow users to create interactive and visually appealing reports. 

Sharing and Collaboration: It allows users to share and collaborate on reports and visualizations with others in their organization. 

Mobile Access: It also offers mobile apps for iOS and Android, that allow users to access and interact with their data on the go. 

How does a data analyst use Power BI? 

A data analyst uses Power BI to collect, clean, transform, visualize, and analyze data to turn it into meaningful insights and decisions. The following steps outline the process of using Power BI for data analysis: 

  1. Connect to data sources: A data analyst can import data from a variety of sources, such as spreadsheets, databases, or cloud-based services. Power BI provides several ways to import data, including manual upload, data connections, and direct connections to data sources. 
  2. Clean and transform data: Before data can be analyzed, it often needs to be cleaned and prepared. This may include removing any extraneous information, correcting errors or inconsistencies, and transforming data into a format that is usable for analysis.
  3. Create visualizations: Once the data has been prepared, a data analyst can use Power BI to create visualizations of the data. This may include bar charts, line graphs, pie charts, scatter plots, and more. Power BI provides a few built-in visualizations and the ability to create custom visualizations, giving data analysts a wide range of options for presenting data. 
  4. Perform data analysis: Power BI provides a range of data analysis tools, including calculated fields and measures, and the DAX language, which allows data analysts to perform more advanced analysis. These tools allow them to uncover insights and trends that might not be immediately apparent. 
  5. Collaborate and share insights: Once insights have been uncovered, data analysts can share their findings with others through Power BI reports or dashboards. These reports provide a way to present data visualizations and analysis results to stakeholders and can be published and shared with others. 

 

Learn Power BI with this crash course in no time!

 

By following these steps, a data analyst can use Power BI to turn raw data into meaningful insights and decisions that can inform business strategy and decision-making. 

 

Why should you use data analytics with Power BI? 

User-friendly interface – Power BI has a user-friendly interface, which makes it easy for users with little to no technical skills to create and share interactive dashboards, reports, and visualizations. 

Real-time data visualization – It provides real-time data visualization, allowing users to analyze data in real time and make quick decisions. 

Integration with other Microsoft tools – Power BI integrates seamlessly with other Microsoft tools, such as Excel, SharePoint, and Azure, making it an ideal tool for organizations using Microsoft technology. 

Wide range of data sources – It can connect to a wide range of data sources, including databases, spreadsheets, cloud services, and web APIs, making it easy to consolidate data from multiple sources. 

Cost-effective – It is a cost-effective solution for data analytics, with both free and paid versions available, making it accessible to organizations of all sizes. 

Mobile accessibility – Power BI provides mobile accessibility, allowing users to access and analyze data from anywhere, on any device. 

Collaboration features – With robust collaboration features, it allows users to share dashboards and reports with other team members, encouraging teamwork and decision-making. 

Conclusion 

In conclusion, Power BI is a powerful tool for data analysis that provides organizations with the ability to easily visualize, analyze, and share complex data. By preparing, cleaning, and transforming data, creating relationships between tables, using visualizations and DAX, they can create reports and dashboards that provide valuable insights into key business metrics.

The ability to publish reports, share insights, and collaborate with others makes Power BI an essential tool for any organization looking to improve performance and make informed decisions.

February 9, 2023

An overview of data analysis, the data analysis methods, its process, and implications for modern corporations. 

 

Studies show that 73% of corporate executives believe that companies failing to use data analysis on big data lack long-term sustainability. While data analysis can guide enterprises to make smart decisions, it can also be useful for individual decision-making 

Let’s consider an example of using data analysis at an intuitive individual level. As consumers, we are always choosing between products offered by multiple companies. These decisions, in turn, are guided by individual past experiences. Every individual analysis the data obtained via their experience to generate a final decision.  

Put more concretely, data analysis involves sifting through data, modeling it, and transforming it to yield information that guides strategic decision-making. For businesses, data analytics can provide highly impactful decisions with long-term yield. 

 

Data analysis methods and data analysis process
Data analysis methods and data analysis processes – Data Science Dojo

 

 So, let’s dive deep and look at how data analytics tools can help businesses make smarter decisions. 

The data analysis process 

The process includes five key steps:  

1. Identify the need

Companies use data analytics for strategic decision-making regarding a specific issue. The first step, therefore, is to identify the particular problem. For example, a company decides it wants to reduce its production costs while maintaining product quality. To do so effectively, the company would need to identify step(s) of the workflow pipeline it should implement cost cuts. 

Similarly, the company might also have a hypothetical solution to its question. Data analytics can be used to judge the falsifiability of the hypothesis, allowing the decision-maker to reach the optimized solution. 

A specific question or hypothesis determines the subsequent steps of the process. Hence, this must be as clear and specific as possible. 

 

2. Collect the data 

Once the data analysis need is identified, the subsequent kind of data is also determined. Data collection can involve data entered in different types and formats. One broad classification is based on structure and includes structured and unstructured data. 

 Structured data, for example, is the data a company obtains from its users via internal data acquisition methods such as marketing automation tools. More importantly, it follows the usual row-column database and is suited to the company’s exact needs. 

Unstructured data, on the other hand, need not follow any such formatting. It is obtained via third parties such as Google Trends, census bureaus, world health bureaus, and so on. Structured data is easier to work with as it’s already tailored to the company’s needs. However, unstructured data can provide a significantly larger data volume. 

There are many other data types to consider as well. For example, metadata, big data, real-time data, and machine data.  

 

3. Clean the data 

The third step, data cleaning, ensures that error-free data is used for the data analysis. This step includes procedures such as formatting data correctly and consistently, removing any duplicate or anomalous entries, dealing with missing data, and fixing cross-set data errors.  

 Performing these tasks manually is tedious and hence, various tools exist to smoothen the data-cleaning process. These include open-source data tools such as OpenRefine, desktop applications like Trifacta Wrangler, cloud-based software as a service (SaaS) like TIBCO Clarity, and other data management tools such as IBM Infosphere quality stage especially used for big data. 

 

4. Perform data analysis 

Data analysis includes several methods as described earlier. The method to be implemented depends closely on the research question to be investigated. Data analysis methods are discussed in detail later in this blog. 

 

5. Present the results 

Presentation of results defines how well the results are to be communicated. Visualization tools such as charts, images, and graphs effectively convey findings, establishing visual connections in the viewer’s mind. These tools emphasize patterns discovered in existing data and shed light on predicted patterns, assisting the results’ interpretation. 

 

Listen to the Data Analysis challenges in cybersecurity

 

Data analysis methods

Data analysts use a variety of approaches, methods, and tools to deal with data. Let’s sift through these methods from an approach-based perspective: 

 

1. Descriptive analysis 

Descriptive analysis involves categorizing and presenting broader datasets in a way that allows emergent patterns to be observed from them to see if there are any obvious patterns. Data aggregation techniques are one way of performing descriptive analysis. This involves first collecting the data and then sorting it to ease manageability. 

This can also involve performing statistical analysis on the data to determine, say, the measures of frequency, dispersion, and central tendencies that provide a mathematical description for the data.
 

2. Exploratory analysis 

Exploratory analysis involves consulting various data sets to see how certain variables may be related, or how certain patterns may be driving others. This analytic approach is crucial in framing potential hypotheses and research questions that can be investigated using data analytic techniques.  

Data mining, for example, requires data analysts to use exploratory analysis to sift through big data and generate hypotheses to be tested. 

 

3. Diagnostic analysis 

Diagnostic analysis is used to answer why a particular pattern exists in the first place. For example, this kind of analysis can assist a company in understanding why its product is performing in a certain way in the market. 

Diagnostic analytics includes methods such as hypothesis testing, determining correlations v/s causation, and diagnostic regression analysis. 

 

4. Predictive analysis 

Predictive analysis answers the question of what will happen. This type of analysis is key for companies in deciding new features or updates on existing products, and in determining what products will perform well in the market.  

 For predictive analysis, data analysts use existing results from the earlier described analyses while also using results from machine learning and artificial intelligence to determine precise predictions for future performance. 

 

5. Prescriptive analysis 

Prescriptive analysis involves determining the most effective strategy for implementing the decision arrived at. For example, an organization can use prescriptive analysis to sift through the best way to unroll a new feature. This component of data analytics actively deals with the consumer end, requiring one to work with marketing, human resources, and so on.  

 Prescriptive analysis makes use of machine learning algorithms to analyze large amounts of big data for business intelligence. These algorithms can assess large amounts of data by working through them via “if” and “else” statements and making recommendations accordingly. 

 

6. Quantitative and qualitative analysis 

Quantitative analysis computationally implements algorithms testing out a mathematical fit to describe correlation or causation observed within datasets. This includes regression analysis, null analysis, hypothesis analysis, etc.  

Qualitative analysis, on the other hand, involves non-numerical data such as interviews and pertains to answering broader social questions. It involves working closely with textual data to derive explanations.  

 

7. Statistical analysis 

Statistical techniques provide answers to essential decision challenges. For example, they can accurately quantify risk probabilities, predict product performance, establish relationships between variables, and so on. These techniques are used by both qualitative and quantitative analysis methods. Some of the invaluable statistical techniques for data analysts include linear regression, classification, resampling methods, and subset selection.  

Statistical analysis, more importantly, lies at the heart of data analysis, providing the essential mathematical framework via which analysis is conducted. 

 

Data-driven businesses

Data-driven businesses use the data analysis methods described above. As a result, they offer many advantages and are particularly suited to modern needs. Their credibility relies on them being evidence-based and using precise mathematical models to determine decisions.

Some of these advantages include stronger customer needs, precise identification of business needs, devising effective strategy decisions, and performing well in a competitive market. Data-driven businesses are the way forward. 

January 16, 2023

It is no surprise that the demand for a skilled data analyst grows across the globe. In this blog, we will explore eight key competencies that aspiring data analysts should focus on developing. 

 

Data analysis is a crucial skill in today’s data-driven business world. Companies rely on data analysts to help them make informed decisions, improve their operations, and stay competitive. And so, all healthy businesses actively seek skilled data analysts. 

 

Technical skills and non-technical skills for data analyst
Technical skills and non-technical skills for data analyst

 

Becoming a skilled data analyst does not just mean that you acquire important technical skills. Rather, certain soft skills such as creative storytelling or effective communication can mean a more all-rounded profile. Additionally, these non-technical skills can be key in shaping how you make use of your data analytics skills. 

Technical skills to practice as a data analyst: 

Technical skills are an important aspect of being a data analyst. Data analysts are responsible for collecting, cleaning, and analyzing large sets of data, so a strong foundation in technical skills is necessary for them to be able to do their job effectively.

Some of the key technical skills that are important for a data analyst include:

1. Probability and statistics:  

A solid foundation in probability and statistics ensures your ability to identify patterns in data, prevent any biases and logical errors in the analysis, and lastly, provide accurate results. All these abilities are critical to becoming a skilled data analyst. 

 Consider, for example, how various kinds of probabilistic distributions are used in machine learning. Other than a strong understanding of these distributions, you will need to be able to apply statistical techniques, such as hypothesis testing and regression analysis, to understand and interpret data. 

 

2. Programming:  

As a data analyst, you will need to know how to code in at least one programming language, such as Python, R, or SQL. These languages are the essential tools via which you will be able to clean and manipulate data, implement algorithms and build models. 

Moreover, statistical programing languages like Python and R allow advanced analysis that interfaces like Excel cannot provide. Additionally, both Python and R are open source.  

3. Data visualization 

A crucial part of a data analyst’s job is effective communication both within and outside the data analytics community. This requires the ability to create clear and compelling data visualizations. You will need to know how to use tools like Tableau, Power BI, and D3.js to create interactive charts, graphs, and maps that help others understand your data. 

 

Dataset
The progression of the Datasaurus Dozen dataset through all of the target shapes – Source

 

4. Database management:  

Managing and working with large and complex datasets means having a solid understanding of database management. This includes everything from methods of collecting, arranging, and storing data in a secure and efficient way. Moreover, you will also need to know how to design and maintain databases, as well as how to query and manipulate data within them. 

Certain companies may have roles particularly suited to this task such as data architects. However, most will require data analysts to perform these duties as data analysts responsible for collecting, organizing, and analyzing data to help inform business decisions. 

Organizations use different data management systems. Hence, it helps to gain a general understanding of database operations so that you can later specialize them to a particular management system.  

Non-technical skills to adopt as a data analyst:  

Data analysts work with various members of the community ranging from business leaders to social scientists. This implies effective communication of ideas to a non-technical audience in a way that drives informed, data-driven decisions. This makes certain soft skills like communication essential.  

Similarly, there are other non-technical skills that you may have acquired outside a formal data analytics education. These skills such as problem-solving and time management are transferable skills that are particularly suited to the everyday work life of a data analyst. 

1. Communication 

As a data analyst, you will need to be able to communicate your findings to a wide range of stakeholders. This includes being able to explain technical concepts concisely and presenting data in a visually compelling way.  

Writing skills can help you communicate your results to wider members of population via blogs and opinion pieces. Moreover, speaking and presentation skills are also invaluable in this regard. 

 

Read about Data Storytelling and its importance

2. Problem-solving:   

Problem-solving is a skill that individuals pick from working in different fields ranging from research to mathematics, and much more. This, too, is a transferable skill and not unique to formal data analytics training. This also involves a dash of creativity and thinking of problems outside the box to come up with unique solutions. 

Data analysis often involves solving complex problems, so you should be a skilled problem-solver who can think critically and creatively. 

3. Attention to detail: 

Working with data requires attention to detail and an elevated level of accuracy. You should be able to identify patterns and anomalies in data and be meticulous in your work. 

4. Time management:  

Data analysis projects can be time-consuming, so you should be able to manage your time effectively and prioritize tasks to meet deadlines. Time management can also be implemented by tracking your daily work using time management tools.  

 

Final word 

Overall, being a data analyst requires a combination of technical and non-technical skills. By mastering these skills, you can become an invaluable member of any team and make a real impact with your data analysis. 

 

January 10, 2023

How does Expedia determine the hotel price to quote to site users? How come Mac users end up spending as much as 30 percent more per night on hotels? Digital marketing analytics, a torrent flowing into all the corners of the global economy has revolutionized marketing efforts, so much so, that resetting it all together. It is safe to say that marketing analytics is the science behind persuasion.

Marketers can learn so much about the users, their likes, dislikes, goals, inspirations, drop-off points, inspirations, needs, and demands. This wealth of information is a gold mine but only for those who know how to use it. In fact, one of the top questions that marketing managers struggle with is

 

“Which metrics to track?” 

 

Furthermore, several platforms report on marketing, such as email marketing software, paid search advertising platforms, social media monitoring tools, blogging platforms, and web analytics packages. It is a marketer’s nightmare to be buried under sets of reports from different platforms while tracking a campaign all the way to conversion.

Definitely, there are smarter ways to track. But before we take a deep dive into how to track smartly, let me clarify why you should be investing half the time measuring while doing:

  • To identify what’s working
  • To identify what’s not working
  • Identify strategies to improve
  • Do more of what works

To gain a trustworthy answer to the aforementioned, you must: measure everything. While you attempt at it, arm yourself with the lexicon of marketing analytics to form statements that communicate results, for example:

 

“Twitter mobile drove 40% of all clicks this week on the corporate website” 

Every statement that you form to communicate analytics must state the source, segment, value, metric, and range. Let us break down the above example:

  • Source: Twitter
  • Segment: Mobile
  • Value: 40%
  • Metric: Clicks
  • Range: This week

To be able to report such glossy statements, you will need to get your hands dirty. You can either take a campaign-based approach or a goals-based approach.

 

Campaign-based approach to marketing analytics

 

In a campaign-based approach, you measure the impact of every campaign, for example, if you have social media platforms, blogs, and emails trying to get users to sign up for an e-learning course, this approach will enable you to get insight into each.

In this approach we will discuss the following in detail:

  1. Measure the impact on the website
  2. Measure the impact of SEO
  3. Measure the impact of paid search advertising
  4. Measure the impact of blogging efforts
  5. Measure the impact of social media marketing
  6. Measure the impact of e-mail marketing

Measure the impact on the website

 

  • Unique visitors

How to use: Unique visitors account for a fresh set of eyes on your site.  If the number of unique visitors is not rising, then it is a clear indication to reassess marketing tactics.

 

  • Repeat visitors

How to use: If you have visitors revisiting your site or a landing page, it is a clear indication that your site sticks or offers content people want to return to. But if your repeat visitor rate is high then it is indicative of your content not gauging new audiences.

 

  • Sources

How to use: Sources are of three types: organic, direct, and referrals. Learning about your traffic sources will give you clarity on your SEO performance. Also, it can help you find answers to questions like what is the percentage of organic traffic of total traffic?

 

  • Referrals

How to use: This is when the traffic arriving on your site is from another website. Aim for referrals to deliver 20-30% of your total traffic. Referrals can help you identify the types of sites or bloggers that are linking to your site and the type of content they tend to share. This information can be fed back into your SEO strategy, and help you produce relevant content that generates inbound links.

 

  • Bounce rate

How to use: High bounce rate indicates trouble. Maybe the content is not relevant, or the pages are not compelling enough. Perhaps the experience is not user-friendly. Or the call-to-action buttons are too confusing? A high bounce rate reflects problems, and the reasons can be many.


 

Measure the impact of SEO 

Similarly, you can measure the impact of SEO using the following metrics:

 

  • Keyword performance and rankings:

How to use: You can use tools like Google AdWords to identify keywords that optimize your website. Check if the chosen keywords are driving traffic to your site or if they are improving your site’s keywords.

 

  • Total traffic from organic search:

How to use: This metric is a mirror of how relevant your content is. Low traffic from the organic search may mean it is time to ramp up content creation – videos, blogs, webinars or expand into newer areas, such as e-books and podcasts that can be ranked higher by search engines.

Measure the impact of paid search advertising

Likewise, it is equally important to measure the impact of your paid search, also known as pay per click (PPC), in which you pay for every click that is generated by paid search advertising. How much are you spending in total? Are those clicks turning into leads? How much profit are you generating from this spend? Some of the following metrics can help you clarify:

 

  • Click through rate:

How to use: This metric helps you determine the quality of your ad. Is it effective enough to prompt a click? Test different copy treatments, headlines, and URLs to figure out the combination that boosts the CTR for a specific term.

 

  • Average cost per click:

How to use: Cost per click determines the amount you spend for each click on a paid search ad. Combine this conversion rate and earnings from the clicks.

 

  • Conversion rate:

How to use: Is conversion always a purchase? No! Each time a user takes the action you want them to do on your site, such as clicking on a button, signing up for a form, or subscribing, it is accounted as a conversion.

 

Measure the impact of blogging efforts 

Going beyond the website and SEO metrics, you can also measure the impact of your blogging efforts. Since a considerable amount of organizational resources is invested in creating blogs that can develop backlinks to the website. Some of the metrics that can get you clarity on whether you are generating relevant content:

  • Post Views
  • Call to action performance
  • Blog leads

Measure the impact of social media marketing

 Very well-known and quite widely implemented are the strategies to measure social media marketing. Especially now, as the e-commerce industry is expanding, social media can make or break your image online. Some of the commonly measured metrics are:

  • Reach
  • Engagement
  • Mentions to assess the brand perception
  • Traffic
  • Conversion rate

 

Measure the impact of e-mail marketing

Quite often, the marketing strategy runs on the crutches of e-mail. E-mails are a good place to start visibility efforts and can be very important in maintaining a sustainable relationship with your existing customer base. Some of the metrics that can help you clarify if your emails are working their magic or not are:

  • Bounce rate
  • Delivery rate
  • Click through rate
  • Share/forwarding rate
  • Unsubscribe rate
  • Frequency of emails sent

Goals-based approach

A goals-based approach is defined based on what you’re trying to achieve by a particular campaign. Are you trying to acquire new customers? Or build a loyal customer base, increase engagement, and improve conversion rate? Here are a few examples:

In this approach we will discuss the following in detail:

  • Audience analysis
  • Acquisition analysis
  • Behavioral analysis
  • Conversion analysis
  • A/B testing

 Audience analysis:

The goal is to know:

 

“Who are your customers?” 

 

Audience analysis is a measure that helps you gain clarity on who your customers are. The information can include demographics, location, income, age, and so forth. The following set of metrics can help you know your customers better.

 

  • Unique visitors
  • Lead score

  • Cookies

  • Segment

  • Label

  • Personally Identifiable Information (PII)
  • Properties

  • Taxonomy

Acquisition analysis:

 

The goal is to know:

 

“How do customers get to your website?” 

 

Acquisition analysis helps you understand which channel delivers the most traffic to your site or application. Comparing incoming visitors from different channels helps determine the efficacy of your SEO efforts on organic search traffic and see how well your email campaigns are running. Some of the metrics that can help you are:

 

  • Omnichannel

  • Funnel

  • Impressions

  • Sources

  • UTM parameters 

  • Tracking URL

  • Direct traffic

  • Referrers

  • Retargeting

  • Attribution

  • Behavioral targeting


Behavioral analysis:

 The goal is to know:

 

“What do the users do on your website?” 

 

Behavior analytics explains what customers do on your website. What pages do they visit? Which device do they use? From where do they enter the site? What makes them stay? How long do they stay? Where on the site did, they drop off? Some of the metrics that can help you gain clarity are:

  • Actions

  • Sessions

  • Engagement rate

  • Events

  • Churn

  • Bounce rate

Conversion analysis

The goal is to know:

 

“Whether customers take actions that you want them to take?” 

 

Conversions track whether customers take actions that you want them to take. This typically involves defining funnels for important actions — such as purchases — to see how well the site encourages these actions over time. Metrics that can help you gain more clarity are:

  • Conversion rate

  • Revenue report

A/B testing:

The goal is to know:

 

“What digital assets are likely to be the most effective for higher conversion?” 

 

A/B testing enables marketers to experiment with different digital options to identify which ones are likely to be the most effective. For example, they can compare one intervention (A Control Group) to another intervention (B). Companies run A/B experiments regularly to learn what works best.

In this article, we discussed what marketing analytics is, its importance, two approaches that marketers can take to report metrics and the marketing lingo they can use while reporting results. Pick the one that addresses your business needs and helps you get clarity on your marketing efforts. This is not an exhaustive list of all the possible metrics that can be used to measure.

Of course, there are more! But this can be a good starting point until the marketing efforts expand into a larger effort that has additional areas that need to be tracked.

 

Upgrade your data science skillset with our Python for Data Science and Data Science Bootcamp training!

December 8, 2022

In this blog, we will discuss what Data Analytics RFP is and the five steps involved in the data analytics RFP process.

(more…)

December 1, 2022

In this blog, we are going to discuss about data storytelling for successful brand building, its components and brand storytelling

What is data storytelling? 

Data storytelling is a process of driving insights from a dataset using analysis and making it presentable through visualization. It not only helps capture insights but makes content visually presentable so that stakeholders can make data-driven decisions.  

With data storytelling, you can influence and inform your audience based on your analysis.  

 

There are 3 important components of data storytelling.  

  1. Data: You analyze to build a foundation of your data story. This could be descriptive, diagnostic, predictive, or prescriptive analysis to help get a full picture. 
  2. Narrative: Also known as a storyline, a narrative is used to communicate insights gained from your analysis. 
  3. Visualization: Visualization helps communicate that story clearly and effectively. Making use of graphs, charts, diagrams, and audio-visuals for the purpose. 

 

The benefits of data storytelling

data storytelling - infographic
Data storytelling 

 

So, the question arises why do we even need storytelling for data? The simple answer is it helps with decision-making. But let’s take a look at some of the benefits of data storytelling. 

  • Adding value to your data and insights. 
  • Interpreting complex information and highlighting essential key points for the audience. 
  • Providing a human touch to your data. 
  • Offering value to your audience and industry. 
  • Building credibility as an industry and topic thought leader.

 

For example, Airbnb uses data storytelling to help consumers find the right hotel at the right price and also for hosts to set up Airbnb at the most lucrative place.  

 

Data storytelling helps AirBnB deliver personalized experience and recommendations. Their price tip feature is constantly updated to help guide hosts on how likely are they to get a booking at a chosen price.  Other features include host/guest interactions, current events, and local market history in real-time available through its app. 

 

Data-driven brand storytelling 

Now that we have an understanding of data storytelling, let’s talk about how brand storytelling works. Data-driven brand storytelling is when a company uses research, studies, and analytics to share information about a brand and tell a story to consumers.  

It turns complex datasets into an insightful easy to understand visually comprehensible story. It is different than creative storytelling where the brand only focuses on creating a perception. Here the story is based on factual data. 

Storytelling is a great way to build brand association and connect with your consumers. Data-driven storytelling uses visualization that captures attention.  

 

Learn how to create and execute data visualization and tell a story with your data by enrolling in our 5-day live Power BI training 

 

Studies show that our brains process images 60,000 times faster than words, 90% of information transmitted to the brain is visual in nature and we’re 65% more likely to retain information that is visual. 

That’s why infographics, charts, and images are so useful.  

For example, Tower Electric Bikes, a direct-to-consumer e-bike brand used a infographic to rank the most and the least bike-friendly cities across the US. This way they turned an enormous amount of data into visually friendly info-graphic that bike consumers can interpret with just a glance. 

 

bike friendly cities infographic
Bike friendly cities infographic – Source: Tower electric bikes

  

Using the power of storytelling for marketing content 

Even though all content is interpreted as data by consumer but visual content provides the most value in terms of memorability, impact, and capturing their attention. The job of any successful brand is to build a positive association in consumers’ minds. 

Storytelling helps create those positive associations by providing high-value engaging content, capturing attention, and giving meaning to not-so-visually appealing datasets. 

We live in a world that is highly cluttered by advertising and paid promotional content. To make your content stand out from competitors you need to have good visualization and a story behind it. Storytelling helps assign meaning and context to data that would otherwise look unappealing and dry.  

Consumers gain clarity, and better understanding, and share more if it makes sense to them. Data storytelling helps extract and communicate insight that in turn helps your consumer’s buying journey.

It could be content relevant to any stage of their buyer journey or even outside of the sales cycle. Storytelling helps create engaging and memorable marketing content that would help grow your brand. 

Learn how to use data visualization, narratives, and real-life examples to bring your story to life with our free community event Storytelling Data. 

 

Executing flawless data-driven brand storytelling 

Now that we have a better understanding of brand storytelling, let’s have a look at how to go about crafting a story and important steps involved. 

Craft a compelling narrative 

The most important element in building a story is the narrative. You need a compelling narrative for your story. There are 4 key elements to any story. 

Characters: These are your key players or stakeholders in your story. They can be customers, suppliers, competitors, environmental groups, government, or any other group that has to do with your brand.  

Setting: This is where you use your data to reinforce the narrative. Whether it’s an improved feature in your product that increases safety or a manufacturing process that takes into account environmental impact. This is the stage where you define environment that concerns your stakeholders.

Conflict: Here you describe the root issue or problem you’re trying to solve with data. This could be marketing content that generated sales revenue, you want your team to have a better understanding of it to create helpful content for the sales team. Conflict plays a crucial role in making your story relevant and engaging. There needs to be a problem for a data solution.  

Resolution: Finally, you want to propose a solution to the identified problem. You can present a short-term fix along with a long-term pivot depending on the type of problem you are solving. At this stage, your marketing outreach should be consistent with a very visible message across all channels.

You don’t want to create confusion, whatever resolution/result you’ve achieved through analysis should be clearly indicated with supporting evidence and compelling visualization to make your story come to life. 

Your storytelling needs to have all these steps to be able to communicate your message effectively to the desired audience. With these steps, your audience will walk through a compelling, engaging and impactful story. 

 

Start learning data storytelling today

Our brains are hard-wired to love stories and visuals. Storytelling is not something new it dates back to 1700 BCE, from cave paintings to symbol language. That is the reason it resonates so well in today’s fast-paced cluttered consumer environment.  

Brands can use storytelling based on factual data to engage, create positive associations and finally encourage action. The best way to come up with a story narrative is to use internal data, success stories, and insights driven by your research and analysis. Then translate those insights into a story and visuals for better retention and brand building. 

 

register now

 

 

References 

Data storytelling

Power BI – Data storytelling

HBS – Data storytelling

 

 

November 29, 2022

In this article, we’re going to talk about how data analytics can help your business generate more leads and why you should rely on data when making decisions regarding a digital marketing strategy. 

Some people believe that marketing is about creativity – unique and interesting campaigns, quirky content, and beautiful imagery. Contrary to their beliefs, data analytics is what actually powers marketing – creativity is simply a way to accomplish the goals determined by analytics. 

Now, if you’re still not sure how you can use data analytics to generate more leads, here are our top 10 suggestions. 

1. Know how your audience behaves

Most businesses have an idea or two about who their target audience is. But having an idea or two is not good enough if you want to grow your business significantly – you need to be absolutely sure who your audience is and how they behave when they come to your website. 

Now, the best way to do that is to analyze the website data.  

You can tell quite a lot by simply looking at the right numbers. For instance, if you want to know whether the users can easily find the information they’re looking for, keep track of how much time they spend on a certain webpage. If they leave the webpage as soon as it loads, they probably didn’t find what they needed. 

We know that looking at spreadsheets is a bit boring, but you can easily obtain Power BI Certification and use Microsoft Power BI to make data visuals that are easy to understand and pleasing to the eye. 

 

 

 

 

Data analytics books
Books on Data Analytics – Compilation by Data Science Dojo

Read the top 12 data analytics books to learn more about it

 

2. Segment your audience

A great way to satisfy the needs of different subgroups within your target audience is to use audience segmentation. Using that, you can create multiple funnels for the users to move through instead of just one, thereby increasing your lead generation. 

Now, before you segment your audience, you need to have enough information about these subgroups so that you can divide them and identify their needs. Since you can’t individually interview users and ask them for the necessary information, you can use data analytics instead. 

Once you have that, it’s time to identify their pain points and address them differently for different subgroups, and voilàa – you’ve got yourself more leads. 

3. Use data analytics to improve buyer persona

Knowing your target audience is a must but identifying a buyer persona will take things to the next level. A buyer persona doesn’t only contain basic information about your customers. It goes deeper than that and tells you their exact age, gender, hobbies, location, and interests.  

It’s like describing a specific person instead of a group of people. 

Of course, not all your customers will fit that description to a T, but that’s not the point. The point is to have that one idea of a person (or maybe two or three buyer personas) in your mind when creating content for your business.  

buyer persona - Data analytics
Understanding buyer persona with the help of Data analytics  [Source: Freepik] 

 

4. Use predictive marketing 

While data analytics should absolutely be used in retrospectives, there’s another purpose for the information you obtain through analytics – predictive marketing. 

Predictive marketing is basically using big data to develop accurate forecasts of customers’ behavior. It uses complex machine-learning algorithms to build predictive models. 

A good example of how that works is Amazon’s landing page, which includes personalized recommendations.  

Amazon doesn’t only keep track of the user’s previous purchases, but also what they have clicked on in the past and the types of items they’ve shown interest in. By combining that with the season of purchase and time, they are able to make recommendations that are nearly 100% accurate. 

lead generation
Acquiring customers – Lead generation

 

If you’re curious to find out how data science works, we suggest that you enroll in the Data Science Bootcamp

 

5. Know where website traffic comes from 

Users come to your website from different places.  

Some have searched for it directly on Google, some have run into an interesting blog piece on your website, while others have seen your ad on Instagram. This means that the time and effort you put into optimizing your website and creating interesting content pays off. 

But imagine creating a YouTube ad that doesn’t bring much traffic – that doesn’t pay off at all. You’d then want to rework your campaign or redirect your efforts elsewhere.  

This is exactly why knowing where website traffic comes from is valuable. You don’t want to invest your time and money into something that doesn’t bring you any benefits. 

6. Understand which products work 

Most of the time, you can determine what your target audience will like and dislike. The more information you have about your target audience, the better you can satisfy their needs.  

But no one is perfect, and anyone can make a mistake. 

Heinz, a company known for producing ketchup and other food, once released their new product: EZ Squirt ketchup in shades of purple, green, and blue. At first, the kids loved it, but this didn’t last for long. Six years later after that, Heinz halted production of these products. 

As you can see, even big and experienced companies flop sometimes. A good way to avoid that is by tracking which product pages have the least traffic and don’t sell well. 

7. Perform competitor analysis 

Keeping an eye on your competitors is never a bad idea. No matter how well you’re doing and how unique you are, others will try to surpass you and become better. 

The good news is that there are quite a few tools online that you can use for competitor analysis. SEMrush, for instance, can help you see what the competition is doing to get qualified leads so that you can use it to your advantage. 

Even if there wasn’t a tool you need, you can always enroll in a Python for Data Science course and learn to build your own tools that can track the data you need to drive your lead generation. 

competitor analysis - data analytics
Performing competitor analysis through data analytics [Source: Freepik] 

8. Nurture your leads

Nurturing your leads means developing a personalized relationship with your prospects at every stage of the sales funnel in order to get them to buy your products and become your customers. 

Because lead nurturing offers a personalized approach, you’ll need information about your leads: what is their title, role, industry, and similar info, depending on what your business does. Once you have that, you can provide them with the relevant content that will help them decide to buy your products and build brand loyalty along the way. 

This is something b2b lead generation companies can help you with if you’re hesitant to do it on your own.  

9. Gain more customers

Having an insight into your conversion rate, churn rate, sources of website traffic, and other relevant data will ultimately lead to more customers. For instance, your sales team will be able to calculate which sources convert most effectively and prepare resources before running a campaign. 

The more information you have, the better you’ll perform, and this is exactly why Data Science for Business is important – you’ll be able to see the bigger picture and make better decisions. 

data analysts performing data analysis of customer's data
Data analysts performing data analysis of customer’s data

10. Avoid significant losses 

Finally, data can help you avoid certain losses by halting the launch of a product that won’t do well. 

For instance, you can use the Coming soon page to research the market and see if your customers are interested in a new product you planned on launching. If enough people show interest, you can start producing, and if not – you won’t waste your money on something that was bound to fail. 

 

Conclusion:

Applications of data analytics go beyond simple data analysis, especially for advanced analytics projects. The majority of the labour is done up front in the data collection, integration, and preparation stages, followed by the creation, testing, and revision of analytical models to make sure they give reliable findings. Data engineers, who build data pipelines and aid in the preparation of data sets for analysis, are frequently included within analytics teams in addition to data scientists and other data analysts.

November 17, 2022

A hands-on guide to collect and store twitter data for timeseries analysis 

“A couple of weeks back, I was working on a project in which I had to scrape tweets from twitter and after storing them in a csv file, I had to plot some graphs for timeseries analysis. I requested Twitter for Twitter developer API, but unfortunately my request was not fulfilled. Then I started searching for python libraries which can allow me to scrape tweets without the official Twitter API.

To my amazement, there were several libraries through which you can scrape tweets easily but for my project I found ‘Snscrape’ to be the best library, which met my requirements!” 

What is SNScrape? 

A scraper for social networking platforms known as snscrape (SNS). It retrieves objects, such as pertinent posts, by scraping things like user profiles, hashtags, or searches. 

 

Install Snscrape 

Snscrape requires Python 3.8 or higher. The Python package dependencies are installed automatically when you install Snscrape. You can install using the following commands. 

  • pip3 install snscrape 

  • pip3 install git+https://github.com/JustAnotherArchivist/snscrape.git (Development Version) 

 

For this tutorial we will be using the development version of Snscrape. Paste the second command in command prompt(cmd), make sure you have git installed on your system. 

 

Code walkthrough for scraping

Before starting make sure you have the following python libraries: 

  • Pandas 
  • Numpy 
  • Snscrape 
  • Tqdm 
  • Seaborn 
  • Matplotlit 

Importing Relevant Libraries 

To run the scraping program, you will first need to import the libraries 

import pandas as pd 

import numpy as np 

import snscrape.modules.twitter as sntwitter 

import datetime 

from tqdm.notebook import tqdm_notebook 

import seaborn as sns 

import matplotlib.pyplot as plt 

sns.set_theme(style="whitegrid") 

 

 

Taking User Input 

To scrape tweets, you can provide many filters such as the username or start date or end date etc. We will be taking the following user inputs which will then be used in Snscrape. 

  • Text: The query to be matched. (Optional) 
  • Username: Specific username from twitter account. (Required) 
  • Since: Start Date in this format yyyy-mm-dd. (Optional) 
  • Until: End Date in this format yyyy-mm-dd. (Optional) 
  • Count: Max number of tweets to retrieve. (Required) 
  • Retweet: Include or Exclude Retweets. (Required) 
  • Replies: Include or Exclude Replies. (Required) 

 

For this tutorial we used the following inputs: 

text = input('Enter query text to be matched (or leave it blank by pressing enter)') 

username = input('Enter specific username(s) from a twitter account without @ (or leave it blank by pressing enter): ') 

since = input('Enter startdate in this format yyyy-mm-dd (or leave it blank by pressing enter): ') 

until = input('Enter enddate in this format yyyy-mm-dd (or leave it blank by pressing enter): ') 

count = int(input('Enter max number of tweets or enter -1 to retrieve all possible tweets: ')) 

retweet = input('Exclude Retweets? (y/n): ') 

replies = input('Exclude Replies? (y/n): ') 

 

Which field can we Scrape? 

Here is the list of fields which we can scrape using Snscrape Library. 

  • url: str 
  • date: datetime.datetime 
  • rawContent: str 
  • renderedContent: str 
  • id: int 
  • user: ‘User’ 
  • replyCount: int 
  • retweetCount: int 
  • likeCount: int 
  • quoteCount: int 
  • conversationId: int 
  • lang: str 
  • source: str 
  • sourceUrl: typing.Optional[str] = None 
  • sourceLabel: typing.Optional[str] = None 
  • links: typing.Optional[typing.List[‘TextLink’]] = None 
  • media: typing.Optional[typing.List[‘Medium’]] = None 
  • retweetedTweet: typing.Optional[‘Tweet’] = None 
  • quotedTweet: typing.Optional[‘Tweet’] = None 
  • inReplyToTweetId: typing.Optional[int] = None 
  • inReplyToUser: typing.Optional[‘User’] = None 
  • mentionedUsers: typing.Optional[typing.List[‘User’]] = None 
  • coordinates: typing.Optional[‘Coordinates’] = None 
  • place: typing.Optional[‘Place’] = None 
  • hashtags: typing.Optional[typing.List[str]] = None 
  • cashtags: typing.Optional[typing.List[str]] = None 
  • card: typing.Optional[‘Card’] = None 

 

For this tutorial we will not scrape all the fields but a few relevant fields from the above list. 

The search function

Next, we will define a search function which takes in the following inputs as arguments and creates a query string to be passed inside SNS twitter search scraper function. 

  • Text 
  • Username 
  • Since 
  • Until 
  • Retweet 
  • Replies 

 

def search(text,username,since,until,retweet,replies): 

    global filename 

    q = text 

    if username!='': 

        q += f" from:{username}"     

    if until=='': 

        until = datetime.datetime.strftime(datetime.date.today(), '%Y-%m-%d') 

    q += f" until:{until}" 

    if since=='': 

        since = datetime.datetime.strftime(datetime.datetime.strptime(until, '%Y-%m-%d') -  

                                           datetime.timedelta(days=7), '%Y-%m-%d') 

    q += f" since:{since}" 

    if retweet == 'y': 

        q += f" exclude:retweets" 

    if replies == 'y': 

        q += f" exclude:replies" 

    if username!='' and text!='': 

        filename = f"{since}_{until}_{username}_{text}.csv" 

    elif username!="": 

        filename = f"{since}_{until}_{username}.csv" 

    else: 

        filename = f"{since}_{until}_{text}.csv" 

    print(filename) 

    return q 

 

Here we have defined different conditions and based on those conditions we are creating the query string. For example, if variable until (end date) is empty then we are assigning it the current date and appending it in a query string and if the variable since (start date) is empty then we are assigning it a date of past 7 days from the current date. Along with the query string, we are creating filename string which will be used to name our csv file. 

 

 

Calling the Search Function and creating Dataframe 

 

q = search(text,username,since,until,retweet,replies) 

# Creating list to append tweet data  

tweets_list1 = [] 

 

# Using TwitterSearchScraper to scrape data and append tweets to list 

if count == -1: 

    for i,tweet in enumerate(tqdm_notebook(sntwitter.TwitterSearchScraper(q).get_items())): 

        tweets_list1.append([tweet.date, tweet.id, tweet.rawContent, tweet.user.username,tweet.lang, 

        tweet.hashtags,tweet.replyCount,tweet.retweetCount, tweet.likeCount,tweet.quoteCount,tweet.media]) 

else: 

    with tqdm_notebook(total=count) as pbar: 

        for i,tweet in enumerate(sntwitter.TwitterSearchScraper(q).get_items()): #declare a username  

            if i>=count: #number of tweets you want to scrape 

                break 

            tweets_list1.append([tweet. Date, tweet.id, tweet.rawContent, tweet.user.username,tweet.lang,tweet.hashtags,tweet.replyCount, 

                                tweet.retweetCount,tweet.likeCount,tweet.quoteCount,tweet.media]) 

            pbar.update(1) 

# Creating a dataframe from the tweets list above  

tweets_df1 = pd.DataFrame(tweets_list1, columns=['DateTime', 'TweetId', 'Text', 'Username','Language', 

                                'Hashtags','ReplyCount','RetweetCount','LikeCount','QuoteCount','Media']) 

 

 

 

In this snippet we have invoked the search function and the query string is stored inside variable ‘q’. Next, we have defined an empty list which will be used for appending tweet data. If the count is specified as -1 then the for loop will iterate over all the tweets.

TwitterSearchScraper class constructor takes the query string as an argument and then we invoke the get_items() method of TwitterSearchScraper class to retrieve all the tweets. Inside for loop we append scraped data in the tweets_list1 variable which we defined earlier. If count is defined, then we use it to break the for loop. Finally, using this list, we create the pandas dataframe by specifying the column names. 

 

tweets_df1.sort_values(by='DateTime',ascending=False) 
Data frame - Panda's library
Data frame created using Panda’s library

 

Data Preprocessing

Before saving the data frame in a csv file, we will first process the data, so that we can easily perform analysis on it. 

 

 

Data Description 

tweets_df1.info() 
Data frame - Panda's library
Data frame created using Panda’s library

 

Data Transformation 

Now we will add more columns to facilitate timeseries analysis 

tweets_df1['Hour'] = tweets_df1['DateTime'].dt.hour 

tweets_df1['Year'] = tweets_df1['DateTime'].dt.year   

tweets_df1['Month'] = tweets_df1['DateTime'].dt.month 

tweets_df1['MonthName'] = tweets_df1['DateTime'].dt.month_name() 

tweets_df1['MonthDay'] = tweets_df1['DateTime'].dt.day 

tweets_df1['DayName'] = tweets_df1['DateTime'].dt.day_name() 

tweets_df1['Week'] = tweets_df1['DateTime'].dt.isocalendar().week 

 

The Datetime column contains both date and time, therefore it is better to split data and time in separate columns. 

tweets_df1['Date'] = [d.date() for d in tweets_df1['DateTime']] 

tweets_df1['Time'] = [d.time() for d in tweets_df1['DateTime']] 

 

After splitting we will drop the DateTime column. 

tweets_df1.drop('DateTime',axis=1,inplace=True) 

tweets_df1 

 

Finally our data is prepared, we will now save the dataframe as csv using df.to_csv() function which takes filename as an input parameter. 

tweets_df1.to_csv(f"{filename}",index=False)

Visualizing timeseries data using barplot, lineplot, histplot and kdeplot 

It is time to visualize our prepared data so that we can find useful insights. Firstly, we will load the saved csv in a dateframe using the read_csv() function of pandas which take filename as input parameter. 

tweets = pd.read_csv("2018-01-01_2022-09-27_DataScienceDojo.csv") 

tweets 

 

Data frame - Panda's library
Data frame created using Panda’s library

 

Count by Year 

The countplot function of seaborn allows us to plot count of tweets by year. 

f, ax = plt.subplots(figsize=(15, 10)) 

sns.countplot(x= tweets['Year']) 

for p in ax.patches: 

    ax.annotate(int(p.get_height()), (p.get_x()+0.05, p.get_height()+20), fontsize = 12) 

 
Plot count of tweets - Bar graph
Plot count of tweets – Bar graph

 

plt.figure(figsize=(15, 8)) 

 

ax=plt.subplot(221) 

sns.lineplot(tweets.Year.value_counts()) 

ax.set_xlabel("Year") 

ax.set_ylabel('Count') 

plt.xticks(np.arange(2018,2023,1)) 

 

plt.subplot(222) 

sns.histplot(x=tweets.Year,stat='count',binwidth=1,kde='true',discrete=True) 

plt.xticks(np.arange(2018,2023,1)) 

plt.grid() 

 

plt.subplot(223) 

sns.kdeplot(x=tweets.Year,fill=True) 

plt.xticks(np.arange(2018,2023,1)) 

plt.grid() 

 

plt.subplot(224) 

sns.kdeplot(x=tweets.Year,fill=True,bw_adjust=3) 

plt.xticks(np.arange(2018,2023,1)) 

plt.grid() 

 

plt.tight_layout() 

plt.show() 

 

Plot count of tweets - per year
Plot count of tweets – per year

 

Count by Month 

We will follow the same steps for count by month, by week, by day of month and by hour. 

 

f, ax = plt.subplots(figsize=(15, 10)) 

sns.countplot(x= tweets['Month']) 

for p in ax.patches: 

    ax.annotate(int(p.get_height()), (p.get_x()+0.05, p.get_height()+20), fontsize = 12) 

 
Monthly Tweet counts - chart
Monthly Tweet counts – chart

 

plt.figure(figsize=(15, 8)) 

 

ax=plt.subplot(221) 

sns.lineplot(tweets.Month.value_counts()) 

ax.set_xlabel("Month") 

ax.set_ylabel('Count') 

plt.xticks(np.arange(1,13,1)) 

 

plt.subplot(222) 

sns.histplot(x=tweets.Month,stat='count',binwidth=1,kde='true',discrete=True) 

plt.xticks(np.arange(1,13,1)) 

plt.grid() 

 

plt.subplot(223) 

sns.kdeplot(x=tweets.Month,fill=True) 

plt.xticks(np.arange(1,13,1)) 

plt.grid() 

 

plt.subplot(224) 

sns.kdeplot(x=tweets.Month,fill=True,bw_adjust=3) 

plt.xticks(np.arange(1,13,1)) 

plt.grid() 

 

plt.tight_layout() 

plt.show() 

 

Monthly tweets count chart
Monthly tweets count chart

 

 

Count by Week 

f, ax = plt.subplots(figsize=(15, 10)) 

sns.countplot(x= tweets['Week']) 

for p in ax.patches: 

    ax.annotate(int(p.get_height()), (p.get_x()+0.005, p.get_height()+5), fontsize = 10) 

 

Weekly tweets count chart
Weekly tweets count chart

 

 

plt.figure(figsize=(15, 8)) 

 

ax=plt.subplot(221) 

sns.lineplot(tweets.Week.value_counts()) 

ax.set_xlabel("Week") 

ax.set_ylabel('Count') 

 

plt.subplot(222) 

sns.histplot(x=tweets.Week,stat='count',binwidth=1,kde='true',discrete=True) 

plt.grid() 

 

plt.subplot(223) 

sns.kdeplot(x=tweets.Week,fill=True) 

plt.grid() 

 

plt.subplot(224) 

sns.kdeplot(x=tweets.Week,fill=True,bw_adjust=3) 

plt.grid() 

 

plt.tight_layout() 

plt.show()  

 

Weekly tweets count charts
Weekly tweets count charts

 

 

Count by Day of Month 

f, ax = plt.subplots(figsize=(15, 10)) 

sns.countplot(x= tweets['MonthDay']) 

for p in ax.patches: 

    ax.annotate(int(p.get_height()), (p.get_x()+0.05, p.get_height()+5), fontsize = 12) 

 

 

Daily tweets count chart
Daily tweets count chart
plt.figure(figsize=(15, 8)) 

 

ax=plt.subplot(221) 

sns.lineplot(tweets.MonthDay.value_counts()) 

ax.set_xlabel("MonthDay") 

ax.set_ylabel('Count') 

 

plt.subplot(222) 

sns.histplot(x=tweets.MonthDay,stat='count',binwidth=1,kde='true',discrete=True) 

plt.grid() 

 

plt.subplot(223) 

sns.kdeplot(x=tweets.MonthDay,fill=True) 

plt.grid() 

 

plt.subplot(224) 

sns.kdeplot(x=tweets.MonthDay,fill=True,bw_adjust=3) 

plt.grid() 

 

plt.tight_layout() 

plt.show() 

 

 
Daily tweets count charts
Daily tweets count charts

 

 

 

 

 

 

 

Count by Hour 

f, ax = plt.subplots(figsize=(15, 10)) 

sns.countplot(x= tweets['Hour']) 

for p in ax.patches: 

    ax.annotate(int(p.get_height()), (p.get_x()+0.05, p.get_height()+20), fontsize = 12) 
hourly tweets count chart
hourly tweets count chart

 

 

plt.figure(figsize=(15, 8)) 

 

ax=plt.subplot(221) 

sns.lineplot(tweets.Hour.value_counts()) 

ax.set_xlabel("Hour") 

ax.set_ylabel('Count') 

plt.xticks(np.arange(0,24,1)) 

 

plt.subplot(222) 

sns.histplot(x=tweets.Hour,stat='count',binwidth=1,kde='true',discrete=True) 

plt.xticks(np.arange(0,24,1)) 

plt.grid() 

 

plt.subplot(223) 

sns.kdeplot(x=tweets.Hour,fill=True) 

plt.xticks(np.arange(0,24,1)) 

plt.grid() 

 

plt.subplot(224) 

sns.kdeplot(x=tweets.Hour,fill=True,bw_adjust=3) 

#plt.xticks(np.arange(0,24,1)) 

plt.grid() 

 

plt.tight_layout() 

plt.show() 

 

Hourly tweets count charts
Hourly tweets count charts

 

 

Conclusion 

From the above time series visualizations, we can clearly understand that the peak hours of tweets from this account is between 7pm-9pm and from 4am -1pm the twitter handle is quiet. We can also point out that most of the tweets related to that topic are done in the month of August. Similarly, we can identify that the Twitter handle was not very active before 2021.  

Conclusively, we saw how we can easily scrape tweets without using Twitter API through Snscrape. Then we performed some transformations on the scraped data and stored it in csv file. Later, we used that csv file for time-series visualizations and analysis. We appreciate you following along with this hands-on guide. We hope that this guide will make it easy for you to get started on your upcoming data science project. 

<<Link to Complete Code>> 

November 16, 2022

Data Science Dojo is offering Metabase for FREE on Azure Marketplace packaged with web accessible Metabase: Open-Source server. 

Metabase query
Metabase query

 

Introduction 

Organizations often adopt strategies that enhance the productivity of their selling points. One strategy is to utilize the prior business data to identify key patterns regarding any product and then take decisions for it accordingly. However, the work is quite hectic, costly, and requires domain experts. Metabase has bridged that gap of skillset. Metabase provides marketing and business professionals with an easy-to-use query builder notebook to extract required data and simultaneously visualize it without any SQL coding, with just a few clicks. 

What is Metabase and its question? 

Metabase is an open-source business intelligence framework that provides a web interface to import data from diverse databases and then analyze and visualize it with few clicks. The methodology of Metabase is based on questions and the answers to them. They form the foundation of everything else that it provides. 

           

A question is any kind of query that you want to perform on a data. Once you are done with the specification of query functions in the notebook editor, you can visualize the query results. After that you can save this question as well for reusability and turn it into a data model for business specific purposes. 

Pro Tip: Join our 6-months instructor-led Data Science Bootcamp to become expert at data science & analytics skillset 

Challenges for businesses  

For businesses that lack expert analysts, engineers and substantial IT department, it was costly and time-consuming to hire new domain experts or managers themselves learn to code and then explore and visualize data. Apart from that, not many pre-existing applications provide diverse data source connections which was also a challenge. 

In this regard, a straightforward interactive tool that even newbies could adapt immediately and thus get the job done would be the most ideal solution. 

Data analytics with Metabase  

Metabase concept is based on questions which are basically queries and data models (special saved questions). It provides an easy-to-use notebook through which users can gather raw data, filter it, join tables, summarize information, and add other customizations without any need for SQL coding.

Users can select the dimensions of columns from tables and then create various visualizations and embed them in different sub-dashboards. Metabase is frequently utilized for pitching business proposals to executive decision-makers because the visualizations are very simple to achieve from raw data. 

 

visualization on sample data
Figure 1: A visualization on sample data 

 

A visualization on sample data 
Figure 2:  Query builder notebook

 

Major characteristics 

  • Metabase delivers a notebook that enables users to select data, join with other tables, filter, and other operations just by clicking on options instead of writing a SQL query 
  • In case of complex queries, a user can also use an in-built optimized SQL editor 
  • The choice to select from various data sources like PostgreSQL, MongoDB, Spark SQL, Druid, etc., makes Metabase flexible and adaptable 
  • Under the Metabase admin dashboard, users can troubleshoot the logs regarding different tasks and jobs 
  • Has the ability to enable public sharing. It enables admins to create publicly viewable links for Questions and Dashboards  

What Data Science Dojo has for you  

Metabase instance packaged by Data Science Dojo serves as an open-source easy-to-use web interface for data analytics without the burden of installation. It contains numerous pre-designed visualization categories waiting for data.

It has a query builder which is used to create questions (customized queries) with few clicks. In our service users can also use an in-browser SQL editor for performing complex queries. Any user who wants to identify the impact of their product from the raw business data can use this tool. 

Features included in this offer:  

  • A rich web interface running Metabase: Open Source 
  • A no-code query building notebook editor 
  • In-browser optimized SQL editor for complex queries 
  • Beautiful interactive visualizations 
  • Ability to create data models 
  • Email configuration and Slack support 
  • Shareability feature 
  • Easy specification for metrics and segments 
  • Feature to download query results in CSV, XLSX and JSON format 

Our instance supports the following major databases: 

  • Druid 
  • PostgreSQL 
  • MySQL 
  • SQL Server 
  • Amazon Redshift 
  • Big Query 
  • Snowflake 
  • Google Analytics 
  • H2 
  • MongoDB 
  • Presto 
  • Spark SQL 
  • SQLite 

Conclusion  

Metabase is a business intelligence software and beneficial for marketing and product managers. By making it possible to share analytics with various teams within an enterprise, Metabase makes it simple for developers to create reports and collaborate on projects. The responsiveness and processing speed are faster than the traditional desktop environment as it uses Microsoft cloud services. 

At Data Science Dojo, we deliver data science education, consulting, and technical services to increase the power of data. We are therefore adding a free Metabase server dedicated specifically for Data Analytics operations on Azure Market Place. Hurry up and install this offer by Data Science Dojo, your ideal companion in your journey to learn data science!  

Click on the button below to head over to the Azure Marketplace and deploy Metabase for FREE by clicking on “Get it now”. 

CTA - Try now

Note: You’ll have to sign up to Azure, for free, if you do not have an existing account. 

November 5, 2022

Data Science Dojo is offering Countly for FREE on Azure Marketplace packaged with web accessible Countly Server. 

Purpose of product analytics  

Product analytics is a comprehensive collection of mechanisms for evaluating the performance of digital ventures created by product teams and managers. 

Businesses often need to measure the metrics and impact of their products, for e.g., how the audience perceives their product like how many visitors are reading a particular page or clicking on a specific button. This gives an insight into what future decisions need to be taken regarding any product. Whether it should be modified? or removed? or kept as it is? Countly has made this work easier by providing a centralized web analytics environment to track the user engagement with a product along with monitoring its health.  

 

Pro Tip: Join our 6-months instructor-led Data Science Bootcamp to become expert at data science & analytics skillset 

Challenges for individuals  

Many platforms require developers for coding to visualize analytics which is not only time consuming but also come at a cost. At the application level, having an app crash leaves anyone in shock, and that is followed by a hectic task of determining the root cause of the problem which is time-consuming. At the corporate level, the current and past data needs to be analyzed appropriately for the future strength of the company and that requires robust analysis easily acquired by anyone which was a challenge faced by many organizations  

Countly analytics 

Countly enables users to monitor and analyze the performance of their applications irrespective of the platform in real-time. It can compile data from numerous sources and presents it in a manner that makes it easier for business analysts and managers to evaluate app usage and client behavior. It offers a customizable dashboard with the freedom to innovate and improve your products in order to meet important business and revenue objectives while also ensuring privacy by design. It is a world leader in product analytics because it tracks more than 1.5 billion unique identities on more than 16,000 applications and more than 2,000 servers worldwide. 

 

Analytics based technology - countly
Figure 1: Analytics based on type of technology

 

 

Analytics based on user activity - Countly
Figure 2: Analytics based on user activity

 

 

Figure 3: Analytics based on views - Countly
Figure 3: Analytics based on views

 

Major characteristics 

  • Interactive web interface: User-friendly web environment with customizable dashboards for easy accessibility along with pre-designed metrics and visualizations 
  • Platform-independent: Supports web analytics, mobile app analytics, and desktop application analytics for macOS and Windows 
  • Alerts and email reporting: Ability to receive alerts based on the metric changes and provides custom email reporting 
  • Users’ role and access manager: Provides global administrators the ability to manage users, groups, and their roles and permissions 
  • Logs Management: Maintains server and audit logs on the web server regarding user actions on data 

What Data Science Dojo has for you  

Countly Server packaged by Data Science Dojo provides a web analytics service that provides insights about your product in real-time, no matter if it’s a web application or mobile app, or even desktop application without the burden of installation. It comes with numerous pre-configured metrics and visualization templates to import data and observe trends. It’s helpful for businesses to identify the application usage and determine the client response to the apps.  

Features included in this offer:  

  • A VM configured with Countly Server: Community Edition accessible from a web browser 
  • Ability to track user analytics, user loyalty, session analytics, technology, and geo insights  
  • Easy-to-use customizable dashboard 
  • Logs manager 
  • Alerting and reporting feature 
  • User permissions and roles manager 
  • Built-in Countly DB viewer 
  • Cache management 
  • Flexibility to define data limits 

Conclusion  

Countly provides the feasibility to analyze data in real-time. It is highly extensible and possesses various features to manage different operations like alerting, reporting, logging, job management, etc. The analytics throughput can be increased by using multi-cores on Azure Virtual Machine. Also, Countly can handle different platform applications at once. This might slow down the server if you have thousands upon thousands of active client requests on different applications. The CPU and RAM usage may also be affected but through Azure Virtual Machine all these problems are taken care of. 

At Data Science Dojo, we deliver data science education, consulting, and technical services to increase the power of data. We are therefore adding a free Countly Server dedicated specifically for Data Analytics operations on Azure Market Place. Hurry up and install this offer by Data Science Dojo, your ideal companion in your journey to learn data science! 

Click on the button below to head over to the Azure Marketplace and deploy Countly for FREE by clicking on “Try now”. 

CTA - Try now

Note: You’ll have to sign up to Azure, for free, if you do not have an existing account. 

 

October 26, 2022

Get hired as a Data Analyst by confidently responding to the most frequently asked interview questions. No matter how qualified or experienced you are, if you stumble over your thoughts while answering the interviewer, it might take away some of your chances of getting onboard. 

 

data analyst interview question
Data analyst interview question – Data Science Dojo

In this blog, you will find the top data analysts interview questions covering both technical and non-technical areas of expertise.  

List of Data Analysts interview questions 

1. Share about your most successful/most challenging data analysis project? 

In this question, you can also share your strengths and weaknesses with the interviewer.   

When answering questions like these, data analysts must attempt to share both their strengths and weaknesses. How do you deal with challenges and how do you measure the success of a data project? You can discuss how you succeeded with your project and what made it successful.  

Take a look at the original job description to see if you can incorporate some of the requirements and skills listed. If you were asked the negative version of the question, be honest about what went wrong and what you would do differently in the future to fix the problem. Despite our human nature, mistakes are a part of life. What’s critical is your ability to learn from them. 

Further talk about any SAAS platforms, programming languages, and libraries. Why did you use them and how did you use them to accomplish yours?

Discuss the entire pipeline of your projects from collecting data, to turning it into valuable insights. Describe the ETL pipeline, including data cleaning, data preprocessing, and exploratory data analysis. What were your learnings and what issues did you encounter, and how did you deal with them. 

Enroll in Data Science Bootcamp today to begin your journey

2. Tell us about the largest data set you’ve worked with? Or what type of data have you worked with in the past? 

What they’re really asking is: Can you handle large data sets?  

Data sets of varying sizes and compositions are becoming increasingly common in many businesses. Answering questions about data size and variety requires a thorough understanding of the type of data and its nature. What data sets did you handle? What types of data were present? 

It is not necessary that you only mention a dataset you worked with at your job. But you can also share about varying sizes, specifically large datasets, you worked with as a part of a data analysis course, Bootcamp, certificate program, or degree. As you put together a portfolio, you may also complete some independent projects where you find and analyze a data set. All of this is valid material to build your answer.  

The more versatile your experience with datasets will be, the greater the chances there are of getting hired.  

Read more about several types of datasets here:

32 datasets to uplift your skills in data science

 

3. What is your process for cleaning data? 

The expected answer to this question will include details about: How you handle missing data, outliers, duplicate data, etc.?c.? 

Data analysts are widely responsible for data preparation, data cleansing, or data cleaning. Organizations expect data analysts to spend a significant amount of time preparing data for an employer. As you answer this question, share in detail with the employer why data cleaning is so important. 

In your answer, give a short description of what data cleaning is and why it’s important to the overall process. Then walk through the steps you typically take to clean a data set. 

 Learn about Data Science Interview Questions and begin your career as a data scientist today.

4. Name some data analytics software you are familiar with. OR what data software have you used in the past? OR What data analytics software are you trained in? 

What they need to know: Do you have basic competency with common tools? How much training will you need? 

Before you appear for the interview, it’s a good time to look at the job listing to see what software was mentioned. As you answer this question, describe how you have used that software or something similar in the past. Show your knowledge of the tool by employing associated words.  

Mention software solutions you have used for a variety of data analysis phases. You don’t need to provide a lengthy explanation. What data analytics tools you used and for what purpose will satisfy the interviewer. 

  

5. What statistical methods have you used in data analysis? OR what is your knowledge of statistics? OR how have you used statistics in your work as a Data Analyst? 

What they’re really asking: Do you have basic statistical knowledge? 

Data analysts should have at least a rudimentary grasp of statistics and know-how that statistical analysis helps business goals. Organizations look for a sound knowledge of statistics in Data analysts to handle complex projects conveniently. If you used any statistical calculations in the past, be sure to mention it. If you haven’t yet, familiarize yourself with the following statistical concepts: 

  • Mean 
  • Standard deviation 
  • Variance
  • Regression 
  • Sample size 
  • Descriptive and inferential statistics 

While speaking of these, share information that you can derive from them. What knowledge can you gain about your dataset? 

Read these amazing 12 Data Analytics books to strengthen your knowledge

12 excellent Data Analytics books you should read in 2022

 

 

6. What scripting languages are you trained in? 

In order to be a data analyst, you will almost certainly need both SQL and a statistical programming language like R or Python. If you are already proficient in the programming language of your choice at the job interview, that’s fine. If not, you can demonstrate your enthusiasm for learning it.  

In addition to your current languages’ expertise, mention how you are developing your expertise in other languages. If there are any plans for completing a programming language course, highlight its details during the interview. 

To gain some extra points, do not hesitate to mention why and in which situations SQL is used, and why R and python are used. 

 

7. How can you handle missing values in a dataset? 

This is one of the most frequently asked data analyst interview questions, and the interviewer expects you to give a detailed answer here, and not just the name of the methods. There are four methods to handle missing values in a dataset. 

  • Listwise Deletion 

In the listwise deletion method, an entire record is excluded from analysis if any single value is missing. 

  • Average Imputation  

Take the average value of the other participants’ responses and fill in the missing value. 

  • Regression Substitution 

You can use multiple-regression analyses to estimate a missing value. 

  • Multiple Imputations 

It creates plausible values based on the correlations for the missing data and then averages the simulated datasets by incorporating random errors in your predictions. 

 

8. What is Time Series analysis? 

Data analysts are responsible for analyzing data points collected at different intervals. While answering this question you also need to talk about the correlation between the data evident in time-series data. 

Watch this short video to learn in detail:

 

9. What is the difference between data profiling and data mining?

Profiling data attributes such as data type, frequency, and length, as well as their discrete values and value ranges, can provide valuable information on data attributes. It also assesses source data to understand its structure and quality through data collection and quality checks. 

On the other hand, data mining is a type of analytical process that identifies meaningful trends and relationships in raw data. This is typically done to predict future data. 

 

10. Explain the difference between R-Squared and Adjusted R-Squared.

The most vital difference between adjusted R-squared and R-squared is simply that adjusted R-squared considers and tests different independent variables against the model, and R-squared does not. 

An R-squared value is an important statistic for comparing two variables. However, when examining the relationship between a single stock and the rest of the S&P500, it is important to use adjusted R-squared to determine any discrepancies in correlation. 

 

11. Explain univariate, bivariate, and multivariate analysis.

Bivariate analysis, which is simpler than univariate analysis, is used when the data set only has one variable and does not involve causes or effects.  

Univariate analysis, which is more complicated than bivariate analysis, is used when the data set has two variables and researchers are looking to compare them.  

When the data set has two variables and researchers are investigating similarities between them, multivariate analysis is the right type of statistical approach. 

 

12. How would you go about measuring the business performance of our company, and what information do you think would be most important to consider?

Before appearing for an interview, make sure you study the company thoroughly and gain enough knowledge about it. It will leave an impression on the employer regarding your interest and enthusiasm to work with them. Also, in your answer you talk about the added value you will bring to the company by improving its business performance. 

 

13. What do you think are the three best qualities that great data analysts share?

List down some of the most critical qualities of a Data Analyst. This may include problem-solving, research, and attention to detail. Apart from these qualities, do not forget to mention soft skills, which are necessary to communicate with team members and across the department.    

 

Did we miss any Data Analyst interview questions? 

Share with us in the comments below and help each other to ace the next data analyst job. 

October 24, 2022

Data Science Dojo is offering Apache Superset for FREE on Azure Marketplace packaged with pre-installed SQL lab and interactive visualizations to get started. 

 

What is Business Intelligence?  

 

Business Intelligence (BI) depends on the idea of utilizing information to perform activities. It expects to give business pioneers noteworthy bits of knowledge through data handling and analytics. For instance, a business breaks down the KPIs (Key Performance Indicators) to distinguish its benefits and shortcomings. Hence, the decision-makers can conclude in which department the organization can work to increase efficiency.  

Recently two elements in BI have resulted in sensational enhancements in metrics like speed and proficiency. The two elements include:  

 

  • Automation  
  • Data Visualization  

 

Apache Superset widely focuses on the latter model which has changed the course of business insights.  

 

But what were the challenges faced by analysts before there were popular exploratory tools like Superset?  

 

Pro Tip: Join our 6-months instructor-led Data Science Bootcamp to master data science. 

 

Challenges of Data Analysts

 

Scalability, framework compatibility, and absence of business-explicit customization were a few challenges faced by data analysts. Apart from that exploring petabytes of data and visualizing it would cause the system to collapse or hang at times.  

In these circumstances, a tool having the ability to query data as per business needs and envision it in various diagrams and plots was required. Additionally, a system scalable and elastic enough to handle and explore large volumes of data would be an ideal solution.  

 

Data Analytics with Superset  

 

Apache Superset is an open-source tool that equips you with a web-based environment for interactive data analytics, visualization, and exploration. It provides a vast collection of different types of vibrant and interactive visualizations, charts, and tables. It can customize the layouts and the dynamic dashboard elements along with quick filtering, making it flexible and user-friendly. Apache Superset is extremely beneficial for businesses and researchers who want to identify key trends and patterns from raw data to aid in the decision-making process.  

 

Sales analytics - Apache superset
Video Game Sales Analytics with different visualizations

 

 

It is a powerhouse of SQL as it not only allows connection to several databases but also provides an in-browser SQL editor by the name SQL Lab  

SQL lab - Apache superset
SQL Lab: an in-browser powerful SQL editor pre-configured for faster querying

 

Key attributes  

 

  • Superset delivers an interactive UI that enriches the plots, charts, and other diagrams. You can customize your dashboard and canvas as per requirement. The hover feature and side-by-side layout make it coherent  
  • An open-source easy-to-use tool with a no-code environment. Drag and drop and one-click alterations make it more user-friendly  
  • Contains a powerful built-in SQL editor to query data from any database quickly  
  • The choice to select from various databases like Druid, Hive, MySQL, SparkSQL, etc., and the ability to connect additional databases makes Superset flexible and adaptable  
  • In-built functionality to create alerts and notifications by setting specific conditions at a particular schedule  
  • Superset provides a section about managing different users and their roles and permissions. It also has a tab for logging the ongoing events  

 

What does Data Science Dojo have for you  

 

Superset instance packaged by Data Science Dojo serves as a web-accessible no-code environment with miscellaneous analysis capabilities without the burden of installation. It has many samples of chart and dataset projects to get started. In our service users can customize dashboards and canvas as per business needs.

It comes with drag-and-drop feasibility which makes it user-friendly and easy to use. Users can create different visualizations to detect key trends in any volume of data.  

 

What is included in this offer:  

 

  • A VM configured with a web-accessible Superset application  
  • Many sample charts and datasets to get started  
  • In-browser optimized SQL editor called SQL Lab  
  • User access and roles manager  
  • Alert and report feature  
  • Feasibility of drag and drop  
  • In-build functionality of event logging  

 

Our instance supports the following major databases:  

 

  • Druid  
  • Hive  
  • SparkSQL  
  • MySQL  
  • PostgreSQL  
  • Presto  
  • Oracle  
  • SQLite  
  • Trino  
  • Apart from these any data engine that has Python DB-API driver and a SQL Alchemy dialect can be connected  

 

Conclusion  

 

Efficient resource requirement for exploring and visualizing large volumes of data was one of the areas of concern when working on traditional desktop environments. The other area of concern includes the ad-hoc SQL querying of data from different database connections. With our Superset instance, both concerns are put to rest.

When coupled with Microsoft cloud services and processing speed, it outperforms its traditional counterparts since data-intensive computations aren’t performed locally but in the cloud. It has a lightweight semantic layer and is designed as a cloud-native architecture.  

At Data Science Dojo, we deliver data science education, consulting, and technical services to increase the power of data. We are therefore adding a free Superset instance dedicated specifically to Data Science & Analytics on Azure Marketplace. Now hurry up and avail this offer by Data Science Dojo, your ideal companion in your journey to learn data science!  

 

Click on the button below to head over to the Azure Marketplace and deploy Apache Superset for FREE by clicking on “Get it now”. 

 

Superset

 

Note: You’ll have to sign up to Azure, for free, if you do not have an existing account. 

 

 

 

 

 

 

 

October 17, 2022

Marketing analytics tells you about the most profitable marketing activities of your business. The more effectively you target the right people with the right approach, the greater value you generate for your business.

However, it is not always clear which of your marketing activities are effective at bringing value to your business.  This is where marketing analytics comes in. Running an Amazon seller competitor analysis is crucial to your success in the marketplace. Using a framework to monitor your competitors’ efforts is a great way to ensure you can beat them at their own game.

It guides you to use the data to evaluate your marketing campaign. It helps you identify which of your activities are effective in engaging with your audience, improving user experience, and driving conversions. 

Grow your business with Data Science Dojo 

 

Marketing analytics
6 marketing analytics features by Data Science Dojo

Data driven marketing is imperative in optimizing your campaigns to generate a net positive value from all your marketing activities in real-time. Without analyzing your marketing data and customer journey, you cannot identify what you are doing right and what you are doing wrong when engaging with potential customers. The 6 features listed below can give you the start you need to get into analyzing and optimizing your marketing strategy using marketing analytics 

 Learn about marketing analytics tools in this blog

1. Impressions 

In digital marketing, impressions are the number of times any piece of your content has been shown on a person’s screen. It can be an ad, a social media post, video etc. However, it is important to remember that impressions do not mean views, a view is an engagement, anytime somebody sees your video that is a view, but an impression would also include anytime they see your video in the recommended videos on YouTube or in their newsfeed on Facebook. The impression will be counted regardless of whether they watch your video or not. 

Learn more about impressions in this video

 

It is also important to distinguish between impressions and reach. Reach is the number of unique viewers, so for example if the same person views your ad three times, you will have three impressions but a reach of one.  

Impressions and reach are important in understanding how effective your content was at gaining traction. However, these metrics alone are not enough to gauge how effective your digital marketing efforts have been, neither impressions nor reach tell you how many people engaged with your content. So, tracking impressions is important, but it does not specify whether you are reaching the right audience.  

 

2. Engagement rate 

In social media marketing, engagement rate is an important metric. Engagement is when a user comments, likes, clicks, or otherwise interacts with any of your content. Engagement rate is a metric that measures the amount of engagement of your marketing campaign relative to each of the following: 

  • Reach 
  • Post 
  • Impressions  
  • Days
  • Views 

Engagement rate by reach is the percentage of people who chose to interact with the content after seeing it. It is calculated by the following formula. Reach is a more accurate measurement than follower count, because not all of your brands followers may see the content while those who do not follow your brand may still be exposed to your content. 

Engagement rate by post is the rate at which followers engage with the content. This metric shows how engaged your followers are with your content. However, this metric does not account for organic reach and as your follower count goes up your engagement by post goes down. 

Engagement rate by Impressions is the rate of engagement relative to the number of impressions. If you are running paid ads for your brand, engagement rate by impressions can be used to gauge your ads effectiveness.  

Average Daily engagement rate tells you how much your followers are engaging with your content daily. This is suitable for specific use cases for instance, when you want to know how much your followers are commenting on your posts or other content. 

Engagement rate by views gives the percentage of people who chose to engage with your video after watching them. This metric however does not use unique views so it may double or triple count views from a single user. 

Learn more about engagement rate in this video

 

3. Sessions 

Sessions are another especially important metric in marketing campaigns that help you analyze engagement on your website. A session is a set of activities by a user within a certain period. For example, a user spent 10 minutes on your website, loading pages, interacting with your content and completed an interaction. All these activities will be recorded in the same 10-minute session.  

In Google Analytics, you can use sessions to check how much time a user spent on your website (session length), how many times they returned to your website (number of sessions), and what interactions users had with your website. Tracking sessions can help you determine how effective your campaigns were in directing traffic towards your website. 

If you have an E-commerce website another very helpful tool on Google Analytics is behavioral analytics. With behavioral analytics you see what key actions are driving purchases on your website. The sessions report can be accessed under conversions tab on Google Analytics. This report can help you understand user behaviors such as abandon carts. This allows you to target these users with targeted ads or offering incentives to complete their purchase. 

Learn more about sessions in this video

 

4. Conversion rate 

Once you have engaged your audience the next step in the customers’ journey is conversion. A conversion is when you make the customer or user complete a specific action. This desired action can be anything from a form submission, purchasing a product or subscribing to a service. The conversion rate is the percentage of visitors who completed the desired action.

So, if you have a form on your website and you want to find out what the conversion rate is. You would simply divide the number of form submissions by the number of visitors on that form’s page (Total conversions/total interactions). 

 

Conversion rate is a very important metric that helps you assess the quality of your leads. While you may generate a large number of leads or visitors, if you cannot get them to perform the desired action you may be targeting the wrong audience. Conversion rate can also help you gauge how effective your conversion strategy is, if you aren’t converting visitors, it might indicate that your campaign needs optimization. 

 

5. Attribution  

Attribution is a sophisticated model that helps you measure which channels are generating the most sales opportunities or conversions. It helps you assign credit to specific touchpoints on the customers journey and understand which touchpoints are driving conversions the most. But how do you know which touchpoint to attribute to a specific conversion?  Well, that depends on which attribution models you are using. There are four common attribution models. 

First touch attribution models assign all the credit to the first touchpoint that drove the prospect to your website. It focuses on the top of the marketing efforts funnel and tells you what is attracting people to your brand 

Last touch attribution models assign credit to the last touchpoint. It focuses on the last touchpoint the visitor interacted with before they converted. 

Linear attribution model assigns an equal weight to all the touchpoints in the buyer’s journey. 

Time decay attributions is based on how close the touchpoint is to the conversion, where a weighted percentage is assigned to the most recent touchpoints. This can be used when the buying cycle is relatively short. 

What model you use is based on what product or subscription you are selling and what is the length of your buyer cycle. While attribution is very important in identifying the effectiveness of your channels, to get the complete picture you need to look at how each touchpoint drives conversion. 

 Learn more about attribution in this video

 

6. Customer lifetime value 

Businesses prefer retaining customers over acquiring new ones, and one of the main reasons is that attracting new customers has a cost. The customer acquisition cost is the total cost that you incur as a business acquiring a customer. The customer acquisition cost is calculated by dividing the marketing and sales cost by the number of new customers. 

Learn more about CLV in this video

 

So, as a business, you must weigh the value of each customer with the associated acquisition cost. This is where the customer lifetime value or CLV comes in. The Customer lifetime value is the total value of your customer to your business during the period of your relationship.

The CLV helps you forecast your revenue as well, the larger the average CLV you have the better your forecasted revenue will be. CLV is calculated by dividing the annual revenue generated from customers by the average retention period (in years).  If your CAC is higher than your CLV, then you are on average losing money on every customer you make.

This presents a huge problem. Metrics like CAC and CLV are very important for driving revenue. They help you identify high-value customers and identify low value customers so you can understand how to serve these customers better. They help you make more informed decisions regarding your marketing effort and build a healthy customer base. 

 

 Integrate marketing analytics into your business 

Marketing analytics is a vast field. There is no one method that suits the needs of all businesses. Using data to analyze and drive your marketing and sales effort is a continuous effort that you will find yourself constantly improving upon. Furthermore, finding the right metrics to track that have a genuine impact on your business activities is a difficult task.

So, this list is by no means exhaustive, however the features listed here can give you the start you need to analyze and understand what actions are important in driving engagement, conversions and eventually value for your business.  

 

September 24, 2022

Data is growing at an exponential rate in the world. It is estimated that the world will generate 181 zettabytes of data by 2025. With this increase, we are also seeing an increase in demand for data-driven techniques and strategies.

According to Forbes, 95% of businesses expressed the need to manage unstructured data as a problem for their business. In fact, Business Analytics vs Data Science is one of the hottest debates among data professionals nowadays.

Many people might wonder – what is the difference between Business Analytics and Data Science? Or which one should they choose as a career path? If you are one of those keep reading to know more about both these fields!

Business analytics - Data science
                                                                                                      Team working on Business Analytics

First, we need to understand what both these fields are. Let’s take a look. 

What is Business Analytics? 

Business Analytics is the process of deriving insights from business data to inform business decisions. It is the process of collecting data and doing analysis for the business to make better decisions. It provides a lot of insight that can be used to make better business decisions. It helps in optimizing processes and improving productivity.

It also helps in identifying potential risks, opportunities, and threats. Business Analytics is an important part of any organization’s decision-making process. It is a combination of different analytical activities like data exploration, data visualization, data transformation, data modeling, and model validation. All of this is done by using various tools and techniques like R programming, machine learning, artificial intelligence, data mining, etc.

Business analytics is a very diverse field that can be used in every industry. It can be used in areas like marketing, sales, supply chain, operations, finance, technology and many more. 

Now that we have a good understanding of what Business Analytics is, let’s move on to Data Science. 

What is Data Science? 

Data science is the process of discovering new information, knowledge, and insights from data. They apply different machine-learning algorithms to any form of data from numbers to text, images, videos, and audio, to draw various understandings from them. Data science is all about exploring data to identify hidden patterns and make decisions based on them.

It involves implementing the right analytical techniques and tools to transform the data into something meaningful. It is not just about storing data in the database or creating reports about the same. Data scientists collect and clean the data, apply machine learning algorithms, create visualizations, and use data-driven decision-making tools to create an impact on the organization.

Data scientists use tools like programming languages, database management, artificial intelligence, and machine learning to clean, visualize, and explore the data.

Pro tip: Learn more about Data Science for business 

What is the difference between Business Analytics and Data Science? 

Technically, Business analytics is a subset of Data Science. But the two terms are often used interchangeably because of the lack of a clear understanding among people. Let’s discuss the key differences between Business Analytics and Data Science. Business Analytics focuses on creating insights from existing data for making better business decisions.

While Data Science focuses on creating insights from new data by applying the right analytical techniques. Business Analytics is a more established field. It combines several analytical activities like data transformation, modeling, and validation. Data Science is a relatively new field that is evolving every day. Business Analytics is more of a hands-on approach to manage the data whereas Data Science is more focused on the development of the data.

Both the fields also differ a bit in their required skills. Business Analysts mostly use Interpretation, Data visualization, analytical reasoning, statistics, and written communication skills to interpret and communicate their work. Whereas Data Scientists utilize statistical analysis, programming skills, machine learning, calculus and algebra, and data visualization to perform most of their work.

Which should one choose? 

Business analytics is a well-established field, whereas data science is still evolving. If you are inclined towards decisive and logical skills with little or no programming knowledge or computer science skills, you can take up Business Analytics. It is a beginner friendly domain and is easy to catch on to.

But if you are interested in programming and are familiar with machine learning algorithms or even interested in data analysis, you can opt for Data Science. We hope this blog answers your questions about the differences between the two similar and somewhat overlapping fields and helps you make the right data-driven and informed decision for yourself! 

 

September 19, 2022

Looking at the right event metrics not only helps us in gauging the success of the current event but also facilitates understanding the audience’s behavior and preferences for future events.   

Creating, managing, and organizing an event seems like a lot of work and surely it is. The job of an event manager is no doubt a hectic one, and the job doesn’t end once the event is complete. After every event, analyzing it is a crucial task to continuously improve and enhance the experience for your audience and presenters.

In a world completely driven by data, if you are not measuring your events, you are surely missing out on a lot. The questions arise about how to get started and what metrics to look for. The post-Covid world has adopted the culture of virtual events which not only allows the organizers to gather audiences globally but also makes it easier for them to measure it.

There are several platforms and tools available for collecting the data, or if you are hosting it through social media then you can easily use the analytics tool of that channel. You can view our Marketing Analytics videos to better understand the analytical tools and features of each platform. 

event metrics
                                                                                                 Successful event metrics

You can take the assistance of tools and platforms to collect the data but utilizing that data to come up with insightful findings and patterns is a critical task. You need to hear the story your data is trying to tell and understand the patterns in your events.  

Event metrics that you should look at 

1. RSVP to attendance rate 

RSVP is the number of people who sign up for your event (through landing pages or social sites) while attendance rate is the number of people who show up.

Attendance rate
Customer self-service, e-support system, electronic attendees feedback concept.

You should expect at least 30% of your RSVPs to actually attend and if they don’t there is something wrong, the possible reasons could be: 

  • The procedure for joining the event is not provided or clarified 
  • They forgot about the event as they signed up long before 
  • The information provided regarding the event day or date is wrong  

Or it many other likely reasons. You need to dig into each channel to find out the reason because if a person signs up, it shows a clear intent to attend from their end.  

2. Retention rate 

There are a few channels as LinkedIn and YouTube that have inbuilt analytics to gauge retention rate, but you can always integrate third-party tools for other platforms. The retention rate depicts how long your audience stayed in your webinar and the points where they dropped off.

It is usually shown as a line graph with the duration of the webinar on the x-axis and the number of people on the y-axis, in this way you can view the number of people at a certain time in the webinar. Through this chart, you can look at points where you see a drop or rise in your views.

Retention rate
Graph representing retention rate  

 

Use-case
For instance, at Data Science Dojo our webinars experienced a huge drop in the audience during the initial 5 mins of the webinar. It was worrisome for the team, so we dug into it and conducted a critical analysis of our webinars. We realized this was happening because we usually spend our first 5 mins waiting for the audience to join in but that is where our existing audience started leaving.  

We decided to bring in engaging activities as a poll in those 5 mins and initiated conversations with our audience directly through chats which improved our overall retention as our audience started feeling more connected which made them stay for a long time. You can explore our webinars here 

3. Demographics of audience 

It is far-reaching to know where your audience belongs to. To take more targeted decisions in the future, every business must realize the audience demographics and what type of people find your events beneficial.  

Once we work on the demographics, it will help us for future events. For example, you can select a time that would be viable in your audience’s time zone, and you can also select a topic that they would be more interested in.  

Demographic data
Statistics showing demographic data

The demographics data opens many new avenues for your business, it introduces you to segments of your audience that you might not be targeting already, and you can expand your business. It shows the industries, locations, seniority, and many other crucial factors about your audience.  

By analyzing this data, you can also understand whether your content is attracting the right target audience or not, if not then what kind of audience you are pulling in and whether that’s beneficial for your business or not.  

4. Engagement rate 

Your event might receive a large number of views but if that audience is not engaging with your content, then it is something you should be concerned about. The engagement rate depicts how involved your audience is. Today’s audience has a lot of distractions especially when it comes to online events, in that situation grasping your audience’s attention and keeping them involved is a major task.  

Engagement rate
Audience engagement shown by chat messages

The more engaged the audience is, the higher the chance that they will benefit from it and come back to you for other services. There are several techniques to keep your audience engaged, you can look up a few engagement activities to build connections 

 Make your event a success with event metrics

On that note, if you have just hosted an event or have an event on your calendar, you know what you need to look at. These metrics will help you continuously improve your event’s quality to match the audience’s expectations and requirements. Planning your strategies based on data will help you stay relevant to your audience and trends.    

September 14, 2022

This blog is based on some exploratory data analysis performed on the corpora provided for the “Spooky Author Identification” challenge at Kaggle.

The Spooky Challenge

A Halloween-based challenge [1] with the following goal using data analysis: predict who was writing a sentence of a possible spooky story between Edgar Allan Poe, HP Lovecraft, and Mary Wollstonecraft Shelley.

“Deep into that darkness peering, long I stood there, wondering, fearing, doubting, dreaming dreams no mortal ever dared to dream before.” Edgar Allan Poe

“That is not dead which can eternal lie, And with strange eons, even death may die.” HP Lovecraft

“Life and death appeared to me ideal bounds, which I should first break through, and pour a torrent of light into our dark world.” Mary Wollstonecraft Shelley

The toolset for data analysis

The only tools available to us during this exploration will be our intuitioncuriosity, and the selected packages for data analysis. Specifically:

  • tidytext package, text mining for word processing, and sentiment analysis using tidy tools
  • tidyverse package, an opinionated collection of R packages designed for data science
  • wordcloud package, pretty word clouds
  • gridExtra package, supporting functions to work with grid graphics
  • caret package, supporting function for performing stratified random sampling
  • corrplotpackage, a graphical display of a correlation matrix, confidence interval
# Required libraries

# if packages are not installed

# install.packages("packageName")
library(tidytext)

library(tidyverse)

library(gridExtra)

library(wordcloud)

library(dplyr)

library(complot)

The beginning of the exploratory data analysis journey: The Spooky data

We are given a CSV file, the train.csv, containing some information about the authors. The information consists of a set of sentences written by different authors (EAP, HPL, MWS). Each entry (line) in the file is an observation providing the following information:

an id, a unique id for the excerpt/ sentence (as a string) the text, the excerpt/ sentence (as a string), the author, the author of the excerpt/ sentence (as a string) – a categorical feature that can assume three possible values EAP for Edgar Allan Poe,

HPL for HP Lovecraft,

MWS for Mary Wollstonecraft Shelley

 # loading the data using readr package

  spooky_data <- readr::read_csv(file = "./../../../data/train.csv",

                    col_types = "ccc",

                    locale = locale("en"),

                    na = c("", "NA"))


  # readr::read_csv does not transform string into factor

  # as the "author" feature is categorical by nature

  # it is transformed into a factor

  spooky_data$author <- as.factor(spooky_data$author)

The overall data includes 19579 observations with 3 features (id, text, author). Specifically 7900 excerpts (40.35 %) of Edgard Allan Poe, 5635 excerpts (28.78 %) of HP Lovecraft, and 6044 excerpts (30.87 %) of Mary Wollstonecraft Shelley.

Read about Data Normalization in predictive modeling before analytics in this blog

Avoid the madness!

It is forbidden to use all of the provided spooky data for finding our way through the unique spookiness of each author.

We still want to evaluate how our intuition generalizes on an unseen excerpt/sentence, right?

For this reason, the given training data is split into two parts (using stratified random sampling)

  • an actual training dataset (70% of the excerpts/sentences), used for
    • exploration and insight creation, and
    • training the classification model
  • test dataset (the remaining 30% of the excerpts/sentences), used for
    • evaluation of the accuracy of our model.
# setting the seed for reproducibility

  set.seed(19711004)

  trainIndex <- caret::createDataPartition(spooky_data$author, p = 0.7, list = FALSE, times = 1)

  spooky_training <- spooky_data[trainIndex,]

  spooky_testing <- spooky_data[-trainIndex,]

Specifically 5530 excerpts (40.35 %) of Edgard Allan Poe, 3945 excerpts (28.78 %) of HP Lovecraft, and 4231 excerpts (30.87 %) of Mary Wollstonecraft Shelley.
Moving our first steps: from darkness into the light
Before we start building any model, we need to understand the data, build intuitions about the information contained in the data, and identify a way to use those intuitions to build a great predicting model.

Is the provided data usable?
Question: Does each observation have an id? An excerpt/sentence associated with it? An author?

missingValueSummary <- colSums(is.na(spooky_training))

As we can see from the table below, there are no missing values in the dataset.

Exploratory data analysis in R | Spooky author identification | Data Science Dojo

Some initial facts about the excerpts/sentences

Below we can see, as an example, some of the observations (and excerpts/sentences) available in our dataset.

EAP
EAP

QuestionHow many excerpts/sentences are available by the author?

 no_excerpts_by_author <- spooky_training %>%

  dplyr::group_by(author) %>%

  dplyr::summarise(n = n())

ggplot(data = no_excerpts_by_author,

          mapping = aes(x = author, y = n, fill = author)) +

     geom_col(show.legend = F) +

     ylab(label = "number of excerpts") +

     theme_dark(base_size = 10)
Excerpt graph
Number of excerpts mapped against author-name

Question: How long (# ofchars) are the excerpts/sentences by the author?

spooky_training$len <- nchar(spooky_training$text)

ggplot(data = spooky_training, mapping = aes(x = len, fill = author)) +

  geom_histogram(binwidth = 50) +

  facet_grid(. ~ author) +

  xlab("# of chars") +

  theme_dark(base_size = 10)
Count graph
Count and number of characters graph
ggplot(data = spooky_training, mapping = aes(x = 1, y = len)) +

  geom_boxplot(outlier.colour = "red", outlier.shape = 1) +

  facet_grid(. ~ author) +

  xlab(NULL) +

  ylab("# of chars") +

  theme_dark(base_size = 10)
characters graph
Number of characters

Some excerpts are very long. As we can see from the boxplot above, there are a few outliers for each author; a possible explanation is that the sentence segmentation has a few hiccups (see details below):

Exploratory data analysis in R | Spooky author identification | Data Science Dojo

For example Mary Wollstonecraft Shelleys (MWS) has an excerpt of around 4600 characters:

“Diotima approached the fountain seated herself on a mossy mound near it and her disciples placed themselves on the grass near her Without noticing me who sat close under her she continued her discourse addressing as it happened one or other of her listeners but before I attempt to repeat her words I will describe the chief of these whom she appeared to wish principally to impress One was a woman of about years of age in the full enjoyment of the most exquisite beauty her golden hair floated in ringlets on her shoulders her hazle eyes were shaded by heavy lids and her mouth the lips apart seemed to breathe sensibility But she appeared thoughtful unhappy her cheek was pale she seemed as if accustomed to suffer and as if the lessons she now heard were the only words of wisdom to which she had ever listened The youth beside her had a far different aspect his form was emaciated nearly to a shadow his features were handsome but thin worn his eyes glistened as if animating the visage of decay his forehead was expansive but there was a doubt perplexity in his looks that seemed to say that although he had sought wisdom he had got entangled in some mysterious mazes from which he in vain endeavoured to extricate himself As Diotima spoke his colour went came with quick changes the flexible muscles of his countenance shewed every impression that his mind received he seemed one who in life had studied hard but whose feeble frame sunk beneath the weight of the mere exertion of life the spark of intelligence burned with uncommon strength within him but that of life seemed ever on the eve of fading At present I shall not describe any other of this groupe but with deep attention try to recall in my memory some of the words of Diotima they were words of fire but their path is faintly marked on my recollection It requires a just hand, said she continuing her discourse, to weigh divide the good from evil On the earth they are inextricably entangled and if you would cast away what there appears an evil a multitude of beneficial causes or effects cling to it mock your labour When I was on earth and have walked in a solitary country during the silence of night have beheld the multitude of stars, the soft radiance of the moon reflected on the sea, which was studded by lovely islands When I have felt the soft breeze steal across my cheek as the words of love it has soothed cherished me then my mind seemed almost to quit the body that confined it to the earth with a quick mental sense to mingle with the scene that I hardly saw I felt Then I have exclaimed, oh world how beautiful thou art Oh brightest universe behold thy worshiper spirit of beauty of sympathy which pervades all things, now lifts my soul as with wings, how have you animated the light the breezes Deep inexplicable spirit give me words to express my adoration; my mind is hurried away but with language I cannot tell how I feel thy loveliness Silence or the song of the nightingale the momentary apparition of some bird that flies quietly past all seems animated with thee more than all the deep sky studded with worlds” If the winds roared tore the sea and the dreadful lightnings seemed falling around me still love was mingled with the sacred terror I felt; the majesty of loveliness was deeply impressed on me So also I have felt when I have seen a lovely countenance or heard solemn music or the eloquence of divine wisdom flowing from the lips of one of its worshippers a lovely animal or even the graceful undulations of trees inanimate objects have excited in me the same deep feeling of love beauty; a feeling which while it made me alive eager to seek the cause animator of the scene, yet satisfied me by its very depth as if I had already found the solution to my enquires sic as if in feeling myself a part of the great whole I had found the truth secret of the universe But when retired in my cell I have studied contemplated the various motions and actions in the world the weight of evil has confounded me If I thought of the creation I saw an eternal chain of evil linked one to the other from the great whale who in the sea swallows destroys multitudes the smaller fish that live on him also torment him to madness to the cat whose pleasure it is to torment her prey I saw the whole creation filled with pain each creature seems to exist through the misery of another death havoc is the watchword of the animated world And Man also even in Athens the most civilized spot on the earth what a multitude of mean passions envy, malice a restless desire to depreciate all that was great and good did I see And in the dominions of the great being I saw man reduced?”

Thinking Point: “What do we want to do with those excerpts/outliers?

Some more facts about the excerpts/sentences using the bag-of-words

The data is transformed into a tidy format (unigrams only) to use the tidy tools to perform some basic and essential NLP operations.

spooky_trainining_tidy_1n <- spooky_training %>%

  select(id, text, author) %>%

  tidytext::unnest_tokens(output = word,

                      input = text,

                      token = "words",

                      to_lower = TRUE)

Each sentence is tokenized into words (normalized to lower case, removed punctuation). See example below how the data (each excerpt/sentence) was and how it has been transformed.

Exploratory data analysis in R | Spooky author identification | Data Science Dojo

Question: Which are the most common words used by each author?

Lets start to count how many times words has been used by each author and plot.

words_author_1 <- plot_common_words_by_author(x = spooky_trainining_tidy_1n,

                                     author = "EAP",

                                     greater.than = 500)


words_author_2 <- plot_common_words_by_author(x = spooky_trainining_tidy_1n,

                                     author = "HPL",

                                     greater.than = 500)

words_author_3 <- plot_common_words_by_author(x = spooky_trainining_tidy_1n,

                                     author = "MWS",

                                     greater.than = 500)


gridExtra::grid.arrange(words_author_1, words_author_2, words_author_3, nrow = 1)
common words graph
Most common words used by each author

From this initial visualization we can see that the authors use quite often the same set of words – like the, and, of. These words do not give any actual information about the vocabulary actually used by each author, they are common words that represent just noise when working with unigrams: they are usually called stopwords.

If the stopwords are removed, using the list of stopwords provided by the tidytext package, it is possible to see that the authors do actually use different words more frequently than others (and it differs from author to author, the author vocabulary footprint).

words_author_1 <- plot_common_words_by_author(x = spooky_trainining_tidy_1n,

                                     author = "EAP",

                                     greater.than = 70,

                                     remove.stopwords = T)


words_author_2 <- plot_common_words_by_author(x = spooky_trainining_tidy_1n,

                                     author = "HPL",

                                     greater.than = 70,

                                     remove.stopwords = T)


words_author_3 <- plot_common_words_by_author(x = spooky_trainining_tidy_1n,

                                     author = "MWS",

                                     greater.than = 70,

                                     remove.stopwords = T)


gridExtra::grid.arrange(words_author_1, words_author_2, words_author_3, nrow = 1)
Data analysis graph
Most common words used comparison between EAP, HPL, and MWS

Another way to visualize the most frequent words by author is to use wordclouds. Wordclouds make it easy to spot differences, the importance of each word matches its font size and color.

par(mfrow = c(1,3), mar = c(0,0,0,0))

words_author <- get_common_words_by_author(x = spooky_trainining_tidy_1n,

                       author = "EAP",

                       remove.stopwords = TRUE)

mypal <- brewer.pal(8,"Spectral")

wordcloud(words = c("EAP", words_author$word),

      freq = c(max(words_author$n) + 100, words_author$n),

      colors = mypal,

      scale=c(7,.5),

      rot.per=.15,

      max.words = 100,

      random.order = F)

words_author <- get_common_words_by_author(x = spooky_trainining_tidy_1n,

                       author = "HPL",

                       remove.stopwords = TRUE)

mypal <- brewer.pal(8,"Spectral")

wordcloud(words = c("HPL", words_author$word),

      freq = c(max(words_author$n) + 100, words_author$n),

      colors = mypal,

      scale=c(7,.5),

      rot.per=.15,

      max.words = 100,

      random.order = F)

words_author <- get_common_words_by_author(x = spooky_trainining_tidy_1n,

                       author = "MWS",

                       remove.stopwords = TRUE)

mypal <- brewer.pal(8,"Spectral")

wordcloud(words = c("MWS", words_author$word),

      freq = c(max(words_author$n) + 100, words_author$n),

      colors = mypal,

      scale=c(7,.5),

      rot.per=.15,

      max.words = 100,

      random.order = F)
Most common words
Most common words used by authors

From the word clouds, we can infer that EAP loves to use the words time, found, eyes, length, day, etc.HPL loves to use the words night, time, found, house, etc.MWS loves to use the words life, time, love, eyes, etc.

A comparison cloud can be used to compare the different authors. From the R documentation

‘Let p{i,j} be the rate at which word i occurs in document j, and p_j be the average across documents(∑ip{i,j}/ndocs). The size of each word is mapped to its maximum deviation ( max_i(p_{i,j}-p_j) ), and its angular position is determined by the document where that maximum occurs.’

See below the comparison cloud between all authors:

comparison_data <- spooky_trainining_tidy_1n %>%

     dplyr::select(author, word) %>%

  dplyr::anti_join(stop_words) %>%

  dplyr::count(author,word, sort = TRUE)


comparison_data %>%

 reshape2::acast(word ~ author, value.var = "n", fill = 0) %>%

 comparison.cloud(colors = c("red", "violetred4", "rosybrown1"),

               random.order = F,

               scale=c(7,.5),

               rot.per = .15,

               max.words = 200) 
Comparison cloud
Comparison cloud between authors

Below is the comparison clouds between the authors, two authors at any time.

par(mfrow = c(1,3), mar = c(0,0,0,0))

comparison_EAP_MWS <- comparison_data %>%

 dplyr::filter(author == "EAP" | author == "MWS")

comparison_EAP_MWS %>%

 reshape2::acast(word ~ author, value.var = "n", fill = 0) %>%

 comparison.cloud(colors = c("red", "rosybrown1"),

               random.order = F,

               scale=c(3,.2),

               rot.per = .15,

               max.words = 100)

comparison_HPL_MWS <- comparison_data %>%

dplyr::filter(author == "HPL" | author == "MWS")

comparison_HPL_MWS %>%

 reshape2::acast(word ~ author, value.var = "n", fill = 0) %>%

comparison.cloud(colors = c("violetred4", "rosybrown1"),

               random.order = F,

               scale=c(3,.2),

               rot.per = .15,

               max.words = 100)


comparison_EAP_HPL <- comparison_data %>%

dplyr::filter(author == "EAP" | author == "HPL")

comparison_EAP_HPL %>%

reshape2::acast(word ~ author, value.var = "n", fill = 0) %>%

comparison.cloud(colors = c("red", "violetred4"),

               random.order = F,

               scale=c(3,.2),

               rot.per = .15,

               max.words = 100)
Comparison cloud
Comparison cloud between EAP, HPL, and MWS

Question: How many unique words are needed in the author dictionary to cover 90% of the used word instances?

words_cov_author_1 <- plot_word_cov_by_author(x = spooky_trainining_tidy_1n, author = "EAP")

words_cov_author_2 <- plot_word_cov_by_author(x = spooky_trainining_tidy_1n, author = "HPL")

words_cov_author_3 <- plot_word_cov_by_author(x = spooky_trainining_tidy_1n, author = "MWS")


gridExtra::grid.arrange(words_cov_author_1, words_cov_author_2, words_cov_author_3, nrow = 1)
Comparison cloud
Detailed comparison cloud between EAP, HPL, and MWS

From the plot above we can see that for EAP and HPL provided corpus, we need circa 7500 words to cover 90% of word instance. While for MWS provided corpus, circa 5000 words are needed to cover 90% of word instances.

Question: Is there any commonality between the dictionaries used by the authors?

Are the authors using the same words? A commonality cloud can be used to answer this specific question, it emphasizes the similarities between authors and plot a cloud showing the common words between the different authors. It shows only those words that are used by all authors with their combined frequency across authors.

See below the commonality cloud between all authors.

comparison_data <- spooky_trainining_tidy_1n %>%

 dplyr::select(author, word) %>%

dplyr::anti_join(stop_words) %>%

dplyr::count(author,word, sort = TRUE)


mypal <- brewer.pal(8,"Spectral") comparison_data %>%

reshape2::acast(word ~ author, value.var = "n", fill = 0) %>%

commonality.cloud(colors = mypal,

               random.order = F,

               scale=c(7,.5),

               rot.per = .15,

               max.words = 200)
Frequency of word usage
Frequency of word usage

Question: Can Word Frequencies be used to compare different authors?

First of all, we need to prepare the data calculating the word frequencies for each author.

 word_freqs <- spooky_trainining_tidy_1n %>%

  dplyr::anti_join(stop_words) %>%

  dplyr::count(author, word) %>%

  dplyr::group_by(author) %>%

  dplyr::mutate(word_freq = n/ sum(n)) %>%

  dplyr::select(-n)

 

Exploratory data analysis in R | Spooky author identification | Data Science Dojo

Then we need to spread the author (key) and the word frequency (value) across multiple columns (note how NAs have been introduced for words not used by an author).
word_freqs <- word_freqs%>%
tidyr::spread(author, word_freq)

Exploratory data analysis in R | Spooky author identification | Data Science Dojo

Let’s start to plot the word frequencies (log scale) comparing two authors at a time and see how words distribute on the plane. Words that are close to the line (y = x) have similar frequencies in both sets of texts. While words that are far from the line are words that are found more in one set of texts than another.
As we can see in the plots below, there are some words close to the line but most of the words are around the line showing a difference between the frequencies.
# Removing incomplete cases - not all words are common for the authors

# when spreading words to all authors - some will get NAs (if not used

# by an author)

word_freqs_EAP_vs_HPL <- word_freqs %>%

  dplyr::select(word, EAP, HPL) %>%

  dplyr::filter(!is.na(EAP) & !is.na(HPL))

ggplot(data = word_freqs_EAP_vs_HPL, mapping = aes(x = EAP, y = HPL, color = abs(EAP - HPL))) +

  geom_abline(color = "red", lty = 2) +

  geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +

  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +

  scale_x_log10(labels = scales::percent_format()) +

  scale_y_log10(labels = scales::percent_format()) +

  theme(legend.position = "none") +

  labs(y = "HP Lovecraft", x = "Edgard Allan Poe")

Exploratory data analysis in R | Spooky author identification | Data Science Dojo

# Removing incomplete cases - not all words are common for the authors

# when spreading words to all authors - some will get NAs (if not used

# by an author)

word_freqs_EAP_vs_MWS <- word_freqs %>%

  dplyr::select(word, EAP, MWS) %>%

  dplyr::filter(!is.na(EAP) & !is.na(MWS))

ggplot(data = word_freqs_EAP_vs_MWS, mapping = aes(x = EAP, y = MWS, color = abs(EAP - MWS))) +

  geom_abline(color = "red", lty = 2) +

  geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +

  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +

  scale_x_log10(labels = scales::percent_format()) +

  scale_y_log10(labels = scales::percent_format()) +

  theme(legend.position = "none") +

  labs(y = "Mary Wollstonecraft Shelley", x = "Edgard Allan Poe")   

Exploratory data analysis in R | Spooky author identification | Data Science Dojo

# Removing incomplete cases - not all words are common for the authors

# when spreading words to all authors - some will get NAs (if not used

# by an author)

word_freqs_HPL_vs_MWS <- word_freqs %>%

  dplyr::select(word, HPL, MWS) %>%

  dplyr::filter(!is.na(HPL) & !is.na(MWS))

ggplot(data = word_freqs_HPL_vs_MWS, mapping = aes(x = HPL, y = MWS, color = abs(HPL - MWS))) +

  geom_abline(color = "red", lty = 2) +

  geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +

  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +

  scale_x_log10(labels = scales::percent_format()) +

  scale_y_log10(labels = scales::percent_format()) +

  theme(legend.position = "none") +

  labs(y = "Mary Wollstonecraft Shelley", x = "HP Lovecraft")

Exploratory data analysis in R | Spooky author identification | Data Science Dojo

In order to quantify how similar/different these sets of word frequencies by author, we can calculate a correlation (Pearson for linearity) measurement between the sets. There is a correlation of around 0.48 to 0.5 between the different authors (see plot below).

word_freqs %>%

  select(-word) %>%

  cor(use="complete.obs", method="spearman") %>%

  corrplot(type="lower",

       method="pie",

       diag = F)
Correlation graph
Correlation between EAP, HPL, and MWS
Get started with R programming with this free of cost course: Beginner R programming course.

References

[1] Kaggle challenge: Spooky Author Identification[2] “Text Mining in R – A tidy Approach” by J. Silge & D. Robinsons, O’Reilly 2017[3] “Regular Expressions, Text Normalization, and Edit Distance” draft chapter by D. Jurafsky & J. H . Martin, 2018

Appendix: Supporting functions

getNoExcerptsFor <- function(x, author){

  sum(x$author == author)

}

getPercentageExcerptsFor <- function(x, author){

  round((sum(x$author == author)/ dim(x)[1]) * 100, digits = 2)

}

get_xxx_length <- function(x, author, func){

  round(func(x[x$author == author,]$len), digits = 2)

}

plot_common_words_by_author <- function(x, author, remove.stopwords = FALSE, greater.than = 90){

  the_title = author

  if(remove.stopwords){

x <- x %>% dplyr::anti_join(stop_words)

  }

  x[x$author == author,] %>%

dplyr::count(word, sort = TRUE) %>%

dplyr::filter(n > greater.than) %>%

dplyr::mutate(word = reorder(word, n)) %>%

ggplot(mapping = aes(x = word, y = n)) +

geom_col() +

xlab(NULL) +

ggtitle(the_title) +

coord_flip() +

theme_dark(base_size = 10)

}

get_common_words_by_author <- function(x, author, remove.stopwords = FALSE){

  if(remove.stopwords){

x <- x %>% dplyr::anti_join(stop_words)

  }

  x[x$author == author,] %>%

dplyr::count(word, sort = TRUE)

}

plot_word_cov_by_author <- function(x,author){

  words_author <- get_common_words_by_author(x, author, remove.stopwords = TRUE) words_author %>%

mutate(cumsum = cumsum(n),

       cumsum_perc = round(100 * cumsum/sum(n), digits = 2)) %>%

ggplot(mapping = aes(x = 1:dim(words_author)[1], y = cumsum_perc)) +

geom_line() +

geom_hline(yintercept = 75, color = "yellow", alpha = 0.5) +

geom_hline(yintercept = 90, color = "orange", alpha = 0.5) +

geom_hline(yintercept = 95, color = "red", alpha = 0.5) +

xlab("no of 'unique' words") +

ylab("% Coverage") +

ggtitle(paste("% Coverage unique words -", author, sep = " ")) +

theme_dark(base_size = 10)

}
sessionInfo()
## R version 3.3.3 (2017-03-06)

## Platform: x86_64-apple-darwin13.4.0 (64-bit)

## Running under: macOS  10.13

##

## locale:

## [1] no_NO.UTF-8/no_NO.UTF-8/no_NO.UTF-8/C/no_NO.UTF-8/no_NO.UTF-8

##

## attached base packages:

## [1] stats     graphics  grDevices utils     datasets  methods   base     

##

## other attached packages:

##  [1] bindrcpp_0.2       corrplot_0.84      wordcloud_2.5     

##  [4] RColorBrewer_1.1-2 gridExtra_2.3      dplyr_0.7.3       

##  [7] purrr_0.2.3        readr_1.1.1        tidyr_0.7.1       

## [10] tibble_1.3.4       ggplot2_2.2.1      tidyverse_1.1.1   

## [13] tidytext_0.1.3    

##

## loaded via a namespace (and not attached):

##  [1] httr_1.3.1         ddalpha_1.2.1      splines_3.3.3     

##  [4] jsonlite_1.5       foreach_1.4.3      prodlim_1.6.1     

##  [7] modelr_0.1.1       assertthat_0.2.0   highr_0.6         

## [10] stats4_3.3.3       DRR_0.0.2          cellranger_1.1.0  

## [13] yaml_2.1.14        robustbase_0.92-7  slam_0.1-40       

## [16] ipred_0.9-6        backports_1.1.0    lattice_0.20-35   

## [19] glue_1.1.1         digest_0.6.12      rvest_0.3.2       

## [22] colorspace_1.3-2   recipes_0.1.0      htmltools_0.3.6   

## [25] Matrix_1.2-11      plyr_1.8.4         psych_1.7.8       

## [28] timeDate_3012.100  pkgconfig_2.0.1    CVST_0.2-1        

## [31] broom_0.4.2        haven_1.1.0        caret_6.0-77      

## [34] scales_0.5.0       gower_0.1.2        lava_1.5          

## [37] withr_2.0.0        nnet_7.3-12        lazyeval_0.2.0    

## [40] mnormt_1.5-5       survival_2.41-3    magrittr_1.5      

## [43] readxl_1.0.0       evaluate_0.10.1    tokenizers_0.1.4  

## [46] janeaustenr_0.1.5  nlme_3.1-131       SnowballC_0.5.1   

## [49] MASS_7.3-47        forcats_0.2.0      xml2_1.1.1        

## [52] dimRed_0.1.0       foreign_0.8-69     class_7.3-14      

## [55] tools_3.3.3        hms_0.3            stringr_1.2.0     

## [58] kernlab_0.9-25     munsell_0.4.3      RcppRoll_0.2.2    

## [61] rlang_0.1.2        grid_3.3.3         iterators_1.0.8   

## [64] labeling_0.3       rmarkdown_1.6      gtable_0.2.0      

## [67] ModelMetrics_1.1.0 codetools_0.2-15   reshape2_1.4.2    

## [70] R6_2.2.2           lubridate_1.6.0    knitr_1.17        

## [73] bindr_0.1          rprojroot_1.2      stringi_1.1.5     

## [76] parallel_3.3.3     Rcpp_0.12.12       rpart_4.1-11      

## [79] tidyselect_0.2.0   DEoptimR_1.0-8
August 18, 2022

From customer relationship management to tracking analytics, marketing analytics tools are important in the modern world. Learn how to make the most of these tools.

What do you usually find in a toolbox? A hammer, screwdriver, nails, tape measure? If you’re building a bird house, these would be perfect for you, but what if you’re creating a marketing campaign? What tools do you want at your disposal? It’s okay if you can’t come up with any. We’re here to help.

Industry’s leading marketing analytics tools

These days marketing is all about data. Whether it’s a click on an email or an abandoned cart on Amazon, marketers are using data to better cater to the needs of the consumer. To analyze and use this data, marketers have a toolbox of their own.

So what are some of these tools and what do they offer? Here, at Data Science Dojo, we’ve come up with our top 5 marketing analytics tools for success:

Customer relationship management platform (CRM)

CRM is a tool used for managing everything there is to know about the customer. It can track where/when a consumer visits your site, tracks the interactions on your site, and creates profiles for leads. A few examples of CRMs are:

HubSpot logo
HubSpot logo

HubSpot, along with the two others listed above, took the idea of a CRM and made it into an all-inclusive marketing resort. Along with the traditional CRM uses, HubSpot can be used to:

  • Manage social media
  • Send mass email campaigns
  • View traffic, campaign, and customer analytics
  • Associate emails, blogs, and social media posts to specific marketing campaigns
  • Create workflows and sequences
  • Connect to your other analytics tools such as Google Analytics, Facebook Ads, YouTube, and Slack.

HubSpot continues its effectiveness by creating reports allowing its users to analyze what is and isn’t working.

This is just a brief description revealing the tip of the iceberg of what HubSpot does. If you want to see below the water line, visit its website.

Search software

Search engine optimization (SEO) is the process of a website ranking on search engines. It’s how you can find everything you have ever searched for on Google. Search software helps marketers analyze how to best optimize websites for potential consumers to find.

A few search software companies are:

I would love to describe each one of the above businesses, but I only have experience with Moz. Moz focuses on a “less invasive way (of marketing) where customers are earned rather than bought”.

Its entire business is focused on upgrading your SEO. Moz offers 9 different services through its Moz Pro toolkit:

Moz Pro Services
Moz Pro Services

I love Moz Keyword Explorer. This is the tool I use to check different variations of titles, keywords, phrases, and hashtags. It gives four different scores, which you can see in the photo below.

Moz Keyword Explorer
Moz Keyword Explorer

Now, there’s not enough data to show the average monthly volume for my name, but, according to Moz, it wouldn’t be that difficult to rank higher than my competitors, people have a high likelihood of clicking, and the Priority explains that my name is not a “sweet spot” for high volume, low difficulty, and high CTR. In conclusion, using my name as a keyword to optimize the Data Science Dojo Blog isn’t the best idea.

Read more about marketing analytics in this blog

Web analytics service

We can’t talk about marketing tools and not to mention Web Analytics Services. These are one of the most important pieces of equipment in the marketer’s toolbox. Google Analytics (GA) is a free web analytics service that integrates your company’s website data into a meticulously organized dashboard. I wouldn’t say GA is the be-all and end-all piece of equipment, and there are many different services and tools out there, however, it can’t be refuted that Google Analytics is a great tool to integrate into your company’s marketing strategy.

Some similar Web Analytics Services include:

Google analytics logo
Google Analytics logo

Some of the analytics you’ll be able to understand are

  • Real-time data – Who’s on your site right now? Where are the users coming from? What pages are they looking at?
  • Audience Information – Where do your users live, age range, interests, gender, new or returning visitor, etc.?
  • Acquisition – Where did they come from (Organic, Direct, Paid Ads, Referrals, Campaigns)? What day/time do they land on your website? What was the final URL they visited before leaving? You can also link to any Google Ads campaigns you have running.
  • Behavior – What is the path people take to convert? How is your site speed? What events took place (Contact form submission, newsletter signup, social media share)?
  • Conversions – Are you attributing conversions by first touch, last touch, linear, or decay?

Understanding these metrics is amazingly effective in narrowing down how users interact with your website.

Another way to integrate Google Analytics into your marketing strategy is by setting up goals. Goals are set up to track specific actions taken on your website. For example, you can set up goals to track purchases, newsletter signups, video plays, live chat, and social media shares.

If you want a more in-depth look at what Google Analytics can offer, you can learn the basics through their Analytics Academy.

marketing analytics tool
Google analysis feedback

Analysis and feedback platform (A&F)

A&Fs are another great piece of equipment in the marketer’s toolbox; more specifically for looking at how users are interacting on your website. One such A&F, HotJar, does this in the form of heatmaps and recordings. HotJar’s integrated tracking pixel allows you to see how far users scroll on your website and what items were clicked the most.

You can also watch recordings of a user’s experience and even filter down to the URL of the page you wish to track, (i.e. /checkout/). This allows you to capture the user’s unique journey until they make a purchase. For each recording, you can view audience information such as geographical location, country, browser, operating system, and a documented list of user actions.

In addition to UX/UI metrics, you can also integrate polls and forms on your website for more intricate data about your users.

As a marketing manager, these tools help to visualize all of my data in ways that a pivot table can’t display. And while I am a genuine user of these platforms, I must admit that it’s not the tool that makes the man, it’s the strategy. To get the most use out of these platforms, you will need to understand what business problem you are trying to solve and what metrics are important to you.

There is a lot of information that these dashboards can provide you. However, it’s up to you to filter through the noise. Not every accessible metric applies to you, so you will need to decide what is the most important for your marketing plan.

A few similar platforms include:

Experimentation platforms

Experimentation platforms are software for experimenting with different variations of a sample. Its purpose is to run A/B tests, something HubSpot does, but these platforms dive head first into them.

Experimentation Platforms
Experimentation Platforms

Where HubSpot only tests versions A and B, experimentation platforms let you test versions A, B, C, D, E, F, etc. They don’t just test the different versions, they will also test different audiences and how they respond to each test version. Searching “definition experimentation platforms” is a good place to start in understanding what experimentation platforms are. I can tell you they are a dream come true for marketers who love to get their hands dirty in behavioral targeting.

Optimizely is one such example of a company offering in-depth A/B testing. Optimizely’s goal is to let you spend more time experimenting with the customer experience and less time wading through statistics to learn what works and what doesn’t. If you are unsure what to do, you can test it with Optimizely.

Using companies like Optimizely or Split is just one way to experiment. Many name brand companies like  Netflix,  MicrosofteBay, and Uber have all built their experimentation platforms to use internally.

Not perfect

No one toolbox is perfect, and everyone is going to be different. One piece of advice I can give is to always understand the problem before deciding which tool is best to solve the problem. You wouldn’t use a hammer to do a job where a drill would be more effective, right?

Top 5 marketing analytics tools for success | Data Science Dojo

You could, it just wouldn’t be the most efficient method. The same concept goes for marketing. Understanding the problem will help you know which tools should be in your toolbox.

August 18, 2022

Data Science Dojo has launched one of the most in-demand data analytics software, Redash as a virtual machine offer on the Azure Marketplace.

Introduction

With the rising complexity of the data, organizations must have complete control over their data. Sometimes there is a hindrance for the analysts in the specific use cases. Especially when working internally with a dedicated team that requires unlimited access to information. A solution is needed to perform the data-driven tasks efficiently and extract actionable insights.

What is Redash?

Redash, a data analytics tool, assists organizations to become more data-driven by providing tools to democratize data access. It simplifies the creation of dashboards and makes visualizations of your data by connecting to any data source. 

Data analysis with Redash

As a Business Intelligence tool, it has more powerful integration capabilities than other Data Analytics platforms, making it a favorite among businesses that have implemented a variety of apps to manage their business processes. Similarly, according to the reviewer’s point-of-view, they found it to be more user-friendly, manageable, and business-friendly in comparison with other platforms.

PRO TIP: Join our Data Science Bootcamp to learn more about data analytics.

analytics graphs
Data Analytics with Redash

Key features of Redash

  • It offers a user-friendly graphical user interface to carry out complex tasks with a few clicks.
  • Allows users to deal with small as well as big data, it supports many SQL and NoSQL databases.
  • The Query Editor allows users to query the database by utilizing the Schema Browser and autocomplete features.
  • Users can utilize the drag-and-drop feature to build visualizations (like charts, boxplot, cohort, counter, etc.) and then merge them into a single dashboard.
  • Enables peer evaluation of reports and searches and makes it simple for users to share visualizations and the queries that go with them.
  • Allows charts and dashboards to be updated automatically at defined time intervals.

Redash with Azure Services

It leverages the power of Azure services to make the procedure of integration with data sources quickly. Write SQL queries to pull subsets of data for visualizations and plot different charts and share dashboards within the organization with greater ease.

Conclusion

Other open-source business intelligence solutions put strong competition on Redash. Deciding to invest in business intelligence and data analysis tool can be challenging because all corporate departments, including product, finance, marketing, and others, now use multiple platforms to carry out day-to-day operations and carry out analytics tasks to strengthen their control over data.

At Data Science Dojo, we deliver data science education, consulting, and technical services to increase the power of data. We, therefore, know the importance of data and encapsulated insights. Through this offer, we are confident that you can analyze, visualize, and query your data in a collaborative environment with greater easeInstall the Redash offer now from the Azure Marketplace by Data Science Dojo, your ideal companion in your journey to learn data science!

Try Redash!

August 16, 2022

All of these written texts are unstructured; text mining algorithms and techniques work best on structured data.

Text analytics for machine learning: Part 1

Have you ever wondered how Siri can understand English? How can you type a question into Google and get what you want?

Over the next week, we will release a five-part blog series on text analytics that will give you a glimpse into the complexities and importance of text mining and natural language processing.

This first section discusses how text is converted to numerical data.

In the past, we have talked about how to build machine learning models on structured data sets. However, life does not always give us data that is clean and structured. Much of the information generated by humans has little or no formal structure: emails, tweets, blogs, reviews, status updates, surveys, legal documents, and so much more. There is a wealth of knowledge stored in these kinds of documents which data scientists and analysts want access to. “Text analytics” is the process by which you extract useful information from text.

Some examples include:

All these written texts are unstructured; machine learning algorithms and techniques work best (or often, work only) on structured data. So, for our machine learning models to operate on these documents, we must convert the unstructured text into a structured matrix. Usually this is done by transforming each document into a sparse matrix (a big but mostly empty table). Each word gets its own column in the dataset, which tracks whether a word appears (binary) in the text OR how often the word appears (term-frequency). For example, consider the two statements below. They have been transformed into a simple term frequency matrix. Each word gets a distinct column, and the frequency of occurrence is tracked. If this were a binary matrix, there would only be ones and zeros instead of a count of the terms.

Make words usable for machine learning

Text Mining

Why do we want numbers instead of text? Most machine learning algorithms and data analysis techniques assume numerical data (or data that can be ranked or categorized). Similarity between documents is calculated by determining the distance between the frequency of words. For example, if the word “team” appears 4 times in one document and 5 times in a second document, they will be calculated as more similar than a third document where the word “team” only appears once.

 

Clusters
Sample clusters

Text mining: Build a matrix

While our example was simple (6 words), term frequency matrices on larger datasets can be tricky.

Imagine turning every word in the Oxford English dictionary into a matrix, that’s 171,476 columns. Now imagine adding everyone’s names, every corporation or product or street name that ever existed. Now feed it slang. Feed it every rap song. Feed it fantasy novels like Lord of the Rings or Harry Potter so that our model will know what to do when it encounters “The Shire” or “Hogwarts.” Good, now that’s just English. Do the same thing again for Russian, Mandarin, and every other language.

After this is accomplished, we are approaching a several billion-column matrix; two problems arise. First, it becomes computationally unfeasible and memory intensive to perform calculations over this matrix. Secondly, the curse of dimensionality kicks in and distance measurements become so absurdly large in scale that they all seem the same. Most of the research and time that goes into natural language processing is less about the syntax of language (which is important) but more about how to reduce the size of this matrix.

Now that we know what we must do and the challenges that we must face in order to reach our desired result. The next three blogs in the series will be directly addressed to these problems. We will introduce you to 3 concepts: conforming, stemming, and stop word removal.

Want to learn more about text mining and text analytics?

Check out our short video on our data science bootcamp curriculum page OR watch our video on tweet sentiment analysis.

June 15, 2022

Develop an understanding of text analytics, text conforming, and special character cleaning. Learn how to make text machine-readable.

Text analytics for machine learning: Part 2

Last week, in part 1 of our text analytics series, we talked about text processing for machine learning. We wrote about how we must transform text into a numeric table, called a term frequency matrix, so that our machine learning algorithms can apply mathematical computations to the text. However, we found that our textual data requires some data cleaning.

In this blog, we will cover the text conforming and special character cleaning parts of text analytics.

Understand how computers read text

The computer sees text differently from humans. Computers cannot see anything other than numbers. Every character (letter) that we see on a computer is actually a numeric representation to a computer, with the mapping between numbers and characters determined by an “encoding table.” The simplest, but most common, is ASCII encoding in text analytics. A small sample ASCII table is shown to the right.

ASCII Code

To the left is a look at six different ways the word “CAFÉ” might be encoded in ASCII. The word on the left is what the human sees and its ASCII representation (what the computer sees) is on the right.

Any human would know that this is just six different spellings for the same word, but to a computer these are six different words. These would spawn six different columns in our term-frequency matrix. This will bloat our already enormous term-frequency matrix, as well as complicate or even prevent useful analysis.

 

ASCII Representation

Unify words with the same spelling

To unify the six different “CAFÉ’s”, we can perform two simple global transformations.

Casing: First we must convert all characters to the same casing, uppercase or lowercase. This is a common enough operation. Most programming languages have a built-in function that converts all characters into a string into either lowercase or uppercase. We can choose either global lowercasing or global uppercasing, it does not matter as long as it’s applied globally.

String normalization: Second, we must convert all accented characters to their unaccented variants. This is often called Unicode normalization, since accented and other special characters are usually encoded using the Unicode standard rather than the ASCII standard. Not all programming languages have this feature out of the box, but most have at least one package which will perform this function.

Note that implementations vary, so you should not mix and match Unicode normalization packages. What kind of normalization you do is highly language dependent, as characters which are interchangeable in English may not be in other languages (such as Italian, French, or Vietnamese).

Remove special characters and numbers

The next thing we have to do is remove special characters and numbers. Numbers rarely contain useful meaning. Examples of such irrelevant numbers include footnote numbering and page numbering. Special characters, as discussed in the string normalization section, have a habit of bloating our term-frequency matrix. For instance, representing a quotation mark has been a pain-point since the beginning of computer science.

Unlike a letter, which may only be capital or not capital, quotation marks have many popular representations. A quotation character has three main properties: curly, straight, or angled; left or right; single, double, or triple. Depending on the text analytics encoding used, not all of these may exist.

ASCII Quotations
Properties of quotation characters

The table below shows how quoting the word “café” in both straight quote and left-right quotes would look in a UTF-8 table in Arial font.

UTF 8 Form

Avoid over-cleaning

The problem is further complicated by each individual font, operating system, and programming language since implementation of the various encoding standards is not always consistent. A common solution is to simply remove all special characters and numeric digits from the text. However, removing all special characters and numbers can have negative consequences.

There is a thing as too much data cleaning when it comes to text analytics. The more we clean and remove the more “lost in translation” the textual message may become. We may inadvertently strip information or meaning from our messages so that by the time our machine learning algorithm sees the textual data, much or all the relevant information has been stripped away.

For each type of cleaning above, there are situations in which you will want to either skip it altogether or selectively apply it. As in all data science situations, experimentation and good domain knowledge are required to achieve the best results.

When do we want to avoid over-cleaning in your text analytics?

Special characters: The advent of email, social media, and text messaging have given rise to text-based emoticons represented by ASCII special characters.

For example, if you were building a sentiment predictor for text, text-based emoticons like “=)” or “>:(” are very indicative of sentiment because they directly reference happy or sad. Stripping our messages of these emoticons by removing special characters will also strip meaning from our message.

Numbers: Consider the infinitely gridlocked freeway in Washington state, “I-405.” In a sentiment predictor model, anytime someone talks about “I-405,” more likely than not the document should be classified as “negative.” However, by removing numbers and special characters, the word now becomes “I”. Our models will be unable to use this information, which, based on domain knowledge, we would expect to be a strong predictor.

Casing: Even cases can carry useful information sometimes. For instance, the word “trump” may carry a different sentiment than “Trump” with a capital T, representing someone’s last name.

One solution to filter out proper nouns that may contain information is through name entity recognition, where we use a combination of predefined dictionaries and scanning of the surrounding syntax (sometimes called “lexical analysis”). Using this, we can identify people, organizations, and locations.

Next, we’ll talk about stemming and Lemmatization as a way to help computers understand that different versions of words can have the same meaning (ex. run, running, runs).

Learn more

Want to learn more about text analytics? Check out the short video on our curriculum page OR

June 15, 2022

Related Topics

Statistics
Resources
Programming
Machine Learning
LLM
Generative AI
Data Visualization
Data Security
Data Science
Data Engineering
Data Analytics
Computer Vision
Career
Artificial Intelligence