Data is an essential component of any business, and it is the role of a data analyst to make sense of it all. Power BI is a powerful data visualization tool that helps them turn raw data into meaningful insights and actionable decisions.
In this blog, we will explore the role of data analysts and how they use Power BI to extract insights from data and drive business success. From data discovery and cleaning to report creation and sharing, we will delve into the key steps that can be taken to turn data into decisions.
A data analyst is a professional who uses data to inform business decisions. They process and analyze large sets of data to identify trends, patterns, and insights that can help organizations make more informed decisions.
Who is a data analyst?
A data analyst is a professional who works with data to extract insights, draw conclusions, and support decision-making. They use a variety of tools and techniques to clean, transform, visualize, and analyze data to understand patterns, relationships, and trends. The role of a data analyst is to turn raw data into actionable information that can inform and drive business strategy.
They use various tools and techniques to extract insights from data, such as statistical analysis, and data visualization. They may also work with databases and programming languages such as SQL and Python to manipulate and extract data.
The importance of data analysts in an organization is that they help organizations make data-driven decisions. By analyzing data, analysts can identify new opportunities, optimize processes, and improve overall performance. They also help organizations make more informed decisions by providing insights into customer behavior, market trends, and other key metrics.
Additionally, their role and job can help organizations stay competitive by identifying areas where they may be lagging and providing recommendations for improvement.
Defining Power BI
Power BI provides a suite of data visualization and analysis tools to help organizations turn data into actionable insights. It allows users to connect to a variety of data sources, perform data preparation and transformations, create interactive visualizations, and share insights with others.
The platform includes features such as data modeling, data discovery, data analysis, and interactive dashboards. It enables organizations to quickly create and share visualizations, reports, and dashboards with stakeholders, regardless of their technical skill level.
Power BI also provides collaboration features, allowing team members to work together on data insights, and share information and insights with others through Power BI reports and dashboards.
Key capabilities of Power BI
Data Connectivity:It allows users to connect to various data sources including Excel, SQL Server, Azure SQL, and other cloud-based data sources.
Data Transformation: It provides a wide range of data transformation tools that allow users to clean, shape, and prepare data for analysis.
Visualization: It offers a wide range of visualization options, including charts, tables, and maps, that allow users to create interactive and visually appealing reports.
Sharing and Collaboration: It allows users to share and collaborate on reports and visualizations with others in their organization.
Mobile Access: It also offers mobile apps for iOS and Android, that allow users to access and interact with their data on the go.
How does a data analyst use Power BI?
A data analyst uses Power BI to collect, clean, transform, visualize, and analyze data to turn it into meaningful insights and decisions. The following steps outline the process of using Power BI for data analysis:
Connect to data sources: A data analyst can import data from a variety of sources, such as spreadsheets, databases, or cloud-based services. Power BI provides several ways to import data, including manual upload, data connections, and direct connections to data sources.
Clean and transform data: Before data can be analyzed, it often needs to be cleaned and prepared. This may include removing any extraneous information, correcting errors or inconsistencies, and transforming data into a format that is usable for analysis.
Create visualizations: Once the data has been prepared, a data analyst can use Power BI to create visualizations of the data. This may include bar charts, line graphs, pie charts, scatter plots, and more. Power BI provides a few built-in visualizations and the ability to create custom visualizations, giving data analysts a wide range of options for presenting data.
Perform data analysis: Power BI provides a range of data analysis tools, including calculated fields and measures, and the DAX language, which allows data analysts to perform more advanced analysis. These tools allow them to uncover insights and trends that might not be immediately apparent.
Collaborate and share insights: Once insights have been uncovered, data analysts can share their findings with others through Power BI reports or dashboards. These reports provide a way to present data visualizations and analysis results to stakeholders and can be published and shared with others.
By following these steps, a data analyst can use Power BI to turn raw data into meaningful insights and decisions that can inform business strategy and decision-making.
Why should you use data analytics with Power BI?
User-friendly interface – Power BI has a user-friendly interface, which makes it easy for users with little to no technical skills to create and share interactive dashboards, reports, and visualizations.
Real-time data visualization – It provides real-time data visualization, allowing users to analyze data in real time and make quick decisions.
Integration with other Microsoft tools – Power BI integrates seamlessly with other Microsoft tools, such as Excel, SharePoint, and Azure, making it an ideal tool for organizations using Microsoft technology.
Wide range of data sources – It can connect to a wide range of data sources, including databases, spreadsheets, cloud services, and web APIs, making it easy to consolidate data from multiple sources.
Cost-effective – It is a cost-effective solution for data analytics, with both free and paid versions available, making it accessible to organizations of all sizes.
Mobile accessibility – Power BI provides mobile accessibility, allowing users to access and analyze data from anywhere, on any device.
Collaboration features – With robust collaboration features, it allows users to share dashboards and reports with other team members, encouraging teamwork and decision-making.
Conclusion
In conclusion, Power BI is a powerful tool for data analysis that provides organizations with the ability to easily visualize, analyze, and share complex data. By preparing, cleaning, and transforming data, creating relationships between tables, using visualizations and DAX, they can create reports and dashboards that provide valuable insights into key business metrics.
The ability to publish reports, share insights, and collaborate with others makes Power BI an essential tool for any organization looking to improve performance and make informed decisions.
An overview of data analysis, the data analysis methods, its process, and implications for modern corporations.
Studies show that 73% of corporate executives believe that companies failing to use data analysis on big data lack long-term sustainability. While data analysis can guide enterprises to make smart decisions, it can also be useful for individual decision-making.
Let’s consider an example of using data analysis at an intuitive individual level. As consumers, we are always choosing between products offered by multiple companies. These decisions, in turn, are guided by individual past experiences. Every individual analysis the data obtained via their experience to generate a final decision.
Put more concretely, data analysis involves sifting through data, modeling it, and transforming it to yield information that guides strategic decision-making. For businesses, data analytics can provide highly impactful decisions with long-term yield.
So, let’s dive deep and look at how data analytics tools can help businesses make smarter decisions.
The data analysis process
The process includes five key steps:
1. Identify the need
Companies use data analytics for strategic decision-making regarding a specific issue. The first step, therefore, is to identify the particular problem. For example, a company decides it wants to reduce its production costs while maintaining product quality. To do so effectively, the company would need to identify step(s) of the workflow pipeline it should implement cost cuts.
Similarly, the company might also have a hypothetical solution to its question. Data analytics can be used to judge the falsifiability of the hypothesis, allowing the decision-maker to reach the optimized solution.
A specific question or hypothesis determines the subsequent steps of the process. Hence, this must be as clear and specific as possible.
2. Collect the data
Once the data analysis need is identified, the subsequent kind of data is also determined. Data collection can involve data entered in different types and formats. One broad classification is based on structure and includes structured and unstructured data.
Structured data, for example, is the data a company obtains from its users via internal data acquisition methods such as marketing automation tools. More importantly, it follows the usual row-column database and is suited to the company’s exact needs.
Unstructured data, on the other hand, need not follow any such formatting. It is obtained via third parties such as Google Trends, census bureaus, world health bureaus, and so on. Structured data is easier to work with as it’s already tailored to the company’s needs. However, unstructured data can provide a significantly larger data volume.
There are many other data types to consider as well. For example, metadata, big data, real-time data, and machine data.
3. Clean the data
The third step, data cleaning, ensures that error-free data is used for the data analysis. This step includes procedures such as formatting data correctly and consistently, removing any duplicate or anomalous entries, dealing with missing data, and fixing cross-set data errors.
Performing these tasks manually is tedious and hence, various tools exist to smoothen the data-cleaning process. These include open-source data tools such as OpenRefine, desktop applications like Trifacta Wrangler, cloud-based software as a service (SaaS) like TIBCO Clarity, and other data management tools such as IBM Infosphere quality stage especially used for big data.
4. Perform data analysis
Data analysis includes several methods as described earlier. The method to be implemented depends closely on the research question to be investigated. Data analysis methods are discussed in detail later in this blog.
5. Present the results
Presentation of results defines how well the results are to be communicated. Visualization tools such as charts, images, and graphs effectively convey findings, establishing visual connections in the viewer’s mind. These tools emphasize patterns discovered in existing data and shed light on predicted patterns, assisting the results’ interpretation.
Listen to the Data Analysis challenges in cybersecurity
Data analysis methods
Data analysts use a variety of approaches, methods, and tools to deal with data. Let’s sift through these methods from an approach-based perspective:
1. Descriptive analysis
Descriptive analysis involves categorizing and presenting broader datasets in a way that allows emergent patterns to be observed from them to see if there are any obvious patterns. Data aggregation techniques are one way of performing descriptive analysis. This involves first collecting the data and then sorting it to ease manageability.
This can also involve performing statistical analysis on the data to determine, say, the measures of frequency, dispersion, and central tendencies that provide a mathematical description for the data.
2. Exploratory analysis
Exploratory analysis involves consulting various data sets to see how certain variables may be related, or how certain patterns may be driving others. This analytic approach is crucial in framing potential hypotheses and research questions that can be investigated using data analytic techniques.
Data mining, for example, requires data analysts to use exploratory analysis to sift through big data and generate hypotheses to be tested.
3. Diagnostic analysis
Diagnostic analysis is used to answer why a particular pattern exists in the first place. For example, this kind of analysis can assist a company in understanding why its product is performing in a certain way in the market.
Diagnostic analytics includes methods such as hypothesis testing, determining correlations v/s causation, and diagnostic regression analysis.
4. Predictive analysis
Predictive analysis answers the question of what will happen. This type of analysis is key for companies in deciding new features or updates on existing products, and in determining what products will perform well in the market.
For predictive analysis, data analysts use existing results from the earlier described analyses while also using results from machine learning and artificial intelligence to determine precise predictions for future performance.
5. Prescriptive analysis
Prescriptive analysis involves determining the most effective strategy for implementing the decision arrived at. For example, an organization can use prescriptive analysis to sift through the best way to unroll a new feature. This component of data analytics actively deals with the consumer end, requiring one to work with marketing, human resources, and so on.
Prescriptive analysis makes use of machine learning algorithms to analyze large amounts of big data for business intelligence. These algorithms can assess large amounts of data by working through them via “if” and “else” statements and making recommendations accordingly.
6. Quantitative and qualitative analysis
Quantitative analysis computationally implements algorithms testing out a mathematical fit to describe correlation or causation observed within datasets. This includes regression analysis, null analysis, hypothesis analysis, etc.
Qualitative analysis, on the other hand, involves non-numerical data such as interviews and pertains to answering broader social questions. It involves working closely with textual data to derive explanations.
7. Statistical analysis
Statistical techniques provide answers to essential decision challenges. For example, they can accurately quantify risk probabilities, predict product performance, establish relationships between variables, and so on. These techniques are used by both qualitative and quantitative analysis methods. Some of the invaluable statistical techniques for data analysts include linear regression, classification, resampling methods, and subset selection.
Statistical analysis, more importantly, lies at the heart of data analysis, providing the essential mathematical framework via which analysis is conducted.
Data-driven businesses
Data-driven businesses use the data analysis methods described above. As a result, they offer many advantages and are particularly suited to modern needs. Their credibility relies on them being evidence-based and using precise mathematical models to determine decisions.
Some of these advantages include stronger customer needs, precise identification of business needs, devising effective strategy decisions, and performing well in a competitive market. Data-driven businesses are the way forward.
It is no surprise that the demand for a skilled data analyst grows across the globe. In this blog, we will explore eight key competencies that aspiring data analysts should focus on developing.
Data analysis is a crucial skill in today’s data-driven business world. Companies rely on data analysts to help them make informed decisions, improve their operations, and stay competitive. And so, all healthy businesses actively seek skilled data analysts.
Becoming a skilled data analyst does not just mean that you acquire important technical skills. Rather, certain soft skills such as creative storytelling or effective communication can mean a more all-rounded profile. Additionally, these non-technical skills can be key in shaping how you make use of your data analytics skills.
Technical skills to practice as a data analyst:
Technical skills are an important aspect of being a data analyst. Data analysts are responsible for collecting, cleaning, and analyzing large sets of data, so a strong foundation in technical skills is necessary for them to be able to do their job effectively.
Some of the key technical skills that are important for a data analyst include:
1. Probability and statistics:
A solid foundation in probability and statistics ensures your ability to identify patterns in data, prevent any biases and logical errors in the analysis, and lastly, provide accurate results. All these abilities are critical to becoming a skilled data analyst.
Consider, for example, how various kinds of probabilistic distributions are used in machine learning. Other than a strong understanding of these distributions, you will need to be able to apply statistical techniques, such as hypothesis testing and regression analysis, to understand and interpret data.
2. Programming:
As a data analyst, you will need to know how to code in at least one programming language, such as Python, R, or SQL. These languages are the essential tools via which you will be able to clean and manipulate data, implement algorithms and build models.
Moreover, statistical programing languages like Python and R allow advanced analysis that interfaces like Excel cannot provide. Additionally, both Python and R are open source.
3. Data visualization:
A crucial part of a data analyst’s job is effective communication both within and outside the data analytics community. This requires the ability to create clear and compelling data visualizations. You will need to know how to use tools like Tableau, Power BI, and D3.js to create interactive charts, graphs, and maps that help others understand your data.
4. Database management:
Managing and working with large and complex datasets means having a solid understanding of database management. This includes everything from methods of collecting, arranging, and storing data in a secure and efficient way. Moreover, you will also need to know how to design and maintain databases, as well as how to query and manipulate data within them.
Certain companies may have roles particularly suited to this task such as data architects. However, most will require data analysts to perform these duties as data analysts responsible for collecting, organizing, and analyzing data to help inform business decisions.
Organizations use different data management systems. Hence, it helps to gain a general understanding of database operations so that you can later specialize them to a particular management system.
Non-technical skills to adopt as a data analyst:
Data analysts work with various members of the community ranging from business leaders to social scientists. This implies effective communication of ideas to a non-technical audience in a way that drives informed, data-driven decisions. This makes certain soft skills like communication essential.
Similarly, there are other non-technical skills that you may have acquired outside a formal data analytics education. These skills such as problem-solving and time management are transferable skills that are particularly suited to the everyday work life of a data analyst.
1. Communication:
As a data analyst, you will need to be able to communicate your findings to a wide range of stakeholders. This includes being able to explain technical concepts concisely and presenting data in a visually compelling way.
Writing skills can help you communicate your results to wider members of population via blogs and opinion pieces. Moreover, speaking and presentation skills are also invaluable in this regard.
Problem-solving is a skill that individuals pick from working in different fields ranging from research to mathematics, and much more. This, too, is a transferable skill and not unique to formal data analytics training. This also involves a dash of creativity and thinking of problems outside the box to come up with unique solutions.
Data analysis often involves solving complex problems, so you should be a skilled problem-solver who can think critically and creatively.
3. Attention to detail:
Working with data requires attention to detail and an elevated level of accuracy. You should be able to identify patterns and anomalies in data and be meticulous in your work.
4. Time management:
Data analysis projects can be time-consuming, so you should be able to manage your time effectively and prioritize tasks to meet deadlines. Time management can also be implemented by tracking your daily work using time management tools.
Final word
Overall, being a data analyst requires a combination of technical and non-technical skills. By mastering these skills, you can become an invaluable member of any team and make a real impact with your data analysis.
How does Expedia determine the hotel price to quote to site users? How come Mac users end up spending as much as 30 percent more per night on hotels? Digital marketing analytics, a torrent flowing into all the corners of the global economy has revolutionized marketing efforts, so much so, that resetting it all together. It is safe to say that marketing analytics is the science behind persuasion.
Marketers can learn so much about the users, their likes, dislikes, goals, inspirations, drop-off points, inspirations, needs, and demands. This wealth of information is a gold mine but only for those who know how to use it. In fact, one of the top questions that marketing managers struggle with is
“Which metrics to track?”
Furthermore, several platforms report on marketing, such as email marketing software, paid search advertising platforms, social media monitoring tools, blogging platforms, and web analytics packages. It is a marketer’s nightmare to be buried under sets of reports from different platforms while tracking a campaign all the way to conversion.
Definitely, there are smarter ways to track. But before we take a deep dive into how to track smartly, let me clarify why you should be investing half the time measuring while doing:
To identify what’s working
To identify what’s not working
Identify strategies to improve
Do more of what works
To gain a trustworthy answer to the aforementioned, you must: measure everything. While you attempt at it, arm yourself with the lexicon of marketing analytics to form statements that communicate results, for example:
“Twitter mobile drove 40% of all clicks this week on the corporate website”
Every statement that you form to communicate analytics must state the source, segment, value, metric, and range. Let us break down the above example:
Source: Twitter
Segment: Mobile
Value: 40%
Metric: Clicks
Range: This week
To be able to report such glossy statements, you will need to get your hands dirty. You can either take a campaign-based approach or a goals-based approach.
Campaign-based approach to marketing analytics
In a campaign-based approach, you measure the impact of every campaign, for example, if you have social media platforms, blogs, and emails trying to get users to sign up for an e-learning course, this approach will enable you to get insight into each.
In this approach we will discuss the following in detail:
Measure the impact on the website
Measure the impact of SEO
Measure the impact of paid search advertising
Measure the impact of blogging efforts
Measure the impact of social media marketing
Measure the impact of e-mail marketing
Measure the impact on the website
Unique visitors
How to use: Unique visitors account for a fresh set of eyes on your site. If the number of unique visitors is not rising, then it is a clear indication to reassess marketing tactics.
Repeat visitors
How to use: If you have visitors revisiting your site or a landing page, it is a clear indication that your site sticks or offers content people want to return to. But if your repeat visitor rate is high then it is indicative of your content not gauging new audiences.
Sources
How to use: Sources are of three types: organic, direct, and referrals. Learning about your traffic sources will give you clarity on your SEO performance. Also, it can help you find answers to questions like what is the percentage of organic traffic of total traffic?
Referrals
How to use: This is when the traffic arriving on your site is from another website. Aim for referrals to deliver 20-30% of your total traffic. Referrals can help you identify the types of sites or bloggers that are linking to your site and the type of content they tend to share. This information can be fed back into your SEO strategy, and help you produce relevant content that generates inbound links.
Bounce rate
How to use:High bounce rate indicates trouble. Maybe the content is not relevant, or the pages are not compelling enough. Perhaps the experience is not user-friendly. Or the call-to-action buttons are too confusing? A high bounce rate reflects problems, and the reasons can be many.
Measure the impact of SEO
Similarly, you can measure the impact of SEO using the following metrics:
Keyword performance and rankings:
How to use:You can use tools like Google AdWords to identify keywords that optimize your website. Check if the chosen keywords are driving traffic to your site or if they are improving your site’s keywords.
Total traffic from organic search:
How to use:This metric is a mirror of how relevant your content is. Low traffic from the organic search may mean it is time to ramp up content creation – videos, blogs, webinars or expand into newer areas, such as e-books and podcasts that can be ranked higher by search engines.
Measure the impact of paid search advertising
Likewise, it is equally important to measure the impact of your paid search, also known as pay per click (PPC), in which you pay for every click that is generated by paid search advertising. How much are you spending in total? Are those clicks turning into leads? How much profit are you generating from this spend? Some of the following metrics can help you clarify:
Click through rate:
How to use: This metric helps you determine the quality of your ad. Is it effective enough to prompt a click? Test different copy treatments, headlines, and URLs to figure out the combination that boosts the CTR for a specific term.
Average cost per click:
How to use: Cost per click determines the amount you spend for each click on a paid search ad. Combine this conversion rate and earnings from the clicks.
Conversion rate:
How to use: Is conversion always a purchase? No! Each time a user takes the action you want them to do on your site, such as clicking on a button, signing up for a form, or subscribing, it is accounted as a conversion.
Measure the impact of blogging efforts
Going beyond the website and SEO metrics, you can also measure the impact of your blogging efforts. Since a considerable amount of organizational resources is invested in creating blogs that can develop backlinks to the website. Some of the metrics that can get you clarity on whether you are generating relevant content:
Post Views
Call to action performance
Blog leads
Measure the impact of social media marketing
Very well-known and quite widely implemented are the strategies to measure social media marketing. Especially now, as the e-commerce industry is expanding, social media can make or break your image online. Some of the commonly measured metrics are:
Reach
Engagement
Mentions to assess the brand perception
Traffic
Conversion rate
Measure the impact of e-mail marketing
Quite often, the marketing strategy runs on the crutches of e-mail. E-mails are a good place to start visibility efforts and can be very important in maintaining a sustainable relationship with your existing customer base. Some of the metrics that can help you clarify if your emails are working their magic or not are:
Bounce rate
Delivery rate
Click through rate
Share/forwarding rate
Unsubscribe rate
Frequency of emails sent
Goals-based approach
A goals-based approach is defined based on what you’re trying to achieve by a particular campaign. Are you trying to acquire new customers? Or build a loyal customer base, increase engagement, and improve conversion rate? Here are a few examples:
In this approach we will discuss the following in detail:
Audience analysis
Acquisition analysis
Behavioral analysis
Conversion analysis
A/B testing
Audience analysis:
The goal is to know:
“Who are your customers?”
Audience analysis is a measure that helps you gain clarity on who your customers are. The information can include demographics, location, income, age, and so forth. The following set of metrics can help you know your customers better.
Unique visitors
Lead score
Cookies
Segment
Label
Personally Identifiable Information (PII)
Properties
Taxonomy
Acquisition analysis:
The goal is to know:
“How do customers get to your website?”
Acquisition analysis helps you understand which channel delivers the most traffic to your site or application. Comparing incoming visitors from different channels helps determine the efficacy of your SEO efforts on organic search traffic and see how well your email campaigns are running. Some of the metrics that can help you are:
Omnichannel
Funnel
Impressions
Sources
UTM parameters
Tracking URL
Direct traffic
Referrers
Retargeting
Attribution
Behavioral targeting
Behavioral analysis:
The goal is to know:
“What do the users do on your website?”
Behavior analytics explains what customers do on your website. What pages do they visit? Which device do they use? From where do they enter the site? What makes them stay? How long do they stay? Where on the site did, they drop off? Some of the metrics that can help you gain clarity are:
Actions
Sessions
Engagement rate
Events
Churn
Bounce rate
Conversion analysis
The goal is to know:
“Whether customers take actions that you want them to take?”
Conversions track whether customers take actions that you want them to take. This typically involves defining funnels for important actions — such as purchases — to see how well the site encourages these actions over time. Metrics that can help you gain more clarity are:
Conversion rate
Revenue report
A/B testing:
The goal is to know:
“What digital assets are likely to be the most effective for higher conversion?”
A/B testing enables marketers to experiment with different digital options to identify which ones are likely to be the most effective. For example, they can compare one intervention (A Control Group) to another intervention (B). Companies run A/B experiments regularly to learn what works best.
In this article, we discussed what marketing analytics is, its importance, two approaches that marketers can take to report metrics and the marketing lingo they can use while reporting results. Pick the one that addresses your business needs and helps you get clarity on your marketing efforts. This is not an exhaustive list of all the possible metrics that can be used to measure.
Of course, there are more! But this can be a good starting point until the marketing efforts expand into a larger effort that has additional areas that need to be tracked.
In this blog, we are going to discuss about data storytelling for successful brand building, its components and brand storytelling
What is data storytelling?
Data storytelling is a process of driving insights from a dataset using analysis and making it presentable through visualization. It not only helps capture insights but makes content visually presentable so that stakeholders can make data-driven decisions.
With data storytelling, you can influence and inform your audience based on your analysis.
There are 3 important components of data storytelling.
Data: You analyze to build a foundation of your data story. This could be descriptive, diagnostic, predictive, or prescriptive analysis to help get a full picture.
Narrative: Also known as a storyline, a narrative is used to communicate insights gained from your analysis.
Visualization: Visualization helps communicate that story clearly and effectively. Making use of graphs, charts, diagrams, and audio-visuals for the purpose.
The benefits of data storytelling
So, the question arises why do we even need storytelling for data? The simple answer is it helps with decision-making. But let’s take a look at some of the benefits of data storytelling.
Adding value to your data and insights.
Interpreting complex information and highlighting essential key points for the audience.
Providing a human touch to your data.
Offering value to your audience and industry.
Building credibility as an industry and topic thought leader.
For example, Airbnb uses data storytelling to help consumers find the right hotel at the right price and also for hosts to set up Airbnb at the most lucrative place.
Data storytelling helps AirBnB deliver personalized experience and recommendations. Their price tip feature is constantly updated to help guide hosts on how likely are they to get a booking at a chosen price. Other features include host/guest interactions, current events, and local market history in real-time available through its app.
Data-driven brand storytelling
Now that we have an understanding of data storytelling, let’s talk about how brand storytelling works. Data-driven brand storytelling is when a company uses research, studies, and analytics to share information about a brand and tell a story to consumers.
It turns complex datasets into an insightful easy to understand visually comprehensible story. It is different than creative storytelling where the brand only focuses on creating a perception. Here the story is based on factual data.
Storytelling is a great way to build brand association and connect with your consumers. Data-driven storytelling uses visualization that captures attention.
Learn how to create and execute data visualization and tell a story with your data by enrolling in our 5-day live Power BI training.
That’s why infographics, charts, and images are so useful.
For example, Tower Electric Bikes, a direct-to-consumer e-bike brand used a infographic to rank the most and the least bike-friendly cities across the US. This way they turned an enormous amount of data into visually friendly info-graphic that bike consumers can interpret with just a glance.
Using the power of storytelling for marketing content
Even though all content is interpreted as data by consumer but visual content provides the most value in terms of memorability, impact, and capturing their attention. The job of any successful brand is to build a positive association in consumers’ minds.
Storytelling helps create those positive associations by providing high-value engaging content, capturing attention, and giving meaning to not-so-visually appealing datasets.
We live in a world that is highly cluttered by advertising and paid promotional content. To make your content stand out from competitors you need to have good visualization and a story behind it. Storytelling helps assign meaning and context to data that would otherwise look unappealing and dry.
Consumers gain clarity, and better understanding, and share more if it makes sense to them. Data storytelling helps extract and communicate insight that in turn helps your consumer’s buying journey.
It could be content relevant to any stage of their buyer journey or even outside of the sales cycle. Storytelling helps create engaging and memorable marketing content that would help grow your brand.
Learn how to use data visualization, narratives, and real-life examples to bring your story to life with our free community event Storytelling Data.
Executing flawless data-driven brand storytelling
Now that we have a better understanding of brand storytelling, let’s have a look at how to go about crafting a story and important steps involved.
Craft a compelling narrative
The most important element in building a story is the narrative. You need a compelling narrative for your story. There are 4 key elements to any story.
Characters: These are your key players or stakeholders in your story. They can be customers, suppliers, competitors, environmental groups, government, or any other group that has to do with your brand.
Setting: This is where you use your data to reinforce the narrative. Whether it’s an improved feature in your product that increases safety or a manufacturing process that takes into account environmental impact. This is the stage where you define environment that concerns your stakeholders.
Conflict: Here you describe the root issue or problem you’re trying to solve with data. This could be marketing content that generated sales revenue, you want your team to have a better understanding of it to create helpful content for the sales team. Conflict plays a crucial role in making your story relevant and engaging. There needs to be a problem for a data solution.
Resolution: Finally, you want to propose a solution to the identified problem. You can present a short-term fix along with a long-term pivot depending on the type of problem you are solving. At this stage, your marketing outreach should be consistent with a very visible message across all channels.
You don’t want to create confusion, whatever resolution/result you’ve achieved through analysis should be clearly indicated with supporting evidence and compelling visualization to make your story come to life.
Your storytelling needs to have all these steps to be able to communicate your message effectively to the desired audience. With these steps, your audience will walk through a compelling, engaging and impactful story.
Start learning data storytelling today
Our brains are hard-wired to love stories and visuals. Storytelling is not something new it dates back to 1700 BCE, from cave paintings to symbol language. That is the reason it resonates so well in today’s fast-paced cluttered consumer environment.
Brands can use storytelling based on factual data to engage, create positive associations and finally encourage action. The best way to come up with a story narrative is to use internal data, success stories, and insights driven by your research and analysis. Then translate those insights into a story and visuals for better retention and brand building.
In this article, we’re going to talk about how data analytics can help your business generate more leads and why you should rely on data when making decisions regarding a digital marketing strategy.
Some people believe that marketing is about creativity – unique and interesting campaigns, quirky content, and beautiful imagery. Contrary to their beliefs, data analytics is what actually powers marketing – creativity is simply a way to accomplish the goals determined by analytics.
Now, if you’re still not sure how you can use data analytics to generate more leads, here are our top 10 suggestions.
1. Know how your audience behaves
Most businesses have an idea or two about who their target audience is. But having an idea or two is not good enough if you want to grow your business significantly – you need to be absolutely sure who your audience is and how they behave when they come to your website.
Now, the best way to do that is to analyze the website data.
You can tell quite a lot by simply looking at the right numbers. For instance, if you want to know whether the users can easily find the information they’re looking for, keep track of how much time they spend on a certain webpage. If they leave the webpage as soon as it loads, they probably didn’t find what they needed.
We know that looking at spreadsheets is a bit boring, but you can easily obtain Power BI Certification and use Microsoft Power BI to make data visuals that are easy to understand and pleasing to the eye.
A great way to satisfy the needs of different subgroups within your target audience is to use audience segmentation. Using that, you can create multiple funnels for the users to move through instead of just one, thereby increasing your lead generation.
Now, before you segment your audience, you need to have enough information about these subgroups so that you can divide them and identify their needs. Since you can’t individually interview users and ask them for the necessary information, you can use data analytics instead.
Once you have that, it’s time to identify their pain points and address them differently for different subgroups, and voilàa – you’ve got yourself more leads.
3. Use data analytics to improve buyer persona
Knowing your target audience is a must but identifying a buyer persona will take things to the next level. A buyer persona doesn’t only contain basic information about your customers. It goes deeper than that and tells you their exact age, gender, hobbies, location, and interests.
It’s like describing a specific person instead of a group of people.
Of course, not all your customers will fit that description to a T, but that’s not the point. The point is to have that oneidea of a person (or maybe two or three buyer personas) in your mind when creating content for your business.
4. Use predictive marketing
While data analytics should absolutely be used in retrospectives, there’s another purpose for the information you obtain through analytics – predictive marketing.
Predictive marketing is basically using big data to develop accurate forecasts of customers’ behavior. It uses complex machine-learning algorithms to build predictive models.
A good example of how that works is Amazon’s landing page, which includes personalized recommendations.
Amazon doesn’t only keep track of the user’s previous purchases, but also what they have clicked on in the past and the types of items they’ve shown interest in. By combining that with the season of purchase and time, they are able to make recommendations that are nearly 100% accurate.
If you’re curious to find out how data science works, we suggest that you enroll in the Data Science Bootcamp.
5. Know where website traffic comes from
Users come to your website from different places.
Some have searched for it directly on Google, some have run into an interesting blog piece on your website, while others have seen your ad on Instagram. This means that the time and effort you put into optimizing your website and creating interesting content pays off.
But imagine creating a YouTube ad that doesn’t bring much traffic – that doesn’t pay off at all. You’d then want to rework your campaign or redirect your efforts elsewhere.
This is exactly why knowing where website traffic comes from is valuable. You don’t want to invest your time and money into something that doesn’t bring you any benefits.
6. Understand which products work
Most of the time, you can determine what your target audience will like and dislike. The more information you have about your target audience, the better you can satisfy their needs.
But no one is perfect, andanyone can make a mistake.
Heinz, a company known for producing ketchup and other food, once released their new product: – EZ Squirt ketchup in shades of purple, green, and blue. At first, the kids loved it, but this didn’t last for long. Six years later after that, Heinz halted production of these products.
As you can see, even big and experienced companies flop sometimes. A good way to avoid that is by tracking which product pages have the least traffic and don’t sell well.
7. Perform competitor analysis
Keeping an eye on your competitors is never a bad idea. No matter how well you’re doing and how unique you are, others will try to surpass you and become better.
The good news is that there are quite a few tools online that you can use for competitor analysis. SEMrush, for instance, can help you see what the competition is doing to get qualified leads so that you can use it to your advantage.
Even if there wasn’t a tool you need, you can always enroll in a Python for Data Science course and learn to build your own tools that can track the data you need to drive your lead generation.
8. Nurture your leads
Nurturing your leads means developing a personalized relationship with your prospects at every stage of the sales funnel in order to get them to buy your products and become your customers.
Because lead nurturing offers a personalized approach, you’ll need information about your leads: –what is their title, role, industry, and similar info, depending on what your business does. Once you have that, you can provide them with the relevant content that willhelp them decide to buy your products and build brand loyalty along the way.
Having an insight into your conversion rate, churn rate, sources of website traffic, and other relevant data will ultimately lead to more customers. For instance, your sales team will be able to calculate which sources convert most effectively and prepare resources before running a campaign.
The more information you have, the better you’ll perform, and this is exactly why Data Science for Business is important – you’ll be able to see the bigger picture and make better decisions.
10. Avoid significant losses
Finally, data can help you avoid certain losses by halting the launch of a product that won’t do well.
For instance, you can use the Coming soon page to research the market and see if your customers are interested in a new product you planned on launching. If enough people show interest, you can start producing, and if not – you won’t waste your money on something that was bound to fail.
Conclusion:
Applications of data analytics go beyond simple data analysis, especially for advanced analytics projects. The majority of the labour is done up front in the data collection, integration, and preparation stages, followed by the creation, testing, and revision of analytical models to make sure they give reliable findings. Data engineers, who build data pipelines and aid in the preparation of data sets for analysis, are frequently included within analytics teams in addition to data scientists and other data analysts.
A hands-on guide to collect and store twitter data for timeseries analysis
“A couple of weeks back, I was working on a project in which I had to scrape tweets from twitter and after storing them in a csv file, I had to plot some graphs for timeseries analysis. I requested Twitter for Twitter developer API, but unfortunately my request was not fulfilled. Then I started searching for python libraries which can allow me to scrape tweets without the official Twitter API.
To my amazement, there were several libraries through which you can scrape tweets easily but for my project I found ‘Snscrape’ to be the best library, which met my requirements!”
What is SNScrape?
A scraper for social networking platforms known as snscrape (SNS). It retrieves objects, such as pertinent posts, by scraping things like user profiles, hashtags, or searches.
Install Snscrape
Snscrape requires Python 3.8 or higher. The Python package dependencies are installed automatically when you install Snscrape. You can install using the following commands.
For this tutorial we will be using the development version of Snscrape. Paste the second command in command prompt(cmd), make sure you have git installed on your system.
Code walkthrough for scraping
Before starting make sure you have the following python libraries:
Pandas
Numpy
Snscrape
Tqdm
Seaborn
Matplotlit
Importing Relevant Libraries
To run the scraping program, you will first need to import the libraries
import pandas as pdimport numpy as npimport snscrape.modules.twitter as sntwitterimport datetimefrom tqdm.notebook import tqdm_notebookimport seaborn as snsimport matplotlib.pyplot as pltsns.set_theme(style="whitegrid")
Taking User Input
To scrape tweets, you can provide many filters such as the username or start date or end date etc. We will be taking the following user inputs which will then be used in Snscrape.
Text: The query to be matched. (Optional)
Username: Specific username from twitter account. (Required)
Since: Start Date in this format yyyy-mm-dd. (Optional)
Until: End Date in this format yyyy-mm-dd. (Optional)
Count: Max number of tweets to retrieve. (Required)
Retweet: Include or Exclude Retweets. (Required)
Replies: Include or Exclude Replies. (Required)
For this tutorial we used the following inputs:
text = input('Enter query text to be matched (or leave it blank by pressing enter)')username = input('Enter specific username(s) from a twitter account without @ (or leave it blank by pressing enter): ')since = input('Enter startdate in this format yyyy-mm-dd (or leave it blank by pressing enter): ')until = input('Enter enddate in this format yyyy-mm-dd (or leave it blank by pressing enter): ')count = int(input('Enter max number of tweets or enter -1 to retrieve all possible tweets: '))retweet = input('Exclude Retweets? (y/n): ')replies = input('Exclude Replies? (y/n): ')
Which field can we Scrape?
Here is the list of fields which we can scrape using Snscrape Library.
For this tutorial we will not scrape all the fields but a few relevant fields from the above list.
The search function
Next, we will define a search function which takes in the following inputs as arguments and creates a query string to be passed inside SNS twitter search scraper function.
Here we have defined different conditions and based on those conditions we are creating the query string. For example, if variable until (end date) is empty then we are assigning it the current date and appending it in a query string and if the variable since (start date) is empty then we are assigning it a date of past 7 days from the current date. Along with the query string, we are creating filename string which will be used to name our csv file.
Calling the Search Function and creating Dataframe
q = search(text,username,since,until,retweet,replies)# Creating list to append tweet data tweets_list1 = []# Using TwitterSearchScraper to scrape data and append tweets to listif count == -1:for i,tweet inenumerate(tqdm_notebook(sntwitter.TwitterSearchScraper(q).get_items())): tweets_list1.append([tweet.date, tweet.id, tweet.rawContent, tweet.user.username,tweet.lang, tweet.hashtags,tweet.replyCount,tweet.retweetCount, tweet.likeCount,tweet.quoteCount,tweet.media])else:with tqdm_notebook(total=count) as pbar:for i,tweet inenumerate(sntwitter.TwitterSearchScraper(q).get_items()): #declare a username if i>=count: #number of tweets you want to scrapebreak tweets_list1.append([tweet. Date, tweet.id, tweet.rawContent, tweet.user.username,tweet.lang,tweet.hashtags,tweet.replyCount, tweet.retweetCount,tweet.likeCount,tweet.quoteCount,tweet.media]) pbar.update(1)# Creating a dataframe from the tweets list above tweets_df1 = pd.DataFrame(tweets_list1, columns=['DateTime', 'TweetId', 'Text', 'Username','Language','Hashtags','ReplyCount','RetweetCount','LikeCount','QuoteCount','Media'])
In this snippet we have invoked the search function and the query string is stored inside variable ‘q’. Next, we have defined an empty list which will be used for appending tweet data. If the count is specified as -1 then the for loop will iterate over all the tweets.
TwitterSearchScraper class constructor takes the query string as an argument and then we invoke the get_items() method of TwitterSearchScraper class to retrieve all the tweets. Inside for loop we append scraped data in the tweets_list1 variable which we defined earlier. If count is defined, then we use it to break the for loop. Finally, using this list, we create the pandas dataframe by specifying the column names.
Finally our data is prepared, we will now save the dataframe as csv using df.to_csv() function which takes filename as an input parameter.
tweets_df1.to_csv(f"{filename}",index=False)
Visualizing timeseries data using barplot, lineplot, histplot and kdeplot
It is time to visualize our prepared data so that we can find useful insights. Firstly, we will load the saved csv in a dateframe using the read_csv() function of pandas which take filename as input parameter.
From the above time series visualizations, we can clearly understand that the peak hours of tweets from this account is between 7pm-9pm and from 4am -1pm the twitter handle is quiet. We can also point out that most of the tweets related to that topic are done in the month of August. Similarly, we can identify that the Twitter handle was not very active before 2021.
Conclusively, we saw how we can easily scrape tweets without using Twitter API through Snscrape. Then we performed some transformations on the scraped data and stored it in csv file. Later, we used that csv file for time-series visualizations and analysis. We appreciate you following along with this hands-on guide. We hope that this guide will make it easy for you to get started on your upcoming data science project.
Data Science Dojo is offering Metabase for FREE on Azure Marketplace packaged with web accessible Metabase: Open-Source server.
Introduction
Organizations often adopt strategies that enhance the productivity of their selling points. One strategy is to utilize the prior business data to identify key patterns regarding any product and then take decisions for it accordingly. However, the work is quite hectic, costly, and requires domain experts. Metabase has bridged that gap of skillset. Metabase provides marketing and business professionals with an easy-to-use query builder notebook to extract required data and simultaneously visualize it without any SQL coding, with just a few clicks.
What is Metabase and its question?
Metabase is an open-source business intelligence framework that provides a web interface to import data from diverse databases and then analyze and visualize it with few clicks. The methodology of Metabase is based on questions and the answers to them. They form the foundation of everything else that it provides.
A question is any kind of query that you want to perform on a data. Once you are done with the specification of query functions in the notebook editor, you can visualize the query results. After that you can save this question as well for reusability and turn it into a data model for business specific purposes.
Pro Tip: Join our 6-months instructor-led Data Science Bootcamp to become expert at data science & analytics skillset
Challenges for businesses
For businesses that lack expert analysts, engineers and substantial IT department, it was costly and time-consuming to hire new domain experts or managers themselves learn to code and then explore and visualize data. Apart from that, not many pre-existing applications provide diverse data source connections which was also a challenge.
In this regard, a straightforward interactive tool that even newbies could adapt immediately and thus get the job done would be the most ideal solution.
Data analytics with Metabase
Metabase concept is based on questions which are basically queries and data models (special saved questions). It provides an easy-to-use notebook through which users can gather raw data, filter it, join tables, summarize information, and add other customizations without any need for SQL coding.
Users can select the dimensions of columns from tables and then create various visualizations and embed them in different sub-dashboards. Metabase is frequently utilized for pitching business proposals to executive decision-makers because the visualizations are very simple to achieve from raw data.
Major characteristics
Metabase delivers a notebook that enables users to select data, join with other tables, filter, and other operations just by clicking on options instead of writing a SQL query
In case of complex queries, a user can also use an in-built optimized SQL editor
The choice to select from various data sources like PostgreSQL, MongoDB, Spark SQL, Druid, etc., makes Metabase flexible and adaptable
Under the Metabase admin dashboard, users can troubleshoot the logs regarding different tasks and jobs
Has the ability to enable public sharing. It enables admins to create publicly viewable links for Questions and Dashboards
What Data Science Dojo has for you
Metabase instance packaged by Data Science Dojo serves as an open-source easy-to-use web interface for data analytics without the burden of installation. It contains numerous pre-designed visualization categories waiting for data.
It has a query builder which is used to create questions (customized queries) with few clicks. In our service users can also use an in-browser SQL editor for performing complex queries. Any user who wants to identify the impact of their product from the raw business data can use this tool.
Features included in this offer:
A rich web interface running Metabase: Open Source
A no-code query building notebook editor
In-browser optimized SQL editor for complex queries
Beautiful interactive visualizations
Ability to create data models
Email configuration and Slack support
Shareability feature
Easy specification for metrics and segments
Feature to download query results in CSV, XLSX and JSON format
Our instance supports the following major databases:
Druid
PostgreSQL
MySQL
SQL Server
Amazon Redshift
Big Query
Snowflake
Google Analytics
H2
MongoDB
Presto
Spark SQL
SQLite
Conclusion
Metabase is a business intelligence software and beneficial for marketing and product managers. By making it possible to share analytics with various teams within an enterprise, Metabase makes it simple for developers to create reports and collaborate on projects.The responsiveness and processing speed are faster than the traditional desktop environment as it uses Microsoft cloud services.
At Data Science Dojo, we deliver data science education, consulting, and technical services to increase the power of data. We are therefore adding a free Metabase server dedicated specifically for Data Analytics operations on Azure Market Place. Hurry up and install this offer by Data Science Dojo, your ideal companion in your journey to learn data science!
Click on the button below to head over to the Azure Marketplace and deploy Metabase for FREE by clicking on “Get it now”.
Note: You’ll have to sign up to Azure, for free, if you do not have an existing account.
Data Science Dojo is offering Countly for FREE on Azure Marketplace packaged with web accessible Countly Server.
Purpose of product analytics
Product analytics is a comprehensive collection of mechanisms for evaluating the performance of digital ventures created by product teams and managers.
Businesses often need to measure the metrics and impact of their products, for e.g., how the audience perceives their product like how many visitors are reading a particular page or clicking on a specific button. This gives an insight into what future decisions need to be taken regarding any product. Whether it should be modified? or removed? or kept as it is? Countly has made this work easier by providing a centralized web analytics environment to track the user engagement with a product along with monitoring its health.
Pro Tip: Join our 6-months instructor-led Data Science Bootcamp to become expert at data science & analytics skillset
Challenges for individuals
Many platforms require developers for coding to visualize analytics which is not only time consuming but also come at a cost. At the application level, having an app crash leaves anyone in shock, and that is followed by a hectic task of determining the root cause of the problem which is time-consuming. At the corporate level, the current and past data needs to be analyzed appropriately for the future strength of the company and that requires robust analysis easily acquired by anyone which was a challenge faced by many organizations
Countly analytics
Countly enables users to monitor and analyze the performance of their applications irrespective of the platform in real-time. It can compile data from numerous sources and presents it in a manner that makes it easier for business analysts and managers to evaluate app usage and client behavior. It offers a customizable dashboard with the freedom to innovate and improve your products in order to meet important business and revenue objectives while also ensuring privacy by design. It is a world leader in product analytics because it tracks more than 1.5 billion unique identities on more than 16,000 applications and more than 2,000 servers worldwide.
Major characteristics
Interactive web interface: User-friendly web environment with customizable dashboards for easy accessibility along with pre-designed metrics and visualizations
Platform-independent:Supports web analytics, mobile app analytics, and desktop application analytics for macOS and Windows
Alerts and email reporting: Ability to receive alerts based on the metric changes and provides custom email reporting
Users’ role and access manager:Provides global administrators the ability to manage users, groups, and their roles and permissions
Logs Management: Maintains server and audit logs on the web server regarding user actions on data
What Data Science Dojo has for you
Countly Server packaged by Data Science Dojo provides a web analytics service that provides insights about your product in real-time, no matter if it’s a web application or mobile app, or even desktop application without the burden of installation. It comes with numerous pre-configured metrics and visualization templates to import data and observe trends. It’s helpful for businesses to identify the application usage and determine the client response to the apps.
Features included in this offer:
A VM configured with Countly Server: Community Edition accessible from a web browser
Ability to track user analytics, user loyalty, session analytics, technology, and geo insights
Easy-to-use customizable dashboard
Logs manager
Alerting and reporting feature
User permissions and roles manager
Built-in Countly DB viewer
Cache management
Flexibility to define data limits
Conclusion
Countly provides the feasibility to analyze data in real-time. It is highly extensible and possesses various features to manage different operations like alerting, reporting, logging, job management, etc. The analytics throughput can be increased by using multi-cores on Azure Virtual Machine. Also, Countly can handle different platform applications at once. This might slow down the server if you have thousands upon thousands of active client requests on different applications. The CPU and RAM usage may also be affected but through Azure Virtual Machine all these problems are taken care of.
At Data Science Dojo, we deliver data science education, consulting, and technical services to increase the power of data. We are therefore adding a free Countly Server dedicated specifically for Data Analytics operations on Azure Market Place. Hurry up and install this offer by Data Science Dojo, your ideal companion in your journey to learn data science!
Click on the button below to head over to the Azure Marketplace and deploy Countly for FREE by clicking on “Try now”.
Note: You’ll have to sign up to Azure, for free, if you do not have an existing account.
Get hired as a Data Analyst by confidently responding to the most frequently asked interview questions. No matter how qualified or experienced you are, if you stumble over your thoughts while answering the interviewer, it might take away some of your chances of getting onboard.
1. Share about your most successful/most challenging data analysis project?
In this question, you can also share your strengths and weaknesses with the interviewer.
When answering questions like these, data analysts must attempt to share both their strengths and weaknesses. How do you deal with challenges and how do you measure the success of a data project? You can discuss how you succeeded with your project and what made it successful.
Take a look at the original job description to see if you can incorporate some of the requirements and skills listed. If you were asked the negative version of the question, be honest about what went wrong and what you would do differently in the future to fix the problem. Despite our human nature, mistakes are a part of life. What’s critical is your ability to learn from them.
Further talk about any SAAS platforms, programming languages, and libraries. Why did you use them and how did you use them to accomplish yours?
Discuss the entire pipeline of your projects from collecting data, to turning it into valuable insights. Describe the ETL pipeline, including data cleaning, data preprocessing, and exploratory data analysis. What were your learnings and what issues did you encounter, and how did you deal with them.
2. Tell us about the largest data set you’ve worked with? Or what type of data have you worked with in the past?
What they’re really asking is: Can you handle large data sets?
Data sets of varying sizes and compositions are becoming increasingly common in many businesses. Answering questions about data size and variety requires a thorough understanding of the type of data and its nature. What data sets did you handle? What types of data were present?
It is not necessary that you only mention a dataset you worked with at your job. But you can also share about varying sizes, specifically large datasets, you worked with as a part of a data analysis course, Bootcamp, certificate program, or degree. As you put together a portfolio, you may also complete some independent projects where you find and analyze a data set. All of this is valid material to build your answer.
The more versatile your experience with datasets will be, the greater the chances there are of getting hired.
The expected answer to this question will include details about: How you handle missing data, outliers, duplicate data, etc.?c.?
Data analysts are widely responsible for data preparation, data cleansing, or data cleaning. Organizations expect data analysts to spend a significant amount of time preparing data for an employer. As you answer this question, share in detail with the employer why data cleaning is so important.
In your answer, give a short description of what data cleaning is and why it’s important to the overall process. Then walk through the steps you typically take to clean a data set.
4. Name some data analytics software you are familiar with. OR what data software have you used in the past? OR What data analytics software are you trained in?
What they need to know: Do you have basic competency with common tools? How much training will you need?
Before you appear for the interview, it’s a good time to look at the job listing to see what software was mentioned. As you answer this question, describe how you have used that software or something similar in the past. Show your knowledge of the tool by employing associated words.
Mention software solutions you have used for a variety of data analysis phases. You don’t need to provide a lengthy explanation. What data analytics tools you used and for what purpose will satisfy the interviewer.
5. What statistical methods have you used in data analysis? OR what is your knowledge of statistics? OR how have you used statistics in your work as a Data Analyst?
What they’re really asking: Do you have basic statistical knowledge?
Data analysts should have at least a rudimentary grasp of statistics and know-how that statistical analysis helps business goals. Organizations look for a sound knowledge of statistics in Data analysts to handle complex projects conveniently. If you used any statistical calculations in the past, be sure to mention it. If you haven’t yet, familiarize yourself with the following statistical concepts:
Mean
Standard deviation
Variance
Regression
Sample size
Descriptive and inferential statistics
While speaking of these, share information that you can derive from them. What knowledge can you gain about your dataset?
Read these amazing 12 Data Analytics books to strengthen your knowledge
In order to be a data analyst, you will almost certainly need both SQL and a statistical programming language like R or Python. If you are already proficient in the programming language of your choice at the job interview, that’s fine. If not, you can demonstrate your enthusiasm for learning it.
In addition to your current languages’ expertise, mention how you are developing your expertise in other languages. If there are any plans for completing a programming language course, highlight its details during the interview.
To gain some extra points, do not hesitate to mention why and in which situations SQL is used, and why R and python are used.
7. How can you handle missing values in a dataset?
This is one of the most frequently asked data analyst interview questions, and the interviewer expects you to give a detailed answer here, and not just the name of the methods. There are four methods to handle missing values in a dataset.
Listwise Deletion
In the listwise deletion method, an entire record is excluded from analysis if any single value is missing.
Average Imputation
Take the average value of the other participants’ responses and fill in the missing value.
Regression Substitution
You can use multiple-regression analyses to estimate a missing value.
Multiple Imputations
It creates plausible values based on the correlations for the missing data and then averages the simulated datasets by incorporating random errors in your predictions.
8. What is Time Series analysis?
Data analysts are responsible for analyzing data points collected at different intervals. While answering this question you also need to talk about the correlation between the data evident in time-series data.
Watch this short video to learn in detail:
9. What is the difference between data profiling and data mining?
Profiling data attributes such as data type, frequency, and length, as well as their discrete values and value ranges, can provide valuable information on data attributes. It also assesses source data to understand its structure and quality through data collection and quality checks.
On the other hand, data mining is a type of analytical process that identifies meaningful trends and relationships in raw data. This is typically done to predict future data.
10. Explain the difference between R-Squared and Adjusted R-Squared.
The most vital difference between adjusted R-squared and R-squared is simply that adjusted R-squared considers and tests different independent variables against the model, and R-squared does not.
An R-squared value is an important statistic for comparing two variables. However, when examining the relationship between a single stock and the rest of the S&P500, it is important to use adjusted R-squared to determine any discrepancies in correlation.
11. Explain univariate, bivariate, and multivariate analysis.
Bivariate analysis, which is simpler than univariate analysis, is used when the data set only has one variable and does not involve causes or effects.
Univariate analysis, which is more complicated than bivariate analysis, is used when the data set has two variables and researchers are looking to compare them.
When the data set has two variables and researchers are investigating similarities between them, multivariate analysis is the right type of statistical approach.
12. How would you go about measuring the business performance of our company, and what information do you think would be most important to consider?
Before appearing for an interview, make sure you study the company thoroughly and gain enough knowledge about it. It will leave an impression on the employer regarding your interest and enthusiasm to work with them. Also, in your answer you talk about the added value you will bring to the company by improving its business performance.
13. What do you think are the three best qualities that great data analysts share?
List down some of the most critical qualities of a Data Analyst. This may include problem-solving, research, and attention to detail. Apart from these qualities, do not forget to mention soft skills, which are necessary to communicate with team members and across the department.
Did we miss any Data Analyst interview questions?
Share with us in the comments below and help each other to ace the next data analyst job.
Data Science Dojo is offering Apache Superset for FREE on Azure Marketplace packaged with pre-installed SQL lab and interactive visualizations to get started.
What is Business Intelligence?
Business Intelligence (BI) depends on the idea of utilizing information to perform activities. It expects to give business pioneers noteworthy bits of knowledge through data handling and analytics. For instance, a business breaks down the KPIs (Key Performance Indicators) to distinguish its benefits and shortcomings. Hence, the decision-makers can conclude in which department the organization can work to increase efficiency.
Recently two elements in BI have resulted in sensational enhancements in metrics like speed and proficiency. The two elements include:
Automation
Data Visualization
Apache Superset widely focuses on the latter model which has changed the course of business insights.
But what were the challenges faced by analysts before there were popular exploratory tools like Superset?
Pro Tip: Join our 6-months instructor-led Data Science Bootcamp to master data science.
Challenges of Data Analysts
Scalability, framework compatibility, and absence of business-explicit customization were a few challenges faced by data analysts. Apart from that exploring petabytes of data and visualizing it would cause the system to collapse or hang at times.
In these circumstances, a tool having the ability to query data as per business needs and envision it in various diagrams and plots was required. Additionally, a system scalable and elastic enough to handle and explore large volumes of data would be an ideal solution.
Data Analytics with Superset
Apache Superset is an open-source tool that equips you with a web-based environment for interactive data analytics, visualization, and exploration. It provides a vast collection of different types of vibrant and interactive visualizations, charts, and tables. It can customize the layouts and the dynamic dashboard elements along with quick filtering, making it flexible and user-friendly. Apache Superset is extremely beneficial for businesses and researchers who want to identify key trends and patterns from raw data to aid in the decision-making process.
It is a powerhouse of SQL as it not only allows connection to several databases but also provides an in-browser SQL editor by the name SQL Lab
Key attributes
Superset delivers an interactive UI that enriches the plots, charts, and other diagrams. You can customize your dashboard and canvas as per requirement. The hover feature and side-by-side layout make it coherent
An open-source easy-to-use tool with a no-code environment. Drag and drop and one-click alterations make it more user-friendly
Contains a powerful built-in SQL editor to query data from any database quickly
The choice to select from various databases like Druid, Hive, MySQL, SparkSQL, etc., and the ability to connect additional databases makes Superset flexible and adaptable
In-built functionality to create alerts and notifications by setting specific conditions at a particular schedule
Superset provides a section about managing different users and their roles and permissions. It also has a tab for logging the ongoing events
What does Data Science Dojo have for you
Superset instance packaged by Data Science Dojo serves as a web-accessible no-code environment with miscellaneous analysis capabilities without the burden of installation. It has many samples of chart and dataset projects to get started. In our service users can customize dashboards and canvas as per business needs.
It comes with drag-and-drop feasibility which makes it user-friendly and easy to use. Users can create different visualizations to detect key trends in any volume of data.
What is included in this offer:
A VM configured with a web-accessible Superset application
Many sample charts and datasets to get started
In-browser optimized SQL editor called SQL Lab
User access and roles manager
Alert and report feature
Feasibility of drag and drop
In-build functionality of event logging
Our instance supports the following major databases:
Druid
Hive
SparkSQL
MySQL
PostgreSQL
Presto
Oracle
SQLite
Trino
Apart from these any data engine that has Python DB-API driver and a SQL Alchemy dialect can be connected
Conclusion
Efficient resource requirement for exploring and visualizing large volumes of data was one of the areas of concern when working on traditional desktop environments. The other area of concern includes the ad-hoc SQL querying of data from different database connections. With our Superset instance, both concerns are put to rest.
When coupled with Microsoft cloud services and processing speed, it outperforms its traditional counterparts since data-intensive computations aren’t performed locally but in the cloud. It has a lightweight semantic layer and is designed as a cloud-native architecture.
At Data Science Dojo, we deliver data science education, consulting, and technical services to increase the power of data. We are therefore adding a free Superset instance dedicated specifically to Data Science & Analytics on Azure Marketplace. Now hurry up and avail this offer by Data Science Dojo, your ideal companion in your journey to learn data science!
Click on the button below to head over to the Azure Marketplace and deploy Apache Superset for FREE by clicking on “Get it now”.
Note: You’ll have to sign up to Azure, for free, if you do not have an existing account.
Marketing analytics tells you about the most profitable marketing activities of your business. The more effectively you target the right people with the right approach, the greater value you generate for your business.
However, it is not always clear which of your marketing activities are effective at bringing value to your business. This is where marketing analytics comes in. Running an Amazon seller competitor analysis is crucial to your success in the marketplace. Using a framework to monitor your competitors’ efforts is a great way to ensure you can beat them at their own game.
It guides you to use the data to evaluate your marketing campaign. It helps you identify which of your activities are effective in engaging with your audience, improving user experience, and driving conversions.
Data driven marketing is imperative in optimizing your campaigns to generate a net positive value from all your marketing activities in real-time. Without analyzing your marketing data and customer journey, you cannot identify what you are doing right and what you are doing wrong when engaging with potential customers. The 6 features listed below can give you the start you need to get into analyzing and optimizing your marketing strategy using marketing analytics
In digital marketing, impressions are the number of times any piece of your content has been shown on a person’s screen. It can be an ad, a social media post, video etc. However, it is important to remember that impressions do not mean views, a view is an engagement, anytime somebody sees your video that is a view, but an impression would also include anytime they see your video in the recommended videos on YouTube or in their newsfeed on Facebook. The impression will be counted regardless of whether they watch your video or not.
Learn more about impressions in this video
It is also important to distinguish between impressions and reach. Reach is the number of unique viewers, so for example if the same person views your ad three times, you will have three impressions but a reach of one.
Impressions and reach are important in understanding how effective your content was at gaining traction. However, these metrics alone are not enough to gauge how effective your digital marketing efforts have been, neither impressions nor reach tell you how many people engaged with your content. So, tracking impressions is important, but it does not specify whether you are reaching the right audience.
2. Engagement rate
In social media marketing, engagement rate is an important metric. Engagement is when a user comments, likes, clicks, or otherwise interacts with any of your content. Engagement rate is a metric that measures the amount of engagement of your marketing campaign relative to each of the following:
Reach
Post
Impressions
Days
Views
Engagement rate by reach is the percentage of people who chose to interact with the content after seeing it. It is calculated by the following formula. Reach is a more accurate measurement than follower count, because not all of your brands followers may see the content while those who do not follow your brand may still be exposed to your content.
Engagement rate by post is the rate at which followers engage with the content. This metric shows how engaged your followers are with your content. However, this metric does not account for organic reach and as your follower count goes up your engagement by post goes down.
Engagement rate by Impressions is the rate of engagement relative to the number of impressions. If you are running paid ads for your brand, engagement rate by impressions can be used to gauge your ads effectiveness.
Average Daily engagement rate tells you how much your followers are engaging with your content daily. This is suitable for specific use cases for instance, when you want to know how much your followers are commenting on your posts or other content.
Engagement rate by views gives the percentage of people who chose to engage with your video after watching them. This metric however does not use unique views so it may double or triple count views from a single user.
Learn more about engagement rate in this video
3. Sessions
Sessions are another especially important metric in marketing campaigns that help you analyze engagement on your website. A session is a set of activities by a user within a certain period. For example, a user spent 10 minutes on your website, loading pages, interacting with your content and completed an interaction. All these activities will be recorded in the same 10-minute session.
In Google Analytics, you can use sessions to check how much time a user spent on your website (session length), how many times they returned to your website (number of sessions), and what interactions users had with your website. Tracking sessions can help you determine how effective your campaigns were in directing traffic towards your website.
If you have an E-commerce website another very helpful tool on Google Analytics is behavioral analytics. With behavioral analytics you see what key actions are driving purchases on your website. The sessions report can be accessed under conversions tab on Google Analytics. This report can help you understand user behaviors such as abandon carts. This allows you to target these users with targeted ads or offering incentives to complete their purchase.
Learn more about sessions in this video
4. Conversion rate
Once you have engaged your audience the next step in the customers’ journey is conversion. A conversion is when you make the customer or user complete a specific action. This desired action can be anything from a form submission, purchasing a product or subscribing to a service. The conversion rate is the percentage of visitors who completed the desired action.
So, if you have a form on your website and you want to find out what the conversion rate is. You would simply divide the number of form submissions by the number of visitors on that form’s page (Total conversions/total interactions).
Conversion rate is a very important metric that helps you assess the quality of your leads. While you may generate a large number of leads or visitors, if you cannot get them to perform the desired action you may be targeting the wrong audience. Conversion rate can also help you gauge how effective your conversion strategy is, if you aren’t converting visitors, it might indicate that your campaign needs optimization.
5. Attribution
Attribution is a sophisticated model that helps you measure which channels are generating the most sales opportunities or conversions. It helps you assign credit to specific touchpoints on the customers journey and understand which touchpoints are driving conversions the most. But how do you know which touchpoint to attribute to a specific conversion? Well, that depends on which attribution models you are using. There are four common attribution models.
First touch attribution models assign all the credit to the first touchpoint that drove the prospect to your website. It focuses on the top of the marketing efforts funnel and tells you what is attracting people to your brand
Last touch attribution models assign credit to the last touchpoint. It focuses on the last touchpoint the visitor interacted with before they converted.
Linear attribution model assigns an equal weight to all the touchpoints in the buyer’s journey.
Time decay attributions is based on how close the touchpoint is to the conversion, where a weighted percentage is assigned to the most recent touchpoints. This can be used when the buying cycle is relatively short.
What model you use is based on what product or subscription you are selling and what is the length of your buyer cycle. While attribution is very important in identifying the effectiveness of your channels, to get the complete picture you need to look at how each touchpoint drives conversion.
Learn more about attribution in this video
6. Customer lifetime value
Businesses prefer retaining customers over acquiring new ones, and one of the main reasons is that attracting new customers has a cost. The customer acquisition cost is the total cost that you incur as a business acquiring a customer. The customer acquisition cost is calculated by dividing the marketing and sales cost by the number of new customers.
Learn more about CLV in this video
So, as a business, you must weigh the value of each customer with the associated acquisition cost. This is where the customer lifetime value or CLV comes in. The Customer lifetime value is the total value of your customer to your business during the period of your relationship.
The CLV helps you forecast your revenue as well, the larger the average CLV you have the better your forecasted revenue will be. CLV is calculated by dividing the annual revenue generated from customers by the average retention period (in years). If your CAC is higher than your CLV, then you are on average losing money on every customer you make.
This presents a huge problem. Metrics like CAC and CLV are very important for driving revenue. They help you identify high-value customers and identify low value customers so you can understand how to serve these customers better. They help you make more informed decisions regarding your marketing effort and build a healthy customer base.
Integrate marketing analytics into your business
Marketing analytics is a vast field. There is no one method that suits the needs of all businesses. Using data to analyze and drive your marketing and sales effort is a continuous effort that you will find yourself constantly improving upon. Furthermore, finding the right metrics to track that have a genuine impact on your business activities is a difficult task.
So, this list is by no means exhaustive, however the features listed here can give you the start you need to analyze and understand what actions are important in driving engagement, conversions and eventually value for your business.
Data is growing at an exponential rate in the world. It is estimated that the world will generate 181 zettabytes of data by 2025. With this increase, we are also seeing an increase in demand for data-driven techniques and strategies.
According to Forbes, 95% of businesses expressed the need to manage unstructured data as a problem for their business. In fact, Business Analytics vs Data Science is one of the hottest debates among data professionals nowadays.
Many people might wonder – what is the difference between Business Analytics and Data Science? Or which one should they choose as a career path? If you are one of those keep reading to know more about both these fields!
First, we need to understand what both these fields are. Let’s take a look.
What is Business Analytics?
Business Analytics is the process of deriving insights from business data to inform business decisions. It is the process of collecting data and doing analysis for the business to make better decisions. It provides a lot of insight that can be used to make better business decisions. It helps in optimizing processes and improving productivity.
It also helps in identifying potential risks, opportunities, and threats. Business Analytics is an important part of any organization’s decision-making process. It is a combination of different analytical activities like data exploration, data visualization, data transformation, data modeling, and model validation. All of this is done by using various tools and techniques like R programming, machine learning, artificial intelligence, data mining, etc.
Business analytics is a very diverse field that can be used in every industry. It can be used in areas like marketing, sales, supply chain, operations, finance, technology and many more.
Now that we have a good understanding of what Business Analytics is, let’s move on to Data Science.
What is Data Science?
Data science is the process of discovering new information, knowledge, and insights from data. They apply different machine-learning algorithms to any form of data from numbers to text, images, videos, and audio, to draw various understandings from them. Data science is all about exploring data to identify hidden patterns and make decisions based on them.
It involves implementing the right analytical techniques and tools to transform the data into something meaningful. It is not just about storing data in the database or creating reports about the same. Data scientists collect and clean the data, apply machine learning algorithms, create visualizations, and use data-driven decision-making tools to create an impact on the organization.
Data scientists use tools like programming languages, database management, artificial intelligence, and machine learning to clean, visualize, and explore the data.
What is the difference between Business Analytics and Data Science?
Technically, Business analytics is a subset of Data Science. But the two terms are often used interchangeably because of the lack of a clear understanding among people. Let’s discuss the key differences between Business Analytics and Data Science. Business Analytics focuses on creating insights from existing data for making better business decisions.
While Data Science focuses on creating insights from new data by applying the right analytical techniques. Business Analytics is a more established field. It combines several analytical activities like data transformation, modeling, and validation. Data Science is a relatively new field that is evolving every day. Business Analytics is more of a hands-on approach to manage the data whereas Data Science is more focused on the development of the data.
Both the fields also differ a bit in their required skills. Business Analysts mostly use Interpretation, Data visualization, analytical reasoning, statistics, and written communication skills to interpret and communicate their work. Whereas Data Scientists utilize statistical analysis, programming skills, machine learning, calculus and algebra, and data visualization to perform most of their work.
Which should one choose?
Business analytics is a well-established field, whereas data science is still evolving. If you are inclined towards decisive and logical skills with little or no programming knowledge or computer science skills, you can take up Business Analytics. It is a beginner friendly domain and is easy to catch on to.
But if you are interested in programming and are familiar with machine learning algorithms or even interested in data analysis, you can opt for Data Science. We hope this blog answers your questions about the differences between the two similar and somewhat overlapping fields and helps you make the right data-driven and informed decision for yourself!
Looking at the right event metrics not only helps us in gauging the success of the current event but also facilitates understanding the audience’s behavior and preferences for future events.
Creating, managing, and organizing an event seems like a lot of work and surely it is. The job of an event manager is no doubt a hectic one, and the job doesn’t end once the event is complete. After every event, analyzing it is a crucial task to continuously improve and enhance the experience for your audience and presenters.
In a world completely driven by data, if you are not measuring your events, you are surely missing out on a lot. The questions arise about how to get started and what metrics to look for.The post-Covid world has adopted the culture of virtual events which not only allows the organizers to gather audiences globally but also makes it easier for them to measure it.
There are several platforms and tools available for collecting the data, or if you are hosting it through social media then you can easily use the analytics tool of that channel. You can view our Marketing Analytics videos to better understand the analytical tools and features of each platform.
You can take the assistance of tools and platforms to collect the data but utilizing that data to come up with insightful findings and patterns is a critical task. You need to hear the story your data is trying to tell and understand the patterns in your events.
Event metrics that you should look at
1. RSVP to attendance rate
RSVP is the number of people who sign up for your event (through landing pages or social sites) while attendance rate is the number of people who show up.
You should expect at least 30% of your RSVPs to actually attend and if they don’t there is something wrong, the possible reasons could be:
The procedure for joining the event is not provided or clarified
They forgot about the event as they signed up long before
The information provided regarding the event day or date is wrong
Or it many other likely reasons. You need to dig into each channel to find out the reason because if a person signs up, it shows a clear intent to attend from their end.
2. Retention rate
There are a few channels as LinkedIn and YouTube that have inbuilt analytics to gauge retention rate, but you can always integrate third-party tools for other platforms. The retention rate depicts how long your audience stayed in your webinar and the points where they dropped off.
It is usually shown as a line graph with the duration of the webinar on the x-axis and the number of people on the y-axis, in this way you can view the number of people at a certain time in the webinar. Through this chart, you can look at points where you see a drop or rise in your views.
Use-case For instance, at Data Science Dojo our webinars experienced a huge drop in the audience during the initial 5 mins of the webinar. It was worrisome for the team, so we dug into it and conducted a critical analysis of our webinars. We realized this was happening because we usually spend our first 5 mins waiting for the audience to join in but that is where our existing audience started leaving.
We decided to bring in engaging activities as a poll in those 5 mins and initiated conversations with our audience directly through chats which improved our overall retention as our audience started feeling more connected which made them stay for a long time. You can explore our webinars here.
3. Demographics of audience
It is far-reaching to know where your audience belongs to. To take more targeted decisions in the future, every business must realize the audience demographics and what type of people find your events beneficial.
Once we work on the demographics, it will help us for future events. For example, you can select a time that would be viable in your audience’s time zone, and you can also select a topic that they would be more interested in.
The demographics data opens many new avenues for your business, it introduces you to segments of your audience that you might not be targeting already, and you can expand your business. It shows the industries, locations, seniority, and many other crucial factors about your audience.
By analyzing this data, you can also understand whether your content is attracting the right target audience or not, if not then what kind of audience you are pulling in and whether that’s beneficial for your business or not.
4. Engagement rate
Your event might receive a large number of views but if that audience is not engaging with your content, then it is something you should be concerned about. The engagement rate depicts how involved your audience is. Today’s audience has a lot of distractions especially when it comes to online events, in that situation grasping your audience’s attention and keeping them involved is a major task.
The more engaged the audience is, the higher the chance that they will benefit from it and come back to you for other services. There are several techniques to keep your audience engaged, you can look up a few engagement activities to build connections.
Make your event a success with event metrics
On that note, if you have just hosted an event or have an event on your calendar, you know what you need to look at. These metrics will help you continuously improve your event’s quality to match the audience’s expectations and requirements. Planning your strategies based on data will help you stay relevant to your audience and trends.
This blog is based on some exploratory data analysis performed on the corpora provided for the “Spooky Author Identification” challenge at Kaggle.
The Spooky Challenge
A Halloween-based challenge [1] with the following goal using data analysis: predict who was writing a sentence of a possible spooky story between Edgar Allan Poe, HP Lovecraft, and Mary Wollstonecraft Shelley.
“Deep into that darkness peering, long I stood there, wondering, fearing, doubting, dreaming dreams no mortal ever dared to dream before.” Edgar Allan Poe
“That is not dead which can eternal lie, And with strange eons, even death may die.” HP Lovecraft
“Life and death appeared to me ideal bounds, which I should first break through, and pour a torrent of light into our dark world.” Mary Wollstonecraft Shelley
The toolset for data analysis
The only tools available to us during this exploration will be our intuition, curiosity, and the selected packages for data analysis. Specifically:
tidytext package, text mining for word processing, and sentiment analysis using tidy tools
tidyverse package, an opinionated collection of R packages designed for data science
wordcloud package, pretty word clouds
gridExtra package, supporting functions to work with grid graphics
caret package, supporting function for performing stratified random sampling
corrplotpackage, a graphical display of a correlation matrix, confidence interval
# Required libraries
# if packages are not installed
# install.packages("packageName")
The beginning of the exploratory data analysis journey: The Spooky data
We are given a CSV file, the train.csv, containing some information about the authors. The information consists of a set of sentences written by different authors (EAP, HPL, MWS). Each entry (line) in the file is an observation providing the following information:
an id, a unique id for the excerpt/ sentence (as a string) the text, the excerpt/ sentence (as a string), the author, the author of the excerpt/ sentence (as a string) – a categorical feature that can assume three possible values EAP for Edgar Allan Poe,
HPL for HP Lovecraft,
MWS for Mary Wollstonecraft Shelley
# loading the data using readr package
spooky_data <- readr::read_csv(file = "./../../../data/train.csv",
col_types = "ccc",
locale = locale("en"),
na = c("", "NA"))
# readr::read_csv does not transform string into factor
# as the "author" feature is categorical by nature
# it is transformed into a factor
spooky_data$author <- as.factor(spooky_data$author)
The overall data includes 19579 observations with 3 features (id, text, author). Specifically 7900 excerpts (40.35 %) of Edgard Allan Poe, 5635 excerpts (28.78 %) of HP Lovecraft, and 6044 excerpts (30.87 %) of Mary Wollstonecraft Shelley.
It is forbidden to use all of the provided spooky data for finding our way through the unique spookiness of each author.
We still want to evaluate how our intuition generalizes on an unseen excerpt/sentence, right?
For this reason, the given training data is split into two parts (using stratified random sampling)
an actual training dataset (70% of the excerpts/sentences), used for
exploration and insight creation, and
training the classification model
a test dataset (the remaining 30% of the excerpts/sentences), used for
evaluation of the accuracy of our model.
# setting the seed for reproducibility
set.seed(19711004)
trainIndex <- caret::createDataPartition(spooky_data$author, p = 0.7, list = FALSE, times = 1)
spooky_training <- spooky_data[trainIndex,]
spooky_testing <- spooky_data[-trainIndex,]
Specifically 5530 excerpts (40.35 %) of Edgard Allan Poe, 3945 excerpts (28.78 %) of HP Lovecraft, and 4231 excerpts (30.87 %) of Mary Wollstonecraft Shelley.
Moving our first steps: from darkness into the light
Before we start building any model, we need to understand the data, build intuitions about the information contained in the data, and identify a way to use those intuitions to build a great predicting model.
Is the provided data usable?
Question: Does each observation have an id? An excerpt/sentence associated with it? An author?
Some excerpts are very long. As we can see from the boxplot above, there are a few outliers for each author; a possible explanation is that the sentence segmentation has a few hiccups (see details below):
For example Mary Wollstonecraft Shelleys (MWS) has an excerpt of around 4600 characters:
“Diotima approached the fountain seated herself on a mossy mound near it and her disciples placed themselves on the grass near her Without noticing me who sat close under her she continued her discourse addressing as it happened one or other of her listeners but before I attempt to repeat her words I will describe the chief of these whom she appeared to wish principally to impress One was a woman of about years of age in the full enjoyment of the most exquisite beauty her golden hair floated in ringlets on her shoulders her hazle eyes were shaded by heavy lids and her mouth the lips apart seemed to breathe sensibility But she appeared thoughtful unhappy her cheek was pale she seemed as if accustomed to suffer and as if the lessons she now heard were the only words of wisdom to which she had ever listened The youth beside her had a far different aspect his form was emaciated nearly to a shadow his features were handsome but thin worn his eyes glistened as if animating the visage of decay his forehead was expansive but there was a doubt perplexity in his looks that seemed to say that although he had sought wisdom he had got entangled in some mysterious mazes from which he in vain endeavoured to extricate himself As Diotima spoke his colour went came with quick changes the flexible muscles of his countenance shewed every impression that his mind received he seemed one who in life had studied hard but whose feeble frame sunk beneath the weight of the mere exertion of life the spark of intelligence burned with uncommon strength within him but that of life seemed ever on the eve of fading At present I shall not describe any other of this groupe but with deep attention try to recall in my memory some of the words of Diotima they were words of fire but their path is faintly marked on my recollection It requires a just hand, said she continuing her discourse, to weigh divide the good from evil On the earth they are inextricably entangled and if you would cast away what there appears an evil a multitude of beneficial causes or effects cling to it mock your labour When I was on earth and have walked in a solitary country during the silence of night have beheld the multitude of stars, the soft radiance of the moon reflected on the sea, which was studded by lovely islands When I have felt the soft breeze steal across my cheek as the words of love it has soothed cherished me then my mind seemed almost to quit the body that confined it to the earth with a quick mental sense to mingle with the scene that I hardly saw I felt Then I have exclaimed, oh world how beautiful thou art Oh brightest universe behold thy worshiper spirit of beauty of sympathy which pervades all things, now lifts my soul as with wings, how have you animated the light the breezes Deep inexplicable spirit give me words to express my adoration; my mind is hurried away but with language I cannot tell how I feel thy loveliness Silence or the song of the nightingale the momentary apparition of some bird that flies quietly past all seems animated with thee more than all the deep sky studded with worlds” If the winds roared tore the sea and the dreadful lightnings seemed falling around me still love was mingled with the sacred terror I felt; the majesty of loveliness was deeply impressed on me So also I have felt when I have seen a lovely countenance or heard solemn music or the eloquence of divine wisdom flowing from the lips of one of its worshippers a lovely animal or even the graceful undulations of trees inanimate objects have excited in me the same deep feeling of love beauty; a feeling which while it made me alive eager to seek the cause animator of the scene, yet satisfied me by its very depth as if I had already found the solution to my enquires sic as if in feeling myself a part of the great whole I had found the truth secret of the universe But when retired in my cell I have studied contemplated the various motions and actions in the world the weight of evil has confounded me If I thought of the creation I saw an eternal chain of evil linked one to the other from the great whale who in the sea swallows destroys multitudes the smaller fish that live on him also torment him to madness to the cat whose pleasure it is to torment her prey I saw the whole creation filled with pain each creature seems to exist through the misery of another death havoc is the watchword of the animated world And Man also even in Athens the most civilized spot on the earth what a multitude of mean passions envy, malice a restless desire to depreciate all that was great and good did I see And in the dominions of the great being I saw man reduced?”
Thinking Point: “What do we want to do with those excerpts/outliers?”
Some more facts about the excerpts/sentences using the bag-of-words
The data is transformed into a tidy format (unigrams only) to use the tidy tools to perform some basic and essential NLP operations.
Each sentence is tokenized into words (normalized to lower case, removed punctuation). See example below how the data (each excerpt/sentence) was and how it has been transformed.
Question: Which are the most common words used by each author?
Lets start to count how many times words has been used by each author and plot.
From this initial visualization we can see that the authors use quite often the same set of words – like the, and, of. These words do not give any actual information about the vocabulary actually used by each author, they are common words that represent just noise when working with unigrams: they are usually called stopwords.
If the stopwords are removed, using the list of stopwords provided by the tidytext package, it is possible to see that the authors do actually use different words more frequently than others (and it differs from author to author, the author vocabulary footprint).
Another way to visualize the most frequent words by author is to use wordclouds. Wordclouds make it easy to spot differences, the importance of each word matches its font size and color.
From the word clouds, we can infer that EAP loves to use the words time, found, eyes, length, day, etc.HPL loves to use the words night, time, found, house, etc.MWS loves to use the words life, time, love, eyes, etc.
A comparison cloud can be used to compare the different authors. From the R documentation
‘Let p{i,j} be the rate at which word i occurs in document j, and p_j be the average across documents(∑ip{i,j}/ndocs). The size of each word is mapped to its maximum deviation ( max_i(p_{i,j}-p_j) ), and its angular position is determined by the document where that maximum occurs.’
See below the comparison cloud between all authors:
From the plot above we can see that for EAP and HPL provided corpus, we need circa 7500 words to cover 90% of word instance. While for MWS provided corpus, circa 5000 words are needed to cover 90% of word instances.
Question: Is there any commonality between the dictionaries used by the authors?
Are the authors using the same words? A commonality cloud can be used to answer this specific question, it emphasizes the similarities between authors and plot a cloud showing the common words between the different authors. It shows only those words that are used by all authors with their combined frequency across authors.
See below the commonality cloud between all authors.
Then we need to spread the author (key) and the word frequency (value) across multiple columns (note how NAs have been introduced for words not used by an author).
Let’s start to plot the word frequencies (log scale) comparing two authors at a time and see how words distribute on the plane. Words that are close to the line (y = x) have similar frequencies in both sets of texts. While words that are far from the line are words that are found more in one set of texts than another.
As we can see in the plots below, there are some words close to the line but most of the words are around the line showing a difference between the frequencies.
# Removing incomplete cases - not all words are common for the authors
# when spreading words to all authors - some will get NAs (if not used
# by an author)
word_freqs_EAP_vs_HPL <- word_freqs %>%
dplyr::select(word, EAP, HPL) %>%
dplyr::filter(!is.na(EAP) & !is.na(HPL))
ggplot(data = word_freqs_EAP_vs_HPL, mapping = aes(x = EAP, y = HPL, color = abs(EAP - HPL))) +
geom_abline(color = "red", lty = 2) +
geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
scale_x_log10(labels = scales::percent_format()) +
scale_y_log10(labels = scales::percent_format()) +
theme(legend.position = "none") +
labs(y = "HP Lovecraft", x = "Edgard Allan Poe")
# Removing incomplete cases - not all words are common for the authors
# when spreading words to all authors - some will get NAs (if not used
# by an author)
word_freqs_EAP_vs_MWS <- word_freqs %>%
dplyr::select(word, EAP, MWS) %>%
dplyr::filter(!is.na(EAP) & !is.na(MWS))
ggplot(data = word_freqs_EAP_vs_MWS, mapping = aes(x = EAP, y = MWS, color = abs(EAP - MWS))) +
geom_abline(color = "red", lty = 2) +
geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
scale_x_log10(labels = scales::percent_format()) +
scale_y_log10(labels = scales::percent_format()) +
theme(legend.position = "none") +
labs(y = "Mary Wollstonecraft Shelley", x = "Edgard Allan Poe")
# Removing incomplete cases - not all words are common for the authors
# when spreading words to all authors - some will get NAs (if not used
# by an author)
word_freqs_HPL_vs_MWS <- word_freqs %>%
dplyr::select(word, HPL, MWS) %>%
dplyr::filter(!is.na(HPL) & !is.na(MWS))
ggplot(data = word_freqs_HPL_vs_MWS, mapping = aes(x = HPL, y = MWS, color = abs(HPL - MWS))) +
geom_abline(color = "red", lty = 2) +
geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
scale_x_log10(labels = scales::percent_format()) +
scale_y_log10(labels = scales::percent_format()) +
theme(legend.position = "none") +
labs(y = "Mary Wollstonecraft Shelley", x = "HP Lovecraft")
In order to quantify how similar/different these sets of word frequencies by author, we can calculate a correlation (Pearson for linearity) measurement between the sets. There is a correlation of around 0.48 to 0.5 between the different authors (see plot below).
[1] Kaggle challenge: Spooky Author Identification[2] “Text Mining in R – A tidy Approach” by J. Silge & D. Robinsons, O’Reilly 2017[3] “Regular Expressions, Text Normalization, and Edit Distance” draft chapter by D. Jurafsky & J. H . Martin, 2018
From customer relationship management to tracking analytics, marketing analytics tools are important in the modern world. Learn how to make the most of these tools.
What do you usually find in a toolbox? A hammer, screwdriver, nails, tape measure? If you’re building a bird house, these would be perfect for you, but what if you’re creating a marketing campaign? What tools do you want at your disposal? It’s okay if you can’t come up with any. We’re here to help.
Industry’s leading marketing analytics tools
These days marketing is all about data. Whether it’s a click on an email or an abandoned cart on Amazon, marketers are using data to better cater to the needs of the consumer. To analyze and use this data, marketers have a toolbox of their own.
So what are some of these tools and what do they offer? Here, at Data Science Dojo, we’ve come up with our top 5 marketing analytics tools for success:
Customer relationship management platform (CRM)
A CRM is a tool used for managing everything there is to know about the customer. It can track where/when a consumer visits your site, tracks the interactions on your site, and creates profiles for leads. A few examples of CRMs are:
HubSpot, along with the two others listed above, took the idea of a CRM and made it into an all-inclusive marketing resort. Along with the traditional CRM uses, HubSpot can be used to:
Manage social media
Send mass email campaigns
View traffic, campaign, and customer analytics
Associate emails, blogs, and social media posts to specific marketing campaigns
Create workflows and sequences
Connect to your other analytics tools such as Google Analytics, Facebook Ads, YouTube, and Slack.
HubSpot continues its effectiveness by creating reports allowing its users to analyze what is and isn’t working.
This is just a brief description revealing the tip of the iceberg of what HubSpot does. If you want to see below the water line, visit its website.
Search software
Search engine optimization (SEO) is the process of a website ranking on search engines. It’s how you can find everything you have ever searched for on Google. Search software helps marketers analyze how to best optimize websites for potential consumers to find.
I would love to describe each one of the above businesses, but I only have experience with Moz. Moz focuses on a “less invasive way (of marketing) where customers are earned rather than bought”.
Its entire business is focused on upgrading your SEO. Moz offers 9 different services through its Moz Pro toolkit:
I love Moz Keyword Explorer. This is the tool I use to check different variations of titles, keywords, phrases, and hashtags. It gives four different scores, which you can see in the photo below.
Now, there’s not enough data to show the average monthly volume for my name, but, according to Moz, it wouldn’t be that difficult to rank higher than my competitors, people have a high likelihood of clicking, and the Priority explains that my name is not a “sweet spot” for high volume, low difficulty, and high CTR. In conclusion, using my name as a keyword to optimize the Data Science Dojo Blog isn’t the best idea.
We can’t talk about marketing tools and not to mention Web Analytics Services. These are one of the most important pieces of equipment in the marketer’s toolbox. Google Analytics (GA) is a free web analytics service that integrates your company’s website data into a meticulously organized dashboard. I wouldn’t say GA is the be-all and end-all piece of equipment, and there are many different services and tools out there, however, it can’t be refuted that Google Analytics is a great tool to integrate into your company’s marketing strategy.
Some of the analytics you’ll be able to understand are
Real-time data – Who’s on your site right now? Where are the users coming from? What pages are they looking at?
Audience Information – Where do your users live, age range, interests, gender, new or returning visitor, etc.?
Acquisition – Where did they come from (Organic, Direct, Paid Ads, Referrals, Campaigns)? What day/time do they land on your website? What was the final URL they visited before leaving? You can also link to any Google Ads campaigns you have running.
Behavior – What is the path people take to convert? How is your site speed? What events took place (Contact form submission, newsletter signup, social media share)?
Conversions – Are you attributing conversions by first touch, last touch, linear, or decay?
Understanding these metrics is amazingly effective in narrowing down how users interact with your website.
Another way to integrate Google Analytics into your marketing strategy is by setting up goals. Goals are set up to track specific actions taken on your website. For example, you can set up goals to track purchases, newsletter signups, video plays, live chat, and social media shares.
If you want a more in-depth look at what Google Analytics can offer, you can learn the basics through their Analytics Academy.
Analysis and feedback platform (A&F)
A&Fs are another great piece of equipment in the marketer’s toolbox; more specifically for looking at how users are interacting on your website. One such A&F, HotJar, does this in the form of heatmaps and recordings. HotJar’s integrated tracking pixel allows you to see how far users scroll on your website and what items were clicked the most.
You can also watch recordings of a user’s experience and even filter down to the URL of the page you wish to track, (i.e. /checkout/). This allows you to capture the user’s unique journey until they make a purchase. For each recording, you can view audience information such as geographical location, country, browser, operating system, and a documented list of user actions.
In addition to UX/UI metrics, you can also integrate polls and forms on your website for more intricate data about your users.
As a marketing manager, these tools help to visualize all of my data in ways that a pivot table can’t display. And while I am a genuine user of these platforms, I must admit that it’s not the tool that makes the man, it’s the strategy. To get the most use out of these platforms, you will need to understand what business problem you are trying to solve and what metrics are important to you.
There is a lot of information that these dashboards can provide you. However, it’s up to you to filter through the noise. Not every accessible metric applies to you, so you will need to decide what is the most important for your marketing plan.
Experimentation platforms are software for experimenting with different variations of a sample. Its purpose is to run A/B tests, something HubSpot does, but these platforms dive head first into them.
Where HubSpot only tests versions A and B, experimentation platforms let you test versions A, B, C, D, E, F, etc. They don’t just test the different versions, they will also test different audiences and how they respond to each test version. Searching “definition experimentation platforms” is a good place to start in understanding what experimentation platforms are. I can tell you they are a dream come true for marketers who love to get their hands dirty in behavioral targeting.
Optimizely is one such example of a company offering in-depth A/B testing. Optimizely’s goal is to let you spend more time experimenting with the customer experience and less time wading through statistics to learn what works and what doesn’t. If you are unsure what to do, you can test it with Optimizely.
Using companies like Optimizely or Split is just one way to experiment. Many name brand companies like Netflix, Microsoft, eBay, and Uber have all built their experimentation platforms to use internally.
Not perfect
No one toolbox is perfect, and everyone is going to be different. One piece of advice I can give is to always understand the problem before deciding which tool is best to solve the problem. You wouldn’t use a hammer to do a job where a drill would be more effective, right?
You could, it just wouldn’t be the most efficient method. The same concept goes for marketing. Understanding the problem will help you know which tools should be in your toolbox.
Data Science Dojo has launched one of the most in-demand data analytics software, Redash as a virtual machine offer on the Azure Marketplace.
Introduction
With the rising complexity of the data, organizations must have complete control over their data. Sometimes there is a hindrance for the analysts in the specific use cases. Especially when working internally with a dedicated team that requires unlimited access to information. A solution is needed to perform the data-driven tasks efficiently and extract actionable insights.
What is Redash?
Redash, a data analytics tool, assists organizations to become more data-driven by providing tools to democratize data access. It simplifies the creation of dashboards and makes visualizations of your data by connecting to any data source.
Data analysis with Redash
As a Business Intelligence tool, it has more powerful integration capabilities than other Data Analytics platforms, making it a favorite among businesses that have implemented a variety of apps to manage their business processes. Similarly, according to the reviewer’s point-of-view, they found it to be more user-friendly, manageable, and business-friendly in comparison with other platforms.
It offers a user-friendly graphical user interface to carry out complex tasks with a few clicks.
Allows users to deal with small as well as big data, it supports many SQL and NoSQL databases.
The Query Editor allows users to query the database by utilizing the Schema Browser and autocomplete features.
Users can utilize the drag-and-drop feature to build visualizations (like charts, boxplot, cohort, counter, etc.) and then merge them into a single dashboard.
Enables peer evaluation of reports and searches and makes it simple for users to share visualizations and the queries that go with them.
Allows charts and dashboards to be updated automatically at defined time intervals.
Redash with Azure Services
It leverages the power of Azure services to make the procedure of integration with data sources quickly. Write SQL queries to pull subsets of data for visualizations and plot different charts and share dashboards within the organization with greater ease.
Conclusion
Other open-source business intelligence solutions put strong competition on Redash. Deciding to invest in business intelligence and data analysis tool can be challenging because all corporate departments, including product, finance, marketing, and others, now use multiple platforms to carry out day-to-day operations and carry out analytics tasks to strengthen their control over data.
At Data Science Dojo, we deliver data science education, consulting, and technical services to increase the power of data. We, therefore, know the importance of data and encapsulated insights. Through this offer, we are confident that you can analyze, visualize, and query your data in a collaborative environment with greater easeInstall the Redash offer now from the Azure Marketplace by Data Science Dojo, your ideal companion in your journey tolearn data science!
All of these written texts are unstructured; text mining algorithms and techniques work best on structured data.
Text analytics for machine learning: Part 1
Have you ever wondered how Siri can understand English? How can you type a question into Google and get what you want?
Over the next week, we will release a five-part blog series on text analytics that will give you a glimpse into the complexities and importance of text mining and natural language processing.
This first section discusses how text is converted to numerical data.
In the past, we have talked about how to build machine learning models on structured data sets. However, life does not always give us data that is clean and structured. Much of the information generated by humans has little or no formal structure: emails, tweets, blogs, reviews, status updates, surveys, legal documents, and so much more. There is a wealth of knowledge stored in these kinds of documents which data scientists and analysts want access to. “Text analytics” is the process by which you extract useful information from text.
All these written texts are unstructured; machine learning algorithms and techniques work best (or often, work only) on structured data. So, for our machine learning models to operate on these documents, we must convert the unstructured text into a structured matrix. Usually this is done by transforming each document into a sparse matrix (a big but mostly empty table). Each word gets its own column in the dataset, which tracks whether a word appears (binary) in the text OR how often the word appears (term-frequency). For example, consider the two statements below. They have been transformed into a simple term frequency matrix. Each word gets a distinct column, and the frequency of occurrence is tracked. If this were a binary matrix, there would only be ones and zeros instead of a count of the terms.
Make words usable for machine learning
Why do we want numbers instead of text? Most machine learning algorithms and data analysis techniques assume numerical data (or data that can be ranked or categorized). Similarity between documents is calculated by determining the distance between the frequency of words. For example, if the word “team” appears 4 times in one document and 5 times in a second document, they will be calculated as more similar than a third document where the word “team” only appears once.
Text mining: Build a matrix
While our example was simple (6 words), term frequency matrices on larger datasets can be tricky.
Imagine turning every word in the Oxford English dictionary into a matrix, that’s 171,476 columns. Now imagine adding everyone’s names, every corporation or product or street name that ever existed. Now feed it slang. Feed it every rap song. Feed it fantasy novels like Lord of the Rings or Harry Potter so that our model will know what to do when it encounters “The Shire” or “Hogwarts.” Good, now that’s just English. Do the same thing again for Russian, Mandarin, and every other language.
After this is accomplished, we are approaching a several billion-column matrix; two problems arise. First, it becomes computationally unfeasible and memory intensive to perform calculations over this matrix. Secondly, the curse of dimensionality kicks in and distance measurements become so absurdly large in scale that they all seem the same. Most of the research and time that goes into natural language processing is less about the syntax of language (which is important) but more about how to reduce the size of this matrix.
Now that we know what we must do and the challenges that we must face in order to reach our desired result. The next three blogs in the series will be directly addressed to these problems. We will introduce you to 3 concepts: conforming, stemming, and stop word removal.
Want to learn more about text mining and text analytics?
Check out our short video on our data science bootcamp curriculum page OR watch our video on tweet sentiment analysis.
Develop an understanding of text analytics, text conforming, and special character cleaning. Learn how to make text machine-readable.
Text analytics for machine learning: Part 2
Last week, in part 1 of our text analytics series, we talked about text processing for machine learning. We wrote about how we must transform text into a numeric table, called a term frequency matrix, so that our machine learning algorithms can apply mathematical computations to the text. However, we found that our textual data requires some data cleaning.
In this blog, we will cover the text conforming and special character cleaning parts of text analytics.
Understand how computers read text
The computer sees text differently from humans. Computers cannot see anything other than numbers. Every character (letter) that we see on a computer is actually a numeric representation to a computer, with the mapping between numbers and characters determined by an “encoding table.” The simplest, but most common, is ASCII encoding in text analytics. A small sample ASCII table is shown to the right.
To the left is a look at six different ways the word “CAFÉ” might be encoded in ASCII. The word on the left is what the human sees and its ASCII representation (what the computer sees) is on the right.
Any human would know that this is just six different spellings for the same word, but to a computer these are six different words. These would spawn six different columns in our term-frequency matrix. This will bloat our already enormous term-frequency matrix, as well as complicate or even prevent useful analysis.
Unify words with the same spelling
To unify the six different “CAFÉ’s”, we can perform two simple global transformations.
Casing: First we must convert all characters to the same casing, uppercase or lowercase. This is a common enough operation. Most programming languages have a built-in function that converts all characters into a string into either lowercase or uppercase. We can choose either global lowercasing or global uppercasing, it does not matter as long as it’s applied globally.
String normalization: Second, we must convert all accented characters to their unaccented variants. This is often called Unicode normalization, since accented and other special characters are usually encoded using the Unicode standard rather than the ASCII standard. Not all programming languages have this feature out of the box, but most have at least one package which will perform this function.
Note that implementations vary, so you should not mix and match Unicode normalization packages. What kind of normalization you do is highly language dependent, as characters which are interchangeable in English may not be in other languages (such as Italian, French, or Vietnamese).
Remove special characters and numbers
The next thing we have to do is remove special characters and numbers. Numbers rarely contain useful meaning. Examples of such irrelevant numbers include footnote numbering and page numbering. Special characters, as discussed in the string normalization section, have a habit of bloating our term-frequency matrix. For instance, representing a quotation mark has been a pain-point since the beginning of computer science.
Unlike a letter, which may only be capital or not capital, quotation marks have many popular representations. A quotation character has three main properties: curly, straight, or angled; left or right; single, double, or triple. Depending on the text analytics encoding used, not all of these may exist.
The table below shows how quoting the word “café” in both straight quote and left-right quotes would look in a UTF-8 table in Arial font.
Avoid over-cleaning
The problem is further complicated by each individual font, operating system, and programming language since implementation of the various encoding standards is not always consistent. A common solution is to simply remove all special characters and numeric digits from the text. However, removing all special characters and numbers can have negative consequences.
There is a thing as too much data cleaning when it comes to text analytics. The more we clean and remove the more “lost in translation” the textual message may become. We may inadvertently strip information or meaning from our messages so that by the time our machine learning algorithm sees the textual data, much or all the relevant information has been stripped away.
For each type of cleaning above, there are situations in which you will want to either skip it altogether or selectively apply it. As in all data science situations, experimentation and good domain knowledge are required to achieve the best results.
When do we want to avoid over-cleaning in your text analytics?
Special characters: The advent of email, social media, and text messaging have given rise to text-based emoticons represented by ASCII special characters.
For example, if you were building a sentiment predictor for text, text-based emoticons like “=)” or “>:(” are very indicative of sentiment because they directly reference happy or sad. Stripping our messages of these emoticons by removing special characters will also strip meaning from our message.
Numbers: Consider the infinitely gridlocked freeway in Washington state, “I-405.” In a sentiment predictor model, anytime someone talks about “I-405,” more likely than not the document should be classified as “negative.” However, by removing numbers and special characters, the word now becomes “I”. Our models will be unable to use this information, which, based on domain knowledge, we would expect to be a strong predictor.
Casing: Even cases can carry useful information sometimes. For instance, the word “trump” may carry a different sentiment than “Trump” with a capital T, representing someone’s last name.
One solution to filter out proper nouns that may contain information is through name entity recognition, where we use a combination of predefined dictionaries and scanning of the surrounding syntax (sometimes called “lexical analysis”). Using this, we can identify people, organizations, and locations.
Next, we’ll talk about stemming and Lemmatization as a way to help computers understand that different versions of words can have the same meaning (ex. run, running, runs).
Learn more
Want to learn more about text analytics? Check out the short video on our curriculum page OR