data analysis

Essential types of data analysis methods and processes for business success
Hudaiba Soomro
| January 17, 2023

An overview of data analysis, the data analysis process, its various methods, and implications for modern corporations. 

 

Studies show that 73% of corporate executives believe that companies failing to use data analysis on big data lack long-term sustainability. While data analysis can guide enterprises to make smart decisions, it can also be useful for individual decision-making 

Let’s consider an example of using data analysis at an intuitive individual level. As consumers, we are always choosing between products offered by multiple companies. These decisions, in turn, are guided by individual past experiences. Every individual analysis the data obtained via their experience to generate a final decision.  

Put more concretely, data analysis involves sifting through data, modeling it, and transforming it to yield information that guides strategic decision-making. For businesses, data analytics can provide highly impactful decisions with long-term yield. 

 

Data analysis methods and data analysis process
Data analysis methods and data analysis processes – Data Science Dojo

 

 So, let’s dive deep and look at how data analytics tools can help businesses make smarter decisions. 

 The data analysis process 

The process includes five key steps:  

1. Identify the need

Companies use data analytics for strategic decision-making regarding a specific issue. The first step, therefore, is to identify the particular problem. For example, a company decides it wants to reduce its production costs while maintaining product quality. To do so effectively, the company would need to identify step(s) of the workflow pipeline it should implement cost cuts. 

Similarly, the company might also have a hypothetical solution to its question. Data analytics can be used to judge the falsifiability of the hypothesis, allowing the decision-maker to reach the optimized solution. 

A specific question or hypothesis determines the subsequent steps of the process. Hence, this must be as clear and specific as possible. 

 

2. Collect the data 

Once the data analysis need is identified, the subsequent kind of data is also determined. Data collection can involve data entered in different types and formats. One broad classification is based on structure and includes structured and unstructured data. 

 Structured data, for example, is the data a company obtains from its users via internal data acquisition methods such as marketing automation tools. More importantly, it follows the usual row-column database and is suited to the company’s exact needs. 

Unstructured data, on the other hand, need not follow any such formatting. It is obtained via third parties such as Google trends, census bureaus, world health bureaus, and so on. Structured data is easier to work with as it’s already tailored to the company’s needs. However, unstructured data can provide a significantly larger data volume. 

There are many other data types to consider as well. For example, meta data, big data, real-time data, and machine data.  

 

3. Clean the data 

The third step, data cleaning, ensures that error-free data is used for the data analysis. This step includes procedures such as formatting data correctly and consistently, removing any duplicate or anomalous entries, dealing with missing data, fixing cross-set data errors.  

 Performing these tasks manually is tedious and hence, various tools exist to smoothen the data cleaning process. These include open-source data tools such as OpenRefine, desktop applications like Trifacta Wrangler, cloud-based software as a service (SaaS) like TIBCO Clarity, and other data management tools such as IBM Infosphere quality stage especially used for big data. 

 

4. Perform data analysis 

Data analysis includes several methods as described earlier. The method to be implemented depends closely on the research question to be investigated. Data analysis methods are discussed in detail later in this blog. 

 

5. Present the results 

Presentation of results defines how well the results are to be communicated. Visualization tools such as charts, images, and graphs effectively convey findings, establishing visual connections in the viewer’s mind. These tools emphasize patterns discovered in existing data and shed light on predicted patterns, assisting the results’ interpretation. 

 

Listen to the Data Analysis challenges in cybersecurity

 

Methods for data analysis 

Data analysts use a variety of approaches, methods, and tools to deal with data. Let’s sift through these methods from an approach-based perspective: 

 

1. Descriptive analysis 

Descriptive analysis involves categorizing and presenting broader datasets in a way that allows emergent patterns to be observed from them to see if there are any obvious patterns. Data aggregation techniques are one way of performing descriptive analysis. This involves first collecting the data and then sorting it to ease manageability. 

This can also involve performing statistical analysis on the data to determine, say, the measures of frequency, dispersion, and central tendencies that provide a mathematical description for the data.
 

2. Exploratory analysis 

Exploratory analysis involves consulting various data sets to see how certain variables may be related, or how certain patterns may be driving others. This analytic approach is crucial in framing potential hypotheses and research questions that can be investigated using data analytic techniques.  

Data mining, for example, requires data analysts to use exploratory analysis to sift through big data and generate hypotheses to be tested out. 

 

3. Diagnostic analysis 

Diagnostic analysis is used to answer why a particular pattern exists in the first place. For example, this kind of analysis can assist a company in understanding why its product is performing in a certain way in the market. 

Diagnostic analytics includes methods such as hypothesis testing, determining a correlations v/s causation, and diagnostic regression analysis. 

 

4. Predictive analysis 

Predictive analysis answers the question of what will happen. This type of analysis is key for companies in deciding new features or updates on existing products, and in determining what products will perform well in the market.  

 For predictive analysis, data analysts use existing results from the earlier described analyses while also using results from machine learning and artificial intelligence to determine precise predictions for future performance. 

 

5. Prescriptive analysis 

Prescriptive analysis involves determining the most effective strategy for implementing the decision arrived at. For example, an organization can use prescriptive analysis to sift through the best way to unroll a new feature. This component of data analytics actively deals with the consumer end, requiring one to work with marketing, human resources, and so on.  

 Prescriptive analysis makes use of machine learning algorithms to analyze large amounts of big data for business intelligence. These algorithms are able to asses large amounts of data by working through them via “if” and “else” statements and making recommendations accordingly. 

 

6. Quantitative and qualitative analysis 

Quantitative analysis computationally implements algorithms testing out a mathematical fit to describe correlation or causation observed within datasets. This includes regression analysis, null analysis, hypothesis analysis, etc.  

Qualitative analysis, on the other hand, involves non-numerical data such as interviews and pertains to answering broader social questions. It involves working closely with textual data to derive explanations.  

 

7. Statistical analysis 

Statistical techniques provide answers to essential decision challenges. For example, they can accurately quantify risk probabilities, predict product performance, establish relationships between variables, and so on. These techniques are used by both qualitative and quantitative analysis methods. Some of the invaluable statistical techniques for data analysts include linear regression, classification, resampling methods, subset selection.  

Statistical analysis, more importantly, lies at the heart of data analysis, providing the essential mathematical framework via which analysis is conducted. 

 

Data-driven businesses 

Data-driven businesses use the data analysis methods described above. As a result, they offer many advantages and are particularly suited to modern needs. Their credibility relies on them being evidence-based and using precise mathematical models to determine decisions. Some of these advantages include stronger customer needs, precise identification of business needs, devising effective strategy decisions, and performing well in a competitive market. Data-driven businesses are the way forward. 

Hazel Jones
| October 11, 2022

Data analysis and data science are very closely related professions in many respects. If one enjoys problem-solving, data-driven decision-making, and critical thinking, both occupations are a good fit. While all alternatives draw on the same core skill set and strive toward comparable goals, there are differences in schooling, talents, daily responsibilities, and compensation ranges. 

 

The data science certification course offers insight into the tools, technology, and trends driving the data science revolution. We have developed this guide to enable you to go through the abilities and background required to become a data scientist or data analyst, and their corresponding course fee.

 

Data Scientist vs. Data Analyst

Data analysis and data science are often misunderstood since they rely on the same fundamental skills, not to mention the very same broad educational foundation (e.g., advanced mathematics, and statistical analysis). 

However, the day-to-day responsibilities of each role are vastly different. The difference, in its most basic form, is how they utilize the data they collect.

data analyst vs data scientist
Key differences between a data analyst and a data scientist

Role of a Data Analyst

A data analyst examines gathered information, organizes it, and cleans it to make it clear and helpful. Based on the data acquired, they make recommendations and judgments. They are part of a team that converts raw data into knowledge that can assist organizations in making sound choices and investments.

 

Role of a Data Scientist

A data scientist creates the tools that will be used by an analyst. They write programs, algorithms, and data-gathering technologies. Data scientists are innovative problem solvers who are constantly thinking of new methods to acquire, store, and view data.

 

Differences in the role of data scientist and data analyst

data analyst vs data scientist job role
Job roles of data analyst and data scientist

 

While both data analysts and data scientists deal with data, the primary distinction is what they do with it. Data analysts evaluate big data sets for insights, generate infographics, and generate visualizations to assist corporations in making better strategic choices. Data scientists, on the other hand, use models, methods, predictive analytics, and specialized analyses to create and build current innovations for data modeling and manufacturing.

 

Data experts and data scientists typically have comparable academic qualifications. Most have Bachelor’s degrees in economics, statistics, computer programming, or machine intelligence. They have in-depth knowledge of data, marketing, communication, and algorithms. They can work with advanced systems, databases, and Programming environments.

 

What is data analysis?

Data analysis is the thorough examination of data to uncover trends that can be turned into meaningful information. When formatted and analyzed correctly, previously meaningless data can become a wealth of useful and valuable information that firms in various industries can use.

 

Data analysis, for example, can tell a technical store what product is most successful at what period and with which population, which can then help employees decide what kind of incentives to run. Data analysis may also assist social media companies in determining when, what, and how they should promote particular users to optimize clicks.

 

What is data science?

Data science and data analysis both aim to unearth significant insights within piles of complicated or seemingly minor information. Rather than performing the actual analytics, data science frequently aims at developing the models and implementing the techniques that will be used during the process of data analysis.

 

While data analysis seeks to reveal insights from previous data to influence future actions, data science seeks to anticipate the result of future decisions. Artificial image processing and pattern recognition, which are still in their early stages, are used to create predictions based on large amounts of historical data.

 

Responsibilities: Data Scientist vs Data Analyst

Professionals in data science and data analysis must be familiar with managing data, information systems, statistics, and data analysis. They must alter and organize data for relevant stakeholders to find it useful and comprehensible. They also assess how effectively firms perform on predefined metrics, uncover trends, and explain the differentiated strategy. While job responsibilities frequently overlap, there are contrasts between data scientists and data analysts, and the methods they utilize to attain these goals.

 

Data Analyst Data Scientist
Data analyzers are expert interpreters. They use massive amounts of information to comprehend what is going on in the industry and how corporate actions affect how customers perceive and engage with the company. They are motivated by the need to understand people’s perspectives and behaviors through data analysis.  Data scientists build the framework for capturing data and better understanding the narrative it conveys about the industry, enterprise, and decisions taken. They are designers that can create a system that can handle the volume of data required while also making it valuable for understanding patterns and advising the management team. 
Everyday data analyst tasks may involve examining both historical and current patterns and trends. Data scientists are typically responsible for the scrubbing and information retrieval.
Create operational and financial reports. Data collection statistical analysis.
Forecasting in tools such as Excel. Deep learning framework training and development.
Designing infographics. Creating architecture that can manage large amounts of data.
Data interpretation and clear communication. Developing automation that streamlines data gathering and processing chores daily.
Data screening is accomplished by analyzing documents and fixing data corruption.  Presenting insights to the executive team and assisting with data-driven decision making
Using predictive modeling to discover and impact future trends.

 

Role: Data Scientist vs Data Analyst

Data Analyst job description

A data analyst, unsurprisingly, analyzes data. This entails gathering information from various sources and processing it via data manipulation and statistical techniques. These procedures organize and extract insights from data, which are subsequently given to individuals who may act on them.

Become a pro with Data Analytics with these 12 amazing books

Users and decision-makers frequently ask data analysts to discover answers to their inquiries. This entails gathering and comparing pertinent facts and stitching it together to form a larger picture. Knowledgehut looks more closely at a career path in analytics and science, and helps you determine which employment best matches your interests, experience, and ambitions.

 

Data Scientist job description

A data scientist can have various tasks inside a corporation, among which are very comparable to those of a data analyst, such as gathering, processing, and analyzing data to get meaningful information. 

 

Whereas a data analyst is likely to have been given particular questions to answer, a data scientist may indeed evaluate the same collection of data with the goal of diverse variables that may lead to a new line of inquiry. In other words, a data scientist must identify both the appropriate questions and the proper answers.

 

A data scientist will make designs and write algorithms and software to assist them as well as their research analyst team members with the analysis of data. A data scientist is also deeply engaged in the field of artificial intelligence and tries to push the limits and develop new methods to apply this technology in a corporate context.

 

How can Data Scientists become ethical hackers?

Yes, you heard it right. Data scientists can definitely become ethical hackers. There are several skills data scientists possess that can help them with the smooth transition from data scientists to ethical hackers. The skills are extensive knowledge of programming languages, databases, and operating systems. Data science is an important tool that can prevent hacking.

 

The necessary skills for a data scientist to become an ethical hacker include mathematical and statistical expertise, and extensive hacking skills. With the rise of cybercrimes, the need for cyber security is increasing. When data scientists become ethical hackers, they can protect an organization’s data and prevent cyber-attacks. 

 

Skill set required for data analysis and data science

 

Data analysis Data science
Qualification: A Bachelor’s or Master’s degree in a related discipline, such as mathematics or statistics. Qualification: An advanced degree, such as a master’s degree or possibly a Ph.D., in a relevant discipline, such as statistics, computer science, or mathematics.
Language skills: To understand data analysis, such as Python, SQL, CQL, and R. Language skills: Demonstrate proficiency in data-related programming languages such as SQL, R, Java, and Python.
Soft skills: 

  • Written and verbal communication skills
  • Exceptional analytical skills 
  • Organizational skills
  • The ability to manage many products at the same time may be required.
Soft skills: 

  • Substantial experience with data mining 
  • Specialized statistical activities and tools
  • Generating generalized linear model regressions, statistical tests, designing data structures, and text mining. 
Technical skills: 

  • Expertise in data gathering and some of the most recent data analytics technology.
Technical skills: 

  • Experience with data sources and web services
  • Web services such as Spark, Hadoop, DigitalOcean and S3 
  • Trained to use information obtained from third-party suppliers such as Google Analytic, Crimson Hexagon, Coremetrics, Site Catalyst
Microsoft Office proficiency: 

Proficient in Microsoft Office applications, notably Excel, to properly explain their findings and translate them for others to grasp. 

Knowledge of statistical techniques and technology: Data processing technologies such as MySQL and Gurobi, as well as technological advances such as machine learning models, deep learning, artificial intelligence, artificial neural networks, and decision tree learning, will play a significant role.

 

Conclusion 

Each career is a good fit for an individual who enjoys statistics, analytics, and evaluating business decisions. As a data analyst or data scientist, you will make logical sense of large amounts of data, articulate patterns and trends, and participate in great responsibilities in a corporate or government organization.

When picking between a data analytics and a data science profession, evaluate your career aspirations, skills, and how much time you want to devote to higher learning and intensive training. Start your data analyst or data scientist journey with a data science course with nominal data science course fee to learn in-demand skills used in realistic, long-term projects, strengthening your resume and commercial viability.

 

FAQs

 

  1. Which is better: Data science or data analyst?

Data science is suitable for candidates who want to develop advanced machine learning models and make human tasks easier. On the other hand, the data analyst role is appropriate for candidates who want to begin their career in data analysis. 

 

  1. What is the career path for data analytics and data science?

Most data analysts will begin their careers as junior members of a bigger data analysis team, where they will learn the fundamentals of the work in a hands-on environment and gain valuable experience in data manipulation. At senior level, data analysts become team leaders, in control of project selection and allocation.

A junior data scientist will most likely obtain a post with a focus on data manipulation before delving into the depths of learning algorithms and mapping out forecasts. The procedure of preparing data for analysis varies so much from case to case that it’s far simpler to learn by doing. 

Once conversant with the mechanics of data analysis, data scientists might expand their understanding of artificial intelligence and its applications by designing algorithms and tools. A more experienced data scientist may pursue team lead or management positions, distributing projects and collaborating closely with users and decision-makers. Alternatively, they could use their seniority to tackle the most difficult and valuable problems using their specialist expertise in patterns and machine learning.

 

  1. What is the salary for a data scientist and a data analyst in India?

2 to 4 years (Senior Data Analyst): $98,682 whereas the average data scientist salary is $100,560, according to the U.S. Bureau of Labor Statistics.

 

References

Difference Between Data Science and Data Analytics – GeeksforGeeks

Business analytics vs data science – Data Science Dojo

Data Analyst vs. Data Scientist: Key Differences Explained | Upwork

Data Analyst vs. Data Scientist: What’s the Difference? | Coursera

Data Analytics vs. Data Science: A Breakdown (northeastern.edu)

Data Analyst vs. Data Scientist: Salary, Skills, & Background (springboard.com)

Data Analyst vs. Data Scientist: Which Should You Pursue? – UT Austin Boot Camps (utexas.edu)

Ayesha Saleem
| October 1, 2022

To perform a systematic study of data, we use data science life cycle to perform testable methods to make predictions.  

Before you apply science to data, you must be aware of the important steps. A data science life cycle will help you get a clear understanding of the end-to-end actions of a data scientist. It provides us with a framework to fulfill business requirements using data science tools and technologies. 

Follow these steps to accomplish your data science life cycle

In this blog, we will study the iterative steps used to develop, deliver, and maintain any data science product.  

data science life cycle
6 steps of data science life cycle – Data Science Dojo

1. Problem identification 

Let us say you are going to work on a project in the healthcare industry. Your team has identified that there is a problem of patient data management in this industry, and this is affecting the quality of healthcare services provided to patients. 

Before you start your data science project, you need to identify the problem and its effects on patients. You can do this by conducting research on various sources, including: 

  • Online forums 
  • Social media (Twitter and Facebook) 
  • Company websites 

 

Understanding the aim of analysis to extract data is mandatory. It sets the direction to use data science for the specific task. For instance, you need to know if the customer is willing to minimize savings loss or prefers to predict the rate of a commodity. 

To be precise, in this step we answer the following questions: 

  • Clearly state the problem to be solved 
  • Reason to solve the problem 
  • State the potential value of the project to motivate everyone 
  • Identify the stakeholders and risks associated with the project 
  • Perform high-level research with your data science team 
  • Determine and communicate the project plan 

Pro-tip: Enroll yourself in Data Science boot camp and become a Data Scientist today

2. Data investigation 

To complete this step, you need to dive into the enterprise’s data collection methods and data repositories. It Is important to gather all the relevant and required data to maintain the quality of research. Data scientists contact the enterprise group to apprehend the available data.  

In this step, we: 

  • Describe the data 
  • Define its structure 
  • Figure out relevance of data and 
  • Assess the type of data record 

 

Here you need to intently explore the data to find any available information related to the problem. Because the historical data present in the archive contributes to better understanding of business.  

In any business, data collection is a continual process. At various steps, information on key stakeholders is recorded in various software systems. To study that data to successfully conduct a data science project it is important to understand the process followed from product development to deployment and delivery. 

Also, data scientists also use many statistical methods to extract critical data and derive meaningful insights from it.  

3. Pre-processing of data 

Organizing the scattered data of any business is a pre-requisite to data exploration. First, we gather data from multiple sources in various formats, then convert the data into a unified format for smooth data processing.  

All the data processing happens in a data warehouse, in which data scientists together extract, transform and load (ETL) the data. Once the data is collected, and the ETL process is completed, data science operations are carried out.  

It is important to realize the role of the ETL process in every data science project. Also, a data architect contributed widely at the stage of pre-processing data as they decide the structure of the data warehouse and perform the steps of ETL operations.  

The actions to be performed at this stage of a data science project are: 

  • Selection of the applicable data 
  • Data integration by means of merging the data sets  
  • Data cleaning and filtration of relevant information  
  • Treating the lacking values through either eliminating them or imputing them 
  • Treating inaccurate data through eliminating them 
  • Additionally, test for outliers the use of box plots and cope with them 

 

This step also emphasizes the importance of elements essential to constructing new data. Often, we are mistaken to start data research for a project from scratch. However, data pre-processing suggests us to construct new data by refining the existing information and eliminating undesirable columns and features.

Data preparation is the most time-consuming but the most essential step in the complete existence cycle. Your model will be as accurate as your data. 

4. Exploratory data analysis  

Applause to us! We now have the data ready to work on. At this stage make sure that you have the data in your hands in the required format. Data analysis is carried out by using various statistical tools. Support of data engineer is crucial in data analysis. They perform the following steps to conduct the Exploratory Data Analysis: 

  • Examine the data by formulating the various statistical functions  
  • Identify dependent and independent variables or features 
  • Analyze key features of data to work on 
  • Define the spread of data 

 

Moreover, for thorough data analysis, various plots are utilized to visualize the data for better understanding for everyone. Data scientists explore the distribution of data inside distinctive variables of a character graphically by the usage of bar graphs. Not only this but relations between distinct aspects are captured via graphical representations like scatter plots and warmth maps. 

The instruments like Tableau, PowerBI and so on are well known for performing Exploratory Data Analysis and Visualization. Information on Data Science with Python and R is significant for performing EDA on an information. 

5. Data modeling 

Data modeling refers to the process of converting raw data into a form that can be transverse into other applications as well. Mostly, this step is performed in spreadsheets, but data scientists also prefer to use statistical tools and databases for data modeling.  

The following elements are required for data modeling: 

 

Data dictionary: A list of all the properties describing your data that you want to maintain in your system, for example, spreadsheet, database, or statistical software. 

 

Entity relationship diagram: This diagram shows the relationship between entities in your data model. It shows how each element is related to the other, as well as any constraints to that relationship  

 

Data model: A set of classes representing each piece of information in your system, along with its attributes and relationships with other objects in the system.  

 

The Machine Learning engineer applies different algorithms to the information and delivers the result. While demonstrating the information numerous multiple times, the models are first tried on fake information like genuine information. 

6. Model evaluation/ Monitoring 

Before we learn what, model evaluation is all about, we need to know that model evaluation can be done parallel to the other stages of the data science life cycle. It helps you to know at every step if your model is working as intended or if you need to make any changes. Alongside, eradicate any error at an early stage to avoid getting false predictions at the end of the project. 

In case you fail to acquire a quality result in the evaluation, we must reiterate the complete modeling procedure until the preferred stage of metrics is achieved.  

As we assess the model towards the end of project, there might be changes in the information, however, the result will change contingent upon changes in information. Thus, while assessing the model the following two stages are significant 

 

  • Data drift analysis: 

Data drift refers to the changes in the input information. Data drift analysis is a feature in data science that highlights the changes in the information along with the circumstance. Examination of this change is called Data Drift Analysis. The accuracy of the model relies heavily on how well it handles this information float. The progressions in information are significantly a direct result of progress in factual properties of information. 

 

  •  Model drift analysis 

We use drift machine learning techniques to find the information. Additionally, more complex techniques like Adaptive Windowing, Page Hinkley, and so on are accessible for use. Demonstrating Drift Analysis is significant as we realize change is quick. Steady advancement likewise can be utilized where the model is presented to added information gradually. 

Start your data science project today

Data science life cycle is a collection of individual steps that need to be taken to prepare for and execute a data science project. The steps include identifying the project goals, gathering relevant data, analyzing it using appropriate tools and techniques, and presenting results in a meaningful way. It is not an effortless process, but with some planning and preparation you can make it much easier on yourself. 

Pier Lorenzo Paracchini
| October 31, 2017

This blog is based on some exploratory data analysis performed on the corpora provided for the “Spooky Author Identification” challenge at Kaggle.

The Spooky challenge

A Halloween-based challenge [1] with the following goal using data analysis: predict who was writing a sentence of a possible spooky story between Edgar Allan Poe, HP Lovecraft, and Mary Wollstonecraft Shelley.

“Deep into that darkness peering, long I stood there, wondering, fearing, doubting, dreaming dreams no mortal ever dared to dream before.” Edgar Allan Poe

“That is not dead which can eternal lie, And with strange eons, even death may die.” HP Lovecraft

“Life and death appeared to me ideal bounds, which I should first break through, and pour a torrent of light into our dark world.” Mary Wollstonecraft Shelley

The toolset for data analysis

The only tools available to us during this exploration will be our intuitioncuriosity, and the selected packages for data analysis. Specifically:

  • tidytext package, text mining for word processing, and sentiment analysis using tidy tools
  • tidyverse package, an opinionated collection of R packages designed for data science
  • wordcloud package, pretty word clouds
  • gridExtra package, supporting functions to work with grid graphics
  • caret package, supporting function for performing stratified random sampling
  • corrplotpackage, a graphical display of a correlation matrix, confidence interval
# Required libraries

# if packages are not installed

# install.packages("packageName")
library(tidytext)

library(tidyverse)

library(gridExtra)

library(wordcloud)

library(dplyr)

library(complot)

The beginning of the data analysis journey: The Spooky data

We are given a CSV file, the train.csv, containing some information about the authors. The information consists of a set of sentences written by different authors (EAP, HPL, MWS). Each entry (line) in the file is an observation providing the following information:

an id, a unique id for the excerpt/ sentence (as a string) the text, the excerpt/ sentence (as a string), the author, the author of the excerpt/ sentence (as a string) – a categorical feature that can assume three possible values EAP for Edgar Allan Poe,

HPL for HP Lovecraft,

MWS for Mary Wollstonecraft Shelley

 # loading the data using readr package

  spooky_data <- readr::read_csv(file = "./../../../data/train.csv",

                    col_types = "ccc",

                    locale = locale("en"),

                    na = c("", "NA"))


  # readr::read_csv does not transform string into factor

  # as the "author" feature is categorical by nature

  # it is transformed into a factor

  spooky_data$author <- as.factor(spooky_data$author)

The overall data includes 19579 observations with 3 features (id, text, author). Specifically 7900 excerpts (40.35 %) of Edgard Allan Poe, 5635 excerpts (28.78 %) of HP Lovecraft, and 6044 excerpts (30.87 %) of Mary Wollstonecraft Shelley.

Read about Data Normalization in predictive modeling before analytics in this blog

Avoid the madness!

It is forbidden to use all of the provided spooky data for finding our way through the unique spookiness of each author.

We still want to evaluate how our intuition generalizes on an unseen excerpt/sentence, right?

For this reason, the given training data is split into two parts (using stratified random sampling)

  • an actual training dataset (70% of the excerpts/sentences), used for
    • exploration and insight creation, and
    • training the classification model
  • test dataset (the remaining 30% of the excerpts/sentences), used for
    • evaluation of the accuracy of our model.
# setting the seed for reproducibility

  set.seed(19711004)

  trainIndex <- caret::createDataPartition(spooky_data$author, p = 0.7, list = FALSE, times = 1)

  spooky_training <- spooky_data[trainIndex,]

  spooky_testing <- spooky_data[-trainIndex,]

Specifically 5530 excerpts (40.35 %) of Edgard Allan Poe, 3945 excerpts (28.78 %) of HP Lovecraft, and 4231 excerpts (30.87 %) of Mary Wollstonecraft Shelley.
Moving our first steps: from darkness into the light
Before we start building any model, we need to understand the data, build intuitions about the information contained in the data, and identify a way to use those intuitions to build a great predicting model.

Is the provided data usable?
Question: Does each observation has an id? An excerpt/sentence associated with it? An author?

missingValueSummary <- colSums(is.na(spooky_training))

As we can see from the table below, there are no missing values in the dataset.

| Data Science Dojo

Some initial facts about the excerpts/sentences

Below we can see, as an example, some of the observations (and excerpts/sentences) available in our dataset.

EAP
EAP

QuestionHow many excerpts/sentences are available by the author?

 no_excerpts_by_author <- spooky_training %>%

  dplyr::group_by(author) %>%

  dplyr::summarise(n = n())

ggplot(data = no_excerpts_by_author,

          mapping = aes(x = author, y = n, fill = author)) +

     geom_col(show.legend = F) +

     ylab(label = "number of excerpts") +

     theme_dark(base_size = 10)
Excerpt graph
Number of excerpts mapped against author name

Question: How long (# ofchars) are the excerpts/sentences by the author?

spooky_training$len <- nchar(spooky_training$text)

ggplot(data = spooky_training, mapping = aes(x = len, fill = author)) +

  geom_histogram(binwidth = 50) +

  facet_grid(. ~ author) +

  xlab("# of chars") +

  theme_dark(base_size = 10)
Count graph
Count and number of characters graph
ggplot(data = spooky_training, mapping = aes(x = 1, y = len)) +

  geom_boxplot(outlier.colour = "red", outlier.shape = 1) +

  facet_grid(. ~ author) +

  xlab(NULL) +

  ylab("# of chars") +

  theme_dark(base_size = 10)
characters graph
Number of characters

There are some excerpts that are very long. As we can see from the boxplot above, there are a few outliers for each author; a possible explanation is that the sentence segmentation has a few hiccups (see details below):

eaphplmws | Data Science Dojo

For example Mary Wollstonecraft Shelley (MWS) has an excerpts of around 4600 characters:

“Diotima approached the fountain seated herself on a mossy mound near it and her disciples placed themselves on the grass near her Without noticing me who sat close under her she continued her discourse addressing as it happened one or other of her listeners but before I attempt to repeat her words I will describe the chief of these whom she appeared to wish principally to impress One was a woman of about years of age in the full enjoyment of the most exquisite beauty her golden hair floated in ringlets on her shoulders her hazle eyes were shaded by heavy lids and her mouth the lips apart seemed to breathe sensibility But she appeared thoughtful unhappy her cheek was pale she seemed as if accustomed to suffer and as if the lessons she now heard were the only words of wisdom to which she had ever listened The youth beside her had a far different aspect his form was emaciated nearly to a shadow his features were handsome but thin worn his eyes glistened as if animating the visage of decay his forehead was expansive but there was a doubt perplexity in his looks that seemed to say that although he had sought wisdom he had got entangled in some mysterious mazes from which he in vain endeavoured to extricate himself As Diotima spoke his colour went came with quick changes the flexible muscles of his countenance shewed every impression that his mind received he seemed one who in life had studied hard but whose feeble frame sunk beneath the weight of the mere exertion of life the spark of intelligence burned with uncommon strength within him but that of life seemed ever on the eve of fading At present I shall not describe any other of this groupe but with deep attention try to recall in my memory some of the words of Diotima they were words of fire but their path is faintly marked on my recollection It requires a just hand, said she continuing her discourse, to weigh divide the good from evil On the earth they are inextricably entangled and if you would cast away what there appears an evil a multitude of beneficial causes or effects cling to it mock your labour When I was on earth and have walked in a solitary country during the silence of night have beheld the multitude of stars, the soft radiance of the moon reflected on the sea, which was studded by lovely islands When I have felt the soft breeze steal across my cheek as the words of love it has soothed cherished me then my mind seemed almost to quit the body that confined it to the earth with a quick mental sense to mingle with the scene that I hardly saw I felt Then I have exclaimed, oh world how beautiful thou art Oh brightest universe behold thy worshiper spirit of beauty of sympathy which pervades all things, now lifts my soul as with wings, how have you animated the light the breezes Deep inexplicable spirit give me words to express my adoration; my mind is hurried away but with language I cannot tell how I feel thy loveliness Silence or the song of the nightingale the momentary apparition of some bird that flies quietly past all seems animated with thee more than all the deep sky studded with worlds” If the winds roared tore the sea and the dreadful lightnings seemed falling around me still love was mingled with the sacred terror I felt; the majesty of loveliness was deeply impressed on me So also I have felt when I have seen a lovely countenance or heard solemn music or the eloquence of divine wisdom flowing from the lips of one of its worshippers a lovely animal or even the graceful undulations of trees inanimate objects have excited in me the same deep feeling of love beauty; a feeling which while it made me alive eager to seek the cause animator of the scene, yet satisfied me by its very depth as if I had already found the solution to my enquires sic as if in feeling myself a part of the great whole I had found the truth secret of the universe But when retired in my cell I have studied contemplated the various motions and actions in the world the weight of evil has confounded me If I thought of the creation I saw an eternal chain of evil linked one to the other from the great whale who in the sea swallows destroys multitudes the smaller fish that live on him also torment him to madness to the cat whose pleasure it is to torment her prey I saw the whole creation filled with pain each creature seems to exist through the misery of another death havoc is the watchword of the animated world And Man also even in Athens the most civilized spot on the earth what a multitude of mean passions envy, malice a restless desire to depreciate all that was great and good did I see And in the dominions of the great being I saw man reduced?”

Thinking Point: “What do we want to do with those excerpts/outliers?

Some more facts about the excerpts/sentences using the bag-of-words

The data is transformed into a tidy format (unigrams only) in order to use the tidy tools to perform some basic and essential NLP operations.

spooky_trainining_tidy_1n <- spooky_training %>%

  select(id, text, author) %>%

  tidytext::unnest_tokens(output = word,

                      input = text,

                      token = "words",

                      to_lower = TRUE)

Each sentence is tokenized into words (normalized to lower case, removed punctuation). See example below how the data (each excerpt/sentence) was and how it has been transformed.

glance | Data Science Dojo

Question: Which are the most common words used by each author?

Lets start to count how many times words has been used by each author and plot.

words_author_1 <- plot_common_words_by_author(x = spooky_trainining_tidy_1n,

                                     author = "EAP",

                                     greater.than = 500)


words_author_2 <- plot_common_words_by_author(x = spooky_trainining_tidy_1n,

                                     author = "HPL",

                                     greater.than = 500)

words_author_3 <- plot_common_words_by_author(x = spooky_trainining_tidy_1n,

                                     author = "MWS",

                                     greater.than = 500)


gridExtra::grid.arrange(words_author_1, words_author_2, words_author_3, nrow = 1)
common words graph
Most common words used by each author

From this initial visualization we can see that the authors use quite often the same set of words – like the, and, of. These words do not give any actual information about the vocabulary actually used by each author, they are common words that represent just noise when working with unigrams: they are usually called stopwords.

If the stopwords are removed, using the list of stopwords provided by the tidytext package, it is possible to see that the authors do actually use different words more frequently than others (and it differs from author to author, the author vocabulary footprint).

words_author_1 <- plot_common_words_by_author(x = spooky_trainining_tidy_1n,

                                     author = "EAP",

                                     greater.than = 70,

                                     remove.stopwords = T)


words_author_2 <- plot_common_words_by_author(x = spooky_trainining_tidy_1n,

                                     author = "HPL",

                                     greater.than = 70,

                                     remove.stopwords = T)


words_author_3 <- plot_common_words_by_author(x = spooky_trainining_tidy_1n,

                                     author = "MWS",

                                     greater.than = 70,

                                     remove.stopwords = T)


gridExtra::grid.arrange(words_author_1, words_author_2, words_author_3, nrow = 1)
Data analysis graph
Most common words used comparison between EAP, HPL, and MWS

Another way to visualize the most frequent words by author is to use wordclouds. Wordclouds make it easy to spot differences, the importance of each word matches its font size and color.

par(mfrow = c(1,3), mar = c(0,0,0,0))

words_author <- get_common_words_by_author(x = spooky_trainining_tidy_1n,

                       author = "EAP",

                       remove.stopwords = TRUE)

mypal <- brewer.pal(8,"Spectral")

wordcloud(words = c("EAP", words_author$word),

      freq = c(max(words_author$n) + 100, words_author$n),

      colors = mypal,

      scale=c(7,.5),

      rot.per=.15,

      max.words = 100,

      random.order = F)

words_author <- get_common_words_by_author(x = spooky_trainining_tidy_1n,

                       author = "HPL",

                       remove.stopwords = TRUE)

mypal <- brewer.pal(8,"Spectral")

wordcloud(words = c("HPL", words_author$word),

      freq = c(max(words_author$n) + 100, words_author$n),

      colors = mypal,

      scale=c(7,.5),

      rot.per=.15,

      max.words = 100,

      random.order = F)

words_author <- get_common_words_by_author(x = spooky_trainining_tidy_1n,

                       author = "MWS",

                       remove.stopwords = TRUE)

mypal <- brewer.pal(8,"Spectral")

wordcloud(words = c("MWS", words_author$word),

      freq = c(max(words_author$n) + 100, words_author$n),

      colors = mypal,

      scale=c(7,.5),

      rot.per=.15,

      max.words = 100,

      random.order = F)
Most common words
Most common words used by authors

From the word clouds, we can infer that EAP loves to use the words time, found, eyes, length, day, etc.HPL loves to use the words night, time, found, house, etc.MWS loves to use the words life, time, love, eyes, etc.

A comparison cloud can be used to compare the different authors. From the R documentation

‘Let p{i,j} be the rate at which word i occurs in document j, and p_j be the average across documents(∑ip{i,j}/ndocs). The size of each word is mapped to its maximum deviation ( max_i(p_{i,j}-p_j) ), and its angular position is determined by the document where that maximum occurs.’

See below the comparison cloud between all authors:

comparison_data <- spooky_trainining_tidy_1n %>%

     dplyr::select(author, word) %>%

  dplyr::anti_join(stop_words) %>%

  dplyr::count(author,word, sort = TRUE)


comparison_data %>%

 reshape2::acast(word ~ author, value.var = "n", fill = 0) %>%

 comparison.cloud(colors = c("red", "violetred4", "rosybrown1"),

               random.order = F,

               scale=c(7,.5),

               rot.per = .15,

               max.words = 200) 
Comparison cloud
Comparison cloud between authors

Below is the comparison clouds between the authors, two authors at any time.

par(mfrow = c(1,3), mar = c(0,0,0,0))

comparison_EAP_MWS <- comparison_data %>%

 dplyr::filter(author == "EAP" | author == "MWS")

comparison_EAP_MWS %>%

 reshape2::acast(word ~ author, value.var = "n", fill = 0) %>%

 comparison.cloud(colors = c("red", "rosybrown1"),

               random.order = F,

               scale=c(3,.2),

               rot.per = .15,

               max.words = 100)

comparison_HPL_MWS <- comparison_data %>%

dplyr::filter(author == "HPL" | author == "MWS")

comparison_HPL_MWS %>%

 reshape2::acast(word ~ author, value.var = "n", fill = 0) %>%

comparison.cloud(colors = c("violetred4", "rosybrown1"),

               random.order = F,

               scale=c(3,.2),

               rot.per = .15,

               max.words = 100)


comparison_EAP_HPL <- comparison_data %>%

dplyr::filter(author == "EAP" | author == "HPL")

comparison_EAP_HPL %>%

reshape2::acast(word ~ author, value.var = "n", fill = 0) %>%

comparison.cloud(colors = c("red", "violetred4"),

               random.order = F,

               scale=c(3,.2),

               rot.per = .15,

               max.words = 100)
Comparison cloud
Comparison cloud between EAP, HPL, and MWS

Question: How many unique words are needed in the author dictionary to cover 90% of the used word instances?

words_cov_author_1 <- plot_word_cov_by_author(x = spooky_trainining_tidy_1n, author = "EAP")

words_cov_author_2 <- plot_word_cov_by_author(x = spooky_trainining_tidy_1n, author = "HPL")

words_cov_author_3 <- plot_word_cov_by_author(x = spooky_trainining_tidy_1n, author = "MWS")


gridExtra::grid.arrange(words_cov_author_1, words_cov_author_2, words_cov_author_3, nrow = 1)
Comparison cloud
Detailed comparison cloud between EAP, HPL, and MWS

From the plot above we can see that for EAP and HPL provided corpus, we need circa 7500 words to cover 90% of word instance. While for MWS provided corpus, circa 5000 words are needed to cover 90% of word instances.

Question: Is there any commonality between the dictionaries used by the authors?

Are the authors using the same words? A commonality cloud can be used to answer this specific question, it emphasizes the similarities between authors and plot a cloud showing the common words between the different authors. It shows only those words that are used by all authors with their combined frequency across authors.

See below the commonality cloud between all authors.

comparison_data <- spooky_trainining_tidy_1n %>%

 dplyr::select(author, word) %>%

dplyr::anti_join(stop_words) %>%

dplyr::count(author,word, sort = TRUE)


mypal <- brewer.pal(8,"Spectral") comparison_data %>%

reshape2::acast(word ~ author, value.var = "n", fill = 0) %>%

commonality.cloud(colors = mypal,

               random.order = F,

               scale=c(7,.5),

               rot.per = .15,

               max.words = 200)
Frequency of word usage
Frequency of word usage

Question: Can Word Frequencies be used to compare different authors?

First of all, we need to prepare the data calculating the word frequencies for each author.

 word_freqs <- spooky_trainining_tidy_1n %>%

  dplyr::anti_join(stop_words) %>%

  dplyr::count(author, word) %>%

  dplyr::group_by(author) %>%

  dplyr::mutate(word_freq = n/ sum(n)) %>%

  dplyr::select(-n)

 

wordfreq | Data Science Dojo

Then we need to spread the author (key) and the word frequency (value) across multiple columns (note how NAs have been introduced for words not used by an author).
word_freqs <- word_freqs%>%
tidyr::spread(author, word_freq)

wordeap | Data Science Dojo

Let’s start to plot the word frequencies (log scale) comparing two authors at a time and see how words distribute on the plane. Words that are close to the line (y = x) have similar frequencies in both sets of texts. While words that are far from the line are words that are found more in one set of texts than another.
As we can see in the plots below, there are some words close to the line but most of the words are around the line showing a difference between the frequencies.
# Removing incomplete cases - not all words are common for the authors

# when spreading words to all authors - some will get NAs (if not used

# by an author)

word_freqs_EAP_vs_HPL <- word_freqs %>%

  dplyr::select(word, EAP, HPL) %>%

  dplyr::filter(!is.na(EAP) & !is.na(HPL))

ggplot(data = word_freqs_EAP_vs_HPL, mapping = aes(x = EAP, y = HPL, color = abs(EAP - HPL))) +

  geom_abline(color = "red", lty = 2) +

  geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +

  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +

  scale_x_log10(labels = scales::percent_format()) +

  scale_y_log10(labels = scales::percent_format()) +

  theme(legend.position = "none") +

  labs(y = "HP Lovecraft", x = "Edgard Allan Poe")

hplovecraft | Data Science Dojo

# Removing incomplete cases - not all words are common for the authors

# when spreading words to all authors - some will get NAs (if not used

# by an author)

word_freqs_EAP_vs_MWS <- word_freqs %>%

  dplyr::select(word, EAP, MWS) %>%

  dplyr::filter(!is.na(EAP) & !is.na(MWS))

ggplot(data = word_freqs_EAP_vs_MWS, mapping = aes(x = EAP, y = MWS, color = abs(EAP - MWS))) +

  geom_abline(color = "red", lty = 2) +

  geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +

  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +

  scale_x_log10(labels = scales::percent_format()) +

  scale_y_log10(labels = scales::percent_format()) +

  theme(legend.position = "none") +

  labs(y = "Mary Wollstonecraft Shelley", x = "Edgard Allan Poe")   

mary | Data Science Dojo

# Removing incomplete cases - not all words are common for the authors

# when spreading words to all authors - some will get NAs (if not used

# by an author)

word_freqs_HPL_vs_MWS <- word_freqs %>%

  dplyr::select(word, HPL, MWS) %>%

  dplyr::filter(!is.na(HPL) & !is.na(MWS))

ggplot(data = word_freqs_HPL_vs_MWS, mapping = aes(x = HPL, y = MWS, color = abs(HPL - MWS))) +

  geom_abline(color = "red", lty = 2) +

  geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +

  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +

  scale_x_log10(labels = scales::percent_format()) +

  scale_y_log10(labels = scales::percent_format()) +

  theme(legend.position = "none") +

  labs(y = "Mary Wollstonecraft Shelley", x = "HP Lovecraft")

maryhp | Data Science Dojo

In order to quantify how similar/different these sets of word frequencies by author, we can calculate a correlation (Pearson for linearity) measurement between the sets. There is a correlation of around 0.48 to 0.5 between the different authors (see plot below).

word_freqs %>%

  select(-word) %>%

  cor(use="complete.obs", method="spearman") %>%

  corrplot(type="lower",

       method="pie",

       diag = F)
Correlation graph
Correlation between EAP, HPL, and MWS
Get started with R programming with this free of cost course: Beginner R programming course.

References

[1] Kaggle challenge: Spooky Author Identification[2] “Text Mining in R – A tidy Approach” by J. Silge & D. Robinsons, O’Reilly 2017[3] “Regular Expressions, Text Normalization, and Edit Distance” draft chapter by D. Jurafsky & J. H . Martin, 2018

Appendix: Supporting functions

getNoExcerptsFor <- function(x, author){

  sum(x$author == author)

}

getPercentageExcerptsFor <- function(x, author){

  round((sum(x$author == author)/ dim(x)[1]) * 100, digits = 2)

}

get_xxx_length <- function(x, author, func){

  round(func(x[x$author == author,]$len), digits = 2)

}

plot_common_words_by_author <- function(x, author, remove.stopwords = FALSE, greater.than = 90){

  the_title = author

  if(remove.stopwords){

x <- x %>% dplyr::anti_join(stop_words)

  }

  x[x$author == author,] %>%

dplyr::count(word, sort = TRUE) %>%

dplyr::filter(n > greater.than) %>%

dplyr::mutate(word = reorder(word, n)) %>%

ggplot(mapping = aes(x = word, y = n)) +

geom_col() +

xlab(NULL) +

ggtitle(the_title) +

coord_flip() +

theme_dark(base_size = 10)

}

get_common_words_by_author <- function(x, author, remove.stopwords = FALSE){

  if(remove.stopwords){

x <- x %>% dplyr::anti_join(stop_words)

  }

  x[x$author == author,] %>%

dplyr::count(word, sort = TRUE)

}

plot_word_cov_by_author <- function(x,author){

  words_author <- get_common_words_by_author(x, author, remove.stopwords = TRUE) words_author %>%

mutate(cumsum = cumsum(n),

       cumsum_perc = round(100 * cumsum/sum(n), digits = 2)) %>%

ggplot(mapping = aes(x = 1:dim(words_author)[1], y = cumsum_perc)) +

geom_line() +

geom_hline(yintercept = 75, color = "yellow", alpha = 0.5) +

geom_hline(yintercept = 90, color = "orange", alpha = 0.5) +

geom_hline(yintercept = 95, color = "red", alpha = 0.5) +

xlab("no of 'unique' words") +

ylab("% Coverage") +

ggtitle(paste("% Coverage unique words -", author, sep = " ")) +

theme_dark(base_size = 10)

}
sessionInfo()
## R version 3.3.3 (2017-03-06)

## Platform: x86_64-apple-darwin13.4.0 (64-bit)

## Running under: macOS  10.13

##

## locale:

## [1] no_NO.UTF-8/no_NO.UTF-8/no_NO.UTF-8/C/no_NO.UTF-8/no_NO.UTF-8

##

## attached base packages:

## [1] stats     graphics  grDevices utils     datasets  methods   base     

##

## other attached packages:

##  [1] bindrcpp_0.2       corrplot_0.84      wordcloud_2.5     

##  [4] RColorBrewer_1.1-2 gridExtra_2.3      dplyr_0.7.3       

##  [7] purrr_0.2.3        readr_1.1.1        tidyr_0.7.1       

## [10] tibble_1.3.4       ggplot2_2.2.1      tidyverse_1.1.1   

## [13] tidytext_0.1.3    

##

## loaded via a namespace (and not attached):

##  [1] httr_1.3.1         ddalpha_1.2.1      splines_3.3.3     

##  [4] jsonlite_1.5       foreach_1.4.3      prodlim_1.6.1     

##  [7] modelr_0.1.1       assertthat_0.2.0   highr_0.6         

## [10] stats4_3.3.3       DRR_0.0.2          cellranger_1.1.0  

## [13] yaml_2.1.14        robustbase_0.92-7  slam_0.1-40       

## [16] ipred_0.9-6        backports_1.1.0    lattice_0.20-35   

## [19] glue_1.1.1         digest_0.6.12      rvest_0.3.2       

## [22] colorspace_1.3-2   recipes_0.1.0      htmltools_0.3.6   

## [25] Matrix_1.2-11      plyr_1.8.4         psych_1.7.8       

## [28] timeDate_3012.100  pkgconfig_2.0.1    CVST_0.2-1        

## [31] broom_0.4.2        haven_1.1.0        caret_6.0-77      

## [34] scales_0.5.0       gower_0.1.2        lava_1.5          

## [37] withr_2.0.0        nnet_7.3-12        lazyeval_0.2.0    

## [40] mnormt_1.5-5       survival_2.41-3    magrittr_1.5      

## [43] readxl_1.0.0       evaluate_0.10.1    tokenizers_0.1.4  

## [46] janeaustenr_0.1.5  nlme_3.1-131       SnowballC_0.5.1   

## [49] MASS_7.3-47        forcats_0.2.0      xml2_1.1.1        

## [52] dimRed_0.1.0       foreign_0.8-69     class_7.3-14      

## [55] tools_3.3.3        hms_0.3            stringr_1.2.0     

## [58] kernlab_0.9-25     munsell_0.4.3      RcppRoll_0.2.2    

## [61] rlang_0.1.2        grid_3.3.3         iterators_1.0.8   

## [64] labeling_0.3       rmarkdown_1.6      gtable_0.2.0      

## [67] ModelMetrics_1.1.0 codetools_0.2-15   reshape2_1.4.2    

## [70] R6_2.2.2           lubridate_1.6.0    knitr_1.17        

## [73] bindr_0.1          rprojroot_1.2      stringi_1.1.5     

## [76] parallel_3.3.3     Rcpp_0.12.12       rpart_4.1-11      

## [79] tidyselect_0.2.0   DEoptimR_1.0-8
Jasmine Wilkerson
| July 2, 2015

What does the data look like for political contributions when we look at each state? How does generosity appear in each state, and what does state activism look like?

Generosity and activism by the state

A few days ago, I published an article about analyzing financial contributions to political campaigns.

When we look at the total individual contributions to political committees by state, it is apparent that California, New York, and Texas take the lead. Given the fact that these states have the highest population, can we justify a claim that the residents are more generous when it comes to political contributions?

Generosity and Activism by State
Individual contributions from 2011-2014 by State

Individual political contributions per capita

In contrast, the contribution per capita tells a different story.  After this adjustment for population by state, Massachusetts and Connecticut lead for political generosity. Meanwhile Idaho and Mississippi consistently collect fewer total contributions and less per person. Other generous states are New York, Virginia, Wyoming, California, and Colorado.

Individual Political Contributions per Capita
A map of individual political contributions per capita

Measuring political activism

Can we measure political activism by analyzing the individual contribution data? When we look at the number of donors that make up the total population by state, surprisingly Montana seems to have a high number of political donors.

Measuring Political Activism
Percentage of state population donated

 

Related Topics

Web Development
Top
Statistics
Software Testing
Programming Language
Podcasts
Natural Language
Machine Learning
Hypothesis Testing
High-Tech
Events
Discussions
Demos
Data Visualization
Data Security
Data Science
Data Mining
Data Engineering
Data Analytics
Conferences

Up for a Weekly Dose of Data Science?

Subscribe to our weekly newsletter & stay up-to-date with current data science news, blogs, and resources.