A hands-on guide to collect and store twitter data for timeseries analysis

“A couple of weeks back, I was working on a project in which I had to scrape tweets from twitter and after storing them in a csv file, I had to plot some graphs for timeseries analysis. I requested Twitter for Twitter developer API, but unfortunately my request was not fulfilled. Then I started searching for python libraries which can allow me to scrape tweets without the official Twitter API.

To my amazement, there were several libraries through which you can scrape tweets easily but for my project I found ‘Snscrape’ to be the best library, which met my requirements!”

What is SNScrape?

A scraper for social networking platforms known as snscrape (SNS). It retrieves objects, such as pertinent posts, by scraping things like user profiles, hashtags, or searches.

Install Snscrape

Snscrape requires Python 3.8 or higher. The Python package dependencies are installed automatically when you install Snscrape. You can install using the following commands.

pip3 install snscrape
pip3 install git+https://github.com/JustAnotherArchivist/snscrape.git (Development Version)

For this tutorial we will be using the development version of Snscrape. Paste the second command in command prompt(cmd), make sure you have git installed on your system.

Code walkthrough for scraping

Before starting make sure you have the following python libraries:

Pandas
Numpy
Snscrape
Tqdm
Seaborn

Matplotlit

Importing Relevant Libraries

To run the scraping program, you will first need to import the libraries

import pandas as pd 

import numpy as np 

import snscrape.modules.twitter as sntwitter 

import datetime 

from tqdm.notebook import tqdm_notebook 

import seaborn as sns 

import matplotlib.pyplot as plt 

sns.set_theme(style="whitegrid")

Taking User Input

To scrape tweets, you can provide many filters such as the username or start date or end date etc. We will be taking the following user inputs which will then be used in Snscrape.

Text: The query to be matched. (Optional)
Username: Specific username from twitter account. (Required)
Since: Start Date in this format yyyy-mm-dd. (Optional)
Until: End Date in this format yyyy-mm-dd. (Optional)
Count: Max number of tweets to retrieve. (Required)

Retweet: Include or Exclude Retweets. (Required)
Replies: Include or Exclude Replies. (Required)

For this tutorial we used the following inputs:

text = input('Enter query text to be matched (or leave it blank by pressing enter)') 

username = input('Enter specific username(s) from a twitter account without @ (or leave it blank by pressing enter): ') 

since = input('Enter startdate in this format yyyy-mm-dd (or leave it blank by pressing enter): ') 

until = input('Enter enddate in this format yyyy-mm-dd (or leave it blank by pressing enter): ') 

count = int(input('Enter max number of tweets or enter -1 to retrieve all possible tweets: ')) 

retweet = input('Exclude Retweets? (y/n): ') 

replies = input('Exclude Replies? (y/n): ')

Which field can we Scrape?

Here is the list of fields which we can scrape using Snscrape Library.

url: str

date: datetime.datetime
rawContent: str
renderedContent: str
id: int
user: ‘User’

replyCount: int
retweetCount: int
likeCount: int
quoteCount: int
conversationId: int

lang: str
source: str
sourceUrl: typing.Optional[str] = None
sourceLabel: typing.Optional[str] = None
links: typing.Optional[typing.List[‘TextLink’]] = None

media: typing.Optional[typing.List[‘Medium’]] = None
retweetedTweet: typing.Optional[‘Tweet’] = None
quotedTweet: typing.Optional[‘Tweet’] = None
inReplyToTweetId: typing.Optional[int] = None
inReplyToUser: typing.Optional[‘User’] = None

mentionedUsers: typing.Optional[typing.List[‘User’]] = None
coordinates: typing.Optional[‘Coordinates’] = None
place: typing.Optional[‘Place’] = None
hashtags: typing.Optional[typing.List[str]] = None
cashtags: typing.Optional[typing.List[str]] = None

card: typing.Optional[‘Card’] = None

For this tutorial we will not scrape all the fields but a few relevant fields from the above list.

The search function

Next, we will define a search function which takes in the following inputs as arguments and creates a query string to be passed inside SNS twitter search scraper function.

Text
Username
Since
Until
Retweet

Replies

def search(text,username,since,until,retweet,replies): 

    global filename 

    q = text 

    if username!='': 

        q += f" from:{username}"     

    if until=='': 

        until = datetime.datetime.strftime(datetime.date.today(), '%Y-%m-%d') 

    q += f" until:{until}" 

    if since=='': 

        since = datetime.datetime.strftime(datetime.datetime.strptime(until, '%Y-%m-%d') -  

                                           datetime.timedelta(days=7), '%Y-%m-%d') 

    q += f" since:{since}" 

    if retweet == 'y': 

        q += f" exclude:retweets" 

    if replies == 'y': 

        q += f" exclude:replies" 

    if username!='' and text!='': 

        filename = f"{since}_{until}_{username}_{text}.csv" 

    elif username!="": 

        filename = f"{since}_{until}_{username}.csv" 

    else: 

        filename = f"{since}_{until}_{text}.csv" 

    print(filename) 

    return q

Here we have defined different conditions and based on those conditions we are creating the query string. For example, if variable until (end date) is empty then we are assigning it the current date and appending it in a query string and if the variable since (start date) is empty then we are assigning it a date of past 7 days from the current date. Along with the query string, we are creating filename string which will be used to name our csv file.

Calling the Search Function and creating Dataframe

q = search(text,username,since,until,retweet,replies) 

# Creating list to append tweet data  

tweets_list1 = [] 

 

# Using TwitterSearchScraper to scrape data and append tweets to list 

if count == -1: 

    for i,tweet in enumerate(tqdm_notebook(sntwitter.TwitterSearchScraper(q).get_items())): 

        tweets_list1.append([tweet.date, tweet.id, tweet.rawContent, tweet.user.username,tweet.lang, 

        tweet.hashtags,tweet.replyCount,tweet.retweetCount, tweet.likeCount,tweet.quoteCount,tweet.media]) 

else: 

    with tqdm_notebook(total=count) as pbar: 

        for i,tweet in enumerate(sntwitter.TwitterSearchScraper(q).get_items()): #declare a username  

            if i>=count: #number of tweets you want to scrape 

                break 

            tweets_list1.append([tweet. Date, tweet.id, tweet.rawContent, tweet.user.username,tweet.lang,tweet.hashtags,tweet.replyCount, 

                                tweet.retweetCount,tweet.likeCount,tweet.quoteCount,tweet.media]) 

            pbar.update(1) 

# Creating a dataframe from the tweets list above  

tweets_df1 = pd.DataFrame(tweets_list1, columns=['DateTime', 'TweetId', 'Text', 'Username','Language', 

                                'Hashtags','ReplyCount','RetweetCount','LikeCount','QuoteCount','Media'])

In this snippet we have invoked the search function and the query string is stored inside variable ‘q’. Next, we have defined an empty list which will be used for appending tweet data. If the count is specified as -1 then the for loop will iterate over all the tweets.

TwitterSearchScraper class constructor takes the query string as an argument and then we invoke the get_items() method of TwitterSearchScraper class to retrieve all the tweets. Inside for loop we append scraped data in the tweets_list1 variable which we defined earlier. If count is defined, then we use it to break the for loop. Finally, using this list, we create the pandas dataframe by specifying the column names.

tweets_df1.sort_values(by='DateTime',ascending=False)

Data frame - Panda's library — Data frame created using Panda’s library

Data Preprocessing

Before saving the data frame in a csv file, we will first process the data, so that we can easily perform analysis on it.

Data Description

tweets_df1.info()

Data Transformation

Now we will add more columns to facilitate timeseries analysis

tweets_df1['Hour'] = tweets_df1['DateTime'].dt.hour 

tweets_df1['Year'] = tweets_df1['DateTime'].dt.year   

tweets_df1['Month'] = tweets_df1['DateTime'].dt.month 

tweets_df1['MonthName'] = tweets_df1['DateTime'].dt.month_name() 

tweets_df1['MonthDay'] = tweets_df1['DateTime'].dt.day 

tweets_df1['DayName'] = tweets_df1['DateTime'].dt.day_name() 

tweets_df1['Week'] = tweets_df1['DateTime'].dt.isocalendar().week

The Datetime column contains both date and time, therefore it is better to split data and time in separate columns.

tweets_df1['Date'] = [d.date() for d in tweets_df1['DateTime']] 

tweets_df1['Time'] = [d.time() for d in tweets_df1['DateTime']]

After splitting we will drop the DateTime column.

tweets_df1.drop('DateTime',axis=1,inplace=True) 

tweets_df1

Finally our data is prepared, we will now save the dataframe as csv using df.to_csv() function which takes filename as an input parameter.

tweets_df1.to_csv(f"{filename}",index=False)

Visualizing timeseries data using barplot, lineplot, histplot and kdeplot

It is time to visualize our prepared data so that we can find useful insights. Firstly, we will load the saved csv in a dateframe using the read_csv() function of pandas which take filename as input parameter.

tweets = pd.read_csv("2018-01-01_2022-09-27_DataScienceDojo.csv") 

tweets

Count by Year

The countplot function of seaborn allows us to plot count of tweets by year.

f, ax = plt.subplots(figsize=(15, 10)) 

sns.countplot(x= tweets['Year']) 

for p in ax.patches: 

    ax.annotate(int(p.get_height()), (p.get_x()+0.05, p.get_height()+20), fontsize = 12)

Plot count of tweets - Bar graph — Plot count of tweets – Bar graph

plt.figure(figsize=(15, 8)) 

 

ax=plt.subplot(221) 

sns.lineplot(tweets.Year.value_counts()) 

ax.set_xlabel("Year") 

ax.set_ylabel('Count') 

plt.xticks(np.arange(2018,2023,1)) 

 

plt.subplot(222) 

sns.histplot(x=tweets.Year,stat='count',binwidth=1,kde='true',discrete=True) 

plt.xticks(np.arange(2018,2023,1)) 

plt.grid() 

 

plt.subplot(223) 

sns.kdeplot(x=tweets.Year,fill=True) 

plt.xticks(np.arange(2018,2023,1)) 

plt.grid() 

 

plt.subplot(224) 

sns.kdeplot(x=tweets.Year,fill=True,bw_adjust=3) 

plt.xticks(np.arange(2018,2023,1)) 

plt.grid() 

 

plt.tight_layout() 

plt.show()

Plot count of tweets - per year — Plot count of tweets – per year

Count by Month

We will follow the same steps for count by month, by week, by day of month and by hour.

f, ax = plt.subplots(figsize=(15, 10)) 

sns.countplot(x= tweets['Month']) 

for p in ax.patches: 

    ax.annotate(int(p.get_height()), (p.get_x()+0.05, p.get_height()+20), fontsize = 12)

Monthly Tweet counts - chart — Monthly Tweet counts – chart

plt.figure(figsize=(15, 8)) 

 

ax=plt.subplot(221) 

sns.lineplot(tweets.Month.value_counts()) 

ax.set_xlabel("Month") 

ax.set_ylabel('Count') 

plt.xticks(np.arange(1,13,1)) 

 

plt.subplot(222) 

sns.histplot(x=tweets.Month,stat='count',binwidth=1,kde='true',discrete=True) 

plt.xticks(np.arange(1,13,1)) 

plt.grid() 

 

plt.subplot(223) 

sns.kdeplot(x=tweets.Month,fill=True) 

plt.xticks(np.arange(1,13,1)) 

plt.grid() 

 

plt.subplot(224) 

sns.kdeplot(x=tweets.Month,fill=True,bw_adjust=3) 

plt.xticks(np.arange(1,13,1)) 

plt.grid() 

 

plt.tight_layout() 

plt.show()

Count by Week

f, ax = plt.subplots(figsize=(15, 10)) 

sns.countplot(x= tweets['Week']) 

for p in ax.patches: 

    ax.annotate(int(p.get_height()), (p.get_x()+0.005, p.get_height()+5), fontsize = 10)

plt.figure(figsize=(15, 8)) 

 

ax=plt.subplot(221) 

sns.lineplot(tweets.Week.value_counts()) 

ax.set_xlabel("Week") 

ax.set_ylabel('Count') 

 

plt.subplot(222) 

sns.histplot(x=tweets.Week,stat='count',binwidth=1,kde='true',discrete=True) 

plt.grid() 

 

plt.subplot(223) 

sns.kdeplot(x=tweets.Week,fill=True) 

plt.grid() 

 

plt.subplot(224) 

sns.kdeplot(x=tweets.Week,fill=True,bw_adjust=3) 

plt.grid() 

 

plt.tight_layout() 

plt.show()

Count by Day of Month

f, ax = plt.subplots(figsize=(15, 10)) 

sns.countplot(x= tweets['MonthDay']) 

for p in ax.patches: 

    ax.annotate(int(p.get_height()), (p.get_x()+0.05, p.get_height()+5), fontsize = 12)

plt.figure(figsize=(15, 8)) 

 

ax=plt.subplot(221) 

sns.lineplot(tweets.MonthDay.value_counts()) 

ax.set_xlabel("MonthDay") 

ax.set_ylabel('Count') 

 

plt.subplot(222) 

sns.histplot(x=tweets.MonthDay,stat='count',binwidth=1,kde='true',discrete=True) 

plt.grid() 

 

plt.subplot(223) 

sns.kdeplot(x=tweets.MonthDay,fill=True) 

plt.grid() 

 

plt.subplot(224) 

sns.kdeplot(x=tweets.MonthDay,fill=True,bw_adjust=3) 

plt.grid() 

 

plt.tight_layout() 

plt.show()

Count by Hour

f, ax = plt.subplots(figsize=(15, 10)) 

sns.countplot(x= tweets['Hour']) 

for p in ax.patches: 

    ax.annotate(int(p.get_height()), (p.get_x()+0.05, p.get_height()+20), fontsize = 12)

plt.figure(figsize=(15, 8)) 

 

ax=plt.subplot(221) 

sns.lineplot(tweets.Hour.value_counts()) 

ax.set_xlabel("Hour") 

ax.set_ylabel('Count') 

plt.xticks(np.arange(0,24,1)) 

 

plt.subplot(222) 

sns.histplot(x=tweets.Hour,stat='count',binwidth=1,kde='true',discrete=True) 

plt.xticks(np.arange(0,24,1)) 

plt.grid() 

 

plt.subplot(223) 

sns.kdeplot(x=tweets.Hour,fill=True) 

plt.xticks(np.arange(0,24,1)) 

plt.grid() 

 

plt.subplot(224) 

sns.kdeplot(x=tweets.Hour,fill=True,bw_adjust=3) 

#plt.xticks(np.arange(0,24,1)) 

plt.grid() 

 

plt.tight_layout() 

plt.show() 

 
Hourly tweets count charts

Conclusion

From the above time series visualizations, we can clearly understand that the peak hours of tweets from this account is between 7pm-9pm and from 4am -1pm the twitter handle is quiet. We can also point out that most of the tweets related to that topic are done in the month of August. Similarly, we can identify that the Twitter handle was not very active before 2021.

Conclusively, we saw how we can easily scrape tweets without using Twitter API through Snscrape. Then we performed some transformations on the scraped data and stored it in csv file. Later, we used that csv file for time-series visualizations and analysis. We appreciate you following along with this hands-on guide. We hope that this guide will make it easy for you to get started on your upcoming data science project.

<<Link to Complete Code>>

LLM - Online Courses

Reviews

Consulting

Community

twitter data

Syed Umair Hasan

Scrape Twitter data without Twitter API using SNScrape for timeseries analysis

What is SNScrape?

Install Snscrape

Code walkthrough for scraping

Which field can we Scrape?

The search function

Data Preprocessing

Visualizing timeseries data using barplot, lineplot, histplot and kdeplot

Conclusion

Related Topics

Training Programs

Enterprise

Community

About