These days social platforms are quite popular. Websites like YouTube, Facebook, Instagram, etc. are used widely by billions of people. These websites have a lot of data that can be used for sentiment analysis against any incident, election prediction, result prediction of any big event, etc. If you have this data, you can analyze the risk of any decision.
In this post, we are going to web-scrape public Facebook pages using Python and Selenium. We will also discuss the libraries and tools required for the process. So, if you’re interested in web scraping and data analysis, keep reading!
Read more about web scraping with Python and BeautifulSoup and kickstart your analysis today.
What do we need before writing the code?
We will use Python 3.x for this tutorial, and I am assuming that you have already installed it on your machine. Other than that, we need to install two III-party libraries BeautifulSoup and Selenium.
- BeautifulSoup — This will help us parse raw HTML and extract the data we need. It is also known as BS4.
- Selenium — It will help us render JavaScript websites.
- We also need chromium to render websites using Selenium API. You can download it from here.
Before installing these libraries, you have to create a folder where you will keep the python script.
Now, create a python file inside this folder. You can use any name and then finally, install these libraries.
What will we extract from a Facebook page?
We are going to scrape addresses, phone numbers, and emails from our target page.
First, we are going to extract the raw HTML using Selenium from the Facebook page and then we are going to use. find() and .find_all() methods of BS4 to parse this data out of the raw HTML. Chromium will be used in coordination with Selenium to load the website.
Read about: How to scrape Twitter data without Twitter API using SNScrape.
Let’s start scraping
Let’s first write a small code to see if everything works fine for us.
Let’s understand the above code step by step.
- We have imported all the libraries that we installed earlier. We have also imported the time library. It will be used for the driver to wait a little more before closing the chromium driver.
- Then we declared the PATH of our chromium driver. This is the path where you have kept the chromedriver.
- One empty list and an object to store data.
- target_url holds the page we are going to scrape.
- Then using .Chrome() method we are going to create an instance for website rendering.
- Then using .get() method of Selenium API we are going to open the target page.
- .sleep() method will pause the script for two seconds.
- Then using .page_source we collect all the raw HTML of the page.
- .close() method will close down the chrome instance.
Once you run this code it will open a chrome instance, then it will open the target page and then after waiting for two seconds the chrome instance will be closed. For the first time, the chrome instance will open a little slow but after two or three times it will work faster.
Once you inspect the page you will find that the intro section, contact detail section, and photo gallery section all have the same class names
with a div. But since for this tutorial, our main focus is on contact details therefore we will focus on the second div tag.
Let’s find this element using the .find() method provided by the BS4 API.
We have created a parse tree using BeautifulSoup and now we are going to extract crucial data from it.
Using .find_all() method we are searching for all the div tags with class
and then we selected the second element from the list.
Now, here is a catch. Every element in this list has the same class and tag. So, we have to use regular expressions in order to find the information we need to extract.
Let’s find all of these element tags and then later we will use a for loop to iterate over each of these elements to identify which element is what.
Here is how we will identify the address, number, and email.
- The address can be identified if the text contains more than two commas.
- The number can be identified if the text contains more than two dash(-).
- Email can be identified if the text contains “@” in it.
We ran a for loop on allDetails variable. Then we are one by one identifying which element is what. Then finally if they satisfy the if condition we are storing it in the object o.
In the end, you can append the object o in the list l and print it.
Once you run this code you will find this result.
Complete Code
We can make further changes to this code to scrape more information from the page. But for now, the code will look like this.
Conclusion
Today we scraped the Facebook page to collect emails for lead generation. Now, this is just an example of scraping a single page. If you have thousands of pages, then we can use the Pandas library to store all the data in a CSV file. I leave this task for you as homework.
I hope you like this little tutorial and if you do then please do not forget to share it with your friends and on your social media.
Written by Manthan Koolwal