This post is part 1 of the 'Advanced Scraping' series:
The Python documentation, wikipedia, and most blogs (including this one) use static content. When we request the URL, we get the final HTML returned to us. If that's the case, then a parser like BeautifulSoup is all you need. A short example of scraping a static page is demonstrated below. I have an overview of BeautifulSoup here.
The purpose for this Proof Of Concepts (POC) was created as a part of my own side project. The goal of this application is to use web scraping tool to extract any publicly available information without much cost and manpower. In this POC, I used Python as the scripting language, Beautiful Soup and Selenium library to extract the necessary. Web scraping is the term for using a program to download and process content from the web. For example, Google runs many web scraping programs to index web pages for its search engine. In this chapter, you will learn about several modules that make it easy to scrape web pages in Python. Nov 08, 2018 Using Selenium with geckodriver is a quick way to scrape the web pages that are using javascript but there are a few drawbacks. I have found that sometimes the page does not load (I’m sure that this could be more efficient by changing the javascript we execute as mentioned above, but I am new to JS so this might require some time), but also.
A site with dynamic content is one where requesting the URL returns an incomplete HTML. The HTML includes Javascript for the browser to execute. Only once the Javascript finishes running is the HTML in its final state. This is common for sites that update frequently. For example, weather.com would use Javascript to look up the latest weather. An Amazon webpage would use Javascript to load the latest reviews from its database. If you use a parser on a dynamically generated page, you get a skeleton of the page with the unexecuted javascript on it.
This post will outline different strategies for scraping dynamic pages.
An example of scraping a static page
Let's start with an example of scraping a static page. This code demonstrates how to get the Introduction section of the Python style guide, PEP8:
This prints
IntroductionThis document gives coding conventions for the Python code comprisingthe standard library in the main Python distribution. Please see thecompanion informational PEP describing style guidelines for the C codein the C implementation of Python [1]...
Volia! If all you have is a static page, you are done!
The straightforward way to scrape a dynamic page
The easiest way of scraping a dynamic page is to actually execute the javascript, and allow it to alter the HTML to finish the page. We can pass the rendered (i.e. finalized) HTML to python, and use the same parsing techniques we used on static sites. The Python module Selenium allows us to control a browser directly from Python. The steps to Parse a dynamic page using Selenium are:
- Initialize a driver (a Python object that controls a browser window)
- Direct the driver to the URL we want to scrape.
- Wait for the driver to finish executing the javascript, and changing the HTML. The driver is typically a Chrome driver, so the page is treated the same way as if you were visiting it in Chrome.
- Use
driver.page_source
to get the HTML as it appears after javascript has rendered it. - Use a parser on the returned HTML
The website https://webscraper.io has some fake pages to test scraping on. Let's use it on the page https://www.webscraper.io/test-sites/e-commerce/ajax/computers/laptops to get the product name and the price for the six items listed on the first page. These are randomly generated; at the time of writing the products were an Asus VivoBook (295.99), two Prestigio SmartBs (299 each), an Acer Aspire ES1 (306.99), and two Lenovo V110s (322 and 356).
Once the HTML has been by Selenium, each item has a div with class caption
that contains the information we want. The product name is in a subdiv with class title
, and the price is in a subdiv with the classes pull-right
and price
. Here is code for scraping the product names and prices:
Trying scraping a dynamic site using requests
What would happen if we tried to load this e-commerce site using requests? That is, what if we didn't know it was a dynamic site?
The html we get out can be a little difficult to read directly. If you are using a terminal, then you can save the results from r.html
to a file and then load it in a browser. If you are using a Jupyter notebook, you can actually use a neat trick to render the output in your browser:
The output in the notebook is an empty list, because javascript hasn't generated the items yet.
Using Selenium is an (almost) sure-fire way of being able to generate any of the dynamic content that you need, because the pages are actually visited by a browser (albeit one controlled by Python rather than you). If you can see it while browsing, Selenium will be able to see it as well.
There are some drawbacks to using Selenium over pure requests:
- It's slow.
We have to wait for pages to render, rather than just grabbing the data we want.
- We have to download images and assets, using bandwidth
Related to the previous point, even if we are just parsing for text, our browser will download all ads and images on the site.
- Chrome takes a lot of memory
When scraping, we might want to have parallel scrapers running (e.g. one for each category of items on an e-commerce site) to allow us to finish faster. If we use Selenium, we will have to have enough memory to have multiple copies running.
- We might not need to parse
Often sites will make API calls to get the data in a nicely formatted JSON object, which is then processed by Javascript into HTML entities. When using a parser such as BeautifulSoup, we are reading in the HTML entities, and trying to reconstruct the original data. It would be a lot slicker (and less error prone) if we are able to get the JSON objects directly.
- Selenium (like parsing) is often tedious and error-prone
The bad news for using the alternative methods is that there are so many different ways of loading data that no single technique is guaranteed to work. The biggest advantage Selenium has is that it uses a browser, and with enough care, should be indistinguishable from you browsing the web yourself.
Other techniques
This is the first in a series of articles that will look at other techniques to get data from dynamic webpages. Because scraping requires a custom approach to each site we scrape, each technique will be presented as a case study. The examples will be detailed enough to enable you to try the technique on other sites.
Web Scraping Using Selenium And Beautifulsoup For Beginners
Technique | Description | Examples |
---|---|---|
Scheme or Opengraph MetaData | OpenGraph is a standard for allowing sites like Facebook to easily find what your page is 'about'. We can scrape the relevant data directly from these tags | ??? Need example ??? |
JSON for Linking Data | This is a standard for putting JSON inside Javascript tags | Yelp |
XHR | Use the same API requests that the browser does to get the data | Sephora lipsticks, Apple jobs |
Selenium summary
The short list of pros and cons for using Selenium to scrape dynamic sites.
Pros | Cons |
---|---|
* Will work | * Slow |
* Bandwidth and memory intensive | |
* Requires error-prone parsing |
In this tutorial, we will learn how to scrap web using selenium and beautiful soup. I am going to use these tools to collect recipes from a food website and store them in a structured format in a database. The two tasks involved in collecting the recipes are:
Web Scraping Using Selenium And Beautifulsoup Free
- Get all the recipe urls from the website using selenium
- Convert the html information of a recipe webpage into a structed json using beautiful soup.
For our task, I picked the NDTV food as a source for extracting recipes.
Selenium
Selenim Webdriver automates web browsers. The important use case of it is for autmating web applications for the testing purposes. It can also be used for web scraping. In our case, I used it for extracting all the urls corresponding to the recipes.
Installation
I used selenium python bindings for using selenium web dirver. Through this python API, we can access all the functionalities of selenium web dirvers like Firefox, IE, Chrome, etc. We can use the following command for installing the selenium python API.
Selenium python API requires a web driver to interface with your choosen browser. The corresponding web drivers can be downloaded from the following links. And also make sure it is in your PATH, e.g. /usr/bin
or /usr/local/bin
. For more information regarding installation, please refer to the link.
Web browser | Web driver link |
---|---|
Chrome | chromedriver |
Firefox | geckodriver |
Safari | safaridriver |
I used chromedriver to automate the google chrome web browser. The following block of code opens the website in seperate window.
Traversing the Sitemap of website
The website that we want to scrape looks like this:
Jul 04, 2019 More specifically, there's a hallway in the upcoming Switch game that features a bunch of framed paintings, and one of them features what looks like art from Mario Strikers Charged, the 2007. Mario strikers switch. Super Mario Strikers Switch Is a Perfect Fit If there’s one gigantic selling point for a Super Mario Strikers return, it’s that the Nintendo Switch is a fantastic console made for the series.
We need to collect all the group of the recipes like categories, cusine, festivals, occasion, member recipes, chefs, restaurant as shown in the above image. To do this, we will select the tab element and extract the text in it. We can find the id of the the tab and its attributes by inspect the source.In our case, id is insidetab
. We can extract the tab contents and their hyper links using the following lines.
We need to follow each of these collected links and construct a link hierachy for the second level.
When you load the leaf of the above sub_category_links
dictionary, you will encounter the following pages with ‘Show More’ button as shown in the below image. Selenium shines at tasks like this where we can actually click the button using element.click()
method.
For the click automation, we will use the below block of code.
Now let’s get all the recipes in NDTV!
Beautiful Soup
Now that we extracted all the recipe URLs, the next task is to open these URLs and parse HTML to extract relevant information. We will use Requests python library to open the urls and excellent Beautiful Soup library to parse the opened html.
Here’s how an example recipe page looks like:
soup
is the root of the parsed tree of our html page which will allow us to navigate and search elements in the tree. Let’s get the div
containing the recipe and restrict our further search to this subtree.
Inspect the source page and get the class name for recipe container. In our case the recipe container class name is recp-det-cont
.
Let’s start by extracting the name of the dish. get_text()
extracts all the text inside the subtree.
Scraping Html Data With Beautifulsoup
Now let’s extract the source of the image of the dish. Inspect element reveals that img
wrapped in picture
inside a div
of class art_imgwrap
.
BeautifulSoup allows us to navigate the tree as desired. I3dconverter.
Finally, ingredients and instructions are li
elements contained in div
of classes ingredients
and method
respectively. While find
gets first element matching the query, find_all
returns list of all matched elements.
Web Scraping Using Selenium And Beautifulsoup 2
Overall, this project allowed me to extract 2031 recipes each with json which looks like this:
Comments are closed.