Web Scraping Ecommerce Websites Using Python



Sometimes we need to extract information from websites. We can extract data from websites by using there available API’s. But there are websites where API’s are not available.

Photo by Waldemar Brandt on Unsplash My Web Scraping Workflow. Before starting any web scraping project, we have to define which websites will be covered in the project. I decided to cover 10 websites which are the most visited online shops in Turkey for the hand-bags category. Mar 18, 2021 Web scraping with Python is easy due to the many useful libraries available A barebones installation isn’t enough for web scraping. One of the Python advantages is a large selection of libraries for web scraping. For this Python web scraping tutorial, we’ll be using three important libraries – BeautifulSoup v4, Pandas, and Selenium. Use a Web Scraping Framework like PySpider or Scrapy When you’re crawling a massive site like Amazon.com, you need to spend some time to figure out how to run your entire crawl smoothly. Choose an open-source framework for building your scraper, like Scrapy or PySpider which are both based in Python. I am new to the concept of scraping data off the internet and need some help. I am using python 3.6.1 to scrape product details from Paytm(e-commerce website in India). I am using the following w. Sep 13, 2018 Price Scraping involves gathering price information of a product from an eCommerce website using web scraping.A price scraper can help you easily scrape prices from website for price monitoring purposes of your competitor and your products.

Here, Web scraping comes into play!

Python is widely being used in web scraping, for the ease it provides in writing the core logic. Whether you are a data scientist, developer, engineer or someone who works with large amounts of data, web scraping with Python is of great help.

Without a direct way to download the data, you are left with web scraping in Python as it can extract massive quantities of data without any hassle and within a short period of time.

In this tutorial , we shall be looking into scraping using some very powerful Python based libraries like BeautifulSoup and Selenium.

Scraping

BeautifulSoup and urllib

BeautifulSoup is a Python library for pulling data out of HTML and XML files. But it does not get data directly from a webpage. So here we will use urllib library to extract webpage.

First we need to install Python web scraping BeautifulSoup4 plugin in our system using following command :

$ sudo pip install BeatifulSoup4

$ pip install lxml

OR

$ sudo apt-get install python3-bs4

$ sudo apt-get install python-lxml

So here I am going to extract homepage from a website https://www.botreetechnologies.com

from urllib.request import urlopen

from bs4 import BeautifulSoup

We import our package that we are going to use in our program. Now we will extract our webpage using following.

response = urlopen('https://www.botreetechnologies.com/case-studies')

Beautiful Soup does not get data directly from content we just extract. So we need to parse it in html/XML data.

data = BeautifulSoup(response.read(),'lxml')

Here we parsed our webpage html content into XML using lxml parser.

As you can see in our web page there are many case studies available. I just want to read all the case studies available here.

There is a title of case studies at the top and then some details related to that case. I want to extract all that information.

We can extract an element based on tag , class, id , Xpath etc.

You can get class of an element by simply right click on that element and select inspect element.

case_studies = data.find('div', { 'class' : 'content-section' })

In case of multiple elements of this class in our page, it will return only first. So if you want to get all the elements having this class use findAll() method.

case_studies = data.find('div', { 'class' : 'content-section' })

Now we have div having class ‘content-section’ containing its child elements. We will get all <h2> tags to get our ‘TITLE’ and <ul> tag to get all children, the <li> elements.

case_stud.find('h2').find('a').text

case_stud_details = case_stud.find(‘ul’).findAll(‘li’)

Now we got the list of all children of ul element.

To get first element from the children list simply write:

case_stud_details[0]

We can extract all attribute of a element . i.e we can get text for this element by using:

case_stud_details[2].text

But here I want to click on the ‘TITLE’ of any case study and open details page to get all information.

Web Scraping Ecommerce Websites Using Python 3.7

Since we want to interact with the website to get the dynamic content, we need to imitate the normal user interaction. Such behaviour cannot be achieved using BeautifulSoup or urllib, hence we need a webdriver to do this.

Webdriver basically creates a new browser window which we can control pragmatically. It also let us capture the user events like click and scroll.

Selenium is one such webdriver.

Selenium Webdriver

Selenium webdriver accepts cthe ommand and sends them to ba rowser and retrieves results.

You can install selenium in your system using fthe ollowing simple command:

$ sudo pip install selenium

In order to use we need to import selenium in our Python script.

from selenium import webdriver

I am using Firefox webdriver in this tutorial. Now we are ready to extract our webpage and we can do this by using fthe ollowing:

self.url = 'https://www.botreetechnologies.com/'

self.browser = webdriver.Firefox()

Now we need to click on ‘CASE-STUDIES’ to open that page.

We can click on a selenium element by using following piece of code:

self.browser.find_element_by_xpath('//div[contains(@id,'navbar')]/ul[2]/li[1]').click()

Now we are transferred to case-studies page and here all the case studies are listed with some information.

Web Scraping With Python Pdf

Here, I want to click on each case study and open details page to extract all available information.

So, I created a list of links for all case studies and load them one after the other.

To load previous page you can use following piece of code:

self.browser.execute_script('window.history.go(-1)')

Final script for using Selenium will looks as under:

And we are done, Now you can extract static webpages or interact with webpages using the above script.

Conclusion: Web Scraping Python is an essential Skill to have

Today, more than ever, companies are working with huge amounts of data. Learning how to scrape data in Python web scraping projects will take you a long way. In this tutorial, you learn Python web scraping with beautiful soup.

Along with that, Python web scraping with selenium is also a useful skill. Companies need data engineers who can extract data and deliver it to them for gathering useful insights. You have a high chance of success in data extraction if you are working on Python web scraping projects.

If you want to hire Python developers for web scraping, then contact BoTree Technologies. We have a team of engineers who are experts in web scraping. Give us a call today.

Consulting is free – let us help you grow!

Price Scraping involves gathering price information of a product from an eCommerce website using web scraping. A price scraper can help you easily scrape prices from website for price monitoring purposes of your competitor and your products.

How to Scrape Prices

1. Create your own Price Monitoring Tool to Scrape Prices

There are plenty of web scraping tutorials on the internet where you can learn how to create your own price scraper to gather pricing from eCommerce websites. However, writing a new scraper for every different eCommerce site could get very expensive and tedious. Below we demonstrate some advanced techniques to build a basic web scraper that could scrape prices from any eCommerce page.

2. Web Scraping using Price Scraping Tools

Web scraping tools such as ScrapeHero Cloud can help you scrape prices without coding, downloading and learning how to use a tool. ScrapeHero Cloud has pre-built crawlers that can help you scrape popular eCommerce websites such as Amazon, Walmart, Target easily. ScrapeHero Cloud also has scraping APIs to help you scrape prices from Amazon and Walmart in real-time, web scraping APIs can help you get pricing details within seconds.

3. Custom Price Monitoring Solution

ScrapeHero Price Monitoring Solutions are cost-effective and can be built within weeks and in some cases days. Our price monitoring solution can easily be scaled to include multiple websites and/or products within a short span of time. We have considerable experience in handling all the challenges involved in price monitoring and have the sufficient know-how about the essentials of product monitoring.

Learn how to scrape prices for FREE –

How to Build a Price Scraper

In this tutorial, we will show you how to build a basic web scraper which will help you in scraping prices from eCommerce websites by taking a few common websites as an example.

Let’s start by taking a look at a few product pages, and identify certain design patterns on how product prices are displayed on the websites.

Amazon.com

Sephora.com

Observations and Patterns

Some patterns that we identified by looking at these product pages are:

  • Price appears as currency figures (never as words)
  • The price is the currency figure with the largest font size
  • Price comes inside first 600 pixels height
  • Usually the price comes above other currency figures

Of course, there could be exceptions to these observations, we’ll discuss how to deal with exceptions later in this article. We can combine these observations to create a fairly effective and generic crawler for scraping prices from eCommerce websites.

Implementation of a generic eCommerce scraper to scrape prices

Step 1: Installation

This tutorial uses the Google Chrome web browser. If you don’t have Google Chrome installed, you can follow the installation instructions.

Instead of Google Chrome, advanced developers can use a programmable version of Google Chrome called Puppeteer. This will remove the necessity of a running GUI application to run the scraper. However, that is beyond the scope of this tutorial.

Step 2: Chrome Developer Tools

The code presented in this tutorial is designed for scraping prices as simple as possible. Therefore, it will not be capable of fetching the price from every product page out there.

For now, we’ll visit an Amazon product page or a Sephora product page in Google Chrome.

  • Visit the product page in Google Chrome
  • Right-click anywhere on the page and select ‘Inspect Element’ to open up Chrome DevTools
  • Click on the Console tab of DevTools

Inside the Console tab, you can enter any JavaScript code. The browser will execute the code in the context of the web page that has been loaded. You can learn more about DevTools using their official documentation.

Step 3: Run the JavaScript snippet

Copy the following JavaScript snippet and paste it into the console.

Press ‘Enter’ and you should now be seeing the price of the product displayed on the console.

If you don’t, then you have probably visited a product page which is an exception to our observations. This is completely normal, we’ll discuss how we can expand our script to cover more product pages of these kinds. You could try one of the sample pages provided in step 2. Itns 300 review.

The animated GIF below shows how we get the price from Amazon.com

How it works

First, we have to fetch all the HTML DOM elements in the page.

We need to convert each of these elements to simple JavaScript objects which stores their XY position values, text content and font size, which looks something like {'text':'Tennis Ball', 'fontSize':'14px', 'x':100,'y':200}. So we have to write a function for that, as follows.

Now, convert all the elements collected to JavaScript objects by applying our function on all elements using the JavaScript map function.

Remember the observations we made regarding how a price is displayed. We can now filter just those records which match our design observations. So we need a function that says whether a given record matches with our design observations.

We have used a Regular Expressionto check if a given text is a currency figure or not. You can modify this regular expression in case it doesn’t cover any web pages that you’re experimenting with.

Now we can filter just the records that are possibly price records

Finally, as we’ve observed, the Price comes as the currency figure having the highest font size. If there are multiple currency figures with equally high font size, then Price probably corresponds to the one residing at a higher position. We are going to sort out our records based on these conditions, using the JavaScript sort<em><strong>function.

Now we just need to display it on the console

Taking it further

Moving to a GUI-less based scalable program

You can replace Google Chrome with a headless version of it called Puppeteer. Puppeteer is arguably the fastest option for headless web rendering. It works entirely based on the same ecosystem provided in Google Chrome. Once Puppeteer is set up, you can inject our script programmatically to the headless browser, and have the price returned to a function in your program. To learn more, visit our tutorial on Puppeteer.

Learn more: Web Scraping with Puppeteer and Node.Js

Improving and enhancing this script

You will quickly notice that some product pages will not work with such a script because they don’t follow the assumptions we have made about how the product price is displayed and the patterns we identified.

Unfortunately, there is no “holy grail” or a perfect solution to this problem. It is possible to generalize more web pages and identify more patterns and enhance this scraper.

A few suggestions for enhancements are:

  • Figuring out more features, such as font-weight, font color, etc.
  • Class names or IDs of the elements containing price would probably have the word price. You could figure out such other commonly occurring words.
  • Currency figures with strike-through are probably regular prices, those could be ignored.

There could be pages that follow some of our design observations but violates some others. The snippet provided above strictly filters out elements that violate even one of the observations. In order to deal with this, you can try creating a score basedsystem. This would award points for following certain observations and penalize for violating certain observations. Those elements scoring above a particular threshold could be considered as price.

The next significant step that you would use to handle other pages is to employ Artificial Intelligence/Machine Learning based techniques. You can identify and classify patterns and automate the process to a larger degree this way. However, this field is an evolving field of study and we at ScrapeHero are using such techniques already with varying degrees of success.

If you need help to scrape prices from Amazon.com you can check out our tutorial specifically designed for Amazon.com:

Learn More:How to Scrape Prices from Amazon using Python

We can help with your data or automation needs

Turn the Internet into meaningful, structured and usable data



Disclaimer:Any code provided in our tutorials is for illustration and learning purposes only. We are not responsible for how it is used and assume no liability for any detrimental usage of the source code. The mere presence of this code on our site does not imply that we encourage scraping or scrape the websites referenced in the code and accompanying tutorial. The tutorials only help illustrate the technique of programming web scrapers for popular internet websites. We are not obligated to provide any support for the code, however, if you add your questions in the comments section, we may periodically address them.