Crawling a Website that loads content using Javascript with Selenium Webdriver in Python

Seleniumis a browser automation tool that is used primarily for testing web applications. You can simulate real user actions and interactions with your web applications. Selenium supports all the major browser platforms and operating systems. There are bindings for all the popular programming languages. The power of Selenium is not just restricted to testing your web apps, one other use can be of crawling or scraping websites, in particular, the ones which don’t provide an API and load content lazily using Javascript.We will be crawling an online merchant websitewww.jabong.comwith Selenium using its python bindings. Jabong loads more products as you scroll down a web page. We will simulate this user action of scrolling down a web page and then retrieve all the product titles and the corresponding links to the product detail pages.Reference:Platform:Ubuntu/DebianSetup:sudo pip install selenium

sudo pip install xvfb

sudo pip install pyvirtualdisplayWe use pyvirtualdisplay which is a wrapper around xvfb and enables you to run Firefix headlessly.Page to Crawl:A quick “Inspect Element” on a shoe above shows that each of the product is wrapped by a “div” element with class “hover-box” and the title and links are embedded in an “a” element within those “div” elements.from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from pyvirtualdisplay import Display

def correct_url(url):
if not url.startswith(“http://”) and not url.startswith(“https://”):
url = “http://” + url
return url

def scrollDown(browser, numberOfScrollDowns):
body = browser.find_element_by_tag_name(“body”)
while numberOfScrollDowns >=0:
numberOfScrollDowns -= 1
return browser

def crawl_url(url, run_headless=True):
if run_headless:
display = Display(visible=0, size=(1024, 768))

url = correct_url(url)
browser = webdriver.Firefox()
browser = scrollDown(browser, 10)

all_hover_elements = browser.find_elements_by_class_name(“hover-box”)

for hover_element in all_hover_elements:
a_element = hover_element.find_element_by_tag_name(“a”)
product_title = a_element.get_attribute(“title”)
product_link = a_element.get_attribute(“href”)
print product_title, product_link


if __name__==’__main__’:
url = “”
crawl_url(url)Here is the output of the above script:javascript:sendSizeFormWithSize(‘’,’6′)
Nike Ballista Iv Msl Grey Running Shoes
U.S. Polo Assn. Navy Blue Sneakers
Nike Dewired Navy Blue Sneakers
United Colors of Benetton Black Boat Shoes
Phosphorus Brown Loafers
U.S. Polo Assn. Delta Beige Sneakers
Asics Kayano 20 Black Running Shoes
Nike Air Max 2014 Blue Running Shoes
Asics Excel 33 3 Navy Blue Running Shoes
Nike Air Relentless 3 Msl Blue Running Shoes
Asics Kayano 20 Navy Blue Running Shoes
Nike Lunarinternationalist Grey Running Shoes
Nike Free 5.0+ Blue Running Shoes
Nike Fs Lite Run Black Running Shoes
Nike Lunarinternationalist Blue Running Shoes
Nike Free Trainer 5.0 Grey Running Shoes
Andrew Hill Brown Dress Shoes
Nike Lunar Forever 3 Msl Grey Running Shoes
Asics Kayano 20 Red Running Shoes
Nike Free Trainer 5.0 Black Running Shoes
Nike Chroma Thong Iii Green Slippers
Nike Aquahype Blue Flip Flops
Z Collection Green Loafers
Nike Flex 2013 Rn Black Running Shoes
U.S. Polo Assn. Brown Sneakers
Phosphorus Black Loafers
Phosphorus Black Loafers
Nike Eliminate Ii Leather Grey Sneakers
Nike Fs Lite Trainer Blue Running Shoes
Nike Suketo 2 Leather Red Sneakers
Nike Free Flyknit+ Red Running Shoes
Nike Air Pegasus+ 30 Grey Running Shoes
Nike Flex Supreme Tr 2 Grey Running Shoes
Phosphorus Black Loafers
Phosphorus Black Loafers
Phosphorus Black Loafers
Nike Zoom Structure+ 17 Black Running Shoes
Nike Flyknit Lunar2 Black Running Shoes
Phosphorus Brown Loafers
Andrew Hill Black Dress Shoes
Phosphorus Tan Loafers
Phosphorus Black Loafers
U.S. Polo Assn. Delta Navy Blue Sneakers
Phosphorus Brown Loafers
Nike Lunar Forever 3 Msl White Running Shoes
Asics Kayano 20 White Running Shoes

News Reporter

Leave a Reply

Your email address will not be published. Required fields are marked *