Contribute Media
A thank you to everyone who makes this possible: Read More

Beyond scraping

Description

Anthon van der Neut - Beyond scraping [EuroPython 2016] [20 July 2016] [Bilbao, Euskadi, Spain] (https://ep2016.europython.eu//conference/talks/beyond-scraping-getting-data-from-dynamic-heavily-javascript-driven-websites)

This talk show how a to create a simple, evolving, client server architecture combining zeromq, selenium and beautifulsoup, which allows you to scrape data even from variable dynamic sites like Sporcle and KhanAcademy. Once the page analysis has been implemented regular "downloads" can easily be deployed without cluttering your desktop, your headless server and/or anonymously.


Scraping static websites can be done with urllib2 from the standard library, or with some slightly more sophisticated packages like requests. However as soon as JavaScript comes into play on the website you want to download information from, for things like logging in via openid or constructing the pages content, you almost always have to fall back to driving a real browser. For web sites with variable content this is can be time consuming and cumbersome process.

This talk show how a to create a simple, evolving, client server architecture combining zeromq, selenium and beautifulsoup, which allows you to scrape data from sites like Sporcle, StackOverflow and KhanAcademy. Once the page analysis has been implemented regular "downloads" can easily be deployed without cluttering your desktop, your headless server and/or anonymously.

The described client server setup allows you to restart your changed analysis program without having to redo all the previous steps of logging in and stepping through instructions to get back to the page where you got "stuck" earlier on. This often decreases the time between entering a possible fix in your HTML analysis code en testing it, down to less than a second from a few tens of seconds in case you have to restart a browser.

Using such a setup you have time to focus on writing robust code instead of code that breaks with every little change the sites designers make.

Improve this page