Contribute Media
A thank you to everyone who makes this possible: Read More

Python + XPath = Extra Parsing Power

Description

Python's power in manipulating strings and handling nested data structures is well known. So much so that for many mild XML and HTML processing use cases one can get the job done using only built-ins and common parts of the standard library. But the markup language world offers many powerful tools which do not map so directly onto python's data model. And there are large gains to be had if we use native XML tools alongside python and give each component the chance to shine when it can. At the same time learning new tools takes time and adding new parsing and query engines to a project consumes resources. The aim of this talk is to highlight those situations where the benefits of calling in heavy machinery from the XML world outweigh the costs. We begin with an overview of the XPath query language and use example queries to highlight differences between python's nested data model and that of common markup languages. For example HTML distinguishes between attributes and content while a nested collection of python dicts, list and tuples only has content. To be sure we can express the same information in both models. But we can write shorter, clearer and more-efficient-to-process queries when we retain the distinction. Similarly we can traverse python's built-in data structures with combinations of various braces and parentheses but it is not so simple to pass references into such nested structures and then navigate around. With an XPath processor and common document object model such actions are straightforward – and arguably more pythonic than a solution built entirely on native language features. Finally we connect things back together with some simple web-scraping examples. Here we use XPath queries to quickly extract elements of interest and then leverage python's string handling capabilities to swiftly convert that content into native data types. Examples will employ both the lxml parsing library and the selenium web scraping framework. The goal is to focus on use cases where the XML machinery is worth employing. All the wrappers are similar and we wish to highlight that it does not particularly matter which package you learn – it matters that you learn when to employ XPath and a proper DOM.


Jon is Managing Director of Data Finnovation, a Singapore-based startup that is changing the way the financial services industry handles data. Before joining the Fintech movement he spent 15 years modelling and trading fixed income and currency derivatives for banks in New York, Tokyo, London and Singapore. During this time Jon worked as a quant and trader, and managed both market-making and electronic trading teams. Prior to working in the capital markets Jon studied Computer Science at Brown University where he earned an ScM in Computer Science and an A.B. in both Mathematical Economics and Computer Science.

Details

Improve this page